Crawl Budget on Large Catalogs

On sites with thousands or millions of URLs, Google's crawl budget is not infinite. If your XML sitemaps list every SKU, filter combination, and paginated page, crawlers will spend time on low-value URLs while missing critical ones.

Large catalogs need deliberate controls on discovery and crawling so search engines focus on the pages that matter.

Identify where crawl budget leaks

The first useful step is to understand how your catalog exposes pages:

faceted navigation and filters that create thousands of parameterized URLs
infinite scroll or pagination paths that generate near-endless crawl sets
thin tag and archive pages that cross-link into loops
duplicate variants for color, shipping, or other product options

Mapping these patterns shows where crawlers are wasting time.

Consolidate and prioritize

Once you know where the noise comes from, focus on consolidation:

assign a single canonical URL for each product or content item
use noindex or robots controls on faceted paths that add no unique value
prune sitemaps so they list only primary categories and canonical URLs
strengthen internal links to the pages that actually matter

This does not eliminate the pages from your site. It guides crawlers to spend their time where it counts.

Monitor and iterate

Managing crawl budget is an ongoing process:

review crawl stats to see which URL patterns are being requested most often
check server logs for unplanned bot behavior
compare organic visibility on key categories and products after cleanup

The goal is not to limit Google's access. It is to keep your important pages from competing with pages you never wanted crawled in the first place.

Applied at scale

Selective controls across more than 10,000 URLs

For a content site with more than 10,000 URLs, I audited index coverage and implemented selective noindex and crawl controls instead of applying a blanket sitewide rule.

This describes the implemented control layer; no indexation-recovery or traffic outcome is claimed.

More notes

Related diagnostic paths

A/B testing infrastructure that keeps search signals stable

Server-side splits, temporary redirects, and cookie-based variants each present different risks. Keep experiments crawlable without serving crawlers special content.

Read note →

Canonical propagation delays in large sites

Fixing a canonical is not the end of the problem. Google must recrawl and reprocess the affected URLs, so visible convergence can take days or weeks.