Back to notes

Note by Niko

Controlling crawl budget on large catalogs

By Niko May 10, 2026 2 min read

On large catalogs, crawl budget gets wasted when sitemaps, filters, and archive paths keep promoting low-value URLs instead of the pages that actually matter.

Related service

This note supports the GSC indexing and crawl cleanup sprint lane.

View service

On sites with thousands or millions of URLs, Google's crawl budget is not infinite. If your XML sitemaps list every SKU, filter combination, and paginated page, crawlers will spend time on low-value URLs while missing critical ones.

Large catalogs need deliberate controls on discovery and crawling so search engines focus on the pages that matter.

Identify where crawl budget leaks

The first useful step is to understand how your catalog exposes pages:

  • faceted navigation and filters that create thousands of parameterized URLs
  • infinite scroll or pagination paths that generate near-endless crawl sets
  • thin tag and archive pages that cross-link into loops
  • duplicate variants for color, shipping, or other product options

Mapping these patterns shows where crawlers are wasting time.

Consolidate and prioritize

Once you know where the noise comes from, focus on consolidation:

  1. assign a single canonical URL for each product or content item
  2. use noindex or robots controls on faceted paths that add no unique value
  3. prune sitemaps so they list only primary categories and canonical URLs
  4. strengthen internal links to the pages that actually matter

This does not eliminate the pages from your site. It guides crawlers to spend their time where it counts.

Monitor and iterate

Managing crawl budget is an ongoing process:

  • review crawl stats to see which URL patterns are being requested most often
  • check server logs for unplanned bot behavior
  • compare organic visibility on key categories and products after cleanup

The goal is not to limit Google's access. It is to keep your important pages from competing with pages you never wanted crawled in the first place.

Turn the note into a sprint

GSC indexing and crawl cleanup

If this matches the live symptom, send the URL, what changed, and the affected pages so the first pass can stay bounded.