Back to work
ILCrawler logo

Crawler / audit evidence

ILCrawler - Technical SEO crawler and audit workbench

Crawler/audit system for turning raw crawl, rendered, Lighthouse, and issue evidence into reviewable SEO handoff data.

Best for

  • technical SEO audits
  • crawl and indexation work
  • crawler or audit-tool proof
  • FastAPI plus Next.js systems proof

Scoped for owned and client-authorized audits.

What it does

Useful proof without the full internal dump.

  • runs raw HTTP crawls with robots.txt, sitemap discovery, URL normalization, depth, and crawl limits
  • captures metadata, canonicals, headings, robots directives, hreflang, word count, TTFB, and link graphs
  • records internal links, external checks, resource inventory, rendered screenshots, and Lighthouse artifacts
  • generates issue rows for duplicate metadata/content, broken links, orphan/dead-end pages, canonical problems, redirects, and hreflang issues
  • tracks issue workflow state, ignore rules, schedules, webhooks, API tokens, admin diagnostics, and worker drain controls
  • exports pages, issues, links, resources, errors, and branded crawl reports as CSV or PDF

Build notes

Implementation choices that matter.

Raw crawl layer
HTTP crawl data stays separate from rendered and Lighthouse evidence so each audit source can be reviewed on its own.
Rendered audits
Playwright captures rendered-page evidence and screenshots only where it adds useful proof.
Worker controls
Queue state, drain controls, worker health, and run progress are visible because crawler jobs fail operationally, not just logically.
Handoff output
CSV and PDF exports keep the findings portable outside the dashboard for review and implementation handoff.

Run detail

Crawl workbench

product frontend

Health

94

Pages

117

Resources

3,873

Raw crawl running
ML report requested
Exports ready
FastAPI Next.js React Tailwind CSS PostgreSQL Docker Compose Playwright Lighthouse Backblaze B2 CSV/PDF Exports

UI screenshots

ILCrawler frontend

Current Next.js frontend excerpts. Local account details and sensitive identifiers are not shown.

ILCrawler frontend overview showing runtime health, signed-in workspace, operations access, and next workbench lanes.
Overview: runtime health, workspace state, operations access, and the workbench lanes that replaced the retired review screens.
ILCrawler projects screen showing project inventory and crawl defaults for the IndexLane target.
Projects: crawl target inventory, default limits, robots policy, and quick access to each project workspace.
ILCrawler project detail screen showing crawl settings and start crawl controls.
Project detail: crawl settings, anti-bot policy, robots and sitemap controls, external link checks, and new-run setup.
ILCrawler run detail screen showing crawl progress, health, pages, issues, links, resources, and ML report progress.
Run detail: live crawl progress, queue status, health score, issue/link/resource counts, exports, and ML report progress.

Worker lease / job execution

python
stmt = (
    select(CrawlQueue)
    .where(CrawlQueue.state == "queued")
    .where(or_(CrawlQueue.available_at.is_(None), CrawlQueue.available_at <= now))
    .order_by(CrawlQueue.priority.desc(), CrawlQueue.id.asc())
    .limit(limit)
    .with_for_update(of=CrawlQueue, skip_locked=True)
)

for item in session.scalars(stmt):
    item.state = "leased"
    item.lease_owner = lease_owner
    item.leased_at = leased_at
    item.attempts += 1
Purpose
Lease queued URLs across concurrent crawl workers.
Guardrail
Avoids duplicate work and stalled jobs when several workers run at once.
Tradeoff
Uses PostgreSQL row locking instead of a separate queue service; simpler stack, tighter database coupling.

Fit

Relevant if you need crawler, audit, or SEO evidence tooling.