Back to work
Crawler / audit evidence
ILCrawler - Technical SEO crawler and audit workbench
Crawler/audit system for turning raw crawl, rendered, Lighthouse, and issue evidence into reviewable SEO handoff data.
Best for
- technical SEO audits
- crawl and indexation work
- crawler or audit-tool proof
- FastAPI plus Next.js systems proof
Scoped for owned and client-authorized audits.
What it does
Useful proof without the full internal dump.
- runs raw HTTP crawls with robots.txt, sitemap discovery, URL normalization, depth, and crawl limits
- captures metadata, canonicals, headings, robots directives, hreflang, word count, TTFB, and link graphs
- records internal links, external checks, resource inventory, rendered screenshots, and Lighthouse artifacts
- generates issue rows for duplicate metadata/content, broken links, orphan/dead-end pages, canonical problems, redirects, and hreflang issues
- tracks issue workflow state, ignore rules, schedules, webhooks, API tokens, admin diagnostics, and worker drain controls
- exports pages, issues, links, resources, errors, and branded crawl reports as CSV or PDF
Build notes
Implementation choices that matter.
- Raw crawl layer
- HTTP crawl data stays separate from rendered and Lighthouse evidence so each audit source can be reviewed on its own.
- Rendered audits
- Playwright captures rendered-page evidence and screenshots only where it adds useful proof.
- Worker controls
- Queue state, drain controls, worker health, and run progress are visible because crawler jobs fail operationally, not just logically.
- Handoff output
- CSV and PDF exports keep the findings portable outside the dashboard for review and implementation handoff.
Run detail
Crawl workbench
Health
94
Pages
117
Resources
3,873
Raw crawl
running
ML report
requested
Exports
ready
FastAPI
Next.js
React
Tailwind CSS
PostgreSQL
Docker Compose
Playwright
Lighthouse
Backblaze B2
CSV/PDF Exports
UI screenshots
ILCrawler frontend
Current Next.js frontend excerpts. Local account details and sensitive identifiers are not shown.
Worker lease / job execution
pythonstmt = (
select(CrawlQueue)
.where(CrawlQueue.state == "queued")
.where(or_(CrawlQueue.available_at.is_(None), CrawlQueue.available_at <= now))
.order_by(CrawlQueue.priority.desc(), CrawlQueue.id.asc())
.limit(limit)
.with_for_update(of=CrawlQueue, skip_locked=True)
)
for item in session.scalars(stmt):
item.state = "leased"
item.lease_owner = lease_owner
item.leased_at = leased_at
item.attempts += 1
- Purpose
- Lease queued URLs across concurrent crawl workers.
- Guardrail
- Avoids duplicate work and stalled jobs when several workers run at once.
- Tradeoff
- Uses PostgreSQL row locking instead of a separate queue service; simpler stack, tighter database coupling.
Fit
Relevant if you need crawler, audit, or SEO evidence tooling.
Related project pages
SEO decision layer
MarketEngine
Internal system for deciding whether an SEO task is worth doing before it becomes a ticket.
Search intelligence cockpit
SearchCaliber
Private dashboard that connects a site, its crawl evidence, market research, and approved SEO actions in one place.
Workflow / outreach ops
ReachLog
Single-owner outreach system for tracking targets, messages, replies, bounces, forms, follow-ups, and dedupe protection.