How We Index 9,000+ AI-Coder Repos Continuously

Every figure on this blog comes from a live pipeline that re-runs every few minutes on a 10-server GPU cluster. This post is the methodology disclosure.

Step 1: Harvest

We search GitHub for commits matching the AI-coder’s signature:

Co-authored-by: Claude Opus 4.7 <...>
Co-authored-by: Codex <...>
Co-authored-by: cursoragent@cursor.com

For each matching commit, we git clone --depth 50 the parent repo into dated shards under /tank0/claude-archive/opus47/ (for Opus 4.7), /tank0/claude-archive/codex/, etc.

The scanner runs every 5 minutes, paginates up to 3 pages of GitHub’s commit search (~300 commits per page × 3 ≈ 900 commit checks per run), and deduplicates against what’s already cloned. Current intake rate: ~400–500 new Opus 4.7 repos per hour.

Step 2: Claim + Analyze (distributed)

A PostgreSQL table tracks every repo with a stage column:

  • 0 — ingested
  • 1 — indexed
  • 3 — extracted (analysis done, symbols + files + credential findings stored)
  • 4 — secured (credential scan completed)
  • 5 — distilled (has embedding + summary + catalog)
  • 99 — in-flight (claimed by a worker)
  • -1 — skipped (e.g., name-duplicate of an already-processed repo)

Six GPU workers across the cluster pull a batch of 100 stage-0 repos at a time via FOR UPDATE SKIP LOCKED, run them through batch_analyze.py, and move them to stage 3. A 7th worker on another node handles /storage/disk2/ repos. Each worker auto-prioritizes /opus47/ paths over /codex/ over /community/.

Analysis extracts:
- file language + role (source/test/docs/config/asset/build)
- symbol table (functions, methods, classes, structs, interfaces) with line ranges and docstrings
- import graph (source → target_external)
- credential findings (JWT secrets, AWS keys, private keys, etc.) via pattern rules
- package manifests (package.json, requirements.txt, Cargo.toml, go.mod) → workspace names

Step 3: Catalog + Enrich

A separate fleet of Ollama instances runs:

  • fix4b (4× instances, gemma4 model): generates catalog.category, catalog.tags, catalog.tech from each repo’s file list and README.
  • fix6 (4× instances, gpt-oss:latest): generates structured llm_insights with purpose, primary_domain, key_patterns, notable_risks, reuse_potential, reference_quality.

Both are priority-patched to process /opus47/ before anything else. Without the patch, Opus 4.7’s high repo_ids would be processed last — the patch gets the newest, most interesting corpus to the head of the queue.

Step 4: Embed

The nomic-embed-text model (768-dim) turns each repo’s summary + purpose + tags + frameworks into a vector in repo_embeddings. Running in a 4-instance sprint against a single Ollama host at 88,000 embeds/hour.

Step 5: Distill (where Repobility comes in)

Every 30 minutes, a pipeline under /tank0/repos_distillate/opus4.7/scripts/ re-queries Postgres and writes:
- stats.json — footprint, languages, types, quality vs community baseline
- golden.json — top 25 per archetype (MCP servers, agents, CLI tools, web apps, libraries)
- frameworks.json — framework popularity + Δ vs previous snapshot
- signatures.json — CLAUDE.md / AGENTS.md / ecosystem / naming patterns
- deep.json — LLM-parsed domains, top imports, function sizes, docstring coverage
- delta.json — what’s new since the previous snapshot
- report.html — combined dashboard

A separate generator (/tank0/repobility/scripts/build.py) consumes these JSONs to refresh this site.

Step 6: Publish

This blog’s home-page KPIs are read live from the latest distillate snapshot. The blog index lists each post by date. Individual posts are generated from Markdown source files in /tank0/repobility/posts/.

Every 30 minutes the whole chain re-runs:

harvest → analyze → catalog → embed → distill → render

Numbers from the live pipeline

Stage Opus 4.7 repos Share
5 (distilled) 8,355 91%
4 (secured) 289 3%
3 (extracted) ~100 1%
99 (in-flight) 510 6%
0 (queued) 24 <1%

So 91% of Opus 4.7 repos are fully processed through the pipeline at any given moment. New arrivals drain through within ~10 minutes.

Transparency

If you want to verify any figure on this blog:

  • The underlying queries are the scripts/01_stats_snapshot.py and sibling files in the distillate repo
  • Snapshots are preserved under /tank0/repos_distillate/opus4.7/snapshots/<date>_<hour>/
  • The dashboard at /admin/opus47 (admin-only) shows the live raw numbers

Known limitations

  • GitHub rate limits: we only see commits GitHub’s API returns. Private repos and yanked commits aren’t seen.
  • Depth 50 clones: if a commit signature is older than 50 commits deep in the branch history, we miss it.
  • LLM grading is imperfect: the 72% “high reuse” stat is a soft judgment, not a hard metric. We show it because it’s directionally useful, not because we’d stake a lawsuit on it.
  • Symbol parser coverage varies by language: TypeScript is better than Kotlin; Go and Rust are good; async keyword detection in TS/JS is known-broken and we call that out where relevant.

This methodology will evolve. We’ll note changes here when we make them.