How We Index 9,000+ AI-Coder Repos Continuously
Every figure on this blog comes from a live pipeline that re-runs every few minutes on a 10-server GPU cluster. This post is the methodology disclosure.
Step 1: Harvest
We search GitHub for commits matching the AI-coder’s signature:
Co-authored-by: Claude Opus 4.7 <...>
Co-authored-by: Codex <...>
Co-authored-by: cursoragent@cursor.com
For each matching commit, we git clone --depth 50 the parent repo into dated shards under /tank0/claude-archive/opus47/ (for Opus 4.7), /tank0/claude-archive/codex/, etc.
The scanner runs every 5 minutes, paginates up to 3 pages of GitHub’s commit search (~300 commits per page × 3 ≈ 900 commit checks per run), and deduplicates against what’s already cloned. Current intake rate: ~400–500 new Opus 4.7 repos per hour.
Step 2: Claim + Analyze (distributed)
A PostgreSQL table tracks every repo with a stage column:
- 0 — ingested
- 1 — indexed
- 3 — extracted (analysis done, symbols + files + credential findings stored)
- 4 — secured (credential scan completed)
- 5 — distilled (has embedding + summary + catalog)
- 99 — in-flight (claimed by a worker)
- -1 — skipped (e.g., name-duplicate of an already-processed repo)
Six GPU workers across the cluster pull a batch of 100 stage-0 repos at a time via FOR UPDATE SKIP LOCKED, run them through batch_analyze.py, and move them to stage 3. A 7th worker on another node handles /storage/disk2/ repos. Each worker auto-prioritizes /opus47/ paths over /codex/ over /community/.
Analysis extracts:
- file language + role (source/test/docs/config/asset/build)
- symbol table (functions, methods, classes, structs, interfaces) with line ranges and docstrings
- import graph (source → target_external)
- credential findings (JWT secrets, AWS keys, private keys, etc.) via pattern rules
- package manifests (package.json, requirements.txt, Cargo.toml, go.mod) → workspace names
Step 3: Catalog + Enrich
A separate fleet of Ollama instances runs:
fix4b(4× instances, gemma4 model): generatescatalog.category,catalog.tags,catalog.techfrom each repo’s file list and README.fix6(4× instances, gpt-oss:latest): generates structuredllm_insightswithpurpose,primary_domain,key_patterns,notable_risks,reuse_potential,reference_quality.
Both are priority-patched to process /opus47/ before anything else. Without the patch, Opus 4.7’s high repo_ids would be processed last — the patch gets the newest, most interesting corpus to the head of the queue.
Step 4: Embed
The nomic-embed-text model (768-dim) turns each repo’s summary + purpose + tags + frameworks into a vector in repo_embeddings. Running in a 4-instance sprint against a single Ollama host at 88,000 embeds/hour.
Step 5: Distill (where Repobility comes in)
Every 30 minutes, a pipeline under /tank0/repos_distillate/opus4.7/scripts/ re-queries Postgres and writes:
- stats.json — footprint, languages, types, quality vs community baseline
- golden.json — top 25 per archetype (MCP servers, agents, CLI tools, web apps, libraries)
- frameworks.json — framework popularity + Δ vs previous snapshot
- signatures.json — CLAUDE.md / AGENTS.md / ecosystem / naming patterns
- deep.json — LLM-parsed domains, top imports, function sizes, docstring coverage
- delta.json — what’s new since the previous snapshot
- report.html — combined dashboard
A separate generator (/tank0/repobility/scripts/build.py) consumes these JSONs to refresh this site.
Step 6: Publish
This blog’s home-page KPIs are read live from the latest distillate snapshot. The blog index lists each post by date. Individual posts are generated from Markdown source files in /tank0/repobility/posts/.
Every 30 minutes the whole chain re-runs:
harvest → analyze → catalog → embed → distill → render
Numbers from the live pipeline
| Stage | Opus 4.7 repos | Share |
|---|---|---|
| 5 (distilled) | 8,355 | 91% |
| 4 (secured) | 289 | 3% |
| 3 (extracted) | ~100 | 1% |
| 99 (in-flight) | 510 | 6% |
| 0 (queued) | 24 | <1% |
So 91% of Opus 4.7 repos are fully processed through the pipeline at any given moment. New arrivals drain through within ~10 minutes.
Transparency
If you want to verify any figure on this blog:
- The underlying queries are the
scripts/01_stats_snapshot.pyand sibling files in the distillate repo - Snapshots are preserved under
/tank0/repos_distillate/opus4.7/snapshots/<date>_<hour>/ - The dashboard at
/admin/opus47(admin-only) shows the live raw numbers
Known limitations
- GitHub rate limits: we only see commits GitHub’s API returns. Private repos and yanked commits aren’t seen.
- Depth 50 clones: if a commit signature is older than 50 commits deep in the branch history, we miss it.
- LLM grading is imperfect: the 72% “high reuse” stat is a soft judgment, not a hard metric. We show it because it’s directionally useful, not because we’d stake a lawsuit on it.
- Symbol parser coverage varies by language: TypeScript is better than Kotlin; Go and Rust are good; async keyword detection in TS/JS is known-broken and we call that out where relevant.
This methodology will evolve. We’ll note changes here when we make them.