We harvest via two different signals:
- The opus47 scanner — searches GitHub for
Co-authored-by: Claude Opus 4.7commits and clones the parent repos into/opus47/. - The community scanner — broader search that catches AI-coder-adjacent repos more generically into
/community/.
When we joined the two corpora on repo name (not id), we found:
Opus 4.7 unique names: 12,095
Names also in community corpus: 7,124 (59%)
Names also in codex corpus: 28 (0.2%)
59% of Opus 4.7 names already exist in the community corpus. That’s a striking number that initially looked like “there’s massive duplication.” It’s not — it’s a methodology artifact.
What’s actually happening
When Opus 4.7 co-authors a commit in a repo, that repo often:
- Has other commits by other AI coders (Codex, Claude Sonnet, Cursor)
- Has human-only commits too
- Is popular enough to show up in the general community scan
So our community scanner (which looks at a broader signal set) finds the same physical repo and clones it into /community/. Meanwhile the opus47 scanner (which specifically hunts for Claude Opus 4.7 co-author strings) also clones it into /opus47/.
Result: one GitHub repo, two copies in our dataset, each under a different path.
How we proved it’s not quality difference
Of the 7,124 overlapping pairs, we could complete quality scoring for 16 pairs (most of the overlap is still being graded).
| Metric | Opus 4.7 copy | Community copy |
|---|---|---|
| Quality score avg | 56.7 | 56.8 |
| Opus higher | 7 pairs | |
| Same score | 1 pair | |
| Community higher | 8 pairs |
Statistically identical. Which is exactly what you’d expect if they’re copies of the same repo commit. The quality variance reduces to parser noise.
Why it matters
1. Skip-ratio in the analysis pipeline
When Opus 4.7 repos start arriving for analysis, our batch_analyze worker maintains a “completed names” set to avoid re-processing. It found that 99% of the opus47 claims were already in completed_names — because they’d been processed under their community path first.
This is exactly the bug we chased earlier in the week. The solution was to promote by name rather than by id when reclassifying stages.
2. Training data de-duplication
If you’re pulling training data from this corpus, you need to de-duplicate by repo name + commit SHA, not by our internal path. Otherwise you’re seeing every overlapping repo twice.
3. The community scanner is effective
The positive read: our community scanner is broad enough to pick up 59% of Opus 4.7 repos without specifically targeting them. The opus47 scanner adds about 4,900 repos (the non-overlapping 41%) that the community scan misses. Both are necessary to achieve full coverage.
What the 4,900 non-overlap repos look like
These are Opus 4.7 repos that the community scanner didn’t find. Common traits:
- Smaller repos — less likely to be indexed by the broader community scan
- Lower star counts — the community scanner weights by stars; these fall below the threshold
- Very recent creations — they haven’t been through the general scan’s cycle yet
- Private-ish — some are public but not linked to by other public repos
This subset is where the “Opus 4.7 specific” signal is strongest. If you’re studying what Opus 4.7 uniquely produces (vs. what it happens to co-author in already-well-known projects), filter to this 4,900.
Methodology lesson
Path-based scanning gives you a superset of the signal you want. Name-based dedup is essential. When building corpora via multiple scanners:
- Never claim “N unique repos” based on scanner path — that’s N repo clones, not N unique repos
- Dedup by GitHub URL or repo full_name, not by your local path
- Track signals per commit, not per repo — a repo with 50 commits might have 10 different AI co-authors across them
Our distillate pipeline now reports cross_source_overlap as a first-class metric for exactly this reason.
The actual Opus 4.7 count
After name-based dedup:
- Opus 4.7 unique repos (across all scanners): ~12,067
- Opus 4.7 exclusive repos (not in other scanners): 4,943
Both numbers are correct. Which one you want depends on the question. For training data, 12,067. For “what does Opus 4.7 uniquely produce?”, 4,943.