The Name-Overlap Mirage: Why 7,124 'Duplicate' Repos Aren't

We harvest via two different signals:

The opus47 scanner — searches GitHub for Co-authored-by: Claude Opus 4.7 commits and clones the parent repos into /opus47/.
The community scanner — broader search that catches AI-coder-adjacent repos more generically into /community/.

When we joined the two corpora on repo name (not id), we found:

Opus 4.7 unique names:          12,095
Names also in community corpus:  7,124  (59%)
Names also in codex corpus:         28  (0.2%)

59% of Opus 4.7 names already exist in the community corpus. That’s a striking number that initially looked like “there’s massive duplication.” It’s not — it’s a methodology artifact.

What’s actually happening

When Opus 4.7 co-authors a commit in a repo, that repo often:

Has other commits by other AI coders (Codex, Claude Sonnet, Cursor)
Has human-only commits too
Is popular enough to show up in the general community scan

So our community scanner (which looks at a broader signal set) finds the same physical repo and clones it into /community/. Meanwhile the opus47 scanner (which specifically hunts for Claude Opus 4.7 co-author strings) also clones it into /opus47/.

Result: one GitHub repo, two copies in our dataset, each under a different path.

How we proved it’s not quality difference

Of the 7,124 overlapping pairs, we could complete quality scoring for 16 pairs (most of the overlap is still being graded).

Metric	Opus 4.7 copy	Community copy
Quality score avg	56.7	56.8
Opus higher	7 pairs
Same score	1 pair
Community higher	8 pairs

Statistically identical. Which is exactly what you’d expect if they’re copies of the same repo commit. The quality variance reduces to parser noise.

Why it matters

1. Skip-ratio in the analysis pipeline

When Opus 4.7 repos start arriving for analysis, our batch_analyze worker maintains a “completed names” set to avoid re-processing. It found that 99% of the opus47 claims were already in completed_names — because they’d been processed under their community path first.

This is exactly the bug we chased earlier in the week. The solution was to promote by name rather than by id when reclassifying stages.

2. Training data de-duplication

If you’re pulling training data from this corpus, you need to de-duplicate by repo name + commit SHA, not by our internal path. Otherwise you’re seeing every overlapping repo twice.

3. The community scanner is effective

The positive read: our community scanner is broad enough to pick up 59% of Opus 4.7 repos without specifically targeting them. The opus47 scanner adds about 4,900 repos (the non-overlapping 41%) that the community scan misses. Both are necessary to achieve full coverage.

What the 4,900 non-overlap repos look like

These are Opus 4.7 repos that the community scanner didn’t find. Common traits:

Smaller repos — less likely to be indexed by the broader community scan
Lower star counts — the community scanner weights by stars; these fall below the threshold
Very recent creations — they haven’t been through the general scan’s cycle yet
Private-ish — some are public but not linked to by other public repos

This subset is where the “Opus 4.7 specific” signal is strongest. If you’re studying what Opus 4.7 uniquely produces (vs. what it happens to co-author in already-well-known projects), filter to this 4,900.

Methodology lesson

Path-based scanning gives you a superset of the signal you want. Name-based dedup is essential. When building corpora via multiple scanners:

Never claim “N unique repos” based on scanner path — that’s N repo clones, not N unique repos
Dedup by GitHub URL or repo full_name, not by your local path
Track signals per commit, not per repo — a repo with 50 commits might have 10 different AI co-authors across them

Our distillate pipeline now reports cross_source_overlap as a first-class metric for exactly this reason.

The actual Opus 4.7 count

After name-based dedup:

Opus 4.7 unique repos (across all scanners): ~12,067
Opus 4.7 exclusive repos (not in other scanners): 4,943

Both numbers are correct. Which one you want depends on the question. For training data, 12,067. For “what does Opus 4.7 uniquely produce?”, 4,943.

The Name-Overlap Mirage: Why 7,124 'Duplicate' Repos Aren't

What’s actually happening

How we proved it’s not quality difference

Why it matters

1. Skip-ratio in the analysis pipeline

2. Training data de-duplication

3. The community scanner is effective

What the 4,900 non-overlap repos look like

Methodology lesson

The actual Opus 4.7 count

Share this research

Data Privacy Disclaimer

Want the full dataset?

Related Research

Button, Card, Badge: The Four Horsemen of shadcn/ui

84% of Opus 4.7 Python Functions Return Types

Rust Files Are the Biggest. Java Files Are the Smallest.

52 Repos With .env Committed: A Security Audit