We measured every source file in the Opus 4.7 corpus and bucketed by language. The median file size varies by language by 4×.
Median lines per source file
| Language | Files sampled | Median lines | p90 | Max |
|---|---|---|---|---|
| Rust | 26,859 | 185 | 709 | 15,933 |
| Python | 98,607 | 127 | 546 | 11,443 |
| Go | 17,975 | 122 | 460 | 13,444 |
| TypeScript | 331,030 | 93 | 407 | 15,215 |
| JavaScript | 62,793 | 75 | 462 | 23,907 |
| Java | 18,883 | 45 | 171 | 15,298 |
The 4× spread
Rust files are roughly 4× longer than Java files on average. Why?
Rust’s verbosity
Rust requires:
- Explicit lifetime annotations
- Borrow-checker-friendly code patterns (often more lines to satisfy the borrow checker)
- Derive macros for serialization / cloning / debugging
- Module declarations at the top of each file
- Explicit error propagation with ? or match
All of these are “more characters for the same semantic intent” compared to Python/TS. It’s not poor style — it’s the language.
Java’s compensating convention
Java keeps files small through:
- One class per file (strict by convention, often enforced by linters)
- Short class bodies split across many files
- Interface/implementation separation (adds more files, each smaller)
- Dependency injection frameworks encourage many small bean classes
A 500-line Java “God class” is considered an anti-pattern. Java’s median of 45 lines reflects that discipline.
Python and TypeScript in the middle
Both are expressive languages where ~100 lines can cover substantial functionality. Python’s 127 vs TS’s 93 reflects Python’s indentation-heavy control flow (slightly longer on average) vs TS’s shorter syntactic constructs.
Long-tail outliers
Every language has some 10K+ LOC files:
- JavaScript max: 23,907 lines — usually minified bundles or monolithic one-page apps
- Rust max: 15,933 — large generated code or big modules
- TypeScript max: 15,215 — likely autogenerated type definitions
- Java max: 15,298 — legacy God-class patterns
These are pathological. The real story is in the median and p90.
p90 tells a story
Look at the p90 column (90% of files are smaller than this):
- Rust p90: 709 lines (long but manageable)
- Python p90: 546 (still manageable)
- TypeScript p90: 407 (very manageable)
- Java p90: 171 (one-class-per-file discipline visible)
Java is the only language where the top 10% of files stay under 200 lines. Every other language has a meaningful cohort of 500+ line files.
What this means for code-review UIs, diff tools, context windows
If you’re building AI-assisted code review or refactoring for Opus 4.7 code:
- Rust code: expect to load 200-line files routinely; size your AI context windows accordingly
- Python code: 127-line median is LLM-friendly; full-file context usually fits
- TypeScript: sweet spot — 93 median, 407 p90 all fit in a 4K token context
- Java: tiny file size means cross-file reasoning is always needed; the relevant code is usually spread across 5-10 files
Implications for training data curation
When curating training data from this corpus by language:
- Rust and Python: sample whole files comfortably — they’re typically coherent units
- TypeScript: same, though App Router patterns span multiple files
- Java: sample whole directories, not files — a single Java file rarely contains a full feature
- JavaScript: be cautious of bundles/minified files dragging up the max
The Java paradox
Java is the smallest median (45 lines) but has one of the biggest max values (15,298). That means 98%+ of Java files are tiny, but there’s a small set of huge legacy files. If you’re grading the “health” of a Java repo, the median is misleading — look at the p99.
Related: The 45% Rule: Opus 4.7 Builds Production-Scale Code.