We measured every source file in the Opus 4.7 corpus and bucketed by language. The median file size varies by language by 4×.

Median lines per source file

Language Files sampled Median lines p90 Max
Rust 26,859 185 709 15,933
Python 98,607 127 546 11,443
Go 17,975 122 460 13,444
TypeScript 331,030 93 407 15,215
JavaScript 62,793 75 462 23,907
Java 18,883 45 171 15,298

The 4× spread

Rust files are roughly 4× longer than Java files on average. Why?

Rust’s verbosity

Rust requires:
- Explicit lifetime annotations
- Borrow-checker-friendly code patterns (often more lines to satisfy the borrow checker)
- Derive macros for serialization / cloning / debugging
- Module declarations at the top of each file
- Explicit error propagation with ? or match

All of these are “more characters for the same semantic intent” compared to Python/TS. It’s not poor style — it’s the language.

Java’s compensating convention

Java keeps files small through:
- One class per file (strict by convention, often enforced by linters)
- Short class bodies split across many files
- Interface/implementation separation (adds more files, each smaller)
- Dependency injection frameworks encourage many small bean classes

A 500-line Java “God class” is considered an anti-pattern. Java’s median of 45 lines reflects that discipline.

Python and TypeScript in the middle

Both are expressive languages where ~100 lines can cover substantial functionality. Python’s 127 vs TS’s 93 reflects Python’s indentation-heavy control flow (slightly longer on average) vs TS’s shorter syntactic constructs.

Long-tail outliers

Every language has some 10K+ LOC files:

  • JavaScript max: 23,907 lines — usually minified bundles or monolithic one-page apps
  • Rust max: 15,933 — large generated code or big modules
  • TypeScript max: 15,215 — likely autogenerated type definitions
  • Java max: 15,298 — legacy God-class patterns

These are pathological. The real story is in the median and p90.

p90 tells a story

Look at the p90 column (90% of files are smaller than this):

  • Rust p90: 709 lines (long but manageable)
  • Python p90: 546 (still manageable)
  • TypeScript p90: 407 (very manageable)
  • Java p90: 171 (one-class-per-file discipline visible)

Java is the only language where the top 10% of files stay under 200 lines. Every other language has a meaningful cohort of 500+ line files.

What this means for code-review UIs, diff tools, context windows

If you’re building AI-assisted code review or refactoring for Opus 4.7 code:

  • Rust code: expect to load 200-line files routinely; size your AI context windows accordingly
  • Python code: 127-line median is LLM-friendly; full-file context usually fits
  • TypeScript: sweet spot — 93 median, 407 p90 all fit in a 4K token context
  • Java: tiny file size means cross-file reasoning is always needed; the relevant code is usually spread across 5-10 files

Implications for training data curation

When curating training data from this corpus by language:

  1. Rust and Python: sample whole files comfortably — they’re typically coherent units
  2. TypeScript: same, though App Router patterns span multiple files
  3. Java: sample whole directories, not files — a single Java file rarely contains a full feature
  4. JavaScript: be cautious of bundles/minified files dragging up the max

The Java paradox

Java is the smallest median (45 lines) but has one of the biggest max values (15,298). That means 98%+ of Java files are tiny, but there’s a small set of huge legacy files. If you’re grading the “health” of a Java repo, the median is misleading — look at the p99.


Related: The 45% Rule: Opus 4.7 Builds Production-Scale Code.