LiveBench/LiveBench

Component	Sub-score	Weight	Contribution
`structure_score`	60.0	0.15	9.00
`security_score`	30.0	0.25	7.50
`testing_score`	20.0	0.20	4.00
`documentation_score`	81.0	0.15	12.15
`practices_score`	40.0	0.15	6.00
`code_quality`	27.3	0.10	2.73
Overall		1.00	41.4

critical Security checks software dependencies conf 0.88 django: GHSA-frmv-pr5f-9mcr

Django vulnerable to SQL injection via _connector keyword argument in QuerySet and Q objects.

livebench/code_runner/requirements_eval.txt

critical Security checks software dependencies conf 0.88 django: GHSA-pv4p-cwwg-4rph

Django SQL injection vulnerability

livebench/code_runner/requirements_eval.txt

critical Security checks software dependencies conf 0.88 keras: GHSA-x4wf-678h-2pmq

Keras code injection vulnerability

livebench/code_runner/requirements_eval.txt

high Security checks quality Quality conf 1.00 ✓ Repobility 6 occurrences Missing import: `stat` used but not imported

The file uses `stat.something(...)` but never imports `stat`. This raises NameError at runtime the first time the line executes.

5 files, 6 locations

livebench/process_results/math/AMPS_Hard/utils.py:49, 98 (2 hits)

livebench/code_runner/eval/__init__.py:240

livebench/if_runner/instruction_following_eval/instructions.py:162

livebench/process_results/math/olympiad/utils.py:63

livebench/process_results/util.py:7

critical Security checks software dependencies conf 0.88 nltk: GHSA-7p94-766c-hgjp

NLTK has a Zip Slip Vulnerability

livebench/code_runner/requirements_eval.txt

critical Security checks software dependencies conf 0.88 tensorflow: GHSA-gw97-ff7c-9v96

TensorFlow has a heap out-of-buffer read vulnerability in the QuantizeAndDequantize operation

livebench/code_runner/requirements_eval.txt

low Security checks quality Quality conf 1.00 ✓ Repobility 3 occurrences [MINED006] Overcatch Baseexception: except BaseException: ... — prevents Ctrl+C and SystemExit from working.

Review and fix per the pattern semantics. See CWE-705 / for context.

3 files, 3 locations

livebench/agentic_code_runner/minisweagent/agents/interactive.py:73

livebench/agentic_code_runner/minisweagent/run/run_batch.py:209

livebench/agentic_code_runner/minisweagent/run_inference.py:189

low Security checks quality Quality conf 1.00 ✓ Repobility 3 occurrences [MINED012] Curl Pipe Bash: curl ... | sh / bash — runs unverified network code.

Review and fix per the pattern semantics. See CWE-494 / A08:2021 for context.

3 files, 3 locations

livebench/agentic_code_runner/eval/harness/repos/javascript/axios/axios.py:59

livebench/agentic_code_runner/eval/harness/repos/javascript/sveltejs/svelte.py:51

livebench/agentic_code_runner/eval/harness/repos/typescript/ant_design/ant_design.py:52

high Security checks quality Quality conf 1.00 ✓ Repobility [MINED034] Python Subprocess Shell True: subprocess(..., shell=True) enables command injection.

Review and fix per the pattern semantics. See CWE-78 / for context.

livebench/agentic_code_runner/minisweagent/environments/local.py:23

high Security checks quality Quality conf 1.00 ✓ Repobility [MINED034] Python Subprocess Shell True: subprocess(..., shell=True) enables command injection.

Review and fix per the pattern semantics. See CWE-78 / for context.

livebench/agentic_code_runner/minisweagent/environments/docker.py:106

high Security checks security path traversal conf 0.80 3 occurrences [SEC013] Path Traversal — User Input in File Path: User-controlled input used in file path without sanitization. Allows reading arbitrary files.

Use os.path.realpath() and verify the path starts with your expected base directory. Use secure_filename() for uploads.

3 files, 3 locations

livebench/if_runner/ifbench/evaluation_lib.py:45

livebench/if_runner/instruction_following_eval/evaluation_main.py:191

livebench/scripts/answer_csv_to_jsonl.py:11

low Security checks security Injection conf 1.00 [SEC103] LDAP injection — non-constant search filter: User input concatenated into an LDAP search filter. Attackers inject `*)(uid=*` style payloads to bypass auth or enumerate accounts.

Escape with javax.naming.ldap.Rdn.escapeValue or equivalent. For python-ldap, use ldap.filter.escape_filter_chars. Better: use parameterized search APIs (Spring LdapTemplate filter encoders).

livebench/process_results/reasoning/logic_with_navigation/utils.py:28

low Security checks security Injection conf 1.00 [SEC103] LDAP injection — non-constant search filter: User input concatenated into an LDAP search filter. Attackers inject `*)(uid=*` style payloads to bypass auth or enumerate accounts.

Escape with javax.naming.ldap.Rdn.escapeValue or equivalent. For python-ldap, use ldap.filter.escape_filter_chars. Better: use parameterized search APIs (Spring LdapTemplate filter encoders).

livebench/agentic_code_runner/eval/harness/repos/c/mruby/mruby.py:423

high Security checks quality Quality conf 1.00 ✓ Repobility 25 occurrences `self.model` used but never assigned in __init__

Method `add_message` of class `InteractiveAgent` reads `self.model`, but no assignment to it exists in __init__ (and no class-level fallback). This raises AttributeError the first time the method runs against an instance.

6 files, 25 locations

livebench/agentic_code_runner/minisweagent/run/batch_progress.py:111, 140, 155, 174, 175, 176, 178, 181, +1 more (9 hits)

livebench/agentic_code_runner/minisweagent/agents/interactive.py:47, 57, 58, 63, 75, 86, 87 (7 hits)

livebench/code_runner/eval/utils.py:191, 192, 195, 196 (4 hits)

livebench/agentic_code_runner/minisweagent/agents/replay.py:62, 78, 79 (3 hits)

livebench/agentic_code_runner/minisweagent/environments/docker.py:110

livebench/agentic_code_runner/minisweagent/run/run_batch.py:48

high Security checks software dependencies conf 0.88 cryptography: GHSA-3ww4-gg4f-jr7f

Python Cryptography package vulnerable to Bleichenbacher timing oracle attack

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 cryptography: GHSA-r6ph-v2qm-q3c2

cryptography Vulnerable to a Subgroup Attack Due to Missing Subgroup Validation for SECT Curves

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 cryptography: GHSA-x4qr-2fvf-3mr5

Vulnerable OpenSSL included in cryptography wheels

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 cryptography: PYSEC-2023-11

cryptography is a package designed to expose cryptographic primitives and recipes to Python developers. In affected versions `Cipher.update_into` would accept Python objects which implement the buffer protocol, but provide only immutable buffers. This would allow immutable objects (such as `bytes`)…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 cryptography: PYSEC-2023-254

cryptography is a package designed to expose cryptographic primitives and recipes to Python developers. Calling `load_pem_pkcs7_certificates` or `load_der_pkcs7_certificates` could lead to a NULL-pointer dereference and segfault. Exploitation of this vulnerability poses a serious risk of Denial of …

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 cryptography: PYSEC-2024-225

cryptography is a package designed to expose cryptographic primitives and recipes to Python developers. Starting in version 38.0.0 and prior to version 42.0.4, if `pkcs12.serialize_key_and_certificates` is called with both a certificate whose public key did not match the provided private key and an…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 cryptography: PYSEC-2026-35

cryptography is a package designed to expose cryptographic primitives and recipes to Python developers. Prior to version 46.0.6, DNS name constraints were only validated against SANs within child certificates, and not the "peer name" presented during each validation. Consequently, cryptography woul…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: GHSA-8p8v-wh79-9r56

Django vulnerable to Uncontrolled Resource Consumption

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2024-102

An issue was discovered in Django 5.1 before 5.1.1, 5.0 before 5.0.9, and 4.2 before 4.2.16. The urlize() and urlizetrunc() template filters are subject to a potential denial-of-service attack via very large inputs with a specific sequence of characters.

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2024-156

An issue was discovered in Django 5.1 before 5.1.4, 5.0 before 5.0.10, and 4.2 before 4.2.17. The strip_tags() method and striptags template filter are subject to a potential denial-of-service attack via certain inputs containing large sequences of nested incomplete HTML entities.

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2024-157

An issue was discovered in Django 5.1 before 5.1.4, 5.0 before 5.0.10, and 4.2 before 4.2.17. Direct usage of the django.db.models.fields.json.HasKey lookup, when an Oracle database is used, is subject to SQL injection if untrusted data is used as an lhs value. (Applications that use the jsonfield.…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2024-28

An issue was discovered in Django 3.2 before 3.2.24, 4.2 before 4.2.10, and Django 5.0 before 5.0.2. The intcomma template filter was subject to a potential denial-of-service attack when used with very long strings.

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2024-47

In Django 3.2 before 3.2.25, 4.2 before 4.2.11, and 5.0 before 5.0.3, the django.utils.text.Truncator.words() method (with html=True) and the truncatewords_html template filter are subject to a potential regular expression denial-of-service attack via a crafted string. NOTE: this issue exists becau…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2024-56

An issue was discovered in Django 4.2 before 4.2.14 and 5.0 before 5.0.7. urlize and urlizetrunc were subject to a potential denial of service attack via certain inputs with a very large number of brackets.

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2024-57

An issue was discovered in Django 5.0 before 5.0.7 and 4.2 before 4.2.14. The django.contrib.auth.backends.ModelBackend.authenticate() method allows remote attackers to enumerate users via a timing attack involving login requests for users with an unusable password.

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2024-58

An issue was discovered in Django 5.0 before 5.0.7 and 4.2 before 4.2.14. Derived classes of the django.core.files.storage.Storage base class, when they override generate_filename() without replicating the file-path validations from the parent class, potentially allow directory traversal via certai…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2024-59

An issue was discovered in Django 5.0 before 5.0.7 and 4.2 before 4.2.14. get_supported_language_variant() was subject to a potential denial-of-service attack when used with very long strings containing specific characters.

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2024-67

An issue was discovered in Django 5.0 before 5.0.8 and 4.2 before 4.2.15. The floatformat template filter is subject to significant memory consumption when given a string representation of a number in scientific notation with a large exponent.

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2024-68

An issue was discovered in Django 5.0 before 5.0.8 and 4.2 before 4.2.15. The urlize() and urlizetrunc() template filters are subject to a potential denial-of-service attack via very large inputs with a specific sequence of characters.

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2024-69

An issue was discovered in Django 5.0 before 5.0.8 and 4.2 before 4.2.15. The urlize and urlizetrunc template filters, and the AdminURLFieldWidget widget, are subject to a potential denial-of-service attack via certain inputs with a very large number of Unicode characters.

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2025-1

An issue was discovered in Django 5.1 before 5.1.5, 5.0 before 5.0.11, and 4.2 before 4.2.18. Lack of upper-bound limit enforcement in strings passed when performing IPv6 validation could lead to a potential denial-of-service attack. The undocumented and private functions clean_ipv6_address and is_…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2025-104

An issue was discovered in 5.2 before 5.2.9, 5.1 before 5.1.15, and 4.2 before 4.2.27. `FilteredRelation` is subject to SQL injection in column aliases, using a suitably crafted dictionary, with dictionary expansion, as the `**kwargs` passed to `QuerySet.annotate()` or `QuerySet.alias()` on Postgre…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2025-105

An issue was discovered in Django 4.2 before 4.2.24, 5.1 before 5.1.12, and 5.2 before 5.2.6. FilteredRelation is subject to SQL injection in column aliases, using a suitably crafted dictionary, with dictionary expansion, as the **kwargs passed QuerySet.annotate() or QuerySet.alias().

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2025-106

An issue was discovered in Django 4.2 before 4.2.25, 5.1 before 5.1.13, and 5.2 before 5.2.7. QuerySet.annotate(), QuerySet.alias(), QuerySet.aggregate(), and QuerySet.extra() are subject to SQL injection in column aliases, when using a suitably crafted dictionary, with dictionary expansion, as the…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2025-107

An issue was discovered in 5.1 before 5.1.14, 4.2 before 4.2.26, and 5.2 before 5.2.8. NFKC normalization in Python is slow on Windows. As a consequence, `django.http.HttpResponseRedirect`, `django.http.HttpResponsePermanentRedirect`, and the shortcut `django.shortcuts.redirect` were subject to a …

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2025-109

An issue was discovered in 5.2 before 5.2.9, 5.1 before 5.1.15, and 4.2 before 4.2.27. Algorithmic complexity in `django.core.serializers.xml_serializer.getInnerText()` allows a remote attacker to cause a potential denial-of-service attack triggering CPU and memory exhaustion via specially crafted …

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2025-13

An issue was discovered in Django 5.1 before 5.1.7, 5.0 before 5.0.13, and 4.2 before 4.2.20. The django.utils.text.wrap() method and wordwrap template filter are subject to a potential denial-of-service attack when used with very long strings.

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2025-37

An issue was discovered in Django 4.2 before 4.2.21, 5.1 before 5.1.9, and 5.2 before 5.2.1. The django.utils.html.strip_tags() function is vulnerable to a potential denial-of-service (slow performance) when processing inputs containing large sequences of incomplete HTML tags. The template filter s…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2025-47

An issue was discovered in Django 5.2 before 5.2.2, 5.1 before 5.1.10, and 4.2 before 4.2.22. Internal HTTP response logging does not escape request.path, which allows remote attackers to potentially manipulate log output via crafted URLs. This may lead to log injection or forgery when logs are vie…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2026-42

An issue was discovered in 6.0 before 6.0.2, 5.2 before 5.2.11, and 4.2 before 4.2.28. The `django.contrib.auth.handlers.modwsgi.check_password()` function for authentication via `mod_wsgi` allows remote attackers to enumerate users via a timing attack. Earlier, unsupported Django series (such as 5…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2026-43

An issue was discovered in 6.0 before 6.0.2, 5.2 before 5.2.11, and 4.2 before 4.2.28. `ASGIRequest` allows a remote attacker to cause a potential denial-of-service via a crafted request with multiple duplicate headers. Earlier, unsupported Django series (such as 5.0.x, 4.1.x, and 3.2.x) were not e…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2026-44

An issue was discovered in 6.0 before 6.0.2, 5.2 before 5.2.11, and 4.2 before 4.2.28. Raster lookups on ``RasterField`` (only implemented on PostGIS) allows remote attackers to inject SQL via the band index parameter. Earlier, unsupported Django series (such as 5.0.x, 4.1.x, and 3.2.x) were not ev…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2026-45

An issue was discovered in 6.0 before 6.0.2, 5.2 before 5.2.11, and 4.2 before 4.2.28. `django.utils.text.Truncator.chars()` and `Truncator.words()` methods (with `html=True`) and the `truncatechars_html` and `truncatewords_html` template filters allow a remote attacker to cause a potential denial-…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2026-46

An issue was discovered in 6.0 before 6.0.2, 5.2 before 5.2.11, and 4.2 before 4.2.28. `FilteredRelation` is subject to SQL injection in column aliases via control characters, using a suitably crafted dictionary, with dictionary expansion, as the `**kwargs` passed to `QuerySet` methods `annotate()`…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2026-47

An issue was discovered in 6.0 before 6.0.2, 5.2 before 5.2.11, and 4.2 before 4.2.28. `.QuerySet.order_by()` is subject to SQL injection in column aliases containing periods when the same alias is, using a suitably crafted dictionary, with dictionary expansion, used in `FilteredRelation`. Earlier,…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2026-48

An issue was discovered in 6.0 before 6.0.4, 5.2 before 5.2.13, and 4.2 before 4.2.30. `MultiPartParser` allows remote attackers to degrade performance by submitting multipart uploads with `Content-Transfer-Encoding: base64` including excessive whitespace. Earlier, unsupported Django series (such a…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2026-49

An issue was discovered in 6.0 before 6.0.4, 5.2 before 5.2.13, and 4.2 before 4.2.30. ASGI requests with a missing or understated `Content-Length` header could bypass the `DATA_UPLOAD_MAX_MEMORY_SIZE` limit when reading `HttpRequest.body`, allowing remote attackers to load an unbounded request bod…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2026-51

An issue was discovered in 6.0 before 6.0.4, 5.2 before 5.2.13, and 4.2 before 4.2.30. `ASGIRequest` allows a remote attacker to spoof headers by exploiting an ambiguous mapping of two header variants (with hyphens or with underscores) to a single version with underscores. Earlier, unsupported Djan…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2026-52

An issue was discovered in 6.0 before 6.0.4, 5.2 before 5.2.13, and 4.2 before 4.2.30. Add permissions on inline model instances were not validated on submission of forged `POST` data in `GenericInlineModelAdmin`. Earlier, unsupported Django series (such as 5.0.x, 4.1.x, and 3.2.x) were not evaluat…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 django: PYSEC-2026-53

An issue was discovered in 6.0 before 6.0.4, 5.2 before 5.2.13, and 4.2 before 4.2.30. Admin changelist forms using `ModelAdmin.list_editable` incorrectly allowed new instances to be created via forged `POST` data. Earlier, unsupported Django series (such as 5.0.x, 4.1.x, and 3.2.x) were not evalua…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 geopandas: PYSEC-2026-62

SQL injection vulnerability in geopandas before v.1.1.2 allows an attacker to obtain sensitive information via the to_postgis()` function being used to write GeoDataFrames to a PostgreSQL database.

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 keras: GHSA-36fq-jgmw-4r9c

Keras is vulnerable to Deserialization of Untrusted Data

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 keras: GHSA-4f3f-g24h-fr8m

Keras has an untrusted deserialization vulnerability

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 keras: GHSA-hjqc-jx6g-rwp9

Keras Directory Traversal Vulnerability

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 keras: PYSEC-2025-121

An issue in keras 3.7.0 allows attackers to write arbitrary files to the user's machine via downloading a crafted tar file through the get_file function.

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 lxml: PYSEC-2026-87

lxml is a library for processing XML and HTML in the Python language. Prior to 6.1.0, using either of the two parsers in the default configuration (with resolve_entities=True) allows untrusted XML input to read local files. Setting the resolve_entities option explicitly to resolve_entities='interna…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 nltk: GHSA-469j-vmhf-r6v7

NLTK has a Downloader Path Traversal Vulnerability (AFO) - Arbitrary File Overwrite

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 nltk: GHSA-jm6w-m3j8-898g

Unauthenticated remote shutdown in nltk.app.wordnet_app

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 nltk: PYSEC-2024-167

NLTK through 3.8.1 allows remote code execution if untrusted packages have pickled Python code, and the integrated data package download functionality is used. This affects, for example, averaged_perceptron_tagger and punkt.

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 nltk: PYSEC-2026-97

A vulnerability in the `filestring()` function of the `nltk.util` module in nltk version 3.9.2 allows arbitrary file read due to improper validation of input paths. The function directly opens files specified by user input without sanitization, enabling attackers to access sensitive system files by…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 nltk: PYSEC-2026-98

A vulnerability in NLTK versions up to and including 3.9.2 allows arbitrary file read via path traversal in multiple CorpusReader classes, including WordListCorpusReader, TaggedCorpusReader, and BracketParseCorpusReader. These classes fail to properly sanitize or validate file paths, enabling attac…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 nltk: PYSEC-2026-99

NLTK versions <=3.9.2 are vulnerable to arbitrary code execution due to improper input validation in the StanfordSegmenter module. The module dynamically loads external Java .jar files without verification or sandboxing. An attacker can supply or replace the JAR file, enabling the execution of arbi…

livebench/code_runner/requirements_eval.txt

high Security checks quality Quality conf 1.00 ✓ Repobility 24 occurrences Phantom test coverage: test_patch_run

Test function `test_patch_run` runs code but contains no assert / expect / should call — it passes regardless of behaviour. Adds line coverage without verifying anything.

12 files, 12 locations

livebench/agentic_code_runner/eval/harness/instance.py:56

livebench/agentic_code_runner/eval/harness/repos/c/OpenMathLib/OpenBLAS.py:229

livebench/agentic_code_runner/eval/harness/repos/c/facebook/zstd.py:230

livebench/agentic_code_runner/eval/harness/repos/c/fluent/fluentbit.py:282

livebench/agentic_code_runner/eval/harness/repos/c/jqlang/jq.py:237

livebench/agentic_code_runner/eval/harness/repos/c/libgit2/libgit2.py:402

livebench/agentic_code_runner/eval/harness/repos/c/libsdlorg/SDL.py:229

livebench/agentic_code_runner/eval/harness/repos/c/mruby/mruby.py:368

high Security checks quality Quality conf 1.00 ✓ Repobility Phantom test coverage: test_patch_run_log

Test function `test_patch_run_log` runs code but contains no assert / expect / should call — it passes regardless of behaviour. Adds line coverage without verifying anything.

livebench/agentic_code_runner/eval/harness/report.py:216

high Security checks software dependencies conf 0.88 pillow: GHSA-cfh3-3jmp-rvhc

Pillow affected by out-of-bounds write when loading PSD images

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 pillow: GHSA-pwv6-vv43-88gr

Pillow has an OOB Write with Invalid PSD Tile Extents (Integer Overflow)

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 pillow: GHSA-whj4-6x5x-4v2j

FITS GZIP decompression bomb in Pillow

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 pillow: PYSEC-2026-165

Pillow is a Python imaging library. Prior to version 12.2.0, if a font advances for each glyph by an exceeding large amount, when Pillow keeps track of the current position, it may lead to an integer overflow. This issue has been patched in version 12.2.0.

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 pycryptodome: GHSA-j225-cvw7-qrx7

PyCryptodome and pycryptodomex side-channel leakage for OAEP decryption

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 scikit-learn: PYSEC-2024-110

A sensitive data leakage vulnerability was identified in scikit-learn's TfidfVectorizer, specifically in versions up to and including 1.4.1.post1, which was fixed in version 1.5.0. The vulnerability arises from the unexpected storage of all tokens present in the training data within the `stop_words…

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 scipy: PYSEC-2023-102

A refcounting issue which leads to potential memory leak was discovered in scipy commit 8627df31ab in Py_FindObjects() function.

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 scipy: PYSEC-2023-114

** DISPUTED ** A use-after-free issue was discovered in Py_FindObjects() function in SciPy versions prior to 1.8.0. NOTE: the vendor and discoverer indicate that this is not a security issue.

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 tensorflow: GHSA-49rq-hwc3-x77w

TensorFlow has Null Pointer Error in QuantizedMatMulWithBiasAndDequantize

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 tensorflow: GHSA-558h-mq8x-7q9g

TensorFlow has Null Pointer Error in SparseSparseMaximum

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 tensorflow: GHSA-5w96-866f-6rm8

TensorFlow has Floating Point Exception in TFLite in conv kernel

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 tensorflow: GHSA-647v-r7qq-24fh

TensorFlow has Floating Point Exception in TensorListSplit with XLA

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 tensorflow: GHSA-64jg-wjww-7c5w

TensorFlow has Null Pointer Error in TensorArrayConcatV2

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 tensorflow: GHSA-68v3-g9cm-rmm6

TensorFlow vulnerable to Out-of-Bounds Read in GRUBlockCellGrad

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 tensorflow: GHSA-6hg6-5c2q-7rcr

TensorFlow has Heap-buffer-overflow in AvgPoolGrad

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 tensorflow: GHSA-6wfh-89q8-44jq

TensorFlow has null dereference on ParallelConcat with XLA

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 tensorflow: GHSA-7jvm-xxmr-v5cw

TensorFlow vulnerable to integer overflow in EditDistance

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 tensorflow: GHSA-7x4v-9gxg-9hwj

TensorFlow has Segfault in Bincount with XLA

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 tensorflow: GHSA-93vr-9q9m-pj8p

TensorFlow vulnerable to Out-of-Bounds Read in DynamicStitch

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 tensorflow: GHSA-94mm-g2mv-8p7r

TensorFlow has Null Pointer Error in LookupTableImportV2

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 tensorflow: GHSA-f49c-87jh-g47q

TensorFlow has double free in Fractional(Max/Avg)Pool

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 tensorflow: GHSA-f637-vh3r-vfh2

TensorFlow has Floating Point Exception in AudioSpectrogram

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 tensorflow: GHSA-gf97-q72m-7579

TensorFlow has Null Pointer Error in RandomShuffle with XLA enable

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 tensorflow: GHSA-gjh7-xx4r-x345

TensorFlow has segfault in array_ops.upper_bound

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 tensorflow: GHSA-j5w9-hmfh-4cr6

TensorFlow has segmentation fault in tfg-translate

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 tensorflow: GHSA-qjqc-vqcf-5qvj

TensorFlow vulnerable to seg fault in `tf.raw_ops.Print`

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 tensorflow: GHSA-rcf8-g8jv-vg6p

TensorFlow has Floating Point Exception in AvgPoolGrad with XLA

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.88 werkzeug: GHSA-2g68-c3qc-8985

Werkzeug debugger vulnerable to remote execution when interacting with attacker controlled domain

livebench/code_runner/requirements_eval.txt

high System graph security security conf 1.00 Insecure pattern 'exec_used' in livebench/code_runner/eval/__init__.py:158

Found a known-risky pattern (exec_used). Review and replace if possible.

livebench/code_runner/eval/__init__.py:158 Exec used

low Security checks quality Error handling conf 1.00 [ERR001] Silent Exception Swallowing: Silently swallowing all exceptions hides bugs. Even in cleanup code, log at DEBUG level.

Log the error: `except Exception: logger.debug('cleanup failed', exc_info=True)`. Or handle specific exception types.

livebench/process_results/math/integrals_with_game/utils.py:122

low Security checks quality Error handling conf 1.00 [ERR001] Silent Exception Swallowing: Silently swallowing all exceptions hides bugs. Even in cleanup code, log at DEBUG level.

Log the error: `except Exception: logger.debug('cleanup failed', exc_info=True)`. Or handle specific exception types.

livebench/process_results/data_analysis/tablereformat/utils.py:15

low Security checks security Injection conf 0.50 3 occurrences [SEC005] Command Injection Risk: Unsafe shell execution or eval of user input.

Use subprocess with shell=False and a list of args. Never eval user input.

3 files, 3 locations

livebench/agentic_code_runner/minisweagent/environments/docker.py:106

livebench/agentic_code_runner/minisweagent/environments/local.py:23

livebench/code_runner/eval/utils.py:201

medium Security checks quality Quality conf 1.00 3 occurrences [SEC123] Production stack trace / debug output exposed: Debug mode left on in production exposes stack traces, environment variables, framework internals — sometimes triggers RCE (Django debug page with arbitrary template eval).

Set DEBUG=False / APP_DEBUG=false in production. Provide a generic 500 handler that logs to backend but returns a sanitized page to clients.

3 files, 3 locations

livebench/lcb_runner/evaluation/compute_code_generation_metrics.py:29

livebench/scripts/check_grading_flakiness.py:111

livebench/scripts/edit_questions.py:138

low Security checks quality Quality conf 1.00 [SEC136] AI-typical over-broad exception handler swallowing all errors: Catch-all exception block that silently returns success or no-ops. AI agents reach for this pattern when a flaky test or an unfamiliar API throws — wrap, swallow, return success. Real bugs are masked, observability is destroyed, and callers think the operation worked. CWE-396 (improperly-generalized exception). Distinct from intentional fallback because there's no log line and the success value is fabricated.

Catch the specific exception type, log at error level with full exception info, and return a failure-shaped result. If the operation is genuinely best-effort, log at warning and document why in a comment so the next reader (or scanner) knows.

livebench/scripts/check_grading_flakiness.py:43

low Security checks quality Error handling conf 0.55 ✓ Repobility 25 occurrences Broad exception handler needs review

This handler catches Exception/BaseException. It is actionable when it swallows errors without logging, re-raising, or returning a structured error. Handlers that intentionally convert exceptions into typed error results should not be treated as high risk.

12 files, 19 locations

livebench/scripts/inspect_agentic_traj.py:141, 144, 181 (3 hits)

livebench/code_runner/eval/__init__.py:182, 346 (2 hits)

livebench/model/completions.py:231, 524 (2 hits)

livebench/scripts/check_grading_flakiness.py:45, 56 (2 hits)

livebench/scripts/edit_questions.py:144, 184 (2 hits)

livebench/scripts/replay_agent_trajectory.py:79, 373 (2 hits)

livebench/agentic_code_runner/minisweagent/run_inference.py:233

livebench/code_runner/eval/utils.py:236

Error handlingquality

medium Security checks software dependencies conf 0.88 cryptography: GHSA-39hc-v87j-747x

Vulnerable OpenSSL included in cryptography wheels

livebench/code_runner/requirements_eval.txt

medium Security checks software dependencies conf 0.88 cryptography: GHSA-9v9h-cgj8-h64p

Null pointer dereference in PKCS12 parsing

livebench/code_runner/requirements_eval.txt

medium Security checks software dependencies conf 0.88 cryptography: GHSA-h4gh-qq45-vh27

pyca/cryptography has a vulnerable OpenSSL included in cryptography wheels

livebench/code_runner/requirements_eval.txt

medium Security checks software dependencies conf 0.88 django: GHSA-rrqc-c2jx-6jgv

Django allows enumeration of user e-mail addresses

livebench/code_runner/requirements_eval.txt

medium Security checks software dependencies conf 0.88 django: GHSA-vm8q-m57g-pff3

Regular expression denial-of-service in Django

livebench/code_runner/requirements_eval.txt

medium Security checks software dependencies conf 0.88 keras: GHSA-mq84-hjqx-cwf2

Keras is vulnerable to arbitrary local file loading and Server-Side Request Forgery

livebench/code_runner/requirements_eval.txt

medium Security checks quality Quality conf 1.00 ✓ Repobility 3 occurrences Mutable default argument in `from_reports` (list)

`def from_reports(... = []/{}/set())` — Python's default value is constructed ONCE at function definition time and shared across all calls. Mutating it in one call mutates it for every future call too.

3 files, 3 locations

livebench/agentic_code_runner/eval/harness/report.py:303

livebench/lcb_runner/evaluation/compute_code_generation_metrics.py:157

livebench/lcb_runner/evaluation/pass_k_utils.py:26

medium Security checks software dependencies conf 0.88 nltk: GHSA-gfwx-w7gr-fvh7

Improper Neutralization of Input During Web Page Generation ('Cross-site Scripting') in nltk

livebench/code_runner/requirements_eval.txt

medium Security checks software dependencies conf 0.88 nltk: GHSA-rf74-v2fm-23pw

Natural Language Toolkit (NLTK) has unbounded recursion in JSONTaggedDecoder.decode_obj() may cause DoS

livebench/code_runner/requirements_eval.txt

medium Security checks software dependencies conf 0.88 numpy: GHSA-fpfv-jqm9-f5jm

Incorrect Comparison in NumPy

livebench/code_runner/requirements_eval.txt

medium Security checks software dependencies conf 0.88 pillow: GHSA-r73j-pqj5-w3x7

Pillow has a PDF Parsing Trailer Infinite Loop (DoS)

livebench/code_runner/requirements_eval.txt

medium Security checks software dependencies conf 0.88 pytest: GHSA-6w46-j5rx-g56g

pytest has vulnerable tmpdir handling

livebench/code_runner/requirements_eval.txt

high Security checks software dependencies conf 0.70 4 occurrences Remote install command pipes network code directly to a shell

Agent helper projects often publish one-line installers. `curl | sh` style commands are convenient, but they bypass review unless the script is pinned, signed, or checksum-verified.

4 files, 4 locations

livebench/agentic_code_runner/eval/harness/repos/javascript/Automattic/mongoose.py:98

livebench/agentic_code_runner/eval/harness/repos/javascript/axios/axios.py:60

livebench/agentic_code_runner/eval/harness/repos/javascript/sveltejs/svelte.py:52

livebench/agentic_code_runner/eval/harness/repos/typescript/ant_design/ant_design.py:58

medium Security checks software dependencies conf 0.88 requests: GHSA-9hjg-9r4m-mvj7

Requests vulnerable to .netrc credentials leak via malicious URLs

livebench/code_runner/requirements_eval.txt

medium Security checks software dependencies conf 0.88 requests: GHSA-9wx4-h78v-vm56

Requests `Session` object does not verify requests after making first request with verify=False

livebench/code_runner/requirements_eval.txt

medium Security checks software dependencies conf 0.88 requests: GHSA-gc5v-m9x4-r6x2

Requests has Insecure Temp File Reuse in its extract_zipped_paths() utility function

livebench/code_runner/requirements_eval.txt

medium Security checks software dependencies conf 0.90 ✓ Repobility 4 occurrences requirements.txt: `absl-py` has no version pin

Unpinned pip requirement means every fresh install may resolve a different version. Newer releases can introduce malicious code (typosquats, account compromises). Reproducible installs need exact pins.

lines 1, 2, 3, 4

livebench/if_runner/instruction_following_eval/requirements.txt:1, 2, 3, 4 (4 hits)

medium Security checks software dependencies conf 0.88 tensorflow: GHSA-fqm2-gh8w-gr68

TensorFlow vulnerable to segfault when opening multiframe gif

livebench/code_runner/requirements_eval.txt

medium Security checks software dependencies conf 0.88 tensorflow: GHSA-fxgc-95xx-grvq

TensorFlow Denial of Service vulnerability

livebench/code_runner/requirements_eval.txt

medium Security checks software dependencies conf 0.88 werkzeug: GHSA-29vq-49wr-vm6x

Werkzeug safe_join() allows Windows special device names

livebench/code_runner/requirements_eval.txt

medium Security checks software dependencies conf 0.88 werkzeug: GHSA-87hc-h4r5-73f7

Werkzeug safe_join() allows Windows special device names with compound extensions

livebench/code_runner/requirements_eval.txt

medium Security checks software dependencies conf 0.88 werkzeug: GHSA-f9vj-2wh5-fj8j

Werkzeug safe_join not safe on Windows

livebench/code_runner/requirements_eval.txt

medium Security checks software dependencies conf 0.88 werkzeug: GHSA-hgf8-39gv-g3f2

Werkzeug safe_join() allows Windows special device names

livebench/code_runner/requirements_eval.txt

medium Security checks software dependencies conf 0.88 werkzeug: GHSA-q34m-jh98-gwm2

Werkzeug possible resource exhaustion when parsing file data in forms

livebench/code_runner/requirements_eval.txt

medium System graph security security conf 1.00 Insecure pattern 'subprocess_shell_true' in livebench/agentic_code_runner/minisweagent/environments/docker.py:106

Found a known-risky pattern (subprocess_shell_true). Review and replace if possible.

livebench/agentic_code_runner/minisweagent/environments/docker.py:106 Subprocess shell true

medium System graph security security conf 1.00 Insecure pattern 'subprocess_shell_true' in livebench/agentic_code_runner/minisweagent/environments/extra/swerex_docker.py:33

Found a known-risky pattern (subprocess_shell_true). Review and replace if possible.

livebench/agentic_code_runner/minisweagent/environments/extra/swerex_docker.py:33 Subprocess shell true

medium System graph security security conf 1.00 Insecure pattern 'subprocess_shell_true' in livebench/agentic_code_runner/minisweagent/environments/local.py:25

Found a known-risky pattern (subprocess_shell_true). Review and replace if possible.

livebench/agentic_code_runner/minisweagent/environments/local.py:25 Subprocess shell true

medium System graph quality Integrity conf 1.00 Network/subprocess call without timeout or try/except — livebench/agentic_code_runner/eval/utils/git_util.py:33

`subprocess.run(...)` here lacks both a `timeout=` arg and an enclosing try/except. This is exactly the class of bug that took down our git-clone earlier (HTTP/2 stream cancel surfaced as a fatal). Add a `timeout=` and wrap in try/except, or use a wrapper that retries.

runtime safetyRobustness

medium System graph quality Integrity conf 1.00 Network/subprocess call without timeout or try/except — livebench/agentic_code_runner/minisweagent/environments/docker.py:106

`subprocess.Popen(...)` here lacks both a `timeout=` arg and an enclosing try/except. This is exactly the class of bug that took down our git-clone earlier (HTTP/2 stream cancel surfaced as a fatal). Add a `timeout=` and wrap in try/except, or use a wrapper that retries.

runtime safetyRobustness

medium System graph quality Integrity conf 1.00 Network/subprocess call without timeout or try/except — livebench/scripts/check_grading_flakiness.py:148

`subprocess.run(...)` here lacks both a `timeout=` arg and an enclosing try/except. This is exactly the class of bug that took down our git-clone earlier (HTTP/2 stream cancel surfaced as a fatal). Add a `timeout=` and wrap in try/except, or use a wrapper that retries.

runtime safetyRobustness

medium System graph quality Integrity conf 1.00 Network/subprocess call without timeout or try/except — livebench/scripts/check_question_variance.py:93

`subprocess.run(...)` here lacks both a `timeout=` arg and an enclosing try/except. This is exactly the class of bug that took down our git-clone earlier (HTTP/2 stream cancel surfaced as a fatal). Add a `timeout=` and wrap in try/except, or use a wrapper that retries.

runtime safetyRobustness

medium System graph cicd CI/CD security conf 1.00 No CI/CD pipelines detected

No GitHub Actions, GitLab CI, or CircleCI configs found. Without CI you can't gate deploys on tests/lints.

CI/CD securityCoverage

medium System graph quality Tests conf 1.00 Very low test-to-source ratio

8 test file(s) for 334 source file(s) (ratio 0.02). Consider adding integration or unit tests for critical paths.

Coverage

low Security checks software dependencies conf 0.88 cryptography: GHSA-5cpq-8wj7-hf2v

Vulnerable OpenSSL included in cryptography wheels

livebench/code_runner/requirements_eval.txt

low Security checks software dependencies conf 0.88 cryptography: GHSA-jm77-qphf-c4w8

pyca/cryptography's wheels include vulnerable OpenSSL

livebench/code_runner/requirements_eval.txt

low Security checks software dependencies conf 0.88 cryptography: GHSA-v8gr-m533-ghj9

Vulnerable OpenSSL included in cryptography wheels

livebench/code_runner/requirements_eval.txt

low Security checks software dependencies conf 0.88 django: GHSA-mjgh-79qc-68w3

Django has a Race Condition vulnerability

livebench/code_runner/requirements_eval.txt

low Security checks software dependencies conf 0.88 django: GHSA-q95w-c7qg-hrff

Django vulnerable to partial directory traversal via archives

livebench/code_runner/requirements_eval.txt

low Security checks quality Quality conf 0.60 30 occurrences Duplicated implementation block across source files

Duplicate implementation blocks are maintenance debt. Keep them visible, but they are not a high-severity defect unless the duplicated logic is security-sensitive or drifting.

12 files, 30 locations

livebench/agentic_code_runner/eval/harness/repos/c/valkey_io/valkey.py:7, 18, 25, 98, 175 (5 hits)

livebench/agentic_code_runner/eval/harness/repos/c/ponylang/ponyc.py:7, 18, 25, 342 (4 hits)

livebench/agentic_code_runner/eval/harness/repos/c/redis/redis.py:1, 18, 175, 211 (4 hits)

livebench/agentic_code_runner/eval/harness/repos/c/libgit2/libgit2.py:7, 18, 72 (3 hits)

livebench/agentic_code_runner/eval/harness/repos/c/mruby/mruby.py:7, 125, 174 (3 hits)

livebench/agentic_code_runner/eval/harness/repos/c/jqlang/jq.py:1, 18 (2 hits)

livebench/agentic_code_runner/eval/harness/repos/c/libsdlorg/SDL.py:7, 87 (2 hits)

livebench/agentic_code_runner/eval/harness/repos/c/php/phpsrc.py:7, 88 (2 hits)

duplicationquality

low Security checks software dependencies conf 0.88 flask: GHSA-68rp-wp8r-4726

Flask session does not add `Vary: Cookie` header when accessed in some ways

livebench/code_runner/requirements_eval.txt

low System graph quality Integrity conf 1.00 13 env vars used in code but missing from .env.example

Drift between code and config docs. The first few: `BIGCODEBENCH_TIMEOUT_PER_TASK`, `LITELLM_MODEL_REGISTRY_PATH`, `LIVEBENCH_API_KEY`, `MSWEA_CONFIG_DIR`, `MSWEA_DOCKER_EXECUTABLE`, `MSWEA_GLOBAL_CALL_LIMIT`, `MSWEA_GLOBAL_CONFIG_DIR`, `MSWEA_GLOBAL_COST_LIMIT` + 5 more. Add them (with a placehold…

config drift

low System graph software Dead code candidate conf 1.00 File has no detected symbols: livebench/agentic_code_runner/eval/harness/constant.py

Source file with no class/function declarations — possible config, dead code, or scratch file.

low System graph software Dead code candidate conf 1.00 File has no detected symbols: livebench/download_leaderboard.py

Source file with no class/function declarations — possible config, dead code, or scratch file.

low System graph software Dead code candidate conf 1.00 File has no detected symbols: livebench/download_questions.py

Source file with no class/function declarations — possible config, dead code, or scratch file.

low System graph software Dead code candidate conf 1.00 File has no detected symbols: livebench/if_runner/instruction_following_eval/json_formatter.py

Source file with no class/function declarations — possible config, dead code, or scratch file.

low System graph security security conf 1.00 Insecure pattern 'debug_true' in livebench/lcb_runner/evaluation/compute_code_generation_metrics.py:29

Found a known-risky pattern (debug_true). Review and replace if possible.

livebench/lcb_runner/evaluation/compute_code_generation_metrics.py:29 Debug true

low System graph security security conf 1.00 Insecure pattern 'debug_true' in livebench/scripts/check_grading_flakiness.py:111

Found a known-risky pattern (debug_true). Review and replace if possible.

livebench/scripts/check_grading_flakiness.py:111 Debug true

low System graph security security conf 1.00 Insecure pattern 'debug_true' in livebench/scripts/edit_questions.py:138

Found a known-risky pattern (debug_true). Review and replace if possible.

livebench/scripts/edit_questions.py:138 Debug true

low System graph quality Integrity conf 1.00 Near-duplicate function bodies in 10 places

Functions with the same first-5-line body hash: livebench/agentic_code_runner/eval/harness/run_evaluation.py:from_dict, livebench/agentic_code_runner/eval/harness/pull_request.py:from_dict, livebench/agentic_code_runner/eval/harness/pull_request.py:from_dict, livebench/agentic_code_runner/eval/harn…

duplicatesduplication

low System graph quality Integrity conf 1.00 Near-duplicate function bodies in 14 places

Functions with the same first-5-line body hash: livebench/agentic_code_runner/eval/harness/run_evaluation.py:dict, livebench/agentic_code_runner/eval/harness/pull_request.py:dict, livebench/agentic_code_runner/eval/harness/pull_request.py:dict, livebench/agentic_code_runner/eval/harness/pull_reques…

duplicatesduplication

low System graph quality Integrity conf 1.00 12 occurrences Near-duplicate function bodies in 2 places

Functions with the same first-5-line body hash: livebench/code_runner/eval/__init__.py:trusted_check_exec, livebench/code_runner/eval/__init__.py:trusted_check This is *the* AI-coder failure mode (4× more duplication in vibe-coded repos — see https://jw.hn/ai-code-hygiene). Consolidate or document…

12 occurrences

repo-level (12 hits)

duplicatesduplication

low System graph quality Integrity conf 1.00 3 occurrences Near-duplicate function bodies in 3 places

Functions with the same first-5-line body hash: livebench/agentic_code_runner/eval/harness/run_evaluation.py:json, livebench/agentic_code_runner/eval/harness/gen_report.py:json, livebench/agentic_code_runner/eval/harness/build_dataset.py:json This is *the* AI-coder failure mode (4× more duplicatio…

3 occurrences

repo-level (3 hits)

duplicatesduplication

low System graph quality Integrity conf 1.00 2 occurrences Near-duplicate function bodies in 4 places

Functions with the same first-5-line body hash: livebench/agentic_code_runner/eval/harness/run_evaluation.py:run_mode_image, livebench/agentic_code_runner/eval/harness/run_evaluation.py:run, livebench/agentic_code_runner/eval/harness/build_dataset.py:run_mode_image, livebench/agentic_code_runner/ev…

2 occurrences

repo-level (2 hits)

duplicatesduplication

low System graph quality Integrity conf 1.00 Near-duplicate function bodies in 9 places

Functions with the same first-5-line body hash: livebench/agentic_code_runner/eval/harness/run_evaluation.py:from_json, livebench/agentic_code_runner/eval/harness/pull_request.py:from_json, livebench/agentic_code_runner/eval/harness/pull_request.py:from_json, livebench/agentic_code_runner/eval/harn…

duplicatesduplication

low System graph quality Integrity conf 1.00 Old/deprecated-named symbol `connections_process_results_old` in livebench/process_results/writing/connections/utils.py:15

Names with suffixes like `_old`, `_v1`, `_deprecated` usually indicate replaced-but-not-removed code (typical AI-coder leftover). Confirm and delete, or rename if it's the active version.

old markerDead code

low System graph quality Integrity conf 1.00 Old/deprecated-named symbol `message_copy` in livebench/agentic_code_runner/minisweagent/models/litellm_model.py:420

Names with suffixes like `_old`, `_v1`, `_deprecated` usually indicate replaced-but-not-removed code (typical AI-coder leftover). Confirm and delete, or rename if it's the active version.

old markerDead code

low System graph quality Integrity conf 1.00 Old/deprecated-named symbol `read_df_func_v2` in livebench/process_results/data_analysis/tablereformat/utils.py:41

Names with suffixes like `_old`, `_v1`, `_deprecated` usually indicate replaced-but-not-removed code (typical AI-coder leftover). Confirm and delete, or rename if it's the active version.

old markerDead code

low System graph quality Integrity conf 1.00 Old/deprecated-named symbol `trajectory_copy` in livebench/scripts/replay_agent_trajectory.py:311

Names with suffixes like `_old`, `_v1`, `_deprecated` usually indicate replaced-but-not-removed code (typical AI-coder leftover). Confirm and delete, or rename if it's the active version.

old markerDead code

low System graph quality Integrity conf 1.00 Old/deprecated-named symbol `two_score_pattern_backup` in livebench/common.py:59

Names with suffixes like `_old`, `_v1`, `_deprecated` usually indicate replaced-but-not-removed code (typical AI-coder leftover). Confirm and delete, or rename if it's the active version.

old markerDead code

low System graph quality Integrity conf 1.00 Old/deprecated-named symbol `web_of_lies_v2` in livebench/gen_api_answer.py:342

Names with suffixes like `_old`, `_v1`, `_deprecated` usually indicate replaced-but-not-removed code (typical AI-coder leftover). Confirm and delete, or rename if it's the active version.

old markerDead code

low System graph quality Integrity conf 1.00 Old/deprecated-named symbol `web_of_lies_v2` in livebench/gen_ground_truth_judgment.py:22

Names with suffixes like `_old`, `_v1`, `_deprecated` usually indicate replaced-but-not-removed code (typical AI-coder leftover). Confirm and delete, or rename if it's the active version.

old markerDead code

low System graph quality Integrity conf 1.00 Old/deprecated-named symbol `zebra_puzzle_process_results_old` in livebench/process_results/reasoning/zebra_puzzle/utils.py:5

Names with suffixes like `_old`, `_v1`, `_deprecated` usually indicate replaced-but-not-removed code (typical AI-coder leftover). Confirm and delete, or rename if it's the active version.

old markerDead code

low System graph software Dead code conf 1.00 Possibly dead Python function: build_image

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/agentic_code_runner/eval/harness/run_evaluation.py:566

low System graph software Dead code conf 1.00 Possibly dead Python function: compatible_eval_result

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/code_runner/eval/__init__.py:51

low System graph software Dead code conf 1.00 Possibly dead Python function: display_result_single

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/show_livebench_result.py:226

low System graph software Dead code conf 1.00 Possibly dead Python function: evaluate_files

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/code_runner/eval/__init__.py:254

low System graph software Dead code conf 1.00 Possibly dead Python function: inner_wrapper

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/agentic_code_runner/eval/harness/instance.py:33

low System graph software Dead code conf 1.00 Possibly dead Python function: is_floats

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/code_runner/eval/__init__.py:101

low System graph software Dead code conf 1.00 Possibly dead Python function: load_single_model_judgments

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/common.py:404

low System graph software Dead code conf 1.00 Possibly dead Python function: normalize_game_key_dict

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/common.py:395

low System graph software Dead code conf 1.00 Possibly dead Python function: play_a_match_wrapper

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/gen_ground_truth_judgment.py:507

low System graph software Dead code conf 1.00 Possibly dead Python function: print_report

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/agentic_code_runner/minisweagent/run/batch_progress.py:183

low System graph software Dead code conf 1.00 Possibly dead Python function: process_instance

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/agentic_code_runner/minisweagent/run/run_batch.py:100

low System graph software Dead code conf 1.00 Possibly dead Python function: process_jsonl_file

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/scripts/syntax_error_finder.py:119

low System graph software Dead code conf 1.00 Possibly dead Python function: readable

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/code_runner/eval/utils.py:268

low System graph software Dead code conf 1.00 Possibly dead Python function: remove_readonly

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/agentic_code_runner/eval/utils/fs_utils.py:36

low System graph software Dead code conf 1.00 Possibly dead Python function: run_commands_for_model

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/scripts/rerun_failed_questions.py:77

low System graph software Dead code conf 1.00 Possibly dead Python function: run_instance

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/agentic_code_runner/eval/harness/run_evaluation.py:725

low System graph software Dead code conf 1.00 Possibly dead Python function: run_iteration

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/scripts/check_question_variance.py:59

low System graph software Dead code conf 1.00 Possibly dead Python function: safe_exec

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/code_runner/eval/utils.py:204

low System graph software Dead code conf 1.00 Possibly dead Python function: safe_killpg

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/code_runner/eval/utils.py:146

low System graph software Dead code conf 1.00 Possibly dead Python function: safe_os_popen

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/code_runner/eval/utils.py:198

low System graph software Dead code conf 1.00 Possibly dead Python function: safe_subprocess_call

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/code_runner/eval/utils.py:158

low System graph software Dead code conf 1.00 Possibly dead Python function: safe_subprocess_check_output

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/code_runner/eval/utils.py:164

low System graph software Dead code conf 1.00 Possibly dead Python function: safe_subprocess_run

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/code_runner/eval/utils.py:170

low System graph software Dead code conf 1.00 Possibly dead Python function: safe_system

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/code_runner/eval/utils.py:152

low System graph software Dead code conf 1.00 Possibly dead Python function: trusted_check

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/code_runner/eval/__init__.py:351

low System graph software Dead code conf 1.00 Possibly dead Python function: trusted_check_exec

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/code_runner/eval/__init__.py:341

low System graph software Dead code conf 1.00 Possibly dead Python function: unsafe_execute

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/code_runner/eval/__init__.py:112

low System graph software Dead code conf 1.00 Possibly dead Python function: update_preds_file

No callers detected by AST scan in this repo. Could be exported for external callers or a framework handler.

livebench/agentic_code_runner/minisweagent/run/run_batch.py:75

low System graph quality Integrity conf 1.00 Stub function `get_instruction_args` (body is just `pass`/`return`) — livebench/if_runner/ifbench/instructions.py:227

Likely an AI scaffold that was never filled in. Remove or implement.

Empty handlerDead code

low System graph quality Integrity conf 1.00 Stub function `get_instruction_args` (body is just `pass`/`return`) — livebench/if_runner/instruction_following_eval/instructions.py:1301

Likely an AI scaffold that was never filled in. Remove or implement.

Empty handlerDead code

low System graph quality Complexity conf 1.00 Very large file: livebench/if_runner/ifbench/instructions.py (2252 lines)

Files with >800 lines often hide complexity hotspots and discourage tests.

low System graph quality Complexity conf 1.00 Very large file: livebench/if_runner/instruction_following_eval/instructions.py (1570 lines)

Files with >800 lines often hide complexity hotspots and discourage tests.

Complete repo analysis