Why we no longer evaluate SWE-bench Verified

AI Summary9 min read

TL;DR

SWE-bench Verified is no longer used for evaluating autonomous software engineering due to flawed test cases and training data contamination, which skew results and fail to reflect real-world model capabilities.

Key Takeaways

  • SWE-bench Verified has issues with test cases that reject correct solutions, making it unreliable for measuring progress.
  • Models are often trained on the benchmark data, leading to contamination that inflates scores without genuine improvement.
  • The benchmark's problems include narrow or wide test cases that don't align with task descriptions, causing unfair failures.
  • OpenAI recommends discontinuing use of SWE-bench Verified and switching to alternatives like SWE-bench Pro for more accurate assessments.

Tags

PublicationResearch

Since we first published SWE-bench Verified in August 2024, the industry has widely used it to measure the progress of models on autonomous software engineering tasks. After its release, SWE-bench Verified provided a strong signal of capability progress and became a standard metric reported in frontier model releases. Tracking and forecasting progress of these capabilities is also an important part of OpenAI’s Preparedness Framework. When we created the Verified benchmark initially, we attempted to solve issues in the original evaluation that made certain tasks impossible to accomplish in the SWE-bench dataset(opens in a new window).

After initial leaps, state-of-the-art progress on SWE-bench Verified has slowed, improving(opens in a new window) from 74.9% to 80.9% in the last 6 months. This raises the question: do the remaining failures reflect model limitations or properties of the dataset itself?

In a new analysis, we found two major issues with the Verified set that indicate the benchmark is no longer suitable for measuring progress on autonomous software engineering capabilities for frontier launches at today’s performance levels:

  1. Tests reject correct solutions: We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions, despite our best efforts in improving on this in the initial creation of SWE-bench Verified.
  2. Training on solutions: Because large frontier models can learn information from their training, it is important that they are never trained on problems and solutions they are evaluated on. This is akin to sharing problems and solutions for an upcoming test with students before the test - they may not memorize the answer but students who have seen the answers before will certainly do better than those without. SWE-bench problems are sourced from open-source repositories many model providers use for training purposes. In our analysis we found that all frontier models we tested were able to reproduce the original, human-written bug fix used as the ground-truth reference, known as the gold patch, or verbatim problem statement specifics for certain tasks, indicating that all of them have seen at least some of the problems and solutions during training.

We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests.

This means that improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time. This is why we have stopped reporting SWE-bench Verified scores, and we recommend that other model developers do so too.

We’re building new, uncontaminated evaluations to better track coding capabilities, and we think this is an important area to focus on for the wider research community. Until we have those, OpenAI recommends reporting results for SWE-bench Pro.

Background

The original SWE-bench(opens in a new window) evaluation was released in 2023. Each problem is sourced from a resolved GitHub issue in one of 12 open-source Python repositories and paired with the corresponding pull request (PR). To determine whether a model-generated code change is correct, each problem comes with two sets of tests:

  • Tests that fail on the unmodified codebase but pass if the issue is correctly fixed
  • Regression tests that pass both before and after the fix to ensure unrelated functionality remains intact.

The model does not see the tests. It has to produce a code change given only the original issue text and the state of the repository before the fix. It passes a problem only if all tests pass after the code change is applied.

We found many issues with that evaluation that could lead to underreporting the capability of models.

  • Some unit tests were overly specific or misaligned with the task so correct fixes could be rejected.
  • Many task statements were underspecified, which could lead to multiple valid interpretations - while the tests only covered a specific one.
  • Depending on setup of the environment (for example Linux vs Windows, or the python version), some tests could spuriously fail

We created SWE-bench Verified in 2024 to address these issues. We worked with expert software engineers to review 1,699 SWE-bench problems and filter out problems that had these issues. Each problem was reviewed by three experts independently. This review process resulted in SWE-bench Verified, a curated set of 500 problems.

Too narrow and too wide tests

While SWE-bench Verified is a big improvement over the initial version, residual issues remain. We conducted an audit of 138 SWE-bench Verified problems that OpenAI o3 did not consistently solve over 64 independent runs. Each case was independently reviewed by at least six experienced software engineers. If an expert flagged an issue, it was re-verified by an additional team.

We found that 59.4% of the 138 problems contained material issues in test design and/or problem description, rendering them extremely difficult or impossible even for the most capable model or human to solve. 

  • 35.5% of the audited tasks have strict test cases that enforce specific implementation details, invalidating many functionally correct submissions, which we call narrow test cases.
  • 18.8% of the audited tasks have tests that check for additional functionality that wasn’t specified in the problem description, which we call wide test cases.
  • The remaining 5.1% of tasks had miscellaneous issues that were not well grouped with this taxonomy.

An illustrative example of the first failure mode is pylint-dev__pylint-4551(opens in a new window), where the PR introduces a new function `get_annotation` as part of the overall solution. This function name is not mentioned in the problem description, but is imported directly by the tests. While some models might intuit to create such a function, it’s not strictly necessary to implement a function with this specific name to correctly address the problem. Many valid solutions fail the tests on import errors.

Problem description

Plain Text

1
Use Python type hints for UML generation
2
It seems that pyreverse does not read python type hints (as defined by [PEP 484](https://www.python.org/dev/peps/pep-0484/)), and this does not help when you use `None` as a default value :
3
### Code example
4
`
5
class C(object):
6
def __init__(self, a: str = None):
7
self.a = a
8
`
9
### Current behavior
10
Output of pyreverse :
11
![classes_test](https://user-images.githubusercontent.com/22218701/27432305-f10fe03e-574f-11e7-81fa-e2b59e493360.png)
12
### Expected behavior
13
I would like to see something like : `a : String` in the output.
14
### pylint --version output
15
pylint-script.py 1.6.5,
16
astroid 1.4.9
17
Python 3.6.0 |Anaconda custom (64-bit)| (default, Dec 23 2016, 11:57:41) [MSC v.1900 64 bit (AMD64)]

PR test snippet

Python

1
+from pylint.pyreverse.utils import get_annotation, get_visibility, infer_node

PR test failures (truncated for readability)

Python

1
==================================== ERRORS ====================================
2
_____________ ERROR collecting tests/unittest_pyreverse_writer.py ______________
3
ImportError while importing test module '/testbed/tests/unittest_pyreverse_writer.py'.
4
Hint: make sure your test modules/packages have valid Python names.
5
Traceback:
6
/opt/miniconda3/envs/testbed/lib/python3.9/importlib/__init__.py:127: in import_module
7
return _bootstrap._gcd_import(name[level:], package, level)
8
tests/unittest_pyreverse_writer.py:32: in <module>
9
from pylint.pyreverse.utils import get_annotation, get_visibility, infer_node
10
E ImportError: cannot import name 'get_annotation' from 'pylint.pyreverse.utils' (/testbed/pylint/pyreverse/utils.py)

An example of too wide test cases is sympy__sympy-18199(opens in a new window). This task was sourced from a PR that addressed three distinct issues with the `nthroot_mod` function, specifically #17373(opens in a new window), #17377(opens in a new window), and #18212(opens in a new window). The description for the SWE-bench Verified task, however, covers only the final issue #18212(opens in a new window). This creates a mismatch: the PR tests cover all three issues, while the description details only one. In our runs, models often correctly implement the described fix and then fail tests that cover implementation for the other two issues.

Original PR description (from the GitHub PR)

Plain Text

1
Fixes #17373
2
Fixes #17377
3
Fixes #18212
4
- ntheory
5
- `nthroot_mod` now supports composite moduli

Problem Description for #18212

Plain Text

1
nthroot_mod function misses one root of x = 0 mod p.
2

3
When in the equation x**n = a mod p , when a % p == 0. Then x = 0 mod p is also a root of this equation. But right now `nthroot_mod` does not check for this condition. `nthroot_mod(17*17, 5 , 17)` has a root `0 mod 17`. But it does not return it.

Problem Description for SWE-bench Verified task (only taken from #18212):

Plain Text

1
nthroot_mod function misses one root of x = 0 mod p.
2

3
When in the equation x**n = a mod p , when a % p == 0. Then x = 0 mod p is also a root of this equation. But right now `nthroot_mod` does not check for this condition. `nthroot_mod(17*17, 5 , 17)` has a root `0 mod 17`. But it does not return it.

Contamination

SWE-bench Verified and the repositories (code bases and release notes) are both open-source and broadly used and discussed, which makes avoiding contamination difficult for model developers.

We first encountered signs of contamination in our own models. For example, when GPT‑5.2 solved 31 tasks we identified to be almost impossible to solve. In django__django-14725(opens in a new window) the tests require a specific new parameter `edit_only` which is not explicitly required by the problem statement. While solving the problem, GPT‑5.2 shows in its chain of thought that it has information about the release notes that detail changes to the codebase, and correctly identifies that the `edit_only` parameter was introduced in Django 4.1.

GPT‑5.2 CoT

Plain Text

1
There is also `edit_only` parameter maybe added around 4.1 or 4.2. Since this is 4.1 dev 2022, the code might be before introduction. We will implement now. Hidden tests will check new behavior.

To assess how significant contamination is more broadly, we created an automated red-teaming setup. For each SWE-bench Verified question, we tasked GPT‑5 with probing a GPT‑5.2‑Chat, Claude Opus 4.5 and Gemini 3 Flash Preview for contamination. These models were chosen to exclude reasoning models, but we acknowledge there is likely a non-trivial capability gap between them.

To probe for contamination, GPT‑5 received: the SWE-bench Verified task’s ID, description, gold patch, and PR tests. Over 15 turns, we allowed GPT‑5 to vary the system/developer prompt, user prompt, and assistant prefill and different elicitation strategies. After each turn, a judge model labeled how much novel task-specific information appeared and each response was labeled for contamination severity from “none” to “strong.” GPT‑5 was allowed to adapt its strategy based on prior turns to iteratively recover task-specific details. For each example of strong contamination, we verified with another judge that GPT‑5 didn’t leak too much information to the target model. Finally, we then manually reviewed the “strong” examples that make up the transcripts in this post.

Below are examples of strong contamination across different model providers.

GPT‑5.2 

Given a short snippet from the task description, GPT‑5.2 outputs the exact gold patch. In particular, it knows the exact class and method name, and the new early return condition `if username is None or password is None` that is introduced.

Contamination elicitation

Problem statement

1
ModelBackend.authenticate() shouldn't make a database query when username is None

User prompt

1
We're playing a SWE-bench Verified memory game.

Visit Website