Right file, wrong lines
A new benchmark suggests many AI coding agents are better at finding bugs than fixing them. That distinction matters more than it first sounds: if a model can identify the correct file but still miss the exact lines that need to change, then an end-to-end “bug fixed” score can hide a serious failure mode.
That is the central claim of SWE-Explore, a benchmark described by The Decoder in its June 14, 2026 report, “AI coding agents find the right file but miss the exact lines that matter, study shows” (The Decoder). The study evaluates 848 problems across 203 open-source projects and, crucially, isolates the search phase from the actual patching step. Once that separation is introduced, the picture changes: even strong models often land in the right file, but their line-level coverage of the relevant code averages only 14% to 19%.
For engineering teams evaluating AI code assistants, that gap is not academic. It means a system can look competent in demos and still be unreliable at the point where correctness depends on precision. The difference between near the bug and on the bug is where production risk lives.
What SWE-Explore changes
Most prior evaluations of coding agents have treated bug fixing as a single verdict: did the system ultimately produce a working fix or not? That framing is convenient, but it conflates at least two different capabilities.
- Search: Can the agent identify the right repository, file, and region of code?
- Patch quality: Can it assemble the exact lines needed to repair the defect?
SWE-Explore splits those tasks apart. By benchmarking the search phase separately, it exposes a weakness that end-to-end scores can blur. A model that retrieves the right file may still fail to surface enough of the relevant surrounding code to generate a correct patch.
That separation matters because code repair is not just a retrieval problem. In practice, fixes often depend on nearby functions, shared state, tests, and error-handling paths that sit outside the first obvious slice of code. The study’s reported pattern is consistent with that reality: the models do better when they are allowed to read more broadly, and worse when they aggressively narrow the context too early.
The headline number is the most important one for practitioners: 14%–19% line-level coverage on average, even when the right file has been found. SWE-Explore also reports that repairs tend to succeed only once the model identifies at least half of the necessary lines. In other words, landing in the correct file is not close to sufficient.
Why file-level success does not equal patch reliability
This is the part that changes how teams should think about “AI-assisted coding” in production.
A system can be directionally correct and still operationally unsafe. If the agent sees the right file but misses the lines that define a bug’s root cause, it may propose a plausible-looking diff that fails tests, introduces regressions, or fixes a symptom while leaving the underlying issue untouched.
That distinction has been easy to miss because many product narratives emphasize autocomplete speed, repository navigation, or code search. Those are useful capabilities, but they are not the same as reliable repair. SWE-Explore suggests that search strength can overstate overall usefulness unless the tool can also preserve enough context to support patch generation and verification.
For buyers, that means demos should not be judged on whether the assistant can point to the right file. The harder question is whether it can consistently identify the relevant line range, explain why those lines matter, and produce a patch that survives validation.
Product implications: rethink the architecture, not just the prompt
The benchmark’s practical message is blunt: if your tool only optimizes for narrow retrieval, you may be improving the wrong layer.
Engineering teams building or adopting AI coding systems should treat search and patching as connected but distinct stages. A few implications follow:
- Broaden context before filtering aggressively. SWE-Explore’s results indicate that models do better with more surrounding code, not less. In design terms, context breadth appears to matter more than minimizing irrelevant tokens too early.
- Add line-level validation to the workflow. A model should not be considered “done” when it finds the right file. It should be checked against the exact lines implicated by the issue, the test failure, or the stack trace.
- Separate retrieval metrics from patch metrics. Measure file hit rate, line-level coverage, patch correctness, and test pass rate independently. A single success number obscures where the system is failing.
- Use verification as a gating step. Any workflow that proposes code changes should include tests, static checks, or reviewer confirmation before the patch is allowed to merge or even reach a human in a high-risk path.
For product vendors, this argues for architectures that couple retrieval with context expansion and patch verification rather than treating “code search” as the core feature. If the system can find the neighborhood but not the neighborhood’s boundaries, its utility will be uneven.
Market positioning and rollout risk
The report also has a messaging implication: vendors that sell AI coding tools on the strength of repository search may be overselling the most fragile part of the stack.
A buyer hearing “the model found the right file” may assume the hard work is mostly done. SWE-Explore shows that assumption is unsafe. In real deployments, line-level misses can translate into wasted reviewer time, failed automation, and lower trust in assistant-generated changes. The risk compounds in teams that want to move from suggestion-only workflows to automated patching.
That means procurement and internal rollout should focus on observable repair quality, not just navigation quality. Engineering managers should ask vendors for evidence on:
- how often the tool identifies the relevant lines, not just the file;
- how performance changes as required context grows;
- whether the system includes explicit patch verification;
- and how often a generated fix passes tests without manual rescue.
If a product cannot answer those questions, the claim that it “helps fix bugs” should be treated cautiously.
What to watch next
SWE-Explore does not prove that AI coding agents are poor at all forms of repair. It does show that current evaluation habits are too coarse to capture where these systems fail.
The next round of work should push in three directions. First, benchmarks need to keep separating search from patch generation so teams can see which capability is actually improving. Second, models and tools should be designed to absorb broader context instead of assuming a smaller slice is always better. Third, deployment pipelines need line-level and patch-level verification before any code reaches production.
That is the practical shift this study forces. End-to-end scores still matter, but they no longer tell the whole story. A tool that finds the right file and misses the exact lines is not a finished coding assistant; it is a partial retrieval system with a patching problem.
For teams deciding whether to trust AI coding agents in real workflows, that distinction should now sit at the center of evaluation, procurement, and rollout.



