Skip to Content
0%

VIBEPASS: Can Vibe Coders Really Pass the Vibe Check?

VIBEPASS, a new benchmark, reveals a fundamental weakness in modern AI coding assistants: even with near-perfect scores on code generation tasks, frontier models falter when it comes to finding and fixing subtle bugs

The Illusion of Competence

We are living through an era of rapid AI coding capability. Systems like GPT-5, Gemini-3-Pro, and Claude Opus-4.6 routinely exceed 90% on standard code generation benchmarks. AI can now write code—that’s no longer up for debate. Developers are shipping entire features from a few prompts. The industry even has a name for this: “vibe coding”, where humans supervise at a distance while models do the building.

VIBEPASS is designed to test a harder question: not whether models can generate code, but whether they can reason about faults in code that look nearly correct.

“Given a partially correct program with no observable failures, can an LLM judge the solution to be faulty, synthesize a concrete test witnessing the latent fault and exploit that diagnosis to repair it?”

Based on our evaluation of frontier models across carefully constructed test instances, the short answer is no, these systems do not perform this task reliably, not even close.

173
Benchmark instances across 76 problems
98%
‘Medium’ or ‘Hard’ coding problems from Leetcode, Atcoder
71%
Median success-rate of buggy solutions
12
Frontier models evaluated
5
Tasks evaluating fault-targeted reasoning

What Is VIBEPASS Testing?

VIBEPASS examines a common failure mode in AI-generated code: solutions that mostly works but break on edge cases that matter. The challenge is whether frontier models can catch these bugs and refine the code until the output is perfect.

  1. Judge: Determine if the code is correct or faulty
    Given just the problem statement and the solution, classify whether the code contains a bug
  2. Test: Generate a fault-triggering (FT) input
    Produce a concrete test that (a) satisfies the problem constraints, (b) causes the buggy solution to produce the wrong output, and (c) causes a correct solution to produce the right output.
  3. Debug: Repair the code
    Using the fault-triggering test as a diagnostic, generate a corrected solution that passes all official test cases.

This three-step process reflects how software engineers actually work and is a requirement for any coding agent in production. VIBEPASS reveals that current models struggle with the Judge → FT-Test → Debug sequence far more than their standard code-generation performance numbers suggest.

We first study two Fault-Triggering (FT) Test Generation settings: Bug-Aware, where the model knows the program is buggy and must generate a test to expose it, measuring reasoning with known faults, and Bug-Discovery, where it first decides whether the program is buggy, then generates a test only when a bug is detected, measuring joint fault detection and test generation. Together, they isolate the effect of bug awareness on fault-targeted reasoning.

Finding 1: Syntactic Competence ≠ Fault Reasoning

The first result highlights a clear gap between two things that are easy to confuse: writing a valid test case and writing one that actually finds a bug. On average across 12 models, 86.4% of the test inputs were ‘valid’ and followed all the rules. But only 61.3% actually triggered the fault. That’s a 25-point gap between writing something that passes as a valid test and writing a test that actually exposes a bug.

86%
Avg. syntactically valid inputs
61%
Avg. fault-triggering tests
25pp
Gap: validity vs. discrimination
54pp
Best-to-worst model spread

There are two failure modes here. The primary one is the “fault hypothesis” gap, a 23-point difference between writing a valid input and finding one that actually triggers the bug. The smaller, 2-point “output validation” gap represents the difficulty of predicting the right output once the input is found.

KEY INSIGHT
The problem isn’t generating valid inputs or predicting outputs, it’s inability to reason which input will expose the fault. Once a model finds the right test input, it usually gets the answer right. The hard part is fault-targeted reasoning. Models still can’t do this reliably.

The differences between models are substantial. Claude Opus-4.6 reaches 80% on bug-aware fault-triggering tests, while Gemini-3.1-Pro and GPT-5.2 fails to achieve 70%, despite all being labeled as “flagship reasoning models”. GPT-5-Nano presents another interesting case: it excels at rule-following (VI: 92.5%) but struggles to actually uncover bugs (DIO: 52%). That 40-percentage-point gap highlights a crucial distinction between appearing competent and being genuinely effective. Together, these results suggest that fault-targeted reasoning is a distinctly discriminative capability.

Finding 2: Bug Detection Is A Second Bottleneck

These first results assume the model already knows a bug exists. In a more realistic “bug-discovery” scenario, where the AI must decide for itself if the code is broken, the numbers drop. Accuracy falls to 71.4%, meaning models judge wrong nearly 30% of the time. Overall success lands just under 50% when the bug-detection isn’t handed to them upfront.

A nuanced trend emerges: for the strong models, being told a bug exists does not matter much like Sonnet-4.6 or can even sometimes hurt performance (e.g. GPT-OSS-120B). They use their own confidence to decide whether to act; if unsure, it passes, keeping the performance gap (DIO → J+DIO)  between two settings minimal. Weaker models like GPT-5-Nano show the reverse, actually benefiting from the “bug-awareness” because their internal judgment is too unreliable to stand on its own.

0.6%
DIO → J+DIO Gain
GPT-OSS-120B
0.6%
DIO → J+DIO Drop
Sonnet-4.6
2.9%
DIO → J+DIO Drop
GPT-5.2
15%
DIO → J+DIO Drop
GPT-5(Mini)
34.1%
DIO → J+DIO Drop
GPT-5(Nano)

Finding 3: Test-Guided Repair Is Not What You’d Hope

The natural assumption is straightforward: hand a model a concrete failing test and debugging gets easier. The VIBEPASS results contradict this. Three repair conditions were evaluated: NoTest (model knows a bug exists but has no test case), ExtTest (externally provided with a fault-triggering test generated by the same model), and IntTest (model generates its own test internally first), contrasting unguided repair vs. external vs. self-generated diagnostic context.

The most surprising finding: when test quality is controlled for (i.e. examining only instances where both the external and self-generated tests are successfully fault-triggering), self-generated tests by strong reasoners outperform external ones (e.g. GPT-5.2 Codex gain by 16.9pp), while for weak reasoners self-generated tests underperform compared to explicitly provided ones (e.g. Gemini-3.1-Flash lite losses 28.1pp).

KEY INSIGHT
Test provenance matters: strong models leverage the implicit context alignment when a test emerges from the same chain of reasoning that produces the fix, while for weaker models, additional test generation splits the reasoning capacity available for debugging.

The broader picture is more unsettling. Repair performance under all three test conditions falls below even code generation, the baseline task models are most practiced at. NoTest barely trails it, but providing any test makes things worse: ExtTest drops further, and IntTest drops furthest of all.

The intuition — give a model a concrete failing test and it debugs better — simply doesn’t hold. A test that fails to genuinely expose the fault doesn’t just fail to help; it actively misleads. Models anchor to the bad diagnostic signal and patch in the wrong direction, performing worse than if they’d received no guidance at all. They currently lack the robustness to filter bad diagnostic signals from good ones.

KEY INSIGHT
Fault-targeted program repair for partially correct solutions is more challenging than code-synthesis from scratch, even for strong reasoning models. Test-guided repair is bottlenecked by test quality. Adding a test isn’t a free upgrade — it’s a bet. And right now, that bet loses more often than it wins.

Finding 4: The Pipeline Has Two Cliffs

VIBEPASS maps the full fault-reasoning pipeline as a cumulative waterfall, each step requiring all prior ones to hold. Performance erodes at every step, but two transitions are steeper than the others.

The first cliff: moving from valid output prediction to fault-triggering input generation. This drop averages −14.7 percentage points, the fault hypothesis bottleneck made quantitative, where general execution ability runs out and causal fault-targeted reasoning must take over.

The second cliff: moving from a valid fault-triggering test to a successful repair. This drop averages −21.2 percentage points. Even models that successfully expose a bug through a targeted test often fail to translate that diagnosis into a working fix.

Why This Matters Beyond Benchmarks

The gap VIBEPASS exposes is not an academic curiosity. The architecture of modern AI coding systems, where one LLM generates code, another evaluates, and another patches, depends on exactly the capability VIBEPASS shows is deficient: reasoning from “this code produces some wrong outputs” to “here is a concrete witness to the fault, and here is why the logic is wrong.”

THE REAL-WORLD IMPLICATION
Every “vibe coded” codebase contains bugs that look correct to the test suite. The question is not whether AI can write code that passes tests. The question is whether AI can find the tests its own code would fail. VIBEPASS shows that today’s models largely cannot.

Code-generation benchmarks and isolated test pass rates are poor proxies for real capability. What matters is fault-targeted evaluation, where diagnosis and repair are tightly coupled within the same agent context. VIBEPASS underscores why: naive pipelines that simply add more tests can backfire. Invalid or non-discriminating tests—those that don’t expose the fault—often mislead models into producing incorrect patches, degrading overall performance rather than improving it.

What Would Progress Look Like?

The larger fault hypothesis gap means that training signals focused on fault-targeted reasoning, not just code generation or output prediction, are most likely to move the needle. Techniques like fault localization training, contrastive examples of buggy vs. correct behavior, and reinforcement learning on execution feedback appear more relevant than scaling code generation capacity alone.

The finding that self-generated tests outperform external ones in controlled conditions hints at an architectural preference: agentic systems may benefit from keeping the fault hypothesis and repair within the same reasoning context, rather than decomposing them across model calls with test cases as the handoff artifact.

The Bottom Line

VIBEPASS arrives at a critical moment. As AI coding systems evolve from autocomplete tools to autonomous agents that write, review, and deploy full modules, fault reasoning becomes essential. This capability is distinct from general coding skill—and today’s frontier models still struggle, achieving under 50% success on end-to-end bug discovery and localization.

The vibe coders can pass the standard checks. They cannot yet reliably pass the vibe check.

FURTHER READING

VIBEPASS paper : https://arxiv.org/abs/2603.15921

Dataset : huggingface.co/datasets/Salesforce/vibepass 

Evaluation code: github.com/SalesforceAIResearch/vibepass

The benchmark draws problems from LiveCodeBench and is designed to update continuously to resist contamination.

Get the latest articles in your inbox.