AI Research

TEX: Test-Time Scaling Testing Agents via Execution-based Cross-Validation

Akash Gokul

Srijan Bansal

1 additional author

January 14, 2026 8 min read

The era of software engineering agents is underway. Benchmarks and real-world usage (e.g., tools like Cursor and Claude Code) illustrate that LLMs can be incredibly effective at writing code for real-world use-cases. Over the course of 2025, we have seen the best performance on SWE-Bench [1] to improve by over 50%, ending the year at 74.4% when measured by the mini-swe-agent [2] evaluations.

While benchmarks like SWE-Bench help gauge performance on real-world software engineering tasks, they fail to measure a critical part of software engineering: writing tests. Test-case generation is not only critical to real-world coding but also a part of solving SWE-Bench problems as most SWE-Bench agent implementations ask the agent to write tests. However, the effectiveness of these tests is left as an afterthought. Luckily, recent benchmarks like SWT-Bench [3] have addressed this by building benchmarks for real-world test-case generation.

In this blog post, we present TEX— a method for building effective test-case generation agents. And as it turns out, the best test-case generation agents are truly software engineering agents, i.e. agents that write tests, code, and learn from each other.

Hybrid Test-Time Scaling

Over the last year, the LLM field has shifted to emphasize test-time scaling and this paradigm has led to many of the recent improvements. Test-time scaling, otherwise known as inference-time scaling, focuses on building models that benefit from more compute at inference-time (i.e. when the model is being used).

One type of inference-time scaling is sequential or serial test-time scaling. This includes reasoning/thinking models like OpenAI’s O1 series, Google’s Gemini w/ thinking mode, and DeepSeek R1. These models produce more reasoning tokens when answering a question or solving a task, and this leads to improved performance as the model is able to perform a search in natural language where it can consider various approaches to a problem and pick the best one, opposed to the older generations of LLMs which would directly answer a problem with minimal thinking. Serial scaling, which is the popular approach at the time of writing, is useful as it makes the LLM more reliable with respect to solving problems as it is able to reason before answering. However, serial scaling is slow as it relies on one LLM to continually generate lots of tokens, which is currently generated one-after-the-next and thus is extremely difficult to parallelize.

An alternate approach to test-time scaling is parallel scaling where you run multiple LLMs in parallel to answer a given question and use a strategy to pick the best answer. Parallel scaling provides increased diversity in responses as we are independently sampling multiple times and, as seen in the name, is trivially parallelizable. However, parallel sampling generates N responses and requires some selection method to pick the best answer, which can be challenging for complex tasks like software engineering.

Hence there is a tradeoff: parallel sampling provides increased diversity (meaning increased chance of finding a valid solution) but requires selecting a candidate which is non-trivial, while serial scaling does result in one final answer at the cost of less diversity and parallelizability.

Is there a way to get the best of both worlds? Recent works [4, 5] have explored hybrid test-time scaling strategies that leverage the benefit of parallel and sequential scaling. These methods are “hybrid” as they leverage parallel sampling but each model/agent gets access to the full (or summarized) outputs from the other agents during inference. However, these methods are hard to apply to agentic tasks, like software engineering, where agents are taking a potentially unbounded number of actions and outputting hundreds of thousands of tokens.

To address this, we develop TEX— a hybrid test-time scaling approach for real-world test-case generation and software engineering agents by using cross-candidate execution feedback as the aggregation method. We hypothesize that test-case generation and software engineering agents will benefit from using the verifier-generator loop that is critical to software engineering, namely writing tests and code. In the TEX framework, multiple agents work in parallel rounds to solve a specific issue. Each agent generates both a test script to reproduce the problem and a code patch to resolve it. To enable hybrid scaling, we implement cross-candidate execution feedback. We execute every agent’s code patch against the test scripts generated by all other candidates in the ensemble. The results of this ‘peer validation’ are fed back to the agents, allowing them to refine their solutions in the subsequent round. This execution-based aggregation avoids the memory overhead of concatenating long agentic traces and prevents the information loss inherent in natural language summarization.

An illustration of TEX. At each round, K agents are sampled in parallel and each agent outputs a test file and code patch. Next, we perform cross-validation by running all generated code patches against generated test files. We provide each agent with the execution result of running its code & test against all other tests & code, and continue sampling in the next round.

Experiments

Experimental Setup

We validate the efficacy of the TEX inference-time framework by running real-world software engineering and test-case generation benchmarks SWE-Bench and SWT-Bench. SWE-Bench is a popular software engineering benchmark that tests a model’s ability to perform bug fixes using issues scraped from popular Python GitHub repositories. This benchmark uses real-world coding problems, requires navigating and understanding large codebases, and performance is measured via oracle tests written by human developers. SWT-Bench extends SWE-Bench but focuses on test-case generation and measures the success rate and coverage of model generated tests given real-world pre-PR and post-PR codebases for a given issue. Following the authors of SWT-Bench (unless otherwise stated) we report our results using the 433 instance subset of SWE-Bench Verified instances as these instances have been further validated to reduce measurement noise. We refer to the test-case generation outputs from our method as TEX-T and the code generation outputs as TEX-C.

We implement TEX by extending the CodeMonkeys agentic scaffold [6]. This simple scaffold allows the model to output a test script, output a code patch, and execute generated tests/code against the pre-PR and model generated code. Each agent can take a maximum of 8 steps per round. The scaffold does not allow the agent to perform bash commands (e.g. grep) nor can the agent access the internet. Each agent is provided a fixed context of code files which we source from Agentless [7]. We use Claude-4 Sonnet (claude-sonnet-4@20250514) as the LLM agent, sample with temperature=0.5, and use an ensemble of size 4.

Test-Case Generation Results: SWT-Bench

Results on SWT-Bench show that TEX-T significantly outperforms baselines that only generate tests or write tests in isolation from the coding problem. Our results further corroborate the benefit of test-case generation via code generation [8], in other words, test-case generation benefits from trying to solve the coding problem. We build upon this in TEX and thus have each agent output tests and code patches for the given issue.

Thus bringing us to our next finding: test-case generation, like software engineering, is better done in teams that learn from each other. This is evidenced by the >6% improvement in pass@1 created by using TEX. Notably, TEX-T’s pass@1 qualifies as state-of-the-art on SWT-Bench when compared to the previous best (84.0%). While parallel test-time scaling can introduce high-variance as some candidates can perform well and others perform poorly, TEX addresses this problem by having parallel agents learn from each other via the cross-candidate execution aggregation. Thus, we see that pass@1 (i.e. random selection from the ensemble) is good enough to provide state-of-the-art results!

For our submission to the SWT-Bench leaderboard we perform majority voting using TEX candidates’ generated tests which are run against TEX candidates’ generated code patches. We found this led to slightly higher performance (+3%). This is an additional benefit of our method as the code generated by our agents can not only be used to solve difficult, real-world problems but also enables better ways of selecting candidates from the ensemble without needing elaborate LLM-as-a-judge setups.

Software-Engineering Results: SWE-Bench

We additionally provide results for TEX-C (the coding outputs from the same TEX run in the prior section) on SWE-Bench (graphed above). Similar to the results on SWT-Bench, TEX improves performance (pass@1) on SWE-Bench across rounds and in comparison to the code generation baseline. This result additionally supports the previously mentioned finding that software engineering agents are better when they learn from each other, as the only overhead induced by TEX, as compared to the baseline, is the cross-candidate aggregation. Although TEX-C improves coding performance, absolute scores on SWE-Bench were limited by our “simple scaffold,” which restricts agents to basic file editing and execution without access to bash commands, python kernels, etc. The limitations of our scaffold are evident when comparing our SWE-Bench baseline (57.8%) with the mini-swe-agent baseline (64.9%), where the latter scaffold allows the model to use the full suite of bash commands to explore, edit code, etc.

Conclusion

In this work, we explore how to build better test-case generation agents without training by developing a novel test-time scaling method. As we found, the best test-case generation agents operate very similar to software engineers in the real-world; namely the best performance comes from writing tests, code, and learning from others. Best of all, TEX agents will not just write tests for you, but they can solve your coding problems too!

Works Cited

[1] Jimenez, Carlos E., et al. “SWE-bench: Can Language Models Resolve Real-world Github Issues?” The Twelfth International Conference on Learning Representations, 2024.

[2] Yang, John, et al. “Swe-agent: Agent-computer interfaces enable automated software engineering.” Advances in Neural Information Processing Systems 37 (2024): 50528-50652.

[3] Mündler, Niels, et al. “SWT-bench: Testing and validating real-world bug-fixes with code agents.” Advances in Neural Information Processing Systems 37 (2024): 81857-81887.

[4] Venkatraman, Siddarth, et al. “Recursive self-aggregation unlocks deep thinking in large language models.” arXiv preprint arXiv:2509.26626 (2025).

[5] Madaan, Lovish, et al. “Rethinking thinking tokens: Llms as improvement operators.” arXiv preprint arXiv:2510.01123 (2025).

[6] Ehrlich, Ryan, et al. “Codemonkeys: Scaling test-time compute for software engineering.” arXiv preprint arXiv:2501.14723 (2025).

[7] Xia, Chunqiu Steven, et al. “Agentless: Demystifying llm-based software engineering agents.” arXiv preprint arXiv:2407.01489 (2024).

[8] Ahmed, Toufique, et al. “Heterogeneous Prompting and Execution Feedback for SWE Issue Test Generation and Selection.” arXiv preprint arXiv:2508.06365 (2025).

Akash Gokul

Akash Gokul is an Applied Scientist at Salesforce AI Research. He has been working on LLMs for coding, LLM evaluations, and multimodal models.

More by Akash

Srijan Bansal Applied Scientist, AI Research

Srijan Bansal is an Applied Scientist specializing in LLM reasoning, long-context understanding, and automatic evaluation. He has contributed to the development of XGen-Sales and is actively involved in SWE Agent research and automated testing applications. He earned his Master’s from Carnegie Read More

More by Srijan

Semih Yavuz Research Director

Semih Yavuz is a Research Director at Salesforce AI Research, leading a team focused on improving the factuality, groundedness, and reasoning capabilities of large language models in knowledge-intensive applications. His work involves developing state-of-the-art embedding and re-ranker models for Read More knowledge retrieval across diverse domains, including code, multi-modal, and multilingual contexts, while refining retrieval-augmented generation (RAG) by enhancing how LLMs consume and integrate knowledge in complex reasoning. His team is focused on pushing the boundaries of the research to develop accurate, scalable, and reliable AI systems and driving product impact with them in the CRM domain.

More by Semih

TEX: Test-Time Scaling Testing Agents via Execution-based Cross-Validation

Hybrid Test-Time Scaling