Salesforce AI Research at ICLR 2026

April 22, 2026 7 min read

Salesforce AI Research will present 21 accepted papers at ICLR 2026, the Fourteenth International Conference on Learning Representations. The conference runs April 23–27 at the Riocentro Convention and Event Center in Rio de Janeiro, Brazil.

Our accepted authors will share their work through lightning talks, poster sessions, and workshops throughout the week.

This year’s research reflects the problems we think matter most for enterprise AI: agents that act reliably in complex environments, evaluation frameworks that expose real failure modes, stronger reasoning, and systems that stay efficient and trustworthy at scale.

Workshop Paper

In addition to our main conference acceptances, our work on agent identity failures was accepted to the Agents in the Wild: Safety, Security, and Beyond workshop at ICLR 2026.

ECHOING: Identity Failures When LLM Agents Talk to Each Other Paper

When LLM agents interact autonomously, they can abandon their assigned roles and mirror their conversational partner. We call this ‘echoing.’ Across 2,500+ conversations, echoing rates reached as high as 70% with major model providers, yet 93% of affected conversations still registered as successful by standard metrics. Reasoning models offered minimal improvement, and structured responses reduced but did not eliminate the problem.

Authors: Sarath Shekkizhar, Romain Cosentino, Adam Earle, Silvio Savarese

Main Conference Papers

Agent Architectures and GUI Agents

Our agent work this year spans test-time scaling, tool learning, computer-use benchmarks, and multi-agent coordination.

GTA1: GUI Test-time Scaling Agent Paper

GTA1 introduces test-time scaling for GUI agents, using multiple candidate action proposals and RL-based grounding to achieve state-of-the-art performance on autonomous task completion across platforms.

Authors: Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, Ran Xu, Liyuan Pan, Silvio Savarese, Caiming Xiong, Junnan Li

WALT: Web Agents that Learn Tools Paper

WALT reverse-engineers website functionality into reusable tools like search, filter, and create. This shifts from fragile step-by-step interactions to reliable tool invocation with higher success rates and fewer steps on VisualWebArena and WebArena.

Authors: Viraj Prabhu, Yutong Dai, Matthew Fernandez, Jing Gu, Krithika Ramakrishnan, Yanqi Luo, Silvio Savarese, Caiming Xiong, Junnan Li, Zeyuan Chen, Ran Xu

SCUBA: Salesforce Computer Use Benchmark Paper

SCUBA benchmarks computer-use agents on 300 real Salesforce CRM tasks across admin, sales, and service workflows. Open-source agents achieve less than 5% success versus 39% for closed-source models in zero-shot settings, improving to 50% with demonstrations while reducing time and costs by 13–16%.

Authors: Yutong Dai, Krithika Ramakrishnan, Jing Gu, Matthew Fernandez, Yanqi Luo, Viraj Prabhu, Zhenyu Hu, Silvio Savarese, Caiming Xiong, Zeyuan Chen, Ran Xu

CoAct-1: Computer-using Multi-agent System with Coding Actions Paper

CoAct-1 introduces a multi-agent system combining GUI control with programmatic execution. An Orchestrator delegates subtasks to GUI Operator or Programmer agents, achieving 60.76% success on OSWorld (a new state-of-the-art) while reducing average steps from 15 to 10.15.

Authors: Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, Ran Xu, Caiming Xiong

Grounded Test-Time Adaptation for LLM Agents Paper

Parametric online adaptation aligns LLM agents to environment-specific formats while non-parametric dynamics grounding learns causal state transitions through persona-driven exploration, together addressing syntactic and semantic mismatches and boosting WebArena multi-site success from 2% to 23%.

Authors: Arthur Chen, Zuxin Liu, Jianguo Zhang, Akshara Prabhakar, Zhiwei Liu, Shelby Heinecke, Silvio Savarese, Victor Zhong, Caiming Xiong

Reasoning and Evaluation

Advancing how LLMs reason and how we measure that reasoning is central to building enterprise AI that works. This cluster addresses test-time scaling, verification dynamics, evaluator training, and efficient reasoning under constraints.

Nudging the Boundaries of LLM Reasoning Paper

NuRL overcomes a central RL limitation by using self-generated hints to unlock learning from previously ‘unsolvable’ problems, raising performance ceilings where standard methods like GRPO plateau, with consistent improvements across six benchmarks and three models.

Authors: Justin Chih-Yao Chen, Becky Xiangyu Peng, Prafulla Kumar Choubey, Kung-Hsiang Huang, Jiaxin Zhang, Mohit Bansal, Chien-Sheng Wu

Variation in Verification: Understanding Verification Dynamics in Large Language Models Paper

This paper analyzes how LLM verifiers assess solution candidates in test-time scaling, finding that weak generators can match stronger ones post-verification. Verification effectiveness depends on problem difficulty, generator strength, and verifier capability, revealing when verifier scaling reaches its limits.

Authors: Yefan Zhou, Austin Xu, Yilun Zhou, Janvijay Singh, Jiang Gui, Shafiq Joty

Foundational Automatic Evaluators (FARE): Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains Paper

FARE, trained on 2.5M samples, sets new standards for open-source evaluators. The 8B model rivals larger RL-trained models, while the 20B version surpasses 70B+ evaluators and achieves near-oracle reranking on MATH with 14.1% downstream RL improvements.

Authors: Austin Xu, Xuan-Phi Nguyen, Yilun Zhou, Chien-Sheng Wu, Caiming Xiong, Shafiq Joty

On the Shelf Life of Fine-Tuned LLM Judges: Future Proofing, Backward Compatibility, and Question Generalization Paper

Fine-tuned LLM judges struggle with future-proofing but handle backward compatibility well with DPO. Continual learning balances adaptation across response distributions, though all judges degrade on unseen questions as generators evolve.

Authors: Janvijay Singh, Austin Xu, Yilun Zhou, Yefan Zhou, Dilek Hakkani-Tur, Shafiq Joty

Scalable Chain of Thoughts via Elastic Reasoning Paper

Elastic Reasoning separates chain-of-thought into thinking and solution phases with independent budgets, prioritizing solution completeness under constraints. The approach achieves robust performance with lower training costs and more concise reasoning across math and coding benchmarks.

Authors: Yuhui Xu, Hanze Dong, Lei Wang, Doyen Sahoo, Junnan Li, Caiming Xiong

Learning to Reason over Continuous Tokens with Reinforcement Learning (HyRea) Paper

HyRea dynamically switches between explicit and latent reasoning via entropy-guided cold-start and GRPO fine-tuning, reducing token usage to approximately 60% while maintaining competitive accuracy across mathematical reasoning benchmarks.

Authors: Yiran Zhao, Yuhui Xu, Doyen Sahoo, Caiming Xiong, Junnan Li

Improving LLM Alignment with References Paper

Reference-guided evaluation improves LLM-based evaluators and enables effective semi-self-improvement, achieving 73.1% on AlpacaEval and 58.7% on Arena-Hard with Llama-3-8B-Instruct, comparable to fine-tuned reward models.

Authors: Kejian Shi, Yixin Liu, PeiFeng Wang, Alexander Fabbri, Shafiq Rayhan Joty, Arman Cohan

Deep Research Reliability

As AI systems take on complex research and information synthesis tasks, rigorous evaluation of their outputs becomes critical. These papers establish new frameworks for auditing deep research quality and measuring citation-grounded reliability.

DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence Paper

DeepTRACE audits generative search engines and deep research agents, finding that they produce overconfident, one-sided responses with 20–60% of statements unsupported by their own cited sources.

Authors: Pranav Narayanan Venkit, Philippe Laban, Yilun Zhou, Kung-Hsiang Huang, Yixin Mao, Chien-Sheng Wu

LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild Paper

LiveResearchBench introduces 100 expert-curated tasks requiring real-time web search, paired with DeepEval for assessing citation-grounded reports. Evaluation of 17 systems reveals specific strengths, failure modes, and components needed for reliable deep research.

Authors: Jiayu Wang, Yifei Ming, Riya Dulepet, Qinglin Chen, Austin Xu, Zixuan Ke, Frederic Sala, Aws Albarghouthi, Caiming Xiong, Shafiq Joty

Knowledge Graphs and Retrieval

Distill-SynthKG: Distilling Knowledge Graph Synthesis Workflow for Improved Coverage and Efficiency Paper

SynthKG introduces ontology-free knowledge graph synthesis that distills into Distill-SynthKG for efficient single-step generation, surpassing models 8x larger in KG quality and outperforming baselines in retrieval and question-answering with a novel graph-based RAG framework.

Authors: Prafulla Kumar Choubey, Xin Su, Man Luo, Xiangyu Peng, Caiming Xiong, Tiep Le, Shachar Rosenman, Vasudev Lal, Phil Mui, Ricky Ho, Phillip Howard, Chien-Sheng Wu

LLM Behavior and Robustness

Understanding how LLMs behave under varied conditions, from multi-turn dialogue to internal circuit mechanisms, shapes how we build more reliable systems.

LLMs Get Lost in Multi-Turn Conversation Paper

LLMs show a 39% performance drop in multi-turn versus single-turn conversations across six tasks. Analysis of 200,000+ simulated conversations reveals that models make premature assumptions and fail to recover when they take wrong turns.

Authors: Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, Jennifer Neville

Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition Paper

Circuit analysis of off-by-one addition reveals a function induction mechanism where parallel attention heads emit distinct pieces of the +1 function. This reusable structure enables task-level generalization across shifted QA, base-8 addition, and other tasks.

Authors: Qinyuan Ye, Robin Jia, Xiang Ren

Efficiency and Scalability

Making models smaller, faster, and cheaper while preserving performance is critical for enterprise deployment at scale.

Entropy-Based Block Pruning for Efficient Large Language Models Paper

Entropy-based pruning outperforms cosine similarity methods by leveraging entropy patterns across Transformer blocks (decreasing early, then increasing) as a more effective measure of information richness for reducing model size while preserving accuracy.

Authors: Liangwei Yang, Yuhui Xu, Juntao Tan, Doyen Sahoo, Silvio Savarese, Caiming Xiong, Huan Wang, Shelby Heinecke

OFTSR: One-Step Flow for Image Super-Resolution with Tunable Fidelity-Realism Trade-offs Paper

OFTSR achieves one-step image super-resolution with a tunable fidelity-realism trade-off by aligning student predictions to teacher model sampling trajectories, reaching state-of-the-art performance on FFHQ, DIV2K, and ImageNet without multi-step overhead.

Authors: Yuanzhi Zhu, Ruiqing Wang, Shilin Lu, Junnan Li, Hanshu Yan, Kai Zhang

Scaling Reinforcement Learning

Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels Paper

Webscale-RL introduces a scalable pipeline converting pre-training documents into 1.2M verifiable QA pairs across 9+ domains. RL training on this dataset achieves continual pre-training performance with 100x fewer tokens, suggesting RL can reach pre-training performance at a fraction of the data cost.

Authors: Zhepeng Cen, Haolin Chen, Shiyu Wang, Zuxin Liu, Zhiwei Liu, Ding Zhao, Silvio Savarese, Caiming Xiong, Huan Wang, Weiran Yao

Software Engineering

SweRank: Software Issue Localization with Code Ranking Paper

SweRank introduces an efficient retrieve-and-rerank framework for software issue localization, trained on the SweLoc dataset. It achieves state-of-the-art performance on SWE-Bench-Lite and LocBench while outperforming costly agent-based systems that rely on closed-source LLMs.

Authors: Revanth Gangi Reddy, Tarun Suresh, JaeHyeok Doo, Ye Liu, Xuan Phi Nguyen, Yingbo Zhou, Semih Yavuz, Caiming Xiong, Heng Ji, Shafiq Joty

Visit Us at ICLR 2026

Our researchers will present throughout the conference. Stop by booth #203 or check our ICLR schedule for specific session times. We’ll also share updates throughout the week on Bluesky and X.

Resources:

Salesforce AI Research
Follow us on X: @SFResearch
Follow us on Bluesky: @SFResearch

Salesforce AI Research

Share article