AI Research

Building Efficient RL Training for the Agentic Era

Xuan Phi Nguyen

Semih Yavuz

1 additional author

April 1, 2026 5 min read

Introduction

Reinforcement Learning from Human or AI Feedback (RLHF, RLAIF) has become the standard recipe for aligning large language models (LLMs). But as we push into the agentic era — where models call tools, browse the web, write and execute code across multi-turn trajectories — the infrastructure demands have changed dramatically. Rollouts are no longer simple single-turn completions. They involve thousands of concurrent tool calls, variable-length trajectories, and models that span hundreds of billions of parameters often with Mixture-of-Experts (MoE) architectures.

Most existing open-source RL training frameworks were not designed for this new regime of LLM post-training. At Salesforce AI Research, we have built SFR-RL, a production-grade RL training stack purpose-built for agentic RL at scale. Our goals are straightforward:

Near-100% GPU utilization across the entire cluster
Train large MoE models at long context lengths with fewer GPUs than previously possible
Scale tool calling to thousands of concurrent executions with minimal cost
Stay resilient — auto-recover from inference engine crashes without losing training progress

In this post, Salesforce AI Research describes the design decisions behind SFR-RL and share early benchmark results showing significant throughput improvements over existing approaches.

The Problem with Current Open-Source Approaches

Synchronous RL (e.g., VERL)

In synchronous RL, all prompts in a batch must complete their rollouts before training can begin. The downside is significant GPU idle time: shorter prompts finish quickly, while longer ones continue generating, leaving most GPUs waiting on a few stragglers. This inefficiency becomes especially severe in agentic workloads, where trajectory lengths can range from a few hundred tokens to tens of thousands.

Asynchronous RL (e.g., Areal, VERL-async)

Asynchronous approaches attempt to keep GPUs busy by overlapping rollout and training stages. However, they may introduce a different set of problems:

Rollout-training mismatch: GPUs are partitioned between rollout and training. When one phase finishes faster than the other, GPUs on the slower side stay idle. With a fixed partition, you either sacrifice rollout throughput or training throughput — you cannot maximize both simultaneously.

Off-policy staleness: Because training happens concurrently with rollout, the model being used for generation can fall significantly behind the model being trained. This off-policy gap degrades learning signal quality.

Data distribution instability: Batches are assembled in first-in-first-out order without regard to data composition. Easy, short prompts dominate early batches; harder, longer prompts cluster in later batches. This batch-to-batch data distribution fluctuation makes it difficult to train on diverse task types and difficulty levels jointly.
Reduced GPU capacity for training: Reserving separate GPUs for rollout means fewer GPUs are available for training. For very large models that require significant parallelism, this is a hard constraint that limits what you can train.

Lack of MoE Support

Most open-source frameworks lack native support for large MoE models with Expert Parallelism (EP). MoE models like the gpt-oss series (20B/120B active parameters) require specialized sharding and communication strategies that standard data-parallel or tensor-parallel setups cannot efficiently handle.

Our Approach: Pipelined Synchronous RL

SFR-RL takes a different path. Rather than choosing between sync and async, we designed a pipelined synchronous approach that captures the best of both worlds: the on-policy guarantees of synchronous training with the high GPU utilization of asynchronous systems.

Two Phases, Full Cluster

The overall learning process in SFR-RL is split into two alternating phases — rollout and training — each utilizing the entire GPU cluster: This mechanism allows us to train 5x larger models with the same GPU resources compared to AsyncRL, where only ~20% of GPUs are typically allocated for training.

Rollout phase: The training model is offloaded, and the policy is loaded onto a resilient inference engine across all GPUs. Prompts are rolled out concurrently, and as each finishes, new work is immediately dispatched to keep utilization high.

Training phase: The inference engine releases GPU memory, and the training model is reloaded for a standard on-policy weight update.

Every GPU participates in every phase — no partitioning, no idle reservations.

Pipelined Batch Management

To further reduce GPU idle time, rollout work is pipelined across batch boundaries so that GPUs are never starved for work. Each training batch is still delivered with proper data composition and mixture guarantees. This allows us to mix tasks with diverse domains, types, and complexities. For instance, one can optimize math reasoning with no tools and long-horizon deep research tasks at the same time.

Resilient and Fault-Tolerant Inference

Agentic rollouts are long-running and often unpredictable due to environment dynamics — a single engine crash can stall an entire batch. SFR-RL’s inference gateway automatically detects failures, recreates engine actors, restores weights, and re-queues in-flight work, all without human intervention.

High-concurrency Agentic Tool Calling

We designed specialized local-first tool systems, with a built-in cache to avoid redundant executions. Certain tools are locally optimized to avoid resorting to paid API services. Thus, the systems achieve high-throughput, low-latency and dense tool executions, delivering up to 4000 concurrent stateful code-execution environments per machine.

Specialization for MoE Models

SFR-RL was built from the ground up with aggressive optimizations for large Mixture-of-Experts models, with our primary targets being the gpt-oss series (20B and 120B active parameters).

Key capabilities include:

Least-Loaded Expert Parallelism (EP): Up to 5x speed improvement and 5x memory reduction for large MoE layers, making previously impractical model sizes trainable.
Mixture-of-Parallelisms: Our proprietary ultra-memory-efficient and high-throughput strategy to train large models with the fewest GPUs. This allows us to train the gpt-oss-120b model at 1 million-token context length, at full parameters, with just 16 H200 GPUs — beating the state of the art in both throughput and memory footprint.

Benchmarks

Training Throughput

Memory Efficiency

SFR-RL’s infrastructure employs extremely memory-efficient methods to achieve the highest “intelligence” per GPU capacity. In other words, SFR-RL requires the fewest possible GPUs to train a large model, at full parameters, with a certain context length. As shown in the figure below, SFR-RL is 10x more memory-efficient than Megatron with complex parallelism settings, and 250x more efficient than VERL with FSDP and context parallelism.

Conclusion

The agentic era demands a new kind of RL training infrastructure — one that can handle variable-length multi-turn trajectories, massive MoE models, thousands of concurrent tool calls, and the inevitable failures that come with long-running distributed workloads.

SFR-RL addresses these challenges through pipelined synchronous training that uses the full cluster for both rollout and training, a resilient inference layer with automatic recovery, scalable local-first tool execution, and native support for Expert Parallelism. The result is a system that achieves near-complete GPU utilization while maintaining on-policy training guarantees — enabling us to efficiently train frontier agentic models at scale.

We are excited to continue pushing the boundaries of agentic RL and look forward to sharing more results as our models and infrastructure evolve.

Salesforce AI Research

Xuan Phi Nguyen

More by Xuan Phi

Semih Yavuz Research Director

Semih Yavuz is a Research Director at Salesforce AI Research, leading a team focused on improving the factuality, groundedness, and reasoning capabilities of large language models in knowledge-intensive applications. His work involves developing state-of-the-art embedding and re-ranker models for Read More knowledge retrieval across diverse domains, including code, multi-modal, and multilingual contexts, while refining retrieval-augmented generation (RAG) by enhancing how LLMs consume and integrate knowledge in complex reasoning. His team is focused on pushing the boundaries of the research to develop accurate, scalable, and reliable AI systems and driving product impact with them in the CRM domain.

More by Semih

Shafiq Joty Senior Director, Research

Shafiq (raihanjoty.github.io) directs the NLP group's work on large language modeling (LLM) and generative AI. Some of his group's recent projects include SFR-RAG, SFR-Judge, SFR-RAG-Agent and xGen. He is also a tenured Associate Professor (currently on leave) in the School of Computer Science and Read More

More by Shafiq

Building Efficient RL Training for the Agentic Era

Introduction

The Problem with Current Open-Source Approaches

Synchronous RL (e.g., VERL)

Asynchronous RL (e.g., Areal, VERL-async)

Lack of MoE Support

Our Approach: Pipelined Synchronous RL

Two Phases, Full Cluster

Pipelined Batch Management

Resilient and Fault-Tolerant Inference

High-concurrency Agentic Tool Calling

Specialization for MoE Models

Benchmarks

Training Throughput

Memory Efficiency

Conclusion

Just For You

The Future of Business Is Fast: Watch For These 3 Tech Trends

Beyond the AI Hype: Five Trends That Will Transform Business in 2026

The A2A Semantic Layer: Building Trust into Agent-to-Agent Interaction

AI Tools For Small Business (Tips and Best Practices)

Making AI Smarter for Finance: The FINDAP Framework

CoDA: A New Era of Code Generation with Diffusion Language Models

Meet SCUBA: The Next Frontier in Enterprise-Agent Evaluation

Breaking the Memory Trilemma: How to Build AI Agents That Actually Remember

Share article

Introduction

The Problem with Current Open-Source Approaches

Synchronous RL (e.g., VERL)

Asynchronous RL (e.g., Areal, VERL-async)

Lack of MoE Support

Our Approach: Pipelined Synchronous RL

Two Phases, Full Cluster

Pipelined Batch Management

Resilient and Fault-Tolerant Inference

High-concurrency Agentic Tool Calling

Specialization for MoE Models

Benchmarks

Training Throughput

Memory Efficiency

Conclusion

Share article

Explore related content by topic

Get the latest articles in your inbox.

360 Highlights

IT

Commerce

Marketing

Service

Sales

Thanks, you're subscribed!