Introduction
Reinforcement Learning from Human or AI Feedback (RLHF, RLAIF) has become the standard recipe for aligning large language models (LLMs). But as we push into the agentic era — where models call tools, browse the web, write and execute code across multi-turn trajectories — the infrastructure demands have changed dramatically. Rollouts are no longer simple single-turn completions. They involve thousands of concurrent tool calls, variable-length trajectories, and models that span hundreds of billions of parameters often with Mixture-of-Experts (MoE) architectures.
Most existing open-source RL training frameworks were not designed for this new regime of LLM post-training. At Salesforce AI Research, we have built SFR-RL, a production-grade RL training stack purpose-built for agentic RL at scale. Our goals are straightforward:
- Near-100% GPU utilization across the entire cluster
- Train large MoE models at long context lengths with fewer GPUs than previously possible
- Scale tool calling to thousands of concurrent executions with minimal cost
- Stay resilient — auto-recover from inference engine crashes without losing training progress
In this post, Salesforce AI Research describes the design decisions behind SFR-RL and share early benchmark results showing significant throughput improvements over existing approaches.
The Problem with Current Open-Source Approaches
Synchronous RL (e.g., VERL)
In synchronous RL, all prompts in a batch must complete their rollouts before training can begin. The downside is significant GPU idle time: shorter prompts finish quickly, while longer ones continue generating, leaving most GPUs waiting on a few stragglers. This inefficiency becomes especially severe in agentic workloads, where trajectory lengths can range from a few hundred tokens to tens of thousands.
Asynchronous RL (e.g., Areal, VERL-async)
Asynchronous approaches attempt to keep GPUs busy by overlapping rollout and training stages. However, they may introduce a different set of problems:
- Rollout-training mismatch: GPUs are partitioned between rollout and training. When one phase finishes faster than the other, GPUs on the slower side stay idle. With a fixed partition, you either sacrifice rollout throughput or training throughput — you cannot maximize both simultaneously.
- Off-policy staleness: Because training happens concurrently with rollout, the model being used for generation can fall significantly behind the model being trained. This off-policy gap degrades learning signal quality.
- Data distribution instability: Batches are assembled in first-in-first-out order without regard to data composition. Easy, short prompts dominate early batches; harder, longer prompts cluster in later batches. This batch-to-batch data distribution fluctuation makes it difficult to train on diverse task types and difficulty levels jointly.
- Reduced GPU capacity for training: Reserving separate GPUs for rollout means fewer GPUs are available for training. For very large models that require significant parallelism, this is a hard constraint that limits what you can train.

Lack of MoE Support
Most open-source frameworks lack native support for large MoE models with Expert Parallelism (EP). MoE models like the gpt-oss series (20B/120B active parameters) require specialized sharding and communication strategies that standard data-parallel or tensor-parallel setups cannot efficiently handle.
Our Approach: Pipelined Synchronous RL
SFR-RL takes a different path. Rather than choosing between sync and async, we designed a pipelined synchronous approach that captures the best of both worlds: the on-policy guarantees of synchronous training with the high GPU utilization of asynchronous systems.
Two Phases, Full Cluster

The overall learning process in SFR-RL is split into two alternating phases — rollout and training — each utilizing the entire GPU cluster: This mechanism allows us to train 5x larger models with the same GPU resources compared to AsyncRL, where only ~20% of GPUs are typically allocated for training.
Rollout phase: The training model is offloaded, and the policy is loaded onto a resilient inference engine across all GPUs. Prompts are rolled out concurrently, and as each finishes, new work is immediately dispatched to keep utilization high.
Training phase: The inference engine releases GPU memory, and the training model is reloaded for a standard on-policy weight update.
Every GPU participates in every phase — no partitioning, no idle reservations.
Pipelined Batch Management
To further reduce GPU idle time, rollout work is pipelined across batch boundaries so that GPUs are never starved for work. Each training batch is still delivered with proper data composition and mixture guarantees. This allows us to mix tasks with diverse domains, types, and complexities. For instance, one can optimize math reasoning with no tools and long-horizon deep research tasks at the same time.
Resilient and Fault-Tolerant Inference
Agentic rollouts are long-running and often unpredictable due to environment dynamics — a single engine crash can stall an entire batch. SFR-RL’s inference gateway automatically detects failures, recreates engine actors, restores weights, and re-queues in-flight work, all without human intervention.
High-concurrency Agentic Tool Calling
We designed specialized local-first tool systems, with a built-in cache to avoid redundant executions. Certain tools are locally optimized to avoid resorting to paid API services. Thus, the systems achieve high-throughput, low-latency and dense tool executions, delivering up to 4000 concurrent stateful code-execution environments per machine.
Specialization for MoE Models
SFR-RL was built from the ground up with aggressive optimizations for large Mixture-of-Experts models, with our primary targets being the gpt-oss series (20B and 120B active parameters).
Key capabilities include:
- Least-Loaded Expert Parallelism (EP): Up to 5x speed improvement and 5x memory reduction for large MoE layers, making previously impractical model sizes trainable.
- Mixture-of-Parallelisms: Our proprietary ultra-memory-efficient and high-throughput strategy to train large models with the fewest GPUs. This allows us to train the gpt-oss-120b model at 1 million-token context length, at full parameters, with just 16 H200 GPUs — beating the state of the art in both throughput and memory footprint.
Benchmarks
Training Throughput

Memory Efficiency
SFR-RL’s infrastructure employs extremely memory-efficient methods to achieve the highest “intelligence” per GPU capacity. In other words, SFR-RL requires the fewest possible GPUs to train a large model, at full parameters, with a certain context length. As shown in the figure below, SFR-RL is 10x more memory-efficient than Megatron with complex parallelism settings, and 250x more efficient than VERL with FSDP and context parallelism.

Conclusion
The agentic era demands a new kind of RL training infrastructure — one that can handle variable-length multi-turn trajectories, massive MoE models, thousands of concurrent tool calls, and the inevitable failures that come with long-running distributed workloads.
SFR-RL addresses these challenges through pipelined synchronous training that uses the full cluster for both rollout and training, a resilient inference layer with automatic recovery, scalable local-first tool execution, and native support for Expert Parallelism. The result is a system that achieves near-complete GPU utilization while maintaining on-policy training guarantees — enabling us to efficiently train frontier agentic models at scale.
We are excited to continue pushing the boundaries of agentic RL and look forward to sharing more results as our models and infrastructure evolve.
Salesforce AI Research













