ROLL Team - Reinforcement Learning Research

Our Mission

ROLL is pioneering the future of Reinforcement Learning (RL) with a strong emphasis on exploring and shaping innovative forms of future living powered by advanced RL technologies. We are dedicated to pushing the boundaries of large-scale learning systems and developing cutting-edge solutions for real-world applications, from agentic AI to LLM reasoning enhancement.

Featured Technical Reports

Key technical reports and system papers, swipe to explore more

📝 Blog

The Bitter Lesson Behind Building Agentic RL in Terminal Environments

February 2026 • ROLL Team

Core Challenges & Lessons

Long-horizon credit assignment in diverse, complex environments
Handling partial observability and noisy failure modes
System-level co-design: learning objectives, data, environments, infrastructure
Practical techniques from real experiments (not final solutions)
Tighter integration needed as tasks become more open-ended

📝 Read Blog

⭐ Latest Work

Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem

December 2025 • ROLL Team

We introduce the Agentic Learning Ecosystem (ALE), a foundational infrastructure that optimizes the production pipeline for agentic models. ALE consists of three core components working in harmony to enable efficient agent development.

Key Contributions

ROLL: Post-training framework for weight optimization
ROCK: Sandbox environment manager for trajectory generation
iFlow CLI: Agent framework for efficient context engineering
ROME: Open-source agent trained on 1M+ trajectories
IPA Algorithm: Novel policy optimization for long-horizon stability

📄 Paper

🚀 System

RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure

December 2025 • ROLL Team

RollArc is a distributed system designed to maximize throughput for multi-task agentic RL on disaggregated infrastructure, achieving significant training time reduction through intelligent workload mapping.

Key Features

Hardware-affinity workload mapping for optimal GPU utilization
Fine-grained asynchrony at trajectory level
Statefulness-aware computation with elastic scaling
1.35-2.05× training time reduction
Tested on 3,000+ GPUs with hundreds-of-billions-parameter models

📄 Paper

🔧 Library

Reinforcement Learning Optimization for Large-Scale Learning: An Efficient and User-Friendly Scaling Library

June 2025 • ROLL Team

ROLL is an efficient, scalable, and user-friendly library designed for Reinforcement Learning Optimization for Large-scale Learning, serving tech pioneers, developers, and researchers.

Core Modules

Single-controller architecture with parallel worker abstraction
Parallel strategy and data transfer modules
Rollout scheduler for fine-grained lifecycle management
Environment and reward workers for rapid experimentation
AutoDeviceMapping for flexible resource allocation

📄 Paper 💻 GitHub

Swipe to explore more featured papers

All Publications (10)

Complete publication list

Reinforcement Learning Optimization for Large-Scale Learning: An Efficient and User-Friendly Scaling Library

June 2025 • ROLL Team

We introduce ROLL, an efficient, scalable, and user-friendly library designed for Reinforcement Learning Optimization for Large-scale Learning. ROLL features a single-controller architecture, parallel strategy modules, rollout scheduler for fine-grained sample management, and AutoDeviceMapping for flexible resource allocation across different models and training stages.

📄 Paper 💻 GitHub

Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning

August 2025 • ROLL Team

This paper systematically reviews widely adopted RL techniques for LLM reasoning through rigorous reproductions and isolated evaluations. We analyze internal mechanisms, applicable scenarios, and core principles through fine-grained experiments across datasets of varying difficulty, model sizes, and architectures. We reveal that a minimalist combination of two techniques can unlock the learning capability of critic-free policies using vanilla PPO loss.

📄 Paper

RollPacker: Mitigating Long-Tail Rollouts for Fast, Synchronous RL Post-Training

September 2025 • ROLL Team

We introduce tail batching, a novel rollout scheduling strategy for synchronous RL that consolidates prompts leading to long-tail responses into a small subset of rollout steps. RollPacker achieves a 2.03x-2.56x end-to-end training time reduction compared to veRL through holistic optimizations across all three RL stages: elastic parallelism adaptation, dynamic resource allocation, and stream-based training.

📄 Paper

GEM: A Gym for Agentic LLMs

October 2025 • Collaboration (Led by Partner Teams)

We introduce GEM (General Experience Maker), an open-source environment simulator designed for agentic LLM training. Analogous to OpenAI-Gym for traditional RL, GEM provides a standardized framework with asynchronous vectorized execution, flexible wrappers, and a diverse suite of 24 environments. We provide baselines using REINFORCE with Return Batch Normalization (ReBN) and conduct comprehensive benchmarking of PPO, GRPO, and REINFORCE.

📄 Paper

Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning

October 2025 • ROLL Team

We introduce AsyPPO, a framework that restores the critic's role in RL for LLMs while remaining efficient at scale. AsyPPO employs lightweight mini-critics trained on disjoint prompt shards to encourage diversity while preserving calibration. It leverages inter-critic uncertainty to mask advantages in low-signal states and filter high-divergence states from entropy regularization, achieving over 6% performance gains on Qwen3-4b-Base.

📄 Paper

Part II: ROLL Flash -- Accelerating RLVR and Agentic Training with Asynchrony

October 2025 • ROLL Team

We present ROLL Flash, a system that extends ROLL with native support for asynchronous RL post-training. Built on fine-grained parallelism and rollout-train decoupling, ROLL Flash provides flexible programming interfaces that enable fully asynchronous training architecture with queue scheduling and environment-level asynchronous execution. It achieves up to 2.24x speedup on RLVR tasks and 2.72x on agentic tasks.

📄 Paper

Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization

October 2025 • ROLL Team

We position attention as a mechanistic blueprint of LLM reasoning, revealing a recurring preplan-and-anchor mechanism through two novel metrics: Windowed Average Attention Distance and Future Attention Influence. We introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes (preplan tokens, anchor tokens, and their temporal coupling), achieving consistent performance gains by aligning optimization with the model's intrinsic reasoning rhythm.

📄 Paper

RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure

December 2025 • ROLL Team

We present RollArt, a distributed system designed to maximize throughput for multi-task agentic RL on disaggregated infrastructure. Built on hardware-affinity workload mapping, fine-grained asynchrony, and statefulness-aware computation, RollArc achieves 1.35-2.05x end-to-end training time reduction. We demonstrate its scalability by training a hundreds-of-billions-parameter MoE model on an Alibaba cluster with more than 3,000 GPUs.

📄 Paper

Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem

December 2025 • ROLL Team

We introduce the Agentic Learning Ecosystem (ALE), consisting of ROLL (post-training framework), ROCK (sandbox environment manager), and iFlow CLI (agent framework). We release ROME, an open-source agent trained on over one million trajectories, featuring data composition protocols and a novel Interaction-Perceptive Agentic Policy Optimization (IPA) algorithm that assigns credit over semantic interaction chunks for improved long-horizon training stability.

📄 Paper

The Bitter Lesson Behind Building Agentic RL in Terminal Environments

February 2026 • ROLL Team

Agentic RL is still at an early stage, and progress is less about algorithmic breakthroughs and more about end-to-end system-level co-design. We share practical lessons from tackling core challenges—long-horizon credit assignment, partial observability, noisy failures, and fragile environments—across diverse, complex terminal environments. These techniques are not presented as best practices or final solutions, but as hard-earned insights from real experiments. We hope this helps others training agentic models in real environments, and perhaps saves them a few of the mistakes we made along the way.

📝 Blog