Introduction

We are excited to release AReaL v0.2 (boba), featuring three major milestones:

  • SGLang Support: With the addition of SGlang support and a series of engineering optimizations, AReaL v0.2 achieves a speed improvement of 1.5x over AReaL v0.1 on 7B models.
  • SOTA 7B Model: AReaL's RL training becomes more stable and sample-efficient. We obtain a SOTA 7B model in mathematical reasoning, which achieves pass@1 score of 61.9 on aime24 and 48.3 on aime25 respectively.
  • Competitive 32B Model: We train a highly competitive 32B model at an extremely low cost, achieving results comparable to QwQ-32B using only 200 data samples.
Model (7B) AIME 2024 AIME 2025 GPQA-Diamond
R1-Distill-Qwen-7B 55.0 39.7 47.1
Light-R1-7B-DS 56.7 44.9 40.9
AReaL-boba-RL-7B 🤗 61.9 48.3 47.6
Model (32B) AIME 2024 AIME 2025 GPQA-Diamond
R1-Distill-Qwen-32B 72.6 54.9 63.2
QwQ-32B 78.9 70.2 64.6
Light-R1-32B-DS 76.2 67.8 63.5
AReaL-boba-SFT-32B 🤗 78.8 62.1 60.1

Training Speed Comparison

throughput_comparision_with_v0.1.0.png

AReaL v0.2.0 features the following system optimizations:

  • Upgraded Generation Backend: vLLM 0.6.3 → SGLang v0.4.0

The generation backend has been upgraded from vLLM 0.6.3 to SGLang v0.4.0, leveraging its radix attention mechanism to significantly improve throughput in scenarios where multiple responses are sampled from the same prompt. Additionally, SGLang automatically flushes radix caches upon weight updates, ensuring correctness in on-policy reinforcement learning (RL). We will continue tracking advancements in the community to integrate further optimizations.

  • Optimized Training for Variable-Length Sequences & Large Batches

To handle variable sequence lengths efficiently, we eliminate padding and instead pack sequences into 1D tensors. A dynamic allocation algorithm (approximately) optimally distributes sequences under a maximum token budget, balancing micro-batch sizes while minimizing the number of micro-batches. This approach maximizes GPU memory utilization, enabling efficient computation across a large batch of variable-length inputs.

  • High-Performance Data Transfer for 1K-GPU Scaling

AReaL employs NCCL with GPU-Direct RDMA (GDRDMA) over InfiniBand/RoCE, enabling direct GPU-to-GPU communication that bypasses costly CPU-mediated transfers and PCIe bottlenecks. Compared to traditional Ethernet-based approaches, this reduces latency and increases throughput, keeping generation-to-training data transfer overhead below 3 seconds even in a 1,000-GPU cluster.

Training Recipe

SOTA 7B model using RL on math reasoning

Base Model

We use R1-Distill-Qwen-7B as our foundation model.

Dataset Curation

Our training dataset (AReaL-boba-106k) combines resources from multiple open-source projects:

We enhanced this with challenging problems from:

To maintain an appropriate difficulty level, we filtered out overly simple questions. Specifically, We generate 8 solutions per question using DeepSeek-R1-Distill-Qwen-7B and filter out questions where all solutions were correct.

Reward Function

We adopt a sparse sequence-level reward mechanism. The model is instructed to enclose the final answer within \boxed{}, and the boxed answer is then verified. Correct responses receive a reward of +5, while incorrect ones are penalized with -5.

Notably, we observe that the KL reward can impair performance, particularly in long chain-of-thought training, so we set it to zero.

RL Algorithm

We employ Proximal Policy Optimization (PPO) as our training algorithm. We remove the critic model to save compute. We set both the discount factor γ and the GAE parameter λ to 1. Such practices are also adopted by the Open-Reasoner-Zero project.

Token-Level Loss Normalization

Averaging the loss in the sequence level can underweight the overall contribution of longer texts. To address this, we normalize the loss at the token level, this practice is also highlighted in DAPO.

Rollout Strategy

During the rollout phase, we sample 512 questions per batch, and the LLM generates 16 responses per question—resulting in a total batch size of 8,192. To minimize output truncation, we set the maximum generation length to 27K tokens. In our experiment, the truncation rate remained below 5%.

Advantage Normalization

During the training phase, we compute advantages using GAE and normalize them across all generated tokens.

Key Hyperparameters

Parameter Value
PPO Minibatches 4
Learning Rate 2e-5
Adam ε 1e-5
Batch Size 8,192

This configuration balances convergence speed with training stability, avoiding collapse risks from higher learning rates or smaller ε values.

Approaching QwQ-32B's performance using only 200 data samples

Model AIME 2024
R1-Distill-Qwen-32B 72.6
QwQ-32B 78.9
AReaL-boba-SFT-32B 🤗 78.8

At the 32B model size, we further refine the training data and released AReaL-boba-SFT-200, a high-quality dataset with only 200 data points. Accompanied by relevant training scripts, we replicated QwQ-32B's inference performance on AIME2024 via Supervised Fine-Tuning (SFT).

Evaluation Best Practices

During evaluation, we use vLLM v0.6.3 as the generation framework. We identified several settings that influence evaluation performance, particularly for long-context generation. We recommend manually configuring the following options:

enforce_eager=True
enable_chunked_prefill=False
disable_custom_all_reduce=True
disable_sliding_window=True

Following the practice of DeepSeek models, we incorporate a directive in the prompt: "Please reason step by step, and enclose your final answer in \boxed{}." To encourage long context reasoning, we also enforce that the model begins each response with "\n" at the start of its output.

To ensure reliable pass@1 estimation, we:

  • Sample 32 answers per problem
  • Use temperature=0.6 and top_p=0.95 for SFT models
  • Maintain training temperature (1.0) for RL models

Conclusion & Future Work

Our results demonstrate that high-quality data is equally critical as algorithmic innovations.
When conducting RL training on a powerful base model, we require more challenging problems to facilitate learning. Therefore, we integrate resources from multiple recent open-source projects and filter problems by difficulty. A straightforward strategy for data filtering involves removing problems that the base model consistently solves correctly across multiple sampling attempts, as these no longer contribute to improving the model's performance.

AReaL delivers stable and fast training with cutting-edge model performances. Since initial release, we've continuously improved system efficiency, training stability, and accessibility.

All aforementioned techniques have been implemented in AReaL, with reproducible configurations for various model sizes and different hardware setups.

Looking ahead, the AReaL team will:

  • Further optimize system performance
  • Introduce new features
  • Continue open-sourcing training data
  • Expand to broader reasoning tasks

We believe these contributions lower the barrier for high-quality RL training while pushing the boundaries of reasoning capabilities. The project will continue evolving - we welcome community feedback and collaboration to drive further progress in this exciting field.

Downloads last month
33
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for inclusionAI/AReaL-boba-SFT-32B

Quantizations
1 model

Collection including inclusionAI/AReaL-boba-SFT-32B