metadata
license: mit
datasets:
- wizardII/ArcherCodeR-Dataset
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
pipeline_tag: reinforcement-learning
tags:
- code
new_version: wizardII/ArcherCodeR-1.5B-DAPO
language:
- en
Overview

ArcherCodeR-1.5B-DAPO achieves progressive improvements on LiveCodeBench (LCB), reaching 27.24% LCB score.
ArcherCodeR is an open-source initiative enhancing code reasoning in large language models through scalable, rule-governed reinforcement learning. We provide full-stack reproducibility including:
- Training code and pipelines
- Curated datasets
- Trained models
- Complete training logs
Current Models:
- ArcherCodeR-1.5B-DAPO - achieves state-of-the-art performance on code tasks (LiveCodeBench) among comparable-scale models (excluding our final ArcherCodeR-1.5B). All training components for this model are now fully released.
- ArcherCodeR-1.5B - SOTA among similarly-sized models (training pipeline releasing progressively)
Evaluation
Performance on LiveCodeBench. The Pass@1 metric represents the average performance across 4 independent sampling attempts. To ensure consistency, we re-evaluated all comparable open-source models using identical evaluation scripts and parameters (temperature=0.8, max_gen_length=32k).
The detailed results are shown in the table below.
Model | LCB (8/1/24-2/1/25)(Pass@1) | LCB (8/1/24-2/1/25)(Pass@4) |
---|---|---|
DeepSeek-R1-Distill-Qwen-1.5B | 16.9 | — |
DeepSeek-R1-Distill-Qwen-1.5B(Tested) | 16.40 | 25.81 |
DeepCoder-1.5B | 25.1 | — |
DeepCoder-1.5B(Tested) | 23.03 | 30.82 |
Nemotron-Research-Reasoning-Qwen-1.5B | 23.81 | — |
Nemotron-Research-Reasoning-Qwen-1.5B(Tested) | 25.45 | 34.40 |
ArcherCodeR-1.5B-DAPO | 26.70 | 36.56 |
ArcherCodeR-1.5B(32k) | 28.49 | 38.71 |
ArcherCodeR-1.5B(48k) | 29.30 | 39.07 |
Note:
- Evaluation variance for the same model is typically within ±0.5 across multiple runs.
- DeepCoder consistently scored around 23 in our tests - lower than its reported performance.
- NVIDIA's Nemotron-Research-Reasoning-Qwen-1.5B slightly outperformed its reported score, potentially due to different parameter settings in their original evaluation.
Technical Report
Coming soon.
Acknowledgements
- We build our model upon
DeepSeek-R1-Distill-Qwen-1.5B
. - Training was carried out with a modified version of verl.