File size: 4,436 Bytes
f4f8c10
 
 
 
 
 
 
 
 
2b2fae7
1eb818c
 
3dededa
 
 
 
 
 
 
 
f25a52d
3dededa
f25a52d
3dededa
 
 
 
 
 
 
 
 
 
313b88c
3dededa
 
 
f25a52d
3dededa
 
 
 
f25a52d
3dededa
 
f25a52d
 
 
 
 
 
3dededa
f25a52d
 
 
3dededa
f25a52d
3dededa
f25a52d
 
 
3dededa
9439b51
 
 
 
 
 
 
 
 
 
 
3dededa
f25a52d
 
 
 
3dededa
 
f25a52d
3dededa
 
 
 
1eb818c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
license: mit
datasets:
- wizardII/ArcherCodeR-Dataset
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
pipeline_tag: reinforcement-learning
tags:
- code
new_version: wizardII/ArcherCodeR-1.5B
language:
- en
---


<div align="center">

# ✨ ArcherCodeR

<div>
🏹️  Reinforcement Learning for Enhanced Code Reasoning in LLMs  🎯
</div>

</div>
<div>
<br>

<div align="center">

[![Github](https://img.shields.io/badge/Code-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white)](https://github.com/wizard-III/ArcherCodeR)
[![Model](https://img.shields.io/badge/Model-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor)](https://huggingface.co/wizardII/ArcherCodeR-1.5B)
[![Data](https://img.shields.io/badge/Data-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor)](https://huggingface.co/datasets/wizardII/ArcherCodeR-Dataset)
[![Wandb](https://img.shields.io/badge/Wandb-000000?style=for-the-badge&logo=Wandb&logoColor=000&labelColor)](https://wandb.ai/wangjkpkucs-peking-university/ArcherCodeR?nw=nwuserwangjkpkucs)
[![知乎](https://img.shields.io/badge/知乎-0084FF?style=for-the-badge&logo=zhihu&logoColor=white)](https://zhuanlan.zhihu.com/p/1918765619614057424)

</div>

## Overview

<div align="center">
<img src="assets/ArcherCodeR-1.5B-DAPO.png" width="100%"/>

<sub>ArcherCodeR-1.5B-DAPO achieves progressive improvements on LiveCodeBench (LCB), reaching 27.24% LCB score.</sub>
</div>

**ArcherCodeR** is an open-source initiative enhancing code reasoning in large language models through scalable, rule-governed reinforcement learning. We provide full-stack reproducibility including:

- Training code and pipelines
- Curated datasets
- Trained models
- Complete training logs

**Current Models**:
- **[ArcherCodeR-1.5B-DAPO](https://huggingface.co/wizardII/ArcherCodeR-1.5B-DAPO)** - achieves state-of-the-art performance on code tasks (LiveCodeBench) among comparable-scale models (excluding our final ArcherCodeR-1.5B). All training components for this model are now fully released.
- **[ArcherCodeR-1.5B](https://huggingface.co/wizardII/ArcherCodeR-1.5B)** - SOTA among similarly-sized models (training pipeline releasing progressively)

## Evaluation

Performance on LiveCodeBench. The Pass@1 metric represents the average performance across 4 independent sampling attempts. To ensure consistency, we re-evaluated all comparable open-source models using identical evaluation scripts and parameters (temperature=0.8, max_gen_length=32k).

The detailed results are shown in the table below.

| Model                                         | LCB (8/1/24-2/1/25)(Pass@1) | LCB (8/1/24-2/1/25)(Pass@4) |
| --------------------------------------------- | --------------------------- | --------------------------- |
| DeepSeek-R1-Distill-Qwen-1.5B                 | 16.9                        | —                           |
| DeepSeek-R1-Distill-Qwen-1.5B(Tested)         | 16.40                       | 25.81                       |
| DeepCoder-1.5B                                | 25.1                        | —                           |
| DeepCoder-1.5B(Tested)                        | 23.03                       | 30.82                       |
| Nemotron-Research-Reasoning-Qwen-1.5B         | 23.81                       | —                           |
| Nemotron-Research-Reasoning-Qwen-1.5B(Tested) | 25.45                       | 34.40                       |
| **ArcherCodeR-1.5B-DAPO**                     | 26.70                       | 36.56                       |
| **ArcherCodeR-1.5B(32k)**                     | 28.49                       | 38.71                       |
| **ArcherCodeR-1.5B(48k)**                     | 29.30                       | 39.07                       |

Note:
1. Evaluation variance for the same model is typically within ±0.5 across multiple runs.
2. DeepCoder consistently scored around 23 in our tests - lower than its reported performance.
3. NVIDIA's Nemotron-Research-Reasoning-Qwen-1.5B slightly outperformed its reported score, potentially due to different parameter settings in their original evaluation.

## Technical Report
Coming soon.

## Acknowledgements

- We build our model upon [`DeepSeek-R1-Distill-Qwen-1.5B`](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B).
- Training was carried out with a modified version of [verl](https://github.com/volcengine/verl).