File size: 4,441 Bytes
9abc56d
 
 
 
 
 
 
 
 
0e73ee4
 
 
9abc56d
 
 
 
 
 
 
 
a5e705d
9abc56d
a5e705d
9abc56d
 
 
 
 
 
 
 
 
 
0c2160f
9abc56d
 
 
a5e705d
9abc56d
 
 
 
a5e705d
9abc56d
 
a5e705d
 
 
 
 
 
9abc56d
a5e705d
 
 
9abc56d
a5e705d
9abc56d
a5e705d
 
 
9abc56d
d3f2901
 
 
 
 
 
 
 
 
 
 
9abc56d
a5e705d
 
 
 
9abc56d
 
a5e705d
9abc56d
 
 
 
0e73ee4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
license: mit
datasets:
- wizardII/ArcherCodeR-Dataset
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
pipeline_tag: reinforcement-learning
tags:
- code
new_version: wizardII/ArcherCodeR-1.5B-DAPO
language:
- en
---


<div align="center">

# ✨ ArcherCodeR

<div>
🏹️  Reinforcement Learning for Enhanced Code Reasoning in LLMs  🎯
</div>

</div>
<div>
<br>

<div align="center">

[![Github](https://img.shields.io/badge/Code-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white)](https://github.com/wizard-III/ArcherCodeR)
[![Model](https://img.shields.io/badge/Model-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor)](https://huggingface.co/wizardII/ArcherCodeR-1.5B)
[![Data](https://img.shields.io/badge/Data-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor)](https://huggingface.co/datasets/wizardII/ArcherCodeR-Dataset)
[![Wandb](https://img.shields.io/badge/Wandb-000000?style=for-the-badge&logo=Wandb&logoColor=000&labelColor)](https://wandb.ai/wangjkpkucs-peking-university/ArcherCodeR?nw=nwuserwangjkpkucs)
[![知乎](https://img.shields.io/badge/知乎-0084FF?style=for-the-badge&logo=zhihu&logoColor=white)](https://zhuanlan.zhihu.com/p/1918765619614057424)

</div>

## Overview

<div align="center">
<img src="assets/ArcherCodeR-1.5B-DAPO.png" width="100%"/>

<sub>ArcherCodeR-1.5B-DAPO achieves progressive improvements on LiveCodeBench (LCB), reaching 27.24% LCB score.</sub>
</div>

**ArcherCodeR** is an open-source initiative enhancing code reasoning in large language models through scalable, rule-governed reinforcement learning. We provide full-stack reproducibility including:

- Training code and pipelines
- Curated datasets
- Trained models
- Complete training logs

**Current Models**:
- **[ArcherCodeR-1.5B-DAPO](https://huggingface.co/wizardII/ArcherCodeR-1.5B-DAPO)** - achieves state-of-the-art performance on code tasks (LiveCodeBench) among comparable-scale models (excluding our final ArcherCodeR-1.5B). All training components for this model are now fully released.
- **[ArcherCodeR-1.5B](https://huggingface.co/wizardII/ArcherCodeR-1.5B)** - SOTA among similarly-sized models (training pipeline releasing progressively)

## Evaluation

Performance on LiveCodeBench. The Pass@1 metric represents the average performance across 4 independent sampling attempts. To ensure consistency, we re-evaluated all comparable open-source models using identical evaluation scripts and parameters (temperature=0.8, max_gen_length=32k).

The detailed results are shown in the table below.

| Model                                         | LCB (8/1/24-2/1/25)(Pass@1) | LCB (8/1/24-2/1/25)(Pass@4) |
| --------------------------------------------- | --------------------------- | --------------------------- |
| DeepSeek-R1-Distill-Qwen-1.5B                 | 16.9                        | —                           |
| DeepSeek-R1-Distill-Qwen-1.5B(Tested)         | 16.40                       | 25.81                       |
| DeepCoder-1.5B                                | 25.1                        | —                           |
| DeepCoder-1.5B(Tested)                        | 23.03                       | 30.82                       |
| Nemotron-Research-Reasoning-Qwen-1.5B         | 23.81                       | —                           |
| Nemotron-Research-Reasoning-Qwen-1.5B(Tested) | 25.45                       | 34.40                       |
| **ArcherCodeR-1.5B-DAPO**                     | 26.70                       | 36.56                       |
| **ArcherCodeR-1.5B(32k)**                     | 28.49                       | 38.71                       |
| **ArcherCodeR-1.5B(48k)**                     | 29.30                       | 39.07                       |

Note:
1. Evaluation variance for the same model is typically within ±0.5 across multiple runs.
2. DeepCoder consistently scored around 23 in our tests - lower than its reported performance.
3. NVIDIA's Nemotron-Research-Reasoning-Qwen-1.5B slightly outperformed its reported score, potentially due to different parameter settings in their original evaluation.

## Technical Report
Coming soon.

## Acknowledgements

- We build our model upon [`DeepSeek-R1-Distill-Qwen-1.5B`](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B).
- Training was carried out with a modified version of [verl](https://github.com/volcengine/verl).