|
--- |
|
library_name: transformers |
|
license: other |
|
base_model: meta-llama/Llama-3.1-8B-Instruct |
|
tags: |
|
- llama-factory |
|
- value-head |
|
- reward-model |
|
metrics: |
|
- accuracy |
|
model-index: |
|
- name: reward-model |
|
results: |
|
- task: |
|
type: text-classification |
|
name: Reward Modeling |
|
dataset: |
|
name: gsm8k_llama3.2-1B_128_1ep |
|
type: custom |
|
metrics: |
|
- type: accuracy |
|
value: 0.8810 |
|
--- |
|
|
|
|
|
|
|
This is a reward model based on [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct), fine-tuned on the `gsm8k_llama3.2-1B_128_1ep` dataset using Reinforcement Learning with Human Feedback (RLHF). The model incorporates a custom **Value Head**, suitable for reward modeling and preference-based evaluation tasks. |
|
|
|
|
|
|
|
- **Base Model:** `meta-llama/Llama-3.1-8B-Instruct` |
|
- **Fine-tuning Dataset:** `gsm8k_llama3.2-1B_128_1ep` |
|
- **Accuracy:** 88.10% |
|
- **Framework:** Transformers |
|
|
|
|
|
|
|
- Reward modeling for RLHF |
|
- Preference evaluation and ranking |
|
- Fine-tuning reinforcement learning agents |
|
|
|
|
|
|
|
- Performance limited to tasks similar to the gsm8k dataset. |
|
|
|
|
|
|
|
|
|
|
|
```bash |
|
pip install transformers torch |
|
``` |
|
|
|
|
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import torch |
|
|
|
model_name = "your-username/reward-model" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16) |
|
|
|
|
|
value_head_state = torch.load("value_head.bin", map_location="cpu") |
|
model.v_head.summary.weight.data = value_head_state["v_head.summary.weight"] |
|
model.v_head.summary.bias.data = value_head_state["v_head.summary.bias"] |
|
``` |
|
|
|
|
|
|
|
```python |
|
inputs = tokenizer("Evaluate this text.", return_tensors="pt").to(model.device) |
|
logits, _, values = model(**inputs) |
|
reward_score = values[:, -1].item() |
|
print("Reward Score:", reward_score) |
|
``` |
|
|
|
|
|
|
|
|
|
|
|
- Learning Rate: `1e-05` |
|
- Total Batch Size: `128` (Distributed across 8 GPUs) |
|
- Gradient Accumulation Steps: `16` |
|
- Optimizer: `adamw_torch` (betas=`(0.9,0.999)`, epsilon=`1e-08`) |
|
- LR Scheduler: Linear with warmup ratio `0.03` |
|
- Number of Epochs: `1` |
|
|
|
|
|
|
|
| Epoch | Step | Validation Loss | Accuracy | |
|
|:------:|:----:|:---------------:|:--------:| |
|
| 0.0856 | 5 | 0.4890 | 81.35% | |
|
| 0.1711 | 10 | 0.2622 | 92.04% | |
|
| 0.2567 | 15 | 0.1574 | 90.60% | |
|
| 0.3422 | 20 | 0.2161 | 90.90% | |
|
| 0.4278 | 25 | 0.2810 | 86.96% | |
|
| 0.5134 | 30 | 0.2796 | 88.32% | |
|
| 0.5989 | 35 | 0.2074 | 90.22% | |
|
| 0.6845 | 40 | 0.1866 | 90.75% | |
|
| 0.7701 | 45 | 0.2167 | 89.76% | |
|
| 0.8556 | 50 | 0.2340 | 88.78% | |
|
| 0.9412 | 55 | 0.2451 | 88.25% | |
|
|
|
|
|
|
|
- Transformers `4.46.1` |
|
- PyTorch `2.5.1` |
|
- Datasets `3.1.0` |
|
- Tokenizers `0.20.1` |
|
|
|
|
|
|