graf's picture
Upload folder using huggingface_hub
594478e verified
---
library_name: transformers
license: other
base_model: meta-llama/Llama-3.1-8B-Instruct
tags:
- llama-factory
- value-head
- reward-model
metrics:
- accuracy
model-index:
- name: reward-model
results:
- task:
type: text-classification
name: Reward Modeling
dataset:
name: gsm8k_llama3.2-1B_128_1ep
type: custom
metrics:
- type: accuracy
value: 0.8810
---
# Reward Model with Value Head
This is a reward model based on [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct), fine-tuned on the `gsm8k_llama3.2-1B_128_1ep` dataset using Reinforcement Learning with Human Feedback (RLHF). The model incorporates a custom **Value Head**, suitable for reward modeling and preference-based evaluation tasks.
## Model Details
- **Base Model:** `meta-llama/Llama-3.1-8B-Instruct`
- **Fine-tuning Dataset:** `gsm8k_llama3.2-1B_128_1ep`
- **Accuracy:** 88.10%
- **Framework:** Transformers
## Intended Uses
- Reward modeling for RLHF
- Preference evaluation and ranking
- Fine-tuning reinforcement learning agents
## Limitations
- Performance limited to tasks similar to the gsm8k dataset.
## How to Use
### Installation
```bash
pip install transformers torch
```
### Model Loading Example
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "your-username/reward-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
# Load Value Head weights
value_head_state = torch.load("value_head.bin", map_location="cpu")
model.v_head.summary.weight.data = value_head_state["v_head.summary.weight"]
model.v_head.summary.bias.data = value_head_state["v_head.summary.bias"]
```
### Evaluation Example
```python
inputs = tokenizer("Evaluate this text.", return_tensors="pt").to(model.device)
logits, _, values = model(**inputs)
reward_score = values[:, -1].item()
print("Reward Score:", reward_score)
```
## Training Procedure
### Hyperparameters
- Learning Rate: `1e-05`
- Total Batch Size: `128` (Distributed across 8 GPUs)
- Gradient Accumulation Steps: `16`
- Optimizer: `adamw_torch` (betas=`(0.9,0.999)`, epsilon=`1e-08`)
- LR Scheduler: Linear with warmup ratio `0.03`
- Number of Epochs: `1`
### Training Results
| Epoch | Step | Validation Loss | Accuracy |
|:------:|:----:|:---------------:|:--------:|
| 0.0856 | 5 | 0.4890 | 81.35% |
| 0.1711 | 10 | 0.2622 | 92.04% |
| 0.2567 | 15 | 0.1574 | 90.60% |
| 0.3422 | 20 | 0.2161 | 90.90% |
| 0.4278 | 25 | 0.2810 | 86.96% |
| 0.5134 | 30 | 0.2796 | 88.32% |
| 0.5989 | 35 | 0.2074 | 90.22% |
| 0.6845 | 40 | 0.1866 | 90.75% |
| 0.7701 | 45 | 0.2167 | 89.76% |
| 0.8556 | 50 | 0.2340 | 88.78% |
| 0.9412 | 55 | 0.2451 | 88.25% |
## Framework Versions
- Transformers `4.46.1`
- PyTorch `2.5.1`
- Datasets `3.1.0`
- Tokenizers `0.20.1`