---
library_name: transformers
license: other
base_model: meta-llama/Llama-3.1-8B-Instruct
tags:
- llama-factory
- value-head
- reward-model
metrics:
- accuracy
model-index:
- name: reward-model
  results:
  - task:
      type: text-classification
      name: Reward Modeling
    dataset:
      name: gsm8k_llama3.2-1B_128_1ep
      type: custom
    metrics:
    - type: accuracy
      value: 0.8810
---

# Reward Model with Value Head

This is a reward model based on [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct), fine-tuned on the `gsm8k_llama3.2-1B_128_1ep` dataset using Reinforcement Learning with Human Feedback (RLHF). The model incorporates a custom **Value Head**, suitable for reward modeling and preference-based evaluation tasks.

## Model Details

- **Base Model:** `meta-llama/Llama-3.1-8B-Instruct`
- **Fine-tuning Dataset:** `gsm8k_llama3.2-1B_128_1ep`
- **Accuracy:** 88.10%
- **Framework:** Transformers

## Intended Uses

- Reward modeling for RLHF
- Preference evaluation and ranking
- Fine-tuning reinforcement learning agents

## Limitations

- Performance limited to tasks similar to the gsm8k dataset.

## How to Use

### Installation

```bash
pip install transformers torch
```

### Model Loading Example

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "your-username/reward-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)

# Load Value Head weights
value_head_state = torch.load("value_head.bin", map_location="cpu")
model.v_head.summary.weight.data = value_head_state["v_head.summary.weight"]
model.v_head.summary.bias.data = value_head_state["v_head.summary.bias"]
```

### Evaluation Example

```python
inputs = tokenizer("Evaluate this text.", return_tensors="pt").to(model.device)
logits, _, values = model(**inputs)
reward_score = values[:, -1].item()
print("Reward Score:", reward_score)
```

## Training Procedure

### Hyperparameters

- Learning Rate: `1e-05`
- Total Batch Size: `128` (Distributed across 8 GPUs)
- Gradient Accumulation Steps: `16`
- Optimizer: `adamw_torch` (betas=`(0.9,0.999)`, epsilon=`1e-08`)
- LR Scheduler: Linear with warmup ratio `0.03`
- Number of Epochs: `1`

### Training Results

| Epoch  | Step | Validation Loss | Accuracy |
|:------:|:----:|:---------------:|:--------:|
| 0.0856 | 5    | 0.4890          | 81.35%   |
| 0.1711 | 10   | 0.2622          | 92.04%   |
| 0.2567 | 15   | 0.1574          | 90.60%   |
| 0.3422 | 20   | 0.2161          | 90.90%   |
| 0.4278 | 25   | 0.2810          | 86.96%   |
| 0.5134 | 30   | 0.2796          | 88.32%   |
| 0.5989 | 35   | 0.2074          | 90.22%   |
| 0.6845 | 40   | 0.1866          | 90.75%   |
| 0.7701 | 45   | 0.2167          | 89.76%   |
| 0.8556 | 50   | 0.2340          | 88.78%   |
| 0.9412 | 55   | 0.2451          | 88.25%   |

## Framework Versions

- Transformers `4.46.1`
- PyTorch `2.5.1`
- Datasets `3.1.0`
- Tokenizers `0.20.1`