--- library_name: transformers license: other base_model: meta-llama/Llama-3.1-8B-Instruct tags: - llama-factory - value-head - reward-model metrics: - accuracy model-index: - name: reward-model results: - task: type: text-classification name: Reward Modeling dataset: name: gsm8k_llama3.2-1B_128_1ep type: custom metrics: - type: accuracy value: 0.8810 --- # Reward Model with Value Head This is a reward model based on [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct), fine-tuned on the `gsm8k_llama3.2-1B_128_1ep` dataset using Reinforcement Learning with Human Feedback (RLHF). The model incorporates a custom **Value Head**, suitable for reward modeling and preference-based evaluation tasks. ## Model Details - **Base Model:** `meta-llama/Llama-3.1-8B-Instruct` - **Fine-tuning Dataset:** `gsm8k_llama3.2-1B_128_1ep` - **Accuracy:** 88.10% - **Framework:** Transformers ## Intended Uses - Reward modeling for RLHF - Preference evaluation and ranking - Fine-tuning reinforcement learning agents ## Limitations - Performance limited to tasks similar to the gsm8k dataset. ## How to Use ### Installation ```bash pip install transformers torch ``` ### Model Loading Example ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "your-username/reward-model" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16) # Load Value Head weights value_head_state = torch.load("value_head.bin", map_location="cpu") model.v_head.summary.weight.data = value_head_state["v_head.summary.weight"] model.v_head.summary.bias.data = value_head_state["v_head.summary.bias"] ``` ### Evaluation Example ```python inputs = tokenizer("Evaluate this text.", return_tensors="pt").to(model.device) logits, _, values = model(**inputs) reward_score = values[:, -1].item() print("Reward Score:", reward_score) ``` ## Training Procedure ### Hyperparameters - Learning Rate: `1e-05` - Total Batch Size: `128` (Distributed across 8 GPUs) - Gradient Accumulation Steps: `16` - Optimizer: `adamw_torch` (betas=`(0.9,0.999)`, epsilon=`1e-08`) - LR Scheduler: Linear with warmup ratio `0.03` - Number of Epochs: `1` ### Training Results | Epoch | Step | Validation Loss | Accuracy | |:------:|:----:|:---------------:|:--------:| | 0.0856 | 5 | 0.4890 | 81.35% | | 0.1711 | 10 | 0.2622 | 92.04% | | 0.2567 | 15 | 0.1574 | 90.60% | | 0.3422 | 20 | 0.2161 | 90.90% | | 0.4278 | 25 | 0.2810 | 86.96% | | 0.5134 | 30 | 0.2796 | 88.32% | | 0.5989 | 35 | 0.2074 | 90.22% | | 0.6845 | 40 | 0.1866 | 90.75% | | 0.7701 | 45 | 0.2167 | 89.76% | | 0.8556 | 50 | 0.2340 | 88.78% | | 0.9412 | 55 | 0.2451 | 88.25% | ## Framework Versions - Transformers `4.46.1` - PyTorch `2.5.1` - Datasets `3.1.0` - Tokenizers `0.20.1`