graf
/

Llama-3.1-GSM8K-8B-RM

Text Generation

Generated from Trainer

text-generation-inference

Model card Files Files and versions Community

Llama-3.1-GSM8K-8B-RM / RME

graf's picture

Upload folder using huggingface_hub

594478e verified 3 months ago

history blame contribute delete

3.06 kB

	---
	library_name: transformers
	license: other
	base_model: meta-llama/Llama-3.1-8B-Instruct
	tags:
	- llama-factory
	- value-head
	- reward-model
	metrics:
	- accuracy
	model-index:
	- name: reward-model
	results:
	- task:
	type: text-classification
	name: Reward Modeling
	dataset:
	name: gsm8k_llama3.2-1B_128_1ep
	type: custom
	metrics:
	- type: accuracy
	value: 0.8810
	---

	# Reward Model with Value Head

	This is a reward model based on [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct), fine-tuned on the `gsm8k_llama3.2-1B_128_1ep` dataset using Reinforcement Learning with Human Feedback (RLHF). The model incorporates a custom Value Head, suitable for reward modeling and preference-based evaluation tasks.

	## Model Details

	- Base Model: `meta-llama/Llama-3.1-8B-Instruct`
	- Fine-tuning Dataset: `gsm8k_llama3.2-1B_128_1ep`
	- Accuracy: 88.10%
	- Framework: Transformers

	## Intended Uses

	- Reward modeling for RLHF
	- Preference evaluation and ranking
	- Fine-tuning reinforcement learning agents

	## Limitations

	- Performance limited to tasks similar to the gsm8k dataset.

	## How to Use

	### Installation

	```bash
	pip install transformers torch
	```

	### Model Loading Example

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model_name = "your-username/reward-model"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)

	# Load Value Head weights
	value_head_state = torch.load("value_head.bin", map_location="cpu")
	model.v_head.summary.weight.data = value_head_state["v_head.summary.weight"]
	model.v_head.summary.bias.data = value_head_state["v_head.summary.bias"]
	```

	### Evaluation Example

	```python
	inputs = tokenizer("Evaluate this text.", return_tensors="pt").to(model.device)
	logits, _, values = model(**inputs)
	reward_score = values[:, -1].item()
	print("Reward Score:", reward_score)
	```

	## Training Procedure

	### Hyperparameters

	- Learning Rate: `1e-05`
	- Total Batch Size: `128` (Distributed across 8 GPUs)
	- Gradient Accumulation Steps: `16`
	- Optimizer: `adamw_torch` (betas=`(0.9,0.999)`, epsilon=`1e-08`)
	- LR Scheduler: Linear with warmup ratio `0.03`
	- Number of Epochs: `1`

	### Training Results

	\| Epoch \| Step \| Validation Loss \| Accuracy \|
	\|:------:\|:----:\|:---------------:\|:--------:\|
	\| 0.0856 \| 5 \| 0.4890 \| 81.35% \|
	\| 0.1711 \| 10 \| 0.2622 \| 92.04% \|
	\| 0.2567 \| 15 \| 0.1574 \| 90.60% \|
	\| 0.3422 \| 20 \| 0.2161 \| 90.90% \|
	\| 0.4278 \| 25 \| 0.2810 \| 86.96% \|
	\| 0.5134 \| 30 \| 0.2796 \| 88.32% \|
	\| 0.5989 \| 35 \| 0.2074 \| 90.22% \|
	\| 0.6845 \| 40 \| 0.1866 \| 90.75% \|
	\| 0.7701 \| 45 \| 0.2167 \| 89.76% \|
	\| 0.8556 \| 50 \| 0.2340 \| 88.78% \|
	\| 0.9412 \| 55 \| 0.2451 \| 88.25% \|

	## Framework Versions

	- Transformers `4.46.1`
	- PyTorch `2.5.1`
	- Datasets `3.1.0`
	- Tokenizers `0.20.1`