DPO Fine-Tune of Llama-3.2-1B using PairRM Preferences
This repository contains the LoRA adapters for a meta-llama/Llama-3.2-1B-Instruct
model that has been fine-tuned using Direct Preference Optimization (DPO).
The preference dataset for this training was generated using the llm-blender/PairRM
reward model, which is designed to rank LLM responses based on quality. This model represents an efficient approach to preference alignment without the need for a separate LLM Judge or human annotation.
- Preference Dataset: NilayR/pairrm-preferences-llama32
Model Details
Model Description
This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct
. It was trained using DPO on a preference dataset where the 'chosen' and 'rejected' labels were determined by the llm-blender/PairRM
model. The goal was to align the base model's outputs with PairRM's learned preferences for high-quality, factual, and concise responses.
- Developed by: NilayR
- Model type: Causal Language Model
- Language(s): English
- License: apache-2.0
- Finetuned from model:
meta-llama/Llama-3.2-1B-Instruct
How to Get Started with the Model
To use these LoRA adapters, load the base model (meta-llama/Llama-3.2-1B-Instruct
) and then apply the adapters from this repository.
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
# Set base model ID and adapter path
base_model_id = "meta-llama/Llama-3.2-1B-Instruct"
adapter_id = "NilayR/llama32-dpo-pairrm"
# Configure BitsAndBytes for 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
# Load the base model with quantization
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
tokenizer.pad_token = tokenizer.eos_token
# Load and apply the PEFT adapters
model = PeftModel.from_pretrained(base_model, adapter_id)
# --- Generate a response ---
prompt = "What are the main differences between renewable and non-renewable energy?"
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
outputs = model.generate(
input_ids,
max_new_tokens=200,
do_sample=True,
temperature=0.7,
top_p=0.95
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response.split("assistant")[-1].strip())
Training Details
Training Data
The model was trained on a preference dataset generated using the llm-blender/PairRM
model.
- Data Generation Process:
- Instructions: 50 instructions were extracted from the LIMA dataset.
- Response Generation: The base
Llama-3.2-1B
model generated 5 diverse responses for each instruction. - Preference Labeling: The
llm-blender/PairRM
ranker scored all 5 responses for each instruction. The highest-ranked response was selected as 'chosen' and the lowest-ranked as 'rejected', resulting in 50 preference pairs.
Training Procedure
The model was trained for one epoch using the TRL library's DPOTrainer
.
Training Hyperparameters
- Framework:
trl.DPOTrainer
- Epochs: 1
- Batch Size: 1
- Gradient Accumulation Steps: 4 (Effective Batch Size: 4)
- Optimizer:
paged_adamw_8bit
- Learning Rate: 5e-5
- LR Scheduler:
cosine
with a warmup ratio of 0.1 - DPO Beta (β): 0.1
- Final Training Loss:
0.6872
LoRA Configuration
- Rank (
r
): 16 - Alpha (
lora_alpha
): 32 - Target Modules:
q_proj
,k_proj
,v_proj
,o_proj
,gate_proj
,up_proj
,down_proj
- Dropout: 0.05
Compute Infrastructure
- Hardware: 1x NVIDIA A100 40GB GPU
- Cloud Provider: Google Colab
- Software:
transformers
,peft
,trl
,bitsandbytes
- Downloads last month
- 11
Model tree for NilayR/llama32-dpo-pairrm
Base model
meta-llama/Llama-3.2-1B-Instruct