DPO Fine-Tune of Llama-3.2-1B using PairRM Preferences

This repository contains the LoRA adapters for a meta-llama/Llama-3.2-1B-Instruct model that has been fine-tuned using Direct Preference Optimization (DPO).

The preference dataset for this training was generated using the llm-blender/PairRM reward model, which is designed to rank LLM responses based on quality. This model represents an efficient approach to preference alignment without the need for a separate LLM Judge or human annotation.

Preference Dataset: NilayR/pairrm-preferences-llama32

Model Details

Model Description

This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct. It was trained using DPO on a preference dataset where the 'chosen' and 'rejected' labels were determined by the llm-blender/PairRM model. The goal was to align the base model's outputs with PairRM's learned preferences for high-quality, factual, and concise responses.

Developed by: NilayR
Model type: Causal Language Model
Language(s): English
License: apache-2.0
Finetuned from model: meta-llama/Llama-3.2-1B-Instruct

How to Get Started with the Model

To use these LoRA adapters, load the base model (meta-llama/Llama-3.2-1B-Instruct) and then apply the adapters from this repository.

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Set base model ID and adapter path
base_model_id = "meta-llama/Llama-3.2-1B-Instruct"
adapter_id = "NilayR/llama32-dpo-pairrm"

# Configure BitsAndBytes for 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load the base model with quantization
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
tokenizer.pad_token = tokenizer.eos_token

# Load and apply the PEFT adapters
model = PeftModel.from_pretrained(base_model, adapter_id)

# --- Generate a response ---
prompt = "What are the main differences between renewable and non-renewable energy?"
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
    top_p=0.95
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response.split("assistant")[-1].strip())

Training Details

Training Data

The model was trained on a preference dataset generated using the llm-blender/PairRM model.

Data Generation Process:
1. Instructions: 50 instructions were extracted from the LIMA dataset.
2. Response Generation: The base Llama-3.2-1B model generated 5 diverse responses for each instruction.
3. Preference Labeling: The llm-blender/PairRM ranker scored all 5 responses for each instruction. The highest-ranked response was selected as 'chosen' and the lowest-ranked as 'rejected', resulting in 50 preference pairs.

Training Procedure

The model was trained for one epoch using the TRL library's DPOTrainer.

Training Hyperparameters

Framework: trl.DPOTrainer
Epochs: 1
Batch Size: 1
Gradient Accumulation Steps: 4 (Effective Batch Size: 4)
Optimizer: paged_adamw_8bit
Learning Rate: 5e-5
LR Scheduler: cosine with a warmup ratio of 0.1
DPO Beta (β): 0.1
Final Training Loss: 0.6872

LoRA Configuration

Rank (r): 16
Alpha (lora_alpha): 32
Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Dropout: 0.05

Compute Infrastructure

Hardware: 1x NVIDIA A100 40GB GPU
Cloud Provider: Google Colab
Software: transformers, peft, trl, bitsandbytes

NilayR
/

llama32-dpo-pairrm