Iterative DPO Fine-Tune of Llama-3.2-1B (Iteration 1)

This repository contains the LoRA adapters from the first iteration of a Direct Preference Optimization (DPO) fine-tuning process on the meta-llama/Llama-3.2-1B-Instruct model.

This work is part of a project exploring iterative DPO, where the model refines itself over multiple cycles of preference data generation and training, inspired by the "Self-Rewarding Language Models" paper.

Model Details

Model Description

This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct. It was trained using DPO on a preference dataset that the base model generated itself. An LLM Judge, powered by GPT-3.5-Turbo, evaluated pairs of model-generated responses to create the chosen/rejected pairs for training.

The goal of this iteration was to establish the first step in a self-improvement loop, aligning the model more closely with human-like preferences for accuracy, helpfulness, and clarity.

  • Developed by: NilayR
  • Model type: Causal Language Model
  • Language(s): English
  • License: apache-2.0
  • Finetuned from model: meta-llama/Llama-3.2-1B-Instruct

How to Get Started with the Model

To use these LoRA adapters, load the base model (meta-llama/Llama-3.2-1B-Instruct) and then apply the adapters from this repository.

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Set base model ID and adapter path
base_model_id = "meta-llama/Llama-3.2-1B-Instruct"
adapter_id = "NilayR/llama32-iterative-dpo-iter1"

# Configure BitsAndBytes for 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load the base model with quantization
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
tokenizer.pad_token = tokenizer.eos_token

# Load and apply the PEFT adapters
model = PeftModel.from_pretrained(base_model, adapter_id)

# --- Generate a response ---
prompt = "What are the key benefits of meditation?"
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
    top_p=0.95
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response.split("assistant")[-1].strip())

Training Details

Training Data

The model was trained on a preference dataset generated by the meta-llama/Llama-3.2-1B-Instruct model itself.

  • Data Generation Process:
    1. Instructions: 20 instructions were selected from the LIMA dataset.
    2. Response Generation: The base model generated multiple diverse responses for each instruction.
    3. Preference Labeling: A custom LLM Judge powered by GPT-3.5-Turbo was used to compare pairs of the generated responses, creating a dataset of 56 chosen/rejected pairs.

Training Procedure

The model was trained for one epoch using the TRL library's DPOTrainer.

Training Hyperparameters

  • Framework: trl.DPOTrainer
  • Epochs: 1
  • Batch Size: 1
  • Gradient Accumulation Steps: 2
  • Optimizer: paged_adamw_8bit
  • Learning Rate: 2e-5
  • DPO Beta (尾): 0.1
  • Max Steps: 50
  • Final Training Loss: 0.6405

LoRA Configuration

  • Rank (r): 16
  • Alpha (lora_alpha): 32
  • Target Modules: q_proj, k_proj, v_proj, o_proj
  • Dropout: 0.05

Compute Infrastructure

  • Hardware: 1x NVIDIA A100 40GB GPU
  • Cloud Provider: Google Colab
  • Software: transformers, peft, trl, bitsandbytes
Downloads last month
23
Safetensors
Model size
765M params
Tensor type
F32
F16
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for NilayR/llama32-iterative-dpo-iter1

Adapter
(321)
this model