codelion/Qwen2-0.5B-Instruct-accuracy-recovery-lora

🎯 Accuracy Recovery LoRA Adapter

This LoRA adapter helps recover accuracy when using INT4 quantized versions of Qwen/Qwen2-0.5B-Instruct. It was trained using self-distillation with Magpie-generated data.

📊 Performance Metrics

Base Model: Qwen/Qwen2-0.5B-Instruct
Quantization: INT4 with NF4
LoRA Rank: 16
LoRA Alpha: 32
Training Samples: 46
Target Performance Gap: <5% perplexity increase

🔧 Usage

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

# Load base model with quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2-0.5B-Instruct",
    quantization_config=quantization_config,
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(model, "codelion/Qwen2-0.5B-Instruct-accuracy-recovery-lora")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")

# Use the model
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

🧪 Training Details

Method: Self-distillation using Magpie data generation
Framework: PEFT + LoRA
Loss Function: Combined KL divergence + MSE loss
Temperature: 4.0
Alpha (distillation weight): 0.8

📈 Expected Benefits

✅ Maintains accuracy close to FP16 baseline
✅ ~75% reduction in memory usage
✅ 2-3x faster inference than FP16
✅ Easy to integrate with existing workflows

🏷️ Related

Dataset: codelion/Qwen2-0.5B-Instruct-magpie
Base Model: Qwen/Qwen2-0.5B-Instruct
Framework: PEFT

This adapter is part of the Ellora project - standardized recipes for enhancing LLM capabilities.

codelion
/

Qwen2-0.5B-Instruct-accuracy-recovery-lora