🧠 LLaMA 3.2 1B (Finetuned via Distillation)

This model is a distilled version of meta-llama/Llama-3.2-3B-Instruct, finetuned to mimic its behavior using meta-llama/Llama-3.2-1B-Instruct as the student. The distillation was performed using a subset of WikiText-2 and prompt-based soft label supervision.

🧪 Training Method

We used logit matching (KL divergence loss) between teacher and student models. The prompt "How to learn a new language?" was used as a simple test example, and the full WikiText-2 corpus was used for training.

Key settings:

Teacher: LLaMA-3.2-3B-Instruct
Student: LLaMA-3.2-1B-Instruct
Optimizer: AdamW
Loss: KLDivLoss on logits
Batch size: 16
Max tokens: 256
Training steps: ~10k (can vary)

💾 Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("YiChuanH/llama1B-finetuned")
tokenizer = AutoTokenizer.from_pretrained("YiChuanH/llama1B-finetuned")

prompt = "How to learn a new language?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(output[0], skip_special_tokens=True))

📊 Evaluation

This model aims to preserve the output style and quality of the 3B teacher model with significantly fewer parameters (1B). Qualitatively, responses are more informative and instructive than the original base 1B model.

📚 Dataset

WikiText-2 (raw) – cleaned and filtered for passages > 30 tokens

📌 Limitations

No RLHF or instruction fine-tuning beyond logit distillation
Not suitable for safety-critical applications
Quality may vary across tasks not seen during distillation

📄 License

This model is released under the same license as the base LLaMA-3.2 models (likely Meta’s LLAMA license), with distillation code and weights under MIT.