π§ LLaMA 3.2 1B (Finetuned via Distillation)
This model is a distilled version of meta-llama/Llama-3.2-3B-Instruct
, finetuned to mimic its behavior using meta-llama/Llama-3.2-1B-Instruct
as the student. The distillation was performed using a subset of WikiText-2 and prompt-based soft label supervision.
π§ͺ Training Method
We used logit matching (KL divergence loss) between teacher and student models. The prompt "How to learn a new language?"
was used as a simple test example, and the full WikiText-2 corpus was used for training.
Key settings:
- Teacher:
LLaMA-3.2-3B-Instruct
- Student:
LLaMA-3.2-1B-Instruct
- Optimizer: AdamW
- Loss:
KLDivLoss
on logits - Batch size: 16
- Max tokens: 256
- Training steps: ~10k (can vary)
πΎ Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("YiChuanH/llama1B-finetuned")
tokenizer = AutoTokenizer.from_pretrained("YiChuanH/llama1B-finetuned")
prompt = "How to learn a new language?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(output[0], skip_special_tokens=True))
π Evaluation
This model aims to preserve the output style and quality of the 3B teacher model with significantly fewer parameters (1B). Qualitatively, responses are more informative and instructive than the original base 1B model.
π Dataset
- WikiText-2 (raw) β cleaned and filtered for passages > 30 tokens
π Limitations
- No RLHF or instruction fine-tuning beyond logit distillation
- Not suitable for safety-critical applications
- Quality may vary across tasks not seen during distillation
π License
This model is released under the same license as the base LLaMA-3.2 models (likely Metaβs LLAMA license), with distillation code and weights under MIT.
- Downloads last month
- 113