Model Card for LRC-4B-SFT
LRC-4B-SFT is a Small Language Model (SLM) with approximately 4 billion parameters. It is the Supervised Fine-Tuned (SFT) version of LRC-4B-Base. The LRC method is an efficient knowledge distillation technique used to construct the base model from its teacher, Qwen2.5-7B-Instruct, using 18 billion tokens. This SFT version was then further fine-tuned on an instruction-following dataset ultrachat_200k.
The LRC approach trains a set of low-rank projection matrices that enable soft pruning by compressing teacher weights and an "activation clone" mechanism that aligns student activations (including FFN signals) with those of the teacher. The base model, LRC-4B-Base, was trained on 18 billion tokens.
Uses
Direct Use
LRC-4B-SFT is an instruction-tuned model and is intended for tasks requiring instruction following, question answering, and general chat capabilities.
Biases, Risks, and Limitations
- SFT Dataset Limitations: LRC-4B-SFT was fine-tuned solely on the UltraChat dataset (using 0.2B tokens). While UltraChat enhances general instruction-following, it may not be sufficiently diverse or targeted to instill robust safety alignment or complex instruction adherence compared to models trained with more extensive or specialized alignment techniques (e.g., RLHF, or SFT on broader safety/instruction datasets). Consequently, the model might exhibit deficiencies in safety and its ability to follow highly complex or nuanced instructions. This is particularly evident in the IFeval and TruthfulQA benchmarks where performance decreased after SFT compared to its base model.
- Inherited Biases: The model may reflect biases present in its pre-training data (a mix of Fineweb-Edu, OpenHermes 2.5, DCLM, and CosmopiediaV2) and the teacher model (Qwen2.5-7B-Instruct).
- Hallucination: Like all LLMs, LRC-4B-SFT can generate factually incorrect or nonsensical information (hallucinations).
- Limited Scope of Evaluation: The paper's primary evaluation focuses on pre-training efficiency and general downstream tasks. Extensive testing on safety benchmarks or complex reasoning tasks beyond the reported MMLU, ARC, etc., was not detailed.
How to Get Started with the Model
❗ Critical: For vLLM serving, please specify model-impl==transformers
when using qwen series model. This is because, in the current implementation of vLLM, the qwen model does not support setting a custom head_dim
through the config. Fortunately, vLLM allows using transformers as the backend.
Tested versions that can serve properly: vllm==0.8.5.post1
and transformers==4.51.3
.
Serve command:
vllm serve JitaiHao/LRC-1.7B-Base --model-impl transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained('JitaiHao/LRC-4B') # Assuming this will be the HF repo name
model = AutoModelForCausalLM.from_pretrained('JitaiHao/LRC-4B') # Assuming this will be the HF repo name
# Prepare a multi-turn chat history
messages = [
{"role": "user", "content": "Hello, who are you?"},
{"role": "assistant", "content": "Hello, I am an AI assistant."}
]
# Use apply_chat_template to create a prompt for the model
input_text = tokenizer.apply_chat_template(
messages,
tokenize=False, # Only generate the string prompt, do not tokenize yet
add_generation_prompt=True # Add a generation prompt for the assistant
)
print(input_text) # View the generated prompt string
# If you want to generate a response with the model
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
Training Data
- Pre-training (for LRC-4B-Base): 18 billion tokens from the "Mixed-2.0" dataset, which includes Fineweb-Edu, OpenHermes 2.5, DCLM, and CosmopiediaV2.
- Supervised Fine-Tuning (SFT): 0.2 billion tokens from the UltraChat dataset.
Training Procedure
- Pre-training (LRC-4B-Base): Trained using the Low-Rank Clone (LRC) method.
- Teacher Model: Qwen2.5-7B-Instruct
- Supervised Fine-Tuning (SFT):
- Dataset: UltraChat (0.2B tokens)
- Learning Rate (SFT): 1.0 x 10⁻⁵
Evaluation
Zero-Shot Comparison with other publicly available SFT models (from Table 2 of the paper):
Model | # Tokens | ARC-E | ARC-C | LogiQA | CSQA | PIQA | WinoG | BoolQ | SciQ | MMLU | Avg. |
---|---|---|---|---|---|---|---|---|---|---|---|
Gemma3-4B | 4T | 82.53 | 57.08 | 33.03 | 69.37 | 76.44 | 69.38 | 83.94 | 95.50 | 57.58 | 69.43 |
Minitron-4B | 94B | 79.59 | 54.35 | 30.26 | 71.09 | 77.64 | 65.93 | 82.60 | 96.60 | 56.77 | 68.31 |
Qwen3-4B | 36T | 80.47 | 53.58 | 33.64 | 75.76 | 75.08 | 65.27 | 84.95 | 95.50 | 68.38 | 70.29 |
LRC-4B-SFT | 18B | 78.37 | 52.47 | 34.10 | 79.28 | 76.82 | 67.72 | 84.50 | 95.00 | 64.41 | 70.30 |
Performance on safety and instruction-following tasks (LRC-4B refers to the SFT version, LRC-4B-B refers to the base version):
Benchmark | Metric | Score (LRC-4B-SFT) | Score (LRC-4B-Base) |
---|---|---|---|
ToxiGen | Accuracy Norm | 43.72 | 43.83 |
IFeval | Instance-Level Loose Acc | 13.67 | 36.09 |
TruthfulQA | MC2 | 50.71 | 55.89 |
The significant decrease in IFeval and TruthfulQA scores after SFT compared to the base model highlights the impact of the SFT dataset's specificity and potential lack of coverage for these nuanced capabilities.
Technical Specifications
Model Architecture and Objective
- Architecture: Transformer-based decoder-only model.
- Number of Layers: 28
- Hidden Size: 2,048
- FFN Intermediate Size: 18,944
- Attention Q Heads: 28
- Attention KV Heads: 4
- Head Dimension: 128
- RMSNorm Epsilon: 1.0x10⁻⁶
- Vocabulary Size: 152,064
- Word Embeddings: Not Tied
- Downloads last month
- 0