Model Card for LRC-4B-Base

LRC-4B-Base is a Small Language Model (SLM) with approximately 4 billion parameters. It is the base pre-trained version, developed using the Low-Rank Clone (LRC) method, before any Supervised Fine-Tuning (SFT). The LRC method is an efficient knowledge distillation technique designed to construct SLMs that aspire to behavioral equivalence with larger, more powerful teacher models. This model was distilled from Qwen2.5-7B-Instruct.

The LRC approach trains a set of low-rank projection matrices that enable soft pruning by compressing teacher weights and an "activation clone" mechanism that aligns student activations (including FFN signals) with those of the teacher. LRC-4B-Base was trained on 18 billion tokens, demonstrating significant training efficiency compared to models trained on trillions of tokens.

Uses

Direct Use

LRC-4B-Base is a base pre-trained model. While it has not undergone specific Supervised Fine-Tuning (SFT) for instruction following or chat, it was distilled from an instruction-tuned teacher (Qwen2.5-7B-Instruct) and trained on data including OpenHermes (synthetic assistant dialogues) as part of the Mixed-2.0 dataset. Consequently, it may exhibit some nascent instruction-following or conversational capabilities.

How to Get Started with the Model

❗ Critical: For vLLM serving, please specify model-impl==transformers when using qwen series model. This is because, in the current implementation of vLLM, the qwen model does not support setting a custom head_dim through the config. Fortunately, vLLM allows using transformers as the backend.

Tested versions that can serve properly: vllm==0.8.5.post1 and transformers==4.51.3.

Serve command:

vllm serve JitaiHao/LRC-1.7B-Base --model-impl transformers

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('JitaiHao/LRC-4B-Base')
model = AutoModelForCausalLM.from_pretrained('JitaiHao/LRC-4B-Base')

# Example: Text generation (output quality will depend on the base model's capabilities)
prompt = "Explain the theory of relativity in simple terms."
inputs = tokenizer(prompt, return_tensors="pt")

# Generate text
# Note: Add generation parameters as needed (e.g., max_length, num_beams)
outputs = model.generate(**inputs, max_length=150)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

LRC-4B-Base was pre-trained as part of the LRC-4B development (which then underwent SFT). The pre-training phase for LRC-4B used 18 billion tokens sampled from the "Mixed-2.0" dataset. As detailed in Table 10 of the paper, the "Mixed-2.0" dataset pool (totaling 21.5B tokens) consists of:

Fineweb-Edu: 18B tokens
DCLM: 2B tokens
Cosmopedia V2: 1B tokens
OpenHermes 2.5: 450M tokens From this pool, 18B tokens were used for distillation from the Qwen2.5-7B-Instruct teacher model.

Training Procedure

LRC-4B-Base was trained using the Low-Rank Clone (LRC) method. Key aspects:

Distillation Method: Low-Rank Projection of teacher weights and Activation Clone (aligning student's internal activations, including FFNs, with the teacher's via MSE loss).
Overall Loss: $L = \mathcal{L}_\mathrm{KL} + \mathcal{L}_\mathrm{LM} + α\mathcal{L}_\mathrm{clone}$ (KL divergence for output logits, next-token prediction loss, and activation cloning loss).
Teacher Model: Qwen2.5-7B-Instruct

Training Hyperparameters (for LRC-4B pre-training, which LRC-4B-Base is the result of):

Total Training Tokens: 18B
Student Hidden Size: 2,048
Sequence Length: 2,048
Batch Size (tokens): 32,768
Clone Loss Weight (α): 0.5
Learning Rate (Pre-train): 1.0 x 10⁻⁴
LR Scheduler: Linear decay with a warmup ratio of 0.005
Optimizer: Adam (β₁=0.9, β₂=0.999)
Temperature for \mathcal{L}_\mathrm{KL} (KL divergence loss): 40
RMSNorm ε (for training process): 1.0 x 10⁻⁵
Hardware: 8 x NVIDIA H800 GPUs
Training Time (for pre-training): Approximately 138 Hours

Evaluation

Zero-shot performance of LRC-4B-Base (pre-SFT base model, referred to as LRC-4B-B) on general downstream tasks (from Table 13):

Benchmark	Metric	Score
ARC-E	Accuracy	78.75
ARC-C	Accuracy Norm	52.22
LogiQA	Accuracy Norm	34.87
CSQA	Accuracy	78.30
PIQA	Accuracy	76.61
WinoG	Accuracy	67.80
BoolQ	Accuracy	84.95
SciQ	Accuracy	94.30
MMLU	Accuracy	64.58
Avg.		70.26

Its SFT version, LRC-4B (trained on 18B tokens), achieves an average of 70.30% on these tasks (Table 13). Below is a comparison of the SFT version (LRC-4B) with other publicly available SFT models with more than 2B parameters (from Table 2 of the paper):

Model	# Tokens	ARC-E	ARC-C	LogiQA	CSQA	PIQA	WinoG	BoolQ	SciQ	MMLU	Avg.
Gemma3-4B	4T	82.53	57.08	33.03	69.37	76.44	69.38	83.94	95.50	57.58	69.43
Minitron-4B	94B	79.59	54.35	30.26	71.09	77.64	65.93	82.60	96.60	56.77	68.31
Qwen3-4B	36T	80.47	53.58	33.64	75.76	75.08	65.27	84.95	95.50	68.38	70.29
LRC-4B-SFT	18B	78.37	52.47	34.10	79.28	76.82	67.72	84.50	95.00	64.41	70.30

This comparison demonstrates that the SFT version of LRC-4B, despite being trained on fewer tokens (18B vs 36T/4T/94B), achieves performance comparable to other SFT models in the ~4B parameter class like Qwen3-4B and Gemma3-4B, and outperforms Minitron-4B. This highlights the efficiency of the LRC method.

Technical Specifications

Model Architecture and Objective

Architecture: Transformer-based decoder-only model. Its configuration is a scaled-down version of its teacher, Qwen2.5-7B-Instruct.
- Number of Layers: 28
- Hidden Size: 2,048
- FFN Intermediate Size: 18,944
- Attention Q Heads: 28
- Attention KV Heads: 4
- Head Dimension: 128
- Vocabulary Size: 152,064 (inherited from teacher Qwen2.5-7B)
- Word Embeddings: False
- RMSNorm ε (model architecture): 1.0 x 10⁻⁶
Objective: The model is trained via knowledge distillation. The primary objective is next-token prediction (language modeling, $\mathcal{L}_\mathrm{LM}$ loss). This is augmented by:
- A KL divergence loss ($\mathcal{L}_\mathrm{KL}$) between the student's and teacher's output logits.
- An "Activation Clone" loss ($\mathcal{L}_\mathrm{clone}$) using Mean Squared Error (MSE) to align the student's intermediate hidden states (for attention inputs q,k,v and FFN inputs gate, up) and output activations (from attention and FFN modules after projection by student's output weights) with those of the teacher model. The teacher's weights are compressed into student weights using trainable low-rank projection matrices.
- The total training objective is $\mathcal{L} = \mathcal{L}_\mathrm{KL} + \mathcal{L}_\mathrm{LM} + α\mathcal{L}_\mathrm{clone}$.

JitaiHao
/

LRC-4B-Base

You need to agree to share your contact information to access this model