Outlier-Safe Pre-Training
Introduction
Quantization plays a crucial role in deploying Large Language Models (LLMs) in resource-constrained environments. However, the presence of outlier features significantly hinders low-bit quantization. While many studies address this problem in a post-hoc manner to make use of already pre-trained models, the importance of handling outliers during pre-training is often underestimated.
Our work, Outlier-Safe Pre-Training (OSP), proposes a practical approach to training models that are robust to outliers from the start, without sacrificing performance or efficiency. Specifically, OSP focuses on the following goals:
πScaling to production-level training requirements
Prior methods for quantization-friendly pre-training are often limited to small-scale experiments (e.g., models under 1B parameters or 100B tokens). In contrast, we train a 1.4B-parameter model on 1 trillion tokens, demonstrating that OSP is effective at production scale.β‘Maintaining computational efficiency comparable to standard training
A method that prevents outliers but significantly reduces efficiency is unlikely to gain adoption. OSP introduces only a ~2% slowdown while reducing GPU memory usage, making it appealing for those seeking to train quantization-friendly foundation models from scratch.π§©Ensuring full compatibility with existing inference pipelines
We prioritize compatibility with widely adopted inference frameworks such as vLLM and SGLang. Rather than introducing architectural changes that break compatibility, OSP preserves computational invariance, allowing models to be directly integrated into existing pipelines without additional effort.
Model Checkpoints
Final Models
The models were trained on 1 trillion tokens, following the pre-training recipe of SmolLM. Specifically, training was conducted using the smollm-corpus, a mixture of FineWeb-Edu, Cosmopedia, and Python-Edu.
- π€ OSP-1.4B-1T-Adam: Trained on the standard Adam optimizer, without any modifications.
- π€ OSP-1.4B-1T-Muon-SSNorm-EmbProj: Trained on the OSP framework. This is our final model.
Ablation Models
Model | Optimizer | SSNorm | EmbProj | Ex. Kurt. | Had. | 4-4-4 | |
---|---|---|---|---|---|---|---|
Avg. | PPL | ||||||
π€ OSP-1.4B-100B-Adam | Adam | β | β | 1818.56 | β β |
26.8 26.9 |
8e4 3e4 |
π€ OSP-1.4B-100B-Muon-Only | Muonβ (w/o Adam) |
β | β | 361.35 | β β |
26.3 33.1 |
8e5 24.8 |
π€ OSP-1.4B-100B-Muon | Muon | β | β | 1575.12 | β β |
29.0 38.4 |
1e4 15.8 |
π€ OSP-1.4B-100B-Muon-SSNorm | Muon | β | β | 66.69 | β β |
36.4 38.3 |
44.2 34.1 |
π€ OSP-1.4B-100B-Muon-EmbProj | Muon | β | β | 703.23 | β β |
30.4 36.2 |
114.6 22.3 |
π€ OSP-1.4B-100B-Muon-SSNorm-EmbProj | Muon | β | β | 0.04 | β β |
37.5 38.9 |
19.6 13.5 |
Training
Model
- Architecture: Llama
- Pretraining tokens: 100 billion tokens
- Precision: bfloat16
Hardware
- TPUs: TPU-v4-512 Pod Slice (supported by TRC Program)
Software
Disclaimer
This model family was trained to demonstrate the effectiveness of eliminating outlier occurrences and improving quantization-friendliness. All models are base models, i.e., no instruction tuning or human alignment was applied. These models are not intended for chatting, conversation, or assistant purposes. They may contain toxic or harmful content. Their best use is for evaluating performance degradation on benchmarks after low-bit quantization.
Citation
@article{park2025osp,
title={Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models},
author={Jungwoo Park and Taewhoo Lee and Chanwoong Yoon and Hyeon Hwang and Jaewoo Kang},
year={2025},
eprint={2506.19697},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.19697},
}
- Downloads last month
- 0