Outlier-Safe Pre-Training

arXiv Models code

Introduction

Quantization plays a crucial role in deploying Large Language Models (LLMs) in resource-constrained environments. However, the presence of outlier features significantly hinders low-bit quantization. While many studies address this problem in a post-hoc manner to make use of already pre-trained models, the importance of handling outliers during pre-training is often underestimated.

Our work, Outlier-Safe Pre-Training (OSP), proposes a practical approach to training models that are robust to outliers from the start, without sacrificing performance or efficiency. Specifically, OSP focuses on the following goals:

  1. πŸ“ˆScaling to production-level training requirements
    Prior methods for quantization-friendly pre-training are often limited to small-scale experiments (e.g., models under 1B parameters or 100B tokens). In contrast, we train a 1.4B-parameter model on 1 trillion tokens, demonstrating that OSP is effective at production scale.

  2. ⚑Maintaining computational efficiency comparable to standard training
    A method that prevents outliers but significantly reduces efficiency is unlikely to gain adoption. OSP introduces only a ~2% slowdown while reducing GPU memory usage, making it appealing for those seeking to train quantization-friendly foundation models from scratch.

  3. 🧩Ensuring full compatibility with existing inference pipelines
    We prioritize compatibility with widely adopted inference frameworks such as vLLM and SGLang. Rather than introducing architectural changes that break compatibility, OSP preserves computational invariance, allowing models to be directly integrated into existing pipelines without additional effort.

Model Checkpoints

Final Models

The models were trained on 1 trillion tokens, following the pre-training recipe of SmolLM. Specifically, training was conducted using the smollm-corpus, a mixture of FineWeb-Edu, Cosmopedia, and Python-Edu.

Ablation Models

Model Optimizer SSNorm EmbProj Ex. Kurt. Had. 4-4-4
Avg. PPL
πŸ€— OSP-1.4B-100B-Adam Adam βœ— βœ— 1818.56 βœ—
βœ”
26.8
26.9
8e4
3e4
πŸ€— OSP-1.4B-100B-Muon-Only Muon†
(w/o Adam)
βœ— βœ— 361.35 βœ—
βœ”
26.3
33.1
8e5
24.8
πŸ€— OSP-1.4B-100B-Muon Muon βœ— βœ— 1575.12 βœ—
βœ”
29.0
38.4
1e4
15.8
πŸ€— OSP-1.4B-100B-Muon-SSNorm Muon βœ” βœ— 66.69 βœ—
βœ”
36.4
38.3
44.2
34.1
πŸ€— OSP-1.4B-100B-Muon-EmbProj Muon βœ— βœ” 703.23 βœ—
βœ”
30.4
36.2
114.6
22.3
πŸ€— OSP-1.4B-100B-Muon-SSNorm-EmbProj Muon βœ” βœ” 0.04 βœ—
βœ”
37.5
38.9
19.6
13.5
†Model configuration that disables decoupled embedding optimization by training with Muon optimizer without Adam optimization on embedding layers

Training

Model

  • Architecture: Llama
  • Pretraining tokens: 100 billion tokens
  • Precision: bfloat16

Hardware

  • TPUs: TPU-v4-512 Pod Slice (supported by TRC Program)

Software

Disclaimer

This model family was trained to demonstrate the effectiveness of eliminating outlier occurrences and improving quantization-friendliness. All models are base models, i.e., no instruction tuning or human alignment was applied. These models are not intended for chatting, conversation, or assistant purposes. They may contain toxic or harmful content. Their best use is for evaluating performance degradation on benchmarks after low-bit quantization.

Citation

@article{park2025osp,
      title={Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models}, 
      author={Jungwoo Park and Taewhoo Lee and Chanwoong Yoon and Hyeon Hwang and Jaewoo Kang},
      year={2025},
      eprint={2506.19697},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.19697}, 
}
Downloads last month
0
Safetensors
Model size
1.42B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train dmis-lab/OSP-1.4B-100B-Muon-SSNorm

Collection including dmis-lab/OSP-1.4B-100B-Muon-SSNorm