File size: 9,300 Bytes

---
datasets:
- HuggingFaceTB/smollm-corpus
language:
- en
---
# Outlier-Safe Pre-Training

[![arXiv](https://img.shields.io/badge/arXiv-2506.19697-b31b1b?style=flat-square)](https://arxiv.org/abs/2506.19697)
[![Models](https://img.shields.io/badge/%F0%9F%A4%97Hugging_Face-Collection-ffd200?style=flat-square)](https://huggingface.co/collections/dmis-lab/outlier-safe-pre-training-osp-685bda10aa1e8a19fcb58ea8)
[![code](https://img.shields.io/badge/Github-Code-keygen.svg?logo=github&style=flat-square)](https://github.com/dmis-lab/Outlier-Safe-Pre-Training)

## Introduction

Quantization plays a crucial role in deploying Large Language Models (LLMs) in resource-constrained environments. However, the presence of outlier features significantly hinders low-bit quantization. While many studies address this problem in a post-hoc manner to make use of already pre-trained models, the importance of handling outliers during pre-training is often underestimated.

Our work, **Outlier-Safe Pre-Training (OSP)**, proposes a practical approach to training models that are robust to outliers from the start, without sacrificing performance or efficiency. Specifically, OSP focuses on the following goals:

1. 📈**Scaling to production-level training requirements**<br/>
Prior methods for quantization-friendly pre-training are often limited to small-scale experiments (e.g., models under 1B parameters or 100B tokens). In contrast, we train a 1.4B-parameter model on 1 trillion tokens, demonstrating that OSP is effective at production scale.

2. ⚡**Maintaining computational efficiency comparable to standard training**<br/>
A method that prevents outliers but significantly reduces efficiency is unlikely to gain adoption. OSP introduces only a ~2% slowdown while reducing GPU memory usage, making it appealing for those seeking to train quantization-friendly foundation models from scratch.

3. 🧩**Ensuring full compatibility with existing inference pipelines**<br/>
We prioritize compatibility with widely adopted inference frameworks such as vLLM and SGLang. Rather than introducing architectural changes that break compatibility, OSP preserves computational invariance, allowing models to be directly integrated into existing pipelines without additional effort.



## Model Checkpoints

### Final Models

The models were trained on 1 trillion tokens, following the pre-training recipe of [SmolLM](https://huggingface.co/blog/smollm). Specifically, training was conducted using the [smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus), a mixture of FineWeb-Edu, Cosmopedia, and Python-Edu.

- [🤗 OSP-1.4B-1T-Adam](https://huggingface.co/dmis-lab/OSP-1.4B-1T-Adam): Trained on the standard Adam optimizer, without any modifications.
- [🤗 OSP-1.4B-1T-Muon-SSNorm-EmbProj](https://huggingface.co/dmis-lab/OSP-1.4B-1T-Muon-SSNorm-EmbProj): Trained on the OSP framework. This is our final model.


### Ablation Models

<table>
    <thead>
        <tr>
            <th rowspan="2">Model</th>
            <th rowspan="2">Optimizer</th>
            <th rowspan="2">SSNorm</th>
            <th rowspan="2">EmbProj</th>
            <th rowspan="2">Ex. Kurt.</th>
            <th rowspan="2">Had.</th>
            <!-- <th colspan="2">16-16-16</th> -->
            <th colspan="2">4-4-4</th>
        </tr>
        <tr>
            <!-- <th>Avg.</th>
            <th>PPL</th> -->
            <!-- <th>Avg.</th>
            <th>PPL</th>
            <th>Avg.</th>
            <th>PPL</th>
            <th>Avg.</th>
            <th>PPL</th> -->
            <th>Avg.</th>
            <th>PPL</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td><a href="https://huggingface.co/dmis-lab/OSP-1.4B-100B-Adam">🤗 OSP-1.4B-100B-Adam</a></td>
            <td>Adam</td>
            <td>✗</td>
            <td>✗</td>
            <td>1818.56</td>
            <td>✗<br>✔</td>
            <!-- <td>41.5<br>41.5</td>
            <td>11.4<br>11.4</td> -->
            <!-- <td>39.7<br>40.2</td>
            <td>21.6<br>22.3</td>
            <td>39.7<br>40.3</td>
            <td>21.6<br>22.3</td>
            <td>26.5<br>27.2</td>
            <td>1e5<br>3e4</td> -->
            <td>26.8<br>26.9</td>
            <td>8e4<br>3e4</td>
        </tr>
        <tr>
            <td><a href="https://huggingface.co/dmis-lab/OSP-1.4B-100B-Muon-Only">🤗 OSP-1.4B-100B-Muon-Only</a></td>
            <td>Muon&dagger;<br/>(w/o Adam)</td>
            <td>✗</td>
            <td>✗</td>
            <td>361.35</td>
            <td>✗<br>✔</td>
            <!-- <td>41.0<br>41.0</td>
            <td>11.7<br>11.7</td> -->
            <!-- <td>38.4<br>37.5</td>
            <td>14.8<br>15.4</td>
            <td>38.3<br>37.5</td>
            <td>14.8<br>15.4</td>
            <td>26.3<br>33.3</td>
            <td>1e6<br>24.5</td> -->
            <td>26.3<br>33.1</td>
            <td>8e5<br>24.8</td>
        </tr>
        <tr>
            <td><a href="https://huggingface.co/dmis-lab/OSP-1.4B-100B-Muon">🤗 OSP-1.4B-100B-Muon</a></td>
            <td>Muon</td>
            <td>✗</td>
            <td>✗</td>
            <td>1575.12</td>
            <td>✗<br>✔</td>
            <!-- <td>41.5<br>41.5</td>
            <td>11.4<br>11.4</td> -->
            <!-- <td>40.0<br>40.6</td>
            <td>13.8<br>12.9</td>
            <td>40.0<br>40.6</td>
            <td>13.8<br>12.9</td>
            <td>29.4<br>38.6</td>
            <td>934.3<br>15.7</td> -->
            <td>29.0<br>38.4</td>
            <td>1e4<br>15.8</td>
        </tr>
        <tr>
            <td><a href="https://huggingface.co/dmis-lab/OSP-1.4B-100B-Muon-SSNorm">🤗 OSP-1.4B-100B-Muon-SSNorm</a></td>
            <td>Muon</td>
            <td>✔</td>
            <td>✗</td>
            <td>66.69</td>
            <td>✗<br>✔</td>
            <!-- <td><strong>41.8</strong><br><strong>41.8</strong></td>
            <td><strong>11.2</strong><br><strong>11.2</strong></td> -->
            <!-- <td><strong>41.0</strong><br><strong>40.8</strong></td>
            <td>12.4<br>12.2</td>
            <td><strong>40.9</strong><br><strong>40.8</strong></td>
            <td>12.4<br>12.2</td>
            <td>36.6<br>38.6</td>
            <td>43.3<br>33.7</td> -->
            <td>36.4<br>38.3</td>
            <td>44.2<br>34.1</td>
        </tr>
        <tr>
            <td><a href="https://huggingface.co/dmis-lab/OSP-1.4B-100B-Muon-EmbProj">🤗 OSP-1.4B-100B-Muon-EmbProj</a></td>
            <td>Muon</td>
            <td>✗</td>
            <td>✔</td>
            <td>703.23</td>
            <td>✗<br>✔</td>
            <!-- <td>40.0<br>40.0</td>
            <td>12.3<br>12.3</td> -->
            <!-- <td>38.4<br>39.2</td>
            <td>14.8<br>13.9</td>
            <td>38.4<br>39.3</td>
            <td>14.8<br>13.9</td>
            <td>31.0<br>36.3</td>
            <td>99.7<br>22.1</td> -->
            <td>30.4<br>36.2</td>
            <td>114.6<br>22.3</td>
        </tr>
        <tr>
            <td><a href="https://huggingface.co/dmis-lab/OSP-1.4B-100B-Muon-SSNorm-EmbProj">🤗 OSP-1.4B-100B-Muon-SSNorm-EmbProj</a></td>
            <td>Muon</td>
            <td>✔</td>
            <td>✔</td>
            <td><strong>0.04</strong></td>
            <td>✗<br>✔</td>
            <!-- <td>41.4<br>41.4</td>
            <td><strong>11.2</strong><br><strong>11.2</strong></td> -->
            <!-- <td>40.6<br>40.5</td>
            <td><strong>12.2</strong><br><strong>12.1</strong></td>
            <td>40.6<br>40.5</td>
            <td><strong>12.2</strong><br><strong>12.1</strong></td>
            <td><strong>37.9</strong><br><strong>39.1</strong></td>
            <td><strong>19.4</strong><br><strong>13.4</strong></td> -->
            <td><strong>37.5</strong><br><strong>38.9</strong></td>
            <td><strong>19.6</strong><br><strong>13.5</strong></td>
        </tr>
    </tbody>
</table>
&dagger;Model configuration that disables decoupled embedding optimization by training with Muon optimizer without Adam optimization on embedding layers 


## Training

### Model

- Architecture: Llama
- Pretraining tokens: 1 trillion tokens
- Precision: bfloat16
  
### Hardware

- TPUs: TPU-v4-512 Pod Slice (supported by [TRC Program](https://sites.research.google/trc/about/))

### Software

- Training Framework: [Jax](https://github.com/jax-ml/jax), [Flax](https://github.com/google/flax)

## Disclaimer

This model family was trained to demonstrate the effectiveness of eliminating outlier occurrences and improving quantization-friendliness. All models are base models, i.e., no instruction tuning or human alignment was applied. These models are not intended for chatting, conversation, or assistant purposes. They may contain toxic or harmful content. Their best use is for evaluating performance degradation on benchmarks after low-bit quantization.

## Citation

```bibtex
@article{park2025osp,
      title={Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models}, 
      author={Jungwoo Park and Taewhoo Lee and Chanwoong Yoon and Hyeon Hwang and Jaewoo Kang},
      year={2025},
      eprint={2506.19697},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.19697}, 
}
```