File size: 9,300 Bytes
dab3c34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1499273
dab3c34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
---
datasets:
- HuggingFaceTB/smollm-corpus
language:
- en
---
# Outlier-Safe Pre-Training

[![arXiv](https://img.shields.io/badge/arXiv-2506.19697-b31b1b?style=flat-square)](https://arxiv.org/abs/2506.19697)
[![Models](https://img.shields.io/badge/%F0%9F%A4%97Hugging_Face-Collection-ffd200?style=flat-square)](https://huggingface.co/collections/dmis-lab/outlier-safe-pre-training-osp-685bda10aa1e8a19fcb58ea8)
[![code](https://img.shields.io/badge/Github-Code-keygen.svg?logo=github&style=flat-square)](https://github.com/dmis-lab/Outlier-Safe-Pre-Training)

## Introduction

Quantization plays a crucial role in deploying Large Language Models (LLMs) in resource-constrained environments. However, the presence of outlier features significantly hinders low-bit quantization. While many studies address this problem in a post-hoc manner to make use of already pre-trained models, the importance of handling outliers during pre-training is often underestimated.

Our work, **Outlier-Safe Pre-Training (OSP)**, proposes a practical approach to training models that are robust to outliers from the start, without sacrificing performance or efficiency. Specifically, OSP focuses on the following goals:

1. πŸ“ˆ**Scaling to production-level training requirements**<br/>
Prior methods for quantization-friendly pre-training are often limited to small-scale experiments (e.g., models under 1B parameters or 100B tokens). In contrast, we train a 1.4B-parameter model on 1 trillion tokens, demonstrating that OSP is effective at production scale.

2. ⚑**Maintaining computational efficiency comparable to standard training**<br/>
A method that prevents outliers but significantly reduces efficiency is unlikely to gain adoption. OSP introduces only a ~2% slowdown while reducing GPU memory usage, making it appealing for those seeking to train quantization-friendly foundation models from scratch.

3. 🧩**Ensuring full compatibility with existing inference pipelines**<br/>
We prioritize compatibility with widely adopted inference frameworks such as vLLM and SGLang. Rather than introducing architectural changes that break compatibility, OSP preserves computational invariance, allowing models to be directly integrated into existing pipelines without additional effort.



## Model Checkpoints

### Final Models

The models were trained on 1 trillion tokens, following the pre-training recipe of [SmolLM](https://huggingface.co/blog/smollm). Specifically, training was conducted using the [smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus), a mixture of FineWeb-Edu, Cosmopedia, and Python-Edu.

- [πŸ€— OSP-1.4B-1T-Adam](https://huggingface.co/dmis-lab/OSP-1.4B-1T-Adam): Trained on the standard Adam optimizer, without any modifications.
- [πŸ€— OSP-1.4B-1T-Muon-SSNorm-EmbProj](https://huggingface.co/dmis-lab/OSP-1.4B-1T-Muon-SSNorm-EmbProj): Trained on the OSP framework. This is our final model.


### Ablation Models

<table>
    <thead>
        <tr>
            <th rowspan="2">Model</th>
            <th rowspan="2">Optimizer</th>
            <th rowspan="2">SSNorm</th>
            <th rowspan="2">EmbProj</th>
            <th rowspan="2">Ex. Kurt.</th>
            <th rowspan="2">Had.</th>
            <!-- <th colspan="2">16-16-16</th> -->
            <th colspan="2">4-4-4</th>
        </tr>
        <tr>
            <!-- <th>Avg.</th>
            <th>PPL</th> -->
            <!-- <th>Avg.</th>
            <th>PPL</th>
            <th>Avg.</th>
            <th>PPL</th>
            <th>Avg.</th>
            <th>PPL</th> -->
            <th>Avg.</th>
            <th>PPL</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td><a href="https://huggingface.co/dmis-lab/OSP-1.4B-100B-Adam">πŸ€— OSP-1.4B-100B-Adam</a></td>
            <td>Adam</td>
            <td>βœ—</td>
            <td>βœ—</td>
            <td>1818.56</td>
            <td>βœ—<br>βœ”</td>
            <!-- <td>41.5<br>41.5</td>
            <td>11.4<br>11.4</td> -->
            <!-- <td>39.7<br>40.2</td>
            <td>21.6<br>22.3</td>
            <td>39.7<br>40.3</td>
            <td>21.6<br>22.3</td>
            <td>26.5<br>27.2</td>
            <td>1e5<br>3e4</td> -->
            <td>26.8<br>26.9</td>
            <td>8e4<br>3e4</td>
        </tr>
        <tr>
            <td><a href="https://huggingface.co/dmis-lab/OSP-1.4B-100B-Muon-Only">πŸ€— OSP-1.4B-100B-Muon-Only</a></td>
            <td>Muon&dagger;<br/>(w/o Adam)</td>
            <td>βœ—</td>
            <td>βœ—</td>
            <td>361.35</td>
            <td>βœ—<br>βœ”</td>
            <!-- <td>41.0<br>41.0</td>
            <td>11.7<br>11.7</td> -->
            <!-- <td>38.4<br>37.5</td>
            <td>14.8<br>15.4</td>
            <td>38.3<br>37.5</td>
            <td>14.8<br>15.4</td>
            <td>26.3<br>33.3</td>
            <td>1e6<br>24.5</td> -->
            <td>26.3<br>33.1</td>
            <td>8e5<br>24.8</td>
        </tr>
        <tr>
            <td><a href="https://huggingface.co/dmis-lab/OSP-1.4B-100B-Muon">πŸ€— OSP-1.4B-100B-Muon</a></td>
            <td>Muon</td>
            <td>βœ—</td>
            <td>βœ—</td>
            <td>1575.12</td>
            <td>βœ—<br>βœ”</td>
            <!-- <td>41.5<br>41.5</td>
            <td>11.4<br>11.4</td> -->
            <!-- <td>40.0<br>40.6</td>
            <td>13.8<br>12.9</td>
            <td>40.0<br>40.6</td>
            <td>13.8<br>12.9</td>
            <td>29.4<br>38.6</td>
            <td>934.3<br>15.7</td> -->
            <td>29.0<br>38.4</td>
            <td>1e4<br>15.8</td>
        </tr>
        <tr>
            <td><a href="https://huggingface.co/dmis-lab/OSP-1.4B-100B-Muon-SSNorm">πŸ€— OSP-1.4B-100B-Muon-SSNorm</a></td>
            <td>Muon</td>
            <td>βœ”</td>
            <td>βœ—</td>
            <td>66.69</td>
            <td>βœ—<br>βœ”</td>
            <!-- <td><strong>41.8</strong><br><strong>41.8</strong></td>
            <td><strong>11.2</strong><br><strong>11.2</strong></td> -->
            <!-- <td><strong>41.0</strong><br><strong>40.8</strong></td>
            <td>12.4<br>12.2</td>
            <td><strong>40.9</strong><br><strong>40.8</strong></td>
            <td>12.4<br>12.2</td>
            <td>36.6<br>38.6</td>
            <td>43.3<br>33.7</td> -->
            <td>36.4<br>38.3</td>
            <td>44.2<br>34.1</td>
        </tr>
        <tr>
            <td><a href="https://huggingface.co/dmis-lab/OSP-1.4B-100B-Muon-EmbProj">πŸ€— OSP-1.4B-100B-Muon-EmbProj</a></td>
            <td>Muon</td>
            <td>βœ—</td>
            <td>βœ”</td>
            <td>703.23</td>
            <td>βœ—<br>βœ”</td>
            <!-- <td>40.0<br>40.0</td>
            <td>12.3<br>12.3</td> -->
            <!-- <td>38.4<br>39.2</td>
            <td>14.8<br>13.9</td>
            <td>38.4<br>39.3</td>
            <td>14.8<br>13.9</td>
            <td>31.0<br>36.3</td>
            <td>99.7<br>22.1</td> -->
            <td>30.4<br>36.2</td>
            <td>114.6<br>22.3</td>
        </tr>
        <tr>
            <td><a href="https://huggingface.co/dmis-lab/OSP-1.4B-100B-Muon-SSNorm-EmbProj">πŸ€— OSP-1.4B-100B-Muon-SSNorm-EmbProj</a></td>
            <td>Muon</td>
            <td>βœ”</td>
            <td>βœ”</td>
            <td><strong>0.04</strong></td>
            <td>βœ—<br>βœ”</td>
            <!-- <td>41.4<br>41.4</td>
            <td><strong>11.2</strong><br><strong>11.2</strong></td> -->
            <!-- <td>40.6<br>40.5</td>
            <td><strong>12.2</strong><br><strong>12.1</strong></td>
            <td>40.6<br>40.5</td>
            <td><strong>12.2</strong><br><strong>12.1</strong></td>
            <td><strong>37.9</strong><br><strong>39.1</strong></td>
            <td><strong>19.4</strong><br><strong>13.4</strong></td> -->
            <td><strong>37.5</strong><br><strong>38.9</strong></td>
            <td><strong>19.6</strong><br><strong>13.5</strong></td>
        </tr>
    </tbody>
</table>
&dagger;Model configuration that disables decoupled embedding optimization by training with Muon optimizer without Adam optimization on embedding layers 


## Training

### Model

- Architecture: Llama
- Pretraining tokens: 1 trillion tokens
- Precision: bfloat16
  
### Hardware

- TPUs: TPU-v4-512 Pod Slice (supported by [TRC Program](https://sites.research.google/trc/about/))

### Software

- Training Framework: [Jax](https://github.com/jax-ml/jax), [Flax](https://github.com/google/flax)

## Disclaimer

This model family was trained to demonstrate the effectiveness of eliminating outlier occurrences and improving quantization-friendliness. All models are base models, i.e., no instruction tuning or human alignment was applied. These models are not intended for chatting, conversation, or assistant purposes. They may contain toxic or harmful content. Their best use is for evaluating performance degradation on benchmarks after low-bit quantization.

## Citation

```bibtex
@article{park2025osp,
      title={Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models}, 
      author={Jungwoo Park and Taewhoo Lee and Chanwoong Yoon and Hyeon Hwang and Jaewoo Kang},
      year={2025},
      eprint={2506.19697},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.19697}, 
}
```