File size: 4,390 Bytes
633672e
 
 
 
 
 
 
 
 
 
1eca664
633672e
 
 
 
5b5195a
633672e
 
 
1eca664
 
633672e
1736ff3
633672e
 
 
 
 
 
 
 
 
1eca664
633672e
 
1eca664
633672e
 
1eca664
633672e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1eca664
633672e
1eca664
633672e
 
1eca664
 
 
633672e
 
1eca664
 
 
 
 
 
633672e
1eca664
633672e
ef67707
 
 
 
 
 
 
 
633672e
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
license: llama3.3
datasets:
  - tokyotech-llm/swallow-math
language:
  - en
  - ja
base_model:
  - meta-llama/Llama-3.1-8B
---

# Model Card

<img src="https://huggingface.co/datasets/tokyotech-llm/swallow-math/resolve/main/figures/swallow-code-math-log.png" alt="SwallowCodeMath Icon" width="600">

<img src="https://huggingface.co/datasets/tokyotech-llm/swallow-math/resolve/main/figures/finemath-rewriting.png" width="800">

## Model Summary

This model is a continual pre-training of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) on a mix of mathematical datasets from [SwallowMath](https://huggingface.co/datasets/tokyotech-llm/swallow-math) and multilingual text datasets.
The model was trained to evaluate the performance of mathematical reasoning and problem-solving as part of the SwallowMath ablation experiments (experiment 2).

It was trained on **50 billion tokens** using a mix of 4.8% SwallowMath (finemath-4+ rewritten) , 13.1% Code, and 82% multilingual text, following the setup described in the [SwallowMath paper](https://arxiv.org/abs/2505.02881).
Training was performed using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/core_r0.9.0).

## Use

### Generation

```python
# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model = "tokyotech-llm/<model-name>"
device = "cuda"  # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model).to(device)

inputs = tokenizer.encode("Solve the equation 2x + 3 = 7:", return_tensors="pt").to(device)
outputs = model.generate(inputs, max_length=100)
print(tokenizer.decode(outputs[0]))
```

## Training

### Model

- **Architecture**: Llama-3.1
- **Pretraining tokens**: 50B
- **Precision**: bfloat16
- **Sequence length**: 8,192
- **Tokenizer**: Llama-3 tokenizer

### Data

The training mix consists of:

- **Mathematical Data** (~4.84%):
  - Target Math Data: 2.4B tokens
- **Code Data** (~13.12%):
  - SwallowCode (Syntax, Pylint Filtered): 6.5B tokens
- **Multilingual Text** (~82.04%):
  - Japanese Wikipedia: 0.84B tokens
  - Japanese Swallow Corpus v2: 33.0B tokens
  - Laboro-ParaCorpus: 0.22B tokens
  - English Wikipedia: 1.1B tokens
  - English Cosmopedia: 3.3B tokens
  - English DCLM: 2.2B tokens

Details are in the paper’s Appendix.

### Hardware

- GPUs: 64 NVIDIA H100 (94GB)
- Interconnect: InfiniBand NDR200
- Supercomputer: TSUBAME, Institute of Science Tokyo

### Software

- Megatron-LM (version core_r0.9.0) for training
- lm-evaluation-harness for evaluation
- BigCodeBench for code evaluation

## Evaluation

The model was evaluated using the setup described in the SwallowMath paper, with the lm-evaluation-harness and BigCodeBench. Benchmarks include mathematical reasoning (GSM8K, MATH), code generation (HumanEval), and general tasks (OpenBookQA, TriviaQA, HellaSwag, SQuAD 2.0, XWINO, MMLU, BBH).
Results are reported for checkpoints at 10B, 20B, 30B, 40B, and 50B tokens.

### Evaluation Results (SwallowMath experiment 2)

| Tokens (B) | OpenBookQA | TriviaQA | HellaSwag | SQuAD2.0 | XWINO | MMLU | HumanEval | GSM8K | BBH | MATH |
|------------|------------|----------|-----------|----------|-------|------|-----------|-------|-----|------|
| 10 | 0.3720 | 0.6643 | 0.5970 | 0.3443 | 0.9015 | 0.6343 | 0.3439 | 0.5603 | 0.5535 | 0.2480 |
| 20 | 0.3800 | 0.6580 | 0.5946 | 0.3428 | 0.8994 | 0.6293 | 0.3762 | 0.6156 | 0.5669 | 0.2860 |
| 30 | 0.3660 | 0.6618 | 0.5964 | 0.3470 | 0.9011 | 0.6298 | 0.3530 | 0.6262 | 0.6383 | 0.3040 |
| 40 | 0.3700 | 0.6610 | 0.5973 | 0.3535 | 0.9088 | 0.6358 | 0.3738 | 0.6422 | 0.6237 | 0.3100 |
| 50 | 0.3800 | 0.6637 | 0.5972 | 0.3537 | 0.9045 | 0.6337 | 0.3683 | 0.6535 | 0.6414 | 0.3160 |

## Citation

```bibtex
@misc{fujii2025rewritingpretrainingdataboosts,
      title={Rewriting Pre-Training Data Boosts LLM Performance in Math and Code}, 
      author={Kazuki Fujii and Yukito Tajima and Sakae Mizuki and Hinari Shimada and Taihei Shiotani and Koshiro Saito and Masanari Ohi and Masaki Kawamura and Taishi Nakamura and Takumi Okamoto and Shigeki Ishida and Kakeru Hattori and Youmi Ma and Hiroya Takamura and Rio Yokota and Naoaki Okazaki},
      year={2025},
      eprint={2505.02881},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.02881}, 
}
```