Update README.md
Browse files
README.md
CHANGED
@@ -8,6 +8,7 @@ language:
|
|
8 |
base_model:
|
9 |
- meta-llama/Llama-3.1-8B
|
10 |
---
|
|
|
11 |
# Model Card
|
12 |
|
13 |
<img src="https://huggingface.co/datasets/tokyotech-llm/swallow-math/resolve/main/figures/swallow-code-math-log.png" alt="SwallowCodeMath Icon" width="600">
|
@@ -16,10 +17,10 @@ base_model:
|
|
16 |
|
17 |
## Model Summary
|
18 |
|
19 |
-
This model is a continual pre-training of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) on a mix of mathematical datasets from
|
20 |
-
The model was trained to evaluate the performance of mathematical reasoning and problem-solving as part of the SwallowMath ablation experiments (experiment
|
21 |
|
22 |
-
It was trained on **50 billion tokens** using a mix of 4.8%
|
23 |
Training was performed using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/core_r0.9.0).
|
24 |
|
25 |
## Use
|
@@ -29,10 +30,13 @@ Training was performed using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM
|
|
29 |
```python
|
30 |
# pip install -q transformers
|
31 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
32 |
model = "tokyotech-llm/<model-name>"
|
33 |
device = "cuda" # for GPU usage or "cpu" for CPU usage
|
|
|
34 |
tokenizer = AutoTokenizer.from_pretrained(model)
|
35 |
model = AutoModelForCausalLM.from_pretrained(model).to(device)
|
|
|
36 |
inputs = tokenizer.encode("Solve the equation 2x + 3 = 7:", return_tensors="pt").to(device)
|
37 |
outputs = model.generate(inputs, max_length=100)
|
38 |
print(tokenizer.decode(outputs[0]))
|
@@ -77,18 +81,24 @@ Details are in the paper’s Appendix.
|
|
77 |
- Megatron-LM (version core_r0.9.0) for training
|
78 |
- lm-evaluation-harness for evaluation
|
79 |
- BigCodeBench for code evaluation
|
|
|
80 |
## Evaluation
|
|
|
81 |
The model was evaluated using the setup described in the SwallowMath paper, with the lm-evaluation-harness and BigCodeBench. Benchmarks include mathematical reasoning (GSM8K, MATH), code generation (HumanEval), and general tasks (OpenBookQA, TriviaQA, HellaSwag, SQuAD 2.0, XWINO, MMLU, BBH).
|
82 |
Results are reported for checkpoints at 10B, 20B, 30B, 40B, and 50B tokens.
|
83 |
-
|
|
|
|
|
84 |
| Tokens (B) | OpenBookQA | TriviaQA | HellaSwag | SQuAD2.0 | XWINO | MMLU | HumanEval | GSM8K | BBH | MATH |
|
85 |
|------------|------------|----------|-----------|----------|-------|------|-----------|-------|-----|------|
|
86 |
-
| 10 | 0.
|
87 |
-
| 20 | 0.
|
88 |
-
| 30 | 0.
|
89 |
-
| 40 | 0.
|
90 |
-
| 50 | 0.
|
|
|
91 |
## Citation
|
|
|
92 |
```bibtex
|
93 |
@misc{fujii2025rewritingpretrainingdata,
|
94 |
title={Rewriting Pre-Training Data: Boosting LLM Performance in Math and Code},
|
|
|
8 |
base_model:
|
9 |
- meta-llama/Llama-3.1-8B
|
10 |
---
|
11 |
+
|
12 |
# Model Card
|
13 |
|
14 |
<img src="https://huggingface.co/datasets/tokyotech-llm/swallow-math/resolve/main/figures/swallow-code-math-log.png" alt="SwallowCodeMath Icon" width="600">
|
|
|
17 |
|
18 |
## Model Summary
|
19 |
|
20 |
+
This model is a continual pre-training of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) on a mix of mathematical datasets from [SwallowMath](https://huggingface.co/datasets/tokyotech-llm/swallow-math) and multilingual text datasets.
|
21 |
+
The model was trained to evaluate the performance of mathematical reasoning and problem-solving as part of the SwallowMath ablation experiments (experiment 2).
|
22 |
|
23 |
+
It was trained on **50 billion tokens** using a mix of 4.8% SwallowMath (finemath-4+ rewritten) , 13.1% Code, and 82% multilingual text, following the setup described in the [SwallowMath paper](https://arxiv.org/abs/XXXX.XXXXX).
|
24 |
Training was performed using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/core_r0.9.0).
|
25 |
|
26 |
## Use
|
|
|
30 |
```python
|
31 |
# pip install -q transformers
|
32 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
33 |
+
|
34 |
model = "tokyotech-llm/<model-name>"
|
35 |
device = "cuda" # for GPU usage or "cpu" for CPU usage
|
36 |
+
|
37 |
tokenizer = AutoTokenizer.from_pretrained(model)
|
38 |
model = AutoModelForCausalLM.from_pretrained(model).to(device)
|
39 |
+
|
40 |
inputs = tokenizer.encode("Solve the equation 2x + 3 = 7:", return_tensors="pt").to(device)
|
41 |
outputs = model.generate(inputs, max_length=100)
|
42 |
print(tokenizer.decode(outputs[0]))
|
|
|
81 |
- Megatron-LM (version core_r0.9.0) for training
|
82 |
- lm-evaluation-harness for evaluation
|
83 |
- BigCodeBench for code evaluation
|
84 |
+
|
85 |
## Evaluation
|
86 |
+
|
87 |
The model was evaluated using the setup described in the SwallowMath paper, with the lm-evaluation-harness and BigCodeBench. Benchmarks include mathematical reasoning (GSM8K, MATH), code generation (HumanEval), and general tasks (OpenBookQA, TriviaQA, HellaSwag, SQuAD 2.0, XWINO, MMLU, BBH).
|
88 |
Results are reported for checkpoints at 10B, 20B, 30B, 40B, and 50B tokens.
|
89 |
+
|
90 |
+
### Evaluation Results (SwallowMath experiment 2)
|
91 |
+
|
92 |
| Tokens (B) | OpenBookQA | TriviaQA | HellaSwag | SQuAD2.0 | XWINO | MMLU | HumanEval | GSM8K | BBH | MATH |
|
93 |
|------------|------------|----------|-----------|----------|-------|------|-----------|-------|-----|------|
|
94 |
+
| 10 | 0.3720 | 0.6643 | 0.5970 | 0.3443 | 0.9015 | 0.6343 | 0.3439 | 0.5603 | 0.5535 | 0.2480 |
|
95 |
+
| 20 | 0.3800 | 0.6580 | 0.5946 | 0.3428 | 0.8994 | 0.6293 | 0.3762 | 0.6156 | 0.5669 | 0.2860 |
|
96 |
+
| 30 | 0.3660 | 0.6618 | 0.5964 | 0.3470 | 0.9011 | 0.6298 | 0.3530 | 0.6262 | 0.6383 | 0.3040 |
|
97 |
+
| 40 | 0.3700 | 0.6610 | 0.5973 | 0.3535 | 0.9088 | 0.6358 | 0.3738 | 0.6422 | 0.6237 | 0.3100 |
|
98 |
+
| 50 | 0.3800 | 0.6637 | 0.5972 | 0.3537 | 0.9045 | 0.6337 | 0.3683 | 0.6535 | 0.6414 | 0.3160 |
|
99 |
+
|
100 |
## Citation
|
101 |
+
|
102 |
```bibtex
|
103 |
@misc{fujii2025rewritingpretrainingdata,
|
104 |
title={Rewriting Pre-Training Data: Boosting LLM Performance in Math and Code},
|