UW
/

Text Generation
Transformers
Safetensors
English
olmo2
alisawuffles commited on
Commit
613df22
·
verified ·
1 Parent(s): 15d52e0

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -0
README.md ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ datasets:
7
+ - allenai/olmo-mix-1124
8
+ ---
9
+
10
+ # BPE Baseline
11
+ This 8B model was trained from scratch with a traditional subword BPE tokenizer, and serves as our baseline in experiments.
12
+
13
+ The model was trained with the Olmo2 7B architecture and pretraining data. It has a context length of 4,096 tokens and is trained on 321B tokens. The tokenizer has a vocabulary size of 200k.
14
+
15
+ ## Example Usage
16
+
17
+ ```
18
+ from transformers import AutoTokenizer, AutoModelForCausalLM
19
+
20
+ tokenizer = AutoTokenizer.from_pretrained("UW/OLMo2-8B-BPE")
21
+ model = AutoModelForCausalLM.from_pretrained("UW/OLMo2-8B-BPE")
22
+
23
+ tokenizer.convert_ids_to_tokens(tokenizer.encode("By the way, I am a fan of the Milky Way."))
24
+ # ['By', 'Ġthe', 'Ġway', ',', 'ĠI', 'Ġam', 'Ġa', 'Ġfan', 'Ġof', 'Ġthe', 'ĠMilky', 'ĠWay', '.']
25
+ ```
26
+
27
+ # Citation
28
+ ```
29
+ @misc{liu-etal-2025-superbpe,
30
+ title={SuperBPE: Space Travel for Language Models},
31
+ author={Alisa Liu and Jonathan Hayase and Valentin Hofmann and Sewoong Oh and Noah A. Smith and Yejin Choi},
32
+ year={2025},
33
+ eprint={2503.13423},
34
+ archivePrefix={arXiv},
35
+ primaryClass={cs.CL},
36
+ url={https://arxiv.org/abs/2503.13423},
37
+ }
38
+ ```