UW
/

Text Generation
Transformers
Safetensors
English
olmo2
alisawuffles commited on
Commit
aee0ef4
·
verified ·
1 Parent(s): c3d5227

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -0
README.md ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ datasets:
7
+ - allenai/olmo-mix-1124
8
+ ---
9
+
10
+ # SuperBPE
11
+ This 8B model was trained from scratch with a SuperBPE tokenizer. [SuperBPE](https://arxiv.org/abs/2503.13423) extends the BPE algorithm to include both traditional subword tokens (contained within word boundaries), as well as new **superword** tokens (containing parts of multiple words)! Due to encoding the same amount of text in fewer tokens, this model is **33% more efficient at inference-time** on average compared to a model trained with BPE.
12
+
13
+ The model was trained with the Olmo2 7B architecture and pretraining data. It has a context length of 2,756 tokens (to match the effective context size in bytes of a BPE model with a context length of 4,096 tokens), and is trained on 334B tokens. The tokenizer has a vocabulary size of 200k and transitions from learning subword to learning superword tokens at vocabulary size of 80k.
14
+
15
+ ## Example Usage
16
+
17
+ ```
18
+ from transformers import AutoTokenizer, AutoModelForCausalLM
19
+
20
+ tokenizer = AutoTokenizer.from_pretrained("UW/OLMo2-8B-SuperBPE-t180k")
21
+ model = AutoModelForCausalLM.from_pretrained("UW/OLMo2-8B-SuperBPE-t180k")
22
+
23
+ tokenizer.convert_ids_to_tokens(tokenizer.encode("By the way, I am a fan of the Milky Way."))
24
+ # ['ByĠtheĠway', ',ĠIĠamĠa', 'ĠfanĠofĠthe', 'ĠMilkyĠWay', '.']
25
+ ```
26
+
27
+ # Citation
28
+ ```
29
+ @misc{liu-etal-2025-superbpe,
30
+ title={SuperBPE: Space Travel for Language Models},
31
+ author={Alisa Liu and Jonathan Hayase and Valentin Hofmann and Sewoong Oh and Noah A. Smith and Yejin Choi},
32
+ year={2025},
33
+ eprint={2503.13423},
34
+ archivePrefix={arXiv},
35
+ primaryClass={cs.CL},
36
+ url={https://arxiv.org/abs/2503.13423},
37
+ }
38
+ ```