UW
/

Text Generation
Transformers
Safetensors
English
olmo2
alisawuffles commited on
Commit
055a749
·
verified ·
1 Parent(s): 82f172f

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -0
README.md ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ datasets:
7
+ - allenai/olmo-mix-1124
8
+ ---
9
+
10
+ # SuperBPE
11
+ This 11B model was trained from scratch with a SuperBPE tokenizer. [SuperBPE](https://arxiv.org/abs/2503.13423) extends the BPE algorithm to include both traditional subword tokens (contained within word boundaries), as well as new **superword** tokens (containing parts of multiple words)! It matches the [8B BPE model](huggingface.co/UW/OLMo2-8B-BPE) in both train and inference FLOPs.
12
+
13
+ The model was trained on the OLMo2 pretraining data. It has a context length of 3,000 tokens (to match the effective context size in bytes of a BPE model with a context length of 4,096 tokens), and is trained on 238B tokens. The tokenizer has a vocabulary size of 200k and transitions from learning subword to learning superword tokens at vocabulary size of 180k.
14
+
15
+ ## Example Usage
16
+
17
+ ```
18
+ from transformers import AutoTokenizer, AutoModelForCausalLM
19
+
20
+ tokenizer = AutoTokenizer.from_pretrained("UW/OLMo2-8B-SuperBPE-t180k")
21
+ model = AutoModelForCausalLM.from_pretrained("UW/OLMo2-8B-SuperBPE-t180k")
22
+
23
+ tokenizer.convert_ids_to_tokens(tokenizer.encode("By the way, I am a fan of the Milky Way."))
24
+ # ['ByĠtheĠway', ',ĠIĠam', 'Ġa', 'Ġfan', 'ĠofĠthe', 'ĠMilkyĠWay', '.']
25
+ ```
26
+
27
+ # Citation
28
+ ```
29
+ @misc{liu-etal-2025-superbpe,
30
+ title={SuperBPE: Space Travel for Language Models},
31
+ author={Alisa Liu and Jonathan Hayase and Valentin Hofmann and Sewoong Oh and Noah A. Smith and Yejin Choi},
32
+ year={2025},
33
+ eprint={2503.13423},
34
+ archivePrefix={arXiv},
35
+ primaryClass={cs.CL},
36
+ url={https://arxiv.org/abs/2503.13423},
37
+ }
38
+ ```