Clarify architecture
Browse files
README.md
CHANGED
@@ -10,7 +10,7 @@ datasets:
|
|
10 |
# SuperBPE
|
11 |
This 11B model was trained from scratch with a SuperBPE tokenizer. [SuperBPE](https://arxiv.org/abs/2503.13423) extends the BPE algorithm to include both traditional subword tokens (contained within word boundaries), as well as new **superword** tokens (containing parts of multiple words)! It matches the [8B BPE model](huggingface.co/UW/OLMo2-8B-BPE) in both train and inference FLOPs.
|
12 |
|
13 |
-
The model was trained
|
14 |
|
15 |
## Example Usage
|
16 |
|
|
|
10 |
# SuperBPE
|
11 |
This 11B model was trained from scratch with a SuperBPE tokenizer. [SuperBPE](https://arxiv.org/abs/2503.13423) extends the BPE algorithm to include both traditional subword tokens (contained within word boundaries), as well as new **superword** tokens (containing parts of multiple words)! It matches the [8B BPE model](huggingface.co/UW/OLMo2-8B-BPE) in both train and inference FLOPs.
|
12 |
|
13 |
+
The model was trained with a scaled-up version of the Olmo2 7B architecture and the Olmo2 7B pretraining data. It has a context length of 3,000 tokens (to match the effective context size in bytes of a BPE model with a context length of 4,096 tokens), and is trained on 238B tokens. The tokenizer has a vocabulary size of 200k and transitions from learning subword to learning superword tokens at vocabulary size of 180k.
|
14 |
|
15 |
## Example Usage
|
16 |
|