UW
/

Text Generation
Transformers
Safetensors
English
olmo2
Jhayase commited on
Commit
efca111
·
verified ·
1 Parent(s): b8d39f7

Clarify architecture

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -10,7 +10,7 @@ datasets:
10
  # SuperBPE
11
  This 11B model was trained from scratch with a SuperBPE tokenizer. [SuperBPE](https://arxiv.org/abs/2503.13423) extends the BPE algorithm to include both traditional subword tokens (contained within word boundaries), as well as new **superword** tokens (containing parts of multiple words)! It matches the [8B BPE model](huggingface.co/UW/OLMo2-8B-BPE) in both train and inference FLOPs.
12
 
13
- The model was trained on the OLMo2 pretraining data. It has a context length of 3,000 tokens (to match the effective context size in bytes of a BPE model with a context length of 4,096 tokens), and is trained on 238B tokens. The tokenizer has a vocabulary size of 200k and transitions from learning subword to learning superword tokens at vocabulary size of 180k.
14
 
15
  ## Example Usage
16
 
 
10
  # SuperBPE
11
  This 11B model was trained from scratch with a SuperBPE tokenizer. [SuperBPE](https://arxiv.org/abs/2503.13423) extends the BPE algorithm to include both traditional subword tokens (contained within word boundaries), as well as new **superword** tokens (containing parts of multiple words)! It matches the [8B BPE model](huggingface.co/UW/OLMo2-8B-BPE) in both train and inference FLOPs.
12
 
13
+ The model was trained with a scaled-up version of the Olmo2 7B architecture and the Olmo2 7B pretraining data. It has a context length of 3,000 tokens (to match the effective context size in bytes of a BPE model with a context length of 4,096 tokens), and is trained on 238B tokens. The tokenizer has a vocabulary size of 200k and transitions from learning subword to learning superword tokens at vocabulary size of 180k.
14
 
15
  ## Example Usage
16