Update README.md
Browse files
README.md
CHANGED
@@ -213,20 +213,6 @@ print(tokenizer.decode(tokens[0], skip_special_tokens=True))
|
|
213 |
</details>
|
214 |
|
215 |
|
216 |
-
### Model Architecture
|
217 |
-
|
218 |
-
The model is a decoder-only transformer similar to the LLaMA ([Touvron et al., 2023](https://arxiv.org/abs/2307.09288)) architecture with the following modifications:
|
219 |
-
|
220 |
-
| Parameters | Hidden Size | Layers | Heads | Sequence Length |
|
221 |
-
|----------------|-------------|--------|-------|-----------------|
|
222 |
-
| 1,644,417,024 | 2048 | 24 | 32 | 4096 |
|
223 |
-
|
224 |
-
* **Position Embeddings**: Rotary Position Embeddings ([Su et al., 2021](https://arxiv.org/abs/2104.09864)) applied to the first 25% of head embedding dimensions for improved throughput following [Black et al. (2022)](https://arxiv.org/pdf/2204.06745.pdf).
|
225 |
-
* **Normalization**: LayerNorm ([Ba et al., 2016](https://arxiv.org/abs/1607.06450)) with learned bias terms as opposed to RMSNorm ([Zhang & Sennrich, 2019](https://arxiv.org/abs/1910.07467)).
|
226 |
-
* **Biases**: We remove all bias terms from the model except for attention Q,K,V projections ([Bai et al., 2023](https://arxiv.org/abs/2309.16609)).
|
227 |
-
* **Tokenizer**: We use Arcade100k, a BPE tokenizer extended from OpenAI's [`tiktoken.cl100k_base`](https://github.com/openai/tiktoken). We split digits into individual tokens following findings by [Liu & Low (2023)](https://arxiv.org/abs/2305.14201).
|
228 |
-
|
229 |
-
|
230 |
## Use and Limitations
|
231 |
|
232 |
### Intended Use
|
|
|
213 |
</details>
|
214 |
|
215 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
216 |
## Use and Limitations
|
217 |
|
218 |
### Intended Use
|