eswardivi commited on
Commit
473bc73
·
verified ·
1 Parent(s): 8dcd783

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -14
README.md CHANGED
@@ -213,20 +213,6 @@ print(tokenizer.decode(tokens[0], skip_special_tokens=True))
213
  </details>
214
 
215
 
216
- ### Model Architecture
217
-
218
- The model is a decoder-only transformer similar to the LLaMA ([Touvron et al., 2023](https://arxiv.org/abs/2307.09288)) architecture with the following modifications:
219
-
220
- | Parameters | Hidden Size | Layers | Heads | Sequence Length |
221
- |----------------|-------------|--------|-------|-----------------|
222
- | 1,644,417,024 | 2048 | 24 | 32 | 4096 |
223
-
224
- * **Position Embeddings**: Rotary Position Embeddings ([Su et al., 2021](https://arxiv.org/abs/2104.09864)) applied to the first 25% of head embedding dimensions for improved throughput following [Black et al. (2022)](https://arxiv.org/pdf/2204.06745.pdf).
225
- * **Normalization**: LayerNorm ([Ba et al., 2016](https://arxiv.org/abs/1607.06450)) with learned bias terms as opposed to RMSNorm ([Zhang & Sennrich, 2019](https://arxiv.org/abs/1910.07467)).
226
- * **Biases**: We remove all bias terms from the model except for attention Q,K,V projections ([Bai et al., 2023](https://arxiv.org/abs/2309.16609)).
227
- * **Tokenizer**: We use Arcade100k, a BPE tokenizer extended from OpenAI's [`tiktoken.cl100k_base`](https://github.com/openai/tiktoken). We split digits into individual tokens following findings by [Liu & Low (2023)](https://arxiv.org/abs/2305.14201).
228
-
229
-
230
  ## Use and Limitations
231
 
232
  ### Intended Use
 
213
  </details>
214
 
215
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
216
  ## Use and Limitations
217
 
218
  ### Intended Use