Update README.md
Browse files
README.md
CHANGED
@@ -9,6 +9,6 @@ Following quantization suggestions in section 6.2 in the [Llama-3 paper](https:/
|
|
9 |
- `token_embd` and `output` are in F16 precision;
|
10 |
- The attention layers are in F16 precision;
|
11 |
- The entirety of the first and final transformer layers are in F16 precision;
|
12 |
-
- The feed-forward network (FFN) layers are in low precision.
|
13 |
|
14 |
For the same total model size, perplexity values might not be favorable compared to more uniformly quantized models, but supposedly this quantization scheme might help with real-world long-context performance and complex tasks. Your mileage may vary.
|
|
|
9 |
- `token_embd` and `output` are in F16 precision;
|
10 |
- The attention layers are in F16 precision;
|
11 |
- The entirety of the first and final transformer layers are in F16 precision;
|
12 |
+
- The feed-forward network (FFN) layers are in **low** precision.
|
13 |
|
14 |
For the same total model size, perplexity values might not be favorable compared to more uniformly quantized models, but supposedly this quantization scheme might help with real-world long-context performance and complex tasks. Your mileage may vary.
|