Update README.md
Browse files
README.md
CHANGED
|
@@ -12,19 +12,19 @@ license: llama3
|
|
| 12 |
# Llama-3 8B Instruct 1048k
|
| 13 |
Gradient incorporates your data to deploy autonomous assistants that power critical operations across your business. To learn more or collaborate on a custom model, drop us a message at [email protected].
|
| 14 |
|
| 15 |
-
This model extends LLama-3 8B's context length from 8k to
|
| 16 |
-
|
|
|
|
| 17 |
|
| 18 |
**Approach:**
|
| 19 |
|
| 20 |
- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
|
| 21 |
-
- NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by
|
| 22 |
-
- Progressive training on increasing context lengths similar to
|
| 23 |
|
| 24 |
**Infra:**
|
| 25 |
|
| 26 |
-
We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to
|
| 27 |
-
|
| 28 |
|
| 29 |
**Data:**
|
| 30 |
|
|
@@ -32,17 +32,22 @@ For training data, we generate long contexts by augmenting [SlimPajama](https://
|
|
| 32 |
|
| 33 |
**Progressive Training Details:**
|
| 34 |
|
| 35 |
-
|
|
| 36 |
-
|
| 37 |
-
| Initialize From
|
| 38 |
-
| Sequence Length
|
| 39 |
-
| RoPE theta
|
| 40 |
-
|
|
| 41 |
-
|
|
| 42 |
-
|
|
| 43 |
-
|
|
| 44 |
-
|
|
| 45 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
## The Gradient AI Team
|
| 48 |
|
|
|
|
| 12 |
# Llama-3 8B Instruct 1048k
|
| 13 |
Gradient incorporates your data to deploy autonomous assistants that power critical operations across your business. To learn more or collaborate on a custom model, drop us a message at [email protected].
|
| 14 |
|
| 15 |
+
This model extends LLama-3 8B's context length from 8k to > 1040K, developed by Gradient, sponsored by compute from [Crusoe Energy](https://huggingface.co/crusoeai). It demonstrates that SOTA LLMs can learn to operate on long context with minimal training by appropriately adjusting RoPE theta. We trained on 320M total tokens, which is < 0.002% of Lamma-3's original pre-training data.
|
| 16 |
+
|
| 17 |
+

|
| 18 |
|
| 19 |
**Approach:**
|
| 20 |
|
| 21 |
- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
|
| 22 |
+
- NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by empirical RoPE theta optimization
|
| 23 |
+
- Progressive training on increasing context lengths, similar to [Large World Model](https://huggingface.co/LargeWorldModel) [2] (See details below)
|
| 24 |
|
| 25 |
**Infra:**
|
| 26 |
|
| 27 |
+
We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 1048k tokens on [Crusoe Energy](https://huggingface.co/crusoeai) high performance L40S cluster. Notably, we layered parallelism on top of Ring Attention with a custom network topology to better leverage large GPU clusters in the face of network bottlenecks from passing many KV blocks between devices. This gave us a 33x speedup in model training (compare 524k and 1048k to 65k and 262k in the table below).
|
|
|
|
| 28 |
|
| 29 |
**Data:**
|
| 30 |
|
|
|
|
| 32 |
|
| 33 |
**Progressive Training Details:**
|
| 34 |
|
| 35 |
+
| | 65K | 262K | 524k | 1048k |
|
| 36 |
+
|------------------------|-----------|-----------|-----------|-----------|
|
| 37 |
+
| Initialize From | LLaMA-3 7B| 65K | 262K | 524k |
|
| 38 |
+
| Sequence Length 2^N | 16 | 18 | 19 | 20 |
|
| 39 |
+
| RoPE theta | 15.3 M | 207.1 M | 1.06B | 2.80B |
|
| 40 |
+
| batch_size | 1 | 1 | 2 | 2 |
|
| 41 |
+
| gradient_accumulation_steps | 32 | 16 | 1 | 1 |
|
| 42 |
+
| Steps | 30 | 24 | 50 | 50 |
|
| 43 |
+
| Total Tokens | 62914560 | 100663296 | 419430400 | 838860800 |
|
| 44 |
+
| learning_rate | 2.00E-05 | 2.00E-05 | 2.00E-05 | 2.00E-05 |
|
| 45 |
+
| # GPUs | 8 | 32 | 512 | 512 |
|
| 46 |
+
| Ring or Data parallelism | 1 | 1 | 8 | 8 |
|
| 47 |
+
| GPU Type | NVIDIA L40S | NVIDIA L40S | NVIDIA L40S | NVIDIA L40S |
|
| 48 |
+
| Minutes to Train (Wall)| 202 | 555 | 61 | 87 |
|
| 49 |
+
|
| 50 |
+
|
| 51 |
|
| 52 |
## The Gradient AI Team
|
| 53 |
|