gradientai
/

Llama-3-8B-Instruct-Gradient-1048k

@@ -12,19 +12,19 @@ license: llama3
 # Llama-3 8B Instruct 1048k
 Gradient incorporates your data to deploy autonomous assistants that power critical operations across your business. To learn more or collaborate on a custom model, drop us a message at [email protected].
-This model extends LLama-3 8B's context length from 8k to 1048k, developed by Gradient, sponsored by compute from [Crusoe Energy](https://huggingface.co/crusoeai). It demonstrates that SOTA LLMs can learn to operate on long context with minimal training (< 200M tokens) by appropriately adjusting RoPE theta.
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/6585dc9be92bc5f258156bd6/hiHWva3CbsrnPvZTp5-lu.png)
 **Approach:**
 - [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
-- NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by a new data-driven RoPE theta optimization technique
-- Progressive training on increasing context lengths similar to the [Large World Model](https://huggingface.co/LargeWorldModel) [2] (See details below)
 **Infra:**
-We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 262144 tokens on [Crusoe Energy](https://huggingface.co/crusoeai) high performance L40S cluster.
 **Data:**
@@ -32,17 +32,22 @@ For training data, we generate long contexts by augmenting [SlimPajama](https://
 **Progressive Training Details:**
-| Parameter                   | 65K            | 262K       | 1048K      |
-|-----------------------------|----------------|------------|------------|
-| Initialize From             | LLaMA-3-8B-Inst| 65K        | xx K       |
-| Sequence Length             | 2^16           | 2^18       | 2^20       |
-| RoPE theta                  | 15.3 M         | 207.1 M    | 2804.3 M   |
-| Batch Size (Tokens / Step)  | 2.097 M        | 4.192 M    | xxx M      |
-| Steps                       | 30             | 24         | xx         |
-| Total Tokens                | 63 M           | 101 M      | xxx M      |
-| Learning Rate               | 2.00E-05       | 2.00E-05   | 2.00E-05   |
-| # GPUs                      | 32             | 32         | xx         |
-| GPU Type                    | NVIDIA L40S    | NVIDIA L40S| NVIDIA L40S|
 ## The Gradient AI Team

 # Llama-3 8B Instruct 1048k
 Gradient incorporates your data to deploy autonomous assistants that power critical operations across your business. To learn more or collaborate on a custom model, drop us a message at [email protected].
+This model extends LLama-3 8B's context length from 8k to > 1040K, developed by Gradient, sponsored by compute from [Crusoe Energy](https://huggingface.co/crusoeai). It demonstrates that SOTA LLMs can learn to operate on long context with minimal training by appropriately adjusting RoPE theta. We trained on 320M total tokens, which is < 0.002% of Lamma-3's original pre-training data.
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6585dc9be92bc5f258156bd6/6MKLoX2ruLIaREiyb6coO.png)
 **Approach:**
 - [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
+- NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by empirical RoPE theta optimization
+- Progressive training on increasing context lengths, similar to [Large World Model](https://huggingface.co/LargeWorldModel) [2] (See details below)
 **Infra:**
+We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 1048k tokens on [Crusoe Energy](https://huggingface.co/crusoeai) high performance L40S cluster. Notably, we layered parallelism on top of Ring Attention with a custom network topology to better leverage large GPU clusters in the face of network bottlenecks from passing many KV blocks between devices. This gave us a 33x speedup in model training (compare 524k and 1048k to 65k and 262k in the table below).
 **Data:**
 **Progressive Training Details:**
+|                        | 65K       | 262K      | 524k      | 1048k     |
+|------------------------|-----------|-----------|-----------|-----------|
+| Initialize From        | LLaMA-3 7B| 65K       | 262K      | 524k      |
+| Sequence Length 2^N    | 16        | 18        | 19        | 20        |
+| RoPE theta             | 15.3 M    | 207.1 M   | 1.06B     | 2.80B     |
+| batch_size             | 1         | 1         | 2         | 2         |
+| gradient_accumulation_steps | 32  | 16        | 1         | 1         |
+| Steps                  | 30        | 24        | 50        | 50        |
+| Total Tokens           | 62914560  | 100663296 | 419430400 | 838860800 |
+| learning_rate          | 2.00E-05  | 2.00E-05  | 2.00E-05  | 2.00E-05  |
+| # GPUs                 | 8         | 32        | 512       | 512       |
+| Ring or Data parallelism | 1       | 1         | 8         | 8         |
+| GPU Type               | NVIDIA L40S | NVIDIA L40S | NVIDIA L40S | NVIDIA L40S |
+| Minutes to Train (Wall)| 202       | 555       | 61        | 87        |
 ## The Gradient AI Team