Spaces:

nanotron
/

ultrascale-playbook

Running

kashif HF Staff commited on 2 days ago

Commit

5ba855e

1 Parent(s): dcd9688

grammer

Files changed (3) hide show

dist/index.html CHANGED Viewed

@@ -273,7 +273,7 @@
         <p> All the techniques we'll cover in this book tackle one or several of the following three key challenges, which we'll bump into repeatedly:</p>
         <ol>
           <li><strong>Memory usage:</strong> This is a hard limitation - if a training step doesn't fit in memory, training cannot proceed.</li>
-          <li><strong>Compute efficiency:</strong> We want our hardware to spend most time computing, so we need to reduce time spent on data transfers or waiting for other GPUs to perform work.</li>
           <li><strong>Communication overhead:</strong> We want to minimize communication overhead, as it keeps GPUs idle. To achieve this, we will try to make the best use of intra-node (fast) and inter-node (slower) bandwidths and to overlap communication with compute as much as possible.</li>
           </ol>
         <p>In many places, we'll see that we can trade one of these (computation, communication, memory) off against another (e.g., through recomputation or tensor parallelism). Finding the right balance is key to scaling training.</p>

         <p> All the techniques we'll cover in this book tackle one or several of the following three key challenges, which we'll bump into repeatedly:</p>
         <ol>
           <li><strong>Memory usage:</strong> This is a hard limitation - if a training step doesn't fit in memory, training cannot proceed.</li>
+          <li><strong>Compute efficiency:</strong> We want our hardware to spend most of its time computing, so we need to reduce time spent on data transfers or waiting for other GPUs to perform work.</li>
           <li><strong>Communication overhead:</strong> We want to minimize communication overhead, as it keeps GPUs idle. To achieve this, we will try to make the best use of intra-node (fast) and inter-node (slower) bandwidths and to overlap communication with compute as much as possible.</li>
           </ol>
         <p>In many places, we'll see that we can trade one of these (computation, communication, memory) off against another (e.g., through recomputation or tensor parallelism). Finding the right balance is key to scaling training.</p>

src/index.html CHANGED Viewed

@@ -273,7 +273,7 @@
         <p> All the techniques we'll cover in this book tackle one or several of the following three key challenges, which we'll bump into repeatedly:</p>
         <ol>
           <li><strong>Memory usage:</strong> This is a hard limitation - if a training step doesn't fit in memory, training cannot proceed.</li>
-          <li><strong>Compute efficiency:</strong> We want our hardware to spend most time computing, so we need to reduce time spent on data transfers or waiting for other GPUs to perform work.</li>
           <li><strong>Communication overhead:</strong> We want to minimize communication overhead, as it keeps GPUs idle. To achieve this, we will try to make the best use of intra-node (fast) and inter-node (slower) bandwidths and to overlap communication with compute as much as possible.</li>
           </ol>
         <p>In many places, we'll see that we can trade one of these (computation, communication, memory) off against another (e.g., through recomputation or tensor parallelism). Finding the right balance is key to scaling training.</p>

         <p> All the techniques we'll cover in this book tackle one or several of the following three key challenges, which we'll bump into repeatedly:</p>
         <ol>
           <li><strong>Memory usage:</strong> This is a hard limitation - if a training step doesn't fit in memory, training cannot proceed.</li>
+          <li><strong>Compute efficiency:</strong> We want our hardware to spend most of its time computing, so we need to reduce time spent on data transfers or waiting for other GPUs to perform work.</li>
           <li><strong>Communication overhead:</strong> We want to minimize communication overhead, as it keeps GPUs idle. To achieve this, we will try to make the best use of intra-node (fast) and inter-node (slower) bandwidths and to overlap communication with compute as much as possible.</li>
           </ol>
         <p>In many places, we'll see that we can trade one of these (computation, communication, memory) off against another (e.g., through recomputation or tensor parallelism). Finding the right balance is key to scaling training.</p>

ultra_blog.md CHANGED Viewed

@@ -120,8 +120,8 @@ following three key challenges, which we'll keep bumping into throughout the
 book:
   1. **Memory Usage** : it's a hard limitation - if a training step doesn't fit in memory, training cannot proceed
-  2. **Compute Efficiency** : we want our hardware to spend most time computing, so we need to reduce time spent on data transfers or waiting for other GPUs to perform work.
-  3. **Communication overhead** : we want to minimize communication overhead as it keeps GPUs idle. To archieve this we will try to make best use of intra-node (fast) and inter-node (slower) bandwidths as well as overlap communication with compute as much as possible.
 In many places we'll see that we can trade one of these (computation,
 communication, memory) for another (e.g. recomputation or Tensor Parallelism).

 book:
   1. **Memory Usage** : it's a hard limitation - if a training step doesn't fit in memory, training cannot proceed
+  2. **Compute Efficiency** : we want our hardware to spend most of its time computing, so we need to reduce time spent on data transfers or waiting for other GPUs to perform work.
+  3. **Communication overhead** : we want to minimize communication overhead as it keeps GPUs idle. To achieve this we will try to make best use of intra-node (fast) and inter-node (slower) bandwidths as well as overlap communication with compute as much as possible.
 In many places we'll see that we can trade one of these (computation,
 communication, memory) for another (e.g. recomputation or Tensor Parallelism).