kashif HF Staff commited on
Commit
5ba855e
·
1 Parent(s): dcd9688
Files changed (3) hide show
  1. dist/index.html +1 -1
  2. src/index.html +1 -1
  3. ultra_blog.md +2 -2
dist/index.html CHANGED
@@ -273,7 +273,7 @@
273
  <p> All the techniques we'll cover in this book tackle one or several of the following three key challenges, which we'll bump into repeatedly:</p>
274
  <ol>
275
  <li><strong>Memory usage:</strong> This is a hard limitation - if a training step doesn't fit in memory, training cannot proceed.</li>
276
- <li><strong>Compute efficiency:</strong> We want our hardware to spend most time computing, so we need to reduce time spent on data transfers or waiting for other GPUs to perform work.</li>
277
  <li><strong>Communication overhead:</strong> We want to minimize communication overhead, as it keeps GPUs idle. To achieve this, we will try to make the best use of intra-node (fast) and inter-node (slower) bandwidths and to overlap communication with compute as much as possible.</li>
278
  </ol>
279
  <p>In many places, we'll see that we can trade one of these (computation, communication, memory) off against another (e.g., through recomputation or tensor parallelism). Finding the right balance is key to scaling training.</p>
 
273
  <p> All the techniques we'll cover in this book tackle one or several of the following three key challenges, which we'll bump into repeatedly:</p>
274
  <ol>
275
  <li><strong>Memory usage:</strong> This is a hard limitation - if a training step doesn't fit in memory, training cannot proceed.</li>
276
+ <li><strong>Compute efficiency:</strong> We want our hardware to spend most of its time computing, so we need to reduce time spent on data transfers or waiting for other GPUs to perform work.</li>
277
  <li><strong>Communication overhead:</strong> We want to minimize communication overhead, as it keeps GPUs idle. To achieve this, we will try to make the best use of intra-node (fast) and inter-node (slower) bandwidths and to overlap communication with compute as much as possible.</li>
278
  </ol>
279
  <p>In many places, we'll see that we can trade one of these (computation, communication, memory) off against another (e.g., through recomputation or tensor parallelism). Finding the right balance is key to scaling training.</p>
src/index.html CHANGED
@@ -273,7 +273,7 @@
273
  <p> All the techniques we'll cover in this book tackle one or several of the following three key challenges, which we'll bump into repeatedly:</p>
274
  <ol>
275
  <li><strong>Memory usage:</strong> This is a hard limitation - if a training step doesn't fit in memory, training cannot proceed.</li>
276
- <li><strong>Compute efficiency:</strong> We want our hardware to spend most time computing, so we need to reduce time spent on data transfers or waiting for other GPUs to perform work.</li>
277
  <li><strong>Communication overhead:</strong> We want to minimize communication overhead, as it keeps GPUs idle. To achieve this, we will try to make the best use of intra-node (fast) and inter-node (slower) bandwidths and to overlap communication with compute as much as possible.</li>
278
  </ol>
279
  <p>In many places, we'll see that we can trade one of these (computation, communication, memory) off against another (e.g., through recomputation or tensor parallelism). Finding the right balance is key to scaling training.</p>
 
273
  <p> All the techniques we'll cover in this book tackle one or several of the following three key challenges, which we'll bump into repeatedly:</p>
274
  <ol>
275
  <li><strong>Memory usage:</strong> This is a hard limitation - if a training step doesn't fit in memory, training cannot proceed.</li>
276
+ <li><strong>Compute efficiency:</strong> We want our hardware to spend most of its time computing, so we need to reduce time spent on data transfers or waiting for other GPUs to perform work.</li>
277
  <li><strong>Communication overhead:</strong> We want to minimize communication overhead, as it keeps GPUs idle. To achieve this, we will try to make the best use of intra-node (fast) and inter-node (slower) bandwidths and to overlap communication with compute as much as possible.</li>
278
  </ol>
279
  <p>In many places, we'll see that we can trade one of these (computation, communication, memory) off against another (e.g., through recomputation or tensor parallelism). Finding the right balance is key to scaling training.</p>
ultra_blog.md CHANGED
@@ -120,8 +120,8 @@ following three key challenges, which we'll keep bumping into throughout the
120
  book:
121
 
122
  1. **Memory Usage** : it's a hard limitation - if a training step doesn't fit in memory, training cannot proceed
123
- 2. **Compute Efficiency** : we want our hardware to spend most time computing, so we need to reduce time spent on data transfers or waiting for other GPUs to perform work.
124
- 3. **Communication overhead** : we want to minimize communication overhead as it keeps GPUs idle. To archieve this we will try to make best use of intra-node (fast) and inter-node (slower) bandwidths as well as overlap communication with compute as much as possible.
125
 
126
  In many places we'll see that we can trade one of these (computation,
127
  communication, memory) for another (e.g. recomputation or Tensor Parallelism).
 
120
  book:
121
 
122
  1. **Memory Usage** : it's a hard limitation - if a training step doesn't fit in memory, training cannot proceed
123
+ 2. **Compute Efficiency** : we want our hardware to spend most of its time computing, so we need to reduce time spent on data transfers or waiting for other GPUs to perform work.
124
+ 3. **Communication overhead** : we want to minimize communication overhead as it keeps GPUs idle. To achieve this we will try to make best use of intra-node (fast) and inter-node (slower) bandwidths as well as overlap communication with compute as much as possible.
125
 
126
  In many places we'll see that we can trade one of these (computation,
127
  communication, memory) for another (e.g. recomputation or Tensor Parallelism).