grammer
Browse files- dist/index.html +1 -1
- src/index.html +1 -1
- ultra_blog.md +2 -2
dist/index.html
CHANGED
@@ -273,7 +273,7 @@
|
|
273 |
<p> All the techniques we'll cover in this book tackle one or several of the following three key challenges, which we'll bump into repeatedly:</p>
|
274 |
<ol>
|
275 |
<li><strong>Memory usage:</strong> This is a hard limitation - if a training step doesn't fit in memory, training cannot proceed.</li>
|
276 |
-
<li><strong>Compute efficiency:</strong> We want our hardware to spend most time computing, so we need to reduce time spent on data transfers or waiting for other GPUs to perform work.</li>
|
277 |
<li><strong>Communication overhead:</strong> We want to minimize communication overhead, as it keeps GPUs idle. To achieve this, we will try to make the best use of intra-node (fast) and inter-node (slower) bandwidths and to overlap communication with compute as much as possible.</li>
|
278 |
</ol>
|
279 |
<p>In many places, we'll see that we can trade one of these (computation, communication, memory) off against another (e.g., through recomputation or tensor parallelism). Finding the right balance is key to scaling training.</p>
|
|
|
273 |
<p> All the techniques we'll cover in this book tackle one or several of the following three key challenges, which we'll bump into repeatedly:</p>
|
274 |
<ol>
|
275 |
<li><strong>Memory usage:</strong> This is a hard limitation - if a training step doesn't fit in memory, training cannot proceed.</li>
|
276 |
+
<li><strong>Compute efficiency:</strong> We want our hardware to spend most of its time computing, so we need to reduce time spent on data transfers or waiting for other GPUs to perform work.</li>
|
277 |
<li><strong>Communication overhead:</strong> We want to minimize communication overhead, as it keeps GPUs idle. To achieve this, we will try to make the best use of intra-node (fast) and inter-node (slower) bandwidths and to overlap communication with compute as much as possible.</li>
|
278 |
</ol>
|
279 |
<p>In many places, we'll see that we can trade one of these (computation, communication, memory) off against another (e.g., through recomputation or tensor parallelism). Finding the right balance is key to scaling training.</p>
|
src/index.html
CHANGED
@@ -273,7 +273,7 @@
|
|
273 |
<p> All the techniques we'll cover in this book tackle one or several of the following three key challenges, which we'll bump into repeatedly:</p>
|
274 |
<ol>
|
275 |
<li><strong>Memory usage:</strong> This is a hard limitation - if a training step doesn't fit in memory, training cannot proceed.</li>
|
276 |
-
<li><strong>Compute efficiency:</strong> We want our hardware to spend most time computing, so we need to reduce time spent on data transfers or waiting for other GPUs to perform work.</li>
|
277 |
<li><strong>Communication overhead:</strong> We want to minimize communication overhead, as it keeps GPUs idle. To achieve this, we will try to make the best use of intra-node (fast) and inter-node (slower) bandwidths and to overlap communication with compute as much as possible.</li>
|
278 |
</ol>
|
279 |
<p>In many places, we'll see that we can trade one of these (computation, communication, memory) off against another (e.g., through recomputation or tensor parallelism). Finding the right balance is key to scaling training.</p>
|
|
|
273 |
<p> All the techniques we'll cover in this book tackle one or several of the following three key challenges, which we'll bump into repeatedly:</p>
|
274 |
<ol>
|
275 |
<li><strong>Memory usage:</strong> This is a hard limitation - if a training step doesn't fit in memory, training cannot proceed.</li>
|
276 |
+
<li><strong>Compute efficiency:</strong> We want our hardware to spend most of its time computing, so we need to reduce time spent on data transfers or waiting for other GPUs to perform work.</li>
|
277 |
<li><strong>Communication overhead:</strong> We want to minimize communication overhead, as it keeps GPUs idle. To achieve this, we will try to make the best use of intra-node (fast) and inter-node (slower) bandwidths and to overlap communication with compute as much as possible.</li>
|
278 |
</ol>
|
279 |
<p>In many places, we'll see that we can trade one of these (computation, communication, memory) off against another (e.g., through recomputation or tensor parallelism). Finding the right balance is key to scaling training.</p>
|
ultra_blog.md
CHANGED
@@ -120,8 +120,8 @@ following three key challenges, which we'll keep bumping into throughout the
|
|
120 |
book:
|
121 |
|
122 |
1. **Memory Usage** : it's a hard limitation - if a training step doesn't fit in memory, training cannot proceed
|
123 |
-
2. **Compute Efficiency** : we want our hardware to spend most time computing, so we need to reduce time spent on data transfers or waiting for other GPUs to perform work.
|
124 |
-
3. **Communication overhead** : we want to minimize communication overhead as it keeps GPUs idle. To
|
125 |
|
126 |
In many places we'll see that we can trade one of these (computation,
|
127 |
communication, memory) for another (e.g. recomputation or Tensor Parallelism).
|
|
|
120 |
book:
|
121 |
|
122 |
1. **Memory Usage** : it's a hard limitation - if a training step doesn't fit in memory, training cannot proceed
|
123 |
+
2. **Compute Efficiency** : we want our hardware to spend most of its time computing, so we need to reduce time spent on data transfers or waiting for other GPUs to perform work.
|
124 |
+
3. **Communication overhead** : we want to minimize communication overhead as it keeps GPUs idle. To achieve this we will try to make best use of intra-node (fast) and inter-node (slower) bandwidths as well as overlap communication with compute as much as possible.
|
125 |
|
126 |
In many places we'll see that we can trade one of these (computation,
|
127 |
communication, memory) for another (e.g. recomputation or Tensor Parallelism).
|