typo
Browse files- dist/index.html +2 -2
- src/index.html +2 -2
- ultra_blog.md +1 -1
dist/index.html
CHANGED
@@ -90,14 +90,14 @@
|
|
90 |
|
91 |
<aside>If you have questions or remarks, open a discussion on the <a href="https://huggingface.co/spaces/nanotron/ultrascale-playbook/discussions?status=open&type=discussion">Community tab</a>!</aside>
|
92 |
|
93 |
-
<p>We'll assume you have some basic knowledge about current LLM architectures and are roughly familiar with how deep learning models are trained, but you can be generally new to distributed training. If needed, you can find information on the basics of model training in the great courses available at <a href="https://www.deeplearning.ai">DeepLearning.ai</a> or in the <a href="https://pytorch.org/tutorials/beginner/basics/intro.html">PyTorch tutorials</a>. This book can be seen as the second part of a trilogy, following our previous blog post on processing data for pretraining (the so-called “<a href="https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1">FineWeb blog post</a>”). Having read both, you should have almost all the core knowledge you need to fully understand how
|
94 |
|
95 |
<aside>We are extremely thankful to the whole <a href="https://distill.pub/">distill.pub</a> team for creating
|
96 |
the template on which we based this blog post.</aside>
|
97 |
|
98 |
<p>The book is built on the following <strong>three general foundations</strong>:</p>
|
99 |
|
100 |
-
<p><strong>1. Quick intros on theory and concepts:</strong> Before diving into code and experiments, we want you to understand how each method works at a high level and what its advantages and limits are. For example, you’ll learn about which parts of a language model eat away at your memory, and when during training it happens. You’ll also learn how we can work around memory constraints by parallelizing the models and
|
101 |
<aside>Note that we're still missing pipeline parallelism in this widget. To be added as an exercise for the reader.</aside>
|
102 |
|
103 |
<div class="large-image-background-transparent">
|
|
|
90 |
|
91 |
<aside>If you have questions or remarks, open a discussion on the <a href="https://huggingface.co/spaces/nanotron/ultrascale-playbook/discussions?status=open&type=discussion">Community tab</a>!</aside>
|
92 |
|
93 |
+
<p>We'll assume you have some basic knowledge about current LLM architectures and are roughly familiar with how deep learning models are trained, but you can be generally new to distributed training. If needed, you can find information on the basics of model training in the great courses available at <a href="https://www.deeplearning.ai">DeepLearning.ai</a> or in the <a href="https://pytorch.org/tutorials/beginner/basics/intro.html">PyTorch tutorials</a>. This book can be seen as the second part of a trilogy, following our previous blog post on processing data for pretraining (the so-called “<a href="https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1">FineWeb blog post</a>”). Having read both, you should have almost all the core knowledge you need to fully understand how high-performing LLMs are being built nowadays and will just be missing the secret sauce regarding data mixing and architecture choices to complete the recipe (stay tuned for part three…).</p>
|
94 |
|
95 |
<aside>We are extremely thankful to the whole <a href="https://distill.pub/">distill.pub</a> team for creating
|
96 |
the template on which we based this blog post.</aside>
|
97 |
|
98 |
<p>The book is built on the following <strong>three general foundations</strong>:</p>
|
99 |
|
100 |
+
<p><strong>1. Quick intros on theory and concepts:</strong> Before diving into code and experiments, we want you to understand how each method works at a high level and what its advantages and limits are. For example, you’ll learn about which parts of a language model eat away at your memory, and when during training it happens. You’ll also learn how we can work around memory constraints by parallelizing the models and increasing throughput by scaling up GPUs. As a result, you'll understand how the following widget to compute the memory breakdown of a Transformer model works: </p>
|
101 |
<aside>Note that we're still missing pipeline parallelism in this widget. To be added as an exercise for the reader.</aside>
|
102 |
|
103 |
<div class="large-image-background-transparent">
|
src/index.html
CHANGED
@@ -90,14 +90,14 @@
|
|
90 |
|
91 |
<aside>If you have questions or remarks, open a discussion on the <a href="https://huggingface.co/spaces/nanotron/ultrascale-playbook/discussions?status=open&type=discussion">Community tab</a>!</aside>
|
92 |
|
93 |
-
<p>We'll assume you have some basic knowledge about current LLM architectures and are roughly familiar with how deep learning models are trained, but you can be generally new to distributed training. If needed, you can find information on the basics of model training in the great courses available at <a href="https://www.deeplearning.ai">DeepLearning.ai</a> or in the <a href="https://pytorch.org/tutorials/beginner/basics/intro.html">PyTorch tutorials</a>. This book can be seen as the second part of a trilogy, following our previous blog post on processing data for pretraining (the so-called “<a href="https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1">FineWeb blog post</a>”). Having read both, you should have almost all the core knowledge you need to fully understand how
|
94 |
|
95 |
<aside>We are extremely thankful to the whole <a href="https://distill.pub/">distill.pub</a> team for creating
|
96 |
the template on which we based this blog post.</aside>
|
97 |
|
98 |
<p>The book is built on the following <strong>three general foundations</strong>:</p>
|
99 |
|
100 |
-
<p><strong>1. Quick intros on theory and concepts:</strong> Before diving into code and experiments, we want you to understand how each method works at a high level and what its advantages and limits are. For example, you’ll learn about which parts of a language model eat away at your memory, and when during training it happens. You’ll also learn how we can work around memory constraints by parallelizing the models and
|
101 |
<aside>Note that we're still missing pipeline parallelism in this widget. To be added as an exercise for the reader.</aside>
|
102 |
|
103 |
<div class="large-image-background-transparent">
|
|
|
90 |
|
91 |
<aside>If you have questions or remarks, open a discussion on the <a href="https://huggingface.co/spaces/nanotron/ultrascale-playbook/discussions?status=open&type=discussion">Community tab</a>!</aside>
|
92 |
|
93 |
+
<p>We'll assume you have some basic knowledge about current LLM architectures and are roughly familiar with how deep learning models are trained, but you can be generally new to distributed training. If needed, you can find information on the basics of model training in the great courses available at <a href="https://www.deeplearning.ai">DeepLearning.ai</a> or in the <a href="https://pytorch.org/tutorials/beginner/basics/intro.html">PyTorch tutorials</a>. This book can be seen as the second part of a trilogy, following our previous blog post on processing data for pretraining (the so-called “<a href="https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1">FineWeb blog post</a>”). Having read both, you should have almost all the core knowledge you need to fully understand how high-performing LLMs are being built nowadays and will just be missing the secret sauce regarding data mixing and architecture choices to complete the recipe (stay tuned for part three…).</p>
|
94 |
|
95 |
<aside>We are extremely thankful to the whole <a href="https://distill.pub/">distill.pub</a> team for creating
|
96 |
the template on which we based this blog post.</aside>
|
97 |
|
98 |
<p>The book is built on the following <strong>three general foundations</strong>:</p>
|
99 |
|
100 |
+
<p><strong>1. Quick intros on theory and concepts:</strong> Before diving into code and experiments, we want you to understand how each method works at a high level and what its advantages and limits are. For example, you’ll learn about which parts of a language model eat away at your memory, and when during training it happens. You’ll also learn how we can work around memory constraints by parallelizing the models and increasing throughput by scaling up GPUs. As a result, you'll understand how the following widget to compute the memory breakdown of a Transformer model works: </p>
|
101 |
<aside>Note that we're still missing pipeline parallelism in this widget. To be added as an exercise for the reader.</aside>
|
102 |
|
103 |
<div class="large-image-background-transparent">
|
ultra_blog.md
CHANGED
@@ -31,7 +31,7 @@ to understand where each method comes from.
|
|
31 |
|
32 |
If you have questions or remarks open a discussion on the [Community tab](https://huggingface.co/spaces/nanotron/ultrascale-playbook/discussions?status=open&type=discussion)!
|
33 |
|
34 |
-
We'll assumes you have some simple basic knowledge about current LLM architecture and are roughtly familiar with how deep learning model are trained, but you can be generally new to distributed training. If needed, the basics of model training can be found in great courses found at [DeepLearning.ai](https://www.deeplearning.ai) or on the [PyTorch tutorial sections](https://pytorch.org/tutorials/beginner/basics/intro.html). This book can be seen as the second part of a trilogy following our first blog on processing data for pre-training, the so-called “[FineWeb blog post](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1)”. Having read both blog posts, you should have almost all the core knowledge needed to fully understand how
|
35 |
|
36 |
We are extremely thankful to the whole [distill.pub](https://distill.pub/) team for creating the template on which we based this blog post.
|
37 |
|
|
|
31 |
|
32 |
If you have questions or remarks open a discussion on the [Community tab](https://huggingface.co/spaces/nanotron/ultrascale-playbook/discussions?status=open&type=discussion)!
|
33 |
|
34 |
+
We'll assumes you have some simple basic knowledge about current LLM architecture and are roughtly familiar with how deep learning model are trained, but you can be generally new to distributed training. If needed, the basics of model training can be found in great courses found at [DeepLearning.ai](https://www.deeplearning.ai) or on the [PyTorch tutorial sections](https://pytorch.org/tutorials/beginner/basics/intro.html). This book can be seen as the second part of a trilogy following our first blog on processing data for pre-training, the so-called “[FineWeb blog post](https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1)”. Having read both blog posts, you should have almost all the core knowledge needed to fully understand how performing LLMs are being built nowadays, just missing some final spices regarding data mixing and architecture choices to complete the recipe (stay tuned for part three…).
|
35 |
|
36 |
We are extremely thankful to the whole [distill.pub](https://distill.pub/) team for creating the template on which we based this blog post.
|
37 |
|