luismirandacruz commited on
Commit
ba5f744
·
1 Parent(s): 85945dc

fix merge conflict

Browse files
Files changed (1) hide show
  1. src/index.html +2 -2
src/index.html CHANGED
@@ -80,7 +80,7 @@
80
  This open-source book is here to change that. Starting from the basics, we'll walk you through the knowledge necessary to scale the training of large language models from one GPU to tens, hundreds and even thousands of GPUs, illustrating theory with practical code examples and reproducible benchmarks.
81
  </p>
82
 
83
- <p>As the size of the clusters used to train these models grew, various techniques such as data parallelism, tensor parallelism, pipeline parallelism or context parallelism as well as ZeRO or kernel fusion have been invented to makes sure that GPUs are highly utilized at all times. This significantly reduces training time and makes the best use of this expensive hardware. Even more, as the challenge of scaling up AI training goes beyond just building the initial models and teams have found that fine-tuning large models on specialized data often produces the best results, generally involving the same distributed training techniques. In this book we'll progressively go over all of these techniques –from the simplest to the most refined ones– while keeping a single story-line to understand where each method comes from.</p>
84
 
85
  <aside>If you have questions or remarks open a discussion on the <a href="https://huggingface.co/spaces/nanotron/ultrascale-playbook/discussions?status=open&type=discussion">Community tab</a>!</aside>
86
 
@@ -434,7 +434,7 @@
434
  \end{aligned}
435
  </d-math>
436
 
437
- <p>Now let’s have look how things change if we use a lower precision. For stability reasons (see <a target="_self" href="#mixed_precision_training">the mixed-precision training section below</a>) we often don't use full low precision training but a mix of higher and lower precision called "mixed precision"<d-cite bibtex-key="micikevicius2018mixedprecisiontraining"></d-cite>. The default nowadays for mixed precision training is to generally use BF16 for most of the computations –requiring 2 bytes per parameter and gradient– as well as an additional copy of the model weights and gradients in FP32, thus 12 bytes per parameter in total. In addition to the parameters and gradient, we need to store the optimizer states: for the Adam optimizer, this requires the momentum and the variance usually stored in FP32 for numerical stability, each using 4 bytes. </p>
438
 
439
  <aside>See some more details below when we cover the ZeRO methods.</aside>
440
 
 
80
  This open-source book is here to change that. Starting from the basics, we'll walk you through the knowledge necessary to scale the training of large language models from one GPU to tens, hundreds and even thousands of GPUs, illustrating theory with practical code examples and reproducible benchmarks.
81
  </p>
82
 
83
+ <p>As the size of the clusters used to train these models grew, various techniques such as data parallelism, tensor parallelism, pipeline parallelism or context parallelism as well as ZeRO or kernel fusion have been invented to makes sure that GPUs are highly utilized at all times. This significantly reduces training time and makes the best use of this expensive hardware. Even more, as the challenge of scaling up AI training goes beyond just building the initial models and teams have found that fine-tuning large models on specialized data often produces the best results, generally involving the same distributed training techniques. In this book we'll progressively go over all of these techniques –from the simplest to the most refined one– while keeping a single story-line to understand where each method comes from.</p>
84
 
85
  <aside>If you have questions or remarks open a discussion on the <a href="https://huggingface.co/spaces/nanotron/ultrascale-playbook/discussions?status=open&type=discussion">Community tab</a>!</aside>
86
 
 
434
  \end{aligned}
435
  </d-math>
436
 
437
+ <p>Now, let’s have a look at how things change if we use a lower precision. For stability reasons (see <a target="_self" href="#mixed_precision_training">the mixed-precision training section below</a>), we often don't use full low precision training but a mix of higher and lower precision called "mixed precision"<d-cite bibtex-key="micikevicius2018mixedprecisiontraining"></d-cite>. The default nowadays for mixed precision training is to generally use BF16 for most of the computations –requiring 2 bytes per parameter and gradient– as well as an additional copy of the model weights and gradients in FP32, thus 12 bytes per parameter in total. In addition to the parameters and gradient, we need to store the optimizer states: for the Adam optimizer, this requires the momentum and the variance usually stored in FP32 for numerical stability, each using 4 bytes. </p>
438
 
439
  <aside>See some more details below when we cover the ZeRO methods.</aside>
440