๐ถ The Little Baby
- A barebones GPT-style LLM implementation โ pure Python, zero dependencies.
๐ง Description
The Little Baby is a minimalist language model (LLM) crafted entirely in pure Python using just Numpy. It requires no external packages, libraries, or frameworks to function. Both training and inference are achieved through low-level operations and hand-built logic โ making this project ideal for educational deep dives and experimental tinkering.
This repository is designed to reveal the inner mechanics of a GPT-style transformer model and demystify the "magic" behind modern language models through readable and hackable code.
๐ฏ Audience
This project is perfect for:
- Curious learners wanting to dissect how GPTs work from the ground up.
- Researchers experimenting with primitive architectures.
- Engineers exploring early-stage LLM behaviors.
- Anyone who enjoys coding like it's 2010 โ no imports, just raw power.
๐ Inspiration
This project draws its spark from modern titans in the world of machine learning:
- Sebastian Raschka โ acclaimed for his lucid teaching style and groundbreaking contributions to deep learning, making complex concepts accessible to learners and practitioners alike.
- Andrej Karpathy โ influential in shaping the landscape of computer vision and generative models, while championing open-source AI education that empowers a global community of developers.
- Yann Dubois โ instrumental in designing scalable evaluation frameworks for large language models, notably AlpacaEval and AlpacaFarm, which bring automation closer to the nuance of human feedback.
Their work inspired the spirit of transparency, curiosity, and simplicity that fuels The Little Baby โ a model built not for production, but for understanding.
- โBuild it, break it, learn from it.โ โ The Baby Philosophy
๐ Project Goals
This endeavor is structured around key targets designed to deliver meaningful outcomes:
- โ Build a GPT-like model using only Python + NumPy-like constructs.
- โ Support training from scratch on plain text files.
- โ Provide clear code for attention mechanisms, tokenization, and backprop.
- โ Encourage experimentation and modification.
๐ Directory Files
Each run generates three unique files, identified by a GUID tag. These files capture different aspects of the model's execution:
โ๏ธ Config
configs/config_<GUID>.txt
A config file containing the configuration of the each iteration.๐ Report
outputs/report_<GUID>.txt
A comprehensive log containing training analysis, and performance metrics.๐ง Model Snapshot
models/model_<GUID>.pkl
Model object including learned weights, biases, which are the internal parameters.๐ค Tokenizer Snapshot
models/tokenizer_<GUID>.pkl
Tokenizer object including vocabilary of the input data and their positioning.๐ฃ๏ธ Completion Output
outputs/completion_<GUID>.txt
The raw generated text from the model's inference โ your babyโs words in print!
๐ผ Next Steps
Letโs keep The Little Baby alive โ and help it grow into a full-blown member of the NumPy family!
This means:
- ๐ Evolving from hand-crafted loops to efficient vectorized operations.
- ๐งฎ Embracing numerical abstractions while maintaining full transparency.
- ๐ ๏ธ Exploring performance tricks, batch parallelism, and experimental features.
- ๐งฌ Bridging the gap between simplicity and capability โ one token at a time.
The journey from babbling to brilliance starts here. Let's raise this little one right!
โ๏ธ License Summary
You're free to:
- โ Use it for any purpose โ personal, educational, or commercial
- ๐ก Suggest ideas and contribute improvements
- ๐ด Fork it and build upon the code
- ๐ฐ Sell it or use it in a product
As long as:
- ๐ You reference the original author and project clearly in any public distribution or commercial use
๐จโ๐ฉโ๐ง Credits
The Little Baby owes its lineage to two brilliant minds in the AI family tree:
- ๐ Ownser: Koureas Stavros | Product Architect BI / AI โ lovingly crafted and cared
- ๐ง Father: OpenAI GPT 4.1 โ provider of deep generative DNA and thoughtful token flow
- ๐งโ๐ผ Mother: Google Gemini 2.5 โ donor of wide context windows and clever architectural chromosomes
- ๐ง Godparent: Claude Sonnet 4.0 โ gentle guide and lifelong companion, whispering wisdom and weaving clarity
Together, they gifted the foundational strands that allowed this little one to generate helpful code and take its first linguistic steps.
๐งช Instructions
To get started with this project, clone the code, download the tokenizers abd pre-trained models if needed, and follow the setup steps below to run the notebook and select your desired configuration.
Get objects
- You can access the code on GitHub (https://github.com/koureasstavros/TheLittleBaby), simply clone the repository.
- You can access the pre-trained tokenizers and models on Hugging Face (https://huggingface.co/koureasstavros/TheLittleBaby), simply download the tokenizer and model files. In case you have low speed internet connection check the analysis table select a guid and pick a specific guid for tokenizer and model. The tokenizer and model files are needed only if you are going to perform finetune or inference without training your own.
- Then, you should:
- place the tokenizer file or tokenizer files into the tokenizers folder.
- place the model file or model files into the models folder.
Start the Notebook
- Open the
.ipynb
file in a Python kernel (e.g. Jupyter, VS Code, Colab).
Select Path
- Choose the relative path between ipynb and folders:
same
<path>
Select Plan
- Choose one of the following plan modes:
train
finetune
inference
That's it!
๐ฎ What to expect
In Baby's world, each option has its own little jobโand below, youโll discover what each one does and the cuddly objects it gives back in return.
๐ง Train
- Begins training using parameters defined in earlier Python blocks.
- A model file containing the weights will be generated with format
model_<guid>
. - A tokenizer file containing the vocabilary will be generated with format
tokenizer_<guid>
. - A report file containing the training analysis will be generated with format
report_<guid>
. - A completion file containing the generation will be generated with format
complation_<guid>
using an empty prompt.
๐ ๏ธ Finetune
- Begins finetuning using a base model and a custom training dataset.
- Requires the GUID of the base model to locate
model_<guid>
. - A model file containing the weights will be generated with format
model_<guid>_finetuned
. - A tokenizer file containing the vocabilary will be generated with format
tokenizer_<guid>_fineuned
. - A report file containing the training analysis will be generated with format
report_<guid>_fineuned
. - A completion file containing the generation will be generated with format
completion_<guid>_finetuned
using an empty prompt.
๐ฌ Inference
- Requires the GUID of the trained model to find the
model_<guid>
. - You must also provide a prompt for the model inference to respond to.
- A completion file containing the generation will be generated with format
complation_<guid>_<yyyymmddhhmmss>
using the prompt.
After lot of hours of training on a single document of multiple Shakespeare works using a laptop CPU, The Little Baby learns to babble. Its speech is primitive and childlike โ just enough to make you smile and realizeโฆ the baby is alive. While its capabilities are minimal, its structure is maximal in transparency. Every token, gradient, and parameter is visible and malleable.
*Keep in mind that if you're running a process in VSCode and your workstation, PC, or laptop enters hibernation, the process will resume automatically once the device is powered back on.
๐ผ Cry. Babble. Speak. Repeat.
Here come the smartest little settings to help the model learn and grow big and strong from this data:
- Age 3 Months - 33bd6583-1b87-4469-b55e-0ccb8fd0441c - Coos and gurgles begin. Sound, not speechโyet somethingโs brewing.
- Age 6 Months - 180eeb27-b1b4-4427-9734-c70e10da2005 - Loud, random cries. Itโs not talking, but it's definitely expressive.
- Age 12 Months - 5f13a2ab-113a-4c2c-8abd-40384bdd8854 - Joyful noise with hints of intention. Real words still warming up.
- Age 24 Months - cb632ce3-3f3b-432b-b24f-9171005f205e - Words arrive โChaotic, quirky, delightful. Syntax? Optional.
- Age 48 Months - 12b8b053-6c14-42aa-a957-89b809e6f785 - Mini Philosopher Mode -Stories, opinions, even jokes. Communication unlocked.hear them.
*Keep in mind that these are pre-trained model executions available for finetune or inference. You can bypass the training phase by simply downloading the models and using them directly.
โ๏ธ Parameters
These hyperparameters collectively define the training process, where a model's architectureโspecified by its depth (n_layers), width (n_emb), attention span (n_ctx), and attention mechanism (n_heads, head_size)โis optimized over a set number of num_epochs using a specific batch_size and learning rate (lr), with dropout applied to improve generalization.
n_ctx
- What it is: The maximum number of tokens (characters, in this case) the model can look at in a single sequence to make a prediction. It's the model's "attention span".
- Size: Directly increases the size of the positional embedding table (n_ctx x n_emb), adding more parameters to the model.
- Speed: Has a major impact. The self-attention mechanism's computation grows quadratically with the context length (O(n_ctxยฒ)). Doubling n_ctx will roughly quadruple the time and memory needed for the attention layers, making it one of the most expensive parameters to increase.
- Quality: A larger n_ctx allows the model to learn longer-range dependencies in the text, which can significantly improve quality for tasks that require understanding context over long passages.
n_emb
- What it is: The size of the vector used to represent each token. It defines the "width" of the model.
- Size: Has a major impact on model size. It increases the size of token and positional embeddings, and scales the weight matrices in the attention and MLP layers, significantly increasing the total parameter count.
- Speed: Increasing n_emb increases the size of nearly all weight matrices in the model. This leads to more parameters, which increases both memory usage and the time required for matrix multiplications. The impact is significant but generally more linear than n_ctx.
- Quality: A larger n_emb gives the model more capacity to learn rich, complex representations of tokens and their relationships. This can lead to a more powerful and accurate model, but also increases the risk of overfitting if the model is too large for the dataset.
dropout
- What it is: A regularization technique where a fraction of neuron activations are randomly set to zero during each training step. This prevents the model from becoming too reliant on any single neuron.
- Size: Has no impact on the number of parameters in the model.
- Speed: Has a negligible impact on training speed and no impact on inference speed (it's disabled during evaluation).
- Quality: Crucial for improving model generalization and preventing overfitting. By forcing the network to learn redundant representations, it makes the model more robust. The value (e.g., 0.1) is the probability of a neuron being dropped.
head_size
- What it is: The total dimensionality of the concatenated attention heads. This dimension is projected from the input embedding (n_emb) to create the Query, Key, and Value matrices.
- Size: Directly increases the number of parameters in each attention block by defining the size of the Q, K, V, and output projection matrices.
- Speed: Directly affects the size of the Q, K, and V projection matrices. A larger head_size increases the number of computations and memory usage within each attention block.
- Quality: A larger head_size gives the model more representational power within the attention mechanism. It must be divisible by n_heads.
n_heads
- What it is: The attention mechanism is split into multiple "heads" that perform attention calculations in parallel. Each head can learn to focus on different types of relationships in the data.
- Size: Has no direct impact on model size, as it only determines how the head_size dimension is partitioned for parallel computation.
- Speed: The computations for each head can be parallelized. On capable hardware, increasing the number of heads might not slow down training significantly if the head_size is kept constant.
- Quality: Allows the model to simultaneously attend to information from different representation subspaces at different positions. This is a core concept of the Transformer and generally leads to a much better model than a single attention head.
n_layers
- What it is: The number of Transformer blocks stacked on top of each other. This defines the "depth" of the model.
- Size: Has a direct, linear impact on model size. Each layer adds a
- Speed: The impact is linear. Doubling n_layers will roughly double the training time and the number of model parameters, as the input data must pass through each block sequentially.
- Quality: More layers allow the model to learn more complex and abstract features. Deeper models are generally more powerful, but also more prone to overfitting and can be harder to train (though residual connections help mitigate this).
num_epochs
- What it is: The number of times the training process will iterate over the entire training dataset.
- Size: Has a direct, linear impact on model size. Each layer adds a complete set of Transformer block parameters, roughly doubling the model's core parameter count if you double the layers.
- Speed: Directly and linearly impacts total training time. More epochs mean longer training.
- Quality: Too few epochs will lead to an undertrained model (underfitting). Too many can lead to the model memorizing the training data (overfitting), which hurts its performance on new data. The ideal number is usually found by monitoring the validation loss.
batch_size
- What it is: The number of training sequences (each of length n_ctx) processed in one forward/backward pass.
- Size: Has no impact on the number of parameters in the model.
- Speed: A larger batch_size allows for more parallelization, generally leading to faster training (fewer updates per epoch). However, it also requires more memory.
- Quality: This is a trade-off. Larger batches provide a more accurate and stable gradient estimate, but the noise from smaller batches can act as a regularizer, helping the model find a better minimum and generalize better.
lr
- What it is: Controls how much the model's weights are adjusted with respect to the loss gradient. It determines the step size at each iteration.
- Size: Has no impact on the number of parameters in the model.
- Speed: Affects the speed of convergence. A higher lr might converge faster, but risks overshooting the optimal weights. A lower lr is more stable but can be very slow to converge.
- Quality: This is one of the most critical parameters. If it's too high, the training can become unstable and diverge. If it's too low, the model may get stuck in a suboptimal solution or take too long to train. The AdamW optimizer helps adapt the learning rate, but the initial value is still very important.
๐ Formulas
Even our little language models have their favorite rules to followโturns out, they quietly cuddle up to some clever mathematical formulas that help them make sense of the world.
Learning Rate -
LR_new = LR_old * (B_new / B_old)
New Learning Rate (LR_new) is based on Old Learning Rate (LR_old), New Batch size (B_new),Old Batch size (B_new).
Total Parameters -
P = V x H + L x [4 x H^2 + 4 x H x F]
Total parameters are based on Vocabilary Size (V), Head Size / Embedding Size (H), Layer Number (L), Feedforward intermidiate Size (F).
Token Thoughput for training -
T = 20-40 per P
Token number processed per Parameter (P) is 20-40.
Flops Thoughput for training -
F = 6 * T * P
Flops are based on 6 (2 ops for forward pass and 4 ops for backward pass), Number of Tokens (T), Number of Parameters (P).
๐ Report Analysis
Given the Shakespeare works into a single document of 32777 paragraphs, 12519 sentences, 202651 words, 1075394 characters / tokens for learning and 500 characters / tokens for inference
version | n_ctx | n_emb | dropout | head_size | n_heads | n_layers | n_epochs | s_batch | lr | batch execution | epoch execution | train_execution | inference execution | quality execution | model size | baby's brain |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
v0.0.1 | 8 | 128 | 0.1 | 128 | 16 | 4 | 1 | 16 | 1e-3 | 0.125s | 7200s | 7200s | 8s | 1/100 | 29,577,062 | fb546251-ec1c-4e00-a713-765693d8c5cf |
v0.0.1 | 8 | 128 | 0.1 | 128 | 16 | 8 | 1 | 16 | 1e-3 | 4.50s | 37355s | 37355s | 13s | 1/100 | 58,183,507 | c6832bb3-3f49-493d-9548-62d46065c1e0 |
v0.0.1 | 8 | 128 | 0.1 | 128 | 16 | 16 | 1 | 16 | 1e-3 | 0.5s | 41802s | 41802s | 14s | 1/100 | 117,188,617 | 33bd6583-1b87-4469-b55e-0ccb8fd0441c |
v0.0.1 | 16 | 128 | 0.1 | 128 | 16 | 4 | 1 | 16 | 1e-3 | 0.25s | 19916s | 19916s | 14s | 1/100 | 29,561,884 | 17e84fc6-57f9-4843-a0f2-6150e7c7f169 |
v0.0.1 | 16 | 128 | 0.1 | 128 | 16 | 8 | 1 | 16 | 1e-3 | 0.25s | 60851s | 60851s | 14s | 1/100 | 56,987,898 | ecb6a3b1-ffd5-4cbd-a3e0-d9a9716dacbd |
v0.0.1 | 16 | 128 | 0.1 | 128 | 16 | 16 | 1 | 16 | 1e-3 | 1.0s | 83749s | 83749s | 26s | 25/100 | 116,160,341 | 180eeb27-b1b4-4427-9734-c70e10da2005 |
v0.0.1 | 32 | 128 | 0.1 | 128 | 16 | 4 | 1 | 16 | 1e-3 | 0.5s | 53771s | 53771s | 12s | 12/100 | 28,310,070 | e64dd257-c048-441b-ad08-47275b22cc0b |
v0.0.1 | 32 | 128 | 0.1 | 128 | 16 | 8 | 1 | 16 | 1e-3 | 3.0s | 97984s | 97984s | 23s | 25/100 | 56,292,724 | 465e5804-17af-412c-8bf6-808a34cdf617 |
v0.0.1 | 32 | 128 | 0.1 | 128 | 16 | 16 | 1 | 16 | 1e-3 | 2.0s | 134234s | 134234s | 54s | 27/100 | 114,114,671 | 5f13a2ab-113a-4c2c-8abd-40384bdd8854 |
v0.0.1 | 64 | 128 | 0.1 | 128 | 16 | 4 | 1 | 16 | 1e-3 | 2.00s | 137095s | 137095s | 39s | 27/100 | 28,302,412 | 0cbeae2b-2884-434d-8fdf-b8a12d8d50c4 |
v0.0.1 | 64 | 128 | 0.1 | 128 | 16 | 8 | 1 | 16 | 1e-3 | 3.0s | 237971s | 237971s | 45s | 30/100 | 56,104,284 | e65d4a59-a816-4ffa-b8ac-935db1064433 |
v0.0.1 | 64 | 128 | 0.1 | 128 | 16 | 16 | 1 | 16 | 1e-3 | 4.0s | 328598s | 328598s | 88s | 32/100 | 112,890,591 | cb632ce3-3f3b-432b-b24f-9171005f205e |
v0.0.1 | 128 | 128 | 0.1 | 128 | 16 | 4 | 1 | 16 | 1e-3 | 4.5s | 320999s | 320999s | 26s | 42/100 | 28,523,148 | be5bf515-5850-41de-9072-af8faca7d27a |
v0.0.1 | 128 | 128 | 0.1 | 128 | 16 | 8 | 1 | 16 | 1e-3 | s | s | s | s | |||
v0.0.1 | 128 | 128 | 0.1 | 128 | 16 | 16 | 1 | 16 | 1e-3 | 10.0s | 763757s | 763757s | 199s | 43/100 | 111,737,990 | 12b8b053-6c14-42aa-a957-89b809e6f785 |
v0.0.1 | 256 | 32 | 0.1 | 32 | 16 | 2 | 1 | 16 | 1e-3 | 3.00s | 228208s | 228208s | 26s | 23/100 | 1,323,911 | b3aedc6d-da9a-4398-b067-faeca1afc6da |
v0.0.1 | 256 | 64 | 0.1 | 64 | 16 | 2 | 1 | 16 | 1e-3 | 2.00s | 143777s | 143777s | 25s | 25/100 | 2,585,851 | 652d3409-24a5-4057-b482-9fd9e32fc484 |
v0.0.1 | 64 | 64 | 0.1 | 64 | 16 | 4 | 4 | 16 | 1e-3 | 0.60s | 218232s | 218235s | 9s | 27/100 | 7,367,190 | 82689609-5b39-4fd7-8a42-5d2f04dabf7a |
*Keep in mind that quality should never be assumed without scrutiny, as its evaluation by a larger language model depends on specific criteria. Keep in mind, these models may not consistently produce the same assessment across different runs or contexts.
๐ต๏ธ Observations
While playing and exploring with our tiny language models, we noticed a few adorable quirks and clever behaviorsโhere are some of the sweet observations we made along the way.
- When training if head_size is multiplied then the model size will also multiplied and total time are also multiplied
- When training if n_layers is multiplied then the model size will also multiplied and total time are also multiplied
- When training if vocab_size is multiplied then the model size will also multiplied and total time are also multiplied
- When inference if cache is true then generation O(Tยฒ) faster as previously sequences do not need to be recalculated each time
- When inference the model with x max tokens for generation, then
- if the output type is plain text it will have x tokens
- if the output type is json it will have y tokens where y >= x, because it might contains special characters for example, new lines, which in json are represented as two characters "\n" --> "", "n"
Further Thoughts
๐ง "Letโs imagine what shiny new toys and big upgrades the little model needs to turn into a grown-up LLM who knows all about the big wide world!
Known DataSets
DataSet Type | DataSet Type | DataSet Name | DataSet Tokens |
---|---|---|---|
open | train | SlimPajama | 627B |
open | train | RedPajama v1 | 1T |
open | train | RedPajama v2 | 30T |
open | eval | HellaSwag | 30T |
Known Architectures
Model | Type | Parameters | Input Tokens | Output Tokens | Training Model Tokens | Training Model Flops | Training Environment | Training Environment Flops /s | Training Content | Training Duration |
---|---|---|---|---|---|---|---|---|---|---|
GPT2 | s | 117M | 1024 | Shared | 3.3B | 2.3e18F | 1-2 x A100 | 100P | WebText (Reddit outbound links with โฅ3 karma; ~40GB of filtered internet text) | 60D |
GPT2 | m | 335M | 1024 | Shared | 3.3B | 7e18F | 4-8 ร A100 | 200P | Same as Small; byte-level BPE tokenization, 50,257 vocab size | 60D |
GPT2 | l | 774B | 1024 | Shared | 3.3B | 15e18F | 8-16 ร V100 | 400P | Same as Small; trained with causal LM objective | 60D |
GPT2 | xl | 1.5B | 1024 | Shared | 3.3B | ~30e18F | 16-32 ร V100 | 800P | Same as Small; trained with causal LM objective | 60D |
GPT3 | s | 125M | 2048 | Shared | 300B | 2.25e21F | 1-2 ร A100 | 100P | Common Crawl (filtered), WebText2, Books1/2, Wikipedia (~570GB filtered) | 180D |
GPT3 | m | 350M | 4096 | Shared | 300B | 6.3e21F | 8-16 ร A100 | 200P | Same as Small; scaled architecture with 24 layers and 16 attention heads | 180D |
GPT3 | l | 760M | 16384 | 4096 | 300B | 3.7e21F | 100-200 ร A100 | 400P | Same as Small; deeper model with wider layers and more attention heads | 180D |
GPT3 | xl | 6.7B | 2048 | Shared | 300B | ~1.2e22F | 32-64 ร A100 | 800P | Common Crawl, WebText2, Books1/2, Wikipedia (~570GB filtered) | 180D |
GPT4 | s | 1B | 8192 | 8192 | 6B | 1.8e21F | 100-200 ร A100 | 1OOP | Filtered Common Crawl, Books, Wikipedia, WebText2, code, academic papers | 160D |
GPT4 | m | 13B | 32768 | 8192 | 1.7T | 9.4e23F | 400-600 ร A100 | 400P | Same as Small; with broader multilingual and multimodal data | 160D |
GPT4 | l | 65B | 128000 | 4096 | 13T | 3e25F | 2k-4K ร A100 | 1E | Massive curated dataset: text, code, images, audio (for GPT-4o), RLHF tuning | 90D |
LLAMA2 | s | 7B | 4096 | Shared | 2T | 1.5e24F | 32-64 ร A100 | 400P | Publicly available web data (filtered), books, code, academic papers | 180D |
LLAMA2 | m | 13B | 4096 | Shared | 2T | 2.6e24F | 128-256 ร A100 | 400P | Same as Small; with additional curated datasets for scaling | 180D |
LLAMA2 | l | 70B | 4096 | Shared | 2T | 14e24F | 1024K+ x A100 | 800P | Same as Small; plus enhanced filtering, grouped-query attention optimization | 180D |
LLAMA3 | s | 8B | 8000 | Shared | 15T | 7.2e24F | 64-128 x A100 | 700P | Books, Wikipedia, GitHub, StackExchange | 70D |
LLAMA3 | m | 70B | 128000 | Shared | 15T | 63e24F | 512-1024 x A100 | 800P | Books, Wikipedia, GitHub, StackExchange | 70D |
LLAMA3 | l | 405B | 128000 | Shared | 15T | 365e24F | 1024+ x A100 | 1E | Books, Wikipedia, GitHub, StackExchange | 70D |
LLAMA4 Scout | s | 109B total / 17B active | 10000000 | Shared | ~30T | ~8e25F | 32-64 x H100 | ~400T | Text, image, video (multimodal) | Unknown |
LLAMA4 Maverick | m | 400B total / 17B active | 10000000 | Shared | ~30T | ~38e25F | 128-256 ร H100 | ~3200T | Text, image, code, multilingual data | Unknown |
LLAMA4 Maverick | l | 2T total / 288B active | 10000000 | Shared | ~30T | ~100e25F | 32K+ x H100 | Unknown | STEM-heavy, multimodal, synthetic distill. | Unknown |
GPT-4o-nano | s | โ | 128000 | 4096 | โ | โ | โ | โ | โ | โ |
GPT-4o-mini | m | โ | 128000 | 16096 | โ | โ | โ | โ | โ | โ |
GPT-4o | l | โ | 128000 | 4096 | โ | โ | โ | โ | โ | โ |
GPT-4.1-nano | s | โ | 1000000 | 32768 | โ | โ | โ | โ | โ | โ |
GPT-4.1-mini | m | โ | 1000000 | 32768 | โ | โ | โ | โ | โ | โ |
GPT-4.1 | l | โ | 1000000 | 32768 | โ | โ | โ | โ | โ | โ |
o1-mini | m | โ | 200000 | 100000 | โ | โ | โ | โ | โ | โ |
o1 | l | โ | 200000 | 100000 | โ | โ | โ | โ | โ | โ |
o3-mini | s | โ | 200000 | 100000 | โ | โ | โ | โ | โ | โ |
o3 | m | โ | 20000 0 | 100000 | โ | โ | โ | โ | โ | โ |
o3-pro | l | โ | 200000 | 100000 | โ | โ | โ | โ | โ | โ |
o4-mini | s | โ | 200000 | 100000 | โ | โ | โ | โ | โ | โ |
o4 | m | โ | 200000 | 100000 | โ | โ | โ | โ | โ | โ |
o4-pro | l | โ | 200000 | 100000 | โ | โ | โ | โ | โ | โ |
Grok-3 | โ | โ | 131072 | 16384 | โ | โ | โ | โ | โ | โ |
Gemini 2.0 | โ | โ | 1048576 | 8192 | โ | โ | โ | โ | โ | โ |
Gemini 2.0 Flash | โ | โ | 1048576 | 8192 | โ | โ | โ | โ | โ | โ |
Gemini 2.5 | โ | โ | 1048576 | 65535 | โ | โ | โ | โ | โ | โ |
Gemini 2.5 Pro | โ | โ | 1048576 | 65535 | โ | โ | โ | โ | โ | โ |
Claude Sonnet 3.5 | โ | โ | 200000 | 4096 | โ | โ | โ | โ | โ | โ |
Claude Sonnet 3.7 | โ | โ | 200000 | 8192 | โ | โ | โ | โ | โ | โ |
Claude Sonnet 4 | โ | โ | 200000 | 64000 | โ | โ | โ | โ | โ | โ |
*Do not try to relate Training Model Flops, Training Environment Training Environment Flops, Training Duration as there are other factors which are playing role, like: number of epochs, number of precision parallel efficiency, memory bandwidth, thermal limitations, etc.
๐ Terminology
๐ง Core Concepts
Transformer โ The backbone of most LLMs. It processes input all at once (not word-by-word) using a technique called self-attention, which helps the model understand relationships between words.
Parameters โ The internal settings (weights) that the model learns during training. More parameters equaks more learning capacity.
Embedding โ A way to turn words into numbers. These numbers (vectors) capture meaning, so similar words have similar embeddings.
๐งฎ Model Architecture
Layer โ A building block of the model which transforms the input data and passes it to the next. LLMs have many layers stacked together.
Embedding Layer โ Converts tokens into vectors.
Attention Layer โ Applies self-attention to understand relationships.
Feed-Forward Layer โ Adds complexity and depth to the modelโs understanding.
Head โ A sub-unit inside an attention layer. Each head focuses on different aspects of the input (e.g., grammar, relationships, facts).
Multi-Head Attention โ Uses multiple heads in parallel to capture diverse patterns in the data4.
๐ Training Process
Training โ The process of teaching the model by showing it lots of text and adjusting its parameters to reduce errors. It involves feeding data, calculating predictions, comparing them to actual results, and updating weights.
Epoch โ One full pass through the training data. Usually repeated many times to help the model learn better.
Batch โ A small group of training examples processed together. This makes training faster and more efficient.
Iteration โ One update to the modelโs parameters. If you have 10,000 samples and a batch size of 100, youโll do 100 iterations per epoch.
Gradient Descent โ The method used to adjust parameters during training. It helps the model get better by reducing errors step-by-step.
Loss Function โ A mathematical formula that measures how far off the modelโs predictions are from the correct answers. The goal is to minimize this loss during training.
๐งช Inference Process
Inference โ When the model uses what it learned to generate answers. This is what happens when you chat with it.
Zero-shot Learning โ The model solves tasks it hasnโt seen before, using general knowledge.
Few-shot Learning โ The model is given a few examples before solving a task.
Hallucination โ When the model makes up facts or gives incorrect information confidently.
๐ Evaluation
MMLU (Massive Multitask Language Understanding) โ A benchmark that tests how well a model performs across 57 subjects (like math, law, and history). Scores range from 0 to 100.
GLUE (General Language Understanding Evaluation) โ A set of tasks used to measure how well a model understands language. Includes things like sentiment analysis and question answering.
๐ Performance
FLOPs (Floating Point Operations) โ A measure of how much computing power is needed. More FLOPs = more expensive and slower processing. GPT-3 uses ~350 billion FLOPs per token.
Latency โ How long it takes for the model to respond. Lower latency = faster answers.
๐งพ References
Yann Dubois
https://www.youtube.com/watch?v=9vM4p9NN0Ts / Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)
Sebastian Raschka
https://www.youtube.com/watch?v=79F32D9aM8U / Build LLMs From Scratch with Sebastian Raschka #52
https://www.youtube.com/watch?v=Zar2TJv-sE0 / Build an LLM from Scratch 5: Pretraining on Unlabeled Data
Andrej Karpathy
https://www.youtube.com/watch?v=l8pRSuU81PU / Let's reproduce GPT-2 (124M)
https://www.youtube.com/watch?v=EWvNQjAaOHw / How I use LLMs