๐Ÿ‘ถ The Little Baby

  • A barebones GPT-style LLM implementation โ€” pure Python, zero dependencies.

๐Ÿง  Description

The Little Baby is a minimalist language model (LLM) crafted entirely in pure Python using just Numpy. It requires no external packages, libraries, or frameworks to function. Both training and inference are achieved through low-level operations and hand-built logic โ€” making this project ideal for educational deep dives and experimental tinkering.

This repository is designed to reveal the inner mechanics of a GPT-style transformer model and demystify the "magic" behind modern language models through readable and hackable code.

๐ŸŽฏ Audience

This project is perfect for:

  • Curious learners wanting to dissect how GPTs work from the ground up.
  • Researchers experimenting with primitive architectures.
  • Engineers exploring early-stage LLM behaviors.
  • Anyone who enjoys coding like it's 2010 โ€” no imports, just raw power.

๐ŸŒŸ Inspiration

This project draws its spark from modern titans in the world of machine learning:

  • Sebastian Raschka โ€” acclaimed for his lucid teaching style and groundbreaking contributions to deep learning, making complex concepts accessible to learners and practitioners alike.
  • Andrej Karpathy โ€” influential in shaping the landscape of computer vision and generative models, while championing open-source AI education that empowers a global community of developers.
  • Yann Dubois โ€” instrumental in designing scalable evaluation frameworks for large language models, notably AlpacaEval and AlpacaFarm, which bring automation closer to the nuance of human feedback.

Their work inspired the spirit of transparency, curiosity, and simplicity that fuels The Little Baby โ€” a model built not for production, but for understanding.

  • โ€œBuild it, break it, learn from it.โ€ โ€“ The Baby Philosophy

๐Ÿš€ Project Goals

This endeavor is structured around key targets designed to deliver meaningful outcomes:

  • โœ… Build a GPT-like model using only Python + NumPy-like constructs.
  • โœ… Support training from scratch on plain text files.
  • โœ… Provide clear code for attention mechanisms, tokenization, and backprop.
  • โœ… Encourage experimentation and modification.

๐Ÿ“š Directory Files

Each run generates three unique files, identified by a GUID tag. These files capture different aspects of the model's execution:

  • โš™๏ธ Config
    configs/config_<GUID>.txt
    A config file containing the configuration of the each iteration.

  • ๐Ÿ“ Report
    outputs/report_<GUID>.txt
    A comprehensive log containing training analysis, and performance metrics.

  • ๐Ÿง  Model Snapshot
    models/model_<GUID>.pkl
    Model object including learned weights, biases, which are the internal parameters.

  • ๐Ÿ”ค Tokenizer Snapshot
    models/tokenizer_<GUID>.pkl
    Tokenizer object including vocabilary of the input data and their positioning.

  • ๐Ÿ—ฃ๏ธ Completion Output
    outputs/completion_<GUID>.txt
    The raw generated text from the model's inference โ€” your babyโ€™s words in print!

๐Ÿšผ Next Steps

Letโ€™s keep The Little Baby alive โ€” and help it grow into a full-blown member of the NumPy family!

This means:

  • ๐Ÿ“ˆ Evolving from hand-crafted loops to efficient vectorized operations.
  • ๐Ÿงฎ Embracing numerical abstractions while maintaining full transparency.
  • ๐Ÿ› ๏ธ Exploring performance tricks, batch parallelism, and experimental features.
  • ๐Ÿงฌ Bridging the gap between simplicity and capability โ€” one token at a time.

The journey from babbling to brilliance starts here. Let's raise this little one right!

โš–๏ธ License Summary

You're free to:

  • โœ… Use it for any purpose โ€” personal, educational, or commercial
  • ๐Ÿ’ก Suggest ideas and contribute improvements
  • ๐Ÿด Fork it and build upon the code
  • ๐Ÿ’ฐ Sell it or use it in a product

As long as:

  • ๐Ÿ“Œ You reference the original author and project clearly in any public distribution or commercial use

๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘ง Credits

The Little Baby owes its lineage to two brilliant minds in the AI family tree:

  • ๐Ÿ‘‘ Ownser: Koureas Stavros | Product Architect BI / AI โ€” lovingly crafted and cared
  • ๐Ÿง” Father: OpenAI GPT 4.1 โ€” provider of deep generative DNA and thoughtful token flow
  • ๐Ÿง‘โ€๐Ÿผ Mother: Google Gemini 2.5 โ€” donor of wide context windows and clever architectural chromosomes
  • ๐Ÿง™ Godparent: Claude Sonnet 4.0 โ€” gentle guide and lifelong companion, whispering wisdom and weaving clarity

Together, they gifted the foundational strands that allowed this little one to generate helpful code and take its first linguistic steps.

๐Ÿงช Instructions

To get started with this project, clone the code, download the tokenizers abd pre-trained models if needed, and follow the setup steps below to run the notebook and select your desired configuration.

Get objects

  • You can access the code on GitHub (https://github.com/koureasstavros/TheLittleBaby), simply clone the repository.
  • You can access the pre-trained tokenizers and models on Hugging Face (https://huggingface.co/koureasstavros/TheLittleBaby), simply download the tokenizer and model files. In case you have low speed internet connection check the analysis table select a guid and pick a specific guid for tokenizer and model. The tokenizer and model files are needed only if you are going to perform finetune or inference without training your own.
  • Then, you should:
    • place the tokenizer file or tokenizer files into the tokenizers folder.
    • place the model file or model files into the models folder.

Start the Notebook

  • Open the .ipynb file in a Python kernel (e.g. Jupyter, VS Code, Colab).

Select Path

  • Choose the relative path between ipynb and folders:
    • same
    • <path>

Select Plan

  • Choose one of the following plan modes:
    • train
    • finetune
    • inference

That's it!

๐Ÿ”ฎ What to expect

In Baby's world, each option has its own little jobโ€”and below, youโ€™ll discover what each one does and the cuddly objects it gives back in return.

๐Ÿ”ง Train

  • Begins training using parameters defined in earlier Python blocks.
  • A model file containing the weights will be generated with format model_<guid>.
  • A tokenizer file containing the vocabilary will be generated with format tokenizer_<guid>.
  • A report file containing the training analysis will be generated with format report_<guid>.
  • A completion file containing the generation will be generated with format complation_<guid> using an empty prompt.

๐Ÿ› ๏ธ Finetune

  • Begins finetuning using a base model and a custom training dataset.
  • Requires the GUID of the base model to locate model_<guid>.
  • A model file containing the weights will be generated with format model_<guid>_finetuned.
  • A tokenizer file containing the vocabilary will be generated with format tokenizer_<guid>_fineuned.
  • A report file containing the training analysis will be generated with format report_<guid>_fineuned.
  • A completion file containing the generation will be generated with format completion_<guid>_finetuned using an empty prompt.

๐Ÿ’ฌ Inference

  • Requires the GUID of the trained model to find the model_<guid>.
  • You must also provide a prompt for the model inference to respond to.
  • A completion file containing the generation will be generated with format complation_<guid>_<yyyymmddhhmmss> using the prompt.

After lot of hours of training on a single document of multiple Shakespeare works using a laptop CPU, The Little Baby learns to babble. Its speech is primitive and childlike โ€” just enough to make you smile and realizeโ€ฆ the baby is alive. While its capabilities are minimal, its structure is maximal in transparency. Every token, gradient, and parameter is visible and malleable.

*Keep in mind that if you're running a process in VSCode and your workstation, PC, or laptop enters hibernation, the process will resume automatically once the device is powered back on.

๐Ÿผ Cry. Babble. Speak. Repeat.

Here come the smartest little settings to help the model learn and grow big and strong from this data:

  • Age 3 Months - 33bd6583-1b87-4469-b55e-0ccb8fd0441c - Coos and gurgles begin. Sound, not speechโ€”yet somethingโ€™s brewing.
  • Age 6 Months - 180eeb27-b1b4-4427-9734-c70e10da2005 - Loud, random cries. Itโ€™s not talking, but it's definitely expressive.
  • Age 12 Months - 5f13a2ab-113a-4c2c-8abd-40384bdd8854 - Joyful noise with hints of intention. Real words still warming up.
  • Age 24 Months - cb632ce3-3f3b-432b-b24f-9171005f205e - Words arrive โ€”Chaotic, quirky, delightful. Syntax? Optional.
  • Age 48 Months - 12b8b053-6c14-42aa-a957-89b809e6f785 - Mini Philosopher Mode -Stories, opinions, even jokes. Communication unlocked.hear them.

*Keep in mind that these are pre-trained model executions available for finetune or inference. You can bypass the training phase by simply downloading the models and using them directly.

โš™๏ธ Parameters

These hyperparameters collectively define the training process, where a model's architectureโ€”specified by its depth (n_layers), width (n_emb), attention span (n_ctx), and attention mechanism (n_heads, head_size)โ€”is optimized over a set number of num_epochs using a specific batch_size and learning rate (lr), with dropout applied to improve generalization.

  • n_ctx

    • What it is: The maximum number of tokens (characters, in this case) the model can look at in a single sequence to make a prediction. It's the model's "attention span".
    • Size: Directly increases the size of the positional embedding table (n_ctx x n_emb), adding more parameters to the model.
    • Speed: Has a major impact. The self-attention mechanism's computation grows quadratically with the context length (O(n_ctxยฒ)). Doubling n_ctx will roughly quadruple the time and memory needed for the attention layers, making it one of the most expensive parameters to increase.
    • Quality: A larger n_ctx allows the model to learn longer-range dependencies in the text, which can significantly improve quality for tasks that require understanding context over long passages.
  • n_emb

    • What it is: The size of the vector used to represent each token. It defines the "width" of the model.
    • Size: Has a major impact on model size. It increases the size of token and positional embeddings, and scales the weight matrices in the attention and MLP layers, significantly increasing the total parameter count.
    • Speed: Increasing n_emb increases the size of nearly all weight matrices in the model. This leads to more parameters, which increases both memory usage and the time required for matrix multiplications. The impact is significant but generally more linear than n_ctx.
    • Quality: A larger n_emb gives the model more capacity to learn rich, complex representations of tokens and their relationships. This can lead to a more powerful and accurate model, but also increases the risk of overfitting if the model is too large for the dataset.
  • dropout

    • What it is: A regularization technique where a fraction of neuron activations are randomly set to zero during each training step. This prevents the model from becoming too reliant on any single neuron.
    • Size: Has no impact on the number of parameters in the model.
    • Speed: Has a negligible impact on training speed and no impact on inference speed (it's disabled during evaluation).
    • Quality: Crucial for improving model generalization and preventing overfitting. By forcing the network to learn redundant representations, it makes the model more robust. The value (e.g., 0.1) is the probability of a neuron being dropped.
  • head_size

    • What it is: The total dimensionality of the concatenated attention heads. This dimension is projected from the input embedding (n_emb) to create the Query, Key, and Value matrices.
    • Size: Directly increases the number of parameters in each attention block by defining the size of the Q, K, V, and output projection matrices.
    • Speed: Directly affects the size of the Q, K, and V projection matrices. A larger head_size increases the number of computations and memory usage within each attention block.
    • Quality: A larger head_size gives the model more representational power within the attention mechanism. It must be divisible by n_heads.
  • n_heads

    • What it is: The attention mechanism is split into multiple "heads" that perform attention calculations in parallel. Each head can learn to focus on different types of relationships in the data.
    • Size: Has no direct impact on model size, as it only determines how the head_size dimension is partitioned for parallel computation.
    • Speed: The computations for each head can be parallelized. On capable hardware, increasing the number of heads might not slow down training significantly if the head_size is kept constant.
    • Quality: Allows the model to simultaneously attend to information from different representation subspaces at different positions. This is a core concept of the Transformer and generally leads to a much better model than a single attention head.
  • n_layers

    • What it is: The number of Transformer blocks stacked on top of each other. This defines the "depth" of the model.
    • Size: Has a direct, linear impact on model size. Each layer adds a
    • Speed: The impact is linear. Doubling n_layers will roughly double the training time and the number of model parameters, as the input data must pass through each block sequentially.
    • Quality: More layers allow the model to learn more complex and abstract features. Deeper models are generally more powerful, but also more prone to overfitting and can be harder to train (though residual connections help mitigate this).
  • num_epochs

    • What it is: The number of times the training process will iterate over the entire training dataset.
    • Size: Has a direct, linear impact on model size. Each layer adds a complete set of Transformer block parameters, roughly doubling the model's core parameter count if you double the layers.
    • Speed: Directly and linearly impacts total training time. More epochs mean longer training.
    • Quality: Too few epochs will lead to an undertrained model (underfitting). Too many can lead to the model memorizing the training data (overfitting), which hurts its performance on new data. The ideal number is usually found by monitoring the validation loss.
  • batch_size

    • What it is: The number of training sequences (each of length n_ctx) processed in one forward/backward pass.
    • Size: Has no impact on the number of parameters in the model.
    • Speed: A larger batch_size allows for more parallelization, generally leading to faster training (fewer updates per epoch). However, it also requires more memory.
    • Quality: This is a trade-off. Larger batches provide a more accurate and stable gradient estimate, but the noise from smaller batches can act as a regularizer, helping the model find a better minimum and generalize better.
  • lr

    • What it is: Controls how much the model's weights are adjusted with respect to the loss gradient. It determines the step size at each iteration.
    • Size: Has no impact on the number of parameters in the model.
    • Speed: Affects the speed of convergence. A higher lr might converge faster, but risks overshooting the optimal weights. A lower lr is more stable but can be very slow to converge.
    • Quality: This is one of the most critical parameters. If it's too high, the training can become unstable and diverge. If it's too low, the model may get stuck in a suboptimal solution or take too long to train. The AdamW optimizer helps adapt the learning rate, but the initial value is still very important.

๐Ÿ“ Formulas

Even our little language models have their favorite rules to followโ€”turns out, they quietly cuddle up to some clever mathematical formulas that help them make sense of the world.

  • Learning Rate - LR_new = LR_old * (B_new / B_old)

    New Learning Rate (LR_new) is based on Old Learning Rate (LR_old), New Batch size (B_new),Old Batch size (B_new).

  • Total Parameters - P = V x H + L x [4 x H^2 + 4 x H x F]

    Total parameters are based on Vocabilary Size (V), Head Size / Embedding Size (H), Layer Number (L), Feedforward intermidiate Size (F).

  • Token Thoughput for training - T = 20-40 per P

    Token number processed per Parameter (P) is 20-40.

  • Flops Thoughput for training - F = 6 * T * P

    Flops are based on 6 (2 ops for forward pass and 4 ops for backward pass), Number of Tokens (T), Number of Parameters (P).

๐Ÿ” Report Analysis

Given the Shakespeare works into a single document of 32777 paragraphs, 12519 sentences, 202651 words, 1075394 characters / tokens for learning and 500 characters / tokens for inference

version n_ctx n_emb dropout head_size n_heads n_layers n_epochs s_batch lr batch execution epoch execution train_execution inference execution quality execution model size baby's brain
v0.0.1 8 128 0.1 128 16 4 1 16 1e-3 0.125s 7200s 7200s 8s 1/100 29,577,062 fb546251-ec1c-4e00-a713-765693d8c5cf
v0.0.1 8 128 0.1 128 16 8 1 16 1e-3 4.50s 37355s 37355s 13s 1/100 58,183,507 c6832bb3-3f49-493d-9548-62d46065c1e0
v0.0.1 8 128 0.1 128 16 16 1 16 1e-3 0.5s 41802s 41802s 14s 1/100 117,188,617 33bd6583-1b87-4469-b55e-0ccb8fd0441c
v0.0.1 16 128 0.1 128 16 4 1 16 1e-3 0.25s 19916s 19916s 14s 1/100 29,561,884 17e84fc6-57f9-4843-a0f2-6150e7c7f169
v0.0.1 16 128 0.1 128 16 8 1 16 1e-3 0.25s 60851s 60851s 14s 1/100 56,987,898 ecb6a3b1-ffd5-4cbd-a3e0-d9a9716dacbd
v0.0.1 16 128 0.1 128 16 16 1 16 1e-3 1.0s 83749s 83749s 26s 25/100 116,160,341 180eeb27-b1b4-4427-9734-c70e10da2005
v0.0.1 32 128 0.1 128 16 4 1 16 1e-3 0.5s 53771s 53771s 12s 12/100 28,310,070 e64dd257-c048-441b-ad08-47275b22cc0b
v0.0.1 32 128 0.1 128 16 8 1 16 1e-3 3.0s 97984s 97984s 23s 25/100 56,292,724 465e5804-17af-412c-8bf6-808a34cdf617
v0.0.1 32 128 0.1 128 16 16 1 16 1e-3 2.0s 134234s 134234s 54s 27/100 114,114,671 5f13a2ab-113a-4c2c-8abd-40384bdd8854
v0.0.1 64 128 0.1 128 16 4 1 16 1e-3 2.00s 137095s 137095s 39s 27/100 28,302,412 0cbeae2b-2884-434d-8fdf-b8a12d8d50c4
v0.0.1 64 128 0.1 128 16 8 1 16 1e-3 3.0s 237971s 237971s 45s 30/100 56,104,284 e65d4a59-a816-4ffa-b8ac-935db1064433
v0.0.1 64 128 0.1 128 16 16 1 16 1e-3 4.0s 328598s 328598s 88s 32/100 112,890,591 cb632ce3-3f3b-432b-b24f-9171005f205e
v0.0.1 128 128 0.1 128 16 4 1 16 1e-3 4.5s 320999s 320999s 26s 42/100 28,523,148 be5bf515-5850-41de-9072-af8faca7d27a
v0.0.1 128 128 0.1 128 16 8 1 16 1e-3 s s s s
v0.0.1 128 128 0.1 128 16 16 1 16 1e-3 10.0s 763757s 763757s 199s 43/100 111,737,990 12b8b053-6c14-42aa-a957-89b809e6f785
v0.0.1 256 32 0.1 32 16 2 1 16 1e-3 3.00s 228208s 228208s 26s 23/100 1,323,911 b3aedc6d-da9a-4398-b067-faeca1afc6da
v0.0.1 256 64 0.1 64 16 2 1 16 1e-3 2.00s 143777s 143777s 25s 25/100 2,585,851 652d3409-24a5-4057-b482-9fd9e32fc484
v0.0.1 64 64 0.1 64 16 4 4 16 1e-3 0.60s 218232s 218235s 9s 27/100 7,367,190 82689609-5b39-4fd7-8a42-5d2f04dabf7a

*Keep in mind that quality should never be assumed without scrutiny, as its evaluation by a larger language model depends on specific criteria. Keep in mind, these models may not consistently produce the same assessment across different runs or contexts.

๐Ÿ•ต๏ธ Observations

While playing and exploring with our tiny language models, we noticed a few adorable quirks and clever behaviorsโ€”here are some of the sweet observations we made along the way.

  • When training if head_size is multiplied then the model size will also multiplied and total time are also multiplied
  • When training if n_layers is multiplied then the model size will also multiplied and total time are also multiplied
  • When training if vocab_size is multiplied then the model size will also multiplied and total time are also multiplied
  • When inference if cache is true then generation O(Tยฒ) faster as previously sequences do not need to be recalculated each time
  • When inference the model with x max tokens for generation, then
    • if the output type is plain text it will have x tokens
    • if the output type is json it will have y tokens where y >= x, because it might contains special characters for example, new lines, which in json are represented as two characters "\n" --> "", "n"

Further Thoughts

๐Ÿง  "Letโ€™s imagine what shiny new toys and big upgrades the little model needs to turn into a grown-up LLM who knows all about the big wide world!

Known DataSets

DataSet Type DataSet Type DataSet Name DataSet Tokens
open train SlimPajama 627B
open train RedPajama v1 1T
open train RedPajama v2 30T
open eval HellaSwag 30T

Known Architectures

Model Type Parameters Input Tokens Output Tokens Training Model Tokens Training Model Flops Training Environment Training Environment Flops /s Training Content Training Duration
GPT2 s 117M 1024 Shared 3.3B 2.3e18F 1-2 x A100 100P WebText (Reddit outbound links with โ‰ฅ3 karma; ~40GB of filtered internet text) 60D
GPT2 m 335M 1024 Shared 3.3B 7e18F 4-8 ร— A100 200P Same as Small; byte-level BPE tokenization, 50,257 vocab size 60D
GPT2 l 774B 1024 Shared 3.3B 15e18F 8-16 ร— V100 400P Same as Small; trained with causal LM objective 60D
GPT2 xl 1.5B 1024 Shared 3.3B ~30e18F 16-32 ร— V100 800P Same as Small; trained with causal LM objective 60D
GPT3 s 125M 2048 Shared 300B 2.25e21F 1-2 ร— A100 100P Common Crawl (filtered), WebText2, Books1/2, Wikipedia (~570GB filtered) 180D
GPT3 m 350M 4096 Shared 300B 6.3e21F 8-16 ร— A100 200P Same as Small; scaled architecture with 24 layers and 16 attention heads 180D
GPT3 l 760M 16384 4096 300B 3.7e21F 100-200 ร— A100 400P Same as Small; deeper model with wider layers and more attention heads 180D
GPT3 xl 6.7B 2048 Shared 300B ~1.2e22F 32-64 ร— A100 800P Common Crawl, WebText2, Books1/2, Wikipedia (~570GB filtered) 180D
GPT4 s 1B 8192 8192 6B 1.8e21F 100-200 ร— A100 1OOP Filtered Common Crawl, Books, Wikipedia, WebText2, code, academic papers 160D
GPT4 m 13B 32768 8192 1.7T 9.4e23F 400-600 ร— A100 400P Same as Small; with broader multilingual and multimodal data 160D
GPT4 l 65B 128000 4096 13T 3e25F 2k-4K ร— A100 1E Massive curated dataset: text, code, images, audio (for GPT-4o), RLHF tuning 90D
LLAMA2 s 7B 4096 Shared 2T 1.5e24F 32-64 ร— A100 400P Publicly available web data (filtered), books, code, academic papers 180D
LLAMA2 m 13B 4096 Shared 2T 2.6e24F 128-256 ร— A100 400P Same as Small; with additional curated datasets for scaling 180D
LLAMA2 l 70B 4096 Shared 2T 14e24F 1024K+ x A100 800P Same as Small; plus enhanced filtering, grouped-query attention optimization 180D
LLAMA3 s 8B 8000 Shared 15T 7.2e24F 64-128 x A100 700P Books, Wikipedia, GitHub, StackExchange 70D
LLAMA3 m 70B 128000 Shared 15T 63e24F 512-1024 x A100 800P Books, Wikipedia, GitHub, StackExchange 70D
LLAMA3 l 405B 128000 Shared 15T 365e24F 1024+ x A100 1E Books, Wikipedia, GitHub, StackExchange 70D
LLAMA4 Scout s 109B total / 17B active 10000000 Shared ~30T ~8e25F 32-64 x H100 ~400T Text, image, video (multimodal) Unknown
LLAMA4 Maverick m 400B total / 17B active 10000000 Shared ~30T ~38e25F 128-256 ร— H100 ~3200T Text, image, code, multilingual data Unknown
LLAMA4 Maverick l 2T total / 288B active 10000000 Shared ~30T ~100e25F 32K+ x H100 Unknown STEM-heavy, multimodal, synthetic distill. Unknown
GPT-4o-nano s โ€” 128000 4096 โ€” โ€” โ€” โ€” โ€” โ€”
GPT-4o-mini m โ€” 128000 16096 โ€” โ€” โ€” โ€” โ€” โ€”
GPT-4o l โ€” 128000 4096 โ€” โ€” โ€” โ€” โ€” โ€”
GPT-4.1-nano s โ€” 1000000 32768 โ€” โ€” โ€” โ€” โ€” โ€”
GPT-4.1-mini m โ€” 1000000 32768 โ€” โ€” โ€” โ€” โ€” โ€”
GPT-4.1 l โ€” 1000000 32768 โ€” โ€” โ€” โ€” โ€” โ€”
o1-mini m โ€” 200000 100000 โ€” โ€” โ€” โ€” โ€” โ€”
o1 l โ€” 200000 100000 โ€” โ€” โ€” โ€” โ€” โ€”
o3-mini s โ€” 200000 100000 โ€” โ€” โ€” โ€” โ€” โ€”
o3 m โ€” 20000 0 100000 โ€” โ€” โ€” โ€” โ€” โ€”
o3-pro l โ€” 200000 100000 โ€” โ€” โ€” โ€” โ€” โ€”
o4-mini s โ€” 200000 100000 โ€” โ€” โ€” โ€” โ€” โ€”
o4 m โ€” 200000 100000 โ€” โ€” โ€” โ€” โ€” โ€”
o4-pro l โ€” 200000 100000 โ€” โ€” โ€” โ€” โ€” โ€”
Grok-3 โ€” โ€” 131072 16384 โ€” โ€” โ€” โ€” โ€” โ€”
Gemini 2.0 โ€” โ€” 1048576 8192 โ€” โ€” โ€” โ€” โ€” โ€”
Gemini 2.0 Flash โ€” โ€” 1048576 8192 โ€” โ€” โ€” โ€” โ€” โ€”
Gemini 2.5 โ€” โ€” 1048576 65535 โ€” โ€” โ€” โ€” โ€” โ€”
Gemini 2.5 Pro โ€” โ€” 1048576 65535 โ€” โ€” โ€” โ€” โ€” โ€”
Claude Sonnet 3.5 โ€” โ€” 200000 4096 โ€” โ€” โ€” โ€” โ€” โ€”
Claude Sonnet 3.7 โ€” โ€” 200000 8192 โ€” โ€” โ€” โ€” โ€” โ€”
Claude Sonnet 4 โ€” โ€” 200000 64000 โ€” โ€” โ€” โ€” โ€” โ€”

*Do not try to relate Training Model Flops, Training Environment Training Environment Flops, Training Duration as there are other factors which are playing role, like: number of epochs, number of precision parallel efficiency, memory bandwidth, thermal limitations, etc.

๐Ÿ“– Terminology

๐Ÿง  Core Concepts

Transformer โ€“ The backbone of most LLMs. It processes input all at once (not word-by-word) using a technique called self-attention, which helps the model understand relationships between words.

Parameters โ€“ The internal settings (weights) that the model learns during training. More parameters equaks more learning capacity.

Embedding โ€“ A way to turn words into numbers. These numbers (vectors) capture meaning, so similar words have similar embeddings.

๐Ÿงฎ Model Architecture

Layer โ€“ A building block of the model which transforms the input data and passes it to the next. LLMs have many layers stacked together.

Embedding Layer โ€“ Converts tokens into vectors.

Attention Layer โ€“ Applies self-attention to understand relationships.

Feed-Forward Layer โ€“ Adds complexity and depth to the modelโ€™s understanding.

Head โ€“ A sub-unit inside an attention layer. Each head focuses on different aspects of the input (e.g., grammar, relationships, facts).

Multi-Head Attention โ€“ Uses multiple heads in parallel to capture diverse patterns in the data4.

๐Ÿ” Training Process

Training โ€“ The process of teaching the model by showing it lots of text and adjusting its parameters to reduce errors. It involves feeding data, calculating predictions, comparing them to actual results, and updating weights.

Epoch โ€“ One full pass through the training data. Usually repeated many times to help the model learn better.

Batch โ€“ A small group of training examples processed together. This makes training faster and more efficient.

Iteration โ€“ One update to the modelโ€™s parameters. If you have 10,000 samples and a batch size of 100, youโ€™ll do 100 iterations per epoch.

Gradient Descent โ€“ The method used to adjust parameters during training. It helps the model get better by reducing errors step-by-step.

Loss Function โ€“ A mathematical formula that measures how far off the modelโ€™s predictions are from the correct answers. The goal is to minimize this loss during training.

๐Ÿงช Inference Process

Inference โ€“ When the model uses what it learned to generate answers. This is what happens when you chat with it.

Zero-shot Learning โ€“ The model solves tasks it hasnโ€™t seen before, using general knowledge.

Few-shot Learning โ€“ The model is given a few examples before solving a task.

Hallucination โ€“ When the model makes up facts or gives incorrect information confidently.

๐Ÿ“Š Evaluation

MMLU (Massive Multitask Language Understanding) โ€“ A benchmark that tests how well a model performs across 57 subjects (like math, law, and history). Scores range from 0 to 100.

GLUE (General Language Understanding Evaluation) โ€“ A set of tasks used to measure how well a model understands language. Includes things like sentiment analysis and question answering.

๐Ÿ“ˆ Performance

FLOPs (Floating Point Operations) โ€“ A measure of how much computing power is needed. More FLOPs = more expensive and slower processing. GPT-3 uses ~350 billion FLOPs per token.

Latency โ€“ How long it takes for the model to respond. Lower latency = faster answers.

๐Ÿงพ References

Yann Dubois

https://www.youtube.com/watch?v=9vM4p9NN0Ts / Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)

Sebastian Raschka

https://www.youtube.com/watch?v=79F32D9aM8U / Build LLMs From Scratch with Sebastian Raschka #52

https://www.youtube.com/watch?v=Zar2TJv-sE0 / Build an LLM from Scratch 5: Pretraining on Unlabeled Data

Andrej Karpathy

https://www.youtube.com/watch?v=l8pRSuU81PU / Let's reproduce GPT-2 (124M)

https://www.youtube.com/watch?v=EWvNQjAaOHw / How I use LLMs

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support