A Transformer-based Language Model Trained on the Harry Potter Corpus for Experimental Research in Training Dynamics and Architecture.
Authors: Srikiran Bandhakavi, Mayank Pratap Singh
Model Summary
This repository contains multiple checkpoints of a GPT-style transformer model trained from scratch on a curated Harry Potter text dataset sourced from Project Gutenberg. The purpose of this repository is to support and supplement the findings of the accompanying research paper:
“Training a Transformer-based LLM from Scratch on Project Gutenberg Corpus: An Experimental Study”
The uploaded checkpoints represent different configurations used in experiments studying the effects of training duration, learning rate, model depth, attention heads, and ablation of architectural components. The objective was to observe how each of these choices influences convergence, generalization, and generation quality in small-scale LLM training.
Dataset
- Source: Project Gutenberg (Harry Potter books)
- Preprocessing: Cleaned and tokenized using subword tokenization
- Sequence Length: 128 tokens per input block
Model Architecture
- Model Type: GPT-style transformer (decoder-only)
- Embedding Dimension: 384
- Feedforward Dimension: 1536
- Activation: ReLU
- Attention Heads: Varied between 1 and 8
- Transformer Layers: Varied between 1 and 12
- Output Layer: Linear projection to vocabulary followed by softmax
- Core Components: LayerNorm, Residual Connections, Feed-Forward Network
- Training Optimizer: AdamW with cosine annealing scheduler
Training Configuration
- Optimizer: AdamW (β₁ = 0.9, β₂ = 0.95, ε = 1e-9, weight decay = 0.1)
- Batch Size: 32
- Max Iterations: 5k to 100k
- Learning Rates Tested: 1e-2, 1e-3, 1e-4
- Training Steps: 5k, 20k, 50k, and 100k iterations
- Loss Function: Cross Entropy (causal language modeling)
Checkpoints and Experiments
Checkpoint Name | Related Experiment | Notes |
---|---|---|
epoch_5000_lr_1e-4_layer_2_head_2 |
Training Dynamics | Early-stage underfitting |
epoch_100000_lr_1e-2_layer_2_head_2 |
Overfitting Test | High LR, overfit behavior |
epoch_10000_lr_1e-3_layer_12_head_2 |
Model Depth | Deep 12-layer transformer |
epoch_10000_lr_1e-3_layer_1_head_2 |
Model Depth | Very shallow transformer |
epoch_10000_lr_1e-3_layer_5_head_1 |
Attention Heads | Single-head model |
epoch_10000_lr_1e-3_layer_5_head_8 |
Attention Heads | Wide attention capacity |
epoch_10000_lr_1e-3_layer_5_head_4_no_ffn |
Ablation | FFN removed |
epoch_10000_lr_1e-3_layer_5_head_4_no_residual |
Ablation | Residual connection removed |
epoch_10000_lr_1e-3_layer_5_head_4_no_layernorm |
Ablation | LayerNorm removed |
Key Results Summary
Configuration | Final Val Loss | Output Quality |
---|---|---|
Baseline (5L, 4H, 50k @ 1e-3) | ~3.0 | Mostly coherent |
Underfitted (5L, 4H, 5k @ 1e-4) | ~5.5 | Incoherent |
Overfitted (5L, 4H, 100k @ 1e-2) | ~4.5 | Fluent but repetitive |
Deep (12L, 4H) | ~3.2 | Fluent, long-range coherence |
Shallow (1L, 4H) | ~5.2 | Poor coherence, short phrases |
Single-head (5L, 1H) | ~4.8 | Basic fluency, low consistency |
Multi-head (5L, 8H) | ~3.1 | Fluent, contextually stable |
No LayerNorm | ~3.8 | Some fluency, unstable phrasing |
No Residual | ~5.0 | Failed convergence |
No Feed-Forward | ~3.5 | Coherent, but repetitive output |