Mayank022/SLM_harry_potter

A Transformer-based Language Model Trained on the Harry Potter Corpus for Experimental Research in Training Dynamics and Architecture.

Authors: Srikiran Bandhakavi, Mayank Pratap Singh

Model Summary

This repository contains multiple checkpoints of a GPT-style transformer model trained from scratch on a curated Harry Potter text dataset sourced from Project Gutenberg. The purpose of this repository is to support and supplement the findings of the accompanying research paper:

“Training a Transformer-based LLM from Scratch on Project Gutenberg Corpus: An Experimental Study”

The uploaded checkpoints represent different configurations used in experiments studying the effects of training duration, learning rate, model depth, attention heads, and ablation of architectural components. The objective was to observe how each of these choices influences convergence, generalization, and generation quality in small-scale LLM training.

Dataset

Source: Project Gutenberg (Harry Potter books)
Preprocessing: Cleaned and tokenized using subword tokenization
Sequence Length: 128 tokens per input block

Model Architecture

Model Type: GPT-style transformer (decoder-only)
Embedding Dimension: 384
Feedforward Dimension: 1536
Activation: ReLU
Attention Heads: Varied between 1 and 8
Transformer Layers: Varied between 1 and 12
Output Layer: Linear projection to vocabulary followed by softmax
Core Components: LayerNorm, Residual Connections, Feed-Forward Network
Training Optimizer: AdamW with cosine annealing scheduler

Training Configuration

Optimizer: AdamW (β₁ = 0.9, β₂ = 0.95, ε = 1e-9, weight decay = 0.1)
Batch Size: 32
Max Iterations: 5k to 100k
Learning Rates Tested: 1e-2, 1e-3, 1e-4
Training Steps: 5k, 20k, 50k, and 100k iterations
Loss Function: Cross Entropy (causal language modeling)

Checkpoints and Experiments

Checkpoint Name	Related Experiment	Notes
`epoch_5000_lr_1e-4_layer_2_head_2`	Training Dynamics	Early-stage underfitting
`epoch_100000_lr_1e-2_layer_2_head_2`	Overfitting Test	High LR, overfit behavior
`epoch_10000_lr_1e-3_layer_12_head_2`	Model Depth	Deep 12-layer transformer
`epoch_10000_lr_1e-3_layer_1_head_2`	Model Depth	Very shallow transformer
`epoch_10000_lr_1e-3_layer_5_head_1`	Attention Heads	Single-head model
`epoch_10000_lr_1e-3_layer_5_head_8`	Attention Heads	Wide attention capacity
`epoch_10000_lr_1e-3_layer_5_head_4_no_ffn`	Ablation	FFN removed
`epoch_10000_lr_1e-3_layer_5_head_4_no_residual`	Ablation	Residual connection removed
`epoch_10000_lr_1e-3_layer_5_head_4_no_layernorm`	Ablation	LayerNorm removed

Key Results Summary

Configuration	Final Val Loss	Output Quality
Baseline (5L, 4H, 50k @ 1e-3)	~3.0	Mostly coherent
Underfitted (5L, 4H, 5k @ 1e-4)	~5.5	Incoherent
Overfitted (5L, 4H, 100k @ 1e-2)	~4.5	Fluent but repetitive
Deep (12L, 4H)	~3.2	Fluent, long-range coherence
Shallow (1L, 4H)	~5.2	Poor coherence, short phrases
Single-head (5L, 1H)	~4.8	Basic fluency, low consistency
Multi-head (5L, 8H)	~3.1	Fluent, contextually stable
No LayerNorm	~3.8	Some fluency, unstable phrasing
No Residual	~5.0	Failed convergence
No Feed-Forward	~3.5	Coherent, but repetitive output