A Transformer-based Language Model Trained on the Harry Potter Corpus for Experimental Research in Training Dynamics and Architecture.

Authors: Srikiran Bandhakavi, Mayank Pratap Singh


Model Summary

This repository contains multiple checkpoints of a GPT-style transformer model trained from scratch on a curated Harry Potter text dataset sourced from Project Gutenberg. The purpose of this repository is to support and supplement the findings of the accompanying research paper:

“Training a Transformer-based LLM from Scratch on Project Gutenberg Corpus: An Experimental Study”

The uploaded checkpoints represent different configurations used in experiments studying the effects of training duration, learning rate, model depth, attention heads, and ablation of architectural components. The objective was to observe how each of these choices influences convergence, generalization, and generation quality in small-scale LLM training.


Dataset

  • Source: Project Gutenberg (Harry Potter books)
  • Preprocessing: Cleaned and tokenized using subword tokenization
  • Sequence Length: 128 tokens per input block

Model Architecture

  • Model Type: GPT-style transformer (decoder-only)
  • Embedding Dimension: 384
  • Feedforward Dimension: 1536
  • Activation: ReLU
  • Attention Heads: Varied between 1 and 8
  • Transformer Layers: Varied between 1 and 12
  • Output Layer: Linear projection to vocabulary followed by softmax
  • Core Components: LayerNorm, Residual Connections, Feed-Forward Network
  • Training Optimizer: AdamW with cosine annealing scheduler

Training Configuration

  • Optimizer: AdamW (β₁ = 0.9, β₂ = 0.95, ε = 1e-9, weight decay = 0.1)
  • Batch Size: 32
  • Max Iterations: 5k to 100k
  • Learning Rates Tested: 1e-2, 1e-3, 1e-4
  • Training Steps: 5k, 20k, 50k, and 100k iterations
  • Loss Function: Cross Entropy (causal language modeling)

Checkpoints and Experiments

Checkpoint Name Related Experiment Notes
epoch_5000_lr_1e-4_layer_2_head_2 Training Dynamics Early-stage underfitting
epoch_100000_lr_1e-2_layer_2_head_2 Overfitting Test High LR, overfit behavior
epoch_10000_lr_1e-3_layer_12_head_2 Model Depth Deep 12-layer transformer
epoch_10000_lr_1e-3_layer_1_head_2 Model Depth Very shallow transformer
epoch_10000_lr_1e-3_layer_5_head_1 Attention Heads Single-head model
epoch_10000_lr_1e-3_layer_5_head_8 Attention Heads Wide attention capacity
epoch_10000_lr_1e-3_layer_5_head_4_no_ffn Ablation FFN removed
epoch_10000_lr_1e-3_layer_5_head_4_no_residual Ablation Residual connection removed
epoch_10000_lr_1e-3_layer_5_head_4_no_layernorm Ablation LayerNorm removed

Key Results Summary

Configuration Final Val Loss Output Quality
Baseline (5L, 4H, 50k @ 1e-3) ~3.0 Mostly coherent
Underfitted (5L, 4H, 5k @ 1e-4) ~5.5 Incoherent
Overfitted (5L, 4H, 100k @ 1e-2) ~4.5 Fluent but repetitive
Deep (12L, 4H) ~3.2 Fluent, long-range coherence
Shallow (1L, 4H) ~5.2 Poor coherence, short phrases
Single-head (5L, 1H) ~4.8 Basic fluency, low consistency
Multi-head (5L, 8H) ~3.1 Fluent, contextually stable
No LayerNorm ~3.8 Some fluency, unstable phrasing
No Residual ~5.0 Failed convergence
No Feed-Forward ~3.5 Coherent, but repetitive output

Resources

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support