--- license: apache-2.0 language: - en widget: - text: It is raining and my family example_title: Example 1 - text: We entered into the forest and example_title: Example 2 - text: I sat for doing my homework example_title: Example 3 --- # Custom GPT Model ## Model Description This model, designed and pretrained from scratch, was developed without utilizing the Hugging Face library. --- ## Model Parameters - **Block Size**: `256` (Maximum sequence length) - **Vocab Size**: `50257` (Includes 50,000 BPE merges, 256 byte-level tokens, and 1 special token) - **Number of Layers**: `8` - **Number of Heads**: `8` - **Embedding Dimension**: `768` - **Max Learning Rate**: `0.0006` - **Min Learning Rate**: `0.00006` (10% of max_lr) - **Warmup Steps**: `715` - **Max Steps**: `52000` - **Total Batch Size**: `524288` (Number of tokens per batch) - **Micro Batch Size**: `128` - **Sequence Length**: `256` ## Model Parameters Details ### Decayed Parameters - **Total Decayed Parameters**: 95,453,184 Decayed parameters typically include weights from the model's various layers (like the transformer blocks), which are subject to weight decay during optimization. This technique helps in regularizing the model, potentially reducing overfitting by penalizing large weights. ### Non-Decayed Parameters - **Total Non-Decayed Parameters**: 81,408 Non-decayed parameters generally involve biases and layer normalization parameters. These parameters are excluded from weight decay as applying decay can adversely affect the training process by destabilizing the learning dynamics. ### Total Parameters - **Overall Total Parameters**: 95,534,592 The calculated total number of parameters includes both decayed and non-decayed tensors, summing up to over 95 million parameters. --- ## Dataset Description ### Overview For the training of this model, a significant subset of the **HuggingFaceFW/fineweb-edu** dataset was utilized. Specifically, the model was pretrained on 3 billion tokens selected from the "Sample 10B" segment of the dataset. This dataset provides a rich corpus compiled from educational and academic web sources, making it an excellent foundation for developing language models with a strong grasp of academic and formal text. ### Dataset Source The dataset is hosted and maintained on Hugging Face's dataset repository. More detailed information and access to the dataset can be found through its dedicated page: [HuggingFaceFW/fineweb-edu Sample 10B](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu/tree/main/sample/10BT) ### Training Details - **Total Tokens Used for Training**: 3 billion tokens - **Training Duration**: The model was trained over 3 epochs to ensure sufficient exposure to the data while optimizing the learning trajectory. --- ## Model Evaluation on HellaSwag Dataset ### Performance Overview The evaluation of our model, "orator," on the HellaSwag dataset demonstrates significant progress in understanding context-based predictions. Below, we detail the performance through loss and accuracy graphs, accompanied by specific metrics. ### Graph Analysis #### Loss Graph ![Loss Graph](output1.png) - **Blue Line (Train Loss)**: Represents the model's loss on the training set over the number of training steps. It shows a sharp decline initially, indicating rapid learning, followed by fluctuations that gradually stabilize. - **Orange Line (Validation Loss)**: Shows the loss on the validation set. This line is smoother than the training loss, indicating general stability and effectiveness of the model against unseen data. - **Red Dashed Line**: Marks the validation loss of a baseline model, OpenAI's GPT-2 (124M), for comparison. Our model achieves lower validation loss, indicating improved performance. #### Accuracy Graph (HellaSwag Eval) ![Accuracy Graph](output2.png) - **Blue Line**: This line represents the accuracy of the "orator" model on the HellaSwag evaluation set. It shows a steady increase in accuracy, reflecting the model's improving capability to correctly predict or complete new scenarios. - **Red Dashed Line**: This is the accuracy of the baseline OpenAI GPT-2 (124M) model. Our model consistently surpasses this benchmark after initial training phases. ### Key Metrics - **Minimum Training Loss**: `2.883471` - **Minimum Validation Loss**: `3.1989` - **Maximum HellaSwag Evaluation Accuracy**: `0.3054` --- ### Tokenization For tokenization, this model uses: ```python tokenizer = tiktoken.get_encoding("gpt2") ``` --- ## How to Use the Model ### Load and Generate Text Below is a Python example on how to load the model and generate text: ```python import torch from torch.nn import functional as F from gpt_class import GPTConfig, GPT import tiktoken # Set up the device device = "cuda" if torch.cuda.is_available() else "cpu" # Load the model state_dict = torch.load('model_51999.pt', map_location=device) config = state_dict['config'] model = GPT(config) model.load_state_dict(state_dict['model']) model.to(device) model.eval() # Seed for reproducibility torch.manual_seed(42) torch.cuda.manual_seed_all(42) # Tokenizer tokenizer = tiktoken.get_encoding("gpt2") def Generate(model, tokenizer, example, num_return_sequences, max_length): model.eval() tokens = tokenizer.encode(example) tokens = torch.tensor(tokens, dtype=torch.long).unsqueeze(0).repeat(num_return_sequences, 1) tokens = tokens.to(device) sample_rng = torch.Generator(device=device) xgen = tokens while xgen.size(1) < max_length: with torch.no_grad(): with torch.autocast(device_type=device): logits, _ = model(xgen) logits = logits[:, -1, :] probs = F.softmax(logits, dim=-1) topk_probs, topk_indices = torch.topk(probs, 50, dim=-1) ix = torch.multinomial(topk_probs, 1, generator=sample_rng) xcol = torch.gather(topk_indices, -1, ix) xgen = torch.cat((xgen, xcol), dim=1) for i in range(num_return_sequences): tokens = xgen[i, :max_length].tolist() decoded = tokenizer.decode(tokens) print(f"Sample {i+1}: {decoded}") # Example usage Generate(model, tokenizer, example="As we entered the forest we saw", num_return_sequences=4, max_length=32) ``` ### Sample Output ``` Sample 1: As we entered the forest we saw huge white pine fells at the tops of the high plateaus (the great peaks) and trees standing at ground level. Sample 2: As we entered the forest we saw a few trees that were too large. We realized they were not going to be very big. There was one tree that was Sample 3: As we entered the forest we saw a group of small, wood-dwelling bees who had managed to escape a predator. A farmer was holding a handful Sample 4: As we entered the forest we saw giant, blue-eyed, spotted beetles on the ground, a grayling beetle in my lawn next to the pond, an ``` ## Accessing the Original Code The original code for the Orator model (for architecture, Pre-Training, Evaluating, Generating) can be accessed through the GitHub repository. **GitHub Repository**: [my-temporary-name/my_gpt2](https://github.com/my-temporary-name/my_gpt2)