FlameF0X's picture
Update README.md
12ba659 verified
---
license: apache-2.0
datasets:
- common-pile/wikimedia_filtered
language:
- en
library_name: transformers
tags:
- pre-train
- custom_code
- SnowflakeCore
pipeline_tag: text-generation
---
# SnowflakeCore-G1-Tiny2
A improve version of SnowflakeCore-G1-Tiny custom GPT-style transformer language model built from scratch using PyTorch, trained on the common-pile/wikimedia\_filtered dataset.
## Model Overview
SnowflakeCore-G1-Tiny2 is a GPT-style autoregressive transformer model with **\~400M parameters** designed for text generation tasks.
### Key Features
* **2048 token context window** for extended conversations
* **Mixed precision training** (BF16/FP16) for efficiency
* **Custom attention implementation** with fused operations
* **Early stopping mechanisms** for optimal training
* **Gradient accumulation** for effective large batch training
### Architecture Specifications
| Component | Value |
| --------------- | -------------------------- |
| Model Type | Autoregressive Transformer |
| Parameters | \~400M |
| Layers | 24 |
| Hidden Size | 1024 |
| Attention Heads | 16 |
| Head Dimension | 64 |
| FFN Dimension | 4096 |
| Context Length | 2048 tokens |
| Vocabulary Size | 50,257 (GPT-2 tokenizer) |
## Model Benchmarks
The following benchmarks compare `SnowflakeCore-G1-Tiny2`, its predecessor, and GPT-2 on key performance and text quality metrics.
### Performance & Quality Metrics
| Model | Params | Size (MB) | Speed (tok/s) | Vocab Div. | Dist. Bigrams | Dist. Trigrams | Bigram Repet. | Trigram Repet. |
| -------------------------- | ------ | --------- | ------------- | ---------- | ------------- | -------------- | ------------- | -------------- |
| **SnowflakeCore-G1-Tiny2** | 355.9M | 1357.54 | 22.13 | **0.3440** | **0.7408** | **0.8834** | **0.2592** | **0.1166** |
| SnowflakeCore-G1-Tiny | 355.9M | 1357.54 | 22.12 | 0.2780 | 0.6111 | 0.7421 | 0.3889 | 0.2579 |
| GPT-2 (small) | 124.4M | 474.70 | **47.73** | 0.2590 | 0.6408 | 0.7946 | 0.3592 | 0.2054 |
> **Notes:**
>
> * Vocabulary Diversity = unique tokens / total tokens
> * Distinct N-grams = unique n-grams / total n-grams
> * Lower repetition rates indicate better text novelty
### Memory Usage (CPU)
All models report `N/A` for CPU memory usage across all sequence lengths.
| Sequence Length | SnowflakeCore-G1-Tiny | SnowflakeCore-G1-Tiny2 | GPT-2 |
| --------------- | --------------------- | ---------------------- | ----- |
| 128 | N/A (CPU) | N/A (CPU) | N/A |
| 512 | N/A (CPU) | N/A (CPU) | N/A |
| 1024 | N/A (CPU) | N/A (CPU) | N/A |
| 2048 | N/A (CPU) | N/A (CPU) | N/A |
## Quick Start
### Installation
```bash
pip install torch transformers # if not already installed
```
### Basic Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"FlameF0X/SnowflakeCore-G1-Tiny2",
trust_remote_code=True,
force_download=True,
use_safetensors=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"FlameF0X/SnowflakeCore-G1-Tiny2",
trust_remote_code=True,
force_download=True,
use_safetensors=True,
)
def custom_greedy_generate(prompt, max_length=50):
model.eval()
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
generated = input_ids
with torch.no_grad():
for _ in range(max_length):
outputs = model(input_ids=generated)
next_token_logits = outputs["logits"][:, -1, :]
next_token_id = torch.argmax(next_token_logits, dim=-1).unsqueeze(-1)
generated = torch.cat((generated, next_token_id), dim=1)
if next_token_id.item() == tokenizer.eos_token_id:
break
return tokenizer.decode(generated[0], skip_special_tokens=True)
# Generate text
prompt = "Once upon a time"
result = custom_greedy_generate(prompt)
print(result)
```
### Fine-Tuning
<code>... (same fine-tuning code as above) ...</code>
## Training Details
### Dataset
* **Source**: [common-pile/wikimedia\_filtered](https://huggingface.co/datasets/common-pile/wikimedia_filtered)
### Training Configuration
* **Framework**: PyTorch with mixed precision (BF16/FP16)
* **Optimizer**: AdamW (learning rate: 2e-4)
* **Batch Size**: 1 with gradient accumulation (32 steps)
* **Context Window**: 2048 tokens
* **Validation Split**: 10%
* **Early Stopping**: Implemented at epoch and step levels
### Performance Monitoring
* Training loss tracked per epoch with perplexity calculation
* Full validation after each epoch
* Step-level monitoring every 500 steps
* Comprehensive metrics saved in `training_metrics.json`
## Technical Implementation
### Attention Mechanism
* **Causal Masking**: Supports autoregressive generation
* **Key Padding Mask**: Enables batched inference
* **Scaled Dot-Product**: Head dimension normalization included
### Memory Optimization
* **Fused Operations**: Reduces memory fragmentation
* **Mixed Precision**: 30-40% memory reduction
* **Gradient Accumulation**: Simulates larger batch sizes
* **Optional Quantization**: Further model compression
### Training Stability
* **Gradient Clipping**: Prevents exploding gradients
* **Automatic Loss Scaling**: Mixed precision stability
* **Early Stopping**: Prevents overfitting with patience mechanisms
## System Requirements
### Memory Requirements
* **Training**: 16-24GB VRAM (precision dependent)
* **Inference**: 1-6GB VRAM for standard generation
* **Context**: Maximum 2048 tokens input length
### Generation Parameters
Default configuration:
```json
{
"do_sample": true,
"temperature": 1.0,
"top_p": 0.9,
"top_k": 50,
"max_new_tokens": 50,
"pad_token_id": 50256,
"eos_token_id": 50256
}
```
## Model Files
The repository contains:
* `pytorch_model.bin` - PyTorch model weights
* `model.safetensors` - SafeTensors format weights
* `config.json` - Model configuration
* `generation_config.json` - Generation parameters
* `training_metrics.json` - Training statistics
* `tokenizer.json` - Tokenizer configuration
* `vocab.json` & `merges.txt` - Vocabulary files
## Limitations
* **No HuggingFace `.generate()` support**: Use custom generation function
* **Output Quality**: May produce repetitive or nonsensical text for some prompts
* **Hardware Requirements**: GPU recommended for practical inference
* **Context Window**: Limited to 2048 tokens
* **Dataset Dependency**: Performance tied to Mixture-of-Thoughts dataset quality
## Example Output
```
N/A
```
## Support Me
You can support me via [Ko-fi](https://ko-fi.com/flamef0x) or you can try my [Vast.ai](https://cloud.vast.ai/?ref_id=222345&creator_id=222345&name=Efficient%20Pretraining%20GPU%20Template) template!
### Small meta-data
* Release date: July 21, 2025.