--- license: apache-2.0 datasets: - common-pile/wikimedia_filtered language: - en library_name: transformers tags: - pre-train - custom_code - SnowflakeCore pipeline_tag: text-generation --- # SnowflakeCore-G1-Tiny2 A improve version of SnowflakeCore-G1-Tiny custom GPT-style transformer language model built from scratch using PyTorch, trained on the common-pile/wikimedia\_filtered dataset. ## Model Overview SnowflakeCore-G1-Tiny2 is a GPT-style autoregressive transformer model with **\~400M parameters** designed for text generation tasks. ### Key Features * **2048 token context window** for extended conversations * **Mixed precision training** (BF16/FP16) for efficiency * **Custom attention implementation** with fused operations * **Early stopping mechanisms** for optimal training * **Gradient accumulation** for effective large batch training ### Architecture Specifications | Component | Value | | --------------- | -------------------------- | | Model Type | Autoregressive Transformer | | Parameters | \~400M | | Layers | 24 | | Hidden Size | 1024 | | Attention Heads | 16 | | Head Dimension | 64 | | FFN Dimension | 4096 | | Context Length | 2048 tokens | | Vocabulary Size | 50,257 (GPT-2 tokenizer) | ## Model Benchmarks The following benchmarks compare `SnowflakeCore-G1-Tiny2`, its predecessor, and GPT-2 on key performance and text quality metrics. ### Performance & Quality Metrics | Model | Params | Size (MB) | Speed (tok/s) | Vocab Div. | Dist. Bigrams | Dist. Trigrams | Bigram Repet. | Trigram Repet. | | -------------------------- | ------ | --------- | ------------- | ---------- | ------------- | -------------- | ------------- | -------------- | | **SnowflakeCore-G1-Tiny2** | 355.9M | 1357.54 | 22.13 | **0.3440** | **0.7408** | **0.8834** | **0.2592** | **0.1166** | | SnowflakeCore-G1-Tiny | 355.9M | 1357.54 | 22.12 | 0.2780 | 0.6111 | 0.7421 | 0.3889 | 0.2579 | | GPT-2 (small) | 124.4M | 474.70 | **47.73** | 0.2590 | 0.6408 | 0.7946 | 0.3592 | 0.2054 | > **Notes:** > > * Vocabulary Diversity = unique tokens / total tokens > * Distinct N-grams = unique n-grams / total n-grams > * Lower repetition rates indicate better text novelty ### Memory Usage (CPU) All models report `N/A` for CPU memory usage across all sequence lengths. | Sequence Length | SnowflakeCore-G1-Tiny | SnowflakeCore-G1-Tiny2 | GPT-2 | | --------------- | --------------------- | ---------------------- | ----- | | 128 | N/A (CPU) | N/A (CPU) | N/A | | 512 | N/A (CPU) | N/A (CPU) | N/A | | 1024 | N/A (CPU) | N/A (CPU) | N/A | | 2048 | N/A (CPU) | N/A (CPU) | N/A | ## Quick Start ### Installation ```bash pip install torch transformers # if not already installed ``` ### Basic Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch # Load model and tokenizer model = AutoModelForCausalLM.from_pretrained( "FlameF0X/SnowflakeCore-G1-Tiny2", trust_remote_code=True, force_download=True, use_safetensors=True, ) tokenizer = AutoTokenizer.from_pretrained( "FlameF0X/SnowflakeCore-G1-Tiny2", trust_remote_code=True, force_download=True, use_safetensors=True, ) def custom_greedy_generate(prompt, max_length=50): model.eval() input_ids = tokenizer(prompt, return_tensors="pt").input_ids generated = input_ids with torch.no_grad(): for _ in range(max_length): outputs = model(input_ids=generated) next_token_logits = outputs["logits"][:, -1, :] next_token_id = torch.argmax(next_token_logits, dim=-1).unsqueeze(-1) generated = torch.cat((generated, next_token_id), dim=1) if next_token_id.item() == tokenizer.eos_token_id: break return tokenizer.decode(generated[0], skip_special_tokens=True) # Generate text prompt = "Once upon a time" result = custom_greedy_generate(prompt) print(result) ``` ### Fine-Tuning ... (same fine-tuning code as above) ... ## Training Details ### Dataset * **Source**: [common-pile/wikimedia\_filtered](https://huggingface.co/datasets/common-pile/wikimedia_filtered) ### Training Configuration * **Framework**: PyTorch with mixed precision (BF16/FP16) * **Optimizer**: AdamW (learning rate: 2e-4) * **Batch Size**: 1 with gradient accumulation (32 steps) * **Context Window**: 2048 tokens * **Validation Split**: 10% * **Early Stopping**: Implemented at epoch and step levels ### Performance Monitoring * Training loss tracked per epoch with perplexity calculation * Full validation after each epoch * Step-level monitoring every 500 steps * Comprehensive metrics saved in `training_metrics.json` ## Technical Implementation ### Attention Mechanism * **Causal Masking**: Supports autoregressive generation * **Key Padding Mask**: Enables batched inference * **Scaled Dot-Product**: Head dimension normalization included ### Memory Optimization * **Fused Operations**: Reduces memory fragmentation * **Mixed Precision**: 30-40% memory reduction * **Gradient Accumulation**: Simulates larger batch sizes * **Optional Quantization**: Further model compression ### Training Stability * **Gradient Clipping**: Prevents exploding gradients * **Automatic Loss Scaling**: Mixed precision stability * **Early Stopping**: Prevents overfitting with patience mechanisms ## System Requirements ### Memory Requirements * **Training**: 16-24GB VRAM (precision dependent) * **Inference**: 1-6GB VRAM for standard generation * **Context**: Maximum 2048 tokens input length ### Generation Parameters Default configuration: ```json { "do_sample": true, "temperature": 1.0, "top_p": 0.9, "top_k": 50, "max_new_tokens": 50, "pad_token_id": 50256, "eos_token_id": 50256 } ``` ## Model Files The repository contains: * `pytorch_model.bin` - PyTorch model weights * `model.safetensors` - SafeTensors format weights * `config.json` - Model configuration * `generation_config.json` - Generation parameters * `training_metrics.json` - Training statistics * `tokenizer.json` - Tokenizer configuration * `vocab.json` & `merges.txt` - Vocabulary files ## Limitations * **No HuggingFace `.generate()` support**: Use custom generation function * **Output Quality**: May produce repetitive or nonsensical text for some prompts * **Hardware Requirements**: GPU recommended for practical inference * **Context Window**: Limited to 2048 tokens * **Dataset Dependency**: Performance tied to Mixture-of-Thoughts dataset quality ## Example Output ``` N/A ``` ## Support Me You can support me via [Ko-fi](https://ko-fi.com/flamef0x) or you can try my [Vast.ai](https://cloud.vast.ai/?ref_id=222345&creator_id=222345&name=Efficient%20Pretraining%20GPU%20Template) template! ### Small meta-data * Release date: July 21, 2025.