metadata
license: apache-2.0
datasets:
- common-pile/wikimedia_filtered
language:
- en
library_name: transformers
tags:
- pre-train
- custom_code
- SnowflakeCore
pipeline_tag: text-generation
SnowflakeCore-G1-Tiny2
A improve version of SnowflakeCore-G1-Tiny custom GPT-style transformer language model built from scratch using PyTorch, trained on the common-pile/wikimedia_filtered dataset.
Model Overview
SnowflakeCore-G1-Tiny2 is a GPT-style autoregressive transformer model with ~400M parameters designed for text generation tasks.
Key Features
- 2048 token context window for extended conversations
- Mixed precision training (BF16/FP16) for efficiency
- Custom attention implementation with fused operations
- Early stopping mechanisms for optimal training
- Gradient accumulation for effective large batch training
Architecture Specifications
Component | Value |
---|---|
Model Type | Autoregressive Transformer |
Parameters | ~400M |
Layers | 24 |
Hidden Size | 1024 |
Attention Heads | 16 |
Head Dimension | 64 |
FFN Dimension | 4096 |
Context Length | 2048 tokens |
Vocabulary Size | 50,257 (GPT-2 tokenizer) |
Model Benchmarks
The following benchmarks compare SnowflakeCore-G1-Tiny2
, its predecessor, and GPT-2 on key performance and text quality metrics.
Performance & Quality Metrics
Model | Params | Size (MB) | Speed (tok/s) | Vocab Div. | Dist. Bigrams | Dist. Trigrams | Bigram Repet. | Trigram Repet. |
---|---|---|---|---|---|---|---|---|
SnowflakeCore-G1-Tiny2 | 355.9M | 1357.54 | 22.13 | 0.3440 | 0.7408 | 0.8834 | 0.2592 | 0.1166 |
SnowflakeCore-G1-Tiny | 355.9M | 1357.54 | 22.12 | 0.2780 | 0.6111 | 0.7421 | 0.3889 | 0.2579 |
GPT-2 (small) | 124.4M | 474.70 | 47.73 | 0.2590 | 0.6408 | 0.7946 | 0.3592 | 0.2054 |
Notes:
- Vocabulary Diversity = unique tokens / total tokens
- Distinct N-grams = unique n-grams / total n-grams
- Lower repetition rates indicate better text novelty
Memory Usage (CPU)
All models report N/A
for CPU memory usage across all sequence lengths.
Sequence Length | SnowflakeCore-G1-Tiny | SnowflakeCore-G1-Tiny2 | GPT-2 |
---|---|---|---|
128 | N/A (CPU) | N/A (CPU) | N/A |
512 | N/A (CPU) | N/A (CPU) | N/A |
1024 | N/A (CPU) | N/A (CPU) | N/A |
2048 | N/A (CPU) | N/A (CPU) | N/A |
Quick Start
Installation
pip install torch transformers # if not already installed
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"FlameF0X/SnowflakeCore-G1-Tiny2",
trust_remote_code=True,
force_download=True,
use_safetensors=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"FlameF0X/SnowflakeCore-G1-Tiny2",
trust_remote_code=True,
force_download=True,
use_safetensors=True,
)
def custom_greedy_generate(prompt, max_length=50):
model.eval()
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
generated = input_ids
with torch.no_grad():
for _ in range(max_length):
outputs = model(input_ids=generated)
next_token_logits = outputs["logits"][:, -1, :]
next_token_id = torch.argmax(next_token_logits, dim=-1).unsqueeze(-1)
generated = torch.cat((generated, next_token_id), dim=1)
if next_token_id.item() == tokenizer.eos_token_id:
break
return tokenizer.decode(generated[0], skip_special_tokens=True)
# Generate text
prompt = "Once upon a time"
result = custom_greedy_generate(prompt)
print(result)
Fine-Tuning
... (same fine-tuning code as above) ...
Training Details
Dataset
- Source: common-pile/wikimedia_filtered
Training Configuration
- Framework: PyTorch with mixed precision (BF16/FP16)
- Optimizer: AdamW (learning rate: 2e-4)
- Batch Size: 1 with gradient accumulation (32 steps)
- Context Window: 2048 tokens
- Validation Split: 10%
- Early Stopping: Implemented at epoch and step levels
Performance Monitoring
- Training loss tracked per epoch with perplexity calculation
- Full validation after each epoch
- Step-level monitoring every 500 steps
- Comprehensive metrics saved in
training_metrics.json
Technical Implementation
Attention Mechanism
- Causal Masking: Supports autoregressive generation
- Key Padding Mask: Enables batched inference
- Scaled Dot-Product: Head dimension normalization included
Memory Optimization
- Fused Operations: Reduces memory fragmentation
- Mixed Precision: 30-40% memory reduction
- Gradient Accumulation: Simulates larger batch sizes
- Optional Quantization: Further model compression
Training Stability
- Gradient Clipping: Prevents exploding gradients
- Automatic Loss Scaling: Mixed precision stability
- Early Stopping: Prevents overfitting with patience mechanisms
System Requirements
Memory Requirements
- Training: 16-24GB VRAM (precision dependent)
- Inference: 1-6GB VRAM for standard generation
- Context: Maximum 2048 tokens input length
Generation Parameters
Default configuration:
{
"do_sample": true,
"temperature": 1.0,
"top_p": 0.9,
"top_k": 50,
"max_new_tokens": 50,
"pad_token_id": 50256,
"eos_token_id": 50256
}
Model Files
The repository contains:
pytorch_model.bin
- PyTorch model weightsmodel.safetensors
- SafeTensors format weightsconfig.json
- Model configurationgeneration_config.json
- Generation parameterstraining_metrics.json
- Training statisticstokenizer.json
- Tokenizer configurationvocab.json
&merges.txt
- Vocabulary files
Limitations
- No HuggingFace
.generate()
support: Use custom generation function - Output Quality: May produce repetitive or nonsensical text for some prompts
- Hardware Requirements: GPU recommended for practical inference
- Context Window: Limited to 2048 tokens
- Dataset Dependency: Performance tied to Mixture-of-Thoughts dataset quality
Example Output
N/A
Support Me
You can support me via Ko-fi or you can try my Vast.ai template!
Small meta-data
- Release date: July 21, 2025.