|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- common-pile/wikimedia_filtered |
|
language: |
|
- en |
|
library_name: transformers |
|
tags: |
|
- pre-train |
|
- custom_code |
|
- SnowflakeCore |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# SnowflakeCore-G1-Tiny2 |
|
|
|
A improve version of SnowflakeCore-G1-Tiny custom GPT-style transformer language model built from scratch using PyTorch, trained on the common-pile/wikimedia\_filtered dataset. |
|
|
|
## Model Overview |
|
|
|
SnowflakeCore-G1-Tiny2 is a GPT-style autoregressive transformer model with **\~400M parameters** designed for text generation tasks. |
|
|
|
### Key Features |
|
|
|
* **2048 token context window** for extended conversations |
|
* **Mixed precision training** (BF16/FP16) for efficiency |
|
* **Custom attention implementation** with fused operations |
|
* **Early stopping mechanisms** for optimal training |
|
* **Gradient accumulation** for effective large batch training |
|
|
|
### Architecture Specifications |
|
|
|
| Component | Value | |
|
| --------------- | -------------------------- | |
|
| Model Type | Autoregressive Transformer | |
|
| Parameters | \~400M | |
|
| Layers | 24 | |
|
| Hidden Size | 1024 | |
|
| Attention Heads | 16 | |
|
| Head Dimension | 64 | |
|
| FFN Dimension | 4096 | |
|
| Context Length | 2048 tokens | |
|
| Vocabulary Size | 50,257 (GPT-2 tokenizer) | |
|
|
|
## Model Benchmarks |
|
|
|
The following benchmarks compare `SnowflakeCore-G1-Tiny2`, its predecessor, and GPT-2 on key performance and text quality metrics. |
|
|
|
### Performance & Quality Metrics |
|
|
|
| Model | Params | Size (MB) | Speed (tok/s) | Vocab Div. | Dist. Bigrams | Dist. Trigrams | Bigram Repet. | Trigram Repet. | |
|
| -------------------------- | ------ | --------- | ------------- | ---------- | ------------- | -------------- | ------------- | -------------- | |
|
| **SnowflakeCore-G1-Tiny2** | 355.9M | 1357.54 | 22.13 | **0.3440** | **0.7408** | **0.8834** | **0.2592** | **0.1166** | |
|
| SnowflakeCore-G1-Tiny | 355.9M | 1357.54 | 22.12 | 0.2780 | 0.6111 | 0.7421 | 0.3889 | 0.2579 | |
|
| GPT-2 (small) | 124.4M | 474.70 | **47.73** | 0.2590 | 0.6408 | 0.7946 | 0.3592 | 0.2054 | |
|
|
|
> **Notes:** |
|
> |
|
> * Vocabulary Diversity = unique tokens / total tokens |
|
> * Distinct N-grams = unique n-grams / total n-grams |
|
> * Lower repetition rates indicate better text novelty |
|
|
|
### Memory Usage (CPU) |
|
|
|
All models report `N/A` for CPU memory usage across all sequence lengths. |
|
|
|
| Sequence Length | SnowflakeCore-G1-Tiny | SnowflakeCore-G1-Tiny2 | GPT-2 | |
|
| --------------- | --------------------- | ---------------------- | ----- | |
|
| 128 | N/A (CPU) | N/A (CPU) | N/A | |
|
| 512 | N/A (CPU) | N/A (CPU) | N/A | |
|
| 1024 | N/A (CPU) | N/A (CPU) | N/A | |
|
| 2048 | N/A (CPU) | N/A (CPU) | N/A | |
|
|
|
## Quick Start |
|
|
|
### Installation |
|
|
|
```bash |
|
pip install torch transformers # if not already installed |
|
``` |
|
|
|
### Basic Usage |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import torch |
|
|
|
# Load model and tokenizer |
|
model = AutoModelForCausalLM.from_pretrained( |
|
"FlameF0X/SnowflakeCore-G1-Tiny2", |
|
trust_remote_code=True, |
|
force_download=True, |
|
use_safetensors=True, |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained( |
|
"FlameF0X/SnowflakeCore-G1-Tiny2", |
|
trust_remote_code=True, |
|
force_download=True, |
|
use_safetensors=True, |
|
) |
|
|
|
def custom_greedy_generate(prompt, max_length=50): |
|
model.eval() |
|
input_ids = tokenizer(prompt, return_tensors="pt").input_ids |
|
generated = input_ids |
|
|
|
with torch.no_grad(): |
|
for _ in range(max_length): |
|
outputs = model(input_ids=generated) |
|
next_token_logits = outputs["logits"][:, -1, :] |
|
next_token_id = torch.argmax(next_token_logits, dim=-1).unsqueeze(-1) |
|
generated = torch.cat((generated, next_token_id), dim=1) |
|
|
|
if next_token_id.item() == tokenizer.eos_token_id: |
|
break |
|
|
|
return tokenizer.decode(generated[0], skip_special_tokens=True) |
|
|
|
# Generate text |
|
prompt = "Once upon a time" |
|
result = custom_greedy_generate(prompt) |
|
print(result) |
|
``` |
|
|
|
### Fine-Tuning |
|
|
|
<code>... (same fine-tuning code as above) ...</code> |
|
|
|
## Training Details |
|
|
|
### Dataset |
|
|
|
* **Source**: [common-pile/wikimedia\_filtered](https://huggingface.co/datasets/common-pile/wikimedia_filtered) |
|
|
|
### Training Configuration |
|
|
|
* **Framework**: PyTorch with mixed precision (BF16/FP16) |
|
* **Optimizer**: AdamW (learning rate: 2e-4) |
|
* **Batch Size**: 1 with gradient accumulation (32 steps) |
|
* **Context Window**: 2048 tokens |
|
* **Validation Split**: 10% |
|
* **Early Stopping**: Implemented at epoch and step levels |
|
|
|
### Performance Monitoring |
|
|
|
* Training loss tracked per epoch with perplexity calculation |
|
* Full validation after each epoch |
|
* Step-level monitoring every 500 steps |
|
* Comprehensive metrics saved in `training_metrics.json` |
|
|
|
## Technical Implementation |
|
|
|
### Attention Mechanism |
|
|
|
* **Causal Masking**: Supports autoregressive generation |
|
* **Key Padding Mask**: Enables batched inference |
|
* **Scaled Dot-Product**: Head dimension normalization included |
|
|
|
### Memory Optimization |
|
|
|
* **Fused Operations**: Reduces memory fragmentation |
|
* **Mixed Precision**: 30-40% memory reduction |
|
* **Gradient Accumulation**: Simulates larger batch sizes |
|
* **Optional Quantization**: Further model compression |
|
|
|
### Training Stability |
|
|
|
* **Gradient Clipping**: Prevents exploding gradients |
|
* **Automatic Loss Scaling**: Mixed precision stability |
|
* **Early Stopping**: Prevents overfitting with patience mechanisms |
|
|
|
## System Requirements |
|
|
|
### Memory Requirements |
|
|
|
* **Training**: 16-24GB VRAM (precision dependent) |
|
* **Inference**: 1-6GB VRAM for standard generation |
|
* **Context**: Maximum 2048 tokens input length |
|
|
|
### Generation Parameters |
|
|
|
Default configuration: |
|
|
|
```json |
|
{ |
|
"do_sample": true, |
|
"temperature": 1.0, |
|
"top_p": 0.9, |
|
"top_k": 50, |
|
"max_new_tokens": 50, |
|
"pad_token_id": 50256, |
|
"eos_token_id": 50256 |
|
} |
|
``` |
|
|
|
## Model Files |
|
|
|
The repository contains: |
|
|
|
* `pytorch_model.bin` - PyTorch model weights |
|
* `model.safetensors` - SafeTensors format weights |
|
* `config.json` - Model configuration |
|
* `generation_config.json` - Generation parameters |
|
* `training_metrics.json` - Training statistics |
|
* `tokenizer.json` - Tokenizer configuration |
|
* `vocab.json` & `merges.txt` - Vocabulary files |
|
|
|
## Limitations |
|
|
|
* **No HuggingFace `.generate()` support**: Use custom generation function |
|
* **Output Quality**: May produce repetitive or nonsensical text for some prompts |
|
* **Hardware Requirements**: GPU recommended for practical inference |
|
* **Context Window**: Limited to 2048 tokens |
|
* **Dataset Dependency**: Performance tied to Mixture-of-Thoughts dataset quality |
|
|
|
## Example Output |
|
|
|
``` |
|
N/A |
|
``` |
|
|
|
## Support Me |
|
|
|
You can support me via [Ko-fi](https://ko-fi.com/flamef0x) or you can try my [Vast.ai](https://cloud.vast.ai/?ref_id=222345&creator_id=222345&name=Efficient%20Pretraining%20GPU%20Template) template! |
|
|
|
### Small meta-data |
|
|
|
* Release date: July 21, 2025. |