FlameF0X's picture
Update README.md
6f116dc verified
---
license: apache-2.0
datasets:
- FlameF0X/Mixture-of-Thoughts-2048T
language:
- en
library_name: transformers
tags:
- pre-train
- custom_code
- SnowflakeCore
model-index:
- name: FlameF0X/SnowflakeCore-G1-Tiny
results:
- task:
type: generation_speed
name: Generation Speed
metrics:
- type: avg_tokens_per_second
value: 57.257723907839626
- task:
type: model_size
name: Model Size
metrics:
- type: model_size_mb
value: 1357.54296875
- task:
type: gsm8k_accuracy
name: GSM8K Accuracy
metrics:
- type: accuracy
value: 0.2
- task:
type: mmlu_accuracy
name: MMLU Accuracy
metrics:
- type: accuracy
value: 0
- task:
type: humaneval_pass@1
name: HumanEval Pass@1
metrics:
- type: pass@1
value: 0
- task:
type: peak_memory_gb
name: Peak Memory (seq_128)
metrics:
- type: seq_128
value: 5.9882988929748535
- task:
type: peak_memory_gb
name: Peak Memory (seq_512)
metrics:
- type: seq_512
value: 6.0380940437316895
- task:
type: peak_memory_gb
name: Peak Memory (seq_1024)
metrics:
- type: seq_1024
value: 6.123685836791992
- task:
type: peak_memory_gb
name: Peak Memory (seq_2048)
metrics:
- type: seq_2048
value: 6.354169845581055
pipeline_tag: text-generation
new_version: FlameF0X/SnowflakeCore-G1-Tiny2
---
# SnowflakeCore-G1-Tiny
A custom GPT-style transformer language model built from scratch using PyTorch, trained on the Mixture-of-Thoughts dataset for enhanced reasoning capabilities.
## Model Overview
SnowflakeCore-G1-Tiny is a GPT-style autoregressive transformer model with **~400M parameters** designed for text generation tasks.
### Key Features
- **2048 token context window** for extended conversations
- **Mixed precision training** (BF16/FP16) for efficiency
- **Custom attention implementation** with fused operations
- **Early stopping mechanisms** for optimal training
- **Gradient accumulation** for effective large batch training
### Architecture Specifications
| Component | Value |
|-----------|-------|
| Model Type | Autoregressive Transformer |
| Parameters | ~400M |
| Layers | 24 |
| Hidden Size | 1024 |
| Attention Heads | 16 |
| Head Dimension | 64 |
| FFN Dimension | 4096 |
| Context Length | 2048 tokens |
| Vocabulary Size | 50,257 (GPT-2 tokenizer) |
## Quick Start
### Installation
```bash
pip install torch transformers # if not already installed
```
### Basic Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"FlameF0X/SnowflakeCore-G1-Tiny",
trust_remote_code=True,
force_download=True,
use_safetensors=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"FlameF0X/SnowflakeCore-G1-Tiny",
trust_remote_code=True,
force_download=True,
use_safetensors=True,
)
def custom_greedy_generate(prompt, max_length=50):
model.eval()
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
generated = input_ids
with torch.no_grad():
for _ in range(max_length):
outputs = model(input_ids=generated)
next_token_logits = outputs["logits"][:, -1, :]
next_token_id = torch.argmax(next_token_logits, dim=-1).unsqueeze(-1)
generated = torch.cat((generated, next_token_id), dim=1)
if next_token_id.item() == tokenizer.eos_token_id:
break
return tokenizer.decode(generated[0], skip_special_tokens=True)
# Generate text
prompt = "Once upon a time"
result = custom_greedy_generate(prompt)
print(result)
```
### Fine-Tuning
```python
import os
import argparse
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
Trainer,
TrainingArguments,
)
from datasets import load_dataset
import torch
# === Disable W&B logging ===
os.environ["WANDB_DISABLED"] = "true"
# === Config ===
config = {
"model_name": "FlameF0X/SnowflakeCore-G1-Tiny",
"output_dir": "./snowflake-chatbot",
"context_window": 512,
"per_device_batch_size": 1,
"gradient_accumulation_steps": 16,
"max_steps": 500,
"dataloader_workers": 4,
"dataset_name": "tatsu-lab/alpaca",
"dataset_split": "train[:10000]",
}
# === Derived ===
config["effective_batch_size"] = (
config["per_device_batch_size"] * config["gradient_accumulation_steps"]
)
print(f"Effective batch size: {config['effective_batch_size']}")
print(f"Context window: {config['context_window']}")
# === 1. Load tokenizer and model ===
def load_model_and_tokenizer(config):
print(f"Loading model and tokenizer from {config['model_name']}...")
tokenizer = AutoTokenizer.from_pretrained(
config["model_name"],
trust_remote_code=True,
force_download=True,
use_safetensors=True,
model_max_length=config["context_window"],
)
model = AutoModelForCausalLM.from_pretrained(
config["model_name"],
trust_remote_code=True,
force_download=True,
use_safetensors=True,
)
if hasattr(torch, "compile"):
try:
print("Compiling model with torch.compile...")
model = torch.compile(model)
except Exception as e:
print(f"Compilation failed: {e}")
return tokenizer, model
# === 2. Load dataset ===
def load_custom_dataset(name, split):
print(f"Loading dataset: {name} ({split})...")
return load_dataset(name, split=split)
# === 3. Format dataset ===
def format_example(example):
"""Update this function to work with different datasets."""
return {
"text": f"### Instruction:\n{example['instruction']}\n### Input:\n{example['input']}\n### Response:\n{example['output']}"
}
# === 4. Tokenize ===
def tokenize_example(example, tokenizer, max_length):
tokens = tokenizer(
example["text"],
truncation=True,
padding="max_length",
max_length=max_length,
)
tokens["labels"] = tokens["input_ids"].copy()
return tokens
# === 5. Train ===
def train_model(model, tokenizer, tokenized_dataset, config):
print("Preparing training arguments...")
training_args = TrainingArguments(
output_dir=config["output_dir"],
per_device_train_batch_size=config["per_device_batch_size"],
gradient_accumulation_steps=config["gradient_accumulation_steps"],
max_steps=config["max_steps"],
logging_dir="./logs",
logging_steps=20,
save_strategy="no",
fp16=torch.cuda.is_available() and not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_available() and torch.cuda.is_bf16_supported(),
overwrite_output_dir=True,
report_to=[],
dataloader_num_workers=config["dataloader_workers"],
optim="adamw_torch_fused" if torch.cuda.is_available() and hasattr(torch, 'compile') else "adamw_torch",
remove_unused_columns=False,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
)
print("Starting training...")
trainer.train()
print("Training completed.")
# === 6. Save ===
def save_model(model, tokenizer, output_dir):
print(f"Saving model to {output_dir}...")
model.save_pretrained(output_dir, safe_serialization=False)
tokenizer.save_pretrained(output_dir)
print("Model saved.")
# === Main ===
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--dataset", type=str, default=config["dataset_name"])
parser.add_argument("--split", type=str, default=config["dataset_split"])
args = parser.parse_args()
tokenizer, model = load_model_and_tokenizer(config)
dataset = load_custom_dataset(args.dataset, args.split)
print("Formatting dataset...")
dataset = dataset.map(format_example, num_proc=config["dataloader_workers"], load_from_cache_file=False)
print("Tokenizing dataset...")
tokenized = dataset.map(
lambda x: tokenize_example(x, tokenizer, config["context_window"]),
batched=True,
num_proc=config["dataloader_workers"],
load_from_cache_file=False,
)
tokenized.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
train_model(model, tokenizer, tokenized, config)
save_model(model, tokenizer, config["output_dir"])
if __name__ == "__main__":
main()
```
## Training Details
### Dataset
- **Source**: [FlameF0X/Mixture-of-Thoughts-2048T](https://huggingface.co/datasets/FlameF0X/Mixture-of-Thoughts-2048T)
- **Purpose**: Enhanced reasoning capabilities through mixture-of-thoughts training
### Training Configuration
- **Framework**: PyTorch with mixed precision (BF16/FP16)
- **Optimizer**: AdamW (learning rate: 2e-4)
- **Batch Size**: 1 with gradient accumulation (32 steps)
- **Context Window**: 2048 tokens
- **Validation Split**: 10%
- **Early Stopping**: Implemented at epoch and step levels
### Performance Monitoring
- Training loss tracked per epoch with perplexity calculation
- Full validation after each epoch
- Step-level monitoring every 500 steps
- Comprehensive metrics saved in `training_metrics.json`
## Technical Implementation
### Attention Mechanism
- **Causal Masking**: Supports autoregressive generation
- **Key Padding Mask**: Enables batched inference
- **Scaled Dot-Product**: Head dimension normalization included
### Memory Optimization
- **Fused Operations**: Reduces memory fragmentation
- **Mixed Precision**: 30-40% memory reduction
- **Gradient Accumulation**: Simulates larger batch sizes
- **Optional Quantization**: Further model compression
### Training Stability
- **Gradient Clipping**: Prevents exploding gradients
- **Automatic Loss Scaling**: Mixed precision stability
- **Early Stopping**: Prevents overfitting with patience mechanisms
## System Requirements
### Memory Requirements
- **Training**: 16-24GB VRAM (precision dependent)
- **Inference**: 4-6GB VRAM for standard generation
- **Context**: Maximum 2048 tokens input length
### Generation Parameters
Default configuration:
```json
{
"do_sample": true,
"temperature": 1.0,
"top_p": 0.9,
"top_k": 50,
"max_new_tokens": 50,
"pad_token_id": 50256,
"eos_token_id": 50256
}
```
## Model Files
The repository contains:
- `pytorch_model.bin` - PyTorch model weights
- `model.safetensors` - SafeTensors format weights
- `config.json` - Model configuration
- `generation_config.json` - Generation parameters
- `training_metrics.json` - Training statistics
- `tokenizer.json` - Tokenizer configuration
- `vocab.json` & `merges.txt` - Vocabulary files
## Limitations
- **No HuggingFace `.generate()` support**: Use custom generation function
- **Output Quality**: May produce repetitive or nonsensical text for some prompts
- **Hardware Requirements**: GPU recommended for practical inference
- **Context Window**: Limited to 2048 tokens
- **Dataset Dependency**: Performance tied to Mixture-of-Thoughts dataset quality
## Example Output
```
Input: Hello, I am Alex and
Output: Hello, I am Alex andbourg Chip Chip Chip Chip Chip Chip Chip ChipCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCos
```
*Note: The repetitive output shown is typical for small or early-stage models and can be improved with further training or fine-tuning.*
## Support Me
You can support me via [Ko-fi](https://ko-fi.com/flamef0x) or you can try my [Vast.ai](https://cloud.vast.ai/?ref_id=222345&creator_id=222345&name=Efficient%20Pretraining%20GPU%20Template) template!
### Small meta-data
- Release date: June 29, 2025.