|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- FlameF0X/Mixture-of-Thoughts-2048T |
|
language: |
|
- en |
|
library_name: transformers |
|
tags: |
|
- pre-train |
|
- custom_code |
|
- SnowflakeCore |
|
model-index: |
|
- name: FlameF0X/SnowflakeCore-G1-Tiny |
|
results: |
|
- task: |
|
type: generation_speed |
|
name: Generation Speed |
|
metrics: |
|
- type: avg_tokens_per_second |
|
value: 57.257723907839626 |
|
- task: |
|
type: model_size |
|
name: Model Size |
|
metrics: |
|
- type: model_size_mb |
|
value: 1357.54296875 |
|
- task: |
|
type: gsm8k_accuracy |
|
name: GSM8K Accuracy |
|
metrics: |
|
- type: accuracy |
|
value: 0.2 |
|
- task: |
|
type: mmlu_accuracy |
|
name: MMLU Accuracy |
|
metrics: |
|
- type: accuracy |
|
value: 0 |
|
- task: |
|
type: humaneval_pass@1 |
|
name: HumanEval Pass@1 |
|
metrics: |
|
- type: pass@1 |
|
value: 0 |
|
- task: |
|
type: peak_memory_gb |
|
name: Peak Memory (seq_128) |
|
metrics: |
|
- type: seq_128 |
|
value: 5.9882988929748535 |
|
- task: |
|
type: peak_memory_gb |
|
name: Peak Memory (seq_512) |
|
metrics: |
|
- type: seq_512 |
|
value: 6.0380940437316895 |
|
- task: |
|
type: peak_memory_gb |
|
name: Peak Memory (seq_1024) |
|
metrics: |
|
- type: seq_1024 |
|
value: 6.123685836791992 |
|
- task: |
|
type: peak_memory_gb |
|
name: Peak Memory (seq_2048) |
|
metrics: |
|
- type: seq_2048 |
|
value: 6.354169845581055 |
|
pipeline_tag: text-generation |
|
new_version: FlameF0X/SnowflakeCore-G1-Tiny2 |
|
--- |
|
|
|
# SnowflakeCore-G1-Tiny |
|
|
|
A custom GPT-style transformer language model built from scratch using PyTorch, trained on the Mixture-of-Thoughts dataset for enhanced reasoning capabilities. |
|
|
|
## Model Overview |
|
|
|
SnowflakeCore-G1-Tiny is a GPT-style autoregressive transformer model with **~400M parameters** designed for text generation tasks. |
|
|
|
### Key Features |
|
- **2048 token context window** for extended conversations |
|
- **Mixed precision training** (BF16/FP16) for efficiency |
|
- **Custom attention implementation** with fused operations |
|
- **Early stopping mechanisms** for optimal training |
|
- **Gradient accumulation** for effective large batch training |
|
|
|
### Architecture Specifications |
|
|
|
| Component | Value | |
|
|-----------|-------| |
|
| Model Type | Autoregressive Transformer | |
|
| Parameters | ~400M | |
|
| Layers | 24 | |
|
| Hidden Size | 1024 | |
|
| Attention Heads | 16 | |
|
| Head Dimension | 64 | |
|
| FFN Dimension | 4096 | |
|
| Context Length | 2048 tokens | |
|
| Vocabulary Size | 50,257 (GPT-2 tokenizer) | |
|
|
|
## Quick Start |
|
|
|
### Installation |
|
|
|
```bash |
|
pip install torch transformers # if not already installed |
|
``` |
|
|
|
### Basic Usage |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import torch |
|
|
|
# Load model and tokenizer |
|
model = AutoModelForCausalLM.from_pretrained( |
|
"FlameF0X/SnowflakeCore-G1-Tiny", |
|
trust_remote_code=True, |
|
force_download=True, |
|
use_safetensors=True, |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained( |
|
"FlameF0X/SnowflakeCore-G1-Tiny", |
|
trust_remote_code=True, |
|
force_download=True, |
|
use_safetensors=True, |
|
) |
|
|
|
def custom_greedy_generate(prompt, max_length=50): |
|
model.eval() |
|
input_ids = tokenizer(prompt, return_tensors="pt").input_ids |
|
generated = input_ids |
|
|
|
with torch.no_grad(): |
|
for _ in range(max_length): |
|
outputs = model(input_ids=generated) |
|
next_token_logits = outputs["logits"][:, -1, :] |
|
next_token_id = torch.argmax(next_token_logits, dim=-1).unsqueeze(-1) |
|
generated = torch.cat((generated, next_token_id), dim=1) |
|
|
|
if next_token_id.item() == tokenizer.eos_token_id: |
|
break |
|
|
|
return tokenizer.decode(generated[0], skip_special_tokens=True) |
|
|
|
# Generate text |
|
prompt = "Once upon a time" |
|
result = custom_greedy_generate(prompt) |
|
print(result) |
|
``` |
|
|
|
### Fine-Tuning |
|
|
|
```python |
|
import os |
|
import argparse |
|
from transformers import ( |
|
AutoTokenizer, |
|
AutoModelForCausalLM, |
|
Trainer, |
|
TrainingArguments, |
|
) |
|
from datasets import load_dataset |
|
import torch |
|
|
|
# === Disable W&B logging === |
|
os.environ["WANDB_DISABLED"] = "true" |
|
|
|
# === Config === |
|
config = { |
|
"model_name": "FlameF0X/SnowflakeCore-G1-Tiny", |
|
"output_dir": "./snowflake-chatbot", |
|
"context_window": 512, |
|
"per_device_batch_size": 1, |
|
"gradient_accumulation_steps": 16, |
|
"max_steps": 500, |
|
"dataloader_workers": 4, |
|
"dataset_name": "tatsu-lab/alpaca", |
|
"dataset_split": "train[:10000]", |
|
} |
|
|
|
# === Derived === |
|
config["effective_batch_size"] = ( |
|
config["per_device_batch_size"] * config["gradient_accumulation_steps"] |
|
) |
|
|
|
print(f"Effective batch size: {config['effective_batch_size']}") |
|
print(f"Context window: {config['context_window']}") |
|
|
|
|
|
# === 1. Load tokenizer and model === |
|
def load_model_and_tokenizer(config): |
|
print(f"Loading model and tokenizer from {config['model_name']}...") |
|
tokenizer = AutoTokenizer.from_pretrained( |
|
config["model_name"], |
|
trust_remote_code=True, |
|
force_download=True, |
|
use_safetensors=True, |
|
model_max_length=config["context_window"], |
|
) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
config["model_name"], |
|
trust_remote_code=True, |
|
force_download=True, |
|
use_safetensors=True, |
|
) |
|
|
|
if hasattr(torch, "compile"): |
|
try: |
|
print("Compiling model with torch.compile...") |
|
model = torch.compile(model) |
|
except Exception as e: |
|
print(f"Compilation failed: {e}") |
|
return tokenizer, model |
|
|
|
|
|
# === 2. Load dataset === |
|
def load_custom_dataset(name, split): |
|
print(f"Loading dataset: {name} ({split})...") |
|
return load_dataset(name, split=split) |
|
|
|
|
|
# === 3. Format dataset === |
|
def format_example(example): |
|
"""Update this function to work with different datasets.""" |
|
return { |
|
"text": f"### Instruction:\n{example['instruction']}\n### Input:\n{example['input']}\n### Response:\n{example['output']}" |
|
} |
|
|
|
|
|
# === 4. Tokenize === |
|
def tokenize_example(example, tokenizer, max_length): |
|
tokens = tokenizer( |
|
example["text"], |
|
truncation=True, |
|
padding="max_length", |
|
max_length=max_length, |
|
) |
|
tokens["labels"] = tokens["input_ids"].copy() |
|
return tokens |
|
|
|
|
|
# === 5. Train === |
|
def train_model(model, tokenizer, tokenized_dataset, config): |
|
print("Preparing training arguments...") |
|
training_args = TrainingArguments( |
|
output_dir=config["output_dir"], |
|
per_device_train_batch_size=config["per_device_batch_size"], |
|
gradient_accumulation_steps=config["gradient_accumulation_steps"], |
|
max_steps=config["max_steps"], |
|
logging_dir="./logs", |
|
logging_steps=20, |
|
save_strategy="no", |
|
fp16=torch.cuda.is_available() and not torch.cuda.is_bf16_supported(), |
|
bf16=torch.cuda.is_available() and torch.cuda.is_bf16_supported(), |
|
overwrite_output_dir=True, |
|
report_to=[], |
|
dataloader_num_workers=config["dataloader_workers"], |
|
optim="adamw_torch_fused" if torch.cuda.is_available() and hasattr(torch, 'compile') else "adamw_torch", |
|
remove_unused_columns=False, |
|
) |
|
|
|
trainer = Trainer( |
|
model=model, |
|
args=training_args, |
|
train_dataset=tokenized_dataset, |
|
) |
|
|
|
print("Starting training...") |
|
trainer.train() |
|
print("Training completed.") |
|
|
|
|
|
# === 6. Save === |
|
def save_model(model, tokenizer, output_dir): |
|
print(f"Saving model to {output_dir}...") |
|
model.save_pretrained(output_dir, safe_serialization=False) |
|
tokenizer.save_pretrained(output_dir) |
|
print("Model saved.") |
|
|
|
|
|
# === Main === |
|
def main(): |
|
parser = argparse.ArgumentParser() |
|
parser.add_argument("--dataset", type=str, default=config["dataset_name"]) |
|
parser.add_argument("--split", type=str, default=config["dataset_split"]) |
|
args = parser.parse_args() |
|
|
|
tokenizer, model = load_model_and_tokenizer(config) |
|
dataset = load_custom_dataset(args.dataset, args.split) |
|
|
|
print("Formatting dataset...") |
|
dataset = dataset.map(format_example, num_proc=config["dataloader_workers"], load_from_cache_file=False) |
|
|
|
print("Tokenizing dataset...") |
|
tokenized = dataset.map( |
|
lambda x: tokenize_example(x, tokenizer, config["context_window"]), |
|
batched=True, |
|
num_proc=config["dataloader_workers"], |
|
load_from_cache_file=False, |
|
) |
|
tokenized.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"]) |
|
|
|
train_model(model, tokenizer, tokenized, config) |
|
save_model(model, tokenizer, config["output_dir"]) |
|
|
|
|
|
if __name__ == "__main__": |
|
main() |
|
``` |
|
|
|
## Training Details |
|
|
|
### Dataset |
|
- **Source**: [FlameF0X/Mixture-of-Thoughts-2048T](https://huggingface.co/datasets/FlameF0X/Mixture-of-Thoughts-2048T) |
|
- **Purpose**: Enhanced reasoning capabilities through mixture-of-thoughts training |
|
|
|
### Training Configuration |
|
- **Framework**: PyTorch with mixed precision (BF16/FP16) |
|
- **Optimizer**: AdamW (learning rate: 2e-4) |
|
- **Batch Size**: 1 with gradient accumulation (32 steps) |
|
- **Context Window**: 2048 tokens |
|
- **Validation Split**: 10% |
|
- **Early Stopping**: Implemented at epoch and step levels |
|
|
|
### Performance Monitoring |
|
- Training loss tracked per epoch with perplexity calculation |
|
- Full validation after each epoch |
|
- Step-level monitoring every 500 steps |
|
- Comprehensive metrics saved in `training_metrics.json` |
|
|
|
## Technical Implementation |
|
|
|
### Attention Mechanism |
|
- **Causal Masking**: Supports autoregressive generation |
|
- **Key Padding Mask**: Enables batched inference |
|
- **Scaled Dot-Product**: Head dimension normalization included |
|
|
|
### Memory Optimization |
|
- **Fused Operations**: Reduces memory fragmentation |
|
- **Mixed Precision**: 30-40% memory reduction |
|
- **Gradient Accumulation**: Simulates larger batch sizes |
|
- **Optional Quantization**: Further model compression |
|
|
|
### Training Stability |
|
- **Gradient Clipping**: Prevents exploding gradients |
|
- **Automatic Loss Scaling**: Mixed precision stability |
|
- **Early Stopping**: Prevents overfitting with patience mechanisms |
|
|
|
## System Requirements |
|
|
|
### Memory Requirements |
|
- **Training**: 16-24GB VRAM (precision dependent) |
|
- **Inference**: 4-6GB VRAM for standard generation |
|
- **Context**: Maximum 2048 tokens input length |
|
|
|
### Generation Parameters |
|
|
|
Default configuration: |
|
```json |
|
{ |
|
"do_sample": true, |
|
"temperature": 1.0, |
|
"top_p": 0.9, |
|
"top_k": 50, |
|
"max_new_tokens": 50, |
|
"pad_token_id": 50256, |
|
"eos_token_id": 50256 |
|
} |
|
``` |
|
|
|
## Model Files |
|
|
|
The repository contains: |
|
- `pytorch_model.bin` - PyTorch model weights |
|
- `model.safetensors` - SafeTensors format weights |
|
- `config.json` - Model configuration |
|
- `generation_config.json` - Generation parameters |
|
- `training_metrics.json` - Training statistics |
|
- `tokenizer.json` - Tokenizer configuration |
|
- `vocab.json` & `merges.txt` - Vocabulary files |
|
|
|
## Limitations |
|
|
|
- **No HuggingFace `.generate()` support**: Use custom generation function |
|
- **Output Quality**: May produce repetitive or nonsensical text for some prompts |
|
- **Hardware Requirements**: GPU recommended for practical inference |
|
- **Context Window**: Limited to 2048 tokens |
|
- **Dataset Dependency**: Performance tied to Mixture-of-Thoughts dataset quality |
|
|
|
## Example Output |
|
|
|
``` |
|
Input: Hello, I am Alex and |
|
|
|
Output: Hello, I am Alex andbourg Chip Chip Chip Chip Chip Chip Chip ChipCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCos |
|
``` |
|
|
|
*Note: The repetitive output shown is typical for small or early-stage models and can be improved with further training or fine-tuning.* |
|
|
|
## Support Me |
|
|
|
You can support me via [Ko-fi](https://ko-fi.com/flamef0x) or you can try my [Vast.ai](https://cloud.vast.ai/?ref_id=222345&creator_id=222345&name=Efficient%20Pretraining%20GPU%20Template) template! |
|
|
|
### Small meta-data |
|
- Release date: June 29, 2025. |