--- license: apache-2.0 datasets: - FlameF0X/Mixture-of-Thoughts-2048T language: - en library_name: transformers tags: - pre-train - custom_code - SnowflakeCore model-index: - name: FlameF0X/SnowflakeCore-G1-Tiny results: - task: type: generation_speed name: Generation Speed metrics: - type: avg_tokens_per_second value: 57.257723907839626 - task: type: model_size name: Model Size metrics: - type: model_size_mb value: 1357.54296875 - task: type: gsm8k_accuracy name: GSM8K Accuracy metrics: - type: accuracy value: 0.2 - task: type: mmlu_accuracy name: MMLU Accuracy metrics: - type: accuracy value: 0 - task: type: humaneval_pass@1 name: HumanEval Pass@1 metrics: - type: pass@1 value: 0 - task: type: peak_memory_gb name: Peak Memory (seq_128) metrics: - type: seq_128 value: 5.9882988929748535 - task: type: peak_memory_gb name: Peak Memory (seq_512) metrics: - type: seq_512 value: 6.0380940437316895 - task: type: peak_memory_gb name: Peak Memory (seq_1024) metrics: - type: seq_1024 value: 6.123685836791992 - task: type: peak_memory_gb name: Peak Memory (seq_2048) metrics: - type: seq_2048 value: 6.354169845581055 pipeline_tag: text-generation new_version: FlameF0X/SnowflakeCore-G1-Tiny2 --- # SnowflakeCore-G1-Tiny A custom GPT-style transformer language model built from scratch using PyTorch, trained on the Mixture-of-Thoughts dataset for enhanced reasoning capabilities. ## Model Overview SnowflakeCore-G1-Tiny is a GPT-style autoregressive transformer model with **~400M parameters** designed for text generation tasks. ### Key Features - **2048 token context window** for extended conversations - **Mixed precision training** (BF16/FP16) for efficiency - **Custom attention implementation** with fused operations - **Early stopping mechanisms** for optimal training - **Gradient accumulation** for effective large batch training ### Architecture Specifications | Component | Value | |-----------|-------| | Model Type | Autoregressive Transformer | | Parameters | ~400M | | Layers | 24 | | Hidden Size | 1024 | | Attention Heads | 16 | | Head Dimension | 64 | | FFN Dimension | 4096 | | Context Length | 2048 tokens | | Vocabulary Size | 50,257 (GPT-2 tokenizer) | ## Quick Start ### Installation ```bash pip install torch transformers # if not already installed ``` ### Basic Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch # Load model and tokenizer model = AutoModelForCausalLM.from_pretrained( "FlameF0X/SnowflakeCore-G1-Tiny", trust_remote_code=True, force_download=True, use_safetensors=True, ) tokenizer = AutoTokenizer.from_pretrained( "FlameF0X/SnowflakeCore-G1-Tiny", trust_remote_code=True, force_download=True, use_safetensors=True, ) def custom_greedy_generate(prompt, max_length=50): model.eval() input_ids = tokenizer(prompt, return_tensors="pt").input_ids generated = input_ids with torch.no_grad(): for _ in range(max_length): outputs = model(input_ids=generated) next_token_logits = outputs["logits"][:, -1, :] next_token_id = torch.argmax(next_token_logits, dim=-1).unsqueeze(-1) generated = torch.cat((generated, next_token_id), dim=1) if next_token_id.item() == tokenizer.eos_token_id: break return tokenizer.decode(generated[0], skip_special_tokens=True) # Generate text prompt = "Once upon a time" result = custom_greedy_generate(prompt) print(result) ``` ### Fine-Tuning ```python import os import argparse from transformers import ( AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, ) from datasets import load_dataset import torch # === Disable W&B logging === os.environ["WANDB_DISABLED"] = "true" # === Config === config = { "model_name": "FlameF0X/SnowflakeCore-G1-Tiny", "output_dir": "./snowflake-chatbot", "context_window": 512, "per_device_batch_size": 1, "gradient_accumulation_steps": 16, "max_steps": 500, "dataloader_workers": 4, "dataset_name": "tatsu-lab/alpaca", "dataset_split": "train[:10000]", } # === Derived === config["effective_batch_size"] = ( config["per_device_batch_size"] * config["gradient_accumulation_steps"] ) print(f"Effective batch size: {config['effective_batch_size']}") print(f"Context window: {config['context_window']}") # === 1. Load tokenizer and model === def load_model_and_tokenizer(config): print(f"Loading model and tokenizer from {config['model_name']}...") tokenizer = AutoTokenizer.from_pretrained( config["model_name"], trust_remote_code=True, force_download=True, use_safetensors=True, model_max_length=config["context_window"], ) model = AutoModelForCausalLM.from_pretrained( config["model_name"], trust_remote_code=True, force_download=True, use_safetensors=True, ) if hasattr(torch, "compile"): try: print("Compiling model with torch.compile...") model = torch.compile(model) except Exception as e: print(f"Compilation failed: {e}") return tokenizer, model # === 2. Load dataset === def load_custom_dataset(name, split): print(f"Loading dataset: {name} ({split})...") return load_dataset(name, split=split) # === 3. Format dataset === def format_example(example): """Update this function to work with different datasets.""" return { "text": f"### Instruction:\n{example['instruction']}\n### Input:\n{example['input']}\n### Response:\n{example['output']}" } # === 4. Tokenize === def tokenize_example(example, tokenizer, max_length): tokens = tokenizer( example["text"], truncation=True, padding="max_length", max_length=max_length, ) tokens["labels"] = tokens["input_ids"].copy() return tokens # === 5. Train === def train_model(model, tokenizer, tokenized_dataset, config): print("Preparing training arguments...") training_args = TrainingArguments( output_dir=config["output_dir"], per_device_train_batch_size=config["per_device_batch_size"], gradient_accumulation_steps=config["gradient_accumulation_steps"], max_steps=config["max_steps"], logging_dir="./logs", logging_steps=20, save_strategy="no", fp16=torch.cuda.is_available() and not torch.cuda.is_bf16_supported(), bf16=torch.cuda.is_available() and torch.cuda.is_bf16_supported(), overwrite_output_dir=True, report_to=[], dataloader_num_workers=config["dataloader_workers"], optim="adamw_torch_fused" if torch.cuda.is_available() and hasattr(torch, 'compile') else "adamw_torch", remove_unused_columns=False, ) trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_dataset, ) print("Starting training...") trainer.train() print("Training completed.") # === 6. Save === def save_model(model, tokenizer, output_dir): print(f"Saving model to {output_dir}...") model.save_pretrained(output_dir, safe_serialization=False) tokenizer.save_pretrained(output_dir) print("Model saved.") # === Main === def main(): parser = argparse.ArgumentParser() parser.add_argument("--dataset", type=str, default=config["dataset_name"]) parser.add_argument("--split", type=str, default=config["dataset_split"]) args = parser.parse_args() tokenizer, model = load_model_and_tokenizer(config) dataset = load_custom_dataset(args.dataset, args.split) print("Formatting dataset...") dataset = dataset.map(format_example, num_proc=config["dataloader_workers"], load_from_cache_file=False) print("Tokenizing dataset...") tokenized = dataset.map( lambda x: tokenize_example(x, tokenizer, config["context_window"]), batched=True, num_proc=config["dataloader_workers"], load_from_cache_file=False, ) tokenized.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"]) train_model(model, tokenizer, tokenized, config) save_model(model, tokenizer, config["output_dir"]) if __name__ == "__main__": main() ``` ## Training Details ### Dataset - **Source**: [FlameF0X/Mixture-of-Thoughts-2048T](https://huggingface.co/datasets/FlameF0X/Mixture-of-Thoughts-2048T) - **Purpose**: Enhanced reasoning capabilities through mixture-of-thoughts training ### Training Configuration - **Framework**: PyTorch with mixed precision (BF16/FP16) - **Optimizer**: AdamW (learning rate: 2e-4) - **Batch Size**: 1 with gradient accumulation (32 steps) - **Context Window**: 2048 tokens - **Validation Split**: 10% - **Early Stopping**: Implemented at epoch and step levels ### Performance Monitoring - Training loss tracked per epoch with perplexity calculation - Full validation after each epoch - Step-level monitoring every 500 steps - Comprehensive metrics saved in `training_metrics.json` ## Technical Implementation ### Attention Mechanism - **Causal Masking**: Supports autoregressive generation - **Key Padding Mask**: Enables batched inference - **Scaled Dot-Product**: Head dimension normalization included ### Memory Optimization - **Fused Operations**: Reduces memory fragmentation - **Mixed Precision**: 30-40% memory reduction - **Gradient Accumulation**: Simulates larger batch sizes - **Optional Quantization**: Further model compression ### Training Stability - **Gradient Clipping**: Prevents exploding gradients - **Automatic Loss Scaling**: Mixed precision stability - **Early Stopping**: Prevents overfitting with patience mechanisms ## System Requirements ### Memory Requirements - **Training**: 16-24GB VRAM (precision dependent) - **Inference**: 4-6GB VRAM for standard generation - **Context**: Maximum 2048 tokens input length ### Generation Parameters Default configuration: ```json { "do_sample": true, "temperature": 1.0, "top_p": 0.9, "top_k": 50, "max_new_tokens": 50, "pad_token_id": 50256, "eos_token_id": 50256 } ``` ## Model Files The repository contains: - `pytorch_model.bin` - PyTorch model weights - `model.safetensors` - SafeTensors format weights - `config.json` - Model configuration - `generation_config.json` - Generation parameters - `training_metrics.json` - Training statistics - `tokenizer.json` - Tokenizer configuration - `vocab.json` & `merges.txt` - Vocabulary files ## Limitations - **No HuggingFace `.generate()` support**: Use custom generation function - **Output Quality**: May produce repetitive or nonsensical text for some prompts - **Hardware Requirements**: GPU recommended for practical inference - **Context Window**: Limited to 2048 tokens - **Dataset Dependency**: Performance tied to Mixture-of-Thoughts dataset quality ## Example Output ``` Input: Hello, I am Alex and Output: Hello, I am Alex andbourg Chip Chip Chip Chip Chip Chip Chip ChipCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCos ``` *Note: The repetitive output shown is typical for small or early-stage models and can be improved with further training or fine-tuning.* ## Support Me You can support me via [Ko-fi](https://ko-fi.com/flamef0x) or you can try my [Vast.ai](https://cloud.vast.ai/?ref_id=222345&creator_id=222345&name=Efficient%20Pretraining%20GPU%20Template) template! ### Small meta-data - Release date: June 29, 2025.