Update README.md

6f116dc verified 10 days ago

11.8 kB

	---
	license: apache-2.0
	datasets:
	- FlameF0X/Mixture-of-Thoughts-2048T
	language:
	- en
	library_name: transformers
	tags:
	- pre-train
	- custom_code
	- SnowflakeCore
	model-index:
	- name: FlameF0X/SnowflakeCore-G1-Tiny
	results:
	- task:
	type: generation_speed
	name: Generation Speed
	metrics:
	- type: avg_tokens_per_second
	value: 57.257723907839626
	- task:
	type: model_size
	name: Model Size
	metrics:
	- type: model_size_mb
	value: 1357.54296875
	- task:
	type: gsm8k_accuracy
	name: GSM8K Accuracy
	metrics:
	- type: accuracy
	value: 0.2
	- task:
	type: mmlu_accuracy
	name: MMLU Accuracy
	metrics:
	- type: accuracy
	value: 0
	- task:
	type: humaneval_pass@1
	name: HumanEval Pass@1
	metrics:
	- type: pass@1
	value: 0
	- task:
	type: peak_memory_gb
	name: Peak Memory (seq_128)
	metrics:
	- type: seq_128
	value: 5.9882988929748535
	- task:
	type: peak_memory_gb
	name: Peak Memory (seq_512)
	metrics:
	- type: seq_512
	value: 6.0380940437316895
	- task:
	type: peak_memory_gb
	name: Peak Memory (seq_1024)
	metrics:
	- type: seq_1024
	value: 6.123685836791992
	- task:
	type: peak_memory_gb
	name: Peak Memory (seq_2048)
	metrics:
	- type: seq_2048
	value: 6.354169845581055
	pipeline_tag: text-generation
	new_version: FlameF0X/SnowflakeCore-G1-Tiny2
	---

	# SnowflakeCore-G1-Tiny

	A custom GPT-style transformer language model built from scratch using PyTorch, trained on the Mixture-of-Thoughts dataset for enhanced reasoning capabilities.

	## Model Overview

	SnowflakeCore-G1-Tiny is a GPT-style autoregressive transformer model with ~400M parameters designed for text generation tasks.

	### Key Features
	- 2048 token context window for extended conversations
	- Mixed precision training (BF16/FP16) for efficiency
	- Custom attention implementation with fused operations
	- Early stopping mechanisms for optimal training
	- Gradient accumulation for effective large batch training

	### Architecture Specifications

	\| Component \| Value \|
	\|-----------\|-------\|
	\| Model Type \| Autoregressive Transformer \|
	\| Parameters \| ~400M \|
	\| Layers \| 24 \|
	\| Hidden Size \| 1024 \|
	\| Attention Heads \| 16 \|
	\| Head Dimension \| 64 \|
	\| FFN Dimension \| 4096 \|
	\| Context Length \| 2048 tokens \|
	\| Vocabulary Size \| 50,257 (GPT-2 tokenizer) \|

	## Quick Start

	### Installation

	```bash
	pip install torch transformers # if not already installed
	```

	### Basic Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	# Load model and tokenizer
	model = AutoModelForCausalLM.from_pretrained(
	"FlameF0X/SnowflakeCore-G1-Tiny",
	trust_remote_code=True,
	force_download=True,
	use_safetensors=True,
	)
	tokenizer = AutoTokenizer.from_pretrained(
	"FlameF0X/SnowflakeCore-G1-Tiny",
	trust_remote_code=True,
	force_download=True,
	use_safetensors=True,
	)

	def custom_greedy_generate(prompt, max_length=50):
	model.eval()
	input_ids = tokenizer(prompt, return_tensors="pt").input_ids
	generated = input_ids

	with torch.no_grad():
	for _ in range(max_length):
	outputs = model(input_ids=generated)
	next_token_logits = outputs["logits"][:, -1, :]
	next_token_id = torch.argmax(next_token_logits, dim=-1).unsqueeze(-1)
	generated = torch.cat((generated, next_token_id), dim=1)

	if next_token_id.item() == tokenizer.eos_token_id:
	break

	return tokenizer.decode(generated[0], skip_special_tokens=True)

	# Generate text
	prompt = "Once upon a time"
	result = custom_greedy_generate(prompt)
	print(result)
	```

	### Fine-Tuning

	```python
	import os
	import argparse
	from transformers import (
	AutoTokenizer,
	AutoModelForCausalLM,
	Trainer,
	TrainingArguments,
	)
	from datasets import load_dataset
	import torch

	# === Disable W&B logging ===
	os.environ["WANDB_DISABLED"] = "true"

	# === Config ===
	config = {
	"model_name": "FlameF0X/SnowflakeCore-G1-Tiny",
	"output_dir": "./snowflake-chatbot",
	"context_window": 512,
	"per_device_batch_size": 1,
	"gradient_accumulation_steps": 16,
	"max_steps": 500,
	"dataloader_workers": 4,
	"dataset_name": "tatsu-lab/alpaca",
	"dataset_split": "train[:10000]",
	}

	# === Derived ===
	config["effective_batch_size"] = (
	config["per_device_batch_size"] * config["gradient_accumulation_steps"]
	)

	print(f"Effective batch size: {config['effective_batch_size']}")
	print(f"Context window: {config['context_window']}")


	# === 1. Load tokenizer and model ===
	def load_model_and_tokenizer(config):
	print(f"Loading model and tokenizer from {config['model_name']}...")
	tokenizer = AutoTokenizer.from_pretrained(
	config["model_name"],
	trust_remote_code=True,
	force_download=True,
	use_safetensors=True,
	model_max_length=config["context_window"],
	)
	model = AutoModelForCausalLM.from_pretrained(
	config["model_name"],
	trust_remote_code=True,
	force_download=True,
	use_safetensors=True,
	)

	if hasattr(torch, "compile"):
	try:
	print("Compiling model with torch.compile...")
	model = torch.compile(model)
	except Exception as e:
	print(f"Compilation failed: {e}")
	return tokenizer, model


	# === 2. Load dataset ===
	def load_custom_dataset(name, split):
	print(f"Loading dataset: {name} ({split})...")
	return load_dataset(name, split=split)


	# === 3. Format dataset ===
	def format_example(example):
	"""Update this function to work with different datasets."""
	return {
	"text": f"### Instruction:\n{example['instruction']}\n### Input:\n{example['input']}\n### Response:\n{example['output']}"
	}


	# === 4. Tokenize ===
	def tokenize_example(example, tokenizer, max_length):
	tokens = tokenizer(
	example["text"],
	truncation=True,
	padding="max_length",
	max_length=max_length,
	)
	tokens["labels"] = tokens["input_ids"].copy()
	return tokens


	# === 5. Train ===
	def train_model(model, tokenizer, tokenized_dataset, config):
	print("Preparing training arguments...")
	training_args = TrainingArguments(
	output_dir=config["output_dir"],
	per_device_train_batch_size=config["per_device_batch_size"],
	gradient_accumulation_steps=config["gradient_accumulation_steps"],
	max_steps=config["max_steps"],
	logging_dir="./logs",
	logging_steps=20,
	save_strategy="no",
	fp16=torch.cuda.is_available() and not torch.cuda.is_bf16_supported(),
	bf16=torch.cuda.is_available() and torch.cuda.is_bf16_supported(),
	overwrite_output_dir=True,
	report_to=[],
	dataloader_num_workers=config["dataloader_workers"],
	optim="adamw_torch_fused" if torch.cuda.is_available() and hasattr(torch, 'compile') else "adamw_torch",
	remove_unused_columns=False,
	)

	trainer = Trainer(
	model=model,
	args=training_args,
	train_dataset=tokenized_dataset,
	)

	print("Starting training...")
	trainer.train()
	print("Training completed.")


	# === 6. Save ===
	def save_model(model, tokenizer, output_dir):
	print(f"Saving model to {output_dir}...")
	model.save_pretrained(output_dir, safe_serialization=False)
	tokenizer.save_pretrained(output_dir)
	print("Model saved.")


	# === Main ===
	def main():
	parser = argparse.ArgumentParser()
	parser.add_argument("--dataset", type=str, default=config["dataset_name"])
	parser.add_argument("--split", type=str, default=config["dataset_split"])
	args = parser.parse_args()

	tokenizer, model = load_model_and_tokenizer(config)
	dataset = load_custom_dataset(args.dataset, args.split)

	print("Formatting dataset...")
	dataset = dataset.map(format_example, num_proc=config["dataloader_workers"], load_from_cache_file=False)

	print("Tokenizing dataset...")
	tokenized = dataset.map(
	lambda x: tokenize_example(x, tokenizer, config["context_window"]),
	batched=True,
	num_proc=config["dataloader_workers"],
	load_from_cache_file=False,
	)
	tokenized.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

	train_model(model, tokenizer, tokenized, config)
	save_model(model, tokenizer, config["output_dir"])


	if __name__ == "__main__":
	main()
	```

	## Training Details

	### Dataset
	- Source: [FlameF0X/Mixture-of-Thoughts-2048T](https://huggingface.co/datasets/FlameF0X/Mixture-of-Thoughts-2048T)
	- Purpose: Enhanced reasoning capabilities through mixture-of-thoughts training

	### Training Configuration
	- Framework: PyTorch with mixed precision (BF16/FP16)
	- Optimizer: AdamW (learning rate: 2e-4)
	- Batch Size: 1 with gradient accumulation (32 steps)
	- Context Window: 2048 tokens
	- Validation Split: 10%
	- Early Stopping: Implemented at epoch and step levels

	### Performance Monitoring
	- Training loss tracked per epoch with perplexity calculation
	- Full validation after each epoch
	- Step-level monitoring every 500 steps
	- Comprehensive metrics saved in `training_metrics.json`

	## Technical Implementation

	### Attention Mechanism
	- Causal Masking: Supports autoregressive generation
	- Key Padding Mask: Enables batched inference
	- Scaled Dot-Product: Head dimension normalization included

	### Memory Optimization
	- Fused Operations: Reduces memory fragmentation
	- Mixed Precision: 30-40% memory reduction
	- Gradient Accumulation: Simulates larger batch sizes
	- Optional Quantization: Further model compression

	### Training Stability
	- Gradient Clipping: Prevents exploding gradients
	- Automatic Loss Scaling: Mixed precision stability
	- Early Stopping: Prevents overfitting with patience mechanisms

	## System Requirements

	### Memory Requirements
	- Training: 16-24GB VRAM (precision dependent)
	- Inference: 4-6GB VRAM for standard generation
	- Context: Maximum 2048 tokens input length

	### Generation Parameters

	Default configuration:
	```json
	{
	"do_sample": true,
	"temperature": 1.0,
	"top_p": 0.9,
	"top_k": 50,
	"max_new_tokens": 50,
	"pad_token_id": 50256,
	"eos_token_id": 50256
	}
	```

	## Model Files

	The repository contains:
	- `pytorch_model.bin` - PyTorch model weights
	- `model.safetensors` - SafeTensors format weights
	- `config.json` - Model configuration
	- `generation_config.json` - Generation parameters
	- `training_metrics.json` - Training statistics
	- `tokenizer.json` - Tokenizer configuration
	- `vocab.json` & `merges.txt` - Vocabulary files

	## Limitations

	- No HuggingFace `.generate()` support: Use custom generation function
	- Output Quality: May produce repetitive or nonsensical text for some prompts
	- Hardware Requirements: GPU recommended for practical inference
	- Context Window: Limited to 2048 tokens
	- Dataset Dependency: Performance tied to Mixture-of-Thoughts dataset quality

	## Example Output

	```
	Input: Hello, I am Alex and

	Output: Hello, I am Alex andbourg Chip Chip Chip Chip Chip Chip Chip ChipCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCosCos
	```

	Note: The repetitive output shown is typical for small or early-stage models and can be improved with further training or fine-tuning.

	## Support Me

	You can support me via [Ko-fi](https://ko-fi.com/flamef0x) or you can try my [Vast.ai](https://cloud.vast.ai/?ref_id=222345&creator_id=222345&name=Efficient%20Pretraining%20GPU%20Template) template!

	### Small meta-data
	- Release date: June 29, 2025.