Update README.md

12ba659 verified 10 days ago

7.28 kB

	---
	license: apache-2.0
	datasets:
	- common-pile/wikimedia_filtered
	language:
	- en
	library_name: transformers
	tags:
	- pre-train
	- custom_code
	- SnowflakeCore
	pipeline_tag: text-generation
	---

	# SnowflakeCore-G1-Tiny2

	A improve version of SnowflakeCore-G1-Tiny custom GPT-style transformer language model built from scratch using PyTorch, trained on the common-pile/wikimedia\_filtered dataset.

	## Model Overview

	SnowflakeCore-G1-Tiny2 is a GPT-style autoregressive transformer model with \~400M parameters designed for text generation tasks.

	### Key Features

	* 2048 token context window for extended conversations
	* Mixed precision training (BF16/FP16) for efficiency
	* Custom attention implementation with fused operations
	* Early stopping mechanisms for optimal training
	* Gradient accumulation for effective large batch training

	### Architecture Specifications

	\| Component \| Value \|
	\| --------------- \| -------------------------- \|
	\| Model Type \| Autoregressive Transformer \|
	\| Parameters \| \~400M \|
	\| Layers \| 24 \|
	\| Hidden Size \| 1024 \|
	\| Attention Heads \| 16 \|
	\| Head Dimension \| 64 \|
	\| FFN Dimension \| 4096 \|
	\| Context Length \| 2048 tokens \|
	\| Vocabulary Size \| 50,257 (GPT-2 tokenizer) \|

	## Model Benchmarks

	The following benchmarks compare `SnowflakeCore-G1-Tiny2`, its predecessor, and GPT-2 on key performance and text quality metrics.

	### Performance & Quality Metrics

	\| Model \| Params \| Size (MB) \| Speed (tok/s) \| Vocab Div. \| Dist. Bigrams \| Dist. Trigrams \| Bigram Repet. \| Trigram Repet. \|
	\| -------------------------- \| ------ \| --------- \| ------------- \| ---------- \| ------------- \| -------------- \| ------------- \| -------------- \|
	\| SnowflakeCore-G1-Tiny2 \| 355.9M \| 1357.54 \| 22.13 \| 0.3440 \| 0.7408 \| 0.8834 \| 0.2592 \| 0.1166 \|
	\| SnowflakeCore-G1-Tiny \| 355.9M \| 1357.54 \| 22.12 \| 0.2780 \| 0.6111 \| 0.7421 \| 0.3889 \| 0.2579 \|
	\| GPT-2 (small) \| 124.4M \| 474.70 \| 47.73 \| 0.2590 \| 0.6408 \| 0.7946 \| 0.3592 \| 0.2054 \|

	> Notes:
	>
	> * Vocabulary Diversity = unique tokens / total tokens
	> * Distinct N-grams = unique n-grams / total n-grams
	> * Lower repetition rates indicate better text novelty

	### Memory Usage (CPU)

	All models report `N/A` for CPU memory usage across all sequence lengths.

	\| Sequence Length \| SnowflakeCore-G1-Tiny \| SnowflakeCore-G1-Tiny2 \| GPT-2 \|
	\| --------------- \| --------------------- \| ---------------------- \| ----- \|
	\| 128 \| N/A (CPU) \| N/A (CPU) \| N/A \|
	\| 512 \| N/A (CPU) \| N/A (CPU) \| N/A \|
	\| 1024 \| N/A (CPU) \| N/A (CPU) \| N/A \|
	\| 2048 \| N/A (CPU) \| N/A (CPU) \| N/A \|

	## Quick Start

	### Installation

	```bash
	pip install torch transformers # if not already installed
	```

	### Basic Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	# Load model and tokenizer
	model = AutoModelForCausalLM.from_pretrained(
	"FlameF0X/SnowflakeCore-G1-Tiny2",
	trust_remote_code=True,
	force_download=True,
	use_safetensors=True,
	)
	tokenizer = AutoTokenizer.from_pretrained(
	"FlameF0X/SnowflakeCore-G1-Tiny2",
	trust_remote_code=True,
	force_download=True,
	use_safetensors=True,
	)

	def custom_greedy_generate(prompt, max_length=50):
	model.eval()
	input_ids = tokenizer(prompt, return_tensors="pt").input_ids
	generated = input_ids

	with torch.no_grad():
	for _ in range(max_length):
	outputs = model(input_ids=generated)
	next_token_logits = outputs["logits"][:, -1, :]
	next_token_id = torch.argmax(next_token_logits, dim=-1).unsqueeze(-1)
	generated = torch.cat((generated, next_token_id), dim=1)

	if next_token_id.item() == tokenizer.eos_token_id:
	break

	return tokenizer.decode(generated[0], skip_special_tokens=True)

	# Generate text
	prompt = "Once upon a time"
	result = custom_greedy_generate(prompt)
	print(result)
	```

	### Fine-Tuning

	<code>... (same fine-tuning code as above) ...</code>

	## Training Details

	### Dataset

	* Source: [common-pile/wikimedia\_filtered](https://huggingface.co/datasets/common-pile/wikimedia_filtered)

	### Training Configuration

	* Framework: PyTorch with mixed precision (BF16/FP16)
	* Optimizer: AdamW (learning rate: 2e-4)
	* Batch Size: 1 with gradient accumulation (32 steps)
	* Context Window: 2048 tokens
	* Validation Split: 10%
	* Early Stopping: Implemented at epoch and step levels

	### Performance Monitoring

	* Training loss tracked per epoch with perplexity calculation
	* Full validation after each epoch
	* Step-level monitoring every 500 steps
	* Comprehensive metrics saved in `training_metrics.json`

	## Technical Implementation

	### Attention Mechanism

	* Causal Masking: Supports autoregressive generation
	* Key Padding Mask: Enables batched inference
	* Scaled Dot-Product: Head dimension normalization included

	### Memory Optimization

	* Fused Operations: Reduces memory fragmentation
	* Mixed Precision: 30-40% memory reduction
	* Gradient Accumulation: Simulates larger batch sizes
	* Optional Quantization: Further model compression

	### Training Stability

	* Gradient Clipping: Prevents exploding gradients
	* Automatic Loss Scaling: Mixed precision stability
	* Early Stopping: Prevents overfitting with patience mechanisms

	## System Requirements

	### Memory Requirements

	* Training: 16-24GB VRAM (precision dependent)
	* Inference: 1-6GB VRAM for standard generation
	* Context: Maximum 2048 tokens input length

	### Generation Parameters

	Default configuration:

	```json
	{
	"do_sample": true,
	"temperature": 1.0,
	"top_p": 0.9,
	"top_k": 50,
	"max_new_tokens": 50,
	"pad_token_id": 50256,
	"eos_token_id": 50256
	}
	```

	## Model Files

	The repository contains:

	* `pytorch_model.bin` - PyTorch model weights
	* `model.safetensors` - SafeTensors format weights
	* `config.json` - Model configuration
	* `generation_config.json` - Generation parameters
	* `training_metrics.json` - Training statistics
	* `tokenizer.json` - Tokenizer configuration
	* `vocab.json` & `merges.txt` - Vocabulary files

	## Limitations

	* No HuggingFace `.generate()` support: Use custom generation function
	* Output Quality: May produce repetitive or nonsensical text for some prompts
	* Hardware Requirements: GPU recommended for practical inference
	* Context Window: Limited to 2048 tokens
	* Dataset Dependency: Performance tied to Mixture-of-Thoughts dataset quality

	## Example Output

	```
	N/A
	```

	## Support Me

	You can support me via [Ko-fi](https://ko-fi.com/flamef0x) or you can try my [Vast.ai](https://cloud.vast.ai/?ref_id=222345&creator_id=222345&name=Efficient%20Pretraining%20GPU%20Template) template!

	### Small meta-data

	* Release date: July 21, 2025.