|
--- |
|
datasets: |
|
- tatsu-lab/alpaca |
|
license: apache-2.0 |
|
language: |
|
- en |
|
base_model: |
|
- FlameF0X/SnowflakeCore-G1-Tiny |
|
pipeline_tag: text-generation |
|
library_name: transformers |
|
tags: |
|
- pre-train |
|
- costume_code |
|
--- |
|
|
|
# SnowflakeCore-G1-Tiny-Instruct |
|
|
|
A custom GPT-style transformer language model built from scratch using PyTorch. |
|
|
|
## Model Overview |
|
|
|
SnowflakeCore-G1-Tiny and SnowflakeCore-G1-Tiny-Instruct are a GPT-style autoregressive transformer model with **~400M parameters** designed for text generation tasks. |
|
|
|
### Key Features |
|
- **2048 token context window** for extended conversations |
|
- **Mixed precision training** (BF16/FP16) for efficiency |
|
- **Custom attention implementation** with fused operations |
|
- **Early stopping mechanisms** N/A |
|
- **Gradient accumulation** for effective large batch training |
|
|
|
### Architecture Specifications |
|
|
|
| Component | Value | |
|
|-----------|-------| |
|
| Model Type | Autoregressive Transformer | |
|
| Parameters | ~400M | |
|
| Layers | 24 | |
|
| Hidden Size | 1024 | |
|
| Attention Heads | 16 | |
|
| Head Dimension | 64 | |
|
| FFN Dimension | 4096 | |
|
| Context Length | 2048 tokens | |
|
| Vocabulary Size | 50,257 (GPT-2 tokenizer) | |
|
|
|
## Quick Start |
|
|
|
### Installation |
|
|
|
```bash |
|
pip install torch transformers # if not already installed |
|
``` |
|
|
|
### Basic Usage |
|
|
|
```python |
|
# N/A |
|
``` |
|
|
|
## Training Details |
|
|
|
### Dataset |
|
- **Source**: |
|
|
|
### Training Configuration |
|
- **Framework**: PyTorch with mixed precision (BF16/FP16) |
|
- **Optimizer**: AdamW (learning rate: 2e-4) |
|
- **Batch Size**: N/A |
|
- **Context Window**: 2048 tokens or 512 tokens |
|
- **Validation Split**: N/A |
|
- **Early Stopping**: N/A |
|
### Performance Monitoring |
|
- Training loss tracked per epoch with perplexity calculation |
|
- Full validation after each epoch |
|
- Step-level monitoring every 500 steps |
|
- Comprehensive metrics saved in `training_metrics.json` |
|
|
|
## Technical Implementation |
|
|
|
### Attention Mechanism |
|
- **Causal Masking**: Supports autoregressive generation |
|
- **Key Padding Mask**: Enables batched inference |
|
- **Scaled Dot-Product**: Head dimension normalization included |
|
|
|
### Memory Optimization |
|
- **Fused Operations**: Reduces memory fragmentation |
|
- **Mixed Precision**: 30-40% memory reduction |
|
- **Gradient Accumulation**: Simulates larger batch sizes |
|
- **Optional Quantization**: Further model compression |
|
|
|
### Training Stability |
|
- **Gradient Clipping**: Prevents exploding gradients |
|
- **Automatic Loss Scaling**: Mixed precision stability |
|
- **Early Stopping**: Prevents overfitting with patience mechanisms |
|
|
|
## System Requirements |
|
|
|
### Memory Requirements |
|
- **Training**: 16-24GB VRAM (precision dependent) |
|
- **Inference**: 4-6GB VRAM for standard generation |
|
- **Context**: Maximum 2048 tokens input length |
|
|
|
### Generation Parameters |
|
|
|
Default configuration: |
|
```json |
|
{ |
|
"do_sample": true, |
|
"temperature": 1.0, |
|
"top_p": 0.9, |
|
"top_k": 50, |
|
"max_new_tokens": 50, |
|
"pad_token_id": 50256, |
|
"eos_token_id": 50256 |
|
} |
|
``` |
|
|
|
|
|
## Limitations |
|
|
|
- **No HuggingFace `.generate()` support**: Use custom generation function |
|
- **Output Quality**: May produce repetitive or nonsensical text for some prompts |
|
- **Hardware Requirements**: GPU recommended for practical inference |
|
- **Context Window**: Limited to 2048 tokens (or 512 tokens) |
|
|
|
## Example Output |
|
|
|
``` |
|
# WIP |
|
``` |
|
|
|
## Support Me |
|
|
|
You can support me via [Ko-fi](https://ko-fi.com/flamef0x) or you can try my [Vast.ai](https://cloud.vast.ai/?ref_id=222345&creator_id=222345&name=Efficient%20Pretraining%20GPU%20Template) template! |
|
|
|
|
|
### More meta-data |
|
- Release date: July 10, 2025 |