SnowflakeCore G1 Pre-Train
Collection
The base models of G1. All the Snowflake models are fully pre-train, not fine-tune of a pre-existing model.
•
2 items
•
Updated
•
1
A improve version of SnowflakeCore-G1-Tiny custom GPT-style transformer language model built from scratch using PyTorch, trained on the common-pile/wikimedia_filtered dataset.
SnowflakeCore-G1-Tiny2 is a GPT-style autoregressive transformer model with ~400M parameters designed for text generation tasks.
Component | Value |
---|---|
Model Type | Autoregressive Transformer |
Parameters | ~400M |
Layers | 24 |
Hidden Size | 1024 |
Attention Heads | 16 |
Head Dimension | 64 |
FFN Dimension | 4096 |
Context Length | 2048 tokens |
Vocabulary Size | 50,257 (GPT-2 tokenizer) |
The following benchmarks compare SnowflakeCore-G1-Tiny2
, its predecessor, and GPT-2 on key performance and text quality metrics.
Model | Params | Size (MB) | Speed (tok/s) | Vocab Div. | Dist. Bigrams | Dist. Trigrams | Bigram Repet. | Trigram Repet. |
---|---|---|---|---|---|---|---|---|
SnowflakeCore-G1-Tiny2 | 355.9M | 1357.54 | 22.13 | 0.3440 | 0.7408 | 0.8834 | 0.2592 | 0.1166 |
SnowflakeCore-G1-Tiny | 355.9M | 1357.54 | 22.12 | 0.2780 | 0.6111 | 0.7421 | 0.3889 | 0.2579 |
GPT-2 (small) | 124.4M | 474.70 | 47.73 | 0.2590 | 0.6408 | 0.7946 | 0.3592 | 0.2054 |
Notes:
- Vocabulary Diversity = unique tokens / total tokens
- Distinct N-grams = unique n-grams / total n-grams
- Lower repetition rates indicate better text novelty
All models report N/A
for CPU memory usage across all sequence lengths.
Sequence Length | SnowflakeCore-G1-Tiny | SnowflakeCore-G1-Tiny2 | GPT-2 |
---|---|---|---|
128 | N/A (CPU) | N/A (CPU) | N/A |
512 | N/A (CPU) | N/A (CPU) | N/A |
1024 | N/A (CPU) | N/A (CPU) | N/A |
2048 | N/A (CPU) | N/A (CPU) | N/A |
pip install torch transformers # if not already installed
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"FlameF0X/SnowflakeCore-G1-Tiny2",
trust_remote_code=True,
force_download=True,
use_safetensors=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"FlameF0X/SnowflakeCore-G1-Tiny2",
trust_remote_code=True,
force_download=True,
use_safetensors=True,
)
def custom_greedy_generate(prompt, max_length=50):
model.eval()
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
generated = input_ids
with torch.no_grad():
for _ in range(max_length):
outputs = model(input_ids=generated)
next_token_logits = outputs["logits"][:, -1, :]
next_token_id = torch.argmax(next_token_logits, dim=-1).unsqueeze(-1)
generated = torch.cat((generated, next_token_id), dim=1)
if next_token_id.item() == tokenizer.eos_token_id:
break
return tokenizer.decode(generated[0], skip_special_tokens=True)
# Generate text
prompt = "Once upon a time"
result = custom_greedy_generate(prompt)
print(result)
... (same fine-tuning code as above) ...
training_metrics.json
Default configuration:
{
"do_sample": true,
"temperature": 1.0,
"top_p": 0.9,
"top_k": 50,
"max_new_tokens": 50,
"pad_token_id": 50256,
"eos_token_id": 50256
}
The repository contains:
pytorch_model.bin
- PyTorch model weightsmodel.safetensors
- SafeTensors format weightsconfig.json
- Model configurationgeneration_config.json
- Generation parameterstraining_metrics.json
- Training statisticstokenizer.json
- Tokenizer configurationvocab.json
& merges.txt
- Vocabulary files.generate()
support: Use custom generation functionN/A
You can support me via Ko-fi or you can try my Vast.ai template!