SnowflakeCore-G0-Release-2

This is the initial release of the SnowflakeCore series language models, trained on the DialogMLM-50K dataset with optimized memory usage.

SUPPORT ME

You can support me via https://ko-fi.com/flamef0x

Model details

Architecture: SnowflakeCore
Hidden size: 768
Number of attention heads: 12
Number of layers: 8
Feed-forward dimension: 1536
Maximum sequence length: 768
Vocabulary size: 30522

Flowchart

HuggingFace Transformers Compatibility

This model is fully compatible with the HuggingFace Transformers library. You can load it using:

from transformers import AutoConfig, AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("FlameF0X/SnowflakeCore-G0-Release-2")
config = AutoConfig.from_pretrained("FlameF0X/SnowflakeCore-G0-Release-2")
model = AutoModel.from_pretrained("FlameF0X/SnowflakeCore-G0-Release-2")

Memory Optimization Techniques

Mixed precision training
Gradient accumulation (8 steps)
Fused QKV projection
Pre-norm architecture
Weight tying between embedding and output layers
Half-precision model storage

The model weights are stored in both PyTorch (.bin) and safetensors format for improved security, loading efficiency, and compatibility.

Training

Epoch	Train Loss	Val Loss
0	10.0000	10.0000
1	5.1290	4.3137
2	4.1629	3.8085
3	3.7087	3.4156
4	3.4236	3.2198
5	3.2251	3.0678
6	3.0599	2.9335
7	2.9571	2.8617
8	2.8831	2.7782
9	2.8003	2.7345
10	2.7579	2.6981
11	2.7128	2.6385
12	2.6783	2.6337
13	2.6571	2.5944
14	2.6261	2.5631
15	2.5919	2.5353
16	2.5592	2.5121
17	2.5359	2.4859
18	2.4998	2.4626
19	2.4746	2.4328
20	2.4631	2.4222
21	2.4374	2.3956
22	2.3924	2.3491
23	2.3540	2.3074
24	2.3207	2.2809
25	2.2994	2.2597
26	2.2737	2.2409
27	2.2595	2.2270
28	2.2353	2.2097
29	2.2030	2.1535
30	2.1648	2.1272
31	2.1375	2.1125
32	2.1189	2.0834
33	2.1056	2.0825
34	2.0820	2.0599
35	2.0643	2.0428
36	2.0451	2.0174
37	2.0256	2.0082
38	2.0099	1.9930
39	1.9937	1.9795
40	1.9753	1.9687

Complexity

The overall time complexity of training SnowflakeCore-G0-Release-2 falls under the O(n²) class due to the self-attention mechanism used in the transformer architecture. Here's a breakdown of the major computational costs:

Self-Attention: O(n² · d), where n is the sequence length (768) and d is the hidden size (768). This term dominates because each token attends to every other token.
Feedforward Layers: O(n · d²), with two projection layers per block.
Stacked Layers: Multiplied by the number of layers L = 8.

Overall per step complexity:

O(L · (n² · d + n · d²)) ≈ O(n² · d · L)

Training over the dataset for E = 40 epochs with B = 8 batch size leads to total training complexity:

O(E · (N / B) · n² · d · L) Where N is the number of training samples.

This puts SnowflakeCore-G0-Release-2 in the O(n²) class with respect to sequence length, which is a key scaling bottleneck. Optimizations such as fused QKV, gradient accumulation, and mixed-precision help reduce practical training cost.

FlameF0X
/

SnowflakeCore-G0-Release-2

SnowflakeCore-G0-Release-2

SUPPORT ME

Model details

Flowchart

HuggingFace Transformers Compatibility

Memory Optimization Techniques

Training

Complexity

Dataset used to train FlameF0X/SnowflakeCore-G0-Release-2

Space using FlameF0X/SnowflakeCore-G0-Release-2 1

Collection including FlameF0X/SnowflakeCore-G0-Release-2

SnowflakeCore G0