SnowflakeCore-G0-Release-2
This is the initial release of the SnowflakeCore series language models, trained on the DialogMLM-50K dataset with optimized memory usage.
SUPPORT ME
You can support me via https://ko-fi.com/flamef0x
Model details
- Architecture: SnowflakeCore
- Hidden size: 768
- Number of attention heads: 12
- Number of layers: 8
- Feed-forward dimension: 1536
- Maximum sequence length: 768
- Vocabulary size: 30522
Flowchart
HuggingFace Transformers Compatibility
This model is fully compatible with the HuggingFace Transformers library. You can load it using:
from transformers import AutoConfig, AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("FlameF0X/SnowflakeCore-G0-Release-2")
config = AutoConfig.from_pretrained("FlameF0X/SnowflakeCore-G0-Release-2")
model = AutoModel.from_pretrained("FlameF0X/SnowflakeCore-G0-Release-2")
Memory Optimization Techniques
- Mixed precision training
- Gradient accumulation (8 steps)
- Fused QKV projection
- Pre-norm architecture
- Weight tying between embedding and output layers
- Half-precision model storage
The model weights are stored in both PyTorch (.bin) and safetensors format for improved security, loading efficiency, and compatibility.
Training
Epoch | Train Loss | Val Loss |
---|---|---|
0 | 10.0000 | 10.0000 |
1 | 5.1290 | 4.3137 |
2 | 4.1629 | 3.8085 |
3 | 3.7087 | 3.4156 |
4 | 3.4236 | 3.2198 |
5 | 3.2251 | 3.0678 |
6 | 3.0599 | 2.9335 |
7 | 2.9571 | 2.8617 |
8 | 2.8831 | 2.7782 |
9 | 2.8003 | 2.7345 |
10 | 2.7579 | 2.6981 |
11 | 2.7128 | 2.6385 |
12 | 2.6783 | 2.6337 |
13 | 2.6571 | 2.5944 |
14 | 2.6261 | 2.5631 |
15 | 2.5919 | 2.5353 |
16 | 2.5592 | 2.5121 |
17 | 2.5359 | 2.4859 |
18 | 2.4998 | 2.4626 |
19 | 2.4746 | 2.4328 |
20 | 2.4631 | 2.4222 |
21 | 2.4374 | 2.3956 |
22 | 2.3924 | 2.3491 |
23 | 2.3540 | 2.3074 |
24 | 2.3207 | 2.2809 |
25 | 2.2994 | 2.2597 |
26 | 2.2737 | 2.2409 |
27 | 2.2595 | 2.2270 |
28 | 2.2353 | 2.2097 |
29 | 2.2030 | 2.1535 |
30 | 2.1648 | 2.1272 |
31 | 2.1375 | 2.1125 |
32 | 2.1189 | 2.0834 |
33 | 2.1056 | 2.0825 |
34 | 2.0820 | 2.0599 |
35 | 2.0643 | 2.0428 |
36 | 2.0451 | 2.0174 |
37 | 2.0256 | 2.0082 |
38 | 2.0099 | 1.9930 |
39 | 1.9937 | 1.9795 |
40 | 1.9753 | 1.9687 |
Complexity
The overall time complexity of training SnowflakeCore-G0-Release-2 falls under the O(n²) class due to the self-attention mechanism used in the transformer architecture. Here's a breakdown of the major computational costs:
Self-Attention: O(n² · d), where
n
is the sequence length (768) andd
is the hidden size (768). This term dominates because each token attends to every other token.Feedforward Layers: O(n · d²), with two projection layers per block.
Stacked Layers: Multiplied by the number of layers
L = 8
.
Overall per step complexity:
O(L · (n² · d + n · d²)) ≈ O(n² · d · L)
Training over the dataset for E = 40
epochs with B = 8
batch size leads to total training complexity:
O(E · (N / B) · n² · d · L) Where
N
is the number of training samples.
This puts SnowflakeCore-G0-Release-2 in the O(n²) class with respect to sequence length, which is a key scaling bottleneck. Optimizations such as fused QKV, gradient accumulation, and mixed-precision help reduce practical training cost.
- Downloads last month
- 20