Transformers
English
fp256
ultra-precision
transformer
experimental
research
Eval Results
FRTR4N commited on
Commit
187aaa9
·
verified ·
1 Parent(s): 4084b7f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -79
README.md CHANGED
@@ -120,86 +120,11 @@ If you use Gradia in your research, please cite:
120
 
121
  ## 📈 Performance Metrics
122
 
123
- ### Core Training Metrics
124
  | Metric | Value | Notes |
125
  |--------|-------|-------|
126
  | Training Loss | 7.003514766 | Step 10 (best checkpoint) |
127
  | Perplexity | ~1095.2 | exp(loss) |
128
- | Loss Reduction Rate | -0.032 per step | Calculated over 10 steps |
129
- | Convergence Speed | Early (Step 10) | Best model achieved quickly |
130
- | Training Stability | 100% | No divergence or NaN events |
131
-
132
- ### Model Architecture Metrics
133
- | Metric | Value | Notes |
134
- |--------|-------|-------|
135
- | Total Parameters | ~937,500 | Estimated from checkpoint size |
136
- | Embedding Parameters | ~288,768 | Token + positional embeddings |
137
- | Transformer Parameters | ~629,472 | 4 layers × ~157K params/layer |
138
- | Layer Norm Parameters | ~19,260 | All normalization layers |
139
- | Model Size (FP256) | 30MB | Full precision storage |
140
- | Model Size (FP32 equiv) | 3.75MB | 8x compression potential |
141
- | Model Size (FP16 equiv) | 1.87MB | 16x compression potential |
142
- | Parameters per Layer | ~157,368 | Average across 4 transformer layers |
143
- | Attention Heads per Layer | 8 | 32 dimensions per head |
144
-
145
- ### FP256 Precision Benefits
146
- | Metric | Value | Notes |
147
- |--------|-------|-------|
148
- | Numerical Stability Saves | 10 | Prevented gradient issues |
149
- | Extreme Precision Events | 14 | Ultra-precision was crucial |
150
- | Gradient Stability Improvements | 0 | Raw gradient tracking mode |
151
- | Training Stability Score | 100, 100, 10... | Consistent high stability |
152
- | Precision Bits | 256 | vs 32 (FP32) or 16 (FP16) |
153
- | Decimal Precision | ~77 digits | vs ~7 (FP32) or ~4 (FP16) |
154
- | Dynamic Range | 2^262143 | Vastly exceeds standard formats |
155
- | Underflow Prevention Rate | 100% | No gradient underflow detected |
156
-
157
- ### Memory and Computational Metrics
158
- | Metric | Value | Notes |
159
- |--------|-------|-------|
160
- | Memory Overhead vs FP32 | 8x | 32 bytes vs 4 bytes per param |
161
- | Memory Overhead vs FP16 | 16x | 32 bytes vs 2 bytes per param |
162
- | Storage Efficiency | 32 bytes/param | FP256 native storage |
163
- | Parameter Density | 31,250 params/MB | In FP256 format |
164
- | Training Memory Peak | ~240MB | Model + gradients + optimizer |
165
- | Gradient Precision | 256-bit | Full precision gradients |
166
-
167
- ### Training Efficiency Metrics
168
- | Metric | Value | Notes |
169
- |--------|-------|-------|
170
- | Steps to Best Model | 10 | Early convergence |
171
- | Training Time per Step | ~1598ms | From checkpoint data |
172
- | FP256 Update Norm | 0.000316227 | Gradient update magnitude |
173
- | Learning Rate Precision | 256-bit | Ultra-precise LR updates |
174
- | Batch Processing Stability | 100% | No batch failures |
175
- | Optimizer Convergence | Stable | No oscillations detected |
176
-
177
- ### Comparative Analysis (Estimated)
178
- | Metric | FP256 (This Model) | FP32 Equivalent | FP16 Equivalent |
179
- |--------|-------------------|-----------------|-----------------|
180
- | Model Size | 30MB | 3.75MB | 1.87MB |
181
- | Precision Digits | ~77 | ~7 | ~4 |
182
- | Gradient Stability | 10 saves | Likely 2-3 failures | Likely 5-8 failures |
183
- | Memory Usage | 240MB | 30MB | 15MB |
184
- | Numerical Range | 2^262143 | 2^127 | 2^15 |
185
- | Training Stability | 100% | ~85-90% | ~70-80% |
186
-
187
- ### Vocabulary and Sequence Metrics
188
- | Metric | Value | Notes |
189
- |--------|-------|-------|
190
- | Vocabulary Size | 1,000 | Compact vocab for demo |
191
- | Max Sequence Length | 128 tokens | Short context window |
192
- | Embedding Dimension | 256 | Hidden size alignment |
193
- | Position Embeddings | 128 | Learned positional encoding |
194
- | Token Coverage | Demo dataset | Limited scope |
195
- | Sequence Processing | Fixed length | No dynamic batching |
196
-
197
- ### Experimental Research Metrics
198
- | Metric | Value | Research Value |
199
- |--------|-------|----------------|
200
- | Novel Precision Format | FP256 | First known implementation |
201
- | Stability Interventions | 10 | Demonstrates precision benefits |
202
- | Precision Event Rate | 1.4 per step | High precision requirement |
203
- | Research Reproducibility | Full | Complete checkpoint available |
204
- | Implementation Novelty | Custom | FP256 transformer architecture |
205
- | Scientific Contribution | High | Ultra-precision ML exploration |
 
120
 
121
  ## 📈 Performance Metrics
122
 
 
123
  | Metric | Value | Notes |
124
  |--------|-------|-------|
125
  | Training Loss | 7.003514766 | Step 10 (best checkpoint) |
126
  | Perplexity | ~1095.2 | exp(loss) |
127
+ | Model Size | 30MB | FP256 precision |
128
+ | Parameters | ~937K | Estimated from checkpoint size |
129
+ | Stability Events | 10 | Numerical instabilities prevented |
130
+ | Precision Events | 14 | Cases where FP256 was crucial |