Update README.md
Browse files
README.md
CHANGED
@@ -120,86 +120,11 @@ If you use Gradia in your research, please cite:
|
|
120 |
|
121 |
## 📈 Performance Metrics
|
122 |
|
123 |
-
### Core Training Metrics
|
124 |
| Metric | Value | Notes |
|
125 |
|--------|-------|-------|
|
126 |
| Training Loss | 7.003514766 | Step 10 (best checkpoint) |
|
127 |
| Perplexity | ~1095.2 | exp(loss) |
|
128 |
-
|
|
129 |
-
|
|
130 |
-
|
|
131 |
-
|
132 |
-
### Model Architecture Metrics
|
133 |
-
| Metric | Value | Notes |
|
134 |
-
|--------|-------|-------|
|
135 |
-
| Total Parameters | ~937,500 | Estimated from checkpoint size |
|
136 |
-
| Embedding Parameters | ~288,768 | Token + positional embeddings |
|
137 |
-
| Transformer Parameters | ~629,472 | 4 layers × ~157K params/layer |
|
138 |
-
| Layer Norm Parameters | ~19,260 | All normalization layers |
|
139 |
-
| Model Size (FP256) | 30MB | Full precision storage |
|
140 |
-
| Model Size (FP32 equiv) | 3.75MB | 8x compression potential |
|
141 |
-
| Model Size (FP16 equiv) | 1.87MB | 16x compression potential |
|
142 |
-
| Parameters per Layer | ~157,368 | Average across 4 transformer layers |
|
143 |
-
| Attention Heads per Layer | 8 | 32 dimensions per head |
|
144 |
-
|
145 |
-
### FP256 Precision Benefits
|
146 |
-
| Metric | Value | Notes |
|
147 |
-
|--------|-------|-------|
|
148 |
-
| Numerical Stability Saves | 10 | Prevented gradient issues |
|
149 |
-
| Extreme Precision Events | 14 | Ultra-precision was crucial |
|
150 |
-
| Gradient Stability Improvements | 0 | Raw gradient tracking mode |
|
151 |
-
| Training Stability Score | 100, 100, 10... | Consistent high stability |
|
152 |
-
| Precision Bits | 256 | vs 32 (FP32) or 16 (FP16) |
|
153 |
-
| Decimal Precision | ~77 digits | vs ~7 (FP32) or ~4 (FP16) |
|
154 |
-
| Dynamic Range | 2^262143 | Vastly exceeds standard formats |
|
155 |
-
| Underflow Prevention Rate | 100% | No gradient underflow detected |
|
156 |
-
|
157 |
-
### Memory and Computational Metrics
|
158 |
-
| Metric | Value | Notes |
|
159 |
-
|--------|-------|-------|
|
160 |
-
| Memory Overhead vs FP32 | 8x | 32 bytes vs 4 bytes per param |
|
161 |
-
| Memory Overhead vs FP16 | 16x | 32 bytes vs 2 bytes per param |
|
162 |
-
| Storage Efficiency | 32 bytes/param | FP256 native storage |
|
163 |
-
| Parameter Density | 31,250 params/MB | In FP256 format |
|
164 |
-
| Training Memory Peak | ~240MB | Model + gradients + optimizer |
|
165 |
-
| Gradient Precision | 256-bit | Full precision gradients |
|
166 |
-
|
167 |
-
### Training Efficiency Metrics
|
168 |
-
| Metric | Value | Notes |
|
169 |
-
|--------|-------|-------|
|
170 |
-
| Steps to Best Model | 10 | Early convergence |
|
171 |
-
| Training Time per Step | ~1598ms | From checkpoint data |
|
172 |
-
| FP256 Update Norm | 0.000316227 | Gradient update magnitude |
|
173 |
-
| Learning Rate Precision | 256-bit | Ultra-precise LR updates |
|
174 |
-
| Batch Processing Stability | 100% | No batch failures |
|
175 |
-
| Optimizer Convergence | Stable | No oscillations detected |
|
176 |
-
|
177 |
-
### Comparative Analysis (Estimated)
|
178 |
-
| Metric | FP256 (This Model) | FP32 Equivalent | FP16 Equivalent |
|
179 |
-
|--------|-------------------|-----------------|-----------------|
|
180 |
-
| Model Size | 30MB | 3.75MB | 1.87MB |
|
181 |
-
| Precision Digits | ~77 | ~7 | ~4 |
|
182 |
-
| Gradient Stability | 10 saves | Likely 2-3 failures | Likely 5-8 failures |
|
183 |
-
| Memory Usage | 240MB | 30MB | 15MB |
|
184 |
-
| Numerical Range | 2^262143 | 2^127 | 2^15 |
|
185 |
-
| Training Stability | 100% | ~85-90% | ~70-80% |
|
186 |
-
|
187 |
-
### Vocabulary and Sequence Metrics
|
188 |
-
| Metric | Value | Notes |
|
189 |
-
|--------|-------|-------|
|
190 |
-
| Vocabulary Size | 1,000 | Compact vocab for demo |
|
191 |
-
| Max Sequence Length | 128 tokens | Short context window |
|
192 |
-
| Embedding Dimension | 256 | Hidden size alignment |
|
193 |
-
| Position Embeddings | 128 | Learned positional encoding |
|
194 |
-
| Token Coverage | Demo dataset | Limited scope |
|
195 |
-
| Sequence Processing | Fixed length | No dynamic batching |
|
196 |
-
|
197 |
-
### Experimental Research Metrics
|
198 |
-
| Metric | Value | Research Value |
|
199 |
-
|--------|-------|----------------|
|
200 |
-
| Novel Precision Format | FP256 | First known implementation |
|
201 |
-
| Stability Interventions | 10 | Demonstrates precision benefits |
|
202 |
-
| Precision Event Rate | 1.4 per step | High precision requirement |
|
203 |
-
| Research Reproducibility | Full | Complete checkpoint available |
|
204 |
-
| Implementation Novelty | Custom | FP256 transformer architecture |
|
205 |
-
| Scientific Contribution | High | Ultra-precision ML exploration |
|
|
|
120 |
|
121 |
## 📈 Performance Metrics
|
122 |
|
|
|
123 |
| Metric | Value | Notes |
|
124 |
|--------|-------|-------|
|
125 |
| Training Loss | 7.003514766 | Step 10 (best checkpoint) |
|
126 |
| Perplexity | ~1095.2 | exp(loss) |
|
127 |
+
| Model Size | 30MB | FP256 precision |
|
128 |
+
| Parameters | ~937K | Estimated from checkpoint size |
|
129 |
+
| Stability Events | 10 | Numerical instabilities prevented |
|
130 |
+
| Precision Events | 14 | Cases where FP256 was crucial |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|