ENTELIJANS
/

Gradia

@@ -14,58 +14,95 @@ datasets:
 library_name: transformers
 model-index:
   - name: Gradia FP256 Series
-    results: []
 ---
-# Gradia FP256 Model — Checkpoint 20
-Gradia is an experimental high-precision transformer research project exploring the use of **FP256 (256-bit floating point)** in training language models. This model is part of an early proof-of-concept run.
 ## 🔬 About the Project
-**Gradia** aims to push the boundaries of numerical stability and gradient precision using extended floating-point formats, bypassing the limitations of mixed or standard FP32 training. This checkpoint (Step 20) was trained entirely in **true FP256 precision**, with a model size of ~500K parameters.
-- **Precision**: Full 256-bit (not mixed)
-- **Loss (Final)**: `6.97254610`
-- **Extreme Precision Events Logged**: `28`
-- **Numerical Stability Events**: `20`
-- **Gradient Stability Improvements**: `0` (indicating raw gradient tracking)
 ## 📐 Model Architecture
-- **Type**: Transformer (custom)
-- **Parameters**: 501,628
-- **Layers**: 2 (assumed, based on parameter count and logs)
-- **Embedding**: Positional + Token
-- **Checkpoint Format**: PyTorch `.pt`
 ## 📊 Training Details
 - **Datasets**:
   - [interstellarninja/hermes_reasoning_tool_use](https://huggingface.co/datasets/interstellarninja/hermes_reasoning_tool_use)
   - [NousResearch/Hermes-3-Dataset](https://huggingface.co/datasets/NousResearch/Hermes-3-Dataset)
-- **Steps**: 20
-- **Batch Size**: [specify if known]
-- **Optimizer**: [specify if Adam, SGD, etc.]
-- **Scheduler**: [specify type if known]
-- **Loss Function**: [specify, e.g. CrossEntropyLoss]
-## 📁 Checkpoints
-This repo contains:
-- `checkpoint_10.pt`
-- `checkpoint_20.pt`
-- `best_model.pt` (selected based on lowest loss)
 ## 🚧 Status
-> ⚠️ This is a **research-stage model** and is **not production-ready**. Due to the use of FP256, inference and deployment require special tooling and hardware support.
 ## 🧠 Future Work
-- Larger parameter models (10M–1B) in FP256
-- Analysis of convergence behavior vs FP32/FP16
-- Open-source FP256 simulator tooling
 ## ✍️ Citation
@@ -73,8 +110,21 @@ If you use Gradia in your research, please cite:
 ```bibtex
 @misc{gradia2025,
-  title={Gradia: Ultra-Precision Language Models in FP256},
   author={The Gradia Project Contributors},
   year={2025},
-  note={https://huggingface.co/Gradia}
 }

 library_name: transformers
 model-index:
   - name: Gradia FP256 Series
+    results:
+      - task:
+          type: text-generation
+          name: Text Generation
+        metrics:
+          - type: perplexity
+            value: ~1095.2  # exp(7.0035)
+            name: Perplexity
+          - type: loss
+            value: 7.003514766
+            name: Training Loss
 ---
+# Gradia FP256 Model — Step 10 Checkpoint
+Gradia is an experimental high-precision transformer research project exploring the use of **FP256 (256-bit floating point)** in training language models. This model represents an early proof-of-concept demonstrating ultra-precision training.
 ## 🔬 About the Project
+**Gradia** aims to push the boundaries of numerical stability and gradient precision using extended floating-point formats, bypassing the limitations of mixed or standard FP32 training. This checkpoint was trained entirely in **true FP256 precision**.
+- **Precision**: Full 256-bit floating point (not mixed)
+- **Training Loss**: `7.003514766`
+- **Extreme Precision Events**: `14`
+- **Numerical Stability Saves**: `10`
+- **Gradient Stability Improvements**: `0`
+- **Training Stability Score**: `100, 100, 10...`
 ## 📐 Model Architecture
+- **Type**: Custom FP256 Transformer
+- **Parameters**: ~937,500 (estimated from 30MB FP256 checkpoint)
+- **Vocab Size**: 1,000
+- **Hidden Size**: 256
+- **Layers**: 4
+- **Attention Heads**: 8
+- **Intermediate Size**: 1,024
+- **Max Sequence Length**: 128
+- **Dropout**: 0.1
+- **Model Size**: 30MB per checkpoint
 ## 📊 Training Details
 - **Datasets**:
   - [interstellarninja/hermes_reasoning_tool_use](https://huggingface.co/datasets/interstellarninja/hermes_reasoning_tool_use)
   - [NousResearch/Hermes-3-Dataset](https://huggingface.co/datasets/NousResearch/Hermes-3-Dataset)
+  - [Salesforce/wikitext](https://huggingface.co/datasets/Salesforce/wikitext)
+- **Training Steps**: 10
+- **FP256 Optimizer**: Custom implementation
+- **Precision Benefits**:
+  - 10 numerical stability interventions prevented training instabilities
+  - 14 extreme precision events where FP256 was crucial
+  - Perfect training stability scores maintained
+## 📁 Checkpoint Contents
+This model contains the complete FP256 state including:
+- `embedding.weight` - Token embeddings
+- `pos_embedding.weight` - Positional embeddings
+- `transformer_blocks.{0-3}` - 4 transformer layers with:
+  - Multi-head attention (q_proj, k_proj, v_proj, out_proj)
+  - Feed-forward networks (dense1, dense2)
+  - Layer normalizations (ln1, ln2)
+- `ln_final` - Final layer normalization
+## 🔬 FP256 Implementation Notes
+- **Storage**: Each parameter stored as 256-bit floating point (32 bytes)
+- **Precision**: ~77 decimal digits of precision
+- **Memory Overhead**: ~16x larger than FP16 equivalent
+- **Numerical Stability**: Demonstrated prevention of gradient underflow/overflow
+- **Training Stability**: Maintained perfect stability scores throughout training
 ## 🚧 Status
+> ⚠️ This is a **research-stage model** and is **not production-ready**. Due to the use of FP256, inference and deployment require specialized FP256-compatible hardware and software frameworks.
 ## 🧠 Future Work
+- Scale to larger parameter counts (1M+ parameters)
+- Comparative analysis of FP256 vs FP32/FP16 convergence behavior
+- Open-source FP256 training framework
+- Extended training runs to evaluate long-term stability benefits
+## 💾 Technical Requirements
+- **Inference**: Requires FP256-compatible runtime
+- **Hardware**: Specialized extended-precision arithmetic units recommended
+- **Memory**: ~16x standard memory requirements for equivalent model size
 ## ✍️ Citation
 ```bibtex
 @misc{gradia2025,
+  title={Gradia: Ultra-Precision Language Models with FP256 Training},
   author={The Gradia Project Contributors},
   year={2025},
+  note={Experimental FP256 transformer implementation},
+  url={https://huggingface.co/Gradia}
 }
+```
+## 📈 Performance Metrics
+| Metric | Value | Notes |
+|--------|-------|-------|
+| Training Loss | 7.003514766 | Step 10 (best checkpoint) |
+| Perplexity | ~1095.2 | exp(loss) |
+| Model Size | 30MB | FP256 precision |
+| Parameters | ~937K | Estimated from checkpoint size |
+| Stability Events | 10 | Numerical instabilities prevented |
+| Precision Events | 14 | Cases where FP256 was crucial |