Gradia FP256 Model
Gradia is an experimental high-precision transformer research project exploring the use of FP256 (256-bit floating point) in training language models. This model represents an early proof-of-concept demonstrating ultra-precision training.
π¬ About the Project
Gradia aims to push the boundaries of numerical stability and gradient precision using extended floating-point formats, bypassing the limitations of mixed or standard FP32 training. This checkpoint was trained entirely in true FP256 precision.
- Precision: Full 256-bit floating point (not mixed)
- Training Loss:
7.003514766
- Extreme Precision Events:
14
- Numerical Stability Saves:
10
- Gradient Stability Improvements:
0
- Training Stability Score:
100, 100, 10...
π Model Architecture
- Type: Custom FP256 Transformer
- Parameters: ~937,500 (estimated from 30MB FP256 checkpoint)
- Vocab Size: 1,000
- Hidden Size: 256
- Layers: 4
- Attention Heads: 8
- Intermediate Size: 1,024
- Max Sequence Length: 128
- Dropout: 0.1
- Model Size: 30MB per checkpoint
π Training Details
- Datasets:
- Training Steps: 10
- FP256 Optimizer: Custom implementation
- Precision Benefits:
- 10 numerical stability interventions prevented training instabilities
- 14 extreme precision events where FP256 was crucial
- Perfect training stability scores maintained
π Checkpoint Contents
This model contains the complete FP256 state including:
embedding.weight
- Token embeddingspos_embedding.weight
- Positional embeddingstransformer_blocks.{0-3}
- 4 transformer layers with:- Multi-head attention (q_proj, k_proj, v_proj, out_proj)
- Feed-forward networks (dense1, dense2)
- Layer normalizations (ln1, ln2)
ln_final
- Final layer normalization
π¬ FP256 Implementation Notes
- Storage: Each parameter stored as 256-bit floating point (32 bytes)
- Precision: ~77 decimal digits of precision
- Memory Overhead: ~16x larger than FP16 equivalent
- Numerical Stability: Demonstrated prevention of gradient underflow/overflow
- Training Stability: Maintained perfect stability scores throughout training
π§ Status
β οΈ This is a research-stage model and is not production-ready. Due to the use of FP256, inference and deployment require specialized FP256-compatible hardware and software frameworks.
π§ Future Work
- Scale to larger parameter counts (1M+ parameters)
- Comparative analysis of FP256 vs FP32/FP16 convergence behavior
- Open-source FP256 training framework
- Extended training runs to evaluate long-term stability benefits
πΎ Technical Requirements
- Inference: Requires FP256-compatible runtime
- Hardware: Specialized extended-precision arithmetic units recommended
- Memory: ~16x standard memory requirements for equivalent model size
βοΈ Citation
If you use Gradia in your research, please cite:
@misc{gradia2025,
title={Gradia: Ultra-Precision Language Models with FP256 Training},
author={Entelijans, GLCTC Corp},
year={2025},
note={Experimental FP256 transformer implementation},
url={https://huggingface.co/ENTELIJANS}
}
π Performance Metrics
Metric | Value | Notes |
---|---|---|
Training Loss | 7.003514766 | Step 10 (best checkpoint) |
Perplexity | ~1095.2 | exp(loss) |
Model Size | 30MB | FP256 precision |
Parameters | ~937K | Estimated from checkpoint size |
Stability Events | 10 | Numerical instabilities prevented |
Precision Events | 14 | Cases where FP256 was crucial |
Datasets used to train ENTELIJANS/Gradia
Evaluation results
- Perplexityself-reported~1095.2
- Training Lossself-reported7.004