File size: 4,582 Bytes
201bc51 53482c3 8a663b1 53482c3 ae85f43 201bc51 d73ed35 201bc51 ae85f43 201bc51 53482c3 201bc51 ae85f43 201bc51 ae85f43 201bc51 53482c3 201bc51 ae85f43 201bc51 53482c3 201bc51 53482c3 ae85f43 201bc51 53482c3 201bc51 ae85f43 201bc51 53482c3 201bc51 ae85f43 201bc51 53482c3 201bc51 53482c3 201bc51 53482c3 ae85f43 d73ed35 53482c3 ae85f43 d73ed35 53482c3 ae85f43 187aaa9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
---
language: en
license: apache-2.0
tags:
- fp256
- ultra-precision
- transformer
- experimental
- research
datasets:
- interstellarninja/hermes_reasoning_tool_use
- NousResearch/Hermes-3-Dataset
- Salesforce/wikitext
library_name: transformers
model-index:
- name: Gradia FP256 Series
results:
- task:
type: text-generation
name: Text Generation
metrics:
- type: perplexity
value: ~1095.2 # exp(7.0035)
name: Perplexity
- type: loss
value: 7.003514766
name: Training Loss
---
# Gradia FP256 Model
Gradia is an experimental high-precision transformer research project exploring the use of **FP256 (256-bit floating point)** in training language models. This model represents an early proof-of-concept demonstrating ultra-precision training.
## π¬ About the Project
**Gradia** aims to push the boundaries of numerical stability and gradient precision using extended floating-point formats, bypassing the limitations of mixed or standard FP32 training. This checkpoint was trained entirely in **true FP256 precision**.
- **Precision**: Full 256-bit floating point (not mixed)
- **Training Loss**: `7.003514766`
- **Extreme Precision Events**: `14`
- **Numerical Stability Saves**: `10`
- **Gradient Stability Improvements**: `0`
- **Training Stability Score**: `100, 100, 10...`
## π Model Architecture
- **Type**: Custom FP256 Transformer
- **Parameters**: ~937,500 (estimated from 30MB FP256 checkpoint)
- **Vocab Size**: 1,000
- **Hidden Size**: 256
- **Layers**: 4
- **Attention Heads**: 8
- **Intermediate Size**: 1,024
- **Max Sequence Length**: 128
- **Dropout**: 0.1
- **Model Size**: 30MB per checkpoint
## π Training Details
- **Datasets**:
- [interstellarninja/hermes_reasoning_tool_use](https://huggingface.co/datasets/interstellarninja/hermes_reasoning_tool_use)
- [NousResearch/Hermes-3-Dataset](https://huggingface.co/datasets/NousResearch/Hermes-3-Dataset)
- [Salesforce/wikitext](https://huggingface.co/datasets/Salesforce/wikitext)
- **Training Steps**: 10
- **FP256 Optimizer**: Custom implementation
- **Precision Benefits**:
- 10 numerical stability interventions prevented training instabilities
- 14 extreme precision events where FP256 was crucial
- Perfect training stability scores maintained
## π Checkpoint Contents
This model contains the complete FP256 state including:
- `embedding.weight` - Token embeddings
- `pos_embedding.weight` - Positional embeddings
- `transformer_blocks.{0-3}` - 4 transformer layers with:
- Multi-head attention (q_proj, k_proj, v_proj, out_proj)
- Feed-forward networks (dense1, dense2)
- Layer normalizations (ln1, ln2)
- `ln_final` - Final layer normalization
## π¬ FP256 Implementation Notes
- **Storage**: Each parameter stored as 256-bit floating point (32 bytes)
- **Precision**: ~77 decimal digits of precision
- **Memory Overhead**: ~16x larger than FP16 equivalent
- **Numerical Stability**: Demonstrated prevention of gradient underflow/overflow
- **Training Stability**: Maintained perfect stability scores throughout training
## π§ Status
> β οΈ This is a **research-stage model** and is **not production-ready**. Due to the use of FP256, inference and deployment require specialized FP256-compatible hardware and software frameworks.
## π§ Future Work
- Scale to larger parameter counts (1M+ parameters)
- Comparative analysis of FP256 vs FP32/FP16 convergence behavior
- Open-source FP256 training framework
- Extended training runs to evaluate long-term stability benefits
## πΎ Technical Requirements
- **Inference**: Requires FP256-compatible runtime
- **Hardware**: Specialized extended-precision arithmetic units recommended
- **Memory**: ~16x standard memory requirements for equivalent model size
## βοΈ Citation
If you use Gradia in your research, please cite:
```bibtex
@misc{gradia2025,
title={Gradia: Ultra-Precision Language Models with FP256 Training},
author={Entelijans, GLCTC Corp},
year={2025},
note={Experimental FP256 transformer implementation},
url={https://huggingface.co/ENTELIJANS}
}
```
## π Performance Metrics
| Metric | Value | Notes |
|--------|-------|-------|
| Training Loss | 7.003514766 | Step 10 (best checkpoint) |
| Perplexity | ~1095.2 | exp(loss) |
| Model Size | 30MB | FP256 precision |
| Parameters | ~937K | Estimated from checkpoint size |
| Stability Events | 10 | Numerical instabilities prevented |
| Precision Events | 14 | Cases where FP256 was crucial | |