|
--- |
|
language: en |
|
license: apache-2.0 |
|
tags: |
|
- fp256 |
|
- ultra-precision |
|
- transformer |
|
- experimental |
|
- research |
|
datasets: |
|
- interstellarninja/hermes_reasoning_tool_use |
|
- NousResearch/Hermes-3-Dataset |
|
- Salesforce/wikitext |
|
library_name: transformers |
|
model-index: |
|
- name: Gradia FP256 Series |
|
results: |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
metrics: |
|
- type: perplexity |
|
value: ~1095.2 |
|
name: Perplexity |
|
- type: loss |
|
value: 7.003514766 |
|
name: Training Loss |
|
--- |
|
|
|
# Gradia FP256 Model |
|
|
|
Gradia is an experimental high-precision transformer research project exploring the use of **FP256 (256-bit floating point)** in training language models. This model represents an early proof-of-concept demonstrating ultra-precision training. |
|
|
|
## π¬ About the Project |
|
|
|
**Gradia** aims to push the boundaries of numerical stability and gradient precision using extended floating-point formats, bypassing the limitations of mixed or standard FP32 training. This checkpoint was trained entirely in **true FP256 precision**. |
|
|
|
- **Precision**: Full 256-bit floating point (not mixed) |
|
- **Training Loss**: `7.003514766` |
|
- **Extreme Precision Events**: `14` |
|
- **Numerical Stability Saves**: `10` |
|
- **Gradient Stability Improvements**: `0` |
|
- **Training Stability Score**: `100, 100, 10...` |
|
|
|
## π Model Architecture |
|
|
|
- **Type**: Custom FP256 Transformer |
|
- **Parameters**: ~937,500 (estimated from 30MB FP256 checkpoint) |
|
- **Vocab Size**: 1,000 |
|
- **Hidden Size**: 256 |
|
- **Layers**: 4 |
|
- **Attention Heads**: 8 |
|
- **Intermediate Size**: 1,024 |
|
- **Max Sequence Length**: 128 |
|
- **Dropout**: 0.1 |
|
- **Model Size**: 30MB per checkpoint |
|
|
|
## π Training Details |
|
|
|
- **Datasets**: |
|
- [interstellarninja/hermes_reasoning_tool_use](https://huggingface.co/datasets/interstellarninja/hermes_reasoning_tool_use) |
|
- [NousResearch/Hermes-3-Dataset](https://huggingface.co/datasets/NousResearch/Hermes-3-Dataset) |
|
- [Salesforce/wikitext](https://huggingface.co/datasets/Salesforce/wikitext) |
|
- **Training Steps**: 10 |
|
- **FP256 Optimizer**: Custom implementation |
|
- **Precision Benefits**: |
|
- 10 numerical stability interventions prevented training instabilities |
|
- 14 extreme precision events where FP256 was crucial |
|
- Perfect training stability scores maintained |
|
|
|
## π Checkpoint Contents |
|
|
|
This model contains the complete FP256 state including: |
|
- `embedding.weight` - Token embeddings |
|
- `pos_embedding.weight` - Positional embeddings |
|
- `transformer_blocks.{0-3}` - 4 transformer layers with: |
|
- Multi-head attention (q_proj, k_proj, v_proj, out_proj) |
|
- Feed-forward networks (dense1, dense2) |
|
- Layer normalizations (ln1, ln2) |
|
- `ln_final` - Final layer normalization |
|
|
|
## π¬ FP256 Implementation Notes |
|
|
|
- **Storage**: Each parameter stored as 256-bit floating point (32 bytes) |
|
- **Precision**: ~77 decimal digits of precision |
|
- **Memory Overhead**: ~16x larger than FP16 equivalent |
|
- **Numerical Stability**: Demonstrated prevention of gradient underflow/overflow |
|
- **Training Stability**: Maintained perfect stability scores throughout training |
|
|
|
## π§ Status |
|
|
|
> β οΈ This is a **research-stage model** and is **not production-ready**. Due to the use of FP256, inference and deployment require specialized FP256-compatible hardware and software frameworks. |
|
|
|
## π§ Future Work |
|
|
|
- Scale to larger parameter counts (1M+ parameters) |
|
- Comparative analysis of FP256 vs FP32/FP16 convergence behavior |
|
- Open-source FP256 training framework |
|
- Extended training runs to evaluate long-term stability benefits |
|
|
|
## πΎ Technical Requirements |
|
|
|
- **Inference**: Requires FP256-compatible runtime |
|
- **Hardware**: Specialized extended-precision arithmetic units recommended |
|
- **Memory**: ~16x standard memory requirements for equivalent model size |
|
|
|
## βοΈ Citation |
|
|
|
If you use Gradia in your research, please cite: |
|
|
|
```bibtex |
|
@misc{gradia2025, |
|
title={Gradia: Ultra-Precision Language Models with FP256 Training}, |
|
author={Entelijans, GLCTC Corp}, |
|
year={2025}, |
|
note={Experimental FP256 transformer implementation}, |
|
url={https://huggingface.co/ENTELIJANS} |
|
} |
|
``` |
|
|
|
## π Performance Metrics |
|
|
|
| Metric | Value | Notes | |
|
|--------|-------|-------| |
|
| Training Loss | 7.003514766 | Step 10 (best checkpoint) | |
|
| Perplexity | ~1095.2 | exp(loss) | |
|
| Model Size | 30MB | FP256 precision | |
|
| Parameters | ~937K | Estimated from checkpoint size | |
|
| Stability Events | 10 | Numerical instabilities prevented | |
|
| Precision Events | 14 | Cases where FP256 was crucial | |