---
language:
- en
license: apache-2.0
tags:
- pytorch
- causal-lm
- bitnet
- quantized
- 8bit
- layer-skip
- early-exit
- rope
- safetensors
- fineweb-edu
datasets:
- HuggingFaceFW/fineweb-edu
---

# bitnet-8bit

This is a BitNet model with 8-bit quantization, layer skipping, and early exit capabilities, trained on the FineWeb-EDU dataset.

## Architecture Overview

### Input Processing
- **Token Embeddings**: 128,256 vocabulary size
- **Position Embeddings**: Up to 128 positions
- **Hidden Dimensions**: 1024-dimensional hidden states

### Transformer Layers (12 total)
Each layer contains:
- Layer normalization
- **BitNet Attention**: 8 heads, 64 dimensions per head
- Residual connections
- **BitNet Feed-Forward Network**: 1024 → 4096 → 1024
- Dropout (0.1) after attention and FFN

### Special Features
- **8-bit Quantization**: Applied in attention and FFN layers for extreme efficiency
- **Rotary Position Embeddings (RoPE)**: Used in attention with dimension 64
- **Layer Skipping**: Quadratic dropout schedule: p_l = p_max × (l/L)²
  - Maximum skip probability: 0.1
  - No explicit minimum active layers
- **Early Exit**: Can terminate at any layer if confidence > 95%

## Model Details

| Parameter | Value |
|-----------|-------|
| Model Type | BitNet with Quantization |
| Vocabulary Size | 128,256 |
| Hidden Size | 1,024 |
| Number of Layers | 12 |
| Attention Heads | 8 |
| Head Dimension | 64 |
| FFN Intermediate Size | 4,096 |
| Max Sequence Length | 128 |
| Quantization Bits | 8 |
| Dropout Rate | 0.1 |

## Training

- **Dataset**: FineWeb-EDU (sample-10BT subset)
- **Training Framework**: PyTorch with mixed precision (FP16)
- **Optimization**: Gradient checkpointing and streaming dataset implementation
- **Hardware**: Training details available in repository

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("bitnet-8bit")
tokenizer = AutoTokenizer.from_pretrained("bitnet-8bit")

# Basic generation
inputs = tokenizer("The key to understanding BitNet is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# With early exit (if supported in inference)
outputs = model.generate(
    **inputs, 
    max_length=100,
    early_exit_threshold=0.95,  # Exit when 95% confident
    use_cache=True
)
```

## Performance Characteristics

- **Memory Efficiency**: 8-bit quantization reduces memory footprint significantly
- **Adaptive Computation**: Layer skipping and early exit reduce average computation
- **Inference Speed**: Variable depending on early exit and layer skipping activation
- **Quality**: Comparable to full-precision models of similar size despite quantization

## Limitations

- Maximum sequence length is limited to 128 tokens
- This is an experimental BitNet implementation with custom architecture
- Early exit and layer skipping require compatible inference code
- Quantization may affect performance on certain tasks

## Citation

If you use this model, please cite:

```bibtex
@misc{bitnet2024,
  title={BitNet with Layer Skipping and Early Exit},
  author={Your Name},
  year={2024},
  url={https://huggingface.co/bitnet-8bit}
}
```

## License

Apache 2.0 - This model can be used for commercial purposes.