bitnet-8bit
This is a BitNet model with 8-bit quantization, layer skipping, and early exit capabilities, trained on the FineWeb-EDU dataset.
Architecture Overview
Input Processing
- Token Embeddings: 128,256 vocabulary size
- Position Embeddings: Up to 128 positions
- Hidden Dimensions: 1024-dimensional hidden states
Transformer Layers (12 total)
Each layer contains:
- Layer normalization
- BitNet Attention: 8 heads, 64 dimensions per head
- Residual connections
- BitNet Feed-Forward Network: 1024 → 4096 → 1024
- Dropout (0.1) after attention and FFN
Special Features
- 8-bit Quantization: Applied in attention and FFN layers for extreme efficiency
- Rotary Position Embeddings (RoPE): Used in attention with dimension 64
- Layer Skipping: Quadratic dropout schedule: p_l = p_max × (l/L)²
- Maximum skip probability: 0.1
- No explicit minimum active layers
- Early Exit: Can terminate at any layer if confidence > 95%
Model Details
Parameter | Value |
---|---|
Model Type | BitNet with Quantization |
Vocabulary Size | 128,256 |
Hidden Size | 1,024 |
Number of Layers | 12 |
Attention Heads | 8 |
Head Dimension | 64 |
FFN Intermediate Size | 4,096 |
Max Sequence Length | 128 |
Quantization Bits | 8 |
Dropout Rate | 0.1 |
Training
- Dataset: FineWeb-EDU (sample-10BT subset)
- Training Framework: PyTorch with mixed precision (FP16)
- Optimization: Gradient checkpointing and streaming dataset implementation
- Hardware: Training details available in repository
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("bitnet-8bit")
tokenizer = AutoTokenizer.from_pretrained("bitnet-8bit")
# Basic generation
inputs = tokenizer("The key to understanding BitNet is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# With early exit (if supported in inference)
outputs = model.generate(
**inputs,
max_length=100,
early_exit_threshold=0.95, # Exit when 95% confident
use_cache=True
)
Performance Characteristics
- Memory Efficiency: 8-bit quantization reduces memory footprint significantly
- Adaptive Computation: Layer skipping and early exit reduce average computation
- Inference Speed: Variable depending on early exit and layer skipping activation
- Quality: Comparable to full-precision models of similar size despite quantization
Limitations
- Maximum sequence length is limited to 128 tokens
- This is an experimental BitNet implementation with custom architecture
- Early exit and layer skipping require compatible inference code
- Quantization may affect performance on certain tasks
Citation
If you use this model, please cite:
@misc{bitnet2024,
title={BitNet with Layer Skipping and Early Exit},
author={Your Name},
year={2024},
url={https://huggingface.co/bitnet-8bit}
}
License
Apache 2.0 - This model can be used for commercial purposes.
- Downloads last month
- 10
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support