bitnet-8bit

This is a BitNet model with 8-bit quantization, layer skipping, and early exit capabilities, trained on the FineWeb-EDU dataset.

Architecture Overview

Input Processing

Token Embeddings: 128,256 vocabulary size
Position Embeddings: Up to 128 positions
Hidden Dimensions: 1024-dimensional hidden states

Transformer Layers (12 total)

Each layer contains:

Layer normalization
BitNet Attention: 8 heads, 64 dimensions per head
Residual connections
BitNet Feed-Forward Network: 1024 → 4096 → 1024
Dropout (0.1) after attention and FFN

Special Features

8-bit Quantization: Applied in attention and FFN layers for extreme efficiency
Rotary Position Embeddings (RoPE): Used in attention with dimension 64
Layer Skipping: Quadratic dropout schedule: p_l = p_max × (l/L)²
- Maximum skip probability: 0.1
- No explicit minimum active layers
Early Exit: Can terminate at any layer if confidence > 95%

Model Details

Parameter	Value
Model Type	BitNet with Quantization
Vocabulary Size	128,256
Hidden Size	1,024
Number of Layers	12
Attention Heads	8
Head Dimension	64
FFN Intermediate Size	4,096
Max Sequence Length	128
Quantization Bits	8
Dropout Rate	0.1

Training

Dataset: FineWeb-EDU (sample-10BT subset)
Training Framework: PyTorch with mixed precision (FP16)
Optimization: Gradient checkpointing and streaming dataset implementation
Hardware: Training details available in repository

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("bitnet-8bit")
tokenizer = AutoTokenizer.from_pretrained("bitnet-8bit")

# Basic generation
inputs = tokenizer("The key to understanding BitNet is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# With early exit (if supported in inference)
outputs = model.generate(
    **inputs, 
    max_length=100,
    early_exit_threshold=0.95,  # Exit when 95% confident
    use_cache=True
)

Performance Characteristics

Memory Efficiency: 8-bit quantization reduces memory footprint significantly
Adaptive Computation: Layer skipping and early exit reduce average computation
Inference Speed: Variable depending on early exit and layer skipping activation
Quality: Comparable to full-precision models of similar size despite quantization

Limitations

Maximum sequence length is limited to 128 tokens
This is an experimental BitNet implementation with custom architecture
Early exit and layer skipping require compatible inference code
Quantization may affect performance on certain tasks

Citation

If you use this model, please cite:

@misc{bitnet2024,
  title={BitNet with Layer Skipping and Early Exit},
  author={Your Name},
  year={2024},
  url={https://huggingface.co/bitnet-8bit}
}

License

Apache 2.0 - This model can be used for commercial purposes.

Ram07
/

bitnet-8bit