bitnet-8bit

This is a BitNet model with 8-bit quantization, layer skipping, and early exit capabilities, trained on the FineWeb-EDU dataset.

Architecture Overview

Input Processing

  • Token Embeddings: 128,256 vocabulary size
  • Position Embeddings: Up to 128 positions
  • Hidden Dimensions: 1024-dimensional hidden states

Transformer Layers (12 total)

Each layer contains:

  • Layer normalization
  • BitNet Attention: 8 heads, 64 dimensions per head
  • Residual connections
  • BitNet Feed-Forward Network: 1024 → 4096 → 1024
  • Dropout (0.1) after attention and FFN

Special Features

  • 8-bit Quantization: Applied in attention and FFN layers for extreme efficiency
  • Rotary Position Embeddings (RoPE): Used in attention with dimension 64
  • Layer Skipping: Quadratic dropout schedule: p_l = p_max × (l/L)²
    • Maximum skip probability: 0.1
    • No explicit minimum active layers
  • Early Exit: Can terminate at any layer if confidence > 95%

Model Details

Parameter Value
Model Type BitNet with Quantization
Vocabulary Size 128,256
Hidden Size 1,024
Number of Layers 12
Attention Heads 8
Head Dimension 64
FFN Intermediate Size 4,096
Max Sequence Length 128
Quantization Bits 8
Dropout Rate 0.1

Training

  • Dataset: FineWeb-EDU (sample-10BT subset)
  • Training Framework: PyTorch with mixed precision (FP16)
  • Optimization: Gradient checkpointing and streaming dataset implementation
  • Hardware: Training details available in repository

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("bitnet-8bit")
tokenizer = AutoTokenizer.from_pretrained("bitnet-8bit")

# Basic generation
inputs = tokenizer("The key to understanding BitNet is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# With early exit (if supported in inference)
outputs = model.generate(
    **inputs, 
    max_length=100,
    early_exit_threshold=0.95,  # Exit when 95% confident
    use_cache=True
)

Performance Characteristics

  • Memory Efficiency: 8-bit quantization reduces memory footprint significantly
  • Adaptive Computation: Layer skipping and early exit reduce average computation
  • Inference Speed: Variable depending on early exit and layer skipping activation
  • Quality: Comparable to full-precision models of similar size despite quantization

Limitations

  • Maximum sequence length is limited to 128 tokens
  • This is an experimental BitNet implementation with custom architecture
  • Early exit and layer skipping require compatible inference code
  • Quantization may affect performance on certain tasks

Citation

If you use this model, please cite:

@misc{bitnet2024,
  title={BitNet with Layer Skipping and Early Exit},
  author={Your Name},
  year={2024},
  url={https://huggingface.co/bitnet-8bit}
}

License

Apache 2.0 - This model can be used for commercial purposes.

Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Ram07/bitnet-8bit