--- language: - en license: apache-2.0 tags: - pytorch - causal-lm - bitnet - quantized - 8bit - layer-skip - early-exit - rope - safetensors - fineweb-edu datasets: - HuggingFaceFW/fineweb-edu --- # bitnet-8bit This is a BitNet model with 8-bit quantization, layer skipping, and early exit capabilities, trained on the FineWeb-EDU dataset. ## Architecture Overview ### Input Processing - **Token Embeddings**: 128,256 vocabulary size - **Position Embeddings**: Up to 128 positions - **Hidden Dimensions**: 1024-dimensional hidden states ### Transformer Layers (12 total) Each layer contains: - Layer normalization - **BitNet Attention**: 8 heads, 64 dimensions per head - Residual connections - **BitNet Feed-Forward Network**: 1024 → 4096 → 1024 - Dropout (0.1) after attention and FFN ### Special Features - **8-bit Quantization**: Applied in attention and FFN layers for extreme efficiency - **Rotary Position Embeddings (RoPE)**: Used in attention with dimension 64 - **Layer Skipping**: Quadratic dropout schedule: p_l = p_max × (l/L)² - Maximum skip probability: 0.1 - No explicit minimum active layers - **Early Exit**: Can terminate at any layer if confidence > 95% ## Model Details | Parameter | Value | |-----------|-------| | Model Type | BitNet with Quantization | | Vocabulary Size | 128,256 | | Hidden Size | 1,024 | | Number of Layers | 12 | | Attention Heads | 8 | | Head Dimension | 64 | | FFN Intermediate Size | 4,096 | | Max Sequence Length | 128 | | Quantization Bits | 8 | | Dropout Rate | 0.1 | ## Training - **Dataset**: FineWeb-EDU (sample-10BT subset) - **Training Framework**: PyTorch with mixed precision (FP16) - **Optimization**: Gradient checkpointing and streaming dataset implementation - **Hardware**: Training details available in repository ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer # Load model and tokenizer model = AutoModelForCausalLM.from_pretrained("bitnet-8bit") tokenizer = AutoTokenizer.from_pretrained("bitnet-8bit") # Basic generation inputs = tokenizer("The key to understanding BitNet is", return_tensors="pt") outputs = model.generate(**inputs, max_length=100, temperature=0.7) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) # With early exit (if supported in inference) outputs = model.generate( **inputs, max_length=100, early_exit_threshold=0.95, # Exit when 95% confident use_cache=True ) ``` ## Performance Characteristics - **Memory Efficiency**: 8-bit quantization reduces memory footprint significantly - **Adaptive Computation**: Layer skipping and early exit reduce average computation - **Inference Speed**: Variable depending on early exit and layer skipping activation - **Quality**: Comparable to full-precision models of similar size despite quantization ## Limitations - Maximum sequence length is limited to 128 tokens - This is an experimental BitNet implementation with custom architecture - Early exit and layer skipping require compatible inference code - Quantization may affect performance on certain tasks ## Citation If you use this model, please cite: ```bibtex @misc{bitnet2024, title={BitNet with Layer Skipping and Early Exit}, author={Your Name}, year={2024}, url={https://huggingface.co/bitnet-8bit} } ``` ## License Apache 2.0 - This model can be used for commercial purposes.