Add support for bitnet.cpp
#10
by
Xenobd
- opened
Adding Support for Phi-4-mini-flash-reasoning in bitnet.cpp
Overview
The Microsoft Phi-4-mini-flash-reasoning model is a 3.8B parameter transformer-based model optimized for mathematical reasoning, featuring the SambaY architecture with Gated Memory Units (GMU), Mamba (State Space Model), Sliding Window Attention (SWA), and a full attention layer. To integrate it into bitnet.cpp, a framework optimized for 1.58-bit LLMs, you need to:
Download and convert the model to GGUF format.
Extend bitnet.cpp to support the SambaY architecture.
Optimize inference for CPU performance.
Test and validate the implementation.
This guide provides step-by-step instructions to achieve this, ensuring compatibility with bitnet.cpp's efficient CPU inference.
Prerequisites
Hardware: A CPU with at least 20GB RAM (for 3.8B model inference at ~10 tokens/s).
Software:
Python 3.9 or later.
Git, CMake, and Visual Studio 2022 (for Windows) or clang (for Linux/macOS).
Conda environment (recommended).
Hugging Face CLI (huggingface-cli).
Required Python packages: torch==2.5.1, transformers==4.51.3, accelerate==1.3.0, flash_attn==2.7.4.post1.
bitnet.cpp Repository: Clone from GitHub.
Phi-4-mini-flash-reasoning Model: Available on Hugging Face.
Step 1: Set Up the Environment
Create a conda environment and install dependencies for bitnet.cpp and model conversion.
# Create a new conda environment
conda create -n bitnet-cpp python=3.9
conda activate bitnet-cpp
# Install requirements
pip install -r requirements.txt
pip install torch==2.5.1 transformers==4.51.3 accelerate==1.3.0 flash_attn==2.7.4.post1
pip install git+https://github.com/huggingface/transformers.git@096f25ae1f501a084d8ff2dcaf25fbc2bd60eba4
Step 2: Download and Convert Phi-4-mini-flash-reasoning to GGUF
The Phi-4-mini-flash-reasoning model must be converted to GGUF format, as bitnet.cpp uses this for efficient CPU inference. Since the model is not natively 1.58-bit quantized, you’ll need to quantize it to a compatible format (e.g., Q8_0 or Q4_K_M for testing).
Download the Model
Use the Hugging Face CLI to download the model weights.
huggingface-cli download microsoft/Phi-4-mini-flash-reasoning --local-dir ./models/phi-4-mini-flash-reasoning
Convert to GGUF
Use the convert-hf-to-gguf.py script from the llama.cpp repository (bitnet.cpp is based on llama.cpp) to convert the model. Since Phi-4-mini-flash-reasoning uses flash attention, ensure the conversion handles the SambaY architecture’s components (Mamba, SWA, and GMU).
# Clone llama.cpp for conversion tools
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Convert the model to GGUF
python convert-hf-to-gguf.py ../models/phi-4-mini-flash-reasoning --outfile ../models/phi-4-mini-flash-reasoning-Q8_0.gguf --outtype q8_0
Note: The q8_0 quantization is used initially for compatibility. For 1.58-bit quantization (W1.58A8, as in BitNet b1.58), you’ll need to adapt the quantization process (see Step 3).
Step 3: Extend bitnet.cpp for SambaY Architecture
The bitnet.cpp framework is optimized for BitNet b1.58 models with ternary weights {-1, 0, +1} and 8-bit activations. Phi-4-mini-flash-reasoning’s SambaY architecture includes:
Mamba (SSM): State Space Model for long-range dependencies.
Sliding Window Attention (SWA): Efficient attention for local context.
Gated Memory Unit (GMU): Shares memory states across layers.
Full Attention Layer: For global context.
To support this, modify bitnet.cpp to handle these components.
Update Model Architecture Handling
Locate Model Loading Code:
In bitnet.cpp, the model loading is handled in files like llama.cpp (inherited from llama.cpp). Modify the model parser to recognize SambaY’s layers.
Add support for GMU layers by defining a new layer type in the model struct (e.g., ggml_tensor for GMU weights).
Implement Mamba and GMU Kernels:
Add custom kernels for Mamba (SSM) operations, which differ from standard transformer attention. Refer to the Mamba implementation in Hugging Face’s transformers.
For GMU, implement element-wise multiplication for memory state sharing, replacing costly cross-attention operations. The GMU operation is approximately output = input * memory_state with learnable gates.
Example pseudocode for GMU:// In bitnet.cpp kernel
void gmml_gmu_layer(ggml_tensor* input, ggml_tensor* memory_state, ggml_tensor* gate) {
ggml_tensor* gated_output = ggml_mul(ctx, input, memory_state); // Element-wise multiplication
ggml_tensor* output = ggml_mul(ctx, gated_output, gate); // Apply learnable gate
return output;
}
Support SWA and Full Attention:
bitnet.cpp already supports attention mechanisms via llama.cpp. For SWA, set the window size to 256 tokens (as specified for Phi-4-mini-flash-reasoning).
Ensure flash attention is disabled for non-GPU environments (e.g., use attn_implementation="eager" for CPU).
Quantization to 1.58-bit:
Phi-4-mini-flash-reasoning is not natively 1.58-bit. Use absmean quantization for weights (ternary {-1, 0, +1}) and absmax for 8-bit activations.
Modify convert-helper-bitnet.py to support Phi-4’s architecture:
# In convert-helper-bitnet.py
from transformers import AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained("./models/phi-4-mini-flash-reasoning", torch_dtype=torch.bfloat16)
def quantize_to_1_58bit(tensor):
# Absmean quantization to {-1, 0, +1}
mean = torch.abs(tensor).mean()
quantized = torch.where(tensor > mean, 1, torch.where(tensor < -mean, -1, 0))
return quantized
for name, param in model.named_parameters():
if "weight" in name:
param.data = quantize_to_1_58bit(param.data)
torch.save(model.state_dict(), "./models/phi-4-mini-flash-reasoning-1.58bit.pt")
Convert the quantized model to GGUF:
python convert-hf-to-gguf.py ./models/phi-4-mini-flash-reasoning-1.58bit.pt --outfile ../models/phi-4-mini-flash-reasoning-1.58bit-Q8_0.gguf --outtype q8_0
Step 4: Build and Run bitnet.cpp
Compile bitnet.cpp with the updated code to support Phi-4-mini-flash-reasoning.
# Navigate to bitnet.cpp directory
cd bitnet.cpp
# Build with CMake
mkdir build && cd build
cmake ..
make
# Run inference with the converted model
./main -m ../models/phi-4-mini-flash-reasoning-1.58bit-Q8_0.gguf -p "Solve the equation x^2 + 2x + 1 = 0" -n 2048
Step 5: Test and Validate
Test the model on mathematical reasoning tasks to ensure performance aligns with reported benchmarks (e.g., 92.45% Pass@1 on Math500, 52% on AIME24). Use the ChatML template for prompts:
<|im_start|>system<|im_sep|>You are Phi, a language model trained by Microsoft to help users. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions.<|im_end|>
<|im_start|>user<|im_sep|>Solve the equation x^2 + 2x + 1 = 0<|im_end|>
<|im_start|>assistant<|im_sep|><think>
1. Recognize the equation as a quadratic: x^2 + 2x + 1 = 0.
2. Notice it resembles a perfect square trinomial: (x + 1)^2 = x^2 + 2x + 1.
3. Rewrite the equation: (x + 1)^2 = 0.
4. Solve: x + 1 = 0, so x = -1.
5. Verify by substituting: (-1)^2 + 2(-1) + 1 = 1 - 2 + 1 = 0, which holds.
6. Check for other roots using the quadratic formula: x = [-b ± √(b^2 - 4ac)] / 2a, where a=1, b=2, c=1.
7. Discriminant: b^2 - 4ac = 2^2 - 4*1*1 = 4 - 4 = 0, indicating one real root.
8. Compute: x = [-2 ± √0] / 2 = -2/2 = -1.
</think>
Solution: The equation x^2 + 2x + 1 = 0 has one real root, x = -1.
Expected Performance
Throughput: Up to 10x higher than Phi-4-mini-reasoning, ~10 tokens/s on CPU with 20GB RAM.
Latency: 2–3x lower than Phi-4-mini-reasoning for 2K-token prompts and 32K-token generations.
Accuracy: Comparable to larger models on math tasks (e.g., 92.45% on Math500).
Step 6: Optimize for Efficiency
To maximize bitnet.cpp’s efficiency (1.37x–6.17x speedups, 55.4%–82.2% energy reduction):
Use optimized kernels for GMU and Mamba operations.
Set SWA window size to 256 tokens.
Test on x86 or ARM CPUs for best performance.
Profile energy usage with tools like Intel Power Gadget or ARM Streamline.
Notes
Quantization Trade-offs: 1.58-bit quantization may reduce accuracy slightly compared to the original bfloat16 model. Test thoroughly on reasoning tasks.
Hardware Compatibility: For CPUs without flash attention support (e.g., non-NVIDIA GPUs), use attn_implementation="eager".
Community Support: Check the bitnet.cpp GitHub issues for updates on SambaY support.
Reporting Issues: If critical issues arise, report via the MSRC Researcher Portal or [email protected].
Conclusion
By converting Phi-4-mini-flash-reasoning to GGUF, extending bitnet.cpp for SambaY, and optimizing for CPU inference, you can leverage its advanced reasoning capabilities on resource-constrained devices. This makes it one of the best models for bitnet.cpp in terms of efficiency and math reasoning performance, ideal for educational and edge applications.```