|
--- |
|
base_model: google/txgemma-9b-chat |
|
language: |
|
- en |
|
library_name: transformers |
|
license: other |
|
license_name: health-ai-developer-foundations |
|
license_link: https://developers.google.com/health-ai-developer-foundations/terms |
|
pipeline_tag: text-generation |
|
tags: |
|
- therapeutics |
|
- drug-development |
|
- llama-cpp |
|
- matrixportal |
|
extra_gated_heading: Access TxGemma on Hugging Face |
|
extra_gated_prompt: To access TxGemma on Hugging Face, you're required to review and |
|
agree to [Health AI Developer Foundation's terms of use](https://developers.google.com/health-ai-developer-foundations/terms). |
|
To do this, please ensure you're logged in to Hugging Face and click below. Requests |
|
are processed immediately. |
|
extra_gated_button_content: Acknowledge license |
|
--- |
|
|
|
# matrixportal/txgemma-9b-chat-GGUF |
|
This model was converted to GGUF format from [`google/txgemma-9b-chat`](https://huggingface.co/google/txgemma-9b-chat) using llama.cpp via the ggml.ai's [all-gguf-same-where](https://huggingface.co/spaces/matrixportal/all-gguf-same-where) space. |
|
Refer to the [original model card](https://huggingface.co/google/txgemma-9b-chat) for more details on the model. |
|
|
|
## ✅ Quantized Models Download List |
|
|
|
### 🔍 Recommended Quantizations |
|
- **✨ General CPU Use:** [`Q4_K_M`](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q4_k_m.gguf) (Best balance of speed/quality) |
|
- **📱 ARM Devices:** [`Q4_0`](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q4_0.gguf) (Optimized for ARM CPUs) |
|
- **🏆 Maximum Quality:** [`Q8_0`](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q8_0.gguf) (Near-original quality) |
|
|
|
### 📦 Full Quantization Options |
|
| 🚀 Download | 🔢 Type | 📝 Notes | |
|
|:---------|:-----|:------| |
|
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q2_k.gguf) |  | Basic quantization | |
|
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q3_k_s.gguf) |  | Small size | |
|
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q3_k_m.gguf) |  | Balanced quality | |
|
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q3_k_l.gguf) |  | Better quality | |
|
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q4_0.gguf) |  | Fast on ARM | |
|
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q4_k_s.gguf) |  | Fast, recommended | |
|
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q4_k_m.gguf) |  ⭐ | Best balance | |
|
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q5_0.gguf) |  | Good quality | |
|
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q5_k_s.gguf) |  | Balanced | |
|
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q5_k_m.gguf) |  | High quality | |
|
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q6_k.gguf) |  🏆 | Very good quality | |
|
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q8_0.gguf) |  ⚡ | Fast, best quality | |
|
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-f16.gguf) |  | Maximum accuracy | |
|
|
|
💡 **Tip:** Use `F16` for maximum precision when quality is critical |
|
|
|
# GGUF Model Quantization & Usage Guide with llama.cpp |
|
|
|
## What is GGUF and Quantization? |
|
|
|
**GGUF** (GPT-Generated Unified Format) is an efficient model file format developed by the `llama.cpp` team that: |
|
- Supports multiple quantization levels |
|
- Works cross-platform |
|
- Enables fast loading and inference |
|
|
|
**Quantization** converts model weights to lower precision data types (e.g., 4-bit integers instead of 32-bit floats) to: |
|
- Reduce model size |
|
- Decrease memory usage |
|
- Speed up inference |
|
- (With minor accuracy trade-offs) |
|
|
|
## Step-by-Step Guide |
|
|
|
### 1. Prerequisites |
|
|
|
```bash |
|
# System updates |
|
sudo apt update && sudo apt upgrade -y |
|
|
|
# Dependencies |
|
sudo apt install -y build-essential cmake python3-pip |
|
|
|
# Clone and build llama.cpp |
|
git clone https://github.com/ggerganov/llama.cpp |
|
cd llama.cpp |
|
make -j4 |
|
``` |
|
|
|
### 2. Using Quantized Models from Hugging Face |
|
|
|
My automated quantization script produces models in this format: |
|
``` |
|
https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q4_k_m.gguf |
|
``` |
|
|
|
Download your quantized model directly: |
|
|
|
```bash |
|
wget https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q4_k_m.gguf |
|
``` |
|
|
|
### 3. Running the Quantized Model |
|
|
|
Basic usage: |
|
```bash |
|
./main -m txgemma-9b-chat-q4_k_m.gguf -p "Your prompt here" -n 128 |
|
``` |
|
|
|
Example with a creative writing prompt: |
|
```bash |
|
./main -m txgemma-9b-chat-q4_k_m.gguf -p "[INST] Write a short poem about AI quantization in the style of Shakespeare [/INST]" -n 256 -c 2048 -t 8 --temp 0.7 |
|
``` |
|
|
|
Advanced parameters: |
|
```bash |
|
./main -m txgemma-9b-chat-q4_k_m.gguf -p "Question: What is the GGUF format? |
|
Answer:" -n 256 -c 2048 -t 8 --temp 0.7 --top-k 40 --top-p 0.9 |
|
``` |
|
|
|
### 4. Python Integration |
|
|
|
Install the Python package: |
|
```bash |
|
pip install llama-cpp-python |
|
``` |
|
|
|
Example script: |
|
```python |
|
from llama_cpp import Llama |
|
|
|
# Initialize the model |
|
llm = Llama( |
|
model_path="txgemma-9b-chat-q4_k_m.gguf", |
|
n_ctx=2048, |
|
n_threads=8 |
|
) |
|
|
|
# Run inference |
|
response = llm( |
|
"[INST] Explain GGUF quantization to a beginner [/INST]", |
|
max_tokens=256, |
|
temperature=0.7, |
|
top_p=0.9 |
|
) |
|
|
|
print(response["choices"][0]["text"]) |
|
``` |
|
|
|
## Performance Tips |
|
|
|
1. **Hardware Utilization**: |
|
- Set thread count with `-t` (typically CPU core count) |
|
- Compile with CUDA/OpenCL for GPU support |
|
|
|
2. **Memory Optimization**: |
|
- Lower quantization (like q4_k_m) uses less RAM |
|
- Adjust context size with `-c` parameter |
|
|
|
3. **Speed/Accuracy Balance**: |
|
- Higher bit quantization is slower but more accurate |
|
- Reduce randomness with `--temp 0` for consistent results |
|
|
|
## FAQ |
|
|
|
**Q: What quantization levels are available?** |
|
A: Common options include q4_0, q4_k_m, q5_0, q5_k_m, q8_0 |
|
|
|
**Q: How much performance loss occurs with q4_k_m?** |
|
A: Typically 2-5% accuracy reduction but 4x smaller size |
|
|
|
**Q: How to enable GPU support?** |
|
A: Build with `make LLAMA_CUBLAS=1` for NVIDIA GPUs |
|
|
|
## Useful Resources |
|
|
|
1. [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp) |
|
2. [GGUF Format Specs](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) |
|
3. [Hugging Face Model Hub](https://huggingface.co/models) |
|
|