File size: 7,469 Bytes
e8e53a1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
---
base_model: google/txgemma-9b-chat
language:
- en
library_name: transformers
license: other
license_name: health-ai-developer-foundations
license_link: https://developers.google.com/health-ai-developer-foundations/terms
pipeline_tag: text-generation
tags:
- therapeutics
- drug-development
- llama-cpp
- matrixportal
extra_gated_heading: Access TxGemma on Hugging Face
extra_gated_prompt: To access TxGemma on Hugging Face, you're required to review and
  agree to [Health AI Developer Foundation's terms of use](https://developers.google.com/health-ai-developer-foundations/terms).
  To do this, please ensure you're logged in to Hugging Face and click below. Requests
  are processed immediately.
extra_gated_button_content: Acknowledge license
---

# matrixportal/txgemma-9b-chat-GGUF
   This model was converted to GGUF format from [`google/txgemma-9b-chat`](https://huggingface.co/google/txgemma-9b-chat) using llama.cpp via the ggml.ai's [all-gguf-same-where](https://huggingface.co/spaces/matrixportal/all-gguf-same-where) space.
Refer to the [original model card](https://huggingface.co/google/txgemma-9b-chat) for more details on the model.

## βœ… Quantized Models Download List

### πŸ” Recommended Quantizations
- **✨ General CPU Use:** [`Q4_K_M`](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q4_k_m.gguf) (Best balance of speed/quality)
- **πŸ“± ARM Devices:** [`Q4_0`](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q4_0.gguf) (Optimized for ARM CPUs)
- **πŸ† Maximum Quality:** [`Q8_0`](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q8_0.gguf) (Near-original quality)

### πŸ“¦ Full Quantization Options
| πŸš€ Download | πŸ”’ Type | πŸ“ Notes |
|:---------|:-----|:------|
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q2_k.gguf) | ![Q2_K](https://img.shields.io/badge/Q2_K-1A73E8) | Basic quantization |
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q3_k_s.gguf) | ![Q3_K_S](https://img.shields.io/badge/Q3_K_S-34A853) | Small size |
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q3_k_m.gguf) | ![Q3_K_M](https://img.shields.io/badge/Q3_K_M-FBBC05) | Balanced quality |
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q3_k_l.gguf) | ![Q3_K_L](https://img.shields.io/badge/Q3_K_L-4285F4) | Better quality |
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q4_0.gguf) | ![Q4_0](https://img.shields.io/badge/Q4_0-EA4335) | Fast on ARM |
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q4_k_s.gguf) | ![Q4_K_S](https://img.shields.io/badge/Q4_K_S-673AB7) | Fast, recommended |
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q4_k_m.gguf) | ![Q4_K_M](https://img.shields.io/badge/Q4_K_M-673AB7) ⭐ | Best balance |
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q5_0.gguf) | ![Q5_0](https://img.shields.io/badge/Q5_0-FF6D01) | Good quality |
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q5_k_s.gguf) | ![Q5_K_S](https://img.shields.io/badge/Q5_K_S-0F9D58) | Balanced |
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q5_k_m.gguf) | ![Q5_K_M](https://img.shields.io/badge/Q5_K_M-0F9D58) | High quality |
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q6_k.gguf) | ![Q6_K](https://img.shields.io/badge/Q6_K-4285F4) πŸ† | Very good quality |
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q8_0.gguf) | ![Q8_0](https://img.shields.io/badge/Q8_0-EA4335) ⚑ | Fast, best quality |
| [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-f16.gguf) | ![F16](https://img.shields.io/badge/F16-000000) | Maximum accuracy |

πŸ’‘ **Tip:** Use `F16` for maximum precision when quality is critical

# GGUF Model Quantization & Usage Guide with llama.cpp

## What is GGUF and Quantization?

**GGUF** (GPT-Generated Unified Format) is an efficient model file format developed by the `llama.cpp` team that:
- Supports multiple quantization levels
- Works cross-platform
- Enables fast loading and inference

**Quantization** converts model weights to lower precision data types (e.g., 4-bit integers instead of 32-bit floats) to:
- Reduce model size
- Decrease memory usage
- Speed up inference
- (With minor accuracy trade-offs)

## Step-by-Step Guide

### 1. Prerequisites

```bash
# System updates
sudo apt update && sudo apt upgrade -y

# Dependencies
sudo apt install -y build-essential cmake python3-pip

# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j4
```

### 2. Using Quantized Models from Hugging Face

My automated quantization script produces models in this format:
```
https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q4_k_m.gguf
```

Download your quantized model directly:

```bash
wget https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q4_k_m.gguf
```

### 3. Running the Quantized Model

Basic usage:
```bash
./main -m txgemma-9b-chat-q4_k_m.gguf -p "Your prompt here" -n 128
```

Example with a creative writing prompt:
```bash
./main -m txgemma-9b-chat-q4_k_m.gguf        -p "[INST] Write a short poem about AI quantization in the style of Shakespeare [/INST]"        -n 256 -c 2048 -t 8 --temp 0.7
```

Advanced parameters:
```bash
./main -m txgemma-9b-chat-q4_k_m.gguf        -p "Question: What is the GGUF format?
Answer:"        -n 256 -c 2048 -t 8 --temp 0.7 --top-k 40 --top-p 0.9
```

### 4. Python Integration

Install the Python package:
```bash
pip install llama-cpp-python
```

Example script:
```python
from llama_cpp import Llama

# Initialize the model
llm = Llama(
    model_path="txgemma-9b-chat-q4_k_m.gguf",
    n_ctx=2048,
    n_threads=8
)

# Run inference
response = llm(
    "[INST] Explain GGUF quantization to a beginner [/INST]",
    max_tokens=256,
    temperature=0.7,
    top_p=0.9
)

print(response["choices"][0]["text"])
```

## Performance Tips

1. **Hardware Utilization**:
   - Set thread count with `-t` (typically CPU core count)
   - Compile with CUDA/OpenCL for GPU support

2. **Memory Optimization**:
   - Lower quantization (like q4_k_m) uses less RAM
   - Adjust context size with `-c` parameter

3. **Speed/Accuracy Balance**:
   - Higher bit quantization is slower but more accurate
   - Reduce randomness with `--temp 0` for consistent results

## FAQ

**Q: What quantization levels are available?**  
A: Common options include q4_0, q4_k_m, q5_0, q5_k_m, q8_0

**Q: How much performance loss occurs with q4_k_m?**  
A: Typically 2-5% accuracy reduction but 4x smaller size

**Q: How to enable GPU support?**  
A: Build with `make LLAMA_CUBLAS=1` for NVIDIA GPUs

## Useful Resources

1. [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
2. [GGUF Format Specs](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)
3. [Hugging Face Model Hub](https://huggingface.co/models)