matrixportal commited on
Commit
e8e53a1
Β·
verified Β·
1 Parent(s): ce257c5

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +174 -0
README.md ADDED
@@ -0,0 +1,174 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: google/txgemma-9b-chat
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ license: other
7
+ license_name: health-ai-developer-foundations
8
+ license_link: https://developers.google.com/health-ai-developer-foundations/terms
9
+ pipeline_tag: text-generation
10
+ tags:
11
+ - therapeutics
12
+ - drug-development
13
+ - llama-cpp
14
+ - matrixportal
15
+ extra_gated_heading: Access TxGemma on Hugging Face
16
+ extra_gated_prompt: To access TxGemma on Hugging Face, you're required to review and
17
+ agree to [Health AI Developer Foundation's terms of use](https://developers.google.com/health-ai-developer-foundations/terms).
18
+ To do this, please ensure you're logged in to Hugging Face and click below. Requests
19
+ are processed immediately.
20
+ extra_gated_button_content: Acknowledge license
21
+ ---
22
+
23
+ # matrixportal/txgemma-9b-chat-GGUF
24
+ This model was converted to GGUF format from [`google/txgemma-9b-chat`](https://huggingface.co/google/txgemma-9b-chat) using llama.cpp via the ggml.ai's [all-gguf-same-where](https://huggingface.co/spaces/matrixportal/all-gguf-same-where) space.
25
+ Refer to the [original model card](https://huggingface.co/google/txgemma-9b-chat) for more details on the model.
26
+
27
+ ## βœ… Quantized Models Download List
28
+
29
+ ### πŸ” Recommended Quantizations
30
+ - **✨ General CPU Use:** [`Q4_K_M`](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q4_k_m.gguf) (Best balance of speed/quality)
31
+ - **πŸ“± ARM Devices:** [`Q4_0`](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q4_0.gguf) (Optimized for ARM CPUs)
32
+ - **πŸ† Maximum Quality:** [`Q8_0`](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q8_0.gguf) (Near-original quality)
33
+
34
+ ### πŸ“¦ Full Quantization Options
35
+ | πŸš€ Download | πŸ”’ Type | πŸ“ Notes |
36
+ |:---------|:-----|:------|
37
+ | [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q2_k.gguf) | ![Q2_K](https://img.shields.io/badge/Q2_K-1A73E8) | Basic quantization |
38
+ | [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q3_k_s.gguf) | ![Q3_K_S](https://img.shields.io/badge/Q3_K_S-34A853) | Small size |
39
+ | [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q3_k_m.gguf) | ![Q3_K_M](https://img.shields.io/badge/Q3_K_M-FBBC05) | Balanced quality |
40
+ | [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q3_k_l.gguf) | ![Q3_K_L](https://img.shields.io/badge/Q3_K_L-4285F4) | Better quality |
41
+ | [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q4_0.gguf) | ![Q4_0](https://img.shields.io/badge/Q4_0-EA4335) | Fast on ARM |
42
+ | [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q4_k_s.gguf) | ![Q4_K_S](https://img.shields.io/badge/Q4_K_S-673AB7) | Fast, recommended |
43
+ | [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q4_k_m.gguf) | ![Q4_K_M](https://img.shields.io/badge/Q4_K_M-673AB7) ⭐ | Best balance |
44
+ | [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q5_0.gguf) | ![Q5_0](https://img.shields.io/badge/Q5_0-FF6D01) | Good quality |
45
+ | [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q5_k_s.gguf) | ![Q5_K_S](https://img.shields.io/badge/Q5_K_S-0F9D58) | Balanced |
46
+ | [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q5_k_m.gguf) | ![Q5_K_M](https://img.shields.io/badge/Q5_K_M-0F9D58) | High quality |
47
+ | [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q6_k.gguf) | ![Q6_K](https://img.shields.io/badge/Q6_K-4285F4) πŸ† | Very good quality |
48
+ | [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q8_0.gguf) | ![Q8_0](https://img.shields.io/badge/Q8_0-EA4335) ⚑ | Fast, best quality |
49
+ | [Download](https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-f16.gguf) | ![F16](https://img.shields.io/badge/F16-000000) | Maximum accuracy |
50
+
51
+ πŸ’‘ **Tip:** Use `F16` for maximum precision when quality is critical
52
+
53
+ # GGUF Model Quantization & Usage Guide with llama.cpp
54
+
55
+ ## What is GGUF and Quantization?
56
+
57
+ **GGUF** (GPT-Generated Unified Format) is an efficient model file format developed by the `llama.cpp` team that:
58
+ - Supports multiple quantization levels
59
+ - Works cross-platform
60
+ - Enables fast loading and inference
61
+
62
+ **Quantization** converts model weights to lower precision data types (e.g., 4-bit integers instead of 32-bit floats) to:
63
+ - Reduce model size
64
+ - Decrease memory usage
65
+ - Speed up inference
66
+ - (With minor accuracy trade-offs)
67
+
68
+ ## Step-by-Step Guide
69
+
70
+ ### 1. Prerequisites
71
+
72
+ ```bash
73
+ # System updates
74
+ sudo apt update && sudo apt upgrade -y
75
+
76
+ # Dependencies
77
+ sudo apt install -y build-essential cmake python3-pip
78
+
79
+ # Clone and build llama.cpp
80
+ git clone https://github.com/ggerganov/llama.cpp
81
+ cd llama.cpp
82
+ make -j4
83
+ ```
84
+
85
+ ### 2. Using Quantized Models from Hugging Face
86
+
87
+ My automated quantization script produces models in this format:
88
+ ```
89
+ https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q4_k_m.gguf
90
+ ```
91
+
92
+ Download your quantized model directly:
93
+
94
+ ```bash
95
+ wget https://huggingface.co/matrixportal/txgemma-9b-chat-GGUF/resolve/main/txgemma-9b-chat-q4_k_m.gguf
96
+ ```
97
+
98
+ ### 3. Running the Quantized Model
99
+
100
+ Basic usage:
101
+ ```bash
102
+ ./main -m txgemma-9b-chat-q4_k_m.gguf -p "Your prompt here" -n 128
103
+ ```
104
+
105
+ Example with a creative writing prompt:
106
+ ```bash
107
+ ./main -m txgemma-9b-chat-q4_k_m.gguf -p "[INST] Write a short poem about AI quantization in the style of Shakespeare [/INST]" -n 256 -c 2048 -t 8 --temp 0.7
108
+ ```
109
+
110
+ Advanced parameters:
111
+ ```bash
112
+ ./main -m txgemma-9b-chat-q4_k_m.gguf -p "Question: What is the GGUF format?
113
+ Answer:" -n 256 -c 2048 -t 8 --temp 0.7 --top-k 40 --top-p 0.9
114
+ ```
115
+
116
+ ### 4. Python Integration
117
+
118
+ Install the Python package:
119
+ ```bash
120
+ pip install llama-cpp-python
121
+ ```
122
+
123
+ Example script:
124
+ ```python
125
+ from llama_cpp import Llama
126
+
127
+ # Initialize the model
128
+ llm = Llama(
129
+ model_path="txgemma-9b-chat-q4_k_m.gguf",
130
+ n_ctx=2048,
131
+ n_threads=8
132
+ )
133
+
134
+ # Run inference
135
+ response = llm(
136
+ "[INST] Explain GGUF quantization to a beginner [/INST]",
137
+ max_tokens=256,
138
+ temperature=0.7,
139
+ top_p=0.9
140
+ )
141
+
142
+ print(response["choices"][0]["text"])
143
+ ```
144
+
145
+ ## Performance Tips
146
+
147
+ 1. **Hardware Utilization**:
148
+ - Set thread count with `-t` (typically CPU core count)
149
+ - Compile with CUDA/OpenCL for GPU support
150
+
151
+ 2. **Memory Optimization**:
152
+ - Lower quantization (like q4_k_m) uses less RAM
153
+ - Adjust context size with `-c` parameter
154
+
155
+ 3. **Speed/Accuracy Balance**:
156
+ - Higher bit quantization is slower but more accurate
157
+ - Reduce randomness with `--temp 0` for consistent results
158
+
159
+ ## FAQ
160
+
161
+ **Q: What quantization levels are available?**
162
+ A: Common options include q4_0, q4_k_m, q5_0, q5_k_m, q8_0
163
+
164
+ **Q: How much performance loss occurs with q4_k_m?**
165
+ A: Typically 2-5% accuracy reduction but 4x smaller size
166
+
167
+ **Q: How to enable GPU support?**
168
+ A: Build with `make LLAMA_CUBLAS=1` for NVIDIA GPUs
169
+
170
+ ## Useful Resources
171
+
172
+ 1. [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
173
+ 2. [GGUF Format Specs](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)
174
+ 3. [Hugging Face Model Hub](https://huggingface.co/models)