Update README with GGUF format documentation and usage instructions
Browse files
README.md
CHANGED
@@ -102,6 +102,85 @@ response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_spec
|
|
102 |
- **LoRA Adapter**: Smaller adapter files (`adapter_model.safetensors`, `adapter_config.json`)
|
103 |
- **Tokenizer**: Shared tokenizer files for both options
|
104 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
105 |
## Intended Use
|
106 |
|
107 |
This model is designed to:
|
|
|
102 |
- **LoRA Adapter**: Smaller adapter files (`adapter_model.safetensors`, `adapter_config.json`)
|
103 |
- **Tokenizer**: Shared tokenizer files for both options
|
104 |
|
105 |
+
## GGUF Format Models
|
106 |
+
|
107 |
+
This repository also includes GGUF format models optimized for use with **llama.cpp**, **Ollama**, and other GGUF-compatible inference engines. These formats offer excellent performance and compatibility across different platforms.
|
108 |
+
|
109 |
+
### Available GGUF Models
|
110 |
+
|
111 |
+
| File | Size | Format | Use Case | RAM Required |
|
112 |
+
|------|------|--------|----------|--------------|
|
113 |
+
| `merged-sci-model.gguf` | 14GB | F16 | Maximum quality inference | ~16GB |
|
114 |
+
| `merged-sci-model-q4_k_m.gguf` | 4.1GB | Q4_K_M | Balanced quality/performance | ~6GB |
|
115 |
+
|
116 |
+
### Usage with Ollama
|
117 |
+
|
118 |
+
**1. Download and create Modelfile:**
|
119 |
+
```bash
|
120 |
+
# Download the quantized model (recommended)
|
121 |
+
wget https://huggingface.co/basiphobe/sci-assistant/resolve/main/merged-sci-model-q4_k_m.gguf
|
122 |
+
|
123 |
+
# Create Modelfile
|
124 |
+
cat > Modelfile << 'EOF'
|
125 |
+
FROM ./merged-sci-model-q4_k_m.gguf
|
126 |
+
TEMPLATE """<|im_start|>system
|
127 |
+
You are a specialized medical assistant for people with spinal cord injuries. Your responses should always consider the unique needs, challenges, and medical realities of individuals living with SCI.<|im_end|>
|
128 |
+
<|im_start|>user
|
129 |
+
{{ .Prompt }}<|im_end|>
|
130 |
+
<|im_start|>assistant
|
131 |
+
"""
|
132 |
+
PARAMETER stop "<|im_start|>"
|
133 |
+
PARAMETER stop "<|im_end|>"
|
134 |
+
PARAMETER temperature 0.7
|
135 |
+
PARAMETER top_p 0.9
|
136 |
+
EOF
|
137 |
+
```
|
138 |
+
|
139 |
+
**2. Create and run the model:**
|
140 |
+
```bash
|
141 |
+
ollama create sci-assistant -f Modelfile
|
142 |
+
ollama run sci-assistant "What are the signs of autonomic dysreflexia?"
|
143 |
+
```
|
144 |
+
|
145 |
+
### Usage with llama.cpp
|
146 |
+
|
147 |
+
**1. Install and setup:**
|
148 |
+
```bash
|
149 |
+
# Clone and build llama.cpp
|
150 |
+
git clone https://github.com/ggerganov/llama.cpp
|
151 |
+
cd llama.cpp
|
152 |
+
make
|
153 |
+
|
154 |
+
# Download model
|
155 |
+
wget https://huggingface.co/basiphobe/sci-assistant/resolve/main/merged-sci-model-q4_k_m.gguf
|
156 |
+
```
|
157 |
+
|
158 |
+
**2. Interactive chat:**
|
159 |
+
```bash
|
160 |
+
./main -m merged-sci-model-q4_k_m.gguf \
|
161 |
+
--temp 0.7 \
|
162 |
+
--repeat_penalty 1.1 \
|
163 |
+
-c 4096 \
|
164 |
+
--interactive \
|
165 |
+
--in-prefix "<|im_start|>user\n" \
|
166 |
+
--in-suffix "<|im_end|>\n<|im_start|>assistant\n"
|
167 |
+
```
|
168 |
+
|
169 |
+
**3. Single prompt:**
|
170 |
+
```bash
|
171 |
+
./main -m merged-sci-model-q4_k_m.gguf \
|
172 |
+
--temp 0.7 \
|
173 |
+
-c 2048 \
|
174 |
+
-p "<|im_start|>system\nYou are a specialized medical assistant for people with spinal cord injuries.<|im_end|>\n<|im_start|>user\nWhat exercises are good for someone with paraplegia?<|im_end|>\n<|im_start|>assistant\n"
|
175 |
+
```
|
176 |
+
|
177 |
+
### Performance Comparison
|
178 |
+
|
179 |
+
- **F16 Model** (`merged-sci-model.gguf`): Maximum quality, larger memory footprint
|
180 |
+
- **Q4_K_M Model** (`merged-sci-model-q4_k_m.gguf`): 99%+ quality retention, 3.5x smaller size, recommended for most users
|
181 |
+
|
182 |
+
Both models use the **ChatML** template format and support up to **32K context length**.
|
183 |
+
|
184 |
## Intended Use
|
185 |
|
186 |
This model is designed to:
|