File size: 2,385 Bytes

f2c05d5
 
 
 
 
d9d2041
f2c05d5
 
27de11c
b81b29d
a6b9807
 
c733242
d9d2041
c733242
f065796
b81b29d
80aab64
b81b29d
 
 
 
 
bca8226
d9d2041
c733242
b81b29d
e6fd03e
94da5f0
4250828
 
b81b29d
 
a6b9807
b81b29d
f2c05d5
 
 
 
ff06adc
f2c05d5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28a4512
a6b9807
7f20c2d
a6b9807
9bebb31
80aab64
7f20c2d

---
license: apache-2.0
base_model: FuseAI/FuseO1-DeepSeekR1-Qwen2.5-Coder-32B-Preview
tags:
- mlx
- code
---

13 TPS  

27 TPS with Draft model: [DeepScaleR-1.5B-Preview-Q8](https://huggingface.co/mlx-community/DeepScaleR-1.5B-Preview-Q8)

oh yeah! 100% faster for math/code stuff.

.


Macbook M4 Max: high power (10 TPS on low-power, GPU draws only 5 watts...less than your brain)

system prompt: "You are Fuse01. You answer very direct brief and concise"

prompt: "Write a quick sort in C++"

Context: 131072, Temp: 0

.


Try this model in Visual Studio Code with the Roo Code extension. Starting in Architect Mode and letting it auto switch to Code Mode.... it actually spits decent code for small projects with multiple files.
Getting close to last year's Claude Sonnet for small projects.  It actually stays reasonably stable even with Roo Code's huge 10k system prompt.  The model still shits the bed for big projects but better after adding roo-code-memory-bank.
So far (Feb 20, 2025) this is the only model & quant that runs fast on Mac, spits decent code on projects AND works with Speculative Decoding.


Huge thanks to all who helped Macs get this far! 

# bobig/FuseO1-DeepSeekR1-Qwen2.5-Coder-32B-Preview-Q8

The Model [bobig/FuseO1-DeepSeekR1-Qwen2.5-Coder-32B-Preview-Q8](https://huggingface.co/bobig/FuseO1-DeepSeekR1-Qwen2.5-Coder-32B-Preview-Q8) was
converted to MLX format from [FuseAI/FuseO1-DeepSeekR1-Qwen2.5-Coder-32B-Preview](https://huggingface.co/FuseAI/FuseO1-DeepSeekR1-Qwen2.5-Coder-32B-Preview)
using mlx-lm version **0.21.4**.  (FYI: the mlx-lm version should be the same in Base model and Draft model)

## Use with mlx

```bash
pip install mlx-lm
```

```python
from mlx_lm import load, generate

model, tokenizer = load("bobig/FuseO1-DeepSeekR1-Qwen2.5-Coder-32B-Preview-Q8")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)
```

Are you still reading down here?  

Maybe check out this new Q4 lossless from NexaAI and tell the MLX community how to improve mlx-lm to get 8-bit quality at 4-bit speed! 

[DeepSeek-R1-Distill-Qwen-1.5B-NexaQuant](https://huggingface.co/NexaAIDev/DeepSeek-R1-Distill-Qwen-1.5B-NexaQuant)

ore