Really good work
Hello. Just wanted to comment that this model works flawlessly on my 4090.
Just out of curiosity: did you do anything during the quantization of the model? For some reason, casperhensen's version doesn't run nearly as well or as accurate as this version.
Thanks!
Hi!
Happy to hear that! I do nothing special. I just use the code snippet from caspers README. Maybe it's the versions of the python libs, the NVIDIA/CUDA drivers or the GPU I use for quantization (currently NVIDIA L40).
We could compare my setup with that of @casperhansen - I think he might also be interested.
For the software part:
accelerate 1.2.1
autoawq 0.2.7.post3
huggingface-hub 0.27.0
safetensors 0.4.5
tokenizers 0.21.0
torch 2.5.1
transformers 4.47.1
triton 3.1.0
Running on Ubuntu 22.04 LTS with:
nvidia-driver-570-open - 570.86.15-0ubuntu1
cuda-tools-12-6 - 12.6.3-1
quantize is called with quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
Best regards, cos
@divmgl
Can you please share steps how you run the model? When I try VLLM with: vllm serve stelterlab/Mistral-Small-24B-Instruct-2501-AWQ --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice --quantization awq
I get run error.
@olegivaniv What error message do you get? Do you limit the MAX_MODEL_LEN?
I use for vLLM on my RTX 4090:
--kv-cache-dtype fp8
--max-model-len 8192
to reduce the memory footprint. When not limiting the MAX_MODEL_LEN you will probably get:
ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (11296). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
Memory usage is then:
INFO 02-02 09:20:34 worker.py:266] Memory profiling takes 2.86 seconds
INFO 02-02 09:20:34 worker.py:266] the current vLLM instance can use total_gpu_memory (23.53GiB) x gpu_memory_utilization (0.90) = 21.17GiB
INFO 02-02 09:20:34 worker.py:266] model weights take 13.30GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 1.73GiB; the rest of the memory reserved for KV Cache is 6.07GiB.
@stelterlab
That's very helpful, thank you! Seems like the issue was with using --config_format mistral --load_format mistral
. Btw, are you using tool calling? None of my tools are getting picked up by the model even with --tool-call-parser mistral --enable-auto-tool-choice
@olegivaniv Nope. I didn't use tool calling, yet.
@olegivaniv
--tokenizer-mode mistral --tokenizer mistralai/Mistral-Small-24B-Instruct-2501 --enable-auto-tool-choice --tool-call-parser mistral