Really good work

by divmgl - opened 2 days ago

2 days ago

•

Hello. Just wanted to comment that this model works flawlessly on my 4090.

Just out of curiosity: did you do anything during the quantization of the model? For some reason, casperhensen's version doesn't run nearly as well or as accurate as this version.

Thanks!

stelterlab

Owner 2 days ago

Hi!

Happy to hear that! I do nothing special. I just use the code snippet from caspers README. Maybe it's the versions of the python libs, the NVIDIA/CUDA drivers or the GPU I use for quantization (currently NVIDIA L40).

We could compare my setup with that of @casperhansen - I think he might also be interested.

For the software part:

accelerate               1.2.1
autoawq                  0.2.7.post3
huggingface-hub          0.27.0
safetensors              0.4.5
tokenizers               0.21.0
torch                    2.5.1
transformers             4.47.1
triton                   3.1.0

Running on Ubuntu 22.04 LTS with:

nvidia-driver-570-open - 570.86.15-0ubuntu1
cuda-tools-12-6 - 12.6.3-1

quantize is called with quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

Best regards, cos

olegivaniv

2 days ago

@divmgl Can you please share steps how you run the model? When I try VLLM with: vllm serve stelterlab/Mistral-Small-24B-Instruct-2501-AWQ --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice --quantization awq I get run error.

stelterlab

Owner 2 days ago

@olegivaniv What error message do you get? Do you limit the MAX_MODEL_LEN?

I use for vLLM on my RTX 4090:

--kv-cache-dtype fp8
--max-model-len 8192

to reduce the memory footprint. When not limiting the MAX_MODEL_LEN you will probably get:

ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (11296). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

Memory usage is then:

INFO 02-02 09:20:34 worker.py:266] Memory profiling takes 2.86 seconds
INFO 02-02 09:20:34 worker.py:266] the current vLLM instance can use total_gpu_memory (23.53GiB) x gpu_memory_utilization (0.90) = 21.17GiB
INFO 02-02 09:20:34 worker.py:266] model weights take 13.30GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 1.73GiB; the rest of the memory reserved for KV Cache is 6.07GiB.

olegivaniv

2 days ago

•

edited 2 days ago

@stelterlab That's very helpful, thank you! Seems like the issue was with using --config_format mistral --load_format mistral. Btw, are you using tool calling? None of my tools are getting picked up by the model even with --tool-call-parser mistral --enable-auto-tool-choice

stelterlab

Owner 1 day ago

@olegivaniv Nope. I didn't use tool calling, yet.

davy60

about 3 hours ago

@olegivaniv
--tokenizer-mode mistral --tokenizer mistralai/Mistral-Small-24B-Instruct-2501 --enable-auto-tool-choice --tool-call-parser mistral

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment