Qwen-32B overflow issue
Hi, Qwen32b and its variants are very special models that could easily cause int4 kernel overflow with chat template. Besides the accuracy, you may need to have a check of the generation or directly follow our recipe in OPEA space when using AutoRound.
Hi,
Yes, I observed it with some models. My current strategies is to check the accuracy on IFEval to monitor generation issues. If the generation is broken, it should result in a significant drop of accuracy for this benchmark.
Do you think it is correct or did you observe some cases where the model performed well on generative benchmarks (like IFEval), while having generation issues? I don't know how it would be possible but I may miss something.
Yes, I believe so, if I remember correctly. By default, lm-eval sets apply_chat_template to False.
Another question, I noticed you mentioned that AutoRound produced unstable IFEVAL results for Qwen2.5-72B. Could you share the exact task name? We tested several hyperparameter settings a long time ago, including auto-round-best
, auto-round
, and auto-round-light
, and all yielded satisfactory results for leaderboard_ifeval.
I didn't observe issues with the chat template in this case but I don't systematically test it. I'll work on this.
As for Qwen2.5-72B Instruct, I also get very good results when quantizing it, except with this (very) specific configuration:
- nsamples = 512
- iterations = 500
- model_dtype = float16
- symmetric quantization
- auto_gptq export
- group size = 128
This produced bad quantization for the 4-bit and 8-bit versions, but worked well with 2-bit and a group size of 32.
Other hyperparameter values, as you suggested, performed well.
@bnjmnmarie Hi, do you still have the 72B int4 model that the accuracy of ifeval was notably low? I couldn't reproduce this issue in my two environments. I'd like to check whether it's related to lm-eval or something else.
Yes, they are here:
kaitchup/Qwen2.5-72B-Instruct-AutoRoundGPTQ-8bit
kaitchup/Qwen2.5-72B-Instruct-AutoRoundGPTQ-4bit
Thanks for the quick reply!
(not sure it matters but for evaluation with IFEval, I use the vLLM backend)
Ok, thanks for the information! We'll evaluate it using both the HF and vLLM backends respectively.