Problem Running Model
Hi, mate! Whatever I do I can't run this exactly model version properly. It replies with nonsence everytime.
I've tried your deepseek-coder-33B-instruct GPTQ, tried 6.7B version AWQ - they all running fine (but slow, thats why I desperate looking for AWQ models). Is there something you aware of or I do wrong? Does it run fine on your side?
Thank you in advance for all your work here!
I guess I have the same issue. It just outputs one token in repeat until it runs into max_token limit. How does it look for you? Are you trying to run it in oobabooga webui?
@adamo1139 You've described it behaviour pretty good on my side as well. Sometimes it produced something like "YouYouYouYouYouYouYou.." instead of empty token but always it's nothing meaningful and stops at max token limit. I'm using my custom solution for inference and there was the first time I saw that kind of output among many-many models.
@TheBloke I really hope you'll look into this as this model is considered one of the best for coding and AWQ version is essential for some of us :)
That sounds like a rope_scaling issue. What are your sequence length settings?
Try at sequence length 16384 and see what happens there
If that doesn't work, or you don't have VRAM for 16384, try editing config.json locally to remove the rope_scaling
section and set max_position_embeddings
to 4096
I've not tested AWQ / AutoAWQ with longer sequence lengths - possibly the clients lack the controls that GPTQ would have for automatically setting the right rope scaling value according to the chosen sequence length
Try those things and let me know
@TheBloke Thanks for reply!
Sure, here is more info:
- I have A6000 with 48 GB VRAM and I wasn't able even run the model with it default rope_scaling factor of 4.0 because of OOM (I could easily run all 70B AWQ models before).
- After removing rope_scaling (or setting it to something low like 1.1 instead of 4.0) I was able to run the model but it doesn't solve wrong output (last time I did the same to even run it). max_position_embeddings to 4096 also did nothing.
Can It be some error during the quantization? As GPTQ (btw, I don't need to change rope_scaling there to avoid OOM) version and smaller AWQ work as intended. It's very strange. I see such OOM and wrong output problems the first time of all your AWQ models I'v tried.
Yeah that could be the case
One final test: could you try setting:max_position_embeddings=4096
and rope_scaling
leave as it is set in my repo, ie factor 4.0
And set inference seqlen to 16384, and see if that makes any difference to OOM or output
By the way, are you loading with Flash Attention 2? That will significantly lower VRAM requirements at longer contexts
I am out of VRAM if I run it with sequence length 16384, but it's just slowing things down and moving stuff to RAM. That didn't help, it still outputs "'''''''''"
Removing rope scaling
, setting max_position_embeddings
to 4096 and loading it in webui at sequence length 4096 didn't fix it either.
SHA256 of both safetensors files matches. I redownloaded jsons from your repo but that didn't help either.
Messing with BOS token and special tokens settings in oobabooga didn't help.
This is the first time I am using AWQ, so there is probably something wrong with my setup - I will check with other versions of awq, my oobabooga setup is currently on 0.1.6 (latest)
@TheBloke
Hmm, it actually helped (max_position_embeddings=4096 and retuned default rope_scaling )
While model often produces some strange text (nonsense) at the end of reply it start to reply as expected with code etc. But it's maybe can be fixed with generating params/template tuning.
For AWQ inference I'm using vllm lib with default params. They use as I know PagedAttention
Yeah, it start working only with vllm back-end, while with common huggingface transformers it still doesn't work. Probably oobabooga using huggingface standard inference. So the problem remains for many people who doesn't know about the vllm.
I tried to run the awq without webui, so without webui miniconda env, I got basically the same result - model outputs about 50 ' characters and then outputs newlines until it reaches max output token limit. I tried 6.7B gptq model and it works fine. Out of topic, but Instruct model was clearly trained on some data from ChatGPT/GPT APIs. That's what I get with default Alpaca system prompt
I tried updating oobabooga, downgrading awq to 0.1.5 and doing what helped @bezale .
Same issue remains. I will try gptq quant, maybe I won't have this issue there.
Have you tried gptq? I'm using this AWQ version with TGI and encountered a similiar issue.
When the input is short it seems to work, for example if I asked "please introduce yourself" the model responds correctly.
But when the input gets longer the output becomes wierd. For example I included some table info and somr RAG content in the prompt, then the output becomes a long sequence of "!"
This may caused by torch_dtype == float16 in config.json file. I've tried the error input in original None-quantized model, and the result is good. However, when I change the torch_dtype value from bfloat16 to float16 in orignial model config.json file, error occured.
However, AWQ does NOT support bfloat16 type, so anybody has some good solution to this issue?