error when quantizing my finetuned 405b model using autoawq
update transformers to 4.43.x
pip install -U transformers
add device_map="cuda"
or device_map="auto"
model = AutoAWQForCausalLM.from_pretrained(
model_path, low_cpu_mem_usage=True, use_cache=False, device_map="cuda"
)
Hi here
@Atomheart-Father
thanks for opening this issue! May I ask which autoawq
version do you have installed? As I believe that there was a recent release https://github.com/casper-hansen/AutoAWQ/releases/tag/v0.2.6 adding batched quantization https://github.com/casper-hansen/AutoAWQ/pull/516 which may have something to do with the device placement issue you mentioned above.
So AFAIK there are two solutions if that's the case and you have AutoAWQ 0.2.6 installed, either to downgrade to AutoAWQ 0.2.5, or to run the quantization script with CUDA_VISIBLE_DEVICES=0
(assuming 0 is the index of the GPU you want to use to quantize the model), but since the issue is apparently with the cpu
and cuda:0
devices, the best option may be to downgrade to 0.2.5 in the meantime. If that does solve the issue, I'd recommend you to open an issue with the detailed information about that in https://github.com/casper-hansen/AutoAWQ/issues, as if that's the case, may affect other users too.
Hi here @Atomheart-Father thanks for opening this issue! May I ask which
autoawq
version do you have installed? As I believe that there was a recent release https://github.com/casper-hansen/AutoAWQ/releases/tag/v0.2.6 adding batched quantization https://github.com/casper-hansen/AutoAWQ/pull/516 which may have something to do with the device placement issue you mentioned above.So AFAIK there are two solutions if that's the case and you have AutoAWQ 0.2.6 installed, either to downgrade to AutoAWQ 0.2.5, or to run the quantization script with
CUDA_VISIBLE_DEVICES=0
(assuming 0 is the index of the GPU you want to use to quantize the model), but since the issue is apparently with thecpu
andcuda:0
devices, the best option may be to downgrade to 0.2.5 in the meantime. If that does solve the issue, I'd recommend you to open an issue with the detailed information about that in https://github.com/casper-hansen/AutoAWQ/issues, as if that's the case, may affect other users too.
I am using awq0.2.5, still got this exception. CUDA_VISIBLE_DEVICES=0 cannot solve this problem...
I've met the same issue with @Atomheart-Father , also tried different device_map setting, but they all failed to solving the issue.
Hi
@alvarobartt
, I have the exact same issue like the users above. The code provided in the model card doesn't work neither for autoawq==0.2.5 nor the latest 0.2.6 .
I have access to an 8xA100 80G machine with plenty of CPU RAM. So my question is: how did you do it? I mean what machine did you use and what exact package versions?
Because trying to reverse engineering what changes might autoawq have done is a really long process. Thank you
Btw I have tried playing with max memory like the code below, but that still fails during quantization with OOM (just mentioning to save other people's time)
import fire
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import torch
def main(model_path, quant_path):
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
tokenizer.save_pretrained(quant_path)
# Load model
model = \
AutoAWQForCausalLM.from_pretrained(
model_path,
**{"low_cpu_mem_usage": True},
#safetensors=True,
device_map ="auto",
max_memory={0: "20GiB", 1: "20GiB", 2: "20GiB", 3: "20GiB", 4: "20GiB", 5: "20GiB", 6: "20GiB", 7: "20GiB","cpu": "900GiB"},
torch_dtype=torch.float16,
# offload_folder="offload"
)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path, safetensors=False)
if __name__ == '__main__':
fire.Fire(main)
(don't want to give anyone high hopes, because it's still running and it might crash at any given moment like awq is doing often) but this line here seems to be avoiding the above issue: https://github.com/casper-hansen/AutoAWQ/compare/main...davedgd:AutoAWQ:patch-1#diff-5ea134b0db33752ee601a18b73d2e41aa050f99961dfcf2be285580c44bb4eed
cc:
@Atomheart-Father
@dong-liuliu
Hi here @Atomheart-Father , @dong-liuliu and @yannisp the machine I used had 8 x H100 80GiB and ~2TB of CPU RAM (out of which we just used a single H100 80GiB and ~1TB of CPU RAM), but ~1TB of CPU RAM should be enough to load the model on CPU, and then what should fit in GPU VRAM are the hidden layers (126 in this case) which are processed sequentially by AutoAWQ. Also would be great to know where the OOM is coming from i.e. is it OOM from CPU or OOM from GPU?
Thank you this is very helpful, and just to confirm did you use the 0.2.5 version or maybe an older, like 0.2.4 (I mean for autoawq).
The OOM happens, not when I try your code, but the code I shared, and it's during the quantization process, but it's from the GPU and it's a bit random (meaning it can happen at 22/126 or 40/126 of the quantization process).
Oh you're right the version is not pinned, I used AutoAWQ v0.2.5 (see release notes), for both transformers
and accelerate
I guess is not that relevant here, but for context I used transformers
4.43.0 (see release notes), and accelerate
0.32.0 (see release notes).
Additionally, I used CUDA 12.1 and PyTorch 2.2.1.
thank you this is helpful, I will post how it all goes!
Hi here @yannisp any update? Is there anything I can help you with? π€
So the patch above works: https://huggingface.co/hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4/discussions/13#670625a95e3dee9d671d04c8
Hope it helps someone else too!
Thank you
@alvarobartt
for helping out.