hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 · error when quantizing my finetuned 405b model using autoawq

Atomheart-Father

Aug 2, 2024

I copy the code in model card to quantize my won model, but meet this exception:

the code I used :

I have enough cpu ram and 8 A800 GPU

billvsme

Aug 2, 2024

update transformers to 4.43.x

 pip install -U transformers

Atomheart-Father

Aug 2, 2024

update transformers to 4.43.x
 pip install -U transformers
already 4.43.x

billvsme

Aug 2, 2024

add device_map="cuda" or device_map="auto"

model = AutoAWQForCausalLM.from_pretrained(
  model_path, low_cpu_mem_usage=True, use_cache=False,  device_map="cuda"
)

Atomheart-Father

Aug 2, 2024

Due to the model card page, I do not need to load the model to GPU. Besides the original BF16 405B model requires 800G+
VRAM, but the instruction below mentioned I can quantize this model with only 80g VRAM

Atomheart-Father

Aug 2, 2024

add device_map="cuda" or device_map="auto"

model = AutoAWQForCausalLM.from_pretrained(
  model_path, low_cpu_mem_usage=True, use_cache=False,  device_map="cuda"
)

I am not trying to using this 4-bit model. I am using AWQ tool to quantize my bf16 405b model to 4-bit.

alvarobartt

Hugging Quants org Aug 3, 2024

•

edited Aug 3, 2024

Hi here @Atomheart-Father thanks for opening this issue! May I ask which autoawq version do you have installed? As I believe that there was a recent release https://github.com/casper-hansen/AutoAWQ/releases/tag/v0.2.6 adding batched quantization https://github.com/casper-hansen/AutoAWQ/pull/516 which may have something to do with the device placement issue you mentioned above.

So AFAIK there are two solutions if that's the case and you have AutoAWQ 0.2.6 installed, either to downgrade to AutoAWQ 0.2.5, or to run the quantization script with CUDA_VISIBLE_DEVICES=0 (assuming 0 is the index of the GPU you want to use to quantize the model), but since the issue is apparently with the cpu and cuda:0 devices, the best option may be to downgrade to 0.2.5 in the meantime. If that does solve the issue, I'd recommend you to open an issue with the detailed information about that in https://github.com/casper-hansen/AutoAWQ/issues, as if that's the case, may affect other users too.

Atomheart-Father

Aug 5, 2024

Hi here @Atomheart-Father thanks for opening this issue! May I ask which autoawq version do you have installed? As I believe that there was a recent release https://github.com/casper-hansen/AutoAWQ/releases/tag/v0.2.6 adding batched quantization https://github.com/casper-hansen/AutoAWQ/pull/516 which may have something to do with the device placement issue you mentioned above.

So AFAIK there are two solutions if that's the case and you have AutoAWQ 0.2.6 installed, either to downgrade to AutoAWQ 0.2.5, or to run the quantization script with CUDA_VISIBLE_DEVICES=0 (assuming 0 is the index of the GPU you want to use to quantize the model), but since the issue is apparently with the cpu and cuda:0 devices, the best option may be to downgrade to 0.2.5 in the meantime. If that does solve the issue, I'd recommend you to open an issue with the detailed information about that in https://github.com/casper-hansen/AutoAWQ/issues, as if that's the case, may affect other users too.

I am using awq0.2.5, still got this exception. CUDA_VISIBLE_DEVICES=0 cannot solve this problem...

dong-liuliu

Aug 11, 2024

I've met the same issue with @Atomheart-Father , also tried different device_map setting, but they all failed to solving the issue.

yannisp

Oct 9, 2024

Hi @alvarobartt , I have the exact same issue like the users above. The code provided in the model card doesn't work neither for autoawq==0.2.5 nor the latest 0.2.6 .
I have access to an 8xA100 80G machine with plenty of CPU RAM. So my question is: how did you do it? I mean what machine did you use and what exact package versions?
Because trying to reverse engineering what changes might autoawq have done is a really long process. Thank you

Btw I have tried playing with max memory like the code below, but that still fails during quantization with OOM (just mentioning to save other people's time)

import fire
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import torch

def main(model_path, quant_path):
    quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}

    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    tokenizer.save_pretrained(quant_path)
    # Load model
    model = \
        AutoAWQForCausalLM.from_pretrained(
            model_path,
            **{"low_cpu_mem_usage": True},
            #safetensors=True,
            device_map ="auto",
            max_memory={0: "20GiB", 1: "20GiB", 2: "20GiB",  3: "20GiB",  4: "20GiB", 5: "20GiB", 6: "20GiB", 7: "20GiB","cpu": "900GiB"},
            torch_dtype=torch.float16,
            # offload_folder="offload"
        )

    # Quantize
    model.quantize(tokenizer, quant_config=quant_config)

    # Save quantized model
    model.save_quantized(quant_path, safetensors=False)

if __name__ == '__main__':
    fire.Fire(main)

yannisp

Oct 9, 2024

(don't want to give anyone high hopes, because it's still running and it might crash at any given moment like awq is doing often) but this line here seems to be avoiding the above issue: https://github.com/casper-hansen/AutoAWQ/compare/main...davedgd:AutoAWQ:patch-1#diff-5ea134b0db33752ee601a18b73d2e41aa050f99961dfcf2be285580c44bb4eed
cc: @Atomheart-Father @dong-liuliu

alvarobartt

Hugging Quants org Oct 9, 2024

•

edited Oct 9, 2024

Hi here @Atomheart-Father , @dong-liuliu and @yannisp the machine I used had 8 x H100 80GiB and ~2TB of CPU RAM (out of which we just used a single H100 80GiB and ~1TB of CPU RAM), but ~1TB of CPU RAM should be enough to load the model on CPU, and then what should fit in GPU VRAM are the hidden layers (126 in this case) which are processed sequentially by AutoAWQ. Also would be great to know where the OOM is coming from i.e. is it OOM from CPU or OOM from GPU?

yannisp

Oct 9, 2024

Thank you this is very helpful, and just to confirm did you use the 0.2.5 version or maybe an older, like 0.2.4 (I mean for autoawq).
The OOM happens, not when I try your code, but the code I shared, and it's during the quantization process, but it's from the GPU and it's a bit random (meaning it can happen at 22/126 or 40/126 of the quantization process).

alvarobartt

Hugging Quants org Oct 9, 2024

Oh you're right the version is not pinned, I used AutoAWQ v0.2.5 (see release notes), for both transformers and accelerate I guess is not that relevant here, but for context I used transformers 4.43.0 (see release notes), and accelerate 0.32.0 (see release notes).

Additionally, I used CUDA 12.1 and PyTorch 2.2.1.

yannisp

Oct 9, 2024

thank you this is helpful, I will post how it all goes!

alvarobartt

Hugging Quants org Oct 15, 2024

Hi here @yannisp any update? Is there anything I can help you with? 🤗

yannisp

Oct 15, 2024

So the patch above works: https://huggingface.co/hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4/discussions/13#670625a95e3dee9d671d04c8
Hope it helps someone else too!
Thank you @alvarobartt for helping out.