How to load this model?

by lechga - opened Apr 15

Discussion

lechga

Apr 15

•

edited Apr 15

Hi, I have tried to load this model in various ways, but all attempts have failed.

I am using an GPU A100.

Could I get some help, please?

Attempt 1

model_name = "bnb-community/Llama-4-Scout-17B-16E-Instruct-bnb-4bit"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    trust_remote_code=True,
)

RuntimeError: Error(s) in loading state_dict for Linear:
    size mismatch for weight: copying a param with shape torch.Size([40960, 1]) from checkpoint, the shape in current model is torch.Size([16, 5120]).

Attempt 2

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=quantization_config,
)

ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8

I have tried following parameters, but I got the same results.

bnb_4bit_quant_type="nf4"
bnb_4bit_compute_dtype=torch.bfloat16
bnb_4bit_compute_dtype=torch.float16

Attempt 3

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_enable_fp32_cpu_offload=True,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=quantization_config,
)

[ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8](OutOfMemoryError: CUDA out of memory. Tried to allocate 2.50 GiB. GPU 0 has a total capacity of 79.15 GiB of which 79.62 MiB is free. Including non-PyTorch memory, this process has 79.06 GiB memory in use. Of the allocated memory 78.43 GiB is allocated by PyTorch, and 236.64 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables))

This attempt seems to load the model in 8bit.

medmekk

Bitsandbytes Community org Apr 15

Hi @lechga , since this is a vlm, you can load it using Llama4ForConditionalGeneration :

import torch
from transformers import AutoTokenizer, Llama4ForConditionalGeneration

model_name = "bnb-community/Llama-4-Scout-17B-16E-Instruct-bnb-4bit"

model = Llama4ForConditionalGeneration.from_pretrained(
    model_name,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

input_text = "What are we having for dinner?"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))```

fahadh4ilyas

Apr 16

Hi, I'm a little bit confused about the structure inside safetensors.

So, there is weight named language_model.model.layers.0.feed_forward.experts.0.gate_proj.weight. But, somehow in the transformers github inside modeling_llama4.py, there is no weight named language_model.model.layers.0.feed_forward.experts.0.gate_proj.weight. Instead, there is language_model.model.layers.0.feed_forward.experts.gate_up_proj.

It seems that you split gate_up_proj weight. Will it still be working?

medmekk

Bitsandbytes Community org Apr 16

Yes we split the the experts, both gate_up_proj and down_proj are split specifically for quantization, and it still works because of the logic we implemented here : https://github.com/huggingface/transformers/blob/main/src/transformers/quantizers/base.py#L305

lechga

Apr 17

Thank you for your response @medmekk .

When I run the code you provided, the following error occurs.

ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details.

So I used multiple GPUs(A100 * 3) and set llm_int8_enable_fp32_cpu_offload=True.
But the same error still occurs.

import torch
from transformers import AutoTokenizer, Llama4ForConditionalGeneration, BitsAndBytesConfig
import os


os.environ["CUDA_VISIBLE_DEVICES"] = "5,6,7"

model_name = "bnb-community/Llama-4-Scout-17B-16E-Instruct-bnb-4bit"

quantization_config = BitsAndBytesConfig(
    llm_int8_enable_fp32_cpu_offload=True,
)

model = Llama4ForConditionalGeneration.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=quantization_config,
)

ZainZia

16 days ago

I am facing the same issue.

ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set llm_int8_enable_fp32_cpu_offload=True and pass a custom device_map to from_pretrained. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details.

tried all things but no luck.

@lechga did you find the solution

lechga

6 days ago

@ZainZia
I have not found a solution yet.
If I find a solution, I'll share it.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment