How to load this model?
Hi, I have tried to load this model in various ways, but all attempts have failed.
I am using an GPU A100.
Could I get some help, please?
Attempt 1
model_name = "bnb-community/Llama-4-Scout-17B-16E-Instruct-bnb-4bit"
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
trust_remote_code=True,
)
RuntimeError: Error(s) in loading state_dict for Linear:
size mismatch for weight: copying a param with shape torch.Size([40960, 1]) from checkpoint, the shape in current model is torch.Size([16, 5120]).
Attempt 2
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
trust_remote_code=True,
quantization_config=quantization_config,
)
ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8
I have tried following parameters, but I got the same results.
bnb_4bit_quant_type="nf4"
bnb_4bit_compute_dtype=torch.bfloat16
bnb_4bit_compute_dtype=torch.float16
Attempt 3
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_enable_fp32_cpu_offload=True,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
trust_remote_code=True,
quantization_config=quantization_config,
)
[ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8](OutOfMemoryError: CUDA out of memory. Tried to allocate 2.50 GiB. GPU 0 has a total capacity of 79.15 GiB of which 79.62 MiB is free. Including non-PyTorch memory, this process has 79.06 GiB memory in use. Of the allocated memory 78.43 GiB is allocated by PyTorch, and 236.64 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables))
This attempt seems to load the model in 8bit.
Hi @lechga , since this is a vlm, you can load it using Llama4ForConditionalGeneration :
import torch
from transformers import AutoTokenizer, Llama4ForConditionalGeneration
model_name = "bnb-community/Llama-4-Scout-17B-16E-Instruct-bnb-4bit"
model = Llama4ForConditionalGeneration.from_pretrained(
model_name,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "What are we having for dinner?"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))```
Hi, I'm a little bit confused about the structure inside safetensors.
So, there is weight named language_model.model.layers.0.feed_forward.experts.0.gate_proj.weight
. But, somehow in the transformers github inside modeling_llama4.py, there is no weight named language_model.model.layers.0.feed_forward.experts.0.gate_proj.weight
. Instead, there is language_model.model.layers.0.feed_forward.experts.gate_up_proj
.
It seems that you split gate_up_proj
weight. Will it still be working?
Yes we split the the experts, both gate_up_proj
and down_proj
are split specifically for quantization, and it still works because of the logic we implemented here : https://github.com/huggingface/transformers/blob/main/src/transformers/quantizers/base.py#L305
Thank you for your response @medmekk .
When I run the code you provided, the following error occurs.
ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details.
So I used multiple GPUs(A100 * 3) and set llm_int8_enable_fp32_cpu_offload=True
.
But the same error still occurs.
import torch
from transformers import AutoTokenizer, Llama4ForConditionalGeneration, BitsAndBytesConfig
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "5,6,7"
model_name = "bnb-community/Llama-4-Scout-17B-16E-Instruct-bnb-4bit"
quantization_config = BitsAndBytesConfig(
llm_int8_enable_fp32_cpu_offload=True,
)
model = Llama4ForConditionalGeneration.from_pretrained(
model_name,
device_map="auto",
quantization_config=quantization_config,
)