How to load this model?

#2
by lechga - opened

Hi, I have tried to load this model in various ways, but all attempts have failed.

I am using an GPU A100.

Could I get some help, please?

Attempt 1

model_name = "bnb-community/Llama-4-Scout-17B-16E-Instruct-bnb-4bit"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    trust_remote_code=True,
)
RuntimeError: Error(s) in loading state_dict for Linear:
    size mismatch for weight: copying a param with shape torch.Size([40960, 1]) from checkpoint, the shape in current model is torch.Size([16, 5120]).

Attempt 2

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=quantization_config,
)
ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8

I have tried following parameters, but I got the same results.

  • bnb_4bit_quant_type="nf4"
  • bnb_4bit_compute_dtype=torch.bfloat16
  • bnb_4bit_compute_dtype=torch.float16

Attempt 3

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_enable_fp32_cpu_offload=True,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=quantization_config,
)
[ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8](OutOfMemoryError: CUDA out of memory. Tried to allocate 2.50 GiB. GPU 0 has a total capacity of 79.15 GiB of which 79.62 MiB is free. Including non-PyTorch memory, this process has 79.06 GiB memory in use. Of the allocated memory 78.43 GiB is allocated by PyTorch, and 236.64 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables))

This attempt seems to load the model in 8bit.

Bitsandbytes Community org

Hi @lechga , since this is a vlm, you can load it using Llama4ForConditionalGeneration :

import torch
from transformers import AutoTokenizer, Llama4ForConditionalGeneration

model_name = "bnb-community/Llama-4-Scout-17B-16E-Instruct-bnb-4bit"

model = Llama4ForConditionalGeneration.from_pretrained(
    model_name,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

input_text = "What are we having for dinner?"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))```

Hi, I'm a little bit confused about the structure inside safetensors.

So, there is weight named language_model.model.layers.0.feed_forward.experts.0.gate_proj.weight. But, somehow in the transformers github inside modeling_llama4.py, there is no weight named language_model.model.layers.0.feed_forward.experts.0.gate_proj.weight. Instead, there is language_model.model.layers.0.feed_forward.experts.gate_up_proj.

It seems that you split gate_up_proj weight. Will it still be working?

Bitsandbytes Community org

Yes we split the the experts, both gate_up_proj and down_proj are split specifically for quantization, and it still works because of the logic we implemented here : https://github.com/huggingface/transformers/blob/main/src/transformers/quantizers/base.py#L305

Thank you for your response @medmekk .

When I run the code you provided, the following error occurs.

ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set `llm_int8_enable_fp32_cpu_offload=True` and pass a custom `device_map` to `from_pretrained`. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details. 

So I used multiple GPUs(A100 * 3) and set llm_int8_enable_fp32_cpu_offload=True.
But the same error still occurs.

import torch
from transformers import AutoTokenizer, Llama4ForConditionalGeneration, BitsAndBytesConfig
import os


os.environ["CUDA_VISIBLE_DEVICES"] = "5,6,7"

model_name = "bnb-community/Llama-4-Scout-17B-16E-Instruct-bnb-4bit"

quantization_config = BitsAndBytesConfig(
    llm_int8_enable_fp32_cpu_offload=True,
)

model = Llama4ForConditionalGeneration.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=quantization_config,
)
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment