Model size when loaded with bf16 and nf4

#81
by mdmy - opened

I am loading the model initially with bf16:

model = Qwen2VLForConditionalGeneration.from_pretrained(
                                                            "Qwen/Qwen2-VL-7B-Instruct",
                                                            torch_dtype=torch.bfloat16,
                                                            attn_implementation="flash_attention_2" if device == 'cuda' else 'eager',
                                                            device_map=device,
                                                        )

Later on I loaded with NF4:

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model_id = "Qwen/Qwen2-VL-7B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config
)

bf16 occupies 16GB and NF4 occupied 6GB memory on GPU. I am curious why they don't match exactly to 14GB (7B * 2bytes) and 3.5GB (7B * 0.5 bytes) estimates? What is causing the difference?

Sign up or log in to comment