Model size when loaded with bf16 and nf4
#81
by
mdmy
- opened
I am loading the model initially with bf16:
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2" if device == 'cuda' else 'eager',
device_map=device,
)
Later on I loaded with NF4:
# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
# Load model and tokenizer
model_id = "Qwen/Qwen2-VL-7B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
quantization_config=bnb_config
)
bf16 occupies 16GB and NF4 occupied 6GB memory on GPU. I am curious why they don't match exactly to 14GB (7B * 2bytes) and 3.5GB (7B * 0.5 bytes) estimates? What is causing the difference?