Molmo-7B-D-0924 4Bit Quantization
Model size (disk): 30GB original → 6.2GB
VRAM usage: Loaded Model ~7GB, inference up to ~10GB (4K image input)
This quantization uses NF4 quantization while keeping FP16 in key modules to avoid deteriorating performance.
It has a relatively minimal VRAM impact compared to full 4-bit quantization and aims to strike a performance/memory optimum.
The model loads significantly faster than the original, making it suitable for serverless hosting.
It fits into a 12GB GPU for serving and allows for batching on a T4 (16GB).
How to run
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from PIL import Image
import requests
import torch
# Can also be a local path if you have already cloned the hugging face repo
MODEL_PATH = "Scoolar/Molmo-7B-D-0924-NF4"
# load the processor
processor = AutoProcessor.from_pretrained(
MODEL_PATH,
trust_remote_code=True,
device_map='auto'
)
# load the model
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
trust_remote_code=True,
device_map='auto',
)
# process the image and text
inputs = processor.process(
images=[Image.open(requests.get("https://picsum.photos/id/237/536/354", stream=True).raw)],
text="Describe this image."
)
# move inputs to the correct device and make a batch of size 1
inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}
# Compute is done in float16, while most weights are NF4
with torch.autocast(device_type="cuda", enabled=True, dtype=torch.float16):
output = model.generate_from_batch(
inputs,
GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"),
tokenizer=processor.tokenizer
)
# only get generated tokens; decode them to text
generated_tokens = output[0, inputs['input_ids'].size(1):]
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
# print the generated text
print(generated_text)
How was the model converted to NF4?
I decided to write this down since I would have been happy to have something like this, so enjoy :)
To convert the model, you need to load the weights with the desired data types/quantization settings
and save them again. This process will produce SafeTensor files along with some configuration files.
All missing files can be copied from the original model repository—you only need to remove the file path in config.json
.
The applied quantization strategy can also be seen in config.json
(quantization_config)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# Can also be a local path if you have already cloned the hugginface repo
MODEL_PATH = "allenai/Molmo-7B-D-0924"
YOUR_OUTPUT_PATH = "enter_local_model_output_path"
DEFAULT_DTYPE = torch.float16
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=DEFAULT_DTYPE,
llm_int8_skip_modules=[
# Module names can also be relative like "ff_norm" which would apply to all such layers
"model.vision_backbone", "model.transformer.ff_out", "model.transformer.ln_f"
]
)
# load the model
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
trust_remote_code=True,
device_map='auto',
torch_dtype=DEFAULT_DTYPE,
quantization_config=nf4_config,
)
# Save model
model.save_pretrained(
save_directory=YOUR_OUTPUT_PATH,
safe_serialization=True,
# Set a maximum shard size if you don't like the default
max_shard_size="4GB"
)
Details
Inspired by observations from SeanScripts/Molmo-72B-0924-nf4, I experimented with keeping certain modules in FP16, particularly the vision_backbone. The vision backbone has relatively few parameters but deteriorates significantly in NF4. Additionally, I found that the transformer output layers are crucial, whereas other layer normalization layers within the transformer stack had no significant impact.
Layers can be easily inspected in model.safetensors.index.json
or analyzed in more detail in modeling_molmo.py
.
- Downloads last month
- 122