|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
base_model: |
|
- openai/clip-vit-large-patch14-336 |
|
- Qwen/Qwen2-7B |
|
pipeline_tag: image-text-to-text |
|
tags: |
|
- multimodal |
|
--- |
|
|
|
<img src="molmo_logo.png" alt="Logo for the Molmo Project" style="width: auto; height: 50px;"> |
|
|
|
# Molmo 7B-D |
|
|
|
Molmo is an open vision-language model developed by the Allen Institute for AI. Molmo models are trained on PixMo, a dataset of 1 million, highly-curated image-text pairs. It has state-of-the-art performance among multimodal models with a similar size while being fully open-source. You can find all models in the Molmo family [here](https://huggingface.co/collections/allenai/molmo-66f379e6fe3b8ef090a8ca19). |
|
|
|
Molmo 7B-D is based on [Qwen2-7B](https://huggingface.co/Qwen/Qwen2-7B) and uses [OpenAI CLIP](https://huggingface.co/openai/clip-vit-large-patch14-336) as vision backbone. |
|
It performs comfortably between GPT-4V and GPT-4o on both academic benchmarks and human evaluation. |
|
|
|
This checkpoint is a **preview** of the Molmo release. All artifacts used in creating Molmo (PixMo dataset, training code, evaluations, intermediate checkpoints) will be made available at a later date, furthering our commitment to open-source AI development and reproducibility. |
|
|
|
**[Sign up here](https://docs.google.com/forms/d/e/1FAIpQLSdML1MhNNBDsCHpgWG65Oydg2SjZzVasyqlP08nBrWjZp_c7A/viewform)** to be the first to know when artifacts are released. |
|
|
|
|
|
|
|
## Quick Start |
|
|
|
To run Molmo, first install dependencies: |
|
|
|
```bash |
|
pip install einops tensorflow torchvision |
|
``` |
|
|
|
Then, follow these steps: |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig |
|
from PIL import Image |
|
import requests |
|
|
|
# load the processor |
|
processor = AutoProcessor.from_pretrained( |
|
'allenai/Molmo-7B-D-0924', |
|
trust_remote_code=True, |
|
torch_dtype='auto', |
|
device_map='auto' |
|
) |
|
|
|
# load the model |
|
model = AutoModelForCausalLM.from_pretrained( |
|
'allenai/Molmo-7B-D-0924', |
|
trust_remote_code=True, |
|
torch_dtype='auto', |
|
device_map='auto' |
|
) |
|
|
|
# process the image and text |
|
inputs = processor.process( |
|
images=[Image.open(requests.get("https://picsum.photos/id/237/536/354", stream=True).raw)], |
|
text="Describe this image." |
|
) |
|
|
|
# move inputs to the correct device and make a batch of size 1 |
|
inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()} |
|
|
|
# generate output; maximum 200 new tokens; stop generation when <|endoftext|> is generated |
|
output = model.generate_from_batch( |
|
inputs, |
|
GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"), |
|
tokenizer=processor.tokenizer |
|
) |
|
|
|
# only get generated tokens; decode them to text |
|
generated_tokens = output[0,inputs['input_ids'].size(1):] |
|
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True) |
|
|
|
# print the generated text |
|
print(generated_text) |
|
|
|
# >>> This photograph captures an adorable black Labrador puppy sitting on a weathered |
|
# wooden deck. The deck's planks, which are a mix of light and dark brown with ... |
|
``` |
|
|
|
## License and Use |
|
|
|
This model is licensed under Apache 2.0. It is intended for research and educational use. |
|
For more information, please see our [Responsible Use Guidelines](https://allenai.org/responsible-use). |
|
|