|
--- |
|
license: apache-2.0 |
|
pipeline_tag: image-text-to-text |
|
--- |
|
|
|
Moondream is a small vision language model designed to run efficiently on edge devices. |
|
|
|
[Website](https://moondream.ai/) / [Demo](https://moondream.ai/playground) / [GitHub](https://github.com/vikhyat/moondream) |
|
|
|
This repository contains the latest (**2025-01-09**) release of Moondream, as well as historical releases. The model is updated frequently, so we recommend specifying a revision as shown below if you're using it in a production application. |
|
|
|
|
|
**Usage** |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
from PIL import Image |
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
"vikhyatk/moondream2", |
|
revision="2025-01-09", |
|
trust_remote_code=True, |
|
# Uncomment to run on GPU. |
|
# device_map={"": "cuda"} |
|
) |
|
|
|
# Captioning |
|
print("Short caption:") |
|
print(model.caption(image, length="short")["caption"]) |
|
|
|
print("\nNormal caption:") |
|
for t in model.caption(image, length="normal", stream=True)["caption"]: |
|
# Streaming generation example, supported for caption() and detect() |
|
print(t, end="", flush=True) |
|
print(model.caption(image, length="normal")) |
|
|
|
# Visual Querying |
|
print("\nVisual query: 'How many people are in the image?'") |
|
print(model.query(image, "How many people are in the image?")["answer"]) |
|
|
|
# Object Detection |
|
print("\nObject detection: 'face'") |
|
objects = model.detect(image, "face")["objects"] |
|
print(f"Found {len(objects)} face(s)") |
|
|
|
# Pointing |
|
print("\nPointing: 'person'") |
|
points = model.point(image, "person")["points"] |
|
print(f"Found {len(points)} person(s)") |
|
``` |