README.md · moondream/moondream-2b-2025-04-14-4bit at 48640e9ea1b6d4aa17abd837db93974db9048ed9

metadata

license: apache-2.0
pipeline_tag: image-text-to-text

Moondream is a small vision language model designed to run efficiently everywhere.

Website / Demo / GitHub

This repository contains the 2025-04-14 int4 release of Moondream, as well as historical releases. The model is updated frequently, so we recommend specifying a revision as shown below if you're using it in a production application.

Make sure to install the requirements:

pip install -r https://depot.moondream.ai/transformers/requirements.txt

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

model = AutoModelForCausalLM.from_pretrained(
    "moondream/moondream-2b-2025-04-14-4bit",
    trust_remote_code=True,
    # Uncomment to run on GPU.
    device_map={"": "cuda"}
)

# Captioning
print("Short caption:")
print(model.caption(image, length="short")["caption"])

print("\nNormal caption:")
for t in model.caption(image, length="normal", stream=True)["caption"]:
    # Streaming generation example, supported for caption() and detect()
    print(t, end="", flush=True)
print(model.caption(image, length="normal"))

# Visual Querying
print("\nVisual query: 'How many people are in the image?'")
print(model.query(image, "How many people are in the image?")["answer"])

# Object Detection
print("\nObject detection: 'face'")
objects = model.detect(image, "face")["objects"]
print(f"Found {len(objects)} face(s)")

# Pointing
print("\nPointing: 'person'")
points = model.point(image, "person")["points"]
print(f"Found {len(points)} person(s)")

Changelog

int4-2025-04-15 (full release notes)

Moondream uses a whole lot less memory (4.12 down to 2.47GB)
Small device get a big speed up (44.54 to 67.84 tok/sec on a RTX 4050 Mobile)
Improved spatial understanding (RealWorldQA up from 58.3 to 60.13)

2025-04-15 (full release notes)

Improved chart understanding (ChartQA up from 74.8 to 77.5, 82.2 with PoT)
Added temperature and nucleus sampling to reduce repetitive outputs
Better OCR for documents and tables (prompt with “Transcribe the text” or “Transcribe the text in natural reading order”)
Object detection supports document layout detection (figure, formula, text, etc)
UI understanding (ScreenSpot [email protected] up from 53.3 to 60.3)
Improved text understanding (DocVQA up from 76.5 to 79.3, TextVQA up from 74.6 to 76.3)

2025-03-27 (full release notes)

Added support for long-form captioning
Open vocabulary image tagging
Improved counting accuracy (e.g. CountBenchQA increased from 80 to 86.4)
Improved text understanding (e.g. OCRBench increased from 58.3 to 61.2)
Improved object detection, especially for small objects (e.g. COCO up from 30.5 to 51.2)
Fixed token streaming bug affecting multi-byte unicode characters
gpt-fast style compile() now supported in HF Transformers implementation