remyxai/mantis-spacellava
Viewer • Updated • 185k • 6 • 2
How to use remyxai/SpaceMantis with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("image-text-to-text", model="remyxai/SpaceMantis")
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
},
]
pipe(text=messages) # Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText
processor = AutoProcessor.from_pretrained("remyxai/SpaceMantis")
model = AutoModelForImageTextToText.from_pretrained("remyxai/SpaceMantis")
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))How to use remyxai/SpaceMantis with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "remyxai/SpaceMantis"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "remyxai/SpaceMantis",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
}'docker model run hf.co/remyxai/SpaceMantis
How to use remyxai/SpaceMantis with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "remyxai/SpaceMantis" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "remyxai/SpaceMantis",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "remyxai/SpaceMantis" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "remyxai/SpaceMantis",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
}'How to use remyxai/SpaceMantis with Docker Model Runner:
docker model run hf.co/remyxai/SpaceMantis
SpaceMantis fine-tunes Mantis-8B-siglip-llama3 for enhanced spatial reasoning.
Uses LoRA fine-tune on the spacellava dataset designed with VQASynth to enhance spatial reasoning as in SpatialVLM.
This model uses data synthesis techniques and publically available models to reproduce the work described in SpatialVLM to enhance the spatial reasoning of multimodal models. With a pipeline of expert models, we can infer spatial relationships between objects in a scene to create VQA dataset for spatial reasoning.
To run SpaceMantis, follow these steps:
import torch
from PIL import Image
from models.mllava import MLlavaProcessor, LlavaForConditionalGeneration, chat_mllava
# Load the model and processor
attn_implementation = None # or "flash_attention_2"
processor = MLlavaProcessor.from_pretrained("remyxai/SpaceMantis")
model = LlavaForConditionalGeneration.from_pretrained("remyxai/SpaceMantis", device_map="cuda", torch_dtype=torch.float16, attn_implementation=attn_implementation)
generation_kwargs = {
"max_new_tokens": 1024,
"num_beams": 1,
"do_sample": False
}
# Function to run inference
def run_inference(image_path, content):
# Load the image
image = Image.open(image_path).convert("RGB")
# Convert the image to base64
images = [image]
# Run the inference
response, history = chat_mllava(content, images, model, processor, **generation_kwargs)
return response
# Example usage
image_path = "path/to/your/image.jpg"
content = "Your question here."
response = run_inference(image_path, content)
print("Response:", response)
@article{chen2024spatialvlm,
title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
journal = {arXiv preprint arXiv:2401.12168},
year = {2024},
url = {https://arxiv.org/abs/2401.12168},
}
@article{jiang2024mantis,
title={MANTIS: Interleaved Multi-Image Instruction Tuning},
author={Jiang, Dongfu and He, Xuan and Zeng, Huaye and Wei, Con and Ku, Max and Liu, Qian and Chen, Wenhu},
journal={arXiv preprint arXiv:2405.01483},
year={2024}
}
Base model
TIGER-Lab/Mantis-8B-siglip-llama3-pretraind