Inference Providers documentation
Image-Text to Text
Image-Text to Text
Image-text-to-text models take in an image and text prompt and output text. These models are also called vision-language models, or VLMs. The difference from image-to-text models is that these models take an additional text input, not restricting the model to certain use cases like image captioning, and may also be trained to accept a conversation as input.
For more details about the
image-text-to-texttask, check out its dedicated page! You will find examples and related materials.
Recommended models
- zai-org/GLM-4.5V: Cutting-edge reasoning vision language model.
Explore all available models and find the one that suits you best here.
Using the API
Language
Client
Provider
import os
from openai import OpenAI
client = OpenAI(
    base_url="https://router.huggingface.co/v1",
    api_key=os.environ["HF_TOKEN"],
)
completion = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct:cerebras",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe this image in one sentence."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
                    }
                }
            ]
        }
    ],
)
print(completion.choices[0].message)API specification
For the API specification of conversational image-text-to-text models, please refer to the Chat Completion API documentation.
Update on GitHub