inference: false
library_name: transformers
language:
- en
- fr
- de
- es
- it
- pt
license: cc-by-nc-4.0
extra_gated_prompt: >-
By submitting this form, you agree to the [License
Agreement](https://cohere.com/c4ai-cc-by-nc-license) and acknowledge that the
information you provide will be collected, used, and shared in accordance with
Cohere’s [Privacy Policy]( https://cohere.com/privacy). You’ll receive email
updates about C4AI and Cohere research, events, products and services. You can
unsubscribe at any time.
extra_gated_fields:
Name: text
Affiliation: text
Country: country
I agree to use this model for non-commercial use ONLY: checkbox
pipeline_tag: image-text-to-text
base_model:
- CohereLabs/c4ai-command-a-03-2025
- google/siglip2-so400m-patch16-512
Model Card for Cohere Labs Command A Vision
Model Summary
Cohere Labs Command A Vision is an open weights research release of a 112 billion parameter model optimized for enterprise image understanding tasks, while keeping a low compute footprint.
Developed by: Cohere and Cohere Labs
- Point of Contact: Cohere Labs
- License: CC-BY-NC, requires also adhering to Cohere Lab's Acceptable Use Policy
- Model: command-a-vision-07-2025
- Model Size: 112B
- Context length: 32K
For more details about this model, please check out our blog post.
Note: The model supports a context length of 128K but it is configured in Hugging Face for 32K. This value can be updated in the configuration if needed.
Try Cohere Labs Command A Vision
You can try out Cohere Labs Command A Vision before downloading the weights in our hosted Hugging Face Space.
Usage
Please install transformers from the source repository that includes the necessary changes for this model.
# pip install "transformers[dev-torch]@git+https://github.com/huggingface/transformers.git"
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
model_id = "CohereLabs/command-a-vision-07-2025"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id, device_map="auto", torch_dtype=torch.float16
)
# Format message with the Command-A-Vision chat template
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://images.pexels.com/photos/1108099/pexels-photo-1108099.jpeg",
},
{"type": "text", "text": "what is in this image?"},
],
},
]
inputs = processor.apply_chat_template(
messages,
padding=True,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
gen_tokens = model.generate(
**inputs,
max_new_tokens=300,
do_sample=True,
temperature=0.3,
)
print(
processor.tokenizer.decode(
gen_tokens[0][inputs.input_ids.shape[1] :], skip_special_tokens=True
)
)
You can also use the model directly using transformers pipeline
abstraction:
from transformers import pipeline
pipe = pipeline(model="CohereLabs/command-a-vision-07-2025", task="image-text-to-text", device_map="auto")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://media.istockphoto.com/id/458012057/photo/istanbul-turkey.jpg?s=612x612&w=0&k=20&c=qogAOVvkpfUyqLUMr_XJQyq-HkACXyYUSZbKhBlPrxo=",
},
{"type": "text", "text": "Where was this taken ?"},
],
},
]
outputs = pipe(text=messages, max_new_tokens=300, return_full_text=False)
print(outputs)
Model Details
Input: Model accepts input text and images.
Output: Model generates text.
Model Architecture:
This is a vision-language model that uses a language model based on Command A paired with the SigLIP2-patch16-512 vision encoder through a multimodal adapter for vision-language understanding.
Image Processing:
We use 256 visual tokens to encode a single image tile with a resolution of 512x512 pixels. Input images of arbitrary sizes are mapped to the nearest supported resolution based on their aspect ratio. Command A Vision uses up to 12 input tiles, depending on image resolution, and an additional thumbnail tile (resized to 512x512), so up to 3328 tokens per single image. We recommend using images of up to 2048x1536 (3 megapixels) resolution.
Languages covered:
- English
- Portuguese
- Italian
- French
- German
- Spanish
Context Length: 32k.
Safety Guardrails
Similar to Cohere Labs Command A, Cohere Labs Command A Vision can be configured with two safety modes, which enable users to set guardrails that are both safe and suitable to their needs: contextual mode, or strict mode. Contextual mode is appropriate for wide-ranging interactions with fewer constraints on output, while maintaining core protections by rejecting harmful or illegal suggestions. Command A Vision is configured to contextual mode by default. Strict mode aims to avoid all sensitive topics, such as violent or sexual acts and profanity. For more information, see the Command A prompt format docs.
Model Card Contact
For errors or additional questions about details in this model card, contact [[email protected]].
Terms of Use:
We hope that the release of this model will make community-based research efforts more accessible, by releasing the weights of a highly performant 112 billion parameter model to researchers all over the world. This model is governed by a CC-BY-NC License (Non-Commercial) with an acceptable use addendum, and also requires adhering to Cohere Lab's Acceptable Use Policy. If you are interested in commercial use, please contact Cohere’s Sales team.
Try it now:
You can try Command A Vision in the playground here. You can also use it in our dedicated Hugging Face Space here.