Devstral-Vision-Small-2507

Created by Eric Hartford at Cognitive Computations

Model Description

Devstral-Vision-Small-2507 is a multimodal language model that combines the exceptional coding capabilities of Devstral-Small-2507 with the vision understanding of Mistral-Small-3.2-24B-Instruct-2506.

This model enables vision-augmented software engineering tasks, allowing developers to:

  • Analyze screenshots and UI mockups to generate code
  • Debug visual rendering issues with actual screenshots
  • Convert designs and wireframes directly into implementation
  • Understand and modify codebases with visual context

Model Details

  • Base Architecture: Mistral Small 3.2 with vision encoder
  • Parameters: 24B (language model) + vision components
  • Context Window: 128k tokens
  • License: Apache 2.0
  • Language Model: Fine-tuned Devstral weights for superior coding performance
  • Vision Model: Mistral-Small vision encoder and multimodal projector

How It Was Created

This model was created by surgically transplanting the language model weights from Devstral-Small-2507 into the Mistral-Small-3.2-24B-Instruct-2506 architecture while preserving all vision components:

  1. Started with Mistral-Small-3.2-24B-Instruct-2506 (complete multimodal model)
  2. Replaced only the core language model weights with Devstral-Small-2507's fine-tuned weights
  3. Preserved Mistral's vision encoder, multimodal projector, vision-language adapter, and token embeddings
  4. Kept Mistral's tokenizer to maintain proper image token handling

The result is a model that combines Devstral's state-of-the-art coding capabilities with Mistral's vision understanding.

Here is the script

Intended Use

Primary Use Cases

  • Visual Software Engineering: Analyze UI screenshots, mockups, and designs to generate implementation code
  • Code Review with Visual Context: Review code changes alongside their visual output
  • Debugging Visual Issues: Debug rendering problems by analyzing screenshots
  • Design-to-Code: Convert visual designs directly into code
  • Documentation with Visual Examples: Generate documentation that references visual elements

Example Applications

  • Building UI components from screenshots
  • Debugging CSS/styling issues with visual feedback
  • Converting Figma/design mockups to code
  • Analyzing and reproducing visual bugs
  • Creating visual test cases

Usage

With OpenHands

The model is optimized for use with OpenHands for agentic coding tasks:

# Using vLLM
vllm serve cognitivecomputations/Devstral-Vision-Small-2507 \
    --tokenizer_mode mistral \
    --config_format mistral \
    --load_format mistral \
    --tensor-parallel-size 2

# Configure OpenHands to use the model
# Set Custom Model: openai/cognitivecomputations/Devstral-Vision-Small-2507
# Set Base URL: http://localhost:8000/v1

With Transformers

import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image

model_id = "cognitivecomputations/Devstral-Vision-Small-2507"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# Load an image
image = Image.open("screenshot.png")

# Create a prompt
prompt = "Analyze this UI screenshot and generate React code to reproduce it."

# Process inputs
inputs = processor(
    text=prompt,
    images=image,
    return_tensors="pt"
).to(model.device)

# Generate
outputs = model.generate(
    **inputs,
    max_new_tokens=2000,
    temperature=0.7
)

response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)

image/png

Performance Expectations

Coding Performance

Inherits Devstral's exceptional performance on coding tasks:

  • 53.6% on SWE-Bench Verified (when used with OpenHands)
  • Superior performance on multi-file editing and codebase exploration
  • Excellent tool use and agentic behavior

Vision Performance

Maintains Mistral-Small's vision capabilities:

  • Strong understanding of UI elements and layouts
  • Accurate interpretation of charts, diagrams, and visual documentation
  • Reliable screenshot analysis for debugging

Hardware Requirements

  • GPU Memory: ~48GB for full precision, ~24GB with 4-bit quantization
  • Recommended: 2x RTX 4090 or better for optimal performance
  • Minimum: Single GPU with 24GB VRAM using quantization

Limitations

  • Vision capabilities are limited to what Mistral-Small-3.2 supports
  • Not specifically fine-tuned on vision-to-code tasks (uses Devstral's text-only fine-tuning)
  • Large model size may be prohibitive for some deployment scenarios
  • Best performance achieved when used with appropriate scaffolding (OpenHands, Cline, etc.)

Ethical Considerations

This model inherits both the capabilities and limitations of its parent models. Users should:

  • Review generated code for security vulnerabilities
  • Verify visual interpretations are accurate
  • Be aware of potential biases in code generation
  • Use appropriate safety measures in production deployments

Citation

If you use this model, please cite:

@misc{devstral-vision-2507,
  author = {Hartford, Eric},
  title = {Devstral-Vision-Small-2507},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/cognitivecomputations/Devstral-Vision-Small-2507}
}

Acknowledgments

This model builds upon the excellent work by:

  • Mistral AI for both Mistral-Small and Devstral
  • All Hands AI for their collaboration on Devstral
  • The open-source community for testing and feedback

License

Apache 2.0 - Same as the base models


Created with dolphin passion ๐Ÿฌ by Cognitive Computations

Downloads last month
744
Safetensors
Model size
24B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cognitivecomputations/Devstral-Vision-Small-2507