Codestral-ViT

A multimodal code generation model that combines vision and language understanding. Built on MLX for Apple Silicon, it integrates CLIP's visual capabilities with Codestral's code generation abilities.

Overview

Codestral-ViT extends the Codestral language model with visual understanding capabilities. It can:

  • Generate code from text descriptions
  • Understand and explain code from screenshots
  • Suggest improvements to code based on visual context
  • Process multiple images with advanced tiling strategies

Technical Details

  • Base Models:

    • Language: Codestral-22B (4-bit quantized)
    • Vision: CLIP ViT-Large/14
    • Framework: MLX (Apple Silicon)
  • Architecture:

    • Vision encoder processes images into 512-dim embeddings
    • Learned projection layer maps vision features to language space
    • Dynamic RoPE scaling for 32K context window
    • Support for overlapping image crops and tiling
  • Input Processing:

    • Images: 224x224 pixels, CLIP normalization
    • Text: Up to 32,768 tokens
    • Special tokens for image-text fusion

Example Usage

from PIL import Image
from src.model import MultimodalCodestral

model = MultimodalCodestral()

# Code generation from screenshot
image = Image.open("code_screenshot.png")
response = model.generate_with_images(
    prompt="Explain this code and suggest improvements",
    images=[image]
)

# Multiple image processing
images = [Image.open(f) for f in ["img1.png", "img2.png"]]
response = model.generate_with_images(
    prompt="Compare these code implementations",
    images=images
)

Capabilities

  • Code Understanding:

    • Analyzes code structure from screenshots
    • Identifies patterns and anti-patterns
    • Suggests contextual improvements
  • Image Processing:

    • Handles multiple image inputs
    • Supports various image formats
    • Advanced crop and resize strategies
  • Generation Features:

    • Context-aware code completion
    • Documentation generation
    • Code refactoring suggestions
    • Bug identification and fixes

Requirements

  • Apple Silicon hardware (M1/M2/M3)
  • 32GB+ RAM recommended
  • MLX framework
  • Python 3.8+

Limitations

  • Apple Silicon only (no CPU/CUDA support)
  • Memory intensive for large images/codebases
  • Visual understanding bounded by CLIP's capabilities
  • Generation quality depends on input clarity

License

This model is released under the Mistral Non-Profit License (MNPL). See license details.

Citation

@software{codestral-vit,
  author = {Mike Casale},
  title = {Codestral-ViT: A Vision-Language Model for Code Generation},
  year = {2023},
  publisher = {Hugging Face},
  url = {https://huggingface.co/casale-xyz/codestral-vit}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API has been turned off for this model.