--- language: en tags: - codestral - vision-language - code-generation - multimodal - mlx license: other library_name: mlx inference: false license_name: mnpl license_link: https://mistral.ai/licences/MNPL-0.1.md --- # Codestral-ViT A multimodal code generation model that combines vision and language understanding. Built on MLX for Apple Silicon, it integrates CLIP's visual capabilities with Codestral's code generation abilities. ## Overview Codestral-ViT extends the Codestral language model with visual understanding capabilities. It can: - Generate code from text descriptions - Understand and explain code from screenshots - Suggest improvements to code based on visual context - Process multiple images with advanced tiling strategies ## Technical Details - **Base Models:** - Language: Codestral-22B (4-bit quantized) - Vision: CLIP ViT-Large/14 - Framework: MLX (Apple Silicon) - **Architecture:** - Vision encoder processes images into 512-dim embeddings - Learned projection layer maps vision features to language space - Dynamic RoPE scaling for 32K context window - Support for overlapping image crops and tiling - **Input Processing:** - Images: 224x224 pixels, CLIP normalization - Text: Up to 32,768 tokens - Special tokens for image-text fusion ## Example Usage ```python from PIL import Image from src.model import MultimodalCodestral model = MultimodalCodestral() # Code generation from screenshot image = Image.open("code_screenshot.png") response = model.generate_with_images( prompt="Explain this code and suggest improvements", images=[image] ) # Multiple image processing images = [Image.open(f) for f in ["img1.png", "img2.png"]] response = model.generate_with_images( prompt="Compare these code implementations", images=images ) ``` ## Capabilities - **Code Understanding:** - Analyzes code structure from screenshots - Identifies patterns and anti-patterns - Suggests contextual improvements - **Image Processing:** - Handles multiple image inputs - Supports various image formats - Advanced crop and resize strategies - **Generation Features:** - Context-aware code completion - Documentation generation - Code refactoring suggestions - Bug identification and fixes ## Requirements - Apple Silicon hardware (M1/M2/M3) - 32GB+ RAM recommended - MLX framework - Python 3.8+ ## Limitations - Apple Silicon only (no CPU/CUDA support) - Memory intensive for large images/codebases - Visual understanding bounded by CLIP's capabilities - Generation quality depends on input clarity ## License This model is released under the Mistral Non-Profit License (MNPL). See [license details](https://mistral.ai/licences/MNPL-0.1.md). ## Citation ```bibtex @software{codestral-vit, author = {Mike Casale}, title = {Codestral-ViT: A Vision-Language Model for Code Generation}, year = {2023}, publisher = {Hugging Face}, url = {https://huggingface.co/casale-xyz/codestral-vit} } ```