VoRA: Vision as LoRA for Command R

This model implements VoRA (Vision as LoRA) - a novel approach for adding vision capabilities to large language models using Low-Rank Adaptation (LoRA). Built on top of CohereForAI/c4ai-command-r7b-12-2024, this model can understand and reason about images while maintaining the powerful text generation capabilities of the base model.

Model Description

VoRA introduces the concept of "Vision as LoRA" - treating visual information as an additional adaptation layer applied through LoRA rather than traditional vision-language fusion methods. Key innovations:

  • Minimal Parameter Training: Only vision embedding (3.8M params) + LoRA weights (27M params) are trainable
  • Existing Token Reuse: Uses the "«" token as a vision placeholder instead of expanding vocabulary
  • Lightweight Vision Encoder: Simple CNN + MLP vision embedding that converts image patches to LLM-compatible embeddings
  • LoRA-Only Language Adaptation: Base LLM weights remain frozen, adaptation happens purely through LoRA layers

Training Details

  • Base Model: CohereForAI/c4ai-command-r7b-12-2024
  • Dataset: Hon-Wong/VoRA-Recap-GLDv2-1.4M
  • Training Epochs: 1
  • Batch Size: 32
  • Learning Rate: 2e-05
  • LoRA Rank: 32
  • Image Size: 224x224
  • Vision Placeholder: "«"

Model Architecture

  • Total Parameters: ~8B (Command R base)
  • Trainable Parameters: ~31M (0.39% of total)
  • LoRA Parameters: ~27M
  • Vision Parameters: ~3.8M
  • Image Resolution: 224x224
  • Patch Size: 14x14

Usage

Basic Usage

from transformers import AutoTokenizer, AutoProcessor
from modeling_vora import VoRAModelForCausalLM
from processing_vora import VoRAProcessor
from PIL import Image

# Load model and processor
model = VoRAModelForCausalLM.from_pretrained("maximuspowers/cmd-r-vora-2")
processor = VoRAProcessor.from_pretrained("maximuspowers/cmd-r-vora-2")

# Load an image
image = Image.open("your_image.jpg")

# Process inputs
inputs = processor(
    text="« What do you see in this image?",
    images=image,
    return_tensors="pt"
)

# Generate response
with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7,
        pad_token_id=processor.tokenizer.eos_token_id
    )

# Decode response
response = processor.decode(output_ids[0], skip_special_tokens=True)
print(response)

Pipeline Usage (Future)

# Coming soon: pipeline support
from transformers import pipeline

pipe = pipeline(
    "image-text-to-text",
    model="maximuspowers/cmd-r-vora-2",
    processor="maximuspowers/cmd-r-vora-2"
)

result = pipe({"image": "path/to/image.jpg", "text": "Describe this image"})

Vision Placeholder

This model uses the "«" character as a vision placeholder token. When processing text with images:

  • Include "«" in your text prompt where you want the image to be processed
  • If no "«" is found, it will be automatically added at the beginning
  • Example: "« What's happening in this image?"

Performance

The model demonstrates efficient vision-language understanding with minimal parameter overhead:

  • Memory Efficient: Only 0.39% of parameters are trainable
  • Fast Training: Converges quickly due to frozen base model
  • Flexible: Can be easily adapted to different vision tasks

Technical Implementation

Based on the VoRA paper "VoRA: Your Visual Retrieval Assistant" (arXiv:2503.20680v1), this implementation includes:

  1. Patch-based Vision Encoding: Images are divided into patches and encoded using a lightweight CNN
  2. Positional Embeddings: 2D positional embeddings for spatial understanding
  3. RMS Normalization: Stable normalization for vision features
  4. LoRA Integration: Efficient adaptation of attention and MLP layers
  5. Token Replacement: Vision embeddings replace placeholder tokens during forward pass

Limitations

  • Currently optimized for single-image understanding
  • Vision placeholder must be included in text prompts
  • Requires specific processor for proper image preprocessing

Citation

If you use this model, please cite the original VoRA paper:

@article{vora2025,
  title={VoRA: Your Visual Retrieval Assistant},
  author={[Authors]},
  journal={arXiv preprint arXiv:2503.20680},
  year={2025}
}

License

This model is released under the Apache 2.0 License.

Downloads last month
18
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for maximuspowers/cmd-r-vora-2

Adapter
(3)
this model

Dataset used to train maximuspowers/cmd-r-vora-2