VoRA: Vision as LoRA for Command R

This model implements VoRA (Vision as LoRA) - a novel approach for adding vision capabilities to large language models using Low-Rank Adaptation (LoRA). Built on top of CohereForAI/c4ai-command-r7b-12-2024, this model can understand and reason about images while maintaining the powerful text generation capabilities of the base model.

Model Description

VoRA introduces the concept of "Vision as LoRA" - treating visual information as an additional adaptation layer applied through LoRA rather than traditional vision-language fusion methods. Key innovations:

Minimal Parameter Training: Only vision embedding (~~3.8M params) + LoRA weights (~~27M params) are trainable
Existing Token Reuse: Uses the "«" token as a vision placeholder instead of expanding vocabulary
Lightweight Vision Encoder: Simple CNN + MLP vision embedding that converts image patches to LLM-compatible embeddings
LoRA-Only Language Adaptation: Base LLM weights remain frozen, adaptation happens purely through LoRA layers

Training Details

Base Model: CohereForAI/c4ai-command-r7b-12-2024
Dataset: Hon-Wong/VoRA-Recap-GLDv2-1.4M
Training Epochs: 1
Batch Size: 32
Learning Rate: 2e-05
LoRA Rank: 32
Image Size: 224x224
Vision Placeholder: "«"

Model Architecture

Total Parameters: ~8B (Command R base)
Trainable Parameters: ~31M (0.39% of total)
LoRA Parameters: ~27M
Vision Parameters: ~3.8M
Image Resolution: 224x224
Patch Size: 14x14

Usage

Basic Usage

from transformers import AutoTokenizer, AutoProcessor
from modeling_vora import VoRAModelForCausalLM
from processing_vora import VoRAProcessor
from PIL import Image

# Load model and processor
model = VoRAModelForCausalLM.from_pretrained("maximuspowers/cmd-r-vora-2")
processor = VoRAProcessor.from_pretrained("maximuspowers/cmd-r-vora-2")

# Load an image
image = Image.open("your_image.jpg")

# Process inputs
inputs = processor(
    text="« What do you see in this image?",
    images=image,
    return_tensors="pt"
)

# Generate response
with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7,
        pad_token_id=processor.tokenizer.eos_token_id
    )

# Decode response
response = processor.decode(output_ids[0], skip_special_tokens=True)
print(response)

Pipeline Usage (Future)

# Coming soon: pipeline support
from transformers import pipeline

pipe = pipeline(
    "image-text-to-text",
    model="maximuspowers/cmd-r-vora-2",
    processor="maximuspowers/cmd-r-vora-2"
)

result = pipe({"image": "path/to/image.jpg", "text": "Describe this image"})

Vision Placeholder

This model uses the "«" character as a vision placeholder token. When processing text with images:

Include "«" in your text prompt where you want the image to be processed
If no "«" is found, it will be automatically added at the beginning
Example: "« What's happening in this image?"

Performance

The model demonstrates efficient vision-language understanding with minimal parameter overhead:

Memory Efficient: Only 0.39% of parameters are trainable
Fast Training: Converges quickly due to frozen base model
Flexible: Can be easily adapted to different vision tasks

Technical Implementation

Based on the VoRA paper "VoRA: Your Visual Retrieval Assistant" (arXiv:2503.20680v1), this implementation includes:

Patch-based Vision Encoding: Images are divided into patches and encoded using a lightweight CNN
Positional Embeddings: 2D positional embeddings for spatial understanding
RMS Normalization: Stable normalization for vision features
LoRA Integration: Efficient adaptation of attention and MLP layers
Token Replacement: Vision embeddings replace placeholder tokens during forward pass

Limitations

Currently optimized for single-image understanding
Vision placeholder must be included in text prompts
Requires specific processor for proper image preprocessing

Citation

If you use this model, please cite the original VoRA paper:

@article{vora2025,
  title={VoRA: Your Visual Retrieval Assistant},
  author={[Authors]},
  journal={arXiv preprint arXiv:2503.20680},
  year={2025}
}

License

This model is released under the Apache 2.0 License.

maximuspowers
/

cmd-r-vora-2