VoRA: Vision as LoRA for Command R
This model implements VoRA (Vision as LoRA) - a novel approach for adding vision capabilities to large language models using Low-Rank Adaptation (LoRA). Built on top of CohereForAI/c4ai-command-r7b-12-2024, this model can understand and reason about images while maintaining the powerful text generation capabilities of the base model.
Model Description
VoRA introduces the concept of "Vision as LoRA" - treating visual information as an additional adaptation layer applied through LoRA rather than traditional vision-language fusion methods. Key innovations:
- Minimal Parameter Training: Only vision embedding (
3.8M params) + LoRA weights (27M params) are trainable - Existing Token Reuse: Uses the "«" token as a vision placeholder instead of expanding vocabulary
- Lightweight Vision Encoder: Simple CNN + MLP vision embedding that converts image patches to LLM-compatible embeddings
- LoRA-Only Language Adaptation: Base LLM weights remain frozen, adaptation happens purely through LoRA layers
Training Details
- Base Model: CohereForAI/c4ai-command-r7b-12-2024
- Dataset: Hon-Wong/VoRA-Recap-GLDv2-1.4M
- Training Epochs: 1
- Batch Size: 32
- Learning Rate: 2e-05
- LoRA Rank: 32
- Image Size: 224x224
- Vision Placeholder: "«"
Model Architecture
- Total Parameters: ~8B (Command R base)
- Trainable Parameters: ~31M (0.39% of total)
- LoRA Parameters: ~27M
- Vision Parameters: ~3.8M
- Image Resolution: 224x224
- Patch Size: 14x14
Usage
Basic Usage
from transformers import AutoTokenizer, AutoProcessor
from modeling_vora import VoRAModelForCausalLM
from processing_vora import VoRAProcessor
from PIL import Image
# Load model and processor
model = VoRAModelForCausalLM.from_pretrained("maximuspowers/cmd-r-vora-2")
processor = VoRAProcessor.from_pretrained("maximuspowers/cmd-r-vora-2")
# Load an image
image = Image.open("your_image.jpg")
# Process inputs
inputs = processor(
text="« What do you see in this image?",
images=image,
return_tensors="pt"
)
# Generate response
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
pad_token_id=processor.tokenizer.eos_token_id
)
# Decode response
response = processor.decode(output_ids[0], skip_special_tokens=True)
print(response)
Pipeline Usage (Future)
# Coming soon: pipeline support
from transformers import pipeline
pipe = pipeline(
"image-text-to-text",
model="maximuspowers/cmd-r-vora-2",
processor="maximuspowers/cmd-r-vora-2"
)
result = pipe({"image": "path/to/image.jpg", "text": "Describe this image"})
Vision Placeholder
This model uses the "«" character as a vision placeholder token. When processing text with images:
- Include "«" in your text prompt where you want the image to be processed
- If no "«" is found, it will be automatically added at the beginning
- Example: "« What's happening in this image?"
Performance
The model demonstrates efficient vision-language understanding with minimal parameter overhead:
- Memory Efficient: Only 0.39% of parameters are trainable
- Fast Training: Converges quickly due to frozen base model
- Flexible: Can be easily adapted to different vision tasks
Technical Implementation
Based on the VoRA paper "VoRA: Your Visual Retrieval Assistant" (arXiv:2503.20680v1), this implementation includes:
- Patch-based Vision Encoding: Images are divided into patches and encoded using a lightweight CNN
- Positional Embeddings: 2D positional embeddings for spatial understanding
- RMS Normalization: Stable normalization for vision features
- LoRA Integration: Efficient adaptation of attention and MLP layers
- Token Replacement: Vision embeddings replace placeholder tokens during forward pass
Limitations
- Currently optimized for single-image understanding
- Vision placeholder must be included in text prompts
- Requires specific processor for proper image preprocessing
Citation
If you use this model, please cite the original VoRA paper:
@article{vora2025,
title={VoRA: Your Visual Retrieval Assistant},
author={[Authors]},
journal={arXiv preprint arXiv:2503.20680},
year={2025}
}
License
This model is released under the Apache 2.0 License.
- Downloads last month
- 18
Model tree for maximuspowers/cmd-r-vora-2
Base model
CohereLabs/c4ai-command-r7b-12-2024