VoRA Video Question Answering Model

This is a VoRA (Vision as LoRA) model fine-tuned for video question answering tasks.

Model Details

Base Model: CohereLabs/c4ai-command-r7b-12-2024
Architecture: VoRA (Vision as LoRA)
Task: Video Question Answering
LoRA Rank: 16
LoRA Alpha: 32
Frames per Video: 8
Frame Resolution: 224x224
Vocabulary Size: 255034 (includes special tokens)

Usage

With Transformers Pipeline (Recommended)

from transformers import pipeline

# Load the pipeline
pipe = pipeline("video-question-answering", model="maximuspowers/cmd-r-vora")

# Ask questions about videos
result = pipe({
    "video": "path/to/your/video.mp4",
    "question": "What is happening in the video?",
    "question_format": "Open Ended"
})

print(result["answer"])

# For multiple choice questions
result = pipe({
    "video": "path/to/your/video.mp4", 
    "question": "What color is the car?",
    "question_format": "MCQ",
    "options": ["Red", "Blue", "Green", "Yellow"]
})

print(result["answer"])  # Will be A, B, C, or D

Loading Issues

If you encounter vocabulary size mismatches, ensure you have the latest version of the custom pipeline classes. The model includes special tokens that expand the vocabulary beyond the base model.

Manual Loading (Advanced)

from transformers import AutoTokenizer
from peft import PeftModel, AutoModelForCausalLM

# Load tokenizer first
tokenizer = AutoTokenizer.from_pretrained("your-model-path")

# Load base model and resize embeddings
base_model = AutoModelForCausalLM.from_pretrained("CohereLabs/c4ai-command-r7b-12-2024")
base_model.resize_token_embeddings(len(tokenizer))

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "your-model-path")

# Load vision embedding separately
import torch
vision_state = torch.load("your-model-path/vision_embedding.pt")
# ... (additional setup required)

Training Details

Training Data: Custom video-text pairs
Training Epochs: 3
Learning Rate: 0.0002
Batch Size: 1
Gradient Accumulation: 4

Special Tokens

The model uses the following special tokens:

<video>: Placeholder for video frame embeddings

Limitations

Requires video input in common formats (mp4, avi, etc.)
Optimized for 8 frames per video
Performance may vary on videos significantly different from training data

Citation

If you use this model, please cite:

@article{vora2024,
    title={VoRA: Vision as LoRA for Multimodal Large Language Models},
    author={Wang, Han and Ye, Yongjie and Li, Bingru and others},
    journal={arXiv preprint arXiv:2503.20680},
    year={2024}
}

maximuspowers
/

cmd-r-vora