YAML Metadata
Warning:
The pipeline tag "video-question-answering" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, text2text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, other
VoRA Video Question Answering Model
This is a VoRA (Vision as LoRA) model fine-tuned for video question answering tasks.
Model Details
- Base Model: CohereLabs/c4ai-command-r7b-12-2024
- Architecture: VoRA (Vision as LoRA)
- Task: Video Question Answering
- LoRA Rank: 16
- LoRA Alpha: 32
- Frames per Video: 8
- Frame Resolution: 224x224
- Vocabulary Size: 255034 (includes special tokens)
Usage
With Transformers Pipeline (Recommended)
from transformers import pipeline
# Load the pipeline
pipe = pipeline("video-question-answering", model="maximuspowers/cmd-r-vora")
# Ask questions about videos
result = pipe({
"video": "path/to/your/video.mp4",
"question": "What is happening in the video?",
"question_format": "Open Ended"
})
print(result["answer"])
# For multiple choice questions
result = pipe({
"video": "path/to/your/video.mp4",
"question": "What color is the car?",
"question_format": "MCQ",
"options": ["Red", "Blue", "Green", "Yellow"]
})
print(result["answer"]) # Will be A, B, C, or D
Loading Issues
If you encounter vocabulary size mismatches, ensure you have the latest version of the custom pipeline classes. The model includes special tokens that expand the vocabulary beyond the base model.
Manual Loading (Advanced)
from transformers import AutoTokenizer
from peft import PeftModel, AutoModelForCausalLM
# Load tokenizer first
tokenizer = AutoTokenizer.from_pretrained("your-model-path")
# Load base model and resize embeddings
base_model = AutoModelForCausalLM.from_pretrained("CohereLabs/c4ai-command-r7b-12-2024")
base_model.resize_token_embeddings(len(tokenizer))
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "your-model-path")
# Load vision embedding separately
import torch
vision_state = torch.load("your-model-path/vision_embedding.pt")
# ... (additional setup required)
Training Details
- Training Data: Custom video-text pairs
- Training Epochs: 3
- Learning Rate: 0.0002
- Batch Size: 1
- Gradient Accumulation: 4
Special Tokens
The model uses the following special tokens:
<video>
: Placeholder for video frame embeddings
Limitations
- Requires video input in common formats (mp4, avi, etc.)
- Optimized for 8 frames per video
- Performance may vary on videos significantly different from training data
Citation
If you use this model, please cite:
@article{vora2024,
title={VoRA: Vision as LoRA for Multimodal Large Language Models},
author={Wang, Han and Ye, Yongjie and Li, Bingru and others},
journal={arXiv preprint arXiv:2503.20680},
year={2024}
}
- Downloads last month
- 4
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for maximuspowers/cmd-r-vora
Base model
CohereLabs/c4ai-command-r7b-12-2024