metadata
license: apache-2.0
tags:
- multimodal
- vision-language
- video understanding
- spatial reasoning
- visuospatial cognition
- llava
- qwen
- llava-video
datasets:
- nkkbr/ViCA-322K
- nkkbr/ViCA-thinking-2.68k
language:
- en
library_name: transformers
pipeline_tag: video-text-to-text
model_name: ViCA-ARKitScenes-7B
base_model: lmms-lab/LLaVA-Video-7B-Qwen2
Usage and Full Documentation
For detailed model description, training setup, datasets, evaluation results, and inference code, please refer to the main ViCA-7B README: