Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding? Paper • 2505.14321 • Published 19 days ago • 10
Running on Zero 729 729 MMAudio — generating synchronized audio from video/text 🔊 Generate audio from video or text prompts
JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization Paper • 2503.23377 • Published Mar 30 • 57
BSharedRAG: Backbone Shared Retrieval-Augmented Generation for the E-commerce Domain Paper • 2409.20075 • Published Sep 30, 2024 • 2
BSharedRAG: Backbone Shared Retrieval-Augmented Generation for the E-commerce Domain Paper • 2409.20075 • Published Sep 30, 2024 • 2
ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering Paper • 2503.16867 • Published Mar 21 • 11 • 2
ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering Paper • 2503.16867 • Published Mar 21 • 11
ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering Paper • 2503.16867 • Published Mar 21 • 11
YuLan-Mini: An Open Data-efficient Language Model Paper • 2412.17743 • Published Dec 23, 2024 • 67
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models Paper • 2410.02740 • Published Oct 3, 2024 • 55
RA-DIT: Retrieval-Augmented Dual Instruction Tuning Paper • 2310.01352 • Published Oct 2, 2023 • 7