VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
Abstract
We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish reproducible evaluation results. In VLMEvalKit, we implement over 70 different large multi-modality models, including both proprietary APIs and open-source models, as well as more than 20 different multi-modal benchmarks. By implementing a single interface, new models can be easily added to the toolkit, while the toolkit automatically handles the remaining workloads, including data preparation, distributed inference, prediction post-processing, and metric calculation. Although the toolkit is currently mainly used for evaluating large vision-language models, its design is compatible with future updates that incorporate additional modalities, such as audio and video. Based on the evaluation results obtained with the toolkit, we host OpenVLM Leaderboard, a comprehensive leaderboard to track the progress of multi-modality learning research. The toolkit is released at https://github.com/open-compass/VLMEvalKit and is actively maintained.
Community
Code: https://github.com/open-compass/VLMEvalKit
OpenVLM Leaderboard: https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
Hi @Lin-Chen congrats on this work!
Feel free to link the Space to this paper, see here on how to do that: https://huggingface.co/docs/hub/en/datasets-cards#linking-a-paper
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding (2024)
- Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis (2024)
- MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs (2024)
- A-Bench: Are LMMs Masters at Evaluating AI-generated Images? (2024)
- MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper