microsoft
/

Magma-8B

Image-Text-to-Text

text-generation

Model card Files Files and versions

jw2yang commited on Feb 23

Commit

f7db9ca

·

1 Parent(s): 5e21a94

update

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -194,7 +194,7 @@ Our training data consists of:
 * Generic Video SFT Data: [ShareGPT4Video](https://sharegpt4video.github.io/) and [LLaVA-Video](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K).
-* Instructional Video Data: [Ego4d](https://ego4d-data.org/), [Somethingv2](https://www.qualcomm.com/developer/software/something-something-v-2-dataset), [Epic-Kitchen](https://epic-kitchens.github.io/2025), [COIN](https://coin-dataset.github.io/).
 * Robotics Manipulation Data: [Open-X-Embodiment](https://robotics-transformer-x.github.io/).
@@ -276,7 +276,7 @@ We evaluate the model's performance after finetuning on the following datasets:
 * Multimodal Image Understanding and Reasoning: [VQAv2](https://visualqa.org/), [GQA](https://cs.stanford.edu/people/dorarad/gqa/about.html), [MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation), [POPE](https://huggingface.co/datasets/lmms-lab/POPE), [TextVQA](https://textvqa.org/), [ChartQA](https://github.com/vis-nlp/ChartQA), [DocVQA](https://www.docvqa.org/).
-* Multimodal Video Understanding and Reasoning: [Next-QA](https://github.com/doc-doc/NExT-QA), [VideoMME](https://video-mme.github.io/home_page.html), [MVBench](https://arxiv.org/abs/2311.17005).
 #### Metrics
 <!-- {{ testing_metrics | default("[More Information Needed]", true)}} -->

 * Generic Video SFT Data: [ShareGPT4Video](https://sharegpt4video.github.io/) and [LLaVA-Video](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K).
+* Instructional Video Data: [Ego4d](https://ego4d-data.org/), [Somethingv2](https://www.qualcomm.com/developer/software/something-something-v-2-dataset), [Epic-Kitchen](https://epic-kitchens.github.io/2025) and other related instructional videos.
 * Robotics Manipulation Data: [Open-X-Embodiment](https://robotics-transformer-x.github.io/).
 * Multimodal Image Understanding and Reasoning: [VQAv2](https://visualqa.org/), [GQA](https://cs.stanford.edu/people/dorarad/gqa/about.html), [MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation), [POPE](https://huggingface.co/datasets/lmms-lab/POPE), [TextVQA](https://textvqa.org/), [ChartQA](https://github.com/vis-nlp/ChartQA), [DocVQA](https://www.docvqa.org/).
+* Multimodal Video Understanding and Reasoning: [Next-QA](https://github.com/doc-doc/NExT-QA), [VideoMME](https://video-mme.github.io/home_page.html), [MVBench](https://huggingface.co/datasets/OpenGVLab/MVBench).
 #### Metrics
 <!-- {{ testing_metrics | default("[More Information Needed]", true)}} -->