About MMEB _v2 evaluation
Ops-MM-embedding achieved impressive results on MMEB. Could you share the evaluation methods and specific instructions you used when testing on MMEB v2?
looking forward to it, especially intructions for video retrieval!
Any update? I also want to know the instructions when evaluating on MMEB v2.
In both training and eval I simply split the instruction and the text in mmeb-train/mmeb-eval and feed them to the model separately—no extra tricks. For video retrieval, I treat the video frames as multiple images and run evaluation with batch size = 1.
Why split the instruction and text? For an image-qa example in mmed-eval:
{
'qry_text': 'What sport can you use this for?',
'qry_inst': '<|image_1|>\nRepresent the given image with the following question: ',
}
The text prompt should combine qry_inst and qry_text (maybe <|image_1|> need to be processed)or just use the qry_text?
In TIGER-Lab/MMEB-eval, qry_inst and qry_text were merged into the single qry_text field. I separated the instruction and text so that the instruction can be placed in the corresponding part of the qwenvl chat template.
So the qry_inst is actually not used in the MMEB-eval?
In MMEB-eval, the original qry_text
is "<|image_1|>\nRepresent the given image with the following question: What sport can you use this for?".
After splitting it into an instruction and a text component, we have
- instruction: "Represent the given image with the following question:"
- text: "<|image_1|>What sport can you use this for?"