Multimodal R1

#10
by salma-remyx - opened

"... we don’t want to stop at math datasets."

Right on, I'm experimenting with VLM fine-tunes using R1 distillations as the base llm to see if it's CoT reasoning can improve spatial reasoning.

This synthetic dataset uses a pipeline of models to infer distances and spatial relationships in a scene: https://huggingface.co/datasets/remyxai/OpenSpaces

Each image sample includes 5 QA pairs sampled from 40 templates.
Can the model learn to use relationships about different objects in a scene to reason about the best answer to the question.

User: What is the distance between the lamp and the chair?

Assistant: Let me solve this step by step.
<think>
The height of the lamp is X.
The sofa is to the left of the painting.
...
</think>
<ansewr>5.3 meters</answer>

Thoughts from the community about restructuring the dataset samples to use the context of 4 QA pairs to reason about the last one?

Sign up or log in to comment