Spaces:
Running
Running
Multimodal R1
#10
by
salma-remyx
- opened
"... we don’t want to stop at math datasets."
Right on, I'm experimenting with VLM fine-tunes using R1 distillations as the base llm to see if it's CoT reasoning can improve spatial reasoning.
This synthetic dataset uses a pipeline of models to infer distances and spatial relationships in a scene: https://huggingface.co/datasets/remyxai/OpenSpaces
Each image sample includes 5 QA pairs sampled from 40 templates.
Can the model learn to use relationships about different objects in a scene to reason about the best answer to the question.
User: What is the distance between the lamp and the chair?
Assistant: Let me solve this step by step.
<think>
The height of the lamp is X.
The sofa is to the left of the painting.
...
</think>
<ansewr>5.3 meters</answer>
Thoughts from the community about restructuring the dataset samples to use the context of 4 QA pairs to reason about the last one?