Spaces:

open-r1
/

README

Running

App Files Files Community

Multimodal R1

#10

by salma-remyx - opened Feb 1

Discussion

salma-remyx

Feb 1

"... we don’t want to stop at math datasets."

Right on, I'm experimenting with VLM fine-tunes using R1 distillations as the base llm to see if it's CoT reasoning can improve spatial reasoning.

This synthetic dataset uses a pipeline of models to infer distances and spatial relationships in a scene: https://huggingface.co/datasets/remyxai/OpenSpaces

Each image sample includes 5 QA pairs sampled from 40 templates.
Can the model learn to use relationships about different objects in a scene to reason about the best answer to the question.

User: What is the distance between the lamp and the chair?

Assistant: Let me solve this step by step.
<think>
The height of the lamp is X.
The sofa is to the left of the painting.
...
</think>
<ansewr>5.3 meters</answer>

Thoughts from the community about restructuring the dataset samples to use the context of 4 QA pairs to reason about the last one?

salma-remyx

Feb 24

Here, I make r1-style reasoning with tags by using another AI to rephrase the information to justify the provided answer after resolving the fact set of the remaining QA pairs

salma-remyx

Mar 15

Here's a little over 12K samples as described above at a cost of $50 to generate.
https://huggingface.co/datasets/remyxai/OpenSpaces_MC_R1

salma-remyx

Mar 15

May be worth experimenting with filtering the dataset further using prometheus's VLM-as-a-Judge as documented here:
https://huggingface.co/datasets/remyxai/SpaceJudgeDataset

taesiri

Mar 25

Hey @salma-remyx

I would love to learn more about your experiments. Are you still working on this?
We recently created a dataset of hand with various number of fingers. Do you think GRPO can help to count fingers better?

salma-remyx

Mar 27

Hey @taesiri , thank you and yes! Still actively adding to VQASynth

And the taesiri/FluxHands-FingerCount dataset looks awesome! Building an expert pipeline to verify finger count could be tricky, especially since you have special instances where there are 5+ fingers. If you could get a pipeline of expert models with a heuristic involved to catch the special cases you could build a verification engine to then train a VLM to do this task.

salma-remyx

Mar 31

•

edited Mar 31

VQASynth is updated to generate samples to train your VLM to use test-time compute with CoT spatial reasoning traces from an HF dataset of images.

Also includes improved 3D scene reconstruction with VGGT and Molmo point-prompting SAM2

salma-remyx

Apr 6

Sharing the SpaceThinker dataset: https://huggingface.co/datasets/remyxai/SpaceThinker
Synthesized from a subset of the Cauldron using VQASynth

salma-remyx

Apr 17

Check out the model https://huggingface.co/remyxai/SpaceThinker-Qwen2.5VL-3B

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment