Kai Chen

KaiChen1998

AI & ML interests

Omni-modal Large Language Models & Controllable Visual World Modeling & Autonomous Driving

Recent Activity

liked a Space 2 days ago
Emova-ollm/RACRO-demo
posted an update 3 days ago
๐Ÿค” Advanced reasoning LLMs keep releasing before your current MLLM alignment is done? Try our RACRO! Train once, flexible change to novel LLM reasoners during inference time! ๐Ÿ“ข RACRO is a novel methodology to build multi-modal large reasoning models. By decoupling multi-modal reasoning into 1) query-conditioned captioning and 2) text-only reasoning, we achieve SoTA results on multi-modal reasoning benchmarks, while supporting flexible changes to any advanced reasoning models during inference. We further propose CRO, a novel GRPO-variant to reinforce query-conditioned captioning with only verifiable data for multi-modal mathematical questions. โœจ Highlights โœ… State-of-the-art multi-modal reasoning: we achieve SoTA performance on multi-modal mathematical benchmarks, exceeding advanced commercial models like Claude-3.7-Sonnet and Gemini-2.0-Flash. โœ… Inference-time scalability: thanks to the perceptual decoupling, we can flexibly change LLM reasoners during inference, providing a unique inference-time scalability for multi-modal reasoning. โœ… Highly efficient: With only a single round of Caption Reward Optimization (CRO) training on ~39K samples, RACRO gets rid of burdensome multi-modal alignment (e.g., 4.1T tokens for Qwen2.5-VL). ๐Ÿ”ฅ You are all welcome to try and star! - Paper: https://huggingface.co/papers/2506.04559 - Github: https://github.com/gyhdog99/RACRO2 - Demo: https://huggingface.co/spaces/Emova-ollm/RACRO-demo
updated a Space 3 days ago
Emova-ollm/RACRO-demo
View all activity

Organizations

EMOVA Hugging Face's profile picture

KaiChen1998's activity

posted an update 3 days ago
view post
Post
176
๐Ÿค” Advanced reasoning LLMs keep releasing before your current MLLM alignment is done? Try our RACRO! Train once, flexible change to novel LLM reasoners during inference time!

๐Ÿ“ข RACRO is a novel methodology to build multi-modal large reasoning models. By decoupling multi-modal reasoning into 1) query-conditioned captioning and 2) text-only reasoning, we achieve SoTA results on multi-modal reasoning benchmarks, while supporting flexible changes to any advanced reasoning models during inference. We further propose CRO, a novel GRPO-variant to reinforce query-conditioned captioning with only verifiable data for multi-modal mathematical questions.

โœจ Highlights
โœ… State-of-the-art multi-modal reasoning: we achieve SoTA performance on multi-modal mathematical benchmarks, exceeding advanced commercial models like Claude-3.7-Sonnet and Gemini-2.0-Flash.
โœ… Inference-time scalability: thanks to the perceptual decoupling, we can flexibly change LLM reasoners during inference, providing a unique inference-time scalability for multi-modal reasoning.
โœ… Highly efficient: With only a single round of Caption Reward Optimization (CRO) training on ~39K samples, RACRO gets rid of burdensome multi-modal alignment (e.g., 4.1T tokens for Qwen2.5-VL).

๐Ÿ”ฅ You are all welcome to try and star!
- Paper: Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning (2506.04559)
- Github: https://github.com/gyhdog99/RACRO2
- Demo: Emova-ollm/RACRO-demo
posted an update 3 months ago
view post
Post
4899
๐Ÿ“ข Our EMOVA paper has been accepted by CVPR 2025, and we are glad to release all resources, including code (training & inference), datasets (training & evaluation), and checkpoints (EMOVA-3B/7B/72B)!

๐Ÿค— EMOVA is a novel end-to-end omni-modal LLM that can see, hear and speak. Given omni-modal (i.e., textual, visual and speech) inputs, EMOVA can generate both textual and speech responses with vivid emotional controls by utilizing the speech decoder and a style controller.

โœจ EMOVA Highlights
โœ… State-of-the-art omni-modality: EMOVA achieves SoTA comparable results on both vision-language and speech benchmarks simultaneously.
โœ… Device adaptation: our codebase supports training/inference on both NVIDIA GPUs (e.g., A800 & H20) and Ascend NPUs (e.g., 910B3)!
โœ… Modular design: we integrate multiple implementations of vision encoder, vision projector, and language model, even including the most recent DeepSeekMoE-tiny!

๐Ÿ”ฅ You are all welcome to try and star!
- Project page: https://emova-ollm.github.io/
- Github: https://github.com/emova-ollm/EMOVA
- Demo: Emova-ollm/EMOVA-demo