Kai Chen's picture

8 6 9

Kai Chen

KaiChen1998

·

https://kaichen1998.github.io/

AI & ML interests

Omni-modal Large Language Models & Controllable Visual World Modeling & Autonomous Driving

Recent Activity

liked a Space 2 days ago

Emova-ollm/RACRO-demo

posted an update 3 days ago

🤔 Advanced reasoning LLMs keep releasing before your current MLLM alignment is done? Try our RACRO! Train once, flexible change to novel LLM reasoners during inference time! 📢 RACRO is a novel methodology to build multi-modal large reasoning models. By decoupling multi-modal reasoning into 1) query-conditioned captioning and 2) text-only reasoning, we achieve SoTA results on multi-modal reasoning benchmarks, while supporting flexible changes to any advanced reasoning models during inference. We further propose CRO, a novel GRPO-variant to reinforce query-conditioned captioning with only verifiable data for multi-modal mathematical questions. ✨ Highlights ✅ State-of-the-art multi-modal reasoning: we achieve SoTA performance on multi-modal mathematical benchmarks, exceeding advanced commercial models like Claude-3.7-Sonnet and Gemini-2.0-Flash. ✅ Inference-time scalability: thanks to the perceptual decoupling, we can flexibly change LLM reasoners during inference, providing a unique inference-time scalability for multi-modal reasoning. ✅ Highly efficient: With only a single round of Caption Reward Optimization (CRO) training on ~39K samples, RACRO gets rid of burdensome multi-modal alignment (e.g., 4.1T tokens for Qwen2.5-VL). 🔥 You are all welcome to try and star! - Paper: https://huggingface.co/papers/2506.04559 - Github: https://github.com/gyhdog99/RACRO2 - Demo: https://huggingface.co/spaces/Emova-ollm/RACRO-demo

updated a Space 3 days ago

Emova-ollm/RACRO-demo

View all activity

Organizations

KaiChen1998's activity

upvoted a paper 12 days ago

Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning

Paper • 2506.04559 • Published 14 days ago • 2

upvoted 2 collections 3 months ago

EMOVA-Datasets

A collection of EMOVA datasets (https://emova-ollm.github.io/) • 6 items • Updated Mar 14 • 2

EMOVA-Models

A collection of EMOVA models (https://emova-ollm.github.io/) • 11 items • Updated Mar 14 • 3

upvoted a paper 7 months ago

MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

Paper • 2411.13807 • Published Nov 21, 2024 • 11

upvoted a collection 8 months ago

GeoDiffusion

A collection of GeoDiffusion checkpoints (https://kaichen1998.github.io/projects/geodiffusion/) • 11 items • Updated Dec 5, 2024 • 2

upvoted a paper 9 months ago

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Paper • 2409.18042 • Published Sep 26, 2024 • 41