Any plan to release a Vision enabled version with the same or near the same base and instruct model?

#7
by drmcbride - opened

As you can tell we all really enjoy the model and the way it sounds because it doesn't sound as robotic as other SOTA. It would be a very welcome addition to enable full vision capability later on for the model in the form of a update/upgrade to allow it to be a better replacement for some use cases. especially if you follow the same design philosophy that you did to create this amazing model.

Moonshot AI org

Thanks. We do have the plan to make Kimi K2 having vision ability. We already have the relevant technical expertise (see kimi-vl in our homepage), but it will still take some time.

Will it still act like this model or will it retrain it entirely and make it act different?

Moonshot AI org

You can refer to our kimi-vl report https://arxiv.org/abs/2504.07491. The act should be similar but cannot be exact the same.

Moonshot AI org

Thank you for your interest in the vision-enabled version. We will ship the vision-enabled version when it meets our expectations on such a scale. And we hope when it is here, it shall not let you down.

As for a prototype on tiny vision-language models, please refer to Kimi-VL-A3B: https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking-2506.

Hey @teowu @lsw825 is it feasible to finetune the moonshotai/Kimi-K2-Instruct or Base for vision in your assessment? Or should the prototype VL model mentioned above be sufficient for general video tasks as a SOTA open weights VL model as of today? I'm particuarly interested in the agentic capabilities from K2, so am not sure if Kimi-VL-A3B-Thinking-2506 will perform similarly and respond to agentic tool usage well, but I have not yet tested it against tasks.

Moonshot AI org

No, we didn't optimize Kimi-VL-A3B-Thinking-2506 for agentic tool usage

are you planning to open source kimi's chat platform? with context management, ocr etc? or not at all.

As a software engineer, I hope you'd train (not now or in next model, maybe much later) on more RFC standards by IETF (e.g. Test by making it recall every section title/content with just the RFC number + Section number, in random contexts.. but without making it break instructions of current context of course.), programming docs such as latest versions of frameworks and libraries; cause its current technical knowledge retrieval is awesome, but still could be more awesome of course.

RFCs alone are what dictate most of how software and the entire IT industry is built, having it trained on these blueprints like the back of its hand is going to big (it can already do it to some extent with the right few-shot which amazed me.); also on Web Standards like W3C, WHATWG, or TC39. For example, WHATWG (https://html.spec.whatwg.org/multipage/) is The HTML Web Standard; it has a ton of new content that are already standardized in browsers right now that even most LLMs still know nothing about, say CSS Anchor positioning, View Transitions API (and its CSS fields), field-sizing, selectedcontent element and so on. A ton of examples on codepen and on the web too.

Anyways, data collection (and organization) can be hard so I actually just want you to take this as just a simple to-do-later rather than anything urgent at all.

Back to the topic, Adding Vision will also add a whole new dimension of complexity to training data (and inference data); honestly can't wait for it! I can already tell the entire team of Moonshot AI is cooking something great so I'd rather not pile up my above silly suggestions. :)

Sign up or log in to comment