Aymeric Roucher's picture

Aymeric Roucher

m-ric

AI & ML interests

Leading Agents at Hugging Face šŸ¤—

Recent Activity

View all activity

Organizations

Hugging Face's profile picture Orange's profile picture Atmos Bank's profile picture Hugging Test Lab's profile picture Tools's profile picture HuggingFaceM4's profile picture lecocqassociate's profile picture huggingPartyParis's profile picture Supreme's profile picture Propulse Lab's profile picture FactSet's profile picture Leaderboard Organization's profile picture FactSet's profile picture CGIAR's profile picture Aperture Laboratories's profile picture AI Energy Score's profile picture C&A's profile picture Social Post Explorers's profile picture Dev Mode Explorers's profile picture Agent Collab's profile picture SLLHF's profile picture Data Agents's profile picture Hugging Face Party @ PyTorch Conference's profile picture Hugging Face FineVideo's profile picture Nerdy Face's profile picture Hugging Face Science's profile picture Agents Leaderboard's profile picture smolagents's profile picture Hugging Face Agents Course's profile picture Open R1's profile picture SIMS's profile picture Open Agents's profile picture GeekAgents's profile picture

Posts 98

view post
Post
1578
New king of open VLMs: InternVL3 takes Qwen 2.5's crown! šŸ‘‘

InternVL have been a wildly successful series of model : and the latest iteration has just taken back their crown thanks to their superior, natively multimodal vision training pipeline.

āž”ļø Most of the vision language models (VLMs) these days are built like Frankenstein : take a good text-only Large Language Model (LLM) backbone, stitch a specific vision transformer (ViT) on top of it. Then the training is sequential šŸ”¢ : 1. Freeze the LLM weights while you train the ViT only to work with the LLM part, then 2. Unfreeze all weights to train all weights in order to work together.

šŸ’« The Shanghai Lab decided to challenge this paradigm and chose this approach that they call "native". For each of their model sizes, they still start from a good LLM (mostly Qwen-2.5 series, did I tell you I'm a huge fan of Qwen? ā¤ļø), and stitch the ViT, but they don't freeze anything : they train all weights together with interleaved text and image understanding data in a single pre-training phase šŸŽØ.

They claim it results in more seamless interactions between modalities. And the results prove them right: they took the crown of top VLMs, at nearly all sizes, from their Qwen-2.5 parents. šŸ‘‘

Articles 10

Article
37

Trace & Evaluate your Agent with Arize Phoenix