Diffusion LLMs are coming for autoregressive LLMs ⚡️⚡️ Inception Labs' new diffusion model demolishes all leading LLMs on generation speed, with equal quality !
Inception Labs was founded a few months ago, and they're not sleeping: after dropping a code model, they just published Mercury chat, a diffusion-based chat model that reaches 1000 tokens / second on H100, i.e. 10x more than models of equivalent performance on the same hardware!
What's the breakthrough? Well instead, of generating tokens left-to-right like the more common autoregressive LLMs, diffusion models generate their blocks of text all at once, and successive steps refine the whole text.
Diffusion models being really fast at isn't new, we have had some promising results on this by Google already back in May with Gemini Diffusion, and Mercury themselves had already published their coding model a few months ago
But being that good quality is new - and now Inception Labs just proved that their models work well in chat too, which could have been challenging given that's streaming generation is well suited to left-to-right generation.
They have a playground available at chat dot inceptionlabs dot ai, I recommend giving it a try!