Muddit Plus
Text-to-image and image-to-text with one model
Machine Learning, Content Creation, Generative Modeling
MeissonFlow Research is a non-commercial research group dedicated to advancing generative modeling techniques for structured visual and multimodal content creation. We aim to design models and algorithms that help creators produce high-quality content with greater efficiency and control.
Our journey began with MaskGIT, a pioneering work by Huiwen Chang, which introduced a bidirectional transformer decoder for image synthesis—outperforming traditional raster-scan autoregressive (AR) generation. This paradigm was later extended to text-to-image synthesis in MUSE.
Building upon these foundations, we scaled masked generative modeling with the latest architectural designs and sampling strategies—culminating in Monetico and Meissonic from scratch, which on par with leading diffusion models such as SDXL, while maintaining greater efficiency.
Having verified the effectiveness of this approach, we began to ask a deeper question — one that reaches beyond performance benchmarks: what foundations are required for general-purpose generative intelligence?
Through discussions with researchers at Safe Superintelligence (SSI) Club, University of Illinois Urbana-Champaign (UIUC) and Riot Video Games, we converged on the vision of a visual-centric world model — a generative and interactive system capable of simulating, interacting with, and reasoning about multimodal environments.
We believe that masking is a fundamental abstraction for building such controllable, efficient, and generalizable intelligence.
A similar vision was shared by Stefano Ermon at ICLR 2025, where he described Diffusion as a unified paradigm for a multi-modal world model — a message that echoes and strengthens our belief: that unified generative modeling is the path toward general-purpose superintelligence.
To pursue this vision, we introduced Muddit and Muddit Plus, unified generative models built upon visual priors (Meissonic), and capable of unified generation across text and image within a single architecture and paradigm.
We look forward to releasing more models and algorithms in this direction.
We thank our amazing teammates — and you, the reader — for your interest in our work.
Special thanks to Style2Paints Research, which helped shape our taste and research direction in the early days.