Add model card (#1)

Files changed (1) hide show

README.md CHANGED Viewed

@@ -1,12 +1,13 @@
 ---
 license: mit
 pipeline_tag: image-text-to-text
-library_name: transformers
 tags:
 - text-to-image
 - image-to-image
 - image-to-text
 ---
 <h1 align="center">Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction</h1>
 Ming-Lite-Uni is an open-source multimodal framework featuring a newly designed unified visual generator and a native multimodal autoregressive model tailored for unifying vision and language. Specifically, this project provides an open-source implementation of the integrated MetaQueries and M2-omni framework, while introducing the novel **multi-scale learnable tokens** and **multi-scale representation alignment strategy**. By leveraging a fixed MLLM and a learnable diffusion model, Ming-Lite-Uni enables native multimodal AR models to perform both text-to-image generation and instruction based image editing tasks, expanding their capabilities beyond pure visual understanding. Our experimental results reveal the strong performance of Ming-Lite-Uni and illustrate the impressive fluid nature of its interactive process. Ming-Lite-Uni is in alpha stage and will soon be further refined.
@@ -116,4 +117,6 @@ inputs = my_proc.process(image_file=image_file, prompt=prompt, device=device)
 result = model.image_gen_generate(inputs, steps=30, seed=42, cfg=5.0, height=512, width=512)[1]
 result.save("result.png")
 ```
-For more advanced usage, such as fine-tuning or generating images, refer to the documentation.

 ---
+library_name: transformers
 license: mit
 pipeline_tag: image-text-to-text
 tags:
 - text-to-image
 - image-to-image
 - image-to-text
 ---
 <h1 align="center">Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction</h1>
 Ming-Lite-Uni is an open-source multimodal framework featuring a newly designed unified visual generator and a native multimodal autoregressive model tailored for unifying vision and language. Specifically, this project provides an open-source implementation of the integrated MetaQueries and M2-omni framework, while introducing the novel **multi-scale learnable tokens** and **multi-scale representation alignment strategy**. By leveraging a fixed MLLM and a learnable diffusion model, Ming-Lite-Uni enables native multimodal AR models to perform both text-to-image generation and instruction based image editing tasks, expanding their capabilities beyond pure visual understanding. Our experimental results reveal the strong performance of Ming-Lite-Uni and illustrate the impressive fluid nature of its interactive process. Ming-Lite-Uni is in alpha stage and will soon be further refined.
 result = model.image_gen_generate(inputs, steps=30, seed=42, cfg=5.0, height=512, width=512)[1]
 result.save("result.png")
 ```
+For more advanced usage, such as fine-tuning or generating images, refer to the documentation.
+Link to the code: https://github.com/inclusionAI/Ming