inclusionAI
/

Ming-Lite-Uni

Image-Text-to-Text

Model card Files Files and versions Community

lbwang commited on May 4

Commit

1c92716

·

verified ·

1 Parent(s): 1bd02b6

Update README.md

Files changed (1) hide show

README.md +10 -2

README.md CHANGED Viewed

@@ -1,3 +1,12 @@
 <h1 align="center">Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction</h1>
 Ming-Lite-Uni is an open-source multimodal framework featuring a newly designed unified visual generator and a native multimodal autoregressive model tailored for unifying vision and language. Specifically, this project provides an open-source implementation of the integrated MetaQueries and M2-omni framework, while introducing the novel **multi-scale learnable tokens** and **multi-scale representation alignment strategy**. By leveraging a fixed MLLM and a learnable diffusion model, Ming-Lite-Uni enables native multimodal AR models to perform both text-to-image generation and instruction based image editing tasks, expanding their capabilities beyond pure visual understanding. Our experimental results reveal the strong performance of Ming-Lite-Uni and illustrate the impressive fluid nature of its interactive process. Ming-Lite-Uni is in alpha stage and will soon be further refined.
@@ -107,5 +116,4 @@ inputs = my_proc.process(image_file=image_file, prompt=prompt, device=device)
 result = model.image_gen_generate(inputs, steps=30, seed=42, cfg=5.0, height=512, width=512)[1]
 result.save("result.png")
 ```
-For more advanced usage, such as fine-tuning or generating images, refer to the documentation.

+---
+license: mit
+pipeline_tag: image-text-to-text
+library_name: transformers
+tags:
+- text-to-image
+- image-to-image
+- image-to-text
+---
 <h1 align="center">Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction</h1>
 Ming-Lite-Uni is an open-source multimodal framework featuring a newly designed unified visual generator and a native multimodal autoregressive model tailored for unifying vision and language. Specifically, this project provides an open-source implementation of the integrated MetaQueries and M2-omni framework, while introducing the novel **multi-scale learnable tokens** and **multi-scale representation alignment strategy**. By leveraging a fixed MLLM and a learnable diffusion model, Ming-Lite-Uni enables native multimodal AR models to perform both text-to-image generation and instruction based image editing tasks, expanding their capabilities beyond pure visual understanding. Our experimental results reveal the strong performance of Ming-Lite-Uni and illustrate the impressive fluid nature of its interactive process. Ming-Lite-Uni is in alpha stage and will soon be further refined.
 result = model.image_gen_generate(inputs, steps=30, seed=42, cfg=5.0, height=512, width=512)[1]
 result.save("result.png")
 ```
+For more advanced usage, such as fine-tuning or generating images, refer to the documentation.