|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
pipeline_tag: text-to-image |
|
--- |
|
# BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation (Glyph-ByT5-v3) |
|
|
|
<a href="https://arxiv.org/abs/2503.20672"><img src="https://img.shields.io/badge/Paper-arXiv-red?style=for-the-badge" height=22.5></a> |
|
<a href="https://github.com/1230young/bizgen"><img src="https://img.shields.io/badge/Gihub-Code-succees?style=for-the-badge&logo=GitHub" height=22.5></a> |
|
<a href="https://bizgen-msra.github.io"><img src="https://img.shields.io/badge/Project-Page-blue?style=for-the-badge" height=22.5></a> |
|
|
|
<table> |
|
<tr> |
|
<td><img src="assets/teaser_info.png" alt="teaser example 0" width="1200"/></td> |
|
</tr> |
|
<tr> |
|
<td><img src="assets/teaser_slide.png" alt="teaser example 1" width="1200"/></td> |
|
</tr> |
|
</table> |
|
|
|
## Abstract |
|
<p> |
|
Recently, state-of-the-art text-to-image generation models, such as Flux and Ideogram 2.0, have made |
|
significant progress in sentence-level visual text rendering. In this paper, we focus on the more |
|
challenging scenarios of article-level visual text rendering and address a novel task of generating |
|
high-quality business content, including infographics and slides, based on user provided article-level |
|
descriptive prompts and ultra-dense layouts. The fundamental challenges are twofold: significantly |
|
longer context lengths and the scarcity of high-quality business content data. |
|
</p> |
|
<p> |
|
In contrast to most previous works that focus on a limited number of sub-regions and sentence-level |
|
prompts, ensuring precise adherence to ultra-dense layouts with tens or even hundreds of sub-regions in |
|
business content is far more challenging. We make two key technical contributions: (i) the construction |
|
of scalable, high-quality business content dataset, i.e., Infographics-650K, equipped with |
|
ultra-dense layouts and prompts by implementing a layer-wise retrieval-augmented infographic generation |
|
scheme; and (ii) a layout-guided cross attention scheme, which injects tens of region-wise prompts into |
|
a set of cropped region latent space according to the ultra-dense layouts, and refine each sub-regions |
|
flexibly during inference using a layout conditional CFG. |
|
</p> |
|
<p> |
|
We demonstrate the strong results of our system compared to previous SOTA systems such as Flux and SD3 |
|
on our BizEval prompt set. Additionally, we conduct thorough ablation experiments to verify the |
|
effectiveness of each component. We hope our constructed Infographics-650K and BizEval can encourage |
|
the broader community to advance the progress of business content generation. |
|
</p> |
|
|
|
## Model Description |
|
|
|
The ByT5 model is finetuned from [Glyph-ByT5-v2](https://arxiv.org/abs/2406.10208), which supports accurate visual text rendering in ten different languages. |
|
The [SPO](https://huggingface.co/SPO-Diffusion-Models) model is a substitute for the original sdxl-base-1.0 for aesthetic improvement. The [lora/infographic](https://huggingface.co/PYY2001/BizGen/tree/main/lora/infographic) and [lora/slides](https://huggingface.co/PYY2001/BizGen/tree/main/lora/slides) are respectively tuned on our infographics and slides datasets. |
|
You can follow our [github](https://github.com/1230young/bizgen) to organize and run the model. |
|
|
|
## Citation |
|
If you find our work or codebase useful, please consider giving us a star and citing our work. |
|
``` |
|
@misc{peng2025bizgenadvancingarticlelevelvisual, |
|
title={BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation}, |
|
author={Yuyang Peng and Shishi Xiao and Keming Wu and Qisheng Liao and Bohan Chen and Kevin Lin and Danqing Huang and Ji Li and Yuhui Yuan}, |
|
year={2025}, |
|
eprint={2503.20672}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2503.20672}, |
|
} |
|
``` |
|
``` |
|
@article{liu2024glyphv2, |
|
title={Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering}, |
|
author={Liu, Zeyu and Liang, Weicong and Zhao, Yiming and Chen, Bohan and Li, Ji and Yuan, Yuhui}, |
|
journal={arXiv preprint arXiv:2406.10208}, |
|
year={2024} |
|
} |
|
``` |
|
``` |
|
@article{liu2024glyph, |
|
title={Glyph-byt5: A customized text encoder for accurate visual text rendering}, |
|
author={Liu, Zeyu and Liang, Weicong and Liang, Zhanhao and Luo, Chong and Li, Ji and Huang, Gao and Yuan, Yuhui}, |
|
journal={arXiv preprint arXiv:2403.09622}, |
|
year={2024} |
|
} |
|
``` |