BizGen

File size: 4,403 Bytes

---
license: apache-2.0
language:
- en
pipeline_tag: text-to-image
---
# BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation (Glyph-ByT5-v3)

<a href="https://arxiv.org/abs/2503.20672"><img src="https://img.shields.io/badge/Paper-arXiv-red?style=for-the-badge" height=22.5></a>
<a href="https://github.com/1230young/bizgen"><img src="https://img.shields.io/badge/Gihub-Code-succees?style=for-the-badge&logo=GitHub" height=22.5></a>
<a href="https://bizgen-msra.github.io"><img src="https://img.shields.io/badge/Project-Page-blue?style=for-the-badge" height=22.5></a>

<table>
  <tr>
    <td><img src="assets/teaser_info.png" alt="teaser example 0" width="1200"/></td>
  </tr>
  <tr>
    <td><img src="assets/teaser_slide.png" alt="teaser example 1" width="1200"/></td>
  </tr>
</table>

## Abstract
<p>
  Recently, state-of-the-art text-to-image generation models, such as Flux and Ideogram 2.0, have made 
  significant progress in sentence-level visual text rendering. In this paper, we focus on the more 
  challenging scenarios of article-level visual text rendering and address a novel task of generating 
  high-quality business content, including infographics and slides, based on user provided article-level 
  descriptive prompts and ultra-dense layouts.  The fundamental challenges are twofold: significantly 
  longer context lengths and the scarcity of high-quality business content data.
</p>
<p>
  In contrast to most previous works that focus on a limited number of sub-regions and sentence-level 
  prompts, ensuring precise adherence to ultra-dense layouts with tens or even hundreds of sub-regions in
  business content is far more challenging. We make two key technical contributions: (i) the construction 
  of scalable, high-quality business content dataset, i.e., Infographics-650K, equipped with 
  ultra-dense layouts and prompts by implementing a layer-wise retrieval-augmented infographic generation 
  scheme; and (ii) a layout-guided cross attention scheme, which injects tens of region-wise prompts into 
  a set of cropped region latent space according to the ultra-dense layouts, and refine each sub-regions 
  flexibly during inference using a layout conditional CFG. 
</p>
<p>
  We demonstrate the strong results of our system compared to previous SOTA systems such as Flux and SD3 
  on our BizEval prompt set. Additionally, we conduct thorough ablation experiments to verify the 
  effectiveness of each component. We hope our constructed Infographics-650K and BizEval can encourage 
  the broader community to advance the progress of business content generation.
</p>

## Model Description

The ByT5 model is finetuned from [Glyph-ByT5-v2](https://arxiv.org/abs/2406.10208), which supports accurate visual text rendering in ten different languages.
The [SPO](https://huggingface.co/SPO-Diffusion-Models) model is a substitute for the original sdxl-base-1.0 for aesthetic improvement. The [lora/infographic](https://huggingface.co/PYY2001/BizGen/tree/main/lora/infographic) and [lora/slides](https://huggingface.co/PYY2001/BizGen/tree/main/lora/slides) are respectively tuned on our infographics and slides datasets.  
You can follow our [github](https://github.com/1230young/bizgen) to organize and run the model.

## Citation
If you find our work or codebase useful, please consider giving us a star and citing our work.
```
@misc{peng2025bizgenadvancingarticlelevelvisual,
  title={BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation}, 
  author={Yuyang Peng and Shishi Xiao and Keming Wu and Qisheng Liao and Bohan Chen and Kevin Lin and Danqing Huang and Ji Li and Yuhui Yuan},
  year={2025},
  eprint={2503.20672},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2503.20672}, 
}
```
```
@article{liu2024glyphv2,
  title={Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering},
  author={Liu, Zeyu and Liang, Weicong and Zhao, Yiming and Chen, Bohan and Li, Ji and Yuan, Yuhui},
  journal={arXiv preprint arXiv:2406.10208},
  year={2024}
}
```
```
@article{liu2024glyph,
  title={Glyph-byt5: A customized text encoder for accurate visual text rendering},
  author={Liu, Zeyu and Liang, Weicong and Liang, Zhanhao and Luo, Chong and Li, Ji and Huang, Gao and Yuan, Yuhui},
  journal={arXiv preprint arXiv:2403.09622},
  year={2024}
}
```