BizGen / README.md

PYY2001

Update README.md

daf08c5 verified 2 months ago

4.4 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: text-to-image
	---
	# BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation (Glyph-ByT5-v3)

	<a href="https://arxiv.org/abs/2503.20672"><img src="https://img.shields.io/badge/Paper-arXiv-red?style=for-the-badge" height=22.5></a>
	<a href="https://github.com/1230young/bizgen"><img src="https://img.shields.io/badge/Gihub-Code-succees?style=for-the-badge&logo=GitHub" height=22.5></a>
	<a href="https://bizgen-msra.github.io"><img src="https://img.shields.io/badge/Project-Page-blue?style=for-the-badge" height=22.5></a>

	<table>
	<tr>
	<td><img src="assets/teaser_info.png" alt="teaser example 0" width="1200"/></td>
	</tr>
	<tr>
	<td><img src="assets/teaser_slide.png" alt="teaser example 1" width="1200"/></td>
	</tr>
	</table>

	## Abstract
	<p>
	Recently, state-of-the-art text-to-image generation models, such as Flux and Ideogram 2.0, have made
	significant progress in sentence-level visual text rendering. In this paper, we focus on the more
	challenging scenarios of article-level visual text rendering and address a novel task of generating
	high-quality business content, including infographics and slides, based on user provided article-level
	descriptive prompts and ultra-dense layouts. The fundamental challenges are twofold: significantly
	longer context lengths and the scarcity of high-quality business content data.
	</p>
	<p>
	In contrast to most previous works that focus on a limited number of sub-regions and sentence-level
	prompts, ensuring precise adherence to ultra-dense layouts with tens or even hundreds of sub-regions in
	business content is far more challenging. We make two key technical contributions: (i) the construction
	of scalable, high-quality business content dataset, i.e., Infographics-650K, equipped with
	ultra-dense layouts and prompts by implementing a layer-wise retrieval-augmented infographic generation
	scheme; and (ii) a layout-guided cross attention scheme, which injects tens of region-wise prompts into
	a set of cropped region latent space according to the ultra-dense layouts, and refine each sub-regions
	flexibly during inference using a layout conditional CFG.
	</p>
	<p>
	We demonstrate the strong results of our system compared to previous SOTA systems such as Flux and SD3
	on our BizEval prompt set. Additionally, we conduct thorough ablation experiments to verify the
	effectiveness of each component. We hope our constructed Infographics-650K and BizEval can encourage
	the broader community to advance the progress of business content generation.
	</p>

	## Model Description

	The ByT5 model is finetuned from [Glyph-ByT5-v2](https://arxiv.org/abs/2406.10208), which supports accurate visual text rendering in ten different languages.
	The [SPO](https://huggingface.co/SPO-Diffusion-Models) model is a substitute for the original sdxl-base-1.0 for aesthetic improvement. The [lora/infographic](https://huggingface.co/PYY2001/BizGen/tree/main/lora/infographic) and [lora/slides](https://huggingface.co/PYY2001/BizGen/tree/main/lora/slides) are respectively tuned on our infographics and slides datasets.
	You can follow our [github](https://github.com/1230young/bizgen) to organize and run the model.

	## Citation
	If you find our work or codebase useful, please consider giving us a star and citing our work.
	```
	@misc{peng2025bizgenadvancingarticlelevelvisual,
	title={BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation},
	author={Yuyang Peng and Shishi Xiao and Keming Wu and Qisheng Liao and Bohan Chen and Kevin Lin and Danqing Huang and Ji Li and Yuhui Yuan},
	year={2025},
	eprint={2503.20672},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2503.20672},
	}
	```
	```
	@article{liu2024glyphv2,
	title={Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering},
	author={Liu, Zeyu and Liang, Weicong and Zhao, Yiming and Chen, Bohan and Li, Ji and Yuan, Yuhui},
	journal={arXiv preprint arXiv:2406.10208},
	year={2024}
	}
	```
	```
	@article{liu2024glyph,
	title={Glyph-byt5: A customized text encoder for accurate visual text rendering},
	author={Liu, Zeyu and Liang, Weicong and Liang, Zhanhao and Luo, Chong and Li, Ji and Huang, Gao and Yuan, Yuhui},
	journal={arXiv preprint arXiv:2403.09622},
	year={2024}
	}
	```