Spaces:
Sleeping
Sleeping
File size: 16,583 Bytes
6fc683c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 |
# TextDiffuser: Diffusion Models as Text Painters (NeurIPS 2023)
<a href='https://arxiv.org/pdf/2305.10855.pdf'><img src='https://img.shields.io/badge/Arxiv-2305.10855-red'>
<a href='https://github.com/microsoft/unilm/tree/master/textdiffuser'><img src='https://img.shields.io/badge/Code-aka.ms/textdiffuser-yellow'>
<a href='https://jingyechen.github.io/textdiffuser/'><img src='https://img.shields.io/badge/Project Page-link-green'>
</a> [](https://huggingface.co/spaces/JingyeChen22/TextDiffuser)
<a href='https://colab.research.google.com/drive/115Qw0l5dhjlTtrbywMWRwhz9IxKE4_Dg?usp=sharing'><img src='https://img.shields.io/badge/GoogleColab-link-purple'>
TextDiffuser generates images with visually appealing text that is coherent with backgrounds. It is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text.
<img src="assets/readme_images/introduction.jpg" width="80%">
## :star2: Highlights
* We propose **TextDiffuser**, which is a two-stage diffusion-based framework for text rendering. It generates accurate and coherent text images from text prompts or additionally with template images, as well as conducting text inpainting to reconstruct incomplete images.
* We release **MARIO-10M**, containing large-scale image-text pairs with OCR annotations, including text recognition, detection, and character-level segmentation masks.
* We construct **MARIO-Eval**, a comprehensive text rendering benchmark containing 10k prompts.
* We **release the demo** at [link](https://huggingface.co/spaces/JingyeChen22/TextDiffuser). Welcome to use and provide feedbacks :hugs:.
## :stopwatch: News
- __[2023.09.22]__: :tada: TextDiffuser is accepted to NeurIPS 2023.
- __[2023.06.22]__: Evaluation script is released.
- __[2023.06.15]__: :raised_hands: :raised_hands: :raised_hands: The Demo of TextDiffuser pre-trained with SD v2.1 is released in this [link](https://huggingface.co/spaces/JingyeChen22/TextDiffuser). Meanwhile, GoogleColab is available in this [link](https://colab.research.google.com/drive/115Qw0l5dhjlTtrbywMWRwhz9IxKE4_Dg?usp=sharing).
- __[2023.06.08]__: Training script is released.
- __[2023.06.07]__: MARIO-LAION is released.
- __[2023.06.02]__: :raised_hands: :raised_hands: :raised_hands: Demo is available in this [link](https://huggingface.co/spaces/JingyeChen22/TextDiffuser).
- __[2023.05.26]__: Upload the inference code and checkpoint.
- __[2023.05.19]__: The paper is available at [link](https://arxiv.org/pdf/2305.10855.pdf).
## :hammer_and_wrench: Installation
Clone this repo:
```
git clone github_path_to/TextDiffuser
cd TextDiffuser
```
Build up a new environment and install packages as follows:
```
conda create -n textdiffuser python=3.8
conda activate textdiffuser
pip install -r requirements.txt
```
Meanwhile, please install torch and torchvision that matches the version of system and cuda (refer to this [link](https://download.pytorch.org/whl/torch_stable.html)).
Install Hugging Face Diffuser and replace some files:
```
git clone https://github.com/JingyeChen/diffusers
cp ./assets/files/scheduling_ddpm.py ./diffusers/src/diffusers/schedulers/scheduling_ddpm.py
cp ./assets/files/unet_2d_condition.py ./diffusers/src/diffusers/models/unet_2d_condition.py
cp ./assets/files/modeling_utils.py ./diffusers/src/diffusers/models/modeling_utils.py
cd diffusers && pip install -e .
```
Besides, a font file is needed for layout generation. Please put your font in ```assets/font/```. We recommend to use ```Arial.ttf```.
## :floppy_disk: Checkpoint
The checkpoints are in this [link](https://layoutlm.blob.core.windows.net/textdiffuser/textdiffuser-ckpt-new.zip?sv=2022-11-02&ss=b&srt=o&sp=r&se=2033-06-08T16:48:15Z&st=2023-06-08T08:48:15Z&spr=https&sig=a9VXrihTzbWyVfaIDlIT1Z0FoR1073VB0RLQUMuudD4%3D) or [HFLink](https://huggingface.co/datasets/JingyeChen22/TextDiffuser/resolve/main/textdiffuser-ckpt-new.zip) (3.2GB). Please download it and unzip it. The file structures should be as follows:
```
textdiffuser
βββ textdiffuser-ckpt
β βββ diffusion_backbone/ # for diffusion backbone
β βββ character_aware_loss_unet.pth # for character-aware loss
β βββ layout_transformer.pth # for layout transformer
β βββ text_segmenter.pth # for character-level segmenter
βββ README.md
```
## :books: Dataset
<img src="assets/readme_images/laion-ocr.jpg" width="80%">
**MARIO-LAION**'s meta information is at this [link](https://layoutlm.blob.core.windows.net/textdiffuser/laion-ocr-new.zip?sv=2022-11-02&ss=b&srt=o&sp=r&se=2033-06-08T16:48:15Z&st=2023-06-08T08:48:15Z&spr=https&sig=a9VXrihTzbWyVfaIDlIT1Z0FoR1073VB0RLQUMuudD4%3D) or [onedrive](https://mail2sysueducn-my.sharepoint.com/personal/huangyp28_mail2_sysu_edu_cn/_layouts/15/onedrive.aspx?ct=1686245253173&or=Teams%2DHL&ga=1&LOF=1&id=%2Fpersonal%2Fhuangyp28%5Fmail2%5Fsysu%5Fedu%5Fcn%2FDocuments%2Frelease%2Ftextdiffuser%2Fdata) (40GB), containing 9,194,613 samples. Please download it and unzip it by running ```python data/maion-laion-unzip.py```. The file structures of each folder should be as follows and ```data/maion-laion-example``` is provided for reference. We also provide ```data/visualize_charseg.ipynb``` to visualize the character-level segmentation mask.
```
βββ 28330/
β βββ 283305839/
β β βββ caption.txt # caption of the image
β β βββ charseg.npy # character-level segmentation mask
β β βββ info.json # more meta information given by laion, such as original height and width
βββ βββ βββ ocr.txt # ocr detection and recognition results
```
The urls of each image is at this [link](https://layoutlm.blob.core.windows.net/textdiffuser/mario_laion_image_url.zip?sv=2022-11-02&ss=b&srt=o&sp=r&se=2033-06-08T16:48:15Z&st=2023-06-08T08:48:15Z&spr=https&sig=a9VXrihTzbWyVfaIDlIT1Z0FoR1073VB0RLQUMuudD4%3D) or [onedrive](https://mail2sysueducn-my.sharepoint.com/personal/huangyp28_mail2_sysu_edu_cn/_layouts/15/onedrive.aspx?ct=1686245253173&or=Teams%2DHL&ga=1&LOF=1&id=%2Fpersonal%2Fhuangyp28%5Fmail2%5Fsysu%5Fedu%5Fcn%2FDocuments%2Frelease%2Ftextdiffuser%2Fdata) (794.6MB). The file structure is as follows:
```
βββ maion_laion_image_url/
β βββ mario-laion-url.txt # urls for downloading by img2dataset
β βββ mario-laion-index-url.txt # urls and indices for each image
β βββ mario-laion-test-index.txt # all indices for test dataset
```
Please download img2dataset wiht ```pip install img2dataset```, and download the images using the following command:
```
img2dataset --url_list=url.txt --output_folder=laion_ocr --thread_count=64 --resize_mode=no
```
After downloading, you need to resize each image to ```512x512```. Please follow ```mario-laion-index-url.txt``` to move each image to the corresponding folders. Images with indices in ```mario-laion-test-index.txt``` are used for testing. Please note that some links may be <span style="color:red">**invalid**</span>
since the owners remove the images from their website.
## :steam_locomotive: Train
Please use ```accelerate config``` to configure your acceleration policy at first, then modify output_dir, dataset_path, and train_dataset_index_file in ```train.sh```. The train_dataset_index_file should be a .txt file, and each line should indicate an index of a training sample.
```txt
06269_062690093
27197_271975251
27197_271978467
...
```
Then you can use the following to run TextDiffuser:
```bash
accelerate launch train.py \
--train_batch_size=24 \
--gradient_accumulation_steps=4 \
--gradient_checkpointing \
--mixed_precision="fp16" \
--num_train_epochs=2 \
--learning_rate=1e-5 \
--max_grad_norm=1 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--output_dir="experiment_name" \
--enable_xformers_memory_efficient_attention \
--dataloader_num_workers=4 \
--character_aware_loss_lambda=0.01 \
--resume_from_checkpoint="latest" \
--drop_caption \
--mask_all_ratio=0.5 \
--segmentation_mask_aug \
--dataset_path=/home/path/to/laion-ocr-unzip \
--train_dataset_index_file=/path/to/index_file.txt \
--vis_num=8
```
If you encounter an "out-of-memory" error, please consider reducing the batch size appropriately.
## :firecracker: Inference
TextDiffuser can be applied on: text-to-image, text-to-image-with-template, and text-inpainting.
### Text-to-Image
This task is designed to generate images based on given prompts. Users are required to enclose the keywords to be drawn with single quotation marks.
```bash
CUDA_VISIBLE_DEVICES=0 python inference.py \
--mode="text-to-image" \
--resume_from_checkpoint="textdiffuser-ckpt/diffusion_backbone" \
--prompt="A sign that says 'Hello'" \
--output_dir="./output" \
--vis_num=4
```
### Text-to-Image-with-Template
This task aims to generate images based on given prompts and template images (can be printed, handwritten, or scene text images). A pre-trained character-level segmentation model is used to extract layout information from the template image.
```bash
CUDA_VISIBLE_DEVICES=0 python inference.py \
--mode="text-to-image-with-template" \
--resume_from_checkpoint="textdiffuser-ckpt/diffusion_backbone" \
--prompt="a poster of monkey music festival" \
--template_image="assets/examples/text-to-image-with-template/case2.jpg" \
--output_dir="./output" \
--vis_num=4
```
### Text-Inpainting
This task aims to modify a given image in an inpainting manner. The provided text mask image should contain the inpainting region and the text to be drawn within the region.
```bash
CUDA_VISIBLE_DEVICES=0 python inference.py \
--mode="text-inpainting" \
--resume_from_checkpoint="textdiffuser-ckpt/diffusion_backbone" \
--prompt="a boy draws good morning on a board" \
--original_image="assets/examples/text-inpainting/case2.jpg" \
--text_mask="assets/examples/text-inpainting/case2_mask.jpg" \
--output_dir="./output" \
--vis_num=4
```
## :chart_with_upwards_trend: Evaluation
For evaluation, please download [MARIOEval](https://layoutlm.blob.core.windows.net/textdiffuser/MARIOEval.zip?sv=2022-11-02&ss=b&srt=o&sp=r&se=2033-06-08T16:48:15Z&st=2023-06-08T08:48:15Z&spr=https&sig=a9VXrihTzbWyVfaIDlIT1Z0FoR1073VB0RLQUMuudD4%3D) and the generation results of each methods are at [link](https://layoutlm.blob.core.windows.net/textdiffuser/marioeval_generation.zip?sv=2022-11-02&ss=b&srt=o&sp=r&se=2033-06-08T16:48:15Z&st=2023-06-08T08:48:15Z&spr=https&sig=a9VXrihTzbWyVfaIDlIT1Z0FoR1073VB0RLQUMuudD4%3D) for reference.
. MARIOEval contains 5,414 prompts for evaluation, including the following subsets:
| Subset | #Sample | Subset | #Sample |
| --- | ---: | --- | ---: |
| LAIONEval4000 | 4,000 | ChineseDrawText | 175 |
| TMDBEval500 | 500 | DrawBenchText | 21 |
| OpenLibrary500 | 500 | DrawTextCreative | 218 |
The structure of each folder is as follows:
```bash
βββ LAIONEval4000/
β βββ images/ # ground truth images
β βββ render/ # layouts of keywords generated by Layout Transformer
β βββ LAIONEval4000.txt # prompts with keywords enclosed with quotes
β βββ LAIONEval4000_wo_quote.txt # prompts without quotes
```
Please note that the ground truth images are only available for the LAIONEval4000, TMDBEval500, and OpenLibrary500 subsets. The render images are used for evaluating ControlNet. We manually enclose keywords with quotes according to the ocr results. Please refer to the ```_wo_quote.txt``` version for original prompts.
To evaluate TextDiffuser, please use the following command for sampling:
```python
CUDA_VISIBLE_DEVICES=0 python evaluate.py \
--mode="text-to-image" \
--resume_from_checkpoint="textdiffuser-ckpt/diffusion_backbone" \
--prompt_list="/path/to/MARIOEval/TMDBEval500/TMDBEval500.txt" \
--output_dir="/path/to/output_dir" \
--vis_num=4
```
To sample from other baseline methods (e.g, Stable Diffusion, ControlNet, and DeepFloyd), the scripts are provided in the ```./eval``` folder. We also provided the scripts for calculating FID, Clip Score, as well as the OCR metrics.
| Metrics | Stable Diffusion | ContolNet | DeepFloyd | TextDiffuser (Ours) |
| :---: | :---: | :---: | :---: | :---: |
| FIDβ | 51.295 | 51.485 | **34.902** | 38.758 |
| CLIPScoreβ | 0.3015 | 0.3424 | 0.3267 | **0.3436** |
| OCR-Accuracyβ | 0.0003 | 0.2390 | 0.0262 | **0.5609** |
| OCR-Precisionβ | 0.0173 | 0.5211 | 0.1450 | **0.7846** |
| OCR-Recallβ | 0.0280 | 0.6707 | 0.2245 | **0.7802** |
| OCR-Fmeasureβ | 0.0214 | 0.5865 | 0.1762 | **0.7824** |
| *OCR-Accuracyβ | 0.0178 | 0.2705 | 0.0457 | **0.5712** |
| *OCR-Precisionβ | 0.0192 | 0.5391 | 0.1738 | **0.7795** |
| *OCR-Recallβ | 0.0260 | 0.6438 | 0.2235 | **0.7498** |
| *OCR-Fmeasureβ | 0.0221 | 0.5868 | 0.1955 | **0.7643** |
Please note that OCR metrics begin with "\*" mean we use open-source [MaskTextSpotterV3](https://github.com/MhLiao/MaskTextSpotterV3) for evaluation, and without "\*" denote we use [MicroSoft OCR API](https://azure.microsoft.com/en-us/updates/computer-vision-v3-preview-6/) for evaluation. The performance of text-to-image on MARIO-Eval compared with existing methods. TextDiffuser performs
the best regarding CLIPScore and OCR evaluation while achieving comparable performance on FID.
<img src="assets/readme_images/userstudy.jpg" width="90%">
User studies for whole-image generation and part-image generation tasks. (a) For whole-image generation, our method clearly outperforms others in both aspects of text rendering quality and image-text matching. (b) For part-image generation, our method receives high scores from human evaluators in these two aspects.
## :joystick: Demo
TextDiffuser has been deployed on [Hugging Face](https://huggingface.co/spaces/JingyeChen22/TextDiffuser). If you have advanced GPUs, you may deploy the demo locally as follows:
```python
CUDA_VISIBLE_DEVICES=0 python gradio_app.py
```
Then you can enjoy the demo with local browser:
<img src="assets/readme_images/demo.jpg" width="90%">
## :framed_picture: Gallery
### Text-to-Image
<img src="assets/readme_images/gallery_text-to-image.jpg" width="80%">
### Text-to-Image-with-Template
<img src="assets/readme_images/gallery_text-to-image-with-template.jpg" width="80%">
### Text-Inpainting
<img src="assets/readme_images/gallery_text-inpainting.jpg" width="80%">
## :love_letter: Acknowledgement
We sincerely thank the following projects: [Hugging Face Diffuser](https://github.com/huggingface/diffusers), [LAION](https://laion.ai/laion-400-open-dataset/), [DB](https://github.com/MhLiao/DB), [PARSeq](https://github.com/baudm/parseq), [img2dataset](https://github.com/rom1504/img2dataset).
Also, special thanks to the open-source diffusion project or available demo: [DALLE](https://openai.com/product/dall-e-2), [Stable Diffusion](https://github.com/CompVis/stable-diffusion), [Stable Diffusion XL](https://dreamstudio.ai/generate), [Midjourney](https://www.midjourney.com/home/?callbackUrl=%2Fapp%2F), [ControlNet](https://github.com/lllyasviel/ControlNet), [DeepFloyd](https://github.com/deep-floyd/IF).
## :exclamation: Disclaimer
Please note that the code is intended for academic and research purposes **ONLY**. Any use of the code for generating inappropriate content is **strictly prohibited**. The responsibility for any misuse or inappropriate use of the code lies solely with the users who generated such content, and this code shall not be held liable for any such use.
## :envelope: Contact
For help or issues using TextDiffuser, please email Jingye Chen ([email protected]), Yupan Huang ([email protected]) or submit a GitHub issue.
For other communications related to TextDiffuser, please contact Lei Cui ([email protected]) or Furu Wei ([email protected]).
## :herb: Citation
If you find this code useful in your research, please consider citing:
```
@article{chen2023textdiffuser,
title={TextDiffuser: Diffusion Models as Text Painters},
author={Chen, Jingye and Huang, Yupan and Lv, Tengchao and Cui, Lei and Chen, Qifeng and Wei, Furu},
journal={arXiv preprint arXiv:2305.10855},
year={2023}
}
```
|