Stable Diffusion for Khmer Text Generation (KHM-53)

This repository hosts a fine-tuned Stable Diffusion model customized for Khmer text image generation. The model aims to generate high-quality synthetic data, particularly for applications such as Khmer OCR, document layout analysis, and AI-based Khmer text systems.

📌 Problem

Despite rapid advances in AI and generative models, Khmer remains a low-resource language, lacking high-quality datasets and models for tasks like text-to-image generation, OCR, and scene text analysis. Compared to languages like Thai or Vietnamese, Khmer lacks sufficient publicly available data, especially in image form, making it difficult to develop robust AI systems.

🎯 Objective

The primary objective of this project was to:

Develop a text-to-image generation pipeline capable of generating synthetic Khmer word images from text prompts.
Support the development of Khmer OCR and other Khmer language-related AI tools by providing training-grade synthetic data.
Evaluate and compare state-of-the-art generation models including DCGAN, UNet2D, and Stable Diffusion to determine the best fit for Khmer.

🧭 Goal and Scope

Goal: To build a complete, scalable, and publicly accessible pipeline that can transform Khmer text into realistic images for downstream use in OCR and machine learning.

Scope:

Covers Khmer character-level and word-level image generation.
Implements and compares several generative architectures (GAN, Diffusion).
Includes full pipeline: data preprocessing, model training, fine-tuning, evaluation, and deployment.
Does not yet include a production-ready API, though integration with Telegram chatbot and public hosting on Hugging Face Hub are provided.
Future extension planned for longer texts, handwritten data, and web GUI.

🚀 Project Summary

This model was developed as part of a 4-month internship at Factory.io under the Cambodia Academy of Digital Technology (CADT), with the main objective of generating synthetic images of Khmer script from text prompts. The final output is an end-to-end text-to-image generation pipeline, fine-tuned on Khmer word images using the Stable Diffusion architecture.

🛠️ Process Overview

1. Literature Review & Experimentation

Compared DCGAN, UNet2D, and Stable Diffusion.
Stable Diffusion with a RoBERTa text encoder and VAE decoder showed the best qualitative and quantitative results.
Utilized techniques like UNet2DConditionalModel, text embedding, and diffusion denoising.

2. Dataset

Source: Khmer text recognition dataset on Kaggle
136K+ images with 10 fonts, filtered and converted to grayscale.
96.57% of images retained after filtering for <128×64 px resolution.

3. Preprocessing

Filtering based on image size.
Conversion from RGB to grayscale to optimize for limited GPU (12GB VRAM).
Applied normalization, resizing, and rescaling.

4. Model Architecture

Text encoder: RoBERTa
Latent generator: UNet2DConditional
Decoder: Variational Autoencoder (VAE)
Pipeline operates in latent space for memory efficiency and sharp image generation.

5. Training & Fine-Tuning

Trained with AdamW optimizer and MSE Loss.
Fine-tuned text encoder, UNet2DConditional, and VAE simultaneously.
Evaluated both with unconditional and conditional generation tasks.

6. Deployment

Final models are hosted here on Hugging Face for:
- Community sharing
- Version control
- Future fine-tuning
- Public reproducibility

Hugging Face Collection:
🔗 https://huggingface.co/collections/channudam/textimagegeneration-khm-35-67d916c2505635db1ba8fc3c

📈 Results

Model Type	Output Quality	Params	Image Size
DCGAN	Low	239K	64x64, 1-chan
UNet2D	Good	106M	64x64, 1-chan
UNet2DConditional	Very Good	147M	64x32, 1-chan
Stable Diffusion	Excellent	881M (Total)	128x64, RGB

✔️ Stable Diffusion outperformed all other methods, producing sharper, more accurate Khmer text images.
✔️ Works well under 12GB GPU constraints using compressed latent representation.

🧠 Key Challenges

Limited public Khmer datasets.
Khmer script complexity (stacked diacritics, varied fonts).
Hardware constraints (12GB VRAM).
Evaluation had to rely on manual visual inspection.

🔮 Future Work

Expand dataset to include longer texts and handwritten Khmer.
Integrate speech and handwriting modules for multimodal Khmer AI.
Develop a web-based GUI for real-time Khmer text-to-image generation.
Optimize for edge devices (mobile, low-power GPUs).
Explore larger transformer-based encoders for better text understanding.

Fine-Tuning

This is a base model and is intended to be fine-tuned for specific tasks or datasets. The model was trained on images with a resolution of 128×64 in RGB color channel, but this can be adjusted during fine-tuning to match your desired output size.

For best results, it is recommended to fine-tune the following three main components rather than just the core UNet model:

Text Encoder – [RobertaModel]
Variational Autoencoder – [AutoencoderKL]
Image Generation Model – [UNet2DConditionModel]

Usage (with GPU)

import matplotlib.pyplot as plt
import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "channudam/stable-diffusion-khm-53",
    torch_dtype=torch.float16,
).to("cuda")

images = pipe("បាត់ដំបង", guidance_scale=2).images[0]
plt.imshow(images)
plt.show()

📚 References

Made with ❤️ by Channudam Ray | Factory.io & CADT, 2025

channudam
/

stable-diffusion-khm-53