Stable Diffusion for Khmer Text Generation (KHM-53)

This repository hosts a fine-tuned Stable Diffusion model customized for Khmer text image generation. The model aims to generate high-quality synthetic data, particularly for applications such as Khmer OCR, document layout analysis, and AI-based Khmer text systems.


๐Ÿ“Œ Problem

Despite rapid advances in AI and generative models, Khmer remains a low-resource language, lacking high-quality datasets and models for tasks like text-to-image generation, OCR, and scene text analysis. Compared to languages like Thai or Vietnamese, Khmer lacks sufficient publicly available data, especially in image form, making it difficult to develop robust AI systems.


๐ŸŽฏ Objective

The primary objective of this project was to:

  • Develop a text-to-image generation pipeline capable of generating synthetic Khmer word images from text prompts.
  • Support the development of Khmer OCR and other Khmer language-related AI tools by providing training-grade synthetic data.
  • Evaluate and compare state-of-the-art generation models including DCGAN, UNet2D, and Stable Diffusion to determine the best fit for Khmer.

๐Ÿงญ Goal and Scope

Goal: To build a complete, scalable, and publicly accessible pipeline that can transform Khmer text into realistic images for downstream use in OCR and machine learning.

Scope:

  • Covers Khmer character-level and word-level image generation.
  • Implements and compares several generative architectures (GAN, Diffusion).
  • Includes full pipeline: data preprocessing, model training, fine-tuning, evaluation, and deployment.
  • Does not yet include a production-ready API, though integration with Telegram chatbot and public hosting on Hugging Face Hub are provided.
  • Future extension planned for longer texts, handwritten data, and web GUI.

๐Ÿš€ Project Summary

This model was developed as part of a 4-month internship at Factory.io under the Cambodia Academy of Digital Technology (CADT), with the main objective of generating synthetic images of Khmer script from text prompts. The final output is an end-to-end text-to-image generation pipeline, fine-tuned on Khmer word images using the Stable Diffusion architecture.

๐Ÿ› ๏ธ Process Overview

1. Literature Review & Experimentation

  • Compared DCGAN, UNet2D, and Stable Diffusion.
  • Stable Diffusion with a RoBERTa text encoder and VAE decoder showed the best qualitative and quantitative results.
  • Utilized techniques like UNet2DConditionalModel, text embedding, and diffusion denoising.

2. Dataset

3. Preprocessing

  • Filtering based on image size.
  • Conversion from RGB to grayscale to optimize for limited GPU (12GB VRAM).
  • Applied normalization, resizing, and rescaling.

4. Model Architecture

  • Text encoder: RoBERTa
  • Latent generator: UNet2DConditional
  • Decoder: Variational Autoencoder (VAE)
  • Pipeline operates in latent space for memory efficiency and sharp image generation.

5. Training & Fine-Tuning

  • Trained with AdamW optimizer and MSE Loss.
  • Fine-tuned text encoder, UNet2DConditional, and VAE simultaneously.
  • Evaluated both with unconditional and conditional generation tasks.

6. Deployment

  • Final models are hosted here on Hugging Face for:
    • Community sharing
    • Version control
    • Future fine-tuning
    • Public reproducibility

Hugging Face Collection:
๐Ÿ”— https://huggingface.co/collections/channudam/textimagegeneration-khm-35-67d916c2505635db1ba8fc3c

๐Ÿ“ˆ Results

Model Type Output Quality Params Image Size
DCGAN Low 239K 64x64, 1-chan
UNet2D Good 106M 64x64, 1-chan
UNet2DConditional Very Good 147M 64x32, 1-chan
Stable Diffusion Excellent 881M (Total) 128x64, RGB

โœ”๏ธ Stable Diffusion outperformed all other methods, producing sharper, more accurate Khmer text images.
โœ”๏ธ Works well under 12GB GPU constraints using compressed latent representation.

๐Ÿง  Key Challenges

  • Limited public Khmer datasets.
  • Khmer script complexity (stacked diacritics, varied fonts).
  • Hardware constraints (12GB VRAM).
  • Evaluation had to rely on manual visual inspection.

๐Ÿ”ฎ Future Work

  • Expand dataset to include longer texts and handwritten Khmer.
  • Integrate speech and handwriting modules for multimodal Khmer AI.
  • Develop a web-based GUI for real-time Khmer text-to-image generation.
  • Optimize for edge devices (mobile, low-power GPUs).
  • Explore larger transformer-based encoders for better text understanding.

Fine-Tuning

This is a base model and is intended to be fine-tuned for specific tasks or datasets. The model was trained on images with a resolution of 128ร—64 in RGB color channel, but this can be adjusted during fine-tuning to match your desired output size.

For best results, it is recommended to fine-tune the following three main components rather than just the core UNet model:

  • Text Encoder โ€“ [RobertaModel]
  • Variational Autoencoder โ€“ [AutoencoderKL]
  • Image Generation Model โ€“ [UNet2DConditionModel]

Usage (with GPU)

import matplotlib.pyplot as plt
import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "channudam/stable-diffusion-khm-53",
    torch_dtype=torch.float16,
).to("cuda")

images = pipe("แž”แžถแžแŸ‹แžŠแŸ†แž”แž„", guidance_scale=2).images[0]
plt.imshow(images)
plt.show()

Generated Khmer Text Image

๐Ÿ“š References


Made with โค๏ธ by Channudam Ray | Factory.io & CADT, 2025

Downloads last month
18
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ 1 Ask for provider support

Collection including channudam/stable-diffusion-khm-53