🇻🇳 Vietnamese Image Captioning Model

EfficientNet-B0 × BARTPho | Trained on UIT-ViIC dataset

📌 Giới thiệu

Mô hình Sinh chú thích ảnh tiếng Việt (Vietnamese Image Captioning) huấn luyện trên bộ dữ liệu UIT-ViIC, cho phép tạo mô tả ảnh tự nhiên và chính xác bằng tiếng Việt.

Ứng dụng:

🔍 Tìm kiếm ảnh theo ngôn ngữ tự nhiên
🦯 Hỗ trợ người khiếm thị tiếp cận nội dung hình ảnh
🤖 Tích hợp vào hệ thống AI đa phương thức (Multimodal AI)

🧠 Kiến trúc mô hình

Thành phần	Mô tả
Encoder	EfficientNet-B0 (pretrained từ NVIDIA TorchHub) → Trích xuất đặc trưng ảnh thành vector embedding
Decoder	BARTPho-Syllable → Sinh câu mô tả dựa trên đặc trưng ảnh

Pipeline:

Ảnh → EncoderCNN (EfficientNet-B0) → vector đặc trưng (embed size = 768)
    → Linear projection → encoder BARTPho
    → BARTPho decoder → sinh chú thích tiếng Việt

⚙️ Thông số huấn luyện

Tham số	Giá trị
Dataset	UIT-ViIC (train/val/test)
Loss	CrossEntropyLoss (ignore pad tokens)
Optimizer	Adam (lr = 5e-5)
Batch size	32
Epochs	30
Gradient clipping	1.0
Mixed Precision	torch.cuda.amp
Image augmentation	Resize(256) → RandomCrop(224) → Normalize(Imagenet)

📊 Metrics hỗ trợ

BLEU
ROUGE-L
METEOR
CIDEr
F1 trung bình token-level
Recall trung bình token-level

Điểm số cụ thể phụ thuộc vào checkpoint được tải.

🚀 Cách sử dụng

import torch
from PIL import Image
from torchvision import transforms
from image_caption import ImageCaptioningModel, Vocabulary
from huggingface_hub import hf_hub_download

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "vinai/bartpho-syllable"

# Load vocab & model
vocab = Vocabulary(model_name=model_name)
model = ImageCaptioningModel(embed_size=768, bartpho_model_name=model_name,
                     train_CNN=False, freeze_bartpho=False).to(DEVICE)

# Download checkpoint từ Hugging Face
ckpt_path = hf_hub_download(repo_id="username/vietnamese-image-captioning",
                    filename="best_image_captioning_model_vietnamese.pth.tar")
model.load_state_dict(torch.load(ckpt_path, map_location=DEVICE)["state_dict"])
model.eval()

# Transform ảnh
tfm = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.CenterCrop((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                 std=[0.229, 0.224, 0.225]),
])

img = Image.open("your_image.jpg").convert("RGB")
img = tfm(img).to(DEVICE)

with torch.no_grad():
    caption = model.predict(img, vocab, max_length=50)

print("Caption:", caption)

📜 Giấy phép

Model: Tuân theo giấy phép của BARTPho và EfficientNet
Dataset: UIT-ViIC (chỉ sử dụng cho nghiên cứu & học tập)

👤 Tác giả

Nguyễn Thành Đạt