File size: 5,182 Bytes

af86eec
 
c2668f6
 
 
 
 
 
 
 
af86eec
c2668f6
af86eec
c2668f6
 
af86eec
c2668f6
af86eec
c2668f6
af86eec
c2668f6
 
 
 
 
 
 
 
 
 
af86eec
c2668f6
af86eec
c2668f6
 
 
 
 
 
af86eec
c2668f6
 
 
 
af86eec
f91ee24
c2668f6
 
 
af86eec
c2668f6
af86eec
c2668f6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f91ee24
c2668f6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
af86eec
c2668f6
af86eec
c2668f6
 
af86eec
c2668f6
 
af86eec
c2668f6
af86eec
c2668f6
 
 
 
 
 
af86eec
c2668f6
 
af86eec
c2668f6
af86eec
c2668f6
 
 
f91ee24
 
c2668f6
 
 
f91ee24
 
c2668f6
af86eec
c2668f6
af86eec
c2668f6
 
 
 
 
 
 
 
 
 
af86eec
c2668f6
af86eec
c2668f6
 
af86eec
c2668f6
 
af86eec
c2668f6

---
library_name: transformers
license: mit
datasets:
- kimleang123/khmer-text-dataset
language:
- km
base_model:
- google/mt5-small
pipeline_tag: summarization
---
# Khmer mT5 Summarization Model

## 📌 Introduction
This repository contains a **fine-tuned mT5 model for Khmer text summarization**. The model is based on Google's [mT5-small](https://huggingface.co/google/mt5-small) and fine-tuned on a dataset of Khmer text and corresponding summaries.

Fine-tuning was performed using the Hugging Face `Trainer` API, optimizing the model to **generate concise and meaningful summaries of Khmer text**.

---

## 🚀 Model Details
- **Base Model:** `google/mt5-small`
- **Fine-tuned for:** Khmer text summarization
- **Training Dataset:** `kimleang123/khmer-text-dataset`
- **Framework:** Hugging Face `transformers`
- **Task Type:** Sequence-to-Sequence (Seq2Seq)
- **Input:** Khmer text (articles, paragraphs, or documents)
- **Output:** Summarized Khmer text
- **Training Hardware:** GPU (Tesla T4)
- **Evaluation Metric:** ROUGE Score

---

## 🔧 Installation & Setup
### 1️⃣ Install Dependencies
Ensure you have `transformers`, `torch`, and `datasets` installed:
```bash
pip install transformers torch datasets
```

### 2️⃣ Load the Model
To load and use the fine-tuned model:
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "songhieng/khmer-mt5-summarization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
```

---

## 📌 How to Use
### 1️⃣ Using Python Code
```python
def summarize_khmer(text, max_length=150):
    input_text = f"summarize: {text}"
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512)
    summary_ids = model.generate(**inputs, max_length=max_length, num_beams=5, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

khmer_text = "កម្ពុជាមានប្រជាជនប្រមាណ ១៦ លាននាក់ ហើយវាគឺជាប្រទេសនៅតំបន់អាស៊ីអាគ្នេយ៍។"
summary = summarize_khmer(khmer_text)
print("🔹 Khmer Summary:", summary)
```

### 2️⃣ Using Hugging Face Pipeline
For a simpler approach:
```python
from transformers import pipeline

summarizer = pipeline("summarization", model="songhieng/khmer-mt5-summarization")
khmer_text = "កម្ពុជាមានប្រជាជនប្រមាណ ១៦ លាននាក់ ហើយវាគឺជាប្រទេសនៅតំបន់អាស៊ីអាគ្នេយ៍។"
summary = summarizer(khmer_text, max_length=150, min_length=30, do_sample=False)
print("🔹 Khmer Summary:", summary[0]['summary_text'])
```

### 3️⃣ Deploy as an API using FastAPI
You can create a simple API for summarization:
```python
from fastapi import FastAPI

app = FastAPI()

@app.post("/summarize/")
def summarize(text: str):
    inputs = tokenizer(f"summarize: {text}", return_tensors="pt", truncation=True, max_length=512)
    summary_ids = model.generate(**inputs, max_length=150, num_beams=5, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return {"summary": summary}

# Run with: uvicorn filename:app --reload
```

---

## 📊 Model Evaluation
The model was evaluated using **ROUGE scores**, which measure how similar the generated summaries are to the ground truth summaries.

```python
from datasets import load_metric

rouge = load_metric("rouge")

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions
    decoded_preds = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
    return rouge.compute(predictions=decoded_preds, references=decoded_labels)

trainer.evaluate()
```

---

## 💾 Saving & Uploading the Model
After fine-tuning, the model was uploaded to Hugging Face Hub:
```python
model.push_to_hub("songhieng/khmer-mt5-summarization")
tokenizer.push_to_hub("songhieng/khmer-mt5-summarization")
```
To download it later:
```python
model = AutoModelForSeq2SeqLM.from_pretrained("songhieng/khmer-mt5-summarization")
tokenizer = AutoTokenizer.from_pretrained("songhieng/khmer-mt5-summarization")
```

---

## 🎯 Summary
| **Feature** | **Details** |
|------------|------------|
| **Base Model** | `google/mt5-small` |
| **Task** | Summarization |
| **Language** | Khmer (ខ្មែរ) |
| **Dataset** | `kimleang123/khmer-text-dataset` |
| **Framework** | Hugging Face Transformers |
| **Evaluation Metric** | ROUGE Score |
| **Deployment** | Hugging Face Model Hub, API (FastAPI), Python Code |

---

## 🤝 Contributing
Contributions are welcome! Feel free to **open issues or submit pull requests** if you find any improvements.

### 📬 Contact
If you have any questions, feel free to reach out via [Hugging Face Discussions](https://huggingface.co/) or create an issue in the repository.

📌 **Built for Khmer NLP Community** 🇰🇭 🚀