File size: 5,182 Bytes
af86eec
 
c2668f6
 
 
 
 
 
 
 
af86eec
c2668f6
af86eec
c2668f6
 
af86eec
c2668f6
af86eec
c2668f6
af86eec
c2668f6
 
 
 
 
 
 
 
 
 
af86eec
c2668f6
af86eec
c2668f6
 
 
 
 
 
af86eec
c2668f6
 
 
 
af86eec
f91ee24
c2668f6
 
 
af86eec
c2668f6
af86eec
c2668f6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f91ee24
c2668f6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
af86eec
c2668f6
af86eec
c2668f6
 
af86eec
c2668f6
 
af86eec
c2668f6
af86eec
c2668f6
 
 
 
 
 
af86eec
c2668f6
 
af86eec
c2668f6
af86eec
c2668f6
 
 
f91ee24
 
c2668f6
 
 
f91ee24
 
c2668f6
af86eec
c2668f6
af86eec
c2668f6
 
 
 
 
 
 
 
 
 
af86eec
c2668f6
af86eec
c2668f6
 
af86eec
c2668f6
 
af86eec
c2668f6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
library_name: transformers
license: mit
datasets:
- kimleang123/khmer-text-dataset
language:
- km
base_model:
- google/mt5-small
pipeline_tag: summarization
---
# Khmer mT5 Summarization Model

## πŸ“Œ Introduction
This repository contains a **fine-tuned mT5 model for Khmer text summarization**. The model is based on Google's [mT5-small](https://huggingface.co/google/mt5-small) and fine-tuned on a dataset of Khmer text and corresponding summaries.

Fine-tuning was performed using the Hugging Face `Trainer` API, optimizing the model to **generate concise and meaningful summaries of Khmer text**.

---

## πŸš€ Model Details
- **Base Model:** `google/mt5-small`
- **Fine-tuned for:** Khmer text summarization
- **Training Dataset:** `kimleang123/khmer-text-dataset`
- **Framework:** Hugging Face `transformers`
- **Task Type:** Sequence-to-Sequence (Seq2Seq)
- **Input:** Khmer text (articles, paragraphs, or documents)
- **Output:** Summarized Khmer text
- **Training Hardware:** GPU (Tesla T4)
- **Evaluation Metric:** ROUGE Score

---

## πŸ”§ Installation & Setup
### 1️⃣ Install Dependencies
Ensure you have `transformers`, `torch`, and `datasets` installed:
```bash
pip install transformers torch datasets
```

### 2️⃣ Load the Model
To load and use the fine-tuned model:
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "songhieng/khmer-mt5-summarization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
```

---

## πŸ“Œ How to Use
### 1️⃣ Using Python Code
```python
def summarize_khmer(text, max_length=150):
    input_text = f"summarize: {text}"
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512)
    summary_ids = model.generate(**inputs, max_length=max_length, num_beams=5, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

khmer_text = "αž€αž˜αŸ’αž–αž»αž‡αžΆαž˜αžΆαž“αž”αŸ’αžšαž‡αžΆαž‡αž“αž”αŸ’αžšαž˜αžΆαžŽ ៑៦ αž›αžΆαž“αž“αžΆαž€αŸ‹ αž αžΎαž™αžœαžΆαž‚αžΊαž‡αžΆαž”αŸ’αžšαž‘αŸαžŸαž“αŸ…αžαŸ†αž”αž“αŸ‹αž’αžΆαžŸαŸŠαžΈαž’αžΆαž‚αŸ’αž“αŸαž™αŸαŸ”"
summary = summarize_khmer(khmer_text)
print("πŸ”Ή Khmer Summary:", summary)
```

### 2️⃣ Using Hugging Face Pipeline
For a simpler approach:
```python
from transformers import pipeline

summarizer = pipeline("summarization", model="songhieng/khmer-mt5-summarization")
khmer_text = "αž€αž˜αŸ’αž–αž»αž‡αžΆαž˜αžΆαž“αž”αŸ’αžšαž‡αžΆαž‡αž“αž”αŸ’αžšαž˜αžΆαžŽ ៑៦ αž›αžΆαž“αž“αžΆαž€αŸ‹ αž αžΎαž™αžœαžΆαž‚αžΊαž‡αžΆαž”αŸ’αžšαž‘αŸαžŸαž“αŸ…αžαŸ†αž”αž“αŸ‹αž’αžΆαžŸαŸŠαžΈαž’αžΆαž‚αŸ’αž“αŸαž™αŸαŸ”"
summary = summarizer(khmer_text, max_length=150, min_length=30, do_sample=False)
print("πŸ”Ή Khmer Summary:", summary[0]['summary_text'])
```

### 3️⃣ Deploy as an API using FastAPI
You can create a simple API for summarization:
```python
from fastapi import FastAPI

app = FastAPI()

@app.post("/summarize/")
def summarize(text: str):
    inputs = tokenizer(f"summarize: {text}", return_tensors="pt", truncation=True, max_length=512)
    summary_ids = model.generate(**inputs, max_length=150, num_beams=5, length_penalty=2.0, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return {"summary": summary}

# Run with: uvicorn filename:app --reload
```

---

## πŸ“Š Model Evaluation
The model was evaluated using **ROUGE scores**, which measure how similar the generated summaries are to the ground truth summaries.

```python
from datasets import load_metric

rouge = load_metric("rouge")

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions
    decoded_preds = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
    return rouge.compute(predictions=decoded_preds, references=decoded_labels)

trainer.evaluate()
```

---

## πŸ’Ύ Saving & Uploading the Model
After fine-tuning, the model was uploaded to Hugging Face Hub:
```python
model.push_to_hub("songhieng/khmer-mt5-summarization")
tokenizer.push_to_hub("songhieng/khmer-mt5-summarization")
```
To download it later:
```python
model = AutoModelForSeq2SeqLM.from_pretrained("songhieng/khmer-mt5-summarization")
tokenizer = AutoTokenizer.from_pretrained("songhieng/khmer-mt5-summarization")
```

---

## 🎯 Summary
| **Feature** | **Details** |
|------------|------------|
| **Base Model** | `google/mt5-small` |
| **Task** | Summarization |
| **Language** | Khmer (αžαŸ’αž˜αŸ‚αžš) |
| **Dataset** | `kimleang123/khmer-text-dataset` |
| **Framework** | Hugging Face Transformers |
| **Evaluation Metric** | ROUGE Score |
| **Deployment** | Hugging Face Model Hub, API (FastAPI), Python Code |

---

## 🀝 Contributing
Contributions are welcome! Feel free to **open issues or submit pull requests** if you find any improvements.

### πŸ“¬ Contact
If you have any questions, feel free to reach out via [Hugging Face Discussions](https://huggingface.co/) or create an issue in the repository.

πŸ“Œ **Built for Khmer NLP Community** πŸ‡°πŸ‡­ πŸš€