File size: 5,182 Bytes
af86eec c2668f6 af86eec c2668f6 af86eec c2668f6 af86eec c2668f6 af86eec c2668f6 af86eec c2668f6 af86eec c2668f6 af86eec c2668f6 af86eec c2668f6 af86eec f91ee24 c2668f6 af86eec c2668f6 af86eec c2668f6 f91ee24 c2668f6 af86eec c2668f6 af86eec c2668f6 af86eec c2668f6 af86eec c2668f6 af86eec c2668f6 af86eec c2668f6 af86eec c2668f6 af86eec c2668f6 f91ee24 c2668f6 f91ee24 c2668f6 af86eec c2668f6 af86eec c2668f6 af86eec c2668f6 af86eec c2668f6 af86eec c2668f6 af86eec c2668f6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
---
library_name: transformers
license: mit
datasets:
- kimleang123/khmer-text-dataset
language:
- km
base_model:
- google/mt5-small
pipeline_tag: summarization
---
# Khmer mT5 Summarization Model
## π Introduction
This repository contains a **fine-tuned mT5 model for Khmer text summarization**. The model is based on Google's [mT5-small](https://huggingface.co/google/mt5-small) and fine-tuned on a dataset of Khmer text and corresponding summaries.
Fine-tuning was performed using the Hugging Face `Trainer` API, optimizing the model to **generate concise and meaningful summaries of Khmer text**.
---
## π Model Details
- **Base Model:** `google/mt5-small`
- **Fine-tuned for:** Khmer text summarization
- **Training Dataset:** `kimleang123/khmer-text-dataset`
- **Framework:** Hugging Face `transformers`
- **Task Type:** Sequence-to-Sequence (Seq2Seq)
- **Input:** Khmer text (articles, paragraphs, or documents)
- **Output:** Summarized Khmer text
- **Training Hardware:** GPU (Tesla T4)
- **Evaluation Metric:** ROUGE Score
---
## π§ Installation & Setup
### 1οΈβ£ Install Dependencies
Ensure you have `transformers`, `torch`, and `datasets` installed:
```bash
pip install transformers torch datasets
```
### 2οΈβ£ Load the Model
To load and use the fine-tuned model:
```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_name = "songhieng/khmer-mt5-summarization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
```
---
## π How to Use
### 1οΈβ£ Using Python Code
```python
def summarize_khmer(text, max_length=150):
input_text = f"summarize: {text}"
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512)
summary_ids = model.generate(**inputs, max_length=max_length, num_beams=5, length_penalty=2.0, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
return summary
khmer_text = "ααααα»ααΆααΆααααααΆαααααααΆα α‘α¦ ααΆαααΆαα α αΎαααΆααΊααΆαααααααα
αααααα’αΆαααΈα’αΆααααααα"
summary = summarize_khmer(khmer_text)
print("πΉ Khmer Summary:", summary)
```
### 2οΈβ£ Using Hugging Face Pipeline
For a simpler approach:
```python
from transformers import pipeline
summarizer = pipeline("summarization", model="songhieng/khmer-mt5-summarization")
khmer_text = "ααααα»ααΆααΆααααααΆαααααααΆα α‘α¦ ααΆαααΆαα α αΎαααΆααΊααΆαααααααα
αααααα’αΆαααΈα’αΆααααααα"
summary = summarizer(khmer_text, max_length=150, min_length=30, do_sample=False)
print("πΉ Khmer Summary:", summary[0]['summary_text'])
```
### 3οΈβ£ Deploy as an API using FastAPI
You can create a simple API for summarization:
```python
from fastapi import FastAPI
app = FastAPI()
@app.post("/summarize/")
def summarize(text: str):
inputs = tokenizer(f"summarize: {text}", return_tensors="pt", truncation=True, max_length=512)
summary_ids = model.generate(**inputs, max_length=150, num_beams=5, length_penalty=2.0, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
return {"summary": summary}
# Run with: uvicorn filename:app --reload
```
---
## π Model Evaluation
The model was evaluated using **ROUGE scores**, which measure how similar the generated summaries are to the ground truth summaries.
```python
from datasets import load_metric
rouge = load_metric("rouge")
def compute_metrics(pred):
labels_ids = pred.label_ids
pred_ids = pred.predictions
decoded_preds = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
decoded_labels = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
return rouge.compute(predictions=decoded_preds, references=decoded_labels)
trainer.evaluate()
```
---
## πΎ Saving & Uploading the Model
After fine-tuning, the model was uploaded to Hugging Face Hub:
```python
model.push_to_hub("songhieng/khmer-mt5-summarization")
tokenizer.push_to_hub("songhieng/khmer-mt5-summarization")
```
To download it later:
```python
model = AutoModelForSeq2SeqLM.from_pretrained("songhieng/khmer-mt5-summarization")
tokenizer = AutoTokenizer.from_pretrained("songhieng/khmer-mt5-summarization")
```
---
## π― Summary
| **Feature** | **Details** |
|------------|------------|
| **Base Model** | `google/mt5-small` |
| **Task** | Summarization |
| **Language** | Khmer (ααααα) |
| **Dataset** | `kimleang123/khmer-text-dataset` |
| **Framework** | Hugging Face Transformers |
| **Evaluation Metric** | ROUGE Score |
| **Deployment** | Hugging Face Model Hub, API (FastAPI), Python Code |
---
## π€ Contributing
Contributions are welcome! Feel free to **open issues or submit pull requests** if you find any improvements.
### π¬ Contact
If you have any questions, feel free to reach out via [Hugging Face Discussions](https://huggingface.co/) or create an issue in the repository.
π **Built for Khmer NLP Community** π°π π |