Update README.md
Browse files
README.md
CHANGED
@@ -4,17 +4,20 @@ license: apache-2.0
|
|
4 |
base_model: google/mt5-small
|
5 |
tags:
|
6 |
- generated_from_trainer
|
|
|
|
|
|
|
|
|
7 |
model-index:
|
8 |
- name: mt5-small-ukrainian-style-editor
|
9 |
results: []
|
10 |
---
|
11 |
|
12 |
-
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
13 |
-
should probably proofread and complete it, then remove this comment. -->
|
14 |
-
|
15 |
# mt5-small-ukrainian-style-editor
|
16 |
|
17 |
-
This model is a fine-tuned version of [google/mt5-small](https://huggingface.co/google/mt5-small)
|
|
|
|
|
18 |
It achieves the following results on the evaluation set:
|
19 |
- Loss: 0.2027
|
20 |
- Score: 41.4271
|
@@ -25,17 +28,32 @@ It achieves the following results on the evaluation set:
|
|
25 |
- Sys Len: 25663
|
26 |
- Ref Len: 34270
|
27 |
|
28 |
-
## Model
|
|
|
|
|
|
|
|
|
29 |
|
30 |
-
|
31 |
|
32 |
-
## Intended
|
33 |
|
34 |
-
|
|
|
|
|
|
|
35 |
|
36 |
-
|
|
|
|
|
|
|
37 |
|
38 |
-
|
|
|
|
|
|
|
|
|
|
|
39 |
|
40 |
## Training procedure
|
41 |
|
@@ -64,3 +82,25 @@ The following hyperparameters were used during training:
|
|
64 |
- Pytorch 2.6.0+cu124
|
65 |
- Datasets 3.5.0
|
66 |
- Tokenizers 0.21.1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
base_model: google/mt5-small
|
5 |
tags:
|
6 |
- generated_from_trainer
|
7 |
+
- ukrainian
|
8 |
+
- style-transfer
|
9 |
+
- text-editing
|
10 |
+
- mt5
|
11 |
model-index:
|
12 |
- name: mt5-small-ukrainian-style-editor
|
13 |
results: []
|
14 |
---
|
15 |
|
|
|
|
|
|
|
16 |
# mt5-small-ukrainian-style-editor
|
17 |
|
18 |
+
This model is a fine-tuned version of [google/mt5-small](https://huggingface.co/google/mt5-small) designed for **stylistic editing of Ukrainian texts**.
|
19 |
+
It transforms raw or non-native phrasing into improved, stylistically polished Ukrainian, making it suitable for academic, journalistic, or official contexts..
|
20 |
+
|
21 |
It achieves the following results on the evaluation set:
|
22 |
- Loss: 0.2027
|
23 |
- Score: 41.4271
|
|
|
28 |
- Sys Len: 25663
|
29 |
- Ref Len: 34270
|
30 |
|
31 |
+
## 🧠 Model Description
|
32 |
+
|
33 |
+
This model was trained using a hybrid approach, combining:
|
34 |
+
- Dictionary-based style correction (e.g., calque removal).
|
35 |
+
- Fine-tuning on paragraph-aligned pairs of original and stylistically improved Ukrainian text.
|
36 |
|
37 |
+
The base model is multilingual T5 (mT5), allowing flexible encoder-decoder performance and cross-lingual generalization, adapted to the specifics of Ukrainian syntax and style.
|
38 |
|
39 |
+
## 📌 Intended Uses & Limitations
|
40 |
|
41 |
+
### ✅ Intended Uses
|
42 |
+
- Stylistic enhancement of Ukrainian texts.
|
43 |
+
- Detection and correction of translationese or poor phrasing.
|
44 |
+
- Text improvement for public communication, official writing, and journalism.
|
45 |
|
46 |
+
### ⚠️ Limitations
|
47 |
+
- Not intended for grammar correction or spell-checking.
|
48 |
+
- May occasionally preserve non-stylistic errors if present in training data.
|
49 |
+
- Performance is best on formal or semi-formal text.
|
50 |
|
51 |
+
## 📊 Training and Evaluation Data
|
52 |
+
|
53 |
+
Training used a custom dataset uploaded to Hugging Face: [Kulynych/training_data](https://huggingface.co/datasets/Kulynych/training_data).
|
54 |
+
Each entry contains:
|
55 |
+
- `input_text`: raw Ukrainian text (possibly containing calques or awkward phrasing).
|
56 |
+
- `target_text`: human-edited version of the same paragraph, stylistically improved.
|
57 |
|
58 |
## Training procedure
|
59 |
|
|
|
82 |
- Pytorch 2.6.0+cu124
|
83 |
- Datasets 3.5.0
|
84 |
- Tokenizers 0.21.1
|
85 |
+
|
86 |
+
### Evaluation Metric
|
87 |
+
- **SacreBLEU** score: **41.43** (after 2nd epoch)
|
88 |
+
- **Validation Loss**: **0.2027**
|
89 |
+
|
90 |
+
| Epoch | Step | Val Loss | SacreBLEU | Bp | Precisions (%) |
|
91 |
+
|-------|------|----------|-----------|--------|---------------------------------------------|
|
92 |
+
| 1 | 3129 | 0.2095 | 41.11 | 0.7240 | [71.48, 58.88, 52.94, 46.63] |
|
93 |
+
| 2 | 6258 | 0.2027 | 41.43 | 0.7151 | [72.67, 60.20, 54.19, 47.51] |
|
94 |
+
|
95 |
+
## 💻 How to Use
|
96 |
+
|
97 |
+
```python
|
98 |
+
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
99 |
+
|
100 |
+
tokenizer = AutoTokenizer.from_pretrained("Kulynych/mt5-small-ukrainian-style-editor")
|
101 |
+
model = AutoModelForSeq2SeqLM.from_pretrained("Kulynych/mt5-small-ukrainian-style-editor")
|
102 |
+
|
103 |
+
text = "Згідно з даними, котрі ми отримали, ситуація погіршилась."
|
104 |
+
inputs = tokenizer(text, return_tensors="pt")
|
105 |
+
output = model.generate(**inputs, max_length=192)
|
106 |
+
print(tokenizer.decode(output[0], skip_special_tokens=True))
|