Kulynych commited on
Commit
9bf9fa7
·
verified ·
1 Parent(s): aaa0d67

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -10
README.md CHANGED
@@ -4,17 +4,20 @@ license: apache-2.0
4
  base_model: google/mt5-small
5
  tags:
6
  - generated_from_trainer
 
 
 
 
7
  model-index:
8
  - name: mt5-small-ukrainian-style-editor
9
  results: []
10
  ---
11
 
12
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
13
- should probably proofread and complete it, then remove this comment. -->
14
-
15
  # mt5-small-ukrainian-style-editor
16
 
17
- This model is a fine-tuned version of [google/mt5-small](https://huggingface.co/google/mt5-small) on an unknown dataset.
 
 
18
  It achieves the following results on the evaluation set:
19
  - Loss: 0.2027
20
  - Score: 41.4271
@@ -25,17 +28,32 @@ It achieves the following results on the evaluation set:
25
  - Sys Len: 25663
26
  - Ref Len: 34270
27
 
28
- ## Model description
 
 
 
 
29
 
30
- More information needed
31
 
32
- ## Intended uses & limitations
33
 
34
- More information needed
 
 
 
35
 
36
- ## Training and evaluation data
 
 
 
37
 
38
- More information needed
 
 
 
 
 
39
 
40
  ## Training procedure
41
 
@@ -64,3 +82,25 @@ The following hyperparameters were used during training:
64
  - Pytorch 2.6.0+cu124
65
  - Datasets 3.5.0
66
  - Tokenizers 0.21.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  base_model: google/mt5-small
5
  tags:
6
  - generated_from_trainer
7
+ - ukrainian
8
+ - style-transfer
9
+ - text-editing
10
+ - mt5
11
  model-index:
12
  - name: mt5-small-ukrainian-style-editor
13
  results: []
14
  ---
15
 
 
 
 
16
  # mt5-small-ukrainian-style-editor
17
 
18
+ This model is a fine-tuned version of [google/mt5-small](https://huggingface.co/google/mt5-small) designed for **stylistic editing of Ukrainian texts**.
19
+ It transforms raw or non-native phrasing into improved, stylistically polished Ukrainian, making it suitable for academic, journalistic, or official contexts..
20
+
21
  It achieves the following results on the evaluation set:
22
  - Loss: 0.2027
23
  - Score: 41.4271
 
28
  - Sys Len: 25663
29
  - Ref Len: 34270
30
 
31
+ ## 🧠 Model Description
32
+
33
+ This model was trained using a hybrid approach, combining:
34
+ - Dictionary-based style correction (e.g., calque removal).
35
+ - Fine-tuning on paragraph-aligned pairs of original and stylistically improved Ukrainian text.
36
 
37
+ The base model is multilingual T5 (mT5), allowing flexible encoder-decoder performance and cross-lingual generalization, adapted to the specifics of Ukrainian syntax and style.
38
 
39
+ ## 📌 Intended Uses & Limitations
40
 
41
+ ### Intended Uses
42
+ - Stylistic enhancement of Ukrainian texts.
43
+ - Detection and correction of translationese or poor phrasing.
44
+ - Text improvement for public communication, official writing, and journalism.
45
 
46
+ ### ⚠️ Limitations
47
+ - Not intended for grammar correction or spell-checking.
48
+ - May occasionally preserve non-stylistic errors if present in training data.
49
+ - Performance is best on formal or semi-formal text.
50
 
51
+ ## 📊 Training and Evaluation Data
52
+
53
+ Training used a custom dataset uploaded to Hugging Face: [Kulynych/training_data](https://huggingface.co/datasets/Kulynych/training_data).
54
+ Each entry contains:
55
+ - `input_text`: raw Ukrainian text (possibly containing calques or awkward phrasing).
56
+ - `target_text`: human-edited version of the same paragraph, stylistically improved.
57
 
58
  ## Training procedure
59
 
 
82
  - Pytorch 2.6.0+cu124
83
  - Datasets 3.5.0
84
  - Tokenizers 0.21.1
85
+
86
+ ### Evaluation Metric
87
+ - **SacreBLEU** score: **41.43** (after 2nd epoch)
88
+ - **Validation Loss**: **0.2027**
89
+
90
+ | Epoch | Step | Val Loss | SacreBLEU | Bp | Precisions (%) |
91
+ |-------|------|----------|-----------|--------|---------------------------------------------|
92
+ | 1 | 3129 | 0.2095 | 41.11 | 0.7240 | [71.48, 58.88, 52.94, 46.63] |
93
+ | 2 | 6258 | 0.2027 | 41.43 | 0.7151 | [72.67, 60.20, 54.19, 47.51] |
94
+
95
+ ## 💻 How to Use
96
+
97
+ ```python
98
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
99
+
100
+ tokenizer = AutoTokenizer.from_pretrained("Kulynych/mt5-small-ukrainian-style-editor")
101
+ model = AutoModelForSeq2SeqLM.from_pretrained("Kulynych/mt5-small-ukrainian-style-editor")
102
+
103
+ text = "Згідно з даними, котрі ми отримали, ситуація погіршилась."
104
+ inputs = tokenizer(text, return_tensors="pt")
105
+ output = model.generate(**inputs, max_length=192)
106
+ print(tokenizer.decode(output[0], skip_special_tokens=True))