Kulynych
/

mt5-small-ukrainian-style-editor

@@ -4,17 +4,20 @@ license: apache-2.0
 base_model: google/mt5-small
 tags:
 - generated_from_trainer
 model-index:
 - name: mt5-small-ukrainian-style-editor
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
 # mt5-small-ukrainian-style-editor
-This model is a fine-tuned version of [google/mt5-small](https://huggingface.co/google/mt5-small) on an unknown dataset.
 It achieves the following results on the evaluation set:
 - Loss: 0.2027
 - Score: 41.4271
@@ -25,17 +28,32 @@ It achieves the following results on the evaluation set:
 - Sys Len: 25663
 - Ref Len: 34270
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
 ## Training procedure
@@ -64,3 +82,25 @@ The following hyperparameters were used during training:
 - Pytorch 2.6.0+cu124
 - Datasets 3.5.0
 - Tokenizers 0.21.1

 base_model: google/mt5-small
 tags:
 - generated_from_trainer
+- ukrainian
+- style-transfer
+- text-editing
+- mt5
 model-index:
 - name: mt5-small-ukrainian-style-editor
   results: []
 ---
 # mt5-small-ukrainian-style-editor
+This model is a fine-tuned version of [google/mt5-small](https://huggingface.co/google/mt5-small) designed for **stylistic editing of Ukrainian texts**.
+It transforms raw or non-native phrasing into improved, stylistically polished Ukrainian, making it suitable for academic, journalistic, or official contexts..
 It achieves the following results on the evaluation set:
 - Loss: 0.2027
 - Score: 41.4271
 - Sys Len: 25663
 - Ref Len: 34270
+## 🧠 Model Description
+This model was trained using a hybrid approach, combining:
+- Dictionary-based style correction (e.g., calque removal).
+- Fine-tuning on paragraph-aligned pairs of original and stylistically improved Ukrainian text.
+The base model is multilingual T5 (mT5), allowing flexible encoder-decoder performance and cross-lingual generalization, adapted to the specifics of Ukrainian syntax and style.
+## 📌 Intended Uses & Limitations
+### ✅ Intended Uses
+- Stylistic enhancement of Ukrainian texts.
+- Detection and correction of translationese or poor phrasing.
+- Text improvement for public communication, official writing, and journalism.
+### ⚠️ Limitations
+- Not intended for grammar correction or spell-checking.
+- May occasionally preserve non-stylistic errors if present in training data.
+- Performance is best on formal or semi-formal text.
+## 📊 Training and Evaluation Data
+Training used a custom dataset uploaded to Hugging Face: [Kulynych/training_data](https://huggingface.co/datasets/Kulynych/training_data).
+Each entry contains:
+- `input_text`: raw Ukrainian text (possibly containing calques or awkward phrasing).
+- `target_text`: human-edited version of the same paragraph, stylistically improved.
 ## Training procedure
 - Pytorch 2.6.0+cu124
 - Datasets 3.5.0
 - Tokenizers 0.21.1
+### Evaluation Metric
+- **SacreBLEU** score: **41.43** (after 2nd epoch)
+- **Validation Loss**: **0.2027**
+| Epoch | Step | Val Loss | SacreBLEU | Bp     | Precisions (%)                              |
+|-------|------|----------|-----------|--------|---------------------------------------------|
+| 1     | 3129 | 0.2095   | 41.11     | 0.7240 | [71.48, 58.88, 52.94, 46.63]                |
+| 2     | 6258 | 0.2027   | 41.43     | 0.7151 | [72.67, 60.20, 54.19, 47.51]                |
+## 💻 How to Use
+```python
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+tokenizer = AutoTokenizer.from_pretrained("Kulynych/mt5-small-ukrainian-style-editor")
+model = AutoModelForSeq2SeqLM.from_pretrained("Kulynych/mt5-small-ukrainian-style-editor")
+text = "Згідно з даними, котрі ми отримали, ситуація погіршилась."
+inputs = tokenizer(text, return_tensors="pt")
+output = model.generate(**inputs, max_length=192)
+print(tokenizer.decode(output[0], skip_special_tokens=True))