File size: 5,296 Bytes
fb7d5d2
 
 
 
 
 
 
 
 
 
 
 
 
 
2836b3f
 
fb7d5d2
 
 
 
d193b1d
fb7d5d2
 
 
2836b3f
fb7d5d2
 
 
 
 
 
 
2836b3f
e0e13f3
 
fb7d5d2
2836b3f
fb7d5d2
2836b3f
 
fb7d5d2
 
 
 
 
 
 
 
 
 
 
 
 
 
2836b3f
fb7d5d2
 
 
2836b3f
 
 
 
 
 
fb7d5d2
 
2836b3f
 
 
 
 
fb7d5d2
 
 
 
 
2836b3f
 
 
fb7d5d2
 
 
cd1b35d
 
 
 
 
fb7d5d2
 
 
 
 
 
 
 
 
 
 
 
2836b3f
 
 
 
 
 
 
 
9c2a0c6
 
2836b3f
 
ed30b6c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
library_name: transformers
pipeline_tag: translation
tags:
- transformers
- translation
- pytorch
- russian
- kazakh

license: apache-2.0
language:
- ru
- kk
datasets:
- issai/kazparc
---

# kazRush-ru-kk

kazRush-ru-kk is a translation model for translating from Russian to Kazakh. The model was trained with randomly initialized weights based on the T5 configuration on the available open-source parallel data.

## Usage

Using the model requires `sentencepiece` library to be installed.

After installing necessary dependencies the model can be run with the following code:  

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

device = 'cuda'
model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/kazRush-ru-kk').to(device)
tokenizer = AutoTokenizer.from_pretrained('deepvk/kazRush-ru-kk')

@torch.inference_mode
def generate(text, **kwargs):
    inputs = tokenizer(text, return_tensors='pt').to(device)
    hypotheses = model.generate(**inputs, num_beams=5, **kwargs)
    return tokenizer.decode(hypotheses[0], skip_special_tokens=True)

print(generate("Как Кока-Кола может помочь автомобилисту?"))
```

You can also access the model via _pipeline_ wrapper:
```python
>>> from transformers import pipeline

>>> pipe = pipeline(model="deepvk/kazRush-ru-kk")
>>> pipe("Мама мыла раму")
[{'translation_text': 'Анам жақтауды сабындады'}]
```

## Data and Training

This model was trained on the following data (Russian-Kazakh language pairs):  

| Dataset  | Number of pairs | 
|-----------------------------------------|-------|
| [OPUS Corpora](<https://opus.nlpl.eu/results/ru&kk/corpus-result-table>)     | 718K   |
| [kazparc](<https://huggingface.co/datasets/issai/kazparc>)            | 2,150K   |
| [wmt19 dataset](<https://statmt.org/wmt19/translation-task.html#download>)                  | 5,063K   |
| [TIL dataset](<https://github.com/turkic-interlingua/til-mt/tree/master/til_corpus>)                  | 4,403K   |

Preprocessing of the data included:
1. deduplication
2. removing trash symbols, special tags, multiple whitespaces etc. from texts
3. removing texts that were not in Russian or Kazakh (language detection was made via [facebook/fasttext-language-identification](<https://huggingface.co/facebook/fasttext-language-identification>))
4. removing pairs that had low alingment score (comparison was performed via [sentence-transformers/LaBSE](<https://huggingface.co/sentence-transformers/LaBSE>))
5. filtering the data using [opusfilter](<https://github.com/Helsinki-NLP/OpusFilter>) tools

Model was trained for 56 hours on 2 GPUs NVIDIA A100 80 Gb.

## Evaluation

Current model was compared to another open-source translation model, [NLLB](<https://huggingface.co/docs/transformers/model_doc/nllb>). We compared our model to all version of NLLB, excluding nllb-moe-54b due to its size.
The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORES+ evaluation benchmark](<https://github.com/openlanguagedata/flores>), most recent evaluation benchmark for multilingual machine translation.  
Calculation of BLEU and chrF follows the standart implementation from [sacreBLEU](<https://github.com/mjpost/sacrebleu>), and COMET is calculated using default model described in [COMET repository](<https://github.com/Unbabel/COMET>).

| Model  | Size | BLEU | chrF | COMET |
|-----------------------------------------|-------|-----------------------------|------------------------|--------|
| [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | 600M   | 13.8  |  48.2  | 86.8  |
| [nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B)                   | 1.3B   | 14.8 | 50.1  | 88.1  |
| [nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B)   | 1.3B    | 15.2 | 50.2 | 88.4   |
| [nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B)                    | 3.3B    | 15.6 | 50.7  | **88.9**    |
| This model                             | 197M    | **16.2**  | **51.8**   |   88.3   |

## Examples of usage:  

```python
>>> print(generate("Каждый охотник желает знать, где сидит фазан."))
Әрбір аңшы ғибадатхананың қайда отырғанын білгісі келеді.

>>> print(generate("Местным продуктом-специалитетом с защищённым географическим наименованием по происхождению считается люнебургский степной барашек."))
Шығу тегі бойынша қорғалған географиялық атауы бар жергілікті мамандандырылған өнім болып люнебургтік дала қошқар болып саналады.

>>> print(generate("Помогите мне удивить девушку"))
Қызды таң қалдыруға көмектесіңіз
```

## Citations

```
@misc{deepvk2024kazRushrukk,
    title={kazRush-ru-kk: translation model from Russian to Kazakh},
    author={Lebedeva, Anna and  Sokolov, Andrey},
    url={https://huggingface.co/deepvk/kazRush-ru-kk},
    publisher={Hugging Face},
    year={2024},
}
```