Kartoshkina commited on
Commit
ed30b6c
·
1 Parent(s): 457aa15

upload model, tokenizer and readme

Browse files
README.md CHANGED
@@ -1,3 +1,85 @@
1
  ---
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ library_name: transformers
3
+ pipeline_tag: translation
4
+ tags:
5
+ - transformers
6
+ - translation
7
+ - pytorch
8
+ - russian
9
+ - kazakh
10
+
11
  license: apache-2.0
12
+ language:
13
+ - ru
14
+ - kk
15
  ---
16
+
17
+ # KazRush-ru-kk
18
+
19
+ KazRush-ru-kk is a translation model for translating from Russian to Kazakh.
20
+
21
+ ## Usage
22
+
23
+ Using the model requires some packages to be installed.
24
+
25
+ ```bash
26
+ pip install numpy==1.26.4 torch~=2.2.2 transformers~=4.39.2 sentencepiece~=0.2.0
27
+ ```
28
+
29
+ After installing necessary dependencies the model can be run with the following code:
30
+
31
+ ```python
32
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
33
+ import torch
34
+
35
+ model = AutoModelForSeq2SeqLM.from_pretrained('deepvk/KazRush-ru-kk')
36
+ tokenizer = AutoTokenizer.from_pretrained('deepvk/KazRush-ru-kk')
37
+
38
+ def generate(text, **kwargs):
39
+ inputs = tokenizer(text, return_tensors='pt')
40
+ with torch.no_grad():
41
+ hypotheses = model.generate(**inputs, num_beams=5, **kwargs)
42
+ return tokenizer.decode(hypotheses[0], skip_special_tokens=True)
43
+
44
+ print(generate("Как Кока-Кола может помочь автомобилисту?"))
45
+ ```
46
+
47
+ ## Data
48
+
49
+ This model was trained on the following data (Russian-Kazakh language pairs):
50
+ [OPUS Corpora](<https://opus.nlpl.eu/results/ru&kk/corpus-result-table>)
51
+ [kazparc](<https://huggingface.co/datasets/issai/kazparc>)
52
+ [wmt19 dataset](<https://statmt.org/wmt19/translation-task.html#download>)
53
+
54
+ Preprocessing of the data included:
55
+ - deduplication;
56
+ - removing trash symbols, special tags, multiple whitespaces etc. from texts;
57
+ - removing texts that were not in Russian or Kazakh (language detection was made via [fasttext](<https://huggingface.co/facebook/fasttext-language-identification>));
58
+ - removing pairs that had low alingment score (comparison was performed via [LaBSE](<https://huggingface.co/sentence-transformers/LaBSE>));
59
+ - filtering the data using [opusfilter](<https://github.com/Helsinki-NLP/OpusFilter>) tools.
60
+
61
+ ## Experiments
62
+
63
+ Current model was compared to another open-source translation model, NLLB. We compared our model to all version of nllb, excluding nllb-moe-54b due to its size.
64
+ The metrics - BLEU, chrF and COMET - were calculated on `devtest` part of [FLORES+ evaluation benchmark](<https://github.com/openlanguagedata/flores>), most recent evaluation benchmark for multilingual machine translation.
65
+
66
+ | Model | Size | BLEU | chrF | COMET |
67
+ |-----------------------------------------|-------|-----------------------------|------------------------|
68
+ | [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | 600M | 13.8 | 48.2 | 0.8684 |
69
+ | [nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B) | 1.3B | 14.8 | 50.1 | 0.8819 |
70
+ | [nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) | 1.3B | 15.2 | 50.2 | 0.8843 |
71
+ | [nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) | 3.3B | 15.6 | 50.7 | 0.8891 |
72
+ | [our model (kzgqkn0f)]() | 196 M | **16.2** | **51.9** | 0.8836 |
73
+
74
+ ## Examples of usage:
75
+
76
+ ```
77
+ print(generate("Каждый охотник желает знать, где сидит фазан."))
78
+ # Әр аңшы қырғауылдың қайда отырғанын білгісі келеді.
79
+
80
+ print(generate("Местным продуктом-специалитетом с защищённым географическим наименованием по происхождению считается люнебургский степной барашек."))
81
+ # Шығу тегі бойынша қорғалған географиялық атауы бар жергілікті өнім-маман болып люнебург далалық барақ есептеледі.
82
+
83
+ print(generate("Помогите мне закадрить девушку"))
84
+ # Қызды бауыздауға көмектесіңіз.
85
+ ```
config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "T5ForConditionalGeneration"
4
+ ],
5
+ "classifier_dropout": 0.0,
6
+ "d_ff": 4096,
7
+ "d_kv": 64,
8
+ "d_model": 512,
9
+ "decoder_start_token_id": 0,
10
+ "dense_act_fn": "gelu_new",
11
+ "dropout_rate": 0.1,
12
+ "eos_token_id": 1,
13
+ "feed_forward_proj": "gated-gelu",
14
+ "initializer_factor": 0.05,
15
+ "is_encoder_decoder": true,
16
+ "is_gated_act": true,
17
+ "layer_norm_epsilon": 1e-06,
18
+ "model_type": "t5",
19
+ "num_decoder_layers": 12,
20
+ "num_heads": 8,
21
+ "num_layers": 12,
22
+ "pad_token_id": 0,
23
+ "relative_attention_max_distance": 128,
24
+ "relative_attention_num_buckets": 32,
25
+ "tie_word_embeddings": false,
26
+ "torch_dtype": "float32",
27
+ "transformers_version": "4.39.3",
28
+ "use_cache": true,
29
+ "vocab_size": 8000
30
+ }
generation_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "decoder_start_token_id": 0,
3
+ "early_stopping": true,
4
+ "eos_token_id": 1,
5
+ "max_new_tokens": 127,
6
+ "num_beams": 3,
7
+ "pad_token_id": 0,
8
+ "transformers_version": "4.39.3"
9
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c76de487e8e2a6007287b2ebc33b86729cf1bcf3d574edca1f78058cedcd178e
3
+ size 787905368
special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "eos_token": {
3
+ "content": "</s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "pad_token": {
10
+ "content": "<pad>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "unk_token": {
17
+ "content": "<unk>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ }
spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:377110f5fc7b44c69c6edbf6b0f2638fdb4398a6a942535cb323ca3b41cb99b0
3
+ size 401220
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": true,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<pad>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "</s>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "<unk>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ }
28
+ },
29
+ "additional_special_tokens": [],
30
+ "clean_up_tokenization_spaces": true,
31
+ "eos_token": "</s>",
32
+ "extra_ids": 0,
33
+ "legacy": false,
34
+ "model_max_length": 1000000000000000019884624838656,
35
+ "pad_token": "<pad>",
36
+ "sp_model_kwargs": {},
37
+ "tokenizer_class": "T5Tokenizer",
38
+ "unk_token": "<unk>"
39
+ }