anegda commited on
Commit
9c157a6
·
verified ·
1 Parent(s): c29a238

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +135 -3
README.md CHANGED
@@ -1,3 +1,135 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - ca
5
+ - eu
6
+ metrics:
7
+ - BLEU
8
+ - TER
9
+ ---
10
+ ## Hitz Center’s Catalan-Basque machine translation model
11
+
12
+ ## Model description
13
+
14
+ This model was trained from scratch using [Marian NMT](https://marian-nmt.github.io/) on a combination of Catalan-Basque datasets totalling 11,224,976 sentence pairs. 1,531,980 sentence pairs were parallel data collected from the web while the remaining 9,692,996 sentence pairs were parallel synthetic data created using the [ES-CA translator from Aina project](https://huggingface.proxy.nlp.skieer.com/projecte-aina/aina-translator-eu-ca). The model was evaluated on the Flores, TaCon and NTREX evaluation datasets.
15
+
16
+ - **Developed by:** HiTZ Research Center & IXA Research group (University of the Basque Country UPV/EHU)
17
+ - **Model type:** traslation
18
+ - **Source Language:** Catalan
19
+ - **Target Language:** Basque
20
+ - **License:** apache-2.0
21
+
22
+ ## Intended uses and limitations
23
+
24
+ You can use this model for machine translation from Catalan to Basque.
25
+
26
+ At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources.
27
+
28
+ ## How to Get Started with the Model
29
+
30
+ Use the code below to get started with the model.
31
+
32
+ ```
33
+ from transformers import MarianMTModel, MarianTokenizer
34
+ from transformers import AutoTokenizer
35
+ from transformers import AutoModelForSeq2SeqLM
36
+
37
+ src_text = ["això és una prova"]
38
+
39
+ model_name = "HiTZ/mt-hitz-ca-eu"
40
+ tokenizer = MarianTokenizer.from_pretrained(model_name)
41
+
42
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
43
+ translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=T
44
+ rue))
45
+ print([tokenizer.decode(t, skip_special_tokens=True) for t in translated])`
46
+ ```
47
+
48
+ ## Training Details
49
+
50
+ ### Training Data
51
+
52
+ The Catalan-Basque data collected from the web was a combination of the following datasets:
53
+
54
+ | Dataset | Sentences before cleaning |
55
+ |-----------------|--------------------------:|
56
+ | CCMatrix v1 | 1,083,677 |
57
+ | XLENT | 219,566 |
58
+ | WikiMatrix | 77,233 |
59
+ | GNOME | 14,828 |
60
+ | KDE4 | 93,787 |
61
+ | QED | 6,554 |
62
+ | TED2020 v1 | 4,469 |
63
+ | OpenSubtitles | 29,114 |
64
+ | Ubuntu | 2,752 |
65
+ | **Total** | **1.531.980** |
66
+
67
+ The 9,692,996 sentence pairs of synthetic parallel data were created by translating a compendium of ES-EU parallel corpora into Catalan using the [ES-CA translator from the Aina project](https://huggingface.proxy.nlp.skieer.com/projecte-aina/aina-translator-eu-ca).
68
+
69
+ ### Training Procedure
70
+
71
+ #### Preprocessing
72
+
73
+ After concatenation, all datasets are cleaned and deduplicated using [bifixer](https://github.com/bitextor/bifixer) and [biclener](https://github.com/bitextor/bicleaner) tools [(Ramírez-Sánchez et al., 2020)](https://aclanthology.org/2020.eamt-1.31/). Any sentence pairs with a classification score of less than 0.5 is removed. The filtered corpus is composed of 10,582,279 parallel sentences.
74
+
75
+ #### Tokenization
76
+ All data is tokenized using sentencepiece, with a 32,000 token sentencepiece model learned from the combination of all filtered training data. This model is included.
77
+
78
+ ## Evaluation
79
+ ### Variable and metrics
80
+ We use the BLEU and TER scores for evaluation on test sets: [Flores-200](https://github.com/facebookresearch/flores/tree/main/flores200), [TaCon](https://elrc-share.eu/repository/browse/tacon-spanish-constitution-mt-test-set/84a96138b98611ec9c1a00155d02670628f3e6857b0f422abd82abc3795ec8c2/) and [NTREX](https://github.com/MicrosoftTranslator/NTREX)
81
+
82
+ ### Evaluation results
83
+ Below are the evaluation results on the machine translation from Catalan to Basque compared to [Google Translate](https://translate.google.com/), [NLLB 200 3.3B](https://huggingface.co/facebook/nllb-200-3.3B) and [ NLLB-200's distilled 1.3B variant](https://huggingface.co/facebook/nllb-200-distilled-1.3B):
84
+
85
+ ####BLEU scores
86
+
87
+ | Test set |Google Translate | NLLB 1.3B | NLLB 3.3 |mt-hitz-ca-eu|
88
+ |----------------------|-----------------|-----------|----------|-------------|
89
+ | Flores 200 devtest |**18.0** | 13.2 | 12.9 | 17.2 |
90
+ | TaCON | 13.2 | 11.8 | 11.2 | **14.0** |
91
+ | NTREX | 13.8 | 11.1 | 10.5 | **14.0** |
92
+ | Average | 15.0 | 12.0 | 11.5 | **15.1** |
93
+
94
+ ####TER scores
95
+
96
+ | Test set |Google Translate | NLLB 1.3B | NLLB 3.3 |mt-hitz-ca-eu|
97
+ |----------------------|-----------------|-----------|----------|-------------|
98
+ | Flores 200 devtest |**63.1** | 76.5 | 70.8 | 65.0 |
99
+ | TaCON |**65.0** | 76.5 | 72.1 | **48.4** |
100
+ | NTREX |**69.4** | 79.4 | 75.5 | 69.7 |
101
+ | Average |**65.8** | 77.5 | 72.8 | **61.0** |
102
+
103
+
104
+ <!-- Momentuz ez dugu artikulurik. ILENIAn zerbait egiten bada eguneratu beharko da -->
105
+
106
+ <!--
107
+ ## Citation [optional]
108
+
109
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. - ->
110
+
111
+ **BibTeX:**
112
+
113
+ [More Information Needed]
114
+
115
+ **APA:**
116
+
117
+ [More Information Needed]
118
+ -->
119
+
120
+ ## Additional information
121
+ ### Author
122
+ HiTZ Research Center & IXA Research group (University of the Basque Country UPV/EHU)
123
+ ### Contact information
124
+ For further information, send an email to <[email protected]>
125
+ ### Licensing information
126
+ This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
127
+ ### Funding
128
+ This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the [project ILENIA](https://proyectoilenia.es/) with reference 2022/TL22/00215337, 2022/TL22/00215336, 2022/TL22/00215335 y 2022/TL22/00215334
129
+ ### Disclaimer
130
+ <details>
131
+ <summary>Click to expand</summary>
132
+ The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
133
+ When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
134
+ In no event shall the owner and creator of the models (HiTZ Research Center) be liable for any results arising from the use made by third parties of these models.
135
+ </details>