mphi commited on
Commit
2d844b8
1 Parent(s): 49a5eeb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -38
README.md CHANGED
@@ -1,39 +1,41 @@
1
- ---
2
- license: mit
3
-
4
- language:
5
- - en
6
-
7
- widget:
8
- - text: "Let us translate some text from Livonian to Võro!"
9
- ---
10
-
11
- # NMT for Finno-Ugric Languages
12
-
13
- This is an NMT system for translating between Võro, Livonian, North Sami, South Sami as well as Estonian, Finnish, Latvian and English. It was created by fine-tuning Facebook's m2m100-418M on the liv4ever and smugri datasets.
14
-
15
- ## Tokenizer
16
- Four language codes were added to the tokenizer: __liv__, __vro__, __sma__ and __sme__. Currently the m2m100 tokenizer loads the list of languages from a hard-coded list, so it has to be updated after loading; see the code example below.
17
-
18
- ## Usage example
19
- Install the transformers and sentencepiece libraries: `pip install sentencepiece transformers`
20
-
21
- ```from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
22
- tokenizer = AutoTokenizer.from_pretrained("tartuNLP/m2m100_418M_smugri")
23
- #Fix the language codes in the tokenizer
24
- tokenizer.id_to_lang_token = dict(list(tokenizer.id_to_lang_token.items()) + list(tokenizer.added_tokens_decoder.items()))
25
- tokenizer.lang_token_to_id = dict(list(tokenizer.lang_token_to_id.items()) + list(tokenizer.added_tokens_encoder.items()))
26
- tokenizer.lang_code_to_token = { k.replace("_", ""): k for k in tokenizer.additional_special_tokens }
27
- tokenizer.lang_code_to_id = { k.replace("_", ""): v for k, v in tokenizer.lang_token_to_id.items() }
28
-
29
- model = AutoModelForSeq2SeqLM.from_pretrained("tartuNLP/m2m100_418M_smugri")
30
-
31
- tokenizer.src_lang = 'liv'
32
-
33
- encoded_src = tokenizer("Līvõ kēļ jelāb!", return_tensors="pt")
34
-
35
- encoded_out = model.generate(**encoded_src, forced_bos_token_id = tokenizer.get_lang_id("sme"))
36
- print(tokenizer.batch_decode(encoded_out, skip_special_tokens=True))
37
- ```
38
-
 
 
39
  The output is `Livčča giella eallá.`
 
1
+ ---
2
+ license: mit
3
+
4
+ language:
5
+ - en
6
+
7
+ widget:
8
+ - text: "Let us translate some text from Livonian to Võro!"
9
+ ---
10
+
11
+ # NMT for Finno-Ugric Languages
12
+
13
+ This is an NMT system for translating between Võro, Livonian, North Sami, South Sami as well as Estonian, Finnish, Latvian and English. It was created by fine-tuning Facebook's m2m100-418M on the liv4ever and smugri datasets.
14
+
15
+ ## Tokenizer
16
+ Four language codes were added to the tokenizer: __liv__, __vro__, __sma__ and __sme__. Currently the m2m100 tokenizer loads the list of languages from a hard-coded list, so it has to be updated after loading; see the code example below.
17
+
18
+ ## Usage example
19
+ Install the transformers and sentencepiece libraries: `pip install sentencepiece transformers`
20
+
21
+ ```from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
22
+
23
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
24
+ tokenizer = AutoTokenizer.from_pretrained("tartuNLP/m2m100_418M_smugri")
25
+ #Fix the language codes in the tokenizer
26
+ tokenizer.id_to_lang_token = dict(list(tokenizer.id_to_lang_token.items()) + list(tokenizer.added_tokens_decoder.items()))
27
+ tokenizer.lang_token_to_id = dict(list(tokenizer.lang_token_to_id.items()) + list(tokenizer.added_tokens_encoder.items()))
28
+ tokenizer.lang_code_to_token = { k.replace("_", ""): k for k in tokenizer.additional_special_tokens }
29
+ tokenizer.lang_code_to_id = { k.replace("_", ""): v for k, v in tokenizer.lang_token_to_id.items() }
30
+
31
+ model = AutoModelForSeq2SeqLM.from_pretrained("tartuNLP/m2m100_418M_smugri")
32
+
33
+ tokenizer.src_lang = 'liv'
34
+
35
+ encoded_src = tokenizer("Līvõ kēļ jelāb!", return_tensors="pt")
36
+
37
+ encoded_out = model.generate(**encoded_src, forced_bos_token_id = tokenizer.get_lang_id("sme"))
38
+ print(tokenizer.batch_decode(encoded_out, skip_special_tokens=True))
39
+ ```
40
+
41
  The output is `Livčča giella eallá.`