alirezamsh commited on
Commit
efd7721
1 Parent(s): 326dbc8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -6
README.md CHANGED
@@ -121,6 +121,29 @@ The model architecture and config are the same as [M2M-100](https://huggingface.
121
 
122
  **Note**: SMALL100Tokenizer requires sentencepiece, so make sure to install it by ```pip install sentencepiece```
123
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
124
  ```
125
  from transformers import M2M100ForConditionalGeneration
126
  from tokenization_small100 import SMALL100Tokenizer
@@ -146,7 +169,9 @@ tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
146
  # => "Life is like a box of chocolate."
147
  ```
148
 
149
- Please refer to [original repository](https://github.com/alirezamshi/small100) for further details.
 
 
150
 
151
  # Languages Covered
152
 
@@ -156,10 +181,21 @@ Afrikaans (af), Amharic (am), Arabic (ar), Asturian (ast), Azerbaijani (az), Bas
156
 
157
  If you use this model for your research, please cite the following work:
158
  ```
159
- @article{mohammadshahi2022small,
160
- title={SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages},
161
- author={Mohammadshahi, Alireza and Nikoulina, Vassilina and Berard, Alexandre and Brun, Caroline and Henderson, James and Besacier, Laurent},
162
- journal={arXiv preprint arXiv:2210.11621},
163
- year={2022}
 
 
 
 
 
 
 
 
 
 
 
164
  }
165
  ```
 
121
 
122
  **Note**: SMALL100Tokenizer requires sentencepiece, so make sure to install it by ```pip install sentencepiece```
123
 
124
+ # Supervised Training
125
+
126
+ SMaLL-100 is a seq-to-seq model for the translation task. The input to the model is ```source:[tgt_lang_code] + src_tokens + [EOS]``` and ```target: tgt_tokens + [EOS]```. An example of supervised training is shown below:
127
+
128
+ ```
129
+ from transformers import M2M100ForConditionalGeneration
130
+ from tokenization_small100 import SMALL100Tokenizer
131
+
132
+ model = M2M100ForConditionalGeneration.from_pretrained("alirezamsh/small100")
133
+ tokenizer = M2M100Tokenizer.from_pretrained("alirezamsh/small100", tgt_lang="fr")
134
+
135
+ src_text = "Life is like a box of chocolates."
136
+ tgt_text = "La vie est comme une boîte de chocolat."
137
+
138
+ model_inputs = tokenizer(src_text, text_target=tgt_text, return_tensors="pt")
139
+
140
+ loss = model(**model_inputs).loss # forward pass
141
+ ```
142
+
143
+ Training data can be provided upon request.
144
+
145
+ # Generation
146
+
147
  ```
148
  from transformers import M2M100ForConditionalGeneration
149
  from tokenization_small100 import SMALL100Tokenizer
 
169
  # => "Life is like a box of chocolate."
170
  ```
171
 
172
+ # Evaluation
173
+
174
+ Please refer to [original repository](https://github.com/alirezamshi/small100) for spBLEU computation.
175
 
176
  # Languages Covered
177
 
 
181
 
182
  If you use this model for your research, please cite the following work:
183
  ```
184
+ @misc{mohammadshahi2022small100,
185
+ title={SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages},
186
+ author={Alireza Mohammadshahi and Vassilina Nikoulina and Alexandre Berard and Caroline Brun and James Henderson and Laurent Besacier},
187
+ year={2022},
188
+ eprint={2210.11621},
189
+ archivePrefix={arXiv},
190
+ primaryClass={cs.CL}
191
+ }
192
+
193
+ @misc{mohammadshahi2022compressed,
194
+ title={What Do Compressed Multilingual Machine Translation Models Forget?},
195
+ author={Alireza Mohammadshahi and Vassilina Nikoulina and Alexandre Berard and Caroline Brun and James Henderson and Laurent Besacier},
196
+ year={2022},
197
+ eprint={2205.10828},
198
+ archivePrefix={arXiv},
199
+ primaryClass={cs.CL}
200
  }
201
  ```