Transferring Monolingual Model to Low-Resource Language: The Case Of Tigrinya:
Proposed Method:
The proposed method transfers a mono-lingual Transformer model into new target language at lexical level by learning new token embeddings. All implementation in this repo uses XLNet as a source Transformer model, however, other Transformer models can also be used similarly.
Main files:
All files are IPython Notebook files which can be excuted simply in Google Colab.
train.ipynb : Fine-tunes XLNet (mono-lingual transformer) on new target language (Tigrinya) sentiment analysis dataset.
token_embeddings.ipynb : Trains a word2vec token embeddings for Tigrinya language.
process_Tigrinya_comments.ipynb : Extracts Tigrinya comments from mixed language contents.
extract_YouTube_comments.ipynb : Downloads available comments from a YouTube channel ID.
auto_labelling.ipynb : Automatically labels Tigrinya comments in to positive or negative sentiments based on Emoji's sentiment.
Tigrinya Tokenizer:
A sentencepiece based tokenizer for Tigrinya has been released to the public and can be accessed as in the following:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("abryee/TigXLNet")
tokenizer.tokenize("αααα α₯α ααα α«α₯α°α αα΅ααα ααα² α’α« α α₯α£αα α’α ααα΅αα ααα² αα₯α α°αα¨ αααΉ αα°α«α£αΉα α£α₯ αααΉα α°α¨αα‘")
TigXLNet:
A new general purpose transformer model for low-resource language Tigrinya is also released to the public and be accessed as in the following:
from transformers import AutoConfig, AutoModel
config = AutoConfig.from_pretrained("abryee/TigXLNet")
config.d_head = 64
model = AutoModel.from_pretrained("abryee/TigXLNet", config=config)
Evaluation:
The proposed method is evaluated using two datasets:
- A newly created sentiment analysis dataset for low-resource language (Tigriyna).
|
- Cross-lingual Sentiment dataset (CLS).
Models | English | German | French | Japanese | Average | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Books | DVD | Music | Books | DVD | Music | Books | DVD | Music | Books | DVD | Music | ||
XLNet | 92.90 | 93.31 | 92.02 | 85.23 | 83.30 | 83.89 | 73.05 | 69.80 | 70.12 | 83.20 | 86.07 | 85.24 | 83.08 |
mBERT | 92.78 | 90.30 | 91.88 | 88.65 | 85.85 | 90.38 | 91.09 | 88.57 | 93.67 | 84.35 | 81.77 | 87.53 | 88.90 |
Dataset used for this paper:
We have constructed new sentiment analysis dataset for Tigrinya language and it can be found in the zip file (Tigrinya Sentiment Analysis Dataset)
Citing our paper:
Our paper can be accessed from ArXiv link, and please consider citing our work.
@misc{tela2020transferring,
title={Transferring Monolingual Model to Low-Resource Language: The Case of Tigrinya},
author={Abrhalei Tela and Abraham Woubie and Ville Hautamaki},
year={2020},
eprint={2006.07698},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Any questions, comments, feedback is appreciated! And can be forwarded to the following email: [email protected]
- Downloads last month
- 5