Historical Irish Embeddings
Collection
5 items
•
Updated
Middle Irish FastText models were trained on Middle Irish texts from CELT. A text was included in the training dataset if "Middle Irish" or the dates "900-1200" were explicitely mentioned in its metadata on CELT, including texts marked as "Old and Middle Irish" or "Old, Middle and Early Modern Irish". Therefore, Middle Irish models can have some Old and Early Modern Irish words in the vocabulary, as well as some Latin due to code-switching.
There are 3 models in this familily:
middle_irish_cased_ft_100_5_2.txt
middle_irish_lower_ft_100_5_2.txt
middle_irish_lower_demutated_ft_100_5_2.txt
All models are trained with the same hyperparameters (emb_size=100, window=5, min_count=2, n_epochs=100
) and saved as KeyedVectors
(see Gensim Documentation).
from gensim.models import KeyedVectors
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(repo_id="ancatmara/middle-irish-ft-vectors", filename="middle_irish_cased_ft_100_5_2.txt")
model = KeyedVectors.load_word2vec_format(model_path, binary=False)
model.similar_by_word('Temra')
Out:
>>> [('Temrach', 0.6949042677879333),
('Temraig', 0.6130734086036682),
('Temraich', 0.5354859828948975),
('Mide', 0.49614325165748596),
('Mumam', 0.49278897047042847),
('aenach', 0.4891957640647888),
('Midi', 0.4783679246902466),
('Muman', 0.47727957367897034),
('Lagen', 0.4697839319705963),
('Erenn', 0.4670616388320923)]