YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

#Slovak RoBERTA Masked Language Model

###83Mil Parameters in small model

Medium and Large models coming soon!

RoBERTA pretrained tokenizer vocab and merges included.


##Training params:

  • Dataset: 8GB Slovak Monolingual dataset including ParaCrawl (monolingual), OSCAR, and several gigs of my own findings and cleaning.

  • Preprocessing: Tokenized with a pretrained ByteLevelBPETokenizer trained on the same dataset. Uncased, with s, pad, /s, unk, and mask special tokens.

  • Evaluation results:

    • Mnoho ľudí tu MASK
      • žije.
      • žijú.
      • je.
      • trpí.
    • Ako sa MASK
      • máte
      • máš
      • hovorí
    • Plážová sezóna pod Zoborom patrí medzi MASK obdobia.
      • ročné
      • najkrajšie
      • najobľúbenejšie
      • najnáročnejšie
  • Limitations: The current model is fairly small, although it works very well. This model is meant to be finetuned on downstream tasks e.g. Part-of-Speech tagging, Question Answering, anything in GLUE or SUPERGLUE.

  • Credit: If you use this or any of my models in research or professional work, please credit me - Christopher Brousseau in said work.

Downloads last month
10
Safetensors
Model size
85.8M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.