Model & tokenizer pretrained from scratch only on cleaned(removal of toxic and sexual content, automatic language correctness filtering) and semantically segmented Hungarian text.
Vocab: 52k
Context length of up to 8,192 tokens
Train dataset: ~1billion tokens
Files info
Base model