YAML Metadata
Error:
"datasets[3]" with value "open subtitles" is not valid. If possible, use a dataset id from https://hf.co/datasets.
YAML Metadata
Error:
"datasets[4]" with value "free readings" is not valid. If possible, use a dataset id from https://hf.co/datasets.
plT5 Large
plT5 models are T5-based language models trained on Polish corpora. The models were optimized for the original T5 denoising target.
Corpus
plT5 was trained on six different corpora available for Polish language:
Corpus | Tokens | Documents |
---|---|---|
CCNet Middle | 3243M | 7.9M |
CCNet Head | 2641M | 7.0M |
National Corpus of Polish | 1357M | 3.9M |
Open Subtitles | 1056M | 1.1M |
Wikipedia | 260M | 1.4M |
Wolne Lektury | 41M | 5.5k |
Tokenizer
The training dataset was tokenized into subwords using a sentencepiece unigram model with vocabulary size of 50k tokens.
Usage
Example code:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("allegro/plt5-large")
model = AutoModel.from_pretrained("allegro/plt5-large")
License
CC BY 4.0
Citation
If you use this model, please cite the following paper:
@article{chrabrowa2022evaluation,
title={Evaluation of Transfer Learning for Polish with a Text-to-Text Model},
author={Chrabrowa, Aleksandra and Dragan, {\L}ukasz and Grzegorczyk, Karol and Kajtoch, Dariusz and Koszowski, Miko{\l}aj and Mroczkowski, Robert and Rybak, Piotr},
journal={arXiv preprint arXiv:2205.08808},
year={2022}
}
Authors
The model was trained by Machine Learning Research Team at Allegro and Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences.
You can contact us at: [email protected]
- Downloads last month
- 2,773
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.