POS-Tagger Portuguese

We fine-tuned the BERTimbau model with the MacMorpho corpus for the Post-Tagger task, with 10 epochs, achieving a general F1-Score of 0.9826.

Metrics:

              Precision  Recall  F1    Suport
accuracy                         0.98  33729
macro avg     0.96       0.95    0.95  33729
weighted avg  0.98       0.98    0.98  33729

F1:  0.9826 Accuracy:  0.9826

Parameters:

nclasses = 27
nepochs = 30
batch_size = 32
batch_status = 32
learning_rate = 1e-5
early_stop = 3
max_length = 200

Tags:

Tag Meaning
ADJ Adjetivo
ADV Advérbio
ADV-KS Advérbio conjuntivo subordinado
ADV-KS-REL Advérbio relativo subordinado
ART Artigo
CUR Moeda
IN Interjeição
KC Conjunção coordenativa
KS Conjunção subordinativa
N Substantivo
NPROP Substantivo próprio
NUM Número
PCP Particípio
PDEN Palavra denotativa
PREP Preposição
PROADJ Pronome Adjetivo
PRO-KS Pronome conjuntivo subordinado
PRO-KS-REL Pronome relativo conectivo subordinado
PROPESS Pronome pessoal
PROSUB Pronome nominal
V Verbo
VAUX Verbo auxiliar

How to cite

@article{
Schneider_postagger_2023,
place={Brasil},
title={Developing a Transformer-based Clinical Part-of-Speech Tagger for Brazilian Portuguese},
volume={15},
url={https://jhi.sbis.org.br/index.php/jhi-sbis/article/view/1086},
DOI={10.59681/2175-4411.v15.iEspecial.2023.1086},
abstractNote={<p>Electronic Health Records are a valuable source of information to be extracted by means of natural language processing (NLP) tasks, such as morphosyntactic word tagging. Although there have been significant advances in health NLP, such as the Transformer architecture, languages such as Portuguese are still underrepresented. This paper presents taggers developed for Portuguese texts, fine-tuned using BioBERtpt (clinical/biomedical) and BERTimbau (generic) models on a POS-tagged corpus. We achieved an accuracy of 0.9826, state-of-the-art for the corpus used. In addition, we performed a human-based evaluation of the trained models and others in the literature, using authentic clinical narratives. Our clinical model achieved 0.8145 in accuracy compared to 0.7656 for the generic model. It also showed competitive results compared to models trained specifically with clinical texts, evidencing domain impact on the base model in NLP tasks.</p>},
number={Especial}, journal={Journal of Health Informatics},
author={Schneider, Elisa Terumi Rubel and Gumiel, Yohan Bonescki and Oliveira, Lucas Ferro Antunes de and Montenegro, Carolina de Oliveira and Barzotto, Laura Rubel and Moro, Claudia and Pagano, Adriana and Paraiso, Emerson Cabrera},
year={2023},
month={jul.} }

Questions?

Please, post a Github issue on the NLP Portuguese POS-Tagger.

Downloads last month
42
Safetensors
Model size
108M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.