ptt5-v2 for Medical Data Anonymization
Model description
ptt5-v2-AnonyMED-BR is a Brazilian Portuguese adaptation of the T5 architecture fine-tuned and evaluated on the AnonyMED-BR dataset for the task of anonymizing sensitive medical records. Unlike extractive models, ptt5-v2 can generate rewritten medical records, embedding tags around sensitive entities.
- Architecture: ptt5-v2-large
- Pretraining corpus: Large Brazilian Portuguese web text
- Evaluation domain: Medical anonymization with AnonyMED-BR
- Language: Brazilian Portuguese
Intended uses & limitations
Intended uses
- Text-to-text anonymization of Brazilian Portuguese medical records.
- Identifies and tags sensitive entities such as names, dates, IDs, and hospitals.
- Suitable for local deployment, preserving privacy compared to API-based models.
How to use
Example usage with Hugging Face Transformers:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "Venturus/ptt5-v2-AnonyMED-BR"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
text = "O paciente João da Silva foi internado no Hospital das Clínicas em 12/05/2023."
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Example output:
O paciente <PATIENT>João da Silva</PATIENT/> foi internado no <HOSPITAL>Hospital das Clínicas</HOSPITAL/> em <DATE>12/05/2023</DATE/>.
Training procedure
- Base architecture: ptt5-v2 Large.
Fine-tuning for text-to-text anonymization using AnonyMED-BR.
Hyperparameters:
Learning rate: 5e-5
Batch size: 4
Epochs: 2
Precision: FP32
Hardware: NVIDIA Tesla T4 (16 GB).
Evaluation results
The model was evaluated on the AnonyMED-BR test set.
Overall results
F1-score: 0.8748
Precision: 0.8918
Recall: 0.8645
Entities covered
Personal data:
<PATIENT>
<DOCTOR>
<AGE>
<PROFESSION>
Identifiers:
<IDNUM>
<MEDICAL_RECORD>
<HEALTH_PLAN>
Locations:
<CITY>
<STATE>
<COUNTRY>
<STREET>
<HOSPITAL>
<LOCATION_OTHER>
<ZIP>
Other sensitive data:
<DATE>
<EMAIL>
<PHONE>
<ORGANIZATION>
<OTHER>
Citation
If you use this model, please cite:
@article{schiezzaro2025guardians,
title={Guardians of the Data: NER and LLMs for Effective Medical Record Anonymization in Brazilian Portuguese},
author={Schiezzaro, Mauricio and Rosa, Guilherme and Pedrini, Helio and Campos, Bruno Augusto Goulart},
journal={Frontiers in Public Health},
year={2025},
publisher={Frontiers},
url={https://github.com/venturusbr/AnonyMED-BR}
}
- Downloads last month
- 13
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support