ptt5-v2 for Medical Data Anonymization

Model description

ptt5-v2-AnonyMED-BR is a Brazilian Portuguese adaptation of the T5 architecture fine-tuned and evaluated on the AnonyMED-BR dataset for the task of anonymizing sensitive medical records. Unlike extractive models, ptt5-v2 can generate rewritten medical records, embedding tags around sensitive entities.

  • Architecture: ptt5-v2-large
  • Pretraining corpus: Large Brazilian Portuguese web text
  • Evaluation domain: Medical anonymization with AnonyMED-BR
  • Language: Brazilian Portuguese

Intended uses & limitations

Intended uses

  • Text-to-text anonymization of Brazilian Portuguese medical records.
  • Identifies and tags sensitive entities such as names, dates, IDs, and hospitals.
  • Suitable for local deployment, preserving privacy compared to API-based models.

How to use

Example usage with Hugging Face Transformers:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "Venturus/ptt5-v2-AnonyMED-BR"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

text = "O paciente João da Silva foi internado no Hospital das Clínicas em 12/05/2023."
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=128)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Example output:

O paciente <PATIENT>João da Silva</PATIENT/> foi internado no <HOSPITAL>Hospital das Clínicas</HOSPITAL/> em <DATE>12/05/2023</DATE/>.

Training procedure

  • Base architecture: ptt5-v2 Large.

Fine-tuning for text-to-text anonymization using AnonyMED-BR.

  • Hyperparameters:

    • Learning rate: 5e-5

    • Batch size: 4

    • Epochs: 2

    • Precision: FP32

  • Hardware: NVIDIA Tesla T4 (16 GB).

Evaluation results

The model was evaluated on the AnonyMED-BR test set.

Overall results

F1-score: 0.8748

Precision: 0.8918

Recall: 0.8645

Entities covered

  • Personal data:

    • <PATIENT>
    • <DOCTOR>
    • <AGE>
    • <PROFESSION>
  • Identifiers:

    • <IDNUM>
    • <MEDICAL_RECORD>
    • <HEALTH_PLAN>
  • Locations:

    • <CITY>
    • <STATE>
    • <COUNTRY>
    • <STREET>
    • <HOSPITAL>
    • <LOCATION_OTHER>
    • <ZIP>
  • Other sensitive data:

    • <DATE>
    • <EMAIL>
    • <PHONE>
    • <ORGANIZATION>
    • <OTHER>

Citation

If you use this model, please cite:

@article{schiezzaro2025guardians,
  title={Guardians of the Data: NER and LLMs for Effective Medical Record Anonymization in Brazilian Portuguese},
  author={Schiezzaro, Mauricio and Rosa, Guilherme and Pedrini, Helio and Campos, Bruno Augusto Goulart},
  journal={Frontiers in Public Health},
  year={2025},
  publisher={Frontiers},
  url={https://github.com/venturusbr/AnonyMED-BR}
}
Downloads last month
13
Safetensors
Model size
223M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Venturus/ptt5-v2-AnonyMED-BR

Base model

google-t5/t5-large
Finetuned
(2)
this model