|
--- |
|
datasets: |
|
- ai4privacy/pii-masking-400k |
|
metrics: |
|
- accuracy |
|
- recall |
|
- precision |
|
- f1 |
|
base_model: |
|
- answerdotai/ModernBERT-base |
|
pipeline_tag: token-classification |
|
tags: |
|
- pii |
|
- privacy |
|
- personal |
|
- identification |
|
--- |
|
# 🐟 PII-RANHA: Privacy-Preserving Token Classification Model |
|
|
|
## Overview |
|
PII-RANHA is a fine-tuned token classification model based on **ModernBERT-base** from Answer.AI. It is designed to identify and classify Personally Identifiable Information (PII) in text data. The model is trained on the `ai4privacy/pii-masking-400k` dataset and can detect 17 different PII categories, such as account numbers, credit card numbers, email addresses, and more. |
|
|
|
This model is intended for privacy-preserving applications, such as data anonymization, redaction, or compliance with data protection regulations. |
|
|
|
## Model Details |
|
|
|
### Model Architecture |
|
- **Base Model**: `answerdotai/ModernBERT-base` |
|
- **Task**: Token Classification |
|
- **Number of Labels**: 18 (17 PII categories + "O" for non-PII tokens) |
|
|
|
|
|
## Usage |
|
|
|
### Installation |
|
To use the model, ensure you have the `transformers` and `datasets` libraries installed: |
|
|
|
```bash |
|
pip install transformers datasets |
|
``` |
|
|
|
Inference Example |
|
Here’s how to load and use the model for PII detection: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
from transformers import pipeline |
|
|
|
# Load the model and tokenizer |
|
model_name = "scampion/piiranha" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForTokenClassification.from_pretrained(model_name) |
|
|
|
# Create a token classification pipeline |
|
pii_pipeline = pipeline("token-classification", model=model, tokenizer=tokenizer) |
|
|
|
# Example input |
|
text = "My email is [email protected] and my phone number is 555-123-4567." |
|
|
|
# Detect PII |
|
results = pii_pipeline(text) |
|
for entity in results: |
|
print(f"Entity: {entity['word']}, Label: {entity['entity']}, Score: {entity['score']:.4f}") |
|
|
|
``` |
|
|
|
```bash |
|
Entity: Ġj, Label: I-ACCOUNTNUM, Score: 0.6445 |
|
Entity: ohn, Label: I-ACCOUNTNUM, Score: 0.3657 |
|
Entity: ., Label: I-USERNAME, Score: 0.5871 |
|
Entity: do, Label: I-USERNAME, Score: 0.5350 |
|
Entity: Ġ555, Label: I-ACCOUNTNUM, Score: 0.8399 |
|
Entity: -, Label: I-SOCIALNUM, Score: 0.5948 |
|
Entity: 123, Label: I-SOCIALNUM, Score: 0.6309 |
|
Entity: -, Label: I-SOCIALNUM, Score: 0.6151 |
|
Entity: 45, Label: I-SOCIALNUM, Score: 0.3742 |
|
Entity: 67, Label: I-TELEPHONENUM, Score: 0.3440 |
|
``` |
|
|
|
## Training Details |
|
|
|
### Dataset |
|
The model was trained on the ai4privacy/pii-masking-400k dataset, which contains 400,000 examples of text with annotated PII tokens. |
|
|
|
### Training Configuration |
|
- **Batch Size:** 32 |
|
- **Learning Rate:** 5e-5 |
|
- **Epochs:** 4 |
|
- **Optimizer:** AdamW |
|
- **Weight Decay:** 0.01 |
|
- **Scheduler:** Linear learning rate scheduler |
|
|
|
### Evaluation Metrics |
|
The model was evaluated using the following metrics: |
|
- Precision |
|
- Recall |
|
- F1 Score |
|
- Accuracy |
|
|
|
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy | |
|
|-------|---------------|-----------------|-----------|--------|-------|----------| |
|
| 1 | 0.017100 | 0.017944 | 0.897562 | 0.905612 | 0.901569 | 0.993549 | |
|
| 2 | 0.011300 | 0.014114 | 0.915451 | 0.923319 | 0.919368 | 0.994782 | |
|
| 3 | 0.005000 | 0.015703 | 0.919432 | 0.928394 | 0.923892 | 0.995136 | |
|
| 4 | 0.001000 | 0.022899 | 0.921234 | 0.927212 | 0.924213 | 0.995267 | |
|
|
|
Would you like me to help analyze any trends in these metrics? |
|
|
|
## License |
|
This model is licensed under the Commons Clause Apache License 2.0. For more details, see the Commons Clause website. |
|
For another license, contact the author. |
|
|
|
## Author |
|
Name: Sébastien Campion |
|
|
|
Email: [email protected] |
|
|
|
Date: 2025-01-30 |
|
|
|
Version: 0.1 |
|
|
|
## Citation |
|
If you use this model in your work, please cite it as follows: |
|
|
|
```bibtex |
|
@misc{piiranha2025, |
|
author = {Sébastien Campion}, |
|
title = {PII-RANHA: A Privacy-Preserving Token Classification Model}, |
|
year = {2025}, |
|
version = {0.1}, |
|
url = {https://huggingface.co/sebastien-campion/piiranha}, |
|
} |
|
``` |
|
|
|
## Disclaimer |
|
This model is provided "as-is" without any guarantees of performance or suitability for specific use cases. |
|
Always evaluate the model's performance in your specific context before deployment. |