piiranha / README.md
scampion's picture
Update
0d8f8be verified
---
datasets:
- ai4privacy/pii-masking-400k
metrics:
- accuracy
- recall
- precision
- f1
base_model:
- answerdotai/ModernBERT-base
pipeline_tag: token-classification
tags:
- pii
- privacy
- personal
- identification
---
# 🐟 PII-RANHA: Privacy-Preserving Token Classification Model
## Overview
PII-RANHA is a fine-tuned token classification model based on **ModernBERT-base** from Answer.AI. It is designed to identify and classify Personally Identifiable Information (PII) in text data. The model is trained on the `ai4privacy/pii-masking-400k` dataset and can detect 17 different PII categories, such as account numbers, credit card numbers, email addresses, and more.
This model is intended for privacy-preserving applications, such as data anonymization, redaction, or compliance with data protection regulations.
## Model Details
### Model Architecture
- **Base Model**: `answerdotai/ModernBERT-base`
- **Task**: Token Classification
- **Number of Labels**: 18 (17 PII categories + "O" for non-PII tokens)
## Usage
### Installation
To use the model, ensure you have the `transformers` and `datasets` libraries installed:
```bash
pip install transformers datasets
```
Inference Example
Here’s how to load and use the model for PII detection:
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
# Load the model and tokenizer
model_name = "scampion/piiranha"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Create a token classification pipeline
pii_pipeline = pipeline("token-classification", model=model, tokenizer=tokenizer)
# Example input
text = "My email is [email protected] and my phone number is 555-123-4567."
# Detect PII
results = pii_pipeline(text)
for entity in results:
print(f"Entity: {entity['word']}, Label: {entity['entity']}, Score: {entity['score']:.4f}")
```
```bash
Entity: Ġj, Label: I-ACCOUNTNUM, Score: 0.6445
Entity: ohn, Label: I-ACCOUNTNUM, Score: 0.3657
Entity: ., Label: I-USERNAME, Score: 0.5871
Entity: do, Label: I-USERNAME, Score: 0.5350
Entity: Ġ555, Label: I-ACCOUNTNUM, Score: 0.8399
Entity: -, Label: I-SOCIALNUM, Score: 0.5948
Entity: 123, Label: I-SOCIALNUM, Score: 0.6309
Entity: -, Label: I-SOCIALNUM, Score: 0.6151
Entity: 45, Label: I-SOCIALNUM, Score: 0.3742
Entity: 67, Label: I-TELEPHONENUM, Score: 0.3440
```
## Training Details
### Dataset
The model was trained on the ai4privacy/pii-masking-400k dataset, which contains 400,000 examples of text with annotated PII tokens.
### Training Configuration
- **Batch Size:** 32
- **Learning Rate:** 5e-5
- **Epochs:** 4
- **Optimizer:** AdamW
- **Weight Decay:** 0.01
- **Scheduler:** Linear learning rate scheduler
### Evaluation Metrics
The model was evaluated using the following metrics:
- Precision
- Recall
- F1 Score
- Accuracy
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
|-------|---------------|-----------------|-----------|--------|-------|----------|
| 1 | 0.017100 | 0.017944 | 0.897562 | 0.905612 | 0.901569 | 0.993549 |
| 2 | 0.011300 | 0.014114 | 0.915451 | 0.923319 | 0.919368 | 0.994782 |
| 3 | 0.005000 | 0.015703 | 0.919432 | 0.928394 | 0.923892 | 0.995136 |
| 4 | 0.001000 | 0.022899 | 0.921234 | 0.927212 | 0.924213 | 0.995267 |
Would you like me to help analyze any trends in these metrics?
## License
This model is licensed under the Commons Clause Apache License 2.0. For more details, see the Commons Clause website.
For another license, contact the author.
## Author
Name: Sébastien Campion
Email: [email protected]
Date: 2025-01-30
Version: 0.1
## Citation
If you use this model in your work, please cite it as follows:
```bibtex
@misc{piiranha2025,
author = {Sébastien Campion},
title = {PII-RANHA: A Privacy-Preserving Token Classification Model},
year = {2025},
version = {0.1},
url = {https://huggingface.co/sebastien-campion/piiranha},
}
```
## Disclaimer
This model is provided "as-is" without any guarantees of performance or suitability for specific use cases.
Always evaluate the model's performance in your specific context before deployment.