piiranha / README.md

Update

0d8f8be verified 5 months ago

4.29 kB

	---
	datasets:
	- ai4privacy/pii-masking-400k
	metrics:
	- accuracy
	- recall
	- precision
	- f1
	base_model:
	- answerdotai/ModernBERT-base
	pipeline_tag: token-classification
	tags:
	- pii
	- privacy
	- personal
	- identification
	---
	# 🐟 PII-RANHA: Privacy-Preserving Token Classification Model

	## Overview
	PII-RANHA is a fine-tuned token classification model based on ModernBERT-base from Answer.AI. It is designed to identify and classify Personally Identifiable Information (PII) in text data. The model is trained on the `ai4privacy/pii-masking-400k` dataset and can detect 17 different PII categories, such as account numbers, credit card numbers, email addresses, and more.

	This model is intended for privacy-preserving applications, such as data anonymization, redaction, or compliance with data protection regulations.

	## Model Details

	### Model Architecture
	- Base Model: `answerdotai/ModernBERT-base`
	- Task: Token Classification
	- Number of Labels: 18 (17 PII categories + "O" for non-PII tokens)


	## Usage

	### Installation
	To use the model, ensure you have the `transformers` and `datasets` libraries installed:

	```bash
	pip install transformers datasets
	```

	Inference Example
	Here’s how to load and use the model for PII detection:

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	from transformers import pipeline

	# Load the model and tokenizer
	model_name = "scampion/piiranha"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForTokenClassification.from_pretrained(model_name)

	# Create a token classification pipeline
	pii_pipeline = pipeline("token-classification", model=model, tokenizer=tokenizer)

	# Example input
	text = "My email is [email protected] and my phone number is 555-123-4567."

	# Detect PII
	results = pii_pipeline(text)
	for entity in results:
	print(f"Entity: {entity['word']}, Label: {entity['entity']}, Score: {entity['score']:.4f}")

	```

	```bash
	Entity: Ġj, Label: I-ACCOUNTNUM, Score: 0.6445
	Entity: ohn, Label: I-ACCOUNTNUM, Score: 0.3657
	Entity: ., Label: I-USERNAME, Score: 0.5871
	Entity: do, Label: I-USERNAME, Score: 0.5350
	Entity: Ġ555, Label: I-ACCOUNTNUM, Score: 0.8399
	Entity: -, Label: I-SOCIALNUM, Score: 0.5948
	Entity: 123, Label: I-SOCIALNUM, Score: 0.6309
	Entity: -, Label: I-SOCIALNUM, Score: 0.6151
	Entity: 45, Label: I-SOCIALNUM, Score: 0.3742
	Entity: 67, Label: I-TELEPHONENUM, Score: 0.3440
	```

	## Training Details

	### Dataset
	The model was trained on the ai4privacy/pii-masking-400k dataset, which contains 400,000 examples of text with annotated PII tokens.

	### Training Configuration
	- Batch Size: 32
	- Learning Rate: 5e-5
	- Epochs: 4
	- Optimizer: AdamW
	- Weight Decay: 0.01
	- Scheduler: Linear learning rate scheduler

	### Evaluation Metrics
	The model was evaluated using the following metrics:
	- Precision
	- Recall
	- F1 Score
	- Accuracy

	\| Epoch \| Training Loss \| Validation Loss \| Precision \| Recall \| F1 \| Accuracy \|
	\|-------\|---------------\|-----------------\|-----------\|--------\|-------\|----------\|
	\| 1 \| 0.017100 \| 0.017944 \| 0.897562 \| 0.905612 \| 0.901569 \| 0.993549 \|
	\| 2 \| 0.011300 \| 0.014114 \| 0.915451 \| 0.923319 \| 0.919368 \| 0.994782 \|
	\| 3 \| 0.005000 \| 0.015703 \| 0.919432 \| 0.928394 \| 0.923892 \| 0.995136 \|
	\| 4 \| 0.001000 \| 0.022899 \| 0.921234 \| 0.927212 \| 0.924213 \| 0.995267 \|

	Would you like me to help analyze any trends in these metrics?

	## License
	This model is licensed under the Commons Clause Apache License 2.0. For more details, see the Commons Clause website.
	For another license, contact the author.

	## Author
	Name: Sébastien Campion

	Email: [email protected]

	Date: 2025-01-30

	Version: 0.1

	## Citation
	If you use this model in your work, please cite it as follows:

	```bibtex
	@misc{piiranha2025,
	author = {Sébastien Campion},
	title = {PII-RANHA: A Privacy-Preserving Token Classification Model},
	year = {2025},
	version = {0.1},
	url = {https://huggingface.co/sebastien-campion/piiranha},
	}
	```

	## Disclaimer
	This model is provided "as-is" without any guarantees of performance or suitability for specific use cases.
	Always evaluate the model's performance in your specific context before deployment.