dp_pii_luganda_ner_model
Model Description
This is a fine-tuned token classification model based on Conrad747/luganda-ner-v6
for detecting Personally Identifiable Information (PII) such as names, emails, phone numbers, and dates of birth. The model was trained with differential privacy (noise_multiplier=3.0, max_grad_norm=0.5, target_delta=1e-4) to ensure strong privacy guarantees, making it suitable for sensitive data applications.
Intended Uses
- Primary Use Case: Identifying PII in text data, particularly for Luganda and English texts.
- Supported Entities: NAME, EMAIL, PHONE, DOB (adjust based on dataset labels).
- Applications: Data anonymization, compliance with privacy regulations (e.g., GDPR), secure text processing.
How to Use
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import json
# Load model and tokenizer
model_name = "e4gl33y3/dp_pii_luganda_ner_model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Define classify_pii function
def classify_pii(text, model, tokenizer, device="cuda" if torch.cuda.is_available() else "cpu", max_length=128):
model.to(device)
model.eval()
inputs = tokenizer(
text,
truncation=True,
padding="max_length",
max_length=max_length,
return_tensors="pt"
).to(device)
# Use model's id2label for accurate label mapping
label_map = model.config.id2label
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predictions = torch.argmax(logits, dim=2)[0].cpu().numpy()
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
word_ids = inputs.word_ids()
previous_word_idx = None
pii_entities = []
current_entity = {"type": None, "value": [], "start": None}
for idx, (token, pred, word_idx) in enumerate(zip(tokens, predictions, word_ids)):
label = label_map.get(pred, "O")
if word_idx is None or token in ["[CLS]", "[SEP]", "[PAD]"]:
continue
if label.startswith("B-"):
if current_entity["type"] is not None:
pii_entities.append({
"type": current_entity["type"],
"value": tokenizer.convert_tokens_to_string(current_entity["value"]).strip(),
"start": current_entity["start"]
})
current_entity = {"type": label[2:], "value": [token], "start": idx}
elif label.startswith("I-") and current_entity["type"] == label[2:] and word_idx == previous_word_idx:
current_entity["value"].append(token)
else:
if current_entity["type"] is not None:
pii_entities.append({
"type": current_entity["type"],
"value": tokenizer.convert_tokens_to_string(current_entity["value"]).strip(),
"start": current_entity["start"]
})
current_entity = {"type": None, "value": [], "start": None}
previous_word_idx = word_idx
if current_entity["type"] is not None:
pii_entities.append({
"type": current_entity["type"],
"value": tokenizer.convert_tokens_to_string(current_entity["value"]).strip(),
"start": current_entity["start"]
})
return {"text": text, "entities": pii_entities}
# Example usage
text = "Contact me at [email protected] or call 123-456-7890. My name is Abner Smith, born on 23/03/1990."
result = classify_pii(text, model, tokenizer)
print(json.dumps(result, indent=2))
Training Details
- Dataset: Trained on
Conrad747/lg-ner
dataset. - Privacy: Differential privacy applied with noise_multiplier=3.0, max_grad_norm=0.5, target_delta=1e-4.
- Optimizer: AdamW with learning rate 5e-5.
- Epochs: 5
- Batch Size: 8 (with BatchMemoryManager for memory efficiency).
Evaluation
- Precision: 0.9445
- Recall: 0.9438
- F1 Score: 0.9436
Limitations
- Optimized for Luganda and English PII detection; performance may vary for other languages.
- Differential privacy may introduce noise, potentially affecting accuracy for rare entities.
- Label mapping must match dataset labels for accurate inference.
Contact
For issues or contributions, please visit the repository on Hugging Face or contact e4gl33y3.
- Downloads last month
- 13
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support