dp_pii_luganda_ner_model

Model Description

This is a fine-tuned token classification model based on Conrad747/luganda-ner-v6 for detecting Personally Identifiable Information (PII) such as names, emails, phone numbers, and dates of birth. The model was trained with differential privacy (noise_multiplier=3.0, max_grad_norm=0.5, target_delta=1e-4) to ensure strong privacy guarantees, making it suitable for sensitive data applications.

Intended Uses

  • Primary Use Case: Identifying PII in text data, particularly for Luganda and English texts.
  • Supported Entities: NAME, EMAIL, PHONE, DOB (adjust based on dataset labels).
  • Applications: Data anonymization, compliance with privacy regulations (e.g., GDPR), secure text processing.

How to Use

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import json

# Load model and tokenizer
model_name = "e4gl33y3/dp_pii_luganda_ner_model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Define classify_pii function
def classify_pii(text, model, tokenizer, device="cuda" if torch.cuda.is_available() else "cpu", max_length=128):
    model.to(device)
    model.eval()
    inputs = tokenizer(
        text,
        truncation=True,
        padding="max_length",
        max_length=max_length,
        return_tensors="pt"
    ).to(device)
    
    # Use model's id2label for accurate label mapping
    label_map = model.config.id2label
    
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=2)[0].cpu().numpy()
    
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    word_ids = inputs.word_ids()
    previous_word_idx = None
    pii_entities = []
    current_entity = {"type": None, "value": [], "start": None}
    
    for idx, (token, pred, word_idx) in enumerate(zip(tokens, predictions, word_ids)):
        label = label_map.get(pred, "O")
        if word_idx is None or token in ["[CLS]", "[SEP]", "[PAD]"]:
            continue
        if label.startswith("B-"):
            if current_entity["type"] is not None:
                pii_entities.append({
                    "type": current_entity["type"],
                    "value": tokenizer.convert_tokens_to_string(current_entity["value"]).strip(),
                    "start": current_entity["start"]
                })
            current_entity = {"type": label[2:], "value": [token], "start": idx}
        elif label.startswith("I-") and current_entity["type"] == label[2:] and word_idx == previous_word_idx:
            current_entity["value"].append(token)
        else:
            if current_entity["type"] is not None:
                pii_entities.append({
                    "type": current_entity["type"],
                    "value": tokenizer.convert_tokens_to_string(current_entity["value"]).strip(),
                    "start": current_entity["start"]
                })
            current_entity = {"type": None, "value": [], "start": None}
        previous_word_idx = word_idx
    
    if current_entity["type"] is not None:
        pii_entities.append({
            "type": current_entity["type"],
            "value": tokenizer.convert_tokens_to_string(current_entity["value"]).strip(),
            "start": current_entity["start"]
        })
    
    return {"text": text, "entities": pii_entities}

# Example usage
text = "Contact me at [email protected] or call 123-456-7890. My name is Abner Smith, born on 23/03/1990."
result = classify_pii(text, model, tokenizer)
print(json.dumps(result, indent=2))

Training Details

  • Dataset: Trained on Conrad747/lg-ner dataset.
  • Privacy: Differential privacy applied with noise_multiplier=3.0, max_grad_norm=0.5, target_delta=1e-4.
  • Optimizer: AdamW with learning rate 5e-5.
  • Epochs: 5
  • Batch Size: 8 (with BatchMemoryManager for memory efficiency).

Evaluation

  • Precision: 0.9445
  • Recall: 0.9438
  • F1 Score: 0.9436

Limitations

  • Optimized for Luganda and English PII detection; performance may vary for other languages.
  • Differential privacy may introduce noise, potentially affecting accuracy for rare entities.
  • Label mapping must match dataset labels for accurate inference.

Contact

For issues or contributions, please visit the repository on Hugging Face or contact e4gl33y3.

Downloads last month
13
Safetensors
Model size
277M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train e4gl33y3/dp_pii_luganda_ner_model