Arabic NER PII

Personally Identifiable Information Detection for Arabic Text

PII Model

Overview

BERT-based token classification model fine-tuned for detecting Personally Identifiable Information (PII) in Arabic text. Addresses unique challenges in Arabic NLP including morphological complexity and absence of capitalization patterns.

Base Model: MutazYoune/ARAB_BERT | Task: Token Classification | Language: Arabic

Quick Start

pip install transformers torch

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load model
tokenizer = AutoTokenizer.from_pretrained("MutazYoune/Arabic-NER-PII")
model = AutoModelForTokenClassification.from_pretrained("MutazYoune/Arabic-NER-PII")

# Create pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Detect PII
text = "يعمل أحمد محمد في شركة جوجل في الرياض ورقم هاتفه 0501234567"
entities = ner_pipeline(text)
print(entities)

Supported Entities

Entity	Description	Examples
`CONTACT`	Email addresses, phone numbers	`[email protected]`, `0501234567`
`NETWORK`	IP addresses, network identifiers	`192.168.1.1`, `10-20-30-40`
`IDENTIFIER`	National IDs, structured identifiers	`ID_123456`, `user.name`
`NUMERIC_ID`	Numeric identifiers	`123456789`, `12-34-56`
`PII`	Generic personal information	Names, personal details

Performance

Maqsam Arabic PII Redaction Challenge - Rank #16

Metric	Exact	Partial	IoU50
Precision	0.029	0.647	0.295
Recall	0.020	0.455	0.208
F1	0.024	0.534	0.244

Overall Score: 0.5341

Training Details

Dataset

Source: Maqsam Arabic PII Redaction Competition Dataset
Size: 20,000 sentences (10k original + 10k LLM-augmented)
Annotation: BIO tagging scheme with regex pattern matching
Labels: 11 total (O + B-/I- for each entity type)

Training Configuration

base_model: MutazYoune/ARAB_BERT
epochs: 12
batch_size: 16
learning_rate: 3e-5
max_length: 512
optimization: AdamW

Pattern Recognition

PATTERNS = {
    "CONTACT": r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}|(?:https?|ftp)://[^\s/$.?#].[^\s]*',
    "NETWORK": r'\d+\.\d+\.\d+\.\d+|\d+\-\d+\-\d+\-\d+',
    "IDENTIFIER": r'[a-zA-Z]+_[a-zA-Z]+\d*|[a-zA-Z]+\.[a-zA-Z]+',
    "NUMERIC_ID": r'\d+\-\d+|\d{6,}'
}

Advanced Usage

Custom Processing Pipeline

import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

def process_arabic_text(text, model, tokenizer):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.argmax(outputs.logits, dim=-1)
    
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    labels = [model.config.id2label[pred.item()] for pred in predictions[0]]
    
    # Filter out special tokens
    results = []
    for token, label in zip(tokens, labels):
        if token not in ['[CLS]', '[SEP]', '[PAD]']:
            results.append((token, label))
    
    return results

Batch Processing

def batch_process_texts(texts, model, tokenizer, batch_size=8):
    results = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        batch_results = []
        
        for text in batch:
            entities = ner_pipeline(text)
            batch_results.append(entities)
        
        results.extend(batch_results)
    
    return results

Model Architecture

Input: Arabic Text
    ↓
Tokenization (Arabic BERT Tokenizer)
    ↓
ARAB_BERT Encoder (12 layers)
    ↓
Classification Head (11 classes)
    ↓
BIO Tag Predictions

Limitations & Considerations

Exact Boundary Detection: Lower exact match scores indicate challenges with precise entity boundaries
Dialectal Coverage: Primarily trained on Modern Standard Arabic
Context Sensitivity: May struggle with context-dependent PII identification
Performance Trade-offs: Higher partial scores vs. exact match performance

Competition Context

Developed for the Maqsam Arabic PII Redaction Challenge addressing critical gaps in Arabic PII detection systems. The competition emphasized:

Token-level evaluation methodology
Real-world deployment considerations
Speed optimization for practical applications
Arabic-specific linguistic challenges

Evaluation Formula:

Final Score = 0.45 × Precision + 0.45 × Recall + 0.1 × (1/avg_time)

Citation

@misc{arabic-ner-pii-2024,
  author = {MutazYoune},
  title = {Arabic NER PII: Personally Identifiable Information Detection for Arabic Text},
  year = {2024},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/MutazYoune/Arabic-NER-PII}}
}

Resources

Base Model: MutazYoune/ARAB_BERT
Competition: Maqsam Arabic PII Redaction Challenge
Dataset: Maqsam/ArabicPIIRedaction

🤗 Model Hub • 📊 Competition

MutazYoune
/

Arabic-NER-PII