Arabic NER PII

Personally Identifiable Information Detection for Arabic Text

Model Competition License Arabic

PII Model

Overview

BERT-based token classification model fine-tuned for detecting Personally Identifiable Information (PII) in Arabic text. Addresses unique challenges in Arabic NLP including morphological complexity and absence of capitalization patterns.

Base Model: MutazYoune/ARAB_BERT | Task: Token Classification | Language: Arabic

Quick Start

pip install transformers torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load model
tokenizer = AutoTokenizer.from_pretrained("MutazYoune/Arabic-NER-PII")
model = AutoModelForTokenClassification.from_pretrained("MutazYoune/Arabic-NER-PII")

# Create pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Detect PII
text = "يعمل أحمد محمد في شركة جوجل في الرياض ورقم هاتفه 0501234567"
entities = ner_pipeline(text)
print(entities)

Supported Entities

Entity Description Examples
CONTACT Email addresses, phone numbers [email protected], 0501234567
NETWORK IP addresses, network identifiers 192.168.1.1, 10-20-30-40
IDENTIFIER National IDs, structured identifiers ID_123456, user.name
NUMERIC_ID Numeric identifiers 123456789, 12-34-56
PII Generic personal information Names, personal details

Performance

Maqsam Arabic PII Redaction Challenge - Rank #16

Metric Exact Partial IoU50
Precision 0.029 0.647 0.295
Recall 0.020 0.455 0.208
F1 0.024 0.534 0.244

Overall Score: 0.5341

Training Details

Dataset
  • Source: Maqsam Arabic PII Redaction Competition Dataset
  • Size: 20,000 sentences (10k original + 10k LLM-augmented)
  • Annotation: BIO tagging scheme with regex pattern matching
  • Labels: 11 total (O + B-/I- for each entity type)
Training Configuration
base_model: MutazYoune/ARAB_BERT
epochs: 12
batch_size: 16
learning_rate: 3e-5
max_length: 512
optimization: AdamW
Pattern Recognition
PATTERNS = {
    "CONTACT": r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}|(?:https?|ftp)://[^\s/$.?#].[^\s]*',
    "NETWORK": r'\d+\.\d+\.\d+\.\d+|\d+\-\d+\-\d+\-\d+',
    "IDENTIFIER": r'[a-zA-Z]+_[a-zA-Z]+\d*|[a-zA-Z]+\.[a-zA-Z]+',
    "NUMERIC_ID": r'\d+\-\d+|\d{6,}'
}

Advanced Usage

Custom Processing Pipeline
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

def process_arabic_text(text, model, tokenizer):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.argmax(outputs.logits, dim=-1)
    
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    labels = [model.config.id2label[pred.item()] for pred in predictions[0]]
    
    # Filter out special tokens
    results = []
    for token, label in zip(tokens, labels):
        if token not in ['[CLS]', '[SEP]', '[PAD]']:
            results.append((token, label))
    
    return results
Batch Processing
def batch_process_texts(texts, model, tokenizer, batch_size=8):
    results = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        batch_results = []
        
        for text in batch:
            entities = ner_pipeline(text)
            batch_results.append(entities)
        
        results.extend(batch_results)
    
    return results

Model Architecture

Input: Arabic Text
    ↓
Tokenization (Arabic BERT Tokenizer)
    ↓
ARAB_BERT Encoder (12 layers)
    ↓
Classification Head (11 classes)
    ↓
BIO Tag Predictions

Limitations & Considerations

  • Exact Boundary Detection: Lower exact match scores indicate challenges with precise entity boundaries
  • Dialectal Coverage: Primarily trained on Modern Standard Arabic
  • Context Sensitivity: May struggle with context-dependent PII identification
  • Performance Trade-offs: Higher partial scores vs. exact match performance

Competition Context

Developed for the Maqsam Arabic PII Redaction Challenge addressing critical gaps in Arabic PII detection systems. The competition emphasized:

  • Token-level evaluation methodology
  • Real-world deployment considerations
  • Speed optimization for practical applications
  • Arabic-specific linguistic challenges

Evaluation Formula:

Final Score = 0.45 × Precision + 0.45 × Recall + 0.1 × (1/avg_time)

Citation

@misc{arabic-ner-pii-2024,
  author = {MutazYoune},
  title = {Arabic NER PII: Personally Identifiable Information Detection for Arabic Text},
  year = {2024},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/MutazYoune/Arabic-NER-PII}}
}

Resources


Downloads last month
78
Safetensors
Model size
108M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MutazYoune/Arabic-NER-PII

Finetuned
(1)
this model