Arabic NER PII
Personally Identifiable Information Detection for Arabic Text
Overview
BERT-based token classification model fine-tuned for detecting Personally Identifiable Information (PII) in Arabic text. Addresses unique challenges in Arabic NLP including morphological complexity and absence of capitalization patterns.
Base Model: MutazYoune/ARAB_BERT
| Task: Token Classification | Language: Arabic
Quick Start
pip install transformers torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
# Load model
tokenizer = AutoTokenizer.from_pretrained("MutazYoune/Arabic-NER-PII")
model = AutoModelForTokenClassification.from_pretrained("MutazYoune/Arabic-NER-PII")
# Create pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
# Detect PII
text = "يعمل أحمد محمد في شركة جوجل في الرياض ورقم هاتفه 0501234567"
entities = ner_pipeline(text)
print(entities)
Supported Entities
Entity | Description | Examples |
---|---|---|
CONTACT |
Email addresses, phone numbers | [email protected] , 0501234567 |
NETWORK |
IP addresses, network identifiers | 192.168.1.1 , 10-20-30-40 |
IDENTIFIER |
National IDs, structured identifiers | ID_123456 , user.name |
NUMERIC_ID |
Numeric identifiers | 123456789 , 12-34-56 |
PII |
Generic personal information | Names, personal details |
Performance
Maqsam Arabic PII Redaction Challenge - Rank #16
Metric | Exact | Partial | IoU50 |
---|---|---|---|
Precision | 0.029 | 0.647 | 0.295 |
Recall | 0.020 | 0.455 | 0.208 |
F1 | 0.024 | 0.534 | 0.244 |
Overall Score: 0.5341
Training Details
Dataset
- Source: Maqsam Arabic PII Redaction Competition Dataset
- Size: 20,000 sentences (10k original + 10k LLM-augmented)
- Annotation: BIO tagging scheme with regex pattern matching
- Labels: 11 total (O + B-/I- for each entity type)
Training Configuration
base_model: MutazYoune/ARAB_BERT
epochs: 12
batch_size: 16
learning_rate: 3e-5
max_length: 512
optimization: AdamW
Pattern Recognition
PATTERNS = {
"CONTACT": r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}|(?:https?|ftp)://[^\s/$.?#].[^\s]*',
"NETWORK": r'\d+\.\d+\.\d+\.\d+|\d+\-\d+\-\d+\-\d+',
"IDENTIFIER": r'[a-zA-Z]+_[a-zA-Z]+\d*|[a-zA-Z]+\.[a-zA-Z]+',
"NUMERIC_ID": r'\d+\-\d+|\d{6,}'
}
Advanced Usage
Custom Processing Pipeline
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
def process_arabic_text(text, model, tokenizer):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[pred.item()] for pred in predictions[0]]
# Filter out special tokens
results = []
for token, label in zip(tokens, labels):
if token not in ['[CLS]', '[SEP]', '[PAD]']:
results.append((token, label))
return results
Batch Processing
def batch_process_texts(texts, model, tokenizer, batch_size=8):
results = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
batch_results = []
for text in batch:
entities = ner_pipeline(text)
batch_results.append(entities)
results.extend(batch_results)
return results
Model Architecture
Input: Arabic Text
↓
Tokenization (Arabic BERT Tokenizer)
↓
ARAB_BERT Encoder (12 layers)
↓
Classification Head (11 classes)
↓
BIO Tag Predictions
Limitations & Considerations
- Exact Boundary Detection: Lower exact match scores indicate challenges with precise entity boundaries
- Dialectal Coverage: Primarily trained on Modern Standard Arabic
- Context Sensitivity: May struggle with context-dependent PII identification
- Performance Trade-offs: Higher partial scores vs. exact match performance
Competition Context
Developed for the Maqsam Arabic PII Redaction Challenge addressing critical gaps in Arabic PII detection systems. The competition emphasized:
- Token-level evaluation methodology
- Real-world deployment considerations
- Speed optimization for practical applications
- Arabic-specific linguistic challenges
Evaluation Formula:
Final Score = 0.45 × Precision + 0.45 × Recall + 0.1 × (1/avg_time)
Citation
@misc{arabic-ner-pii-2024,
author = {MutazYoune},
title = {Arabic NER PII: Personally Identifiable Information Detection for Arabic Text},
year = {2024},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://huggingface.co/MutazYoune/Arabic-NER-PII}}
}
Resources
- Base Model: MutazYoune/ARAB_BERT
- Competition: Maqsam Arabic PII Redaction Challenge
- Dataset: Maqsam/ArabicPIIRedaction
- Downloads last month
- 78
Model tree for MutazYoune/Arabic-NER-PII
Base model
MutazYoune/ARAB_BERT