Resume NER BERT v2 - Advanced Resume Information Extraction

A state-of-the-art Named Entity Recognition (NER) model specifically designed for extracting structured information from resumes and CVs. This model achieves 90.87% F1 score and is trained on a comprehensive dataset of 22,542 resume samples from multiple sources.

🎯 Model Description

This model has been extensively fine-tuned to extract key information from resumes including personal details, contact information, work experience, education, skills, and more. The model uses a BERT-based architecture with token classification capabilities, making it highly effective for resume parsing tasks.

Key Features:

High Accuracy: 90.87% F1 score on comprehensive resume parsing
Comprehensive Coverage: 25 entity types covering all major resume sections
Large Training Dataset: 22,542 samples from multiple sources
Production Ready: Tested and optimized for real-world applications
Memory Efficient: CPU-optimized with reasonable model size (431MB)

📊 Performance Metrics

Metric	Score	Status
F1 Score	90.87%	✅ Excellent
Precision	91.44%	✅ High
Recall	90.81%	✅ High
Training Loss	0.2604	✅ Low

🏷️ Label Schema

The model recognizes 25 entity types using BIO (Beginning-Inside-Outside) tagging:

Core Personal Information:

Name: Person's full name (e.g., "John Smith", "Sarah Johnson")
Email Address: Email contact information (e.g., "[email protected]")
Phone: Phone number (e.g., "(555) 123-4567", "+1-555-123-4567")
Location: Geographic location (e.g., "San Francisco, CA", "New York")

Professional Information:

Companies worked at: Previous employers and organizations (e.g., "Google", "Microsoft", "Amazon")
Designation: Job titles and roles (e.g., "Senior Software Engineer", "Data Scientist", "Marketing Manager")
Skills: Technical and soft skills (e.g., "Python", "JavaScript", "Machine Learning", "Leadership")
Years of Experience: Work experience duration (e.g., "5 years", "10+ years")

Educational Information:

Degree: Educational qualifications (e.g., "Bachelor's in Computer Science", "Master's in Data Science")
College Name: Educational institutions (e.g., "Stanford University", "MIT", "Harvard")
Graduation Year: Year of degree completion (e.g., "2020", "2018")

Additional:

UNKNOWN: Unclassified entities that don't fit other categories

BIO Tags:

B- (Beginning): Start of an entity
I- (Inside): Continuation of an entity
O (Outside): Non-entity tokens

🚀 Usage

Using the Model Directly

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
model_name = "yashpwr/resume-ner-bert-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Example resume text
text = "John Smith is a senior software engineer with 8 years of experience at Google. He has expertise in Python, JavaScript, and machine learning. Contact: [email protected]"

# Tokenize
inputs = tokenizer(
    text,
    return_tensors="pt",
    truncation=True,
    max_length=128,
    padding=True
)

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=2)

# Extract entities
entities = []
current_entity = None

for i, pred in enumerate(predictions[0]):
    label = model.config.id2label[pred.item()]
    token = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][i])
    
    if label.startswith('B-'):
        if current_entity:
            entities.append(current_entity)
        current_entity = {
            'text': token,
            'label': label[2:],  # Remove 'B-' prefix
            'start': i
        }
    elif label.startswith('I-') and current_entity:
        current_entity['text'] += ' ' + token
    elif label == 'O':
        if current_entity:
            entities.append(current_entity)
            current_entity = None

if current_entity:
    entities.append(current_entity)

print("Extracted Entities:")
for entity in entities:
    print(f"- {entity['label']}: {entity['text']}")

Using the Pipeline

from transformers import pipeline

# Create NER pipeline
ner_pipeline = pipeline(
    "token-classification",
    model="yashpwr/resume-ner-bert-v2",
    aggregation_strategy="simple"
)

# Extract entities
text = "Sarah Johnson holds a Master's degree in Computer Science from Stanford University. Skills: Python, TensorFlow, SQL."
results = ner_pipeline(text)

for entity in results:
    print(f"{entity['entity_group']}: {entity['word']} (confidence: {entity['score']:.3f})")

Advanced Usage with Confidence Scores

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import numpy as np

# Load model
model_name = "yashpwr/resume-ner-bert-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

def extract_entities_with_confidence(text, confidence_threshold=0.5):
    """Extract entities with confidence scores."""
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=128,
        padding=True,
        return_offsets_mapping=True
    )
    
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.argmax(outputs.logits, dim=2)
        probabilities = torch.softmax(outputs.logits, dim=2)
    
    entities = []
    current_entity = None
    offset_mapping = inputs.offset_mapping[0]
    
    for i, (pred, offset) in enumerate(zip(predictions[0], offset_mapping)):
        label = model.config.id2label[pred.item()]
        confidence = probabilities[0][i][pred].item()
        
        # Skip special tokens
        if offset[0] == 0 and offset[1] == 0:
            continue
        
        if label.startswith('B-'):
            if current_entity and current_entity['confidence'] >= confidence_threshold:
                entities.append(current_entity)
            
            entity_type = label[2:]
            current_entity = {
                'text': text[offset[0]:offset[1]],
                'label': entity_type,
                'start': offset[0],
                'end': offset[1],
                'confidence': confidence
            }
        
        elif label.startswith('I-') and current_entity:
            entity_type = label[2:]
            if entity_type == current_entity['label']:
                current_entity['text'] += ' ' + text[offset[0]:offset[1]]
                current_entity['end'] = offset[1]
                current_entity['confidence'] = min(current_entity['confidence'], confidence)
        
        elif label == 'O':
            if current_entity and current_entity['confidence'] >= confidence_threshold:
                entities.append(current_entity)
                current_entity = None
    
    if current_entity and current_entity['confidence'] >= confidence_threshold:
        entities.append(current_entity)
    
    return entities

# Example usage
text = "Michael Brown is a marketing manager with 10 years of experience at Coca-Cola. Contact: [email protected]"
entities = extract_entities_with_confidence(text, confidence_threshold=0.3)

for entity in entities:
    print(f"{entity['label']}: '{entity['text']}' (confidence: {entity['confidence']:.3f})")

📚 Training Details

Dataset Composition

Total Samples: 22,542
Sources:
- Resume-Corpus Dataset: 349 samples (structured resume data)
- DataTurks Resume NER: 420 samples (manually annotated resumes)
- Custom Training Data: 21,773 samples (rule-based extraction from conversation data)
- Mehyaar Skills Dataset: Integrated skills-focused data

Training Configuration

Base Model: yashpwr/resume-ner-bert
Learning Rate: 3e-5
Batch Size: 4 (effective: 32 with gradient accumulation)
Max Sequence Length: 128 tokens
Epochs: 1.0 (early stopping applied)
Device: CPU (optimized for memory efficiency)
Gradient Accumulation Steps: 8
Optimizer: AdamW
Loss Function: Cross-Entropy Loss

Training Process Improvements

The model was trained using a comprehensive pipeline that addressed several key challenges:

✅ Tokenization Consistency: Used bert-base-cased throughout the pipeline to ensure consistent tokenization between training and inference
✅ Entity Extraction Enhancement: Implemented proper character-to-token alignment using return_offsets_mapping=True for accurate text reconstruction
✅ Label Mapping: Unified diverse label schemas into the DataTurks format (25 labels) for consistency
✅ Performance Optimization: CPU-optimized training with memory efficiency and gradient accumulation
✅ Dataset Integration: Successfully integrated 21,773 additional samples from conversation data using rule-based extraction

🎯 Use Cases

Primary Applications

Recruitment Platforms: Automated resume parsing and candidate screening
Resume Parsing Engines: Extract structured data from unstructured resumes
Talent Analytics Tools: Analyze candidate skills and experience patterns
Document Processing Pipelines: Integrate with HR systems and ATS
ATS (Applicant Tracking Systems): Automated candidate data extraction and categorization

Industry Sectors

Human Resources & Recruitment: Streamline hiring processes
Technology & Software Development: Technical skill assessment
Finance & Banking: Compliance and background verification
Healthcare: Medical credential verification
Education: Academic credential processing
Government & Public Sector: Public service recruitment

Specific Use Cases

Automated Resume Screening: Filter candidates based on skills and experience
Data Migration: Convert legacy resume databases to structured formats
Compliance Checking: Verify educational and professional credentials
Skill Gap Analysis: Identify missing skills in candidate pools
Market Research: Analyze job market trends and skill demands

⚠️ Limitations

Language: Currently optimized for English resumes only
Format: Works best with text-based resumes (PDF conversion may be required)
Domain: Primarily trained on technology and business resumes
Length: Optimal for resumes under 512 tokens (truncation applied for longer texts)
Accuracy: 90.87% F1 score - may miss some entities in complex or non-standard formats
Context: Limited to resume-specific entities (not general NER)

🔧 Technical Requirements

System Requirements

Python: 3.8 or higher
PyTorch: 1.9 or higher
Transformers: 4.20 or higher
Memory: 2GB+ RAM recommended
Storage: 431MB model size
CPU: Multi-core recommended for inference

Dependencies

pip install transformers torch datasets scikit-learn numpy

Installation

# Install required packages
pip install transformers[torch] datasets scikit-learn

# Or using conda
conda install pytorch transformers -c pytorch

📄 License

This model is licensed under the Apache 2.0 License. This means you can:

Use the model for commercial purposes
Modify and distribute the model
Use it in proprietary software
Distribute modified versions

See the LICENSE file for complete details.

🤝 Contributing

We welcome contributions from the community! Here's how you can contribute:

Report Issues: Create an issue for bugs or feature requests
Submit Improvements: Fork the repository and submit pull requests
Share Datasets: Contribute additional training data
Documentation: Help improve documentation and examples
Testing: Test the model on different resume formats

📞 Support

For questions, issues, or support:

GitHub Issues: Create an issue on the model repository
Hugging Face Discussions: Use the discussion tab on the model page
Email: Contact through the Hugging Face profile
Documentation: Check the model card and examples

🙏 Acknowledgments

Base Model: yashpwr/resume-ner-bert for the foundation architecture
Datasets: Resume-Corpus, DataTurks, and custom training data contributors
Hugging Face: For the transformers library and platform
Open Source Community: For contributions and feedback
Research Community: For advancing NER and information extraction techniques

📈 Model Evolution

Version History

v1: Initial release with basic resume parsing
v2: Comprehensive model with 22,542 samples and 90.87% F1 score

Future Improvements

Multi-language support
Enhanced entity types
Better handling of complex resume formats
Integration with document processing pipelines
Real-time inference optimization

Last Updated: August 7, 2025
Version: v2
Status: Production Ready ✅
License: Apache 2.0
Repository: https://huggingface.co/yashpwr/resume-ner-bert-v2