Resume NER BERT v2 - Advanced Resume Information Extraction

A state-of-the-art Named Entity Recognition (NER) model specifically designed for extracting structured information from resumes and CVs. This model achieves 90.87% F1 score and is trained on a comprehensive dataset of 22,542 resume samples from multiple sources.

🎯 Model Description

This model has been extensively fine-tuned to extract key information from resumes including personal details, contact information, work experience, education, skills, and more. The model uses a BERT-based architecture with token classification capabilities, making it highly effective for resume parsing tasks.

Key Features:

  • High Accuracy: 90.87% F1 score on comprehensive resume parsing
  • Comprehensive Coverage: 25 entity types covering all major resume sections
  • Large Training Dataset: 22,542 samples from multiple sources
  • Production Ready: Tested and optimized for real-world applications
  • Memory Efficient: CPU-optimized with reasonable model size (431MB)

πŸ“Š Performance Metrics

Metric Score Status
F1 Score 90.87% βœ… Excellent
Precision 91.44% βœ… High
Recall 90.81% βœ… High
Training Loss 0.2604 βœ… Low

🏷️ Label Schema

The model recognizes 25 entity types using BIO (Beginning-Inside-Outside) tagging:

Core Personal Information:

  • Name: Person's full name (e.g., "John Smith", "Sarah Johnson")
  • Email Address: Email contact information (e.g., "[email protected]")
  • Phone: Phone number (e.g., "(555) 123-4567", "+1-555-123-4567")
  • Location: Geographic location (e.g., "San Francisco, CA", "New York")

Professional Information:

  • Companies worked at: Previous employers and organizations (e.g., "Google", "Microsoft", "Amazon")
  • Designation: Job titles and roles (e.g., "Senior Software Engineer", "Data Scientist", "Marketing Manager")
  • Skills: Technical and soft skills (e.g., "Python", "JavaScript", "Machine Learning", "Leadership")
  • Years of Experience: Work experience duration (e.g., "5 years", "10+ years")

Educational Information:

  • Degree: Educational qualifications (e.g., "Bachelor's in Computer Science", "Master's in Data Science")
  • College Name: Educational institutions (e.g., "Stanford University", "MIT", "Harvard")
  • Graduation Year: Year of degree completion (e.g., "2020", "2018")

Additional:

  • UNKNOWN: Unclassified entities that don't fit other categories

BIO Tags:

  • B- (Beginning): Start of an entity
  • I- (Inside): Continuation of an entity
  • O (Outside): Non-entity tokens

πŸš€ Usage

Using the Model Directly

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
model_name = "yashpwr/resume-ner-bert-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Example resume text
text = "John Smith is a senior software engineer with 8 years of experience at Google. He has expertise in Python, JavaScript, and machine learning. Contact: [email protected]"

# Tokenize
inputs = tokenizer(
    text,
    return_tensors="pt",
    truncation=True,
    max_length=128,
    padding=True
)

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=2)

# Extract entities
entities = []
current_entity = None

for i, pred in enumerate(predictions[0]):
    label = model.config.id2label[pred.item()]
    token = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][i])
    
    if label.startswith('B-'):
        if current_entity:
            entities.append(current_entity)
        current_entity = {
            'text': token,
            'label': label[2:],  # Remove 'B-' prefix
            'start': i
        }
    elif label.startswith('I-') and current_entity:
        current_entity['text'] += ' ' + token
    elif label == 'O':
        if current_entity:
            entities.append(current_entity)
            current_entity = None

if current_entity:
    entities.append(current_entity)

print("Extracted Entities:")
for entity in entities:
    print(f"- {entity['label']}: {entity['text']}")

Using the Pipeline

from transformers import pipeline

# Create NER pipeline
ner_pipeline = pipeline(
    "token-classification",
    model="yashpwr/resume-ner-bert-v2",
    aggregation_strategy="simple"
)

# Extract entities
text = "Sarah Johnson holds a Master's degree in Computer Science from Stanford University. Skills: Python, TensorFlow, SQL."
results = ner_pipeline(text)

for entity in results:
    print(f"{entity['entity_group']}: {entity['word']} (confidence: {entity['score']:.3f})")

Advanced Usage with Confidence Scores

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import numpy as np

# Load model
model_name = "yashpwr/resume-ner-bert-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

def extract_entities_with_confidence(text, confidence_threshold=0.5):
    """Extract entities with confidence scores."""
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=128,
        padding=True,
        return_offsets_mapping=True
    )
    
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.argmax(outputs.logits, dim=2)
        probabilities = torch.softmax(outputs.logits, dim=2)
    
    entities = []
    current_entity = None
    offset_mapping = inputs.offset_mapping[0]
    
    for i, (pred, offset) in enumerate(zip(predictions[0], offset_mapping)):
        label = model.config.id2label[pred.item()]
        confidence = probabilities[0][i][pred].item()
        
        # Skip special tokens
        if offset[0] == 0 and offset[1] == 0:
            continue
        
        if label.startswith('B-'):
            if current_entity and current_entity['confidence'] >= confidence_threshold:
                entities.append(current_entity)
            
            entity_type = label[2:]
            current_entity = {
                'text': text[offset[0]:offset[1]],
                'label': entity_type,
                'start': offset[0],
                'end': offset[1],
                'confidence': confidence
            }
        
        elif label.startswith('I-') and current_entity:
            entity_type = label[2:]
            if entity_type == current_entity['label']:
                current_entity['text'] += ' ' + text[offset[0]:offset[1]]
                current_entity['end'] = offset[1]
                current_entity['confidence'] = min(current_entity['confidence'], confidence)
        
        elif label == 'O':
            if current_entity and current_entity['confidence'] >= confidence_threshold:
                entities.append(current_entity)
                current_entity = None
    
    if current_entity and current_entity['confidence'] >= confidence_threshold:
        entities.append(current_entity)
    
    return entities

# Example usage
text = "Michael Brown is a marketing manager with 10 years of experience at Coca-Cola. Contact: [email protected]"
entities = extract_entities_with_confidence(text, confidence_threshold=0.3)

for entity in entities:
    print(f"{entity['label']}: '{entity['text']}' (confidence: {entity['confidence']:.3f})")

πŸ“š Training Details

Dataset Composition

  • Total Samples: 22,542
  • Sources:
    • Resume-Corpus Dataset: 349 samples (structured resume data)
    • DataTurks Resume NER: 420 samples (manually annotated resumes)
    • Custom Training Data: 21,773 samples (rule-based extraction from conversation data)
    • Mehyaar Skills Dataset: Integrated skills-focused data

Training Configuration

  • Base Model: yashpwr/resume-ner-bert
  • Learning Rate: 3e-5
  • Batch Size: 4 (effective: 32 with gradient accumulation)
  • Max Sequence Length: 128 tokens
  • Epochs: 1.0 (early stopping applied)
  • Device: CPU (optimized for memory efficiency)
  • Gradient Accumulation Steps: 8
  • Optimizer: AdamW
  • Loss Function: Cross-Entropy Loss

Training Process Improvements

The model was trained using a comprehensive pipeline that addressed several key challenges:

  1. βœ… Tokenization Consistency: Used bert-base-cased throughout the pipeline to ensure consistent tokenization between training and inference
  2. βœ… Entity Extraction Enhancement: Implemented proper character-to-token alignment using return_offsets_mapping=True for accurate text reconstruction
  3. βœ… Label Mapping: Unified diverse label schemas into the DataTurks format (25 labels) for consistency
  4. βœ… Performance Optimization: CPU-optimized training with memory efficiency and gradient accumulation
  5. βœ… Dataset Integration: Successfully integrated 21,773 additional samples from conversation data using rule-based extraction

🎯 Use Cases

Primary Applications

  • Recruitment Platforms: Automated resume parsing and candidate screening
  • Resume Parsing Engines: Extract structured data from unstructured resumes
  • Talent Analytics Tools: Analyze candidate skills and experience patterns
  • Document Processing Pipelines: Integrate with HR systems and ATS
  • ATS (Applicant Tracking Systems): Automated candidate data extraction and categorization

Industry Sectors

  • Human Resources & Recruitment: Streamline hiring processes
  • Technology & Software Development: Technical skill assessment
  • Finance & Banking: Compliance and background verification
  • Healthcare: Medical credential verification
  • Education: Academic credential processing
  • Government & Public Sector: Public service recruitment

Specific Use Cases

  1. Automated Resume Screening: Filter candidates based on skills and experience
  2. Data Migration: Convert legacy resume databases to structured formats
  3. Compliance Checking: Verify educational and professional credentials
  4. Skill Gap Analysis: Identify missing skills in candidate pools
  5. Market Research: Analyze job market trends and skill demands

⚠️ Limitations

  1. Language: Currently optimized for English resumes only
  2. Format: Works best with text-based resumes (PDF conversion may be required)
  3. Domain: Primarily trained on technology and business resumes
  4. Length: Optimal for resumes under 512 tokens (truncation applied for longer texts)
  5. Accuracy: 90.87% F1 score - may miss some entities in complex or non-standard formats
  6. Context: Limited to resume-specific entities (not general NER)

πŸ”§ Technical Requirements

System Requirements

  • Python: 3.8 or higher
  • PyTorch: 1.9 or higher
  • Transformers: 4.20 or higher
  • Memory: 2GB+ RAM recommended
  • Storage: 431MB model size
  • CPU: Multi-core recommended for inference

Dependencies

pip install transformers torch datasets scikit-learn numpy

Installation

# Install required packages
pip install transformers[torch] datasets scikit-learn

# Or using conda
conda install pytorch transformers -c pytorch

πŸ“„ License

This model is licensed under the Apache 2.0 License. This means you can:

  • Use the model for commercial purposes
  • Modify and distribute the model
  • Use it in proprietary software
  • Distribute modified versions

See the LICENSE file for complete details.

🀝 Contributing

We welcome contributions from the community! Here's how you can contribute:

  1. Report Issues: Create an issue for bugs or feature requests
  2. Submit Improvements: Fork the repository and submit pull requests
  3. Share Datasets: Contribute additional training data
  4. Documentation: Help improve documentation and examples
  5. Testing: Test the model on different resume formats

πŸ“ž Support

For questions, issues, or support:

  • GitHub Issues: Create an issue on the model repository
  • Hugging Face Discussions: Use the discussion tab on the model page
  • Email: Contact through the Hugging Face profile
  • Documentation: Check the model card and examples

πŸ™ Acknowledgments

  • Base Model: yashpwr/resume-ner-bert for the foundation architecture
  • Datasets: Resume-Corpus, DataTurks, and custom training data contributors
  • Hugging Face: For the transformers library and platform
  • Open Source Community: For contributions and feedback
  • Research Community: For advancing NER and information extraction techniques

πŸ“ˆ Model Evolution

Version History

  • v1: Initial release with basic resume parsing
  • v2: Comprehensive model with 22,542 samples and 90.87% F1 score

Future Improvements

  • Multi-language support
  • Enhanced entity types
  • Better handling of complex resume formats
  • Integration with document processing pipelines
  • Real-time inference optimization

Last Updated: August 7, 2025
Version: v2
Status: Production Ready βœ…
License: Apache 2.0
Repository: https://huggingface.co/yashpwr/resume-ner-bert-v2

Downloads last month
2,970
Safetensors
Model size
108M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support