|
--- |
|
library_name: transformers |
|
license: mit |
|
datasets: |
|
- ahmedheakl/resume-atlas |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
- f1 |
|
- recall |
|
- precision |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
# How to use |
|
|
|
In this example, we do an inference on a sample from our dataset (_ResumeAtlas_). You can increase `max_length` for more accurate predictions. |
|
|
|
```python |
|
!pip install datasets |
|
|
|
import numpy as np |
|
import torch |
|
from transformers import BertForSequenceClassification, BertTokenizer |
|
from datasets import load_dataset |
|
from sklearn import preprocessing |
|
|
|
dataset_id='ahmedheakl/resume-atlas' |
|
model_id='ahmedheakl/bert-resume-classification' |
|
label_column = "Category" |
|
num_labels=43 |
|
output_attentions=False |
|
output_hidden_states=False |
|
do_lower_case=True |
|
add_special_tokens=True |
|
max_length=512 |
|
pad_to_max_length=True |
|
return_attention_mask=True |
|
truncation=True |
|
|
|
ds = load_dataset(dataset_id, trust_remote_code=True) |
|
|
|
le = preprocessing.LabelEncoder() |
|
le.fit(ds['train'][label_column]) |
|
|
|
|
|
tokenizer = BertTokenizer.from_pretrained(model_id, do_lower_case=do_lower_case) |
|
model = BertForSequenceClassification.from_pretrained( |
|
model_id, |
|
num_labels = num_labels, |
|
output_attentions = output_attentions, |
|
output_hidden_states = output_hidden_states, |
|
) |
|
|
|
model = model.to('cuda').eval() |
|
sent = ds['train'][0]['Text'] |
|
|
|
encoded_dict = tokenizer.encode_plus( |
|
sent, |
|
add_special_tokens=add_special_tokens, |
|
max_length=max_length, |
|
pad_to_max_length=pad_to_max_length, |
|
return_attention_mask=return_attention_mask, |
|
return_tensors='pt', |
|
truncation=truncation, |
|
) |
|
input_ids = encoded_dict['input_ids'].to('cuda') |
|
attention_mask = encoded_dict['attention_mask'].to('cuda') |
|
|
|
outputs = model( |
|
input_ids, |
|
token_type_ids=None, |
|
attention_mask=attention_mask |
|
) |
|
|
|
label_id = np.argmax(outputs['logits'].cpu().detach().tolist(), axis=1) |
|
print(f'Predicted: {le.inverse_transform(label_id)[0]} | Ground: {ds["train"][0][label_column]}') |
|
``` |
|
|
|
# Model Card for Model ID |
|
|
|
**Please see paper & code for more information:** |
|
- https://github.com/noran-mohamed/Resume-Classification-Dataset |
|
- https://arxiv.org/abs/2406.18125 |
|
|
|
|
|
## Citation |
|
|
|
**BibTeX:** |
|
``` |
|
@article{heakl2024resumeatlas, |
|
title={ResumeAtlas: Revisiting Resume Classification with Large-Scale Datasets and Large Language Models}, |
|
author={Heakl, Ahmed and Mohamed, Youssef and Mohamed, Noran and Sharkaway, Ali and Zaky, Ahmed}, |
|
journal={arXiv preprint arXiv:2406.18125}, |
|
year={2024} |
|
} |
|
``` |
|
|
|
**APA:** |
|
``` |
|
Heakl, A., Mohamed, Y., Mohamed, N., Sharkaway, A., & Zaky, A. (2024). ResumeAtlas: Revisiting Resume Classification with Large-Scale Datasets and Large Language Models. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2406.18125 |
|
``` |
|
|
|
|
|
## Model Card Authors [optional] |
|
|
|
Email: [email protected] |
|
Linkedin: https://linkedin.com/in/ahmed-heakl |