Best result from combination with en_core_web_trf
Dear Jean-Baptiste,
first of all thanks for bringing this to live! Great work! Shows really good performance on NER tasks for english texts indeed. I see the advantage over SpaCy's transformer model in general, but I also see en_core_web_trf delivering incremental value by recognizing different entities, especially when input text is not coming with perfect grammar but being rather short form, e.g.: "John V. marketing feedback, shop mgmt. report, OMS deadline".
Actually I get best NER results when combining both models in the following way:
- Run roberta-ner first,
- Match the entity labels with SpaCy's label convention,
- Add entities through separate SpaCy pipe (add_pipe("entity_ruler", before="ner")
- Run NER again with SpaCy's pre-trained transformer model
This would find me all three entities in this text, like: John V., shop mgmt., OMS, whereas any single model alone would only find two out of three.
Again, thanks for your work! Just wanted to add this in case this is helpful for anybody.
A brief code example below:
hf_pipeline = pipeline("ner", model=roberta_model_path, tokenizer=roberta_model_path, aggregation_strategy="simple")
nlp = spacy.load("en_core_web_trf")
ruler = nlp.add_pipe("entity_ruler", before="ner")
hf_entities = hf_pipeline(text)
hf_to_spacy_labels = {
"PER": "PERSON",
"LOC": "GPE",
"ORG": "ORG",
"MISC": "NORP",
"DATE": "DATE"
}
patterns = []
for ent in hf_entities:
entity_text = ent["word"]
if not is_valid_entity(entity_text):
continue # skip
label = hf_to_spacy_labels.get(ent["entity_group"], ent["entity_group"])
token_pattern = [{"lower": tok.lower()} for tok in ent["word"].split()]
patterns.append({"label": label, "pattern": token_pattern})
ruler.add_patterns(patterns)
doc = nlp(text)
for ent in doc.ents:
print(f"{ent.text} β {ent.label_}")
Thanks for sharing