Jean-Baptiste/roberta-large-ner-english · Best result from combination with en_core_web

28 days ago

Dear Jean-Baptiste,

first of all thanks for bringing this to live! Great work! Shows really good performance on NER tasks for english texts indeed. I see the advantage over SpaCy's transformer model in general, but I also see en_core_web_trf delivering incremental value by recognizing different entities, especially when input text is not coming with perfect grammar but being rather short form, e.g.: "John V. marketing feedback, shop mgmt. report, OMS deadline".

Actually I get best NER results when combining both models in the following way:

Run roberta-ner first,
Match the entity labels with SpaCy's label convention,
Add entities through separate SpaCy pipe (add_pipe("entity_ruler", before="ner")
Run NER again with SpaCy's pre-trained transformer model

This would find me all three entities in this text, like: John V., shop mgmt., OMS, whereas any single model alone would only find two out of three.

Again, thanks for your work! Just wanted to add this in case this is helpful for anybody.

A brief code example below:

hf_pipeline = pipeline("ner", model=roberta_model_path, tokenizer=roberta_model_path, aggregation_strategy="simple")

nlp = spacy.load("en_core_web_trf")
ruler = nlp.add_pipe("entity_ruler", before="ner")

hf_entities = hf_pipeline(text)

hf_to_spacy_labels = {
"PER": "PERSON",
"LOC": "GPE",
"ORG": "ORG",
"MISC": "NORP",
"DATE": "DATE"
}

patterns = []
for ent in hf_entities:
entity_text = ent["word"]
if not is_valid_entity(entity_text):
continue # skip
label = hf_to_spacy_labels.get(ent["entity_group"], ent["entity_group"])
token_pattern = [{"lower": tok.lower()} for tok in ent["word"].split()]
patterns.append({"label": label, "pattern": token_pattern})

ruler.add_patterns(patterns)

doc = nlp(text)
for ent in doc.ents:
print(f"{ent.text} → {ent.label_}")

Jean-Baptiste

Owner 28 days ago

Thanks for sharing

Jean-Baptiste changed discussion status to closed 28 days ago

Jean-Baptiste
/

roberta-large-ner-english

Best result from combination with en_core_web_trf