NeuroBERT-NER / README.md
boltuix's picture
Update README.md
2f27342 verified
|
raw
history blame
5.11 kB
metadata
license: apache-2.0
datasets:
  - custom
  - chatgpt
language:
  - en
metrics:
  - precision
  - recall
  - f1
  - accuracy
pipeline_tag: token-classification
library_name: transformers
new_version: v1.1
tags:
  - token-classification
  - ner
  - named-entity-recognition
  - text-classification
  - sequence-labeling
  - transformer
  - bert
  - nlp
  - pretrained-model
  - dataset-finetuning
  - deep-learning
  - huggingface
  - conll2012
  - real-time-inference
  - efficient-nlp
  - high-accuracy
  - gpu-optimized
  - chatbot
  - information-extraction
  - search-enhancement
  - knowledge-graph
  - legal-nlp
  - medical-nlp
  - financial-nlp
base_model:
  - boltuix/bert-lite

Banner

🌟 Boltuix BERT-NER Model 🌟

πŸš€ Model Details

🌈 Description

  • ✨ Fine-tuned for Named Entity Recognition (NER)
  • πŸ“š Dataset: CoNLL-2012
  • πŸ” Recognizes 37 entity types across diverse domains like people, places, organizations, laws, events, and more!
  • πŸ’¬ Works great for sentence-level and document-level tagging in English.
  • 🧠 Training examples: 115,812 | βœ… Validation: 15,680 | πŸ§ͺ Test: 12,217

πŸ”§ Info

  • Developer: Boltuix πŸ§™β€β™‚οΈ
  • Fuel: Passion 🧠
  • License: Apache 2.0 πŸ“œ
  • Language: English πŸ‡¬πŸ‡§
  • Type: Transformer-based Token Classification πŸ€–
  • Version: v1.0 🎈
  • Trained: Before March 27, 2025

πŸ”— Links


🎯 Use Cases for NER

🌟 Direct Applications

  • Extracting names, places, and dates from news, blogs, and reports
  • Powering chatbots with contextual awareness
  • Enhancing search with semantic understanding
  • Building dynamic knowledge graphs

🌱 Downstream Tasks

  • Medical & legal domain adaptation
  • Multilingual extensions (with retraining)
  • Custom entity sets for finance, e-commerce, etc.

❌ Limitations

  • πŸ“Œ English-only out of the box
  • 🚫 May not generalize to informal, low-resource, or code-mixed texts
  • βš–οΈ May reflect dataset bias (CoNLL-2012 is newswire-heavy)

πŸ› οΈ Getting Started

πŸ§ͺ Inference Code

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("boltuix/bert-ner")
model = AutoModelForTokenClassification.from_pretrained("boltuix/bert-ner")

text = "Barack Obama visited Microsoft headquarters in Seattle."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
label_map = model.config.id2label
labels = [label_map[p.item()] for p in predictions[0]]

for token, label in zip(tokens, labels):
    if token not in tokenizer.all_special_tokens:
        print(f"{token:15} β†’ {label}")

✨ Example Output

barack          β†’ B-PERSON
obama           β†’ I-PERSON
visited         β†’ O
microsoft       β†’ B-ORG
headquarters    β†’ O
in              β†’ O
seattle         β†’ B-GPE
.               β†’ O

🧠 Entity Labels (CoNLL-2012)

Here are all 37 labels supported by the model:

πŸ”Ή O – Outside

πŸ”’ Beginning (B-) Tags

πŸ”’ B-CARDINAL – CARDINAL
πŸ“… B-DATE – DATE
πŸŽ‰ B-EVENT – EVENT
🏒 B-FAC – FACILITY
🌍 B-GPE – COUNTRY/CITY
πŸ—£οΈ B-LANGUAGE – LANGUAGE
βš–οΈ B-LAW – LAW
πŸ—ΊοΈ B-LOC – LOCATION
πŸ’° B-MONEY – MONEY
πŸ§‘β€πŸ€β€πŸ§‘ B-NORP – GROUP
πŸ”Ÿ B-ORDINAL – ORDINAL
πŸ›οΈ B-ORG – ORGANIZATION
πŸ“Š B-PERCENT – PERCENT
πŸ‘€ B-PERSON – PERSON
πŸ“¦ B-PRODUCT – PRODUCT
πŸ“ B-QUANTITY – QUANTITY
⏰ B-TIME – TIME
🎨 B-WORK_OF_ART – WORK_OF_ART

πŸ”’ Inside (I-) Tags

πŸ”’ I-CARDINAL
πŸ“… I-DATE
πŸŽ‰ I-EVENT
🏒 I-FAC
🌍 I-GPE
πŸ—£οΈ I-LANGUAGE
βš–οΈ I-LAW
πŸ—ΊοΈ I-LOC
πŸ’° I-MONEY
πŸ§‘β€πŸ€β€πŸ§‘ I-NORP
πŸ”Ÿ I-ORDINAL
πŸ›οΈ I-ORG
πŸ“Š I-PERCENT
πŸ‘€ I-PERSON
πŸ“¦ I-PRODUCT
πŸ“ I-QUANTITY
⏰ I-TIME
🎨 I-WORK_OF_ART


πŸ“ˆ Performance

Metric Score
🎯 Precision 0.85
πŸ•ΈοΈ Recall 0.87
🎢 F1 Score 0.86
βœ… Accuracy 0.92
  • πŸ“Š Evaluation tool: seqeval
  • πŸ§ͺ Dataset: CoNLL-2012 test split

βš™οΈ Training Setup

  • πŸ’» Hardware: NVIDIA GPU
  • ⏱️ Training Time: ~2 hours
  • 🐘 Parameters: ~11M
  • πŸŽ›οΈ Optimizer: AdamW (default)
  • πŸ“¦ Mixed precision: No (fp32)

🌍 Carbon Impact

  • πŸ’» Trained Locally
  • ☁️ Region: Boltuix’s Base
  • 🌱 Emissions: ~50g COβ‚‚eq
  • πŸ“Š Measured via: ML Impact

✍️ Contact