license: apache-2.0
datasets:
- custom
- chatgpt
language:
- en
metrics:
- precision
- recall
- f1
- accuracy
pipeline_tag: token-classification
library_name: transformers
new_version: v1.1
tags:
- token-classification
- ner
- named-entity-recognition
- text-classification
- sequence-labeling
- transformer
- bert
- nlp
- pretrained-model
- dataset-finetuning
- deep-learning
- huggingface
- conll2012
- real-time-inference
- efficient-nlp
- high-accuracy
- gpu-optimized
- chatbot
- information-extraction
- search-enhancement
- knowledge-graph
- legal-nlp
- medical-nlp
- financial-nlp
base_model:
- boltuix/bert-lite
π Boltuix BERT-NER Model π
π Model Details
π Description
- β¨ Fine-tuned for Named Entity Recognition (NER)
- π Dataset: CoNLL-2012
- π Recognizes 37 entity types across diverse domains like people, places, organizations, laws, events, and more!
- π¬ Works great for sentence-level and document-level tagging in English.
- π§ Training examples: 115,812 | β Validation: 15,680 | π§ͺ Test: 12,217
π§ Info
- Developer: Boltuix π§ββοΈ
- Fuel: Passion π§
- License: Apache 2.0 π
- Language: English π¬π§
- Type: Transformer-based Token Classification π€
- Version: v1.0 π
- Trained: Before March 27, 2025
π Links
- π§ Model Repo
- π CoNLL-2012 Paper
- π‘ Demo: Coming Soon
π― Use Cases for NER
π Direct Applications
- Extracting names, places, and dates from news, blogs, and reports
- Powering chatbots with contextual awareness
- Enhancing search with semantic understanding
- Building dynamic knowledge graphs
π± Downstream Tasks
- Medical & legal domain adaptation
- Multilingual extensions (with retraining)
- Custom entity sets for finance, e-commerce, etc.
β Limitations
- π English-only out of the box
- π« May not generalize to informal, low-resource, or code-mixed texts
- βοΈ May reflect dataset bias (CoNLL-2012 is newswire-heavy)
π οΈ Getting Started
π§ͺ Inference Code
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("boltuix/bert-ner")
model = AutoModelForTokenClassification.from_pretrained("boltuix/bert-ner")
text = "Barack Obama visited Microsoft headquarters in Seattle."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
label_map = model.config.id2label
labels = [label_map[p.item()] for p in predictions[0]]
for token, label in zip(tokens, labels):
if token not in tokenizer.all_special_tokens:
print(f"{token:15} β {label}")
β¨ Example Output
barack β B-PERSON
obama β I-PERSON
visited β O
microsoft β B-ORG
headquarters β O
in β O
seattle β B-GPE
. β O
π§ Entity Labels (CoNLL-2012)
Here are all 37 labels supported by the model:
πΉ O β Outside
π’ Beginning (B-) Tags
π’ B-CARDINAL β CARDINAL
π
B-DATE β DATE
π B-EVENT β EVENT
π’ B-FAC β FACILITY
π B-GPE β COUNTRY/CITY
π£οΈ B-LANGUAGE β LANGUAGE
βοΈ B-LAW β LAW
πΊοΈ B-LOC β LOCATION
π° B-MONEY β MONEY
π§βπ€βπ§ B-NORP β GROUP
π B-ORDINAL β ORDINAL
ποΈ B-ORG β ORGANIZATION
π B-PERCENT β PERCENT
π€ B-PERSON β PERSON
π¦ B-PRODUCT β PRODUCT
π B-QUANTITY β QUANTITY
β° B-TIME β TIME
π¨ B-WORK_OF_ART β WORK_OF_ART
π’ Inside (I-) Tags
π’ I-CARDINAL
π
I-DATE
π I-EVENT
π’ I-FAC
π I-GPE
π£οΈ I-LANGUAGE
βοΈ I-LAW
πΊοΈ I-LOC
π° I-MONEY
π§βπ€βπ§ I-NORP
π I-ORDINAL
ποΈ I-ORG
π I-PERCENT
π€ I-PERSON
π¦ I-PRODUCT
π I-QUANTITY
β° I-TIME
π¨ I-WORK_OF_ART
π Performance
Metric | Score |
---|---|
π― Precision | 0.85 |
πΈοΈ Recall | 0.87 |
πΆ F1 Score | 0.86 |
β Accuracy | 0.92 |
- π Evaluation tool:
seqeval
- π§ͺ Dataset: CoNLL-2012 test split
βοΈ Training Setup
- π» Hardware: NVIDIA GPU
- β±οΈ Training Time: ~2 hours
- π Parameters: ~11M
- ποΈ Optimizer: AdamW (default)
- π¦ Mixed precision: No (fp32)
π Carbon Impact
- π» Trained Locally
- βοΈ Region: Boltuixβs Base
- π± Emissions: ~50g COβeq
- π Measured via: ML Impact
βοΈ Contact
- π§ Author: Boltuix
- π¬ Email: [email protected]