disham993 commited on
Commit
82e60e3
·
verified ·
1 Parent(s): feee559

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -30
README.md CHANGED
@@ -8,58 +8,157 @@ tags:
8
  datasets:
9
  - disham993/ElectricalNER
10
  metrics:
11
- - epoch: 1.0
12
- - eval_precision: 0.8835414301929625
13
- - eval_recall: 0.9227851102505334
14
- - eval_f1: 0.9027369723210142
15
- - eval_accuracy: 0.956991714467814
16
- - eval_runtime: 2.6822
17
- - eval_samples_per_second: 562.603
18
- - eval_steps_per_second: 8.948
19
  ---
20
 
21
- # disham993/electrical-ner-bert-base
22
 
23
- ## Model description
24
 
25
- This model is fine-tuned from [google-bert/bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased) for token-classification tasks.
26
 
27
  ## Training Data
28
 
29
- The model was trained on the disham993/ElectricalNER dataset.
30
 
31
  ## Model Details
32
- - **Base Model:** google-bert/bert-base-uncased
33
- - **Task:** token-classification
34
- - **Language:** en
35
- - **Dataset:** disham993/ElectricalNER
36
 
37
- ## Training procedure
 
 
 
38
 
39
- ### Training hyperparameters
40
- [Please add your training hyperparameters here]
41
 
42
- ## Evaluation results
43
 
44
- ### Metrics\n- epoch: 1.0\n- eval_precision: 0.8835414301929625\n- eval_recall: 0.9227851102505334\n- eval_f1: 0.9027369723210142\n- eval_accuracy: 0.956991714467814\n- eval_runtime: 2.6822\n- eval_samples_per_second: 562.603\n- eval_steps_per_second: 8.948
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
  ## Usage
47
 
48
- ```python
49
- from transformers import AutoTokenizer, AutoModel
50
 
51
- tokenizer = AutoTokenizer.from_pretrained("disham993/electrical-ner-bert-base")
52
- model = AutoModel.from_pretrained("disham993/electrical-ner-bert-base")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  ```
54
 
55
- ## Limitations and bias
 
 
 
 
 
 
 
56
 
57
- [Add any known limitations or biases of the model]
58
 
59
  ## Training Infrastructure
60
 
61
- [Add details about training infrastructure used]
62
 
63
- ## Last update
64
 
65
- 2024-12-30
 
8
  datasets:
9
  - disham993/ElectricalNER
10
  metrics:
11
+ - epoch: 5.0
12
+ - eval_precision: 0.9193
13
+ - eval_recall: 0.9303
14
+ - eval_f1: 0.9247
15
+ - eval_accuracy: 0.9669
16
+ - eval_runtime: 2.2917
17
+ - eval_samples_per_second: 658.454
18
+ - eval_steps_per_second: 10.472
19
  ---
20
 
21
+ # electrical-ner-bert-base
22
 
23
+ ## Model Description
24
 
25
+ This model is fine-tuned from [google-bert/bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased) for token-classification tasks, specifically Named Entity Recognition (NER) in the electrical engineering domain. The model has been optimized to extract entities such as components, materials, standards, and design parameters from technical texts with high precision and recall.
26
 
27
  ## Training Data
28
 
29
+ The model was trained on the [disham993/ElectricalNER](https://huggingface.co/datasets/disham993/ElectricalNER) dataset, a GPT-4o-mini-generated dataset curated for the electrical engineering domain. This dataset includes diverse technical contexts, such as circuit design, testing, maintenance, installation, troubleshooting, or research.
30
 
31
  ## Model Details
 
 
 
 
32
 
33
+ - **Base Model:** [google-bert/bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased)
34
+ - **Task:** Token Classification (NER)
35
+ - **Language:** English (en)
36
+ - **Dataset:** [disham993/ElectricalNER](https://huggingface.co/datasets/disham993/ElectricalNER)
37
 
38
+ ## Training Procedure
 
39
 
40
+ ### Training Hyperparameters
41
 
42
+ The model was fine-tuned using the following hyperparameters:
43
+
44
+ - **Evaluation Strategy:** epoch
45
+ - **Learning Rate:** 1e-5
46
+ - **Batch Size:** 64 (for both training and evaluation)
47
+ - **Number of Epochs:** 5
48
+ - **Weight Decay:** 0.01
49
+
50
+ ## Evaluation Results
51
+
52
+ The following metrics were achieved during evaluation:
53
+
54
+ - **Precision:** 0.9193
55
+ - **Recall:** 0.9303
56
+ - **F1 Score:** 0.9247
57
+ - **Accuracy:** 0.9660
58
+ - **Evaluation Runtime:** 2.2917 seconds
59
+ - **Samples Per Second:** 658.454
60
+ - **Steps Per Second:** 10.472
61
 
62
  ## Usage
63
 
64
+ You can use this model for Named Entity Recognition tasks as follows:
 
65
 
66
+ ```python
67
+ from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
68
+
69
+ model_name = "disham993/electrical-ner-bert-base"
70
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
71
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
72
+
73
+ nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
74
+
75
+ text = "The Xilinx Vivado development suite was used to program the Artix-7 FPGA."
76
+
77
+ ner_results = nlp(text)
78
+
79
+ def clean_and_group_entities(ner_results, min_score=0.40):
80
+ """
81
+ Cleans and groups named entity recognition (NER) results based on a minimum score threshold.
82
+
83
+ Args:
84
+ ner_results (list of dict): A list of dictionaries containing NER results. Each dictionary should have the keys:
85
+ - "word" (str): The recognized word or token.
86
+ - "entity_group" (str): The entity group or label.
87
+ - "start" (int): The start position of the entity in the text.
88
+ - "end" (int): The end position of the entity in the text.
89
+ - "score" (float): The confidence score of the entity recognition.
90
+ min_score (float, optional): The minimum score threshold for considering an entity. Defaults to 0.40.
91
+
92
+ Returns:
93
+ list of dict: A list of grouped entities that meet the minimum score threshold. Each dictionary contains:
94
+ - "entity_group" (str): The entity group or label.
95
+ - "word" (str): The concatenated word or token.
96
+ - "start" (int): The start position of the entity in the text.
97
+ - "end" (int): The end position of the entity in the text.
98
+ - "score" (float): The minimum confidence score of the grouped entity.
99
+ """
100
+ grouped_entities = []
101
+ current_entity = None
102
+
103
+ for result in ner_results:
104
+ # Skip entities with score below threshold
105
+ if result["score"] < min_score:
106
+ if current_entity:
107
+ # Add current entity if it meets threshold
108
+ if current_entity["score"] >= min_score:
109
+ grouped_entities.append(current_entity)
110
+ current_entity = None
111
+ continue
112
+
113
+ word = result["word"].replace("##", "") # Remove subword token markers
114
+
115
+ if current_entity and result["entity_group"] == current_entity["entity_group"] and result["start"] == current_entity["end"]:
116
+ # Continue the current entity
117
+ current_entity["word"] += word
118
+ current_entity["end"] = result["end"]
119
+ current_entity["score"] = min(current_entity["score"], result["score"])
120
+
121
+ # If combined score drops below threshold, discard the entity
122
+ if current_entity["score"] < min_score:
123
+ current_entity = None
124
+ else:
125
+ # Finalize the current entity if it meets threshold
126
+ if current_entity and current_entity["score"] >= min_score:
127
+ grouped_entities.append(current_entity)
128
+
129
+ # Start a new entity
130
+ current_entity = {
131
+ "entity_group": result["entity_group"],
132
+ "word": word,
133
+ "start": result["start"],
134
+ "end": result["end"],
135
+ "score": result["score"]
136
+ }
137
+
138
+ # Add the last entity if it meets threshold
139
+ if current_entity and current_entity["score"] >= min_score:
140
+ grouped_entities.append(current_entity)
141
+
142
+ return grouped_entities
143
+
144
+ cleaned_results = clean_and_group_entities(ner_results)
145
  ```
146
 
147
+ ## Limitations and Bias
148
+
149
+ While this model performs well in the electrical engineering domain, it is not designed for use in other domains. Additionally, it may:
150
+
151
+ - Misclassify entities due to potential inaccuracies in the GPT-4o-mini-generated dataset.
152
+ - Struggle with ambiguous contexts or low-confidence predictions - this is minimized with help of `clean_and_group_entities` function.
153
+
154
+ This model is intended for research and educational purposes only, and users are encouraged to validate results before applying them to critical applications.
155
 
156
+ Users are encouraged to validate results for critical applications.
157
 
158
  ## Training Infrastructure
159
 
160
+ For a complete guide covering the entire process - from data tokenization to pushing the model to the Hugging Face Hub - please refer to the [GitHub repository](https://github.com/di37/ner-electrical-finetuning).
161
 
162
+ ## Last Update
163
 
164
+ 2024-12-31