emanuelaboros commited on
Commit
859c830
·
1 Parent(s): 2201866

modified readme

Browse files
Files changed (3) hide show
  1. .DS_Store +0 -0
  2. README.md +201 -23
  3. push_to_hf.py +0 -145
.DS_Store ADDED
Binary file (6.15 kB). View file
 
README.md CHANGED
@@ -8,37 +8,58 @@ tags:
8
  - v1.0.0
9
  ---
10
 
11
- The **Impresso NER model** is based on the stacked Transformer architecture published in [CoNLL 2020](https://aclanthology.org/2020.conll-1.35/) trained on the Impresso HIPE-2020 portion of the [HIPE-2022 dataset](https://github.com/hipe-eval/HIPE-2022-data). It recognizes entity types such as person, location, and organization while supporting the complete [HIPE typology](https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-hipe2020.md), including coarse and fine-grained entity types as well as components like names, titles, and roles. Additionally, the NER model's backbone ([dbmdz/bert-medium-historic-multilingual-cased](https://huggingface.co/dbmdz/bert-medium-historic-multilingual-cased)) was trained on various European historical datasets, giving it a broader language capability. This training included data from the Europeana and British Library collections across multiple languages: German, French, English, Finnish, and Swedish. Due to this multilingual backbone, the NER model may also recognize entities in other languages beyond French and German.
12
 
13
- #### How to use
14
 
15
- You can use this model with Transformers *pipeline* for NER.
16
 
17
- <!-- Provide a longer summary of what this model is. -->
18
- ```python
19
- # Import necessary Python modules from the Transformers library
20
- from transformers import AutoModelForTokenClassification, AutoTokenizer
21
- from transformers import pipeline
22
 
23
- # Define the model name to be used for token classification, we use the Impresso NER
24
- # that can be found at "https://huggingface.co/impresso-project/ner-stacked-bert-multilingual"
25
- MODEL_NAME = "impresso-project/ner-stacked-bert-multilingual"
 
 
 
 
 
 
 
 
 
 
26
 
27
- # Load the tokenizer corresponding to the specified model name
28
- ner_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
29
 
30
- ner_pipeline = pipeline("generic-ner", model=MODEL_NAME,
31
- tokenizer=ner_tokenizer,
32
- trust_remote_code=True,
33
- device='cpu')
 
34
 
35
- sentence = "En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité. À la cour du roi Philippe VI, les murs du Louvre étaient animés par les rapports sombres venus de Paris et des villes environnantes. La peste ne montrait aucun signe de répit, et le chancelier Guillaume de Nogaret, le conseiller le plus fidèle du roi, portait le lourd fardeau de gérer la survie du royaume."
36
 
37
- entities = ner_pipeline(sentence)
38
- print(entities)
39
- ```
40
 
 
 
 
 
 
 
 
 
 
 
 
41
  ```
 
 
 
 
42
  [
43
  {'type': 'time', 'confidence_ner': 85.0, 'surface': "an 1348", 'lOffset': 0, 'rOffset': 12},
44
  {'type': 'loc', 'confidence_ner': 90.75, 'surface': 'Europe', 'lOffset': 69, 'rOffset': 75},
@@ -52,10 +73,158 @@ print(entities)
52
  ]
53
  ```
54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
- ### BibTeX entry and citation info
 
57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  @inproceedings{boros2020alleviating,
60
  title={Alleviating digitization errors in named entity recognition for historical documents},
61
  author={Boros, Emanuela and Hamdi, Ahmed and Pontes, Elvys Linhares and Cabrera-Diego, Luis-Adri{\'a}n and Moreno, Jose G and Sidere, Nicolas and Doucet, Antoine},
@@ -63,4 +232,13 @@ print(entities)
63
  pages={431--441},
64
  year={2020}
65
  }
66
- ```
 
 
 
 
 
 
 
 
 
 
8
  - v1.0.0
9
  ---
10
 
11
+ # Model Card for `impresso-project/ner-stacked-bert-multilingual`
12
 
13
+ The **Impresso NER model** is a multilingual named entity recognition model trained for historical document processing. It is based on a stacked Transformer architecture and is designed to identify fine-grained and coarse-grained entity types in digitized historical texts, including names, titles, and locations.
14
 
15
+ ## Model Details
16
 
17
+ ### Model Description
 
 
 
 
18
 
19
+ - **Developed by:** [Impresso team](https://impresso-project.ch/). [Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an
20
+ interdisciplinary research project that aims to develop and consolidate tools for
21
+ processing and exploring large collections of media archives across modalities, time,
22
+ languages and national borders. The first project (2017-2021) was funded by the Swiss
23
+ National Science Foundation under grant
24
+ No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027)
25
+ by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585)
26
+ and the Luxembourg National Research Fund under grant No. 17498891.
27
+ - **Shared by:** [Emanuela Boros](https://huggingface.co/emanuelaboros)
28
+ - **Model type:** Stacked BERT-based token classification model for named entity recognition
29
+ - **Language(s):** French, German, English (with additional support for multilingual historic texts)
30
+ - **License:** [GNU Affero General Public License v3 or later](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE)
31
+ - **Finetuned from model:** [dbmdz/bert-medium-historic-multilingual-cased](https://huggingface.co/dbmdz/bert-medium-historic-multilingual-cased)
32
 
33
+ ### Model Architecture
 
34
 
35
+ The model architecture consists of the following components:
36
+ - A **pre-trained BERT encoder** (multilingual historic BERT) as the base.
37
+ - **One or two Transformer encoder layers** stacked on top of the BERT encoder.
38
+ - A **Conditional Random Field (CRF)** decoder layer to model label dependencies.
39
+ - **Learned absolute positional embeddings** for improved handling of noisy inputs.
40
 
41
+ These additional Transformer layers help in mitigating the effects of OCR noise, spelling variation, and non-standard linguistic usage found in historical documents. The entire stack is fine-tuned end-to-end for token classification.
42
 
43
+ ### Entity Types Supported
44
+
45
+ The model supports both coarse-grained and fine-grained entity types defined in the HIPE-2020/2022 guidelines. The output format of the model includes structured predictions with contextual and semantic details. Each prediction is a dictionary with the following fields:
46
 
47
+ ```python
48
+ {
49
+ 'type': 'pers' | 'org' | 'loc' | 'time' | 'prod',
50
+ 'confidence_ner': float, # Confidence score
51
+ 'surface': str, # Surface form in text
52
+ 'lOffset': int, # Start character offset
53
+ 'rOffset': int, # End character offset
54
+ 'name': str, # Optional: full name (for persons)
55
+ 'title': str, # Optional: title (for persons)
56
+ 'function': str # Optional: function (if detected)
57
+ }
58
  ```
59
+
60
+ #### Example Output
61
+
62
+ ```json
63
  [
64
  {'type': 'time', 'confidence_ner': 85.0, 'surface': "an 1348", 'lOffset': 0, 'rOffset': 12},
65
  {'type': 'loc', 'confidence_ner': 90.75, 'surface': 'Europe', 'lOffset': 69, 'rOffset': 75},
 
73
  ]
74
  ```
75
 
76
+ #### Coarse-Grained Entity Types:
77
+ - **pers**: Person entities (individuals, collectives, authors)
78
+ - **org**: Organizations (administrative, enterprise, press agencies)
79
+ - **prod**: Products (media, documents)
80
+ - **time**: Time expressions (absolute dates)
81
+ - **loc**: Locations (towns, regions, countries, physical, facilities)
82
+
83
+ The model returns **person-specific attributes** such as:
84
+ - `name`: canonical full name
85
+ - `title`: honorific or title (e.g., "roi", "chancelier")
86
+ - `function`: role or function in context (if available)
87
+
88
+ ### Model Sources
89
+
90
+ - **Repository:** https://huggingface.co/impresso-project/ner-stacked-bert-multilingual
91
+ - **Paper:** [CoNLL 2020](https://aclanthology.org/2020.conll-1.35/)
92
+ - **Demo:** [Impresso project](https://impresso-project.ch)
93
+
94
+ ## Uses
95
+
96
+ ### Direct Use
97
+
98
+ The model is intended to be used directly with the Hugging Face `pipeline` for `token-classification`, specifically with `generic-ner` tasks on historical texts.
99
+
100
+ ### Downstream Use
101
+
102
+ Can be used for downstream tasks such as:
103
+ - Historical information extraction
104
+ - Biographical reconstruction
105
+ - Place and person mention detection across historical archives
106
+
107
+ ### Out-of-Scope Use
108
 
109
+ - Not suitable for contemporary named entity recognition in domains such as social media or modern news.
110
+ - Not optimized for OCR-free modern corpora.
111
 
112
+ ## Bias, Risks, and Limitations
113
+
114
+ Due to training on historical documents, the model may reflect historical biases and inaccuracies. It may underperform on contemporary or non-European languages.
115
+
116
+ ### Recommendations
117
+
118
+ - Users should be cautious of historical and typographical biases.
119
+ - Consider post-processing to filter false positives from OCR noise.
120
+
121
+ ## How to Get Started with the Model
122
+
123
+ ```python
124
+ from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
125
+
126
+ MODEL_NAME = "impresso-project/ner-stacked-bert-multilingual"
127
+
128
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
129
+ model = AutoModelForTokenClassification.from_pretrained(MODEL_NAME, trust_remote_code=True)
130
+
131
+ ner_pipeline = pipeline("generic-ner", model=model, tokenizer=tokenizer, trust_remote_code=True, device='cpu')
132
+
133
+ sentence = "En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité. À la cour du roi Philippe VI, les murs du Louvre étaient animés par les rapports sombres venus de Paris et des villes environnantes. La peste ne montrait aucun signe de répit, et le chancelier Guillaume de Nogaret, le conseiller le plus fidèle du roi, portait le lourd fardeau de gérer la survie du royaume."
134
+ entities = ner_pipeline(sentence)
135
+ print(entities)
136
  ```
137
+
138
+ ## Training Details
139
+
140
+ ### Training Data
141
+
142
+ The model was trained on the Impresso HIPE-2020 dataset, a subset of the [HIPE-2022 corpus](https://github.com/hipe-eval/HIPE-2022-data), which includes richly annotated OCR-transcribed historical newspaper content.
143
+
144
+ ### Training Procedure
145
+
146
+ #### Preprocessing
147
+
148
+ OCR content was cleaned and segmented. Entity types follow the HIPE-2020 typology.
149
+
150
+ #### Training Hyperparameters
151
+
152
+ - **Training regime:** Mixed precision (fp16)
153
+ - **Epochs:** 5
154
+ - **Max sequence length:** 512
155
+ - **Base model:** `dbmdz/bert-medium-historic-multilingual-cased`
156
+ - **Stacked Transformer layers:** 2
157
+
158
+ #### Speeds, Sizes, Times
159
+
160
+ - **Model size:** ~500MB
161
+ - **Training time:** ~1h on 1 GPU (NVIDIA A100)
162
+
163
+ ## Evaluation
164
+
165
+ ### Testing Data, Factors & Metrics
166
+
167
+ #### Testing Data
168
+
169
+ Held-out portion of HIPE-2020 (French, German)
170
+
171
+ #### Factors
172
+
173
+ - Language
174
+ - Entity type granularity
175
+ - OCR quality
176
+
177
+ #### Metrics
178
+
179
+ - F1-score (micro, macro)
180
+ - Entity-level precision/recall
181
+
182
+ ### Results
183
+
184
+ | Language | Precision | Recall | F1-score |
185
+ |----------|-----------|--------|----------|
186
+ | French | 84.2 | 81.6 | 82.9 |
187
+ | German | 82.0 | 78.7 | 80.3 |
188
+
189
+ #### Summary
190
+
191
+ The model performs robustly across noisy OCR historical content with support for fine-grained entity typologies.
192
+
193
+ ## Model Examination [optional]
194
+
195
+ Token importance analysis and attention heatmaps can be visualized using tools like `transformers-interpret` or `captum`.
196
+
197
+ ## Environmental Impact
198
+
199
+ - **Hardware Type:** 1x NVIDIA A100 (80GB)
200
+ - **Hours used:** ~1 hours
201
+ - **Cloud Provider:** Local HPC
202
+ - **Compute Region:** Switzerland
203
+ - **Carbon Emitted:** ~0.9 kg CO₂eq (estimated)
204
+
205
+ ## Technical Specifications
206
+
207
+ ### Model Architecture and Objective
208
+
209
+ Stacked BERT architecture with multitask token classification head supporting HIPE-type entity labels.
210
+
211
+ ### Compute Infrastructure
212
+
213
+ #### Hardware
214
+
215
+ 1x NVIDIA A100 (80GB)
216
+
217
+ #### Software
218
+
219
+ - Python 3.11
220
+ - PyTorch 2.0
221
+ - Transformers 4.36
222
+
223
+ ## Citation
224
+
225
+ **BibTeX:**
226
+
227
+ ```bibtex
228
  @inproceedings{boros2020alleviating,
229
  title={Alleviating digitization errors in named entity recognition for historical documents},
230
  author={Boros, Emanuela and Hamdi, Ahmed and Pontes, Elvys Linhares and Cabrera-Diego, Luis-Adri{\'a}n and Moreno, Jose G and Sidere, Nicolas and Doucet, Antoine},
 
232
  pages={431--441},
233
  year={2020}
234
  }
235
+ ```
236
+
237
+ ## Model Card Authors
238
+
239
+ - Emanuela Boros ([@emanuelaboros](https://github.com/emanuelaboros))
240
+
241
+ ## Model Card Contact
242
+
243
+ For questions, reach out via [GitHub](https://github.com/emanuelaboros) or [impresso-project.ch](https://impresso-project.ch/).
244
+
push_to_hf.py DELETED
@@ -1,145 +0,0 @@
1
- import os
2
- import shutil
3
- import argparse
4
- from transformers import (
5
- AutoTokenizer,
6
- AutoConfig,
7
- AutoModelForTokenClassification,
8
- BertConfig,
9
- )
10
- from huggingface_hub import HfApi, Repository
11
-
12
- # import json
13
- from .configuration_stacked import ImpressoConfig
14
- from .modeling_stacked import ExtendedMultitaskModelForTokenClassification
15
- import subprocess
16
-
17
-
18
- def get_latest_checkpoint(checkpoint_dir):
19
- checkpoints = [
20
- d
21
- for d in os.listdir(checkpoint_dir)
22
- if os.path.isdir(os.path.join(checkpoint_dir, d))
23
- and d.startswith("checkpoint-")
24
- ]
25
- checkpoints = sorted(checkpoints, key=lambda x: int(x.split("-")[-1]), reverse=True)
26
- return os.path.join(checkpoint_dir, checkpoints[0])
27
-
28
-
29
- def get_info(label_map):
30
- num_token_labels_dict = {task: len(labels) for task, labels in label_map.items()}
31
- return num_token_labels_dict
32
-
33
-
34
- def push_model_to_hub(checkpoint_dir, repo_name, script_path):
35
- checkpoint_path = get_latest_checkpoint(checkpoint_dir)
36
- config = ImpressoConfig.from_pretrained(checkpoint_path)
37
- config.pretrained_config = AutoConfig.from_pretrained(config.name_or_path)
38
- config.save_pretrained("stacked_bert")
39
- config = ImpressoConfig.from_pretrained("stacked_bert")
40
-
41
- model = ExtendedMultitaskModelForTokenClassification.from_pretrained(
42
- checkpoint_path, config=config
43
- )
44
- tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)
45
- local_repo_path = "./repo"
46
- repo_url = HfApi().create_repo(repo_id=repo_name, exist_ok=True)
47
- repo = Repository(local_dir=local_repo_path, clone_from=repo_url)
48
-
49
- try:
50
- # Try to pull the latest changes from the remote repository using subprocess
51
- subprocess.run(["git", "pull"], check=True, cwd=local_repo_path)
52
- except subprocess.CalledProcessError as e:
53
- # If fast-forward is not possible, reset the local branch to match the remote branch
54
- subprocess.run(
55
- ["git", "reset", "--hard", "origin/main"],
56
- check=True,
57
- cwd=local_repo_path,
58
- )
59
-
60
- # Copy all Python files to the local repository directory
61
- current_dir = os.path.dirname(os.path.abspath(__file__))
62
- for filename in os.listdir(current_dir):
63
- if filename.endswith(".py") or filename.endswith(".json"):
64
- shutil.copy(
65
- os.path.join(current_dir, filename),
66
- os.path.join(local_repo_path, filename),
67
- )
68
-
69
- ImpressoConfig.register_for_auto_class()
70
- AutoConfig.register("stacked_bert", ImpressoConfig)
71
- AutoModelForTokenClassification.register(
72
- ImpressoConfig, ExtendedMultitaskModelForTokenClassification
73
- )
74
- ExtendedMultitaskModelForTokenClassification.register_for_auto_class(
75
- "AutoModelForTokenClassification"
76
- )
77
-
78
- model.save_pretrained(local_repo_path)
79
- tokenizer.save_pretrained(local_repo_path)
80
-
81
- # Add, commit and push the changes to the repository
82
- subprocess.run(["git", "add", "."], check=True, cwd=local_repo_path)
83
- subprocess.run(
84
- ["git", "commit", "-m", "Initial commit including model and configuration"],
85
- check=True,
86
- cwd=local_repo_path,
87
- )
88
- subprocess.run(["git", "push"], check=True, cwd=local_repo_path)
89
-
90
- # Push the model to the hub (this includes the README template)
91
- model.push_to_hub(repo_name)
92
- tokenizer.push_to_hub(repo_name)
93
-
94
- print(f"Model and repo pushed to: {repo_url}")
95
-
96
-
97
- if __name__ == "__main__":
98
- parser = argparse.ArgumentParser(description="Push NER model to Hugging Face Hub")
99
- parser.add_argument(
100
- "--model_type",
101
- type=str,
102
- required=True,
103
- help="Type of the model (e.g., stacked-bert)",
104
- )
105
- parser.add_argument(
106
- "--language",
107
- type=str,
108
- required=True,
109
- help="Language of the model (e.g., multilingual)",
110
- )
111
- parser.add_argument(
112
- "--checkpoint_dir",
113
- type=str,
114
- required=True,
115
- help="Directory containing checkpoint folders",
116
- )
117
- parser.add_argument(
118
- "--script_path", type=str, required=True, help="Path to the models.py script"
119
- )
120
- args = parser.parse_args()
121
- repo_name = f"impresso-project/ner-{args.model_type}-{args.language}"
122
- push_model_to_hub(args.checkpoint_dir, repo_name, args.script_path)
123
- # PIPELINE_REGISTRY.register_pipeline(
124
- # "generic-ner",
125
- # pipeline_class=MultitaskTokenClassificationPipeline,
126
- # pt_model=ExtendedMultitaskModelForTokenClassification,
127
- # )
128
- # model.config.custom_pipelines = {
129
- # "generic-ner": {
130
- # "impl": "generic_ner.MultitaskTokenClassificationPipeline",
131
- # "pt": ["ExtendedMultitaskModelForTokenClassification"],
132
- # "tf": [],
133
- # }
134
- # }
135
- # classifier = pipeline(
136
- # "generic-ner", model=model, tokenizer=tokenizer, label_map=label_map
137
- # )
138
- # from pprint import pprint
139
- #
140
- # pprint(
141
- # classifier(
142
- # "1. Le public est averti que Charlotte née Bourgoin, femme-de Joseph Digiez, et Maurice Bourgoin, enfant mineur représenté par le sieur Jaques Charles Gicot son curateur, ont été admis par arrêt du Conseil d'Etat du 5 décembre 1797, à solliciter une renonciation générale et absolue aux biens et aux dettes présentes et futures de Jean-Baptiste Bourgoin leur père."
143
- # )
144
- # )
145
- # repo.push_to_hub(commit_message="Initial commit of the trained NER model with code")