File size: 8,955 Bytes
7edf218
 
07b6583
 
 
 
b0d1067
 
7edf218
 
859c830
7edf218
859c830
4efcbf3
859c830
7edf218
859c830
420fe84
49115a0
7c8c6ca
 
 
 
 
420fe84
859c830
420fe84
859c830
 
 
 
 
dd5a61f
859c830
dd5a61f
859c830
 
 
dd5a61f
859c830
 
 
 
 
 
 
 
 
 
 
dd5a61f
859c830
7edf218
859c830
 
 
a3cc8f0
859c830
 
 
a7afe48
859c830
22aac98
859c830
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7edf218
859c830
 
7edf218
859c830
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41da343
859c830
 
 
 
3db4bd4
22aac98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
859c830
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a3cc8f0
859c830
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a3cc8f0
 
 
 
859c830
 
 
 
 
 
 
 
 
 
 
a3cc8f0
859c830
 
 
 
 
 
 
 
 
 
 
 
3db4bd4
 
 
 
 
 
 
ad039dd
 
 
 
d99613a
 
ad039dd
 
d99613a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
---
library_name: transformers
language:
- en
- fr
- de
tags:
- v1.0.0
---

# Model Card for `impresso-project/ner-stacked-bert-multilingual`

The **Impresso NER model** is a multilingual named entity recognition model trained for historical document processing. It is based on a stacked Transformer architecture and is designed to identify fine-grained and coarse-grained entity types in digitized historical texts, including names, titles, and locations.

## Model Details

### Model Description

- **Developed by:** EPFL from the [Impresso team](https://impresso-project.ch). The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation ([CRSII5_173719](http://p3.snf.ch/project-173719), [CRSII5_213585](https://data.snf.ch/grants/grant/213585)) and the Luxembourg National Research Fund (grant No. 17498891).
- **Model type:** Stacked BERT-based token classification for named entity recognition
- **Languages:** French, German, English (with support for multilingual historical texts)
- **License:** [AGPL v3+](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE)
- **Finetuned from:** [`dbmdz/bert-medium-historic-multilingual-cased`](https://huggingface.co/dbmdz/bert-medium-historic-multilingual-cased)


### Model Architecture

The model architecture consists of the following components:
- A **pre-trained BERT encoder** (multilingual historic BERT) as the base.
- **One or two Transformer encoder layers** stacked on top of the BERT encoder.
- A **Conditional Random Field (CRF)** decoder layer to model label dependencies.
- **Learned absolute positional embeddings** for improved handling of noisy inputs.

These additional Transformer layers help in mitigating the effects of OCR noise, spelling variation, and non-standard linguistic usage found in historical documents. The entire stack is fine-tuned end-to-end for token classification.

### Entity Types Supported

The model supports both coarse-grained and fine-grained entity types defined in the HIPE-2020/2022 guidelines. The output format of the model includes structured predictions with contextual and semantic details. Each prediction is a dictionary with the following fields:

```python
{
  'type': 'pers' | 'org' | 'loc' | 'time' | 'prod',
  'confidence_ner': float,              # Confidence score
  'surface': str,                       # Surface form in text
  'lOffset': int,                       # Start character offset
  'rOffset': int,                       # End character offset
  'name': str,                          # Optional: full name (for persons)
  'title': str,                         # Optional: title (for persons)
  'function': str                       # Optional: function (if detected)
}
```


#### Coarse-Grained Entity Types:
- **pers**: Person entities (individuals, collectives, authors)
- **org**: Organizations (administrative, enterprise, press agencies)
- **prod**: Products (media)
- **time**: Time expressions (absolute dates)
- **loc**: Locations (towns, regions, countries, physical, facilities)

If present in the text, surrounding an entity, model returns **person-specific attributes** such as:
- `name`: canonical full name
- `title`: honorific or title (e.g., "king", "chancellor")
- `function`: role or function in context (if available)

### Model Sources

- **Repository:** https://huggingface.co/impresso-project/ner-stacked-bert-multilingual
- **Paper:** [CoNLL 2020](https://aclanthology.org/2020.conll-1.35/)
- **Demo:** [Impresso project](https://impresso-project.ch)

## Uses

### Direct Use

The model is intended to be used directly with the Hugging Face `pipeline` for `token-classification`, specifically with `generic-ner` tasks on historical texts.

### Downstream Use

Can be used for downstream tasks such as:
- Historical information extraction
- Biographical reconstruction
- Place and person mention detection across historical archives

### Out-of-Scope Use

- Not suitable for contemporary named entity recognition in domains such as social media or modern news.
- Not optimized for OCR-free modern corpora.

## Bias, Risks, and Limitations

Due to training on historical documents, the model may reflect historical biases and inaccuracies. It may underperform on contemporary or non-European languages.

### Recommendations

- Users should be cautious of historical and typographical biases.
- Consider post-processing to filter false positives from OCR noise.

## How to Get Started with the Model

```python
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

MODEL_NAME = "impresso-project/ner-stacked-bert-multilingual"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

ner_pipeline = pipeline("generic-ner", model=MODEL_NAME, tokenizer=tokenizer, trust_remote_code=True, device='cpu')

sentence = "En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité. À la cour du roi Philippe VI, les murs du Louvre étaient animés par les rapports sombres venus de Paris et des villes environnantes. La peste ne montrait aucun signe de répit, et le chancelier Guillaume de Nogaret, le conseiller le plus fidèle du roi, portait le lourd fardeau de gérer la survie du royaume."
entities = ner_pipeline(sentence)
print(entities)
```
#### Example Output

```json
[
  {'type': 'time', 'confidence_ner': 85.0, 'surface': "an 1348", 'lOffset': 0, 'rOffset': 12},
  {'type': 'loc', 'confidence_ner': 90.75, 'surface': "Europe", 'lOffset': 69, 'rOffset': 75},
  {'type': 'loc', 'confidence_ner': 75.45, 'surface': "Royaume de France", 'lOffset': 80, 'rOffset': 97},
  {'type': 'pers', 'confidence_ner': 85.27, 'surface': "roi Philippe VI", 'lOffset': 181, 'rOffset': 196, 'title': "roi", 'name': "roi Philippe VI"},
  {'type': 'loc', 'confidence_ner': 30.59, 'surface': "Louvre", 'lOffset': 210, 'rOffset': 216},
  {'type': 'loc', 'confidence_ner': 94.46, 'surface': "Paris", 'lOffset': 266, 'rOffset': 271},
  {'type': 'pers', 'confidence_ner': 96.1, 'surface': "chancelier Guillaume de Nogaret", 'lOffset': 350, 'rOffset': 381, 'title': "chancelier", 'name': "Guillaume de Nogaret"},
  {'type': 'loc', 'confidence_ner': 49.35, 'surface': "Royaume", 'lOffset': 80, 'rOffset': 87},
  {'type': 'loc', 'confidence_ner': 24.18, 'surface': "France", 'lOffset': 91, 'rOffset': 97}
]
```

## Training Details

### Training Data

The model was trained on the Impresso HIPE-2020 dataset, a subset of the [HIPE-2022 corpus](https://github.com/hipe-eval/HIPE-2022-data), which includes richly annotated OCR-transcribed historical newspaper content.

### Training Procedure

#### Preprocessing

OCR content was cleaned and segmented. Entity types follow the HIPE-2020 typology.

#### Training Hyperparameters

- **Training regime:** Mixed precision (fp16)
- **Epochs:** 5
- **Max sequence length:** 512
- **Base model:** `dbmdz/bert-medium-historic-multilingual-cased`
- **Stacked Transformer layers:** 2

#### Speeds, Sizes, Times

- **Model size:** ~500MB
- **Training time:** ~1h on 1 GPU (NVIDIA TITAN X)

## Evaluation

#### Testing Data

Held-out portion of HIPE-2020 (French, German)

#### Metrics

- F1-score (micro, macro)
- Entity-level precision/recall

### Results

| Language | Precision | Recall | F1-score |
|----------|-----------|--------|----------|
| French   | 84.2      | 81.6   | 82.9     |
| German   | 82.0      | 78.7   | 80.3     |

#### Summary

The model performs robustly across noisy OCR historical content with support for fine-grained entity typologies.

## Environmental Impact

- **Hardware Type:** NVIDIA TITAN X (Pascal, 12GB)
- **Hours used:** ~1 hour
- **Cloud Provider:** EPFL, Switzerland
- **Carbon Emitted:** ~0.022 kg CO₂eq (estimated)

## Technical Specifications

### Model Architecture and Objective

Stacked BERT architecture with multitask token classification head supporting HIPE-type entity labels.

### Compute Infrastructure

#### Hardware

1x NVIDIA TITAN X (Pascal, 12GB)

#### Software

- Python 3.11
- PyTorch 2.0
- Transformers 4.36

## Citation

**BibTeX:**

```bibtex
@inproceedings{boros2020alleviating,
  title={Alleviating digitization errors in named entity recognition for historical documents},
  author={Boros, Emanuela and Hamdi, Ahmed and Pontes, Elvys Linhares and Cabrera-Diego, Luis-Adri{\'a}n and Moreno, Jose G and Sidere, Nicolas and Doucet, Antoine},
  booktitle={Proceedings of the 24th conference on computational natural language learning},
  pages={431--441},
  year={2020}
}
```

## Contact

- Website: [https://impresso-project.ch](https://impresso-project.ch)

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/>
</p>