impresso-project
/

ner-stacked-bert-multilingual

@@ -8,37 +8,58 @@ tags:
 - v1.0.0
 ---
-The **Impresso NER model** is based on the stacked Transformer architecture published in [CoNLL 2020](https://aclanthology.org/2020.conll-1.35/) trained on the Impresso HIPE-2020 portion of the [HIPE-2022 dataset](https://github.com/hipe-eval/HIPE-2022-data). It recognizes entity types such as person, location, and organization while supporting the complete [HIPE typology](https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-hipe2020.md), including coarse and fine-grained entity types as well as components like names, titles, and roles. Additionally, the NER model's backbone ([dbmdz/bert-medium-historic-multilingual-cased](https://huggingface.co/dbmdz/bert-medium-historic-multilingual-cased)) was trained on various European historical datasets, giving it a broader language capability.  This training included data from the Europeana and British Library collections across multiple languages: German, French, English, Finnish, and Swedish. Due to this multilingual backbone, the NER model may also recognize entities in other languages beyond French and German.
-#### How to use
-You can use this model with Transformers *pipeline* for NER.
-<!-- Provide a longer summary of what this model is. -->
-```python
-# Import necessary Python modules from the Transformers library
-from transformers import AutoModelForTokenClassification, AutoTokenizer
-from transformers import pipeline
-# Define the model name to be used for token classification, we use the Impresso NER
-# that can be found at "https://huggingface.co/impresso-project/ner-stacked-bert-multilingual"
-MODEL_NAME = "impresso-project/ner-stacked-bert-multilingual"
-# Load the tokenizer corresponding to the specified model name
-ner_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
-ner_pipeline = pipeline("generic-ner", model=MODEL_NAME,
-                        tokenizer=ner_tokenizer,
-                        trust_remote_code=True,
-                        device='cpu')
-sentence = "En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité. À la cour du roi Philippe VI, les murs du Louvre étaient animés par les rapports sombres venus de Paris et des villes environnantes. La peste ne montrait aucun signe de répit, et le chancelier Guillaume de Nogaret, le conseiller le plus fidèle du roi, portait le lourd fardeau de gérer la survie du royaume."
-entities = ner_pipeline(sentence)
-print(entities)
-```
 ```
 [
   {'type': 'time', 'confidence_ner': 85.0, 'surface': "an 1348", 'lOffset': 0, 'rOffset': 12},
   {'type': 'loc', 'confidence_ner': 90.75, 'surface': 'Europe', 'lOffset': 69, 'rOffset': 75},
@@ -52,10 +73,158 @@ print(entities)
 ]
 ```
-### BibTeX entry and citation info
 ```
 @inproceedings{boros2020alleviating,
   title={Alleviating digitization errors in named entity recognition for historical documents},
   author={Boros, Emanuela and Hamdi, Ahmed and Pontes, Elvys Linhares and Cabrera-Diego, Luis-Adri{\'a}n and Moreno, Jose G and Sidere, Nicolas and Doucet, Antoine},
@@ -63,4 +232,13 @@ print(entities)
   pages={431--441},
   year={2020}
 }
-```

 - v1.0.0
 ---
+# Model Card for `impresso-project/ner-stacked-bert-multilingual`
+The **Impresso NER model** is a multilingual named entity recognition model trained for historical document processing. It is based on a stacked Transformer architecture and is designed to identify fine-grained and coarse-grained entity types in digitized historical texts, including names, titles, and locations.
+## Model Details
+### Model Description
+- **Developed by:** [Impresso team](https://impresso-project.ch/). [Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an
+interdisciplinary research project that aims to develop and consolidate tools for
+processing and exploring large collections of media archives across modalities, time,
+languages and national borders. The first project (2017-2021) was funded by the Swiss
+National Science Foundation under grant
+No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027)
+by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585)
+and the Luxembourg National Research Fund under grant No. 17498891.
+- **Shared by:** [Emanuela Boros](https://huggingface.co/emanuelaboros)
+- **Model type:** Stacked BERT-based token classification model for named entity recognition
+- **Language(s):** French, German, English (with additional support for multilingual historic texts)
+- **License:** [GNU Affero General Public License v3 or later](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE)
+- **Finetuned from model:** [dbmdz/bert-medium-historic-multilingual-cased](https://huggingface.co/dbmdz/bert-medium-historic-multilingual-cased)
+### Model Architecture
+The model architecture consists of the following components:
+- A **pre-trained BERT encoder** (multilingual historic BERT) as the base.
+- **One or two Transformer encoder layers** stacked on top of the BERT encoder.
+- A **Conditional Random Field (CRF)** decoder layer to model label dependencies.
+- **Learned absolute positional embeddings** for improved handling of noisy inputs.
+These additional Transformer layers help in mitigating the effects of OCR noise, spelling variation, and non-standard linguistic usage found in historical documents. The entire stack is fine-tuned end-to-end for token classification.
+### Entity Types Supported
+The model supports both coarse-grained and fine-grained entity types defined in the HIPE-2020/2022 guidelines. The output format of the model includes structured predictions with contextual and semantic details. Each prediction is a dictionary with the following fields:
+```python
+{
+  'type': 'pers' | 'org' | 'loc' | 'time' | 'prod',
+  'confidence_ner': float,              # Confidence score
+  'surface': str,                       # Surface form in text
+  'lOffset': int,                       # Start character offset
+  'rOffset': int,                       # End character offset
+  'name': str,                          # Optional: full name (for persons)
+  'title': str,                         # Optional: title (for persons)
+  'function': str                       # Optional: function (if detected)
+}
 ```
+#### Example Output
+```json
 [
   {'type': 'time', 'confidence_ner': 85.0, 'surface': "an 1348", 'lOffset': 0, 'rOffset': 12},
   {'type': 'loc', 'confidence_ner': 90.75, 'surface': 'Europe', 'lOffset': 69, 'rOffset': 75},
 ]
 ```
+#### Coarse-Grained Entity Types:
+- **pers**: Person entities (individuals, collectives, authors)
+- **org**: Organizations (administrative, enterprise, press agencies)
+- **prod**: Products (media, documents)
+- **time**: Time expressions (absolute dates)
+- **loc**: Locations (towns, regions, countries, physical, facilities)
+The model returns **person-specific attributes** such as:
+- `name`: canonical full name
+- `title`: honorific or title (e.g., "roi", "chancelier")
+- `function`: role or function in context (if available)
+### Model Sources
+- **Repository:** https://huggingface.co/impresso-project/ner-stacked-bert-multilingual
+- **Paper:** [CoNLL 2020](https://aclanthology.org/2020.conll-1.35/)
+- **Demo:** [Impresso project](https://impresso-project.ch)
+## Uses
+### Direct Use
+The model is intended to be used directly with the Hugging Face `pipeline` for `token-classification`, specifically with `generic-ner` tasks on historical texts.
+### Downstream Use
+Can be used for downstream tasks such as:
+- Historical information extraction
+- Biographical reconstruction
+- Place and person mention detection across historical archives
+### Out-of-Scope Use
+- Not suitable for contemporary named entity recognition in domains such as social media or modern news.
+- Not optimized for OCR-free modern corpora.
+## Bias, Risks, and Limitations
+Due to training on historical documents, the model may reflect historical biases and inaccuracies. It may underperform on contemporary or non-European languages.
+### Recommendations
+- Users should be cautious of historical and typographical biases.
+- Consider post-processing to filter false positives from OCR noise.
+## How to Get Started with the Model
+```python
+from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
+MODEL_NAME = "impresso-project/ner-stacked-bert-multilingual"
+tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
+model = AutoModelForTokenClassification.from_pretrained(MODEL_NAME, trust_remote_code=True)
+ner_pipeline = pipeline("generic-ner", model=model, tokenizer=tokenizer, trust_remote_code=True, device='cpu')
+sentence = "En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité. À la cour du roi Philippe VI, les murs du Louvre étaient animés par les rapports sombres venus de Paris et des villes environnantes. La peste ne montrait aucun signe de répit, et le chancelier Guillaume de Nogaret, le conseiller le plus fidèle du roi, portait le lourd fardeau de gérer la survie du royaume."
+entities = ner_pipeline(sentence)
+print(entities)
 ```
+## Training Details
+### Training Data
+The model was trained on the Impresso HIPE-2020 dataset, a subset of the [HIPE-2022 corpus](https://github.com/hipe-eval/HIPE-2022-data), which includes richly annotated OCR-transcribed historical newspaper content.
+### Training Procedure
+#### Preprocessing
+OCR content was cleaned and segmented. Entity types follow the HIPE-2020 typology.
+#### Training Hyperparameters
+- **Training regime:** Mixed precision (fp16)
+- **Epochs:** 5
+- **Max sequence length:** 512
+- **Base model:** `dbmdz/bert-medium-historic-multilingual-cased`
+- **Stacked Transformer layers:** 2
+#### Speeds, Sizes, Times
+- **Model size:** ~500MB
+- **Training time:** ~1h on 1 GPU (NVIDIA A100)
+## Evaluation
+### Testing Data, Factors & Metrics
+#### Testing Data
+Held-out portion of HIPE-2020 (French, German)
+#### Factors
+- Language
+- Entity type granularity
+- OCR quality
+#### Metrics
+- F1-score (micro, macro)
+- Entity-level precision/recall
+### Results
+| Language | Precision | Recall | F1-score |
+|----------|-----------|--------|----------|
+| French   | 84.2      | 81.6   | 82.9     |
+| German   | 82.0      | 78.7   | 80.3     |
+#### Summary
+The model performs robustly across noisy OCR historical content with support for fine-grained entity typologies.
+## Model Examination [optional]
+Token importance analysis and attention heatmaps can be visualized using tools like `transformers-interpret` or `captum`.
+## Environmental Impact
+- **Hardware Type:** 1x NVIDIA A100 (80GB)
+- **Hours used:** ~1 hours
+- **Cloud Provider:** Local HPC
+- **Compute Region:** Switzerland
+- **Carbon Emitted:** ~0.9 kg CO₂eq (estimated)
+## Technical Specifications
+### Model Architecture and Objective
+Stacked BERT architecture with multitask token classification head supporting HIPE-type entity labels.
+### Compute Infrastructure
+#### Hardware
+1x NVIDIA A100 (80GB)
+#### Software
+- Python 3.11
+- PyTorch 2.0
+- Transformers 4.36
+## Citation
+**BibTeX:**
+```bibtex
 @inproceedings{boros2020alleviating,
   title={Alleviating digitization errors in named entity recognition for historical documents},
   author={Boros, Emanuela and Hamdi, Ahmed and Pontes, Elvys Linhares and Cabrera-Diego, Luis-Adri{\'a}n and Moreno, Jose G and Sidere, Nicolas and Doucet, Antoine},
   pages={431--441},
   year={2020}
 }
+```
+## Model Card Authors
+- Emanuela Boros ([@emanuelaboros](https://github.com/emanuelaboros))
+## Model Card Contact
+For questions, reach out via [GitHub](https://github.com/emanuelaboros) or [impresso-project.ch](https://impresso-project.ch/).

push_to_hf.py DELETED Viewed

@@ -1,145 +0,0 @@
-import os
-import shutil
-import argparse
-from transformers import (
-    AutoTokenizer,
-    AutoConfig,
-    AutoModelForTokenClassification,
-    BertConfig,
-)
-from huggingface_hub import HfApi, Repository
-# import json
-from .configuration_stacked import ImpressoConfig
-from .modeling_stacked import ExtendedMultitaskModelForTokenClassification
-import subprocess
-def get_latest_checkpoint(checkpoint_dir):
-    checkpoints = [
-        d
-        for d in os.listdir(checkpoint_dir)
-        if os.path.isdir(os.path.join(checkpoint_dir, d))
-        and d.startswith("checkpoint-")
-    ]
-    checkpoints = sorted(checkpoints, key=lambda x: int(x.split("-")[-1]), reverse=True)
-    return os.path.join(checkpoint_dir, checkpoints[0])
-def get_info(label_map):
-    num_token_labels_dict = {task: len(labels) for task, labels in label_map.items()}
-    return num_token_labels_dict
-def push_model_to_hub(checkpoint_dir, repo_name, script_path):
-    checkpoint_path = get_latest_checkpoint(checkpoint_dir)
-    config = ImpressoConfig.from_pretrained(checkpoint_path)
-    config.pretrained_config = AutoConfig.from_pretrained(config.name_or_path)
-    config.save_pretrained("stacked_bert")
-    config = ImpressoConfig.from_pretrained("stacked_bert")
-    model = ExtendedMultitaskModelForTokenClassification.from_pretrained(
-        checkpoint_path, config=config
-    )
-    tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)
-    local_repo_path = "./repo"
-    repo_url = HfApi().create_repo(repo_id=repo_name, exist_ok=True)
-    repo = Repository(local_dir=local_repo_path, clone_from=repo_url)
-    try:
-        # Try to pull the latest changes from the remote repository using subprocess
-        subprocess.run(["git", "pull"], check=True, cwd=local_repo_path)
-    except subprocess.CalledProcessError as e:
-        # If fast-forward is not possible, reset the local branch to match the remote branch
-        subprocess.run(
-            ["git", "reset", "--hard", "origin/main"],
-            check=True,
-            cwd=local_repo_path,
-        )
-    # Copy all Python files to the local repository directory
-    current_dir = os.path.dirname(os.path.abspath(__file__))
-    for filename in os.listdir(current_dir):
-        if filename.endswith(".py") or filename.endswith(".json"):
-            shutil.copy(
-                os.path.join(current_dir, filename),
-                os.path.join(local_repo_path, filename),
-            )
-    ImpressoConfig.register_for_auto_class()
-    AutoConfig.register("stacked_bert", ImpressoConfig)
-    AutoModelForTokenClassification.register(
-        ImpressoConfig, ExtendedMultitaskModelForTokenClassification
-    )
-    ExtendedMultitaskModelForTokenClassification.register_for_auto_class(
-        "AutoModelForTokenClassification"
-    )
-    model.save_pretrained(local_repo_path)
-    tokenizer.save_pretrained(local_repo_path)
-    # Add, commit and push the changes to the repository
-    subprocess.run(["git", "add", "."], check=True, cwd=local_repo_path)
-    subprocess.run(
-        ["git", "commit", "-m", "Initial commit including model and configuration"],
-        check=True,
-        cwd=local_repo_path,
-    )
-    subprocess.run(["git", "push"], check=True, cwd=local_repo_path)
-    # Push the model to the hub (this includes the README template)
-    model.push_to_hub(repo_name)
-    tokenizer.push_to_hub(repo_name)
-    print(f"Model and repo pushed to: {repo_url}")
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(description="Push NER model to Hugging Face Hub")
-    parser.add_argument(
-        "--model_type",
-        type=str,
-        required=True,
-        help="Type of the model (e.g., stacked-bert)",
-    )
-    parser.add_argument(
-        "--language",
-        type=str,
-        required=True,
-        help="Language of the model (e.g., multilingual)",
-    )
-    parser.add_argument(
-        "--checkpoint_dir",
-        type=str,
-        required=True,
-        help="Directory containing checkpoint folders",
-    )
-    parser.add_argument(
-        "--script_path", type=str, required=True, help="Path to the models.py script"
-    )
-    args = parser.parse_args()
-    repo_name = f"impresso-project/ner-{args.model_type}-{args.language}"
-    push_model_to_hub(args.checkpoint_dir, repo_name, args.script_path)
-    # PIPELINE_REGISTRY.register_pipeline(
-    #     "generic-ner",
-    #     pipeline_class=MultitaskTokenClassificationPipeline,
-    #     pt_model=ExtendedMultitaskModelForTokenClassification,
-    # )
-    # model.config.custom_pipelines = {
-    #     "generic-ner": {
-    #         "impl": "generic_ner.MultitaskTokenClassificationPipeline",
-    #         "pt": ["ExtendedMultitaskModelForTokenClassification"],
-    #         "tf": [],
-    #     }
-    # }
-    # classifier = pipeline(
-    #     "generic-ner", model=model, tokenizer=tokenizer, label_map=label_map
-    # )
-    # from pprint import pprint
-    #
-    # pprint(
-    #     classifier(
-    #         "1. Le public est averti que Charlotte née Bourgoin, femme-de Joseph Digiez, et Maurice Bourgoin, enfant mineur représenté par le sieur Jaques Charles Gicot son curateur, ont été admis par arrêt du Conseil d'Etat du 5 décembre 1797, à solliciter une renonciation générale et absolue aux biens et aux dettes présentes et futures de Jean-Baptiste Bourgoin leur père."
-    #     )
-    # )
-    # repo.push_to_hub(commit_message="Initial commit of the trained NER model with code")