--- library_name: transformers language: - fr - de - en - it - lb license: agpl-3.0 tags: - language-identification - multilingual - historical - impresso --- # Model Card for `impresso-project/language-identifier` ## Overview `impresso-project/language-identifier` is a multilingual language identification model fine-tuned for use on historical newspaper content. It supports **German (de), French (fr), Italian (it), English (en), and Luxembourgish (lb)** — the core languages of the [Impresso Project](https://impresso-project.ch), which focuses on analyzing historical media across national and linguistic borders. This model has been adapted for short, OCR-noisy and fragmentary inputs typical of historical digitized texts. ## Model Details ### Model Description - **Developed by:** University of Zurich (UZH) from the [Impresso team](https://impresso-project.ch). The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation ([CRSII5_173719](http://p3.snf.ch/project-173719), [CRSII5_213585](https://data.snf.ch/grants/grant/213585)) and the Luxembourg National Research Fund (grant No. 17498891). - **Model type:** Language identification using a transformer-based classification architecture - **Languages:** French, German, English, Italian, Luxembourgish - **License:** AGPL-3.0 - **Finetuned from:** Custom model trained on historical newspaper data from the Impresso corpus ## How to Use ```python from transformers import pipeline MODEL_NAME = "impresso-project/language-identifier" lang_pipeline = pipeline( "langident", model=MODEL_NAME, trust_remote_code=True, device="cpu", ) text = """En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité.""" langs = lang_pipeline(text) print(langs) ``` ## Output Format The output is a single dictionary with the predicted language and confidence score: ```python { "language": "fr", "score": 1.0 } ``` ## Use Cases - Preprocessing for OCR and NLP tasks on historical corpora - Document and segment-level language tagging - Filtering and sorting multilingual newspaper archives ## Limitations - Works best on **sentence- or paragraph-length** texts - May struggle with code-switching or OCR-degraded text that mixes languages - Primarily optimized for **Impresso-like sources** (19th–20th century newspapers) ## Installation ```bash pip install transformers floret ``` ## Contact - Website: [https://impresso-project.ch](https://impresso-project.ch)

Impresso Logo