Model Card for impresso-project/language-identifier

Overview

impresso-project/language-identifier is a multilingual language identification model fine-tuned for use on historical newspaper content. It supports German (de), French (fr), Italian (it), English (en), and Luxembourgish (lb) — the core languages of the Impresso Project, which focuses on analyzing historical media across national and linguistic borders.

This model has been adapted for short, OCR-noisy and fragmentary inputs typical of historical digitized texts.

Model Details

  • Model type: Language identification
  • Interface: Hugging Face transformers pipeline
  • Languages supported: fr, de, en, it, lb
  • License: AGPL-3.0
  • Developed by: UZH, Switzerland
  • Training data: Historical newspapers from the impresso corpus and related sources

How to Use

from transformers import pipeline

MODEL_NAME = "impresso-project/language-identifier"

lang_pipeline = pipeline(
    "langident",
    model=MODEL_NAME,
    trust_remote_code=True,
    device="cpu",
)

text = """En l'an 1348, au plus fort des ravages de la peste noire à travers
l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et
face à une opportunité."""

langs = lang_pipeline(text)
print(langs)

Output Format

The output is a single dictionary with the predicted language and confidence score:

{
  "language": "fr",
  "score": 1.0
}

Use Cases

  • Preprocessing for OCR and NLP tasks on historical corpora
  • Document and segment-level language tagging
  • Filtering and sorting multilingual newspaper archives

Limitations

  • Works best on sentence- or paragraph-length texts
  • May struggle with code-switching or OCR-degraded text that mixes languages
  • Primarily optimized for Impresso-like sources (19th–20th century newspapers)

Installation

pip install transformers floret

Contact

Impresso Logo

Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support