Model Card for impresso-project/language-identifier
Overview
impresso-project/language-identifier
is a multilingual language identification model fine-tuned for use on historical newspaper content. It supports German (de), French (fr), Italian (it), English (en), and Luxembourgish (lb) — the core languages of the Impresso Project, which focuses on analyzing historical media across national and linguistic borders.
This model has been adapted for short, OCR-noisy and fragmentary inputs typical of historical digitized texts.
Model Details
- Model type: Language identification
- Interface: Hugging Face
transformers
pipeline - Languages supported: fr, de, en, it, lb
- License: AGPL-3.0
- Developed by: UZH, Switzerland
- Training data: Historical newspapers from the impresso corpus and related sources
How to Use
from transformers import pipeline
MODEL_NAME = "impresso-project/language-identifier"
lang_pipeline = pipeline(
"langident",
model=MODEL_NAME,
trust_remote_code=True,
device="cpu",
)
text = """En l'an 1348, au plus fort des ravages de la peste noire à travers
l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et
face à une opportunité."""
langs = lang_pipeline(text)
print(langs)
Output Format
The output is a single dictionary with the predicted language and confidence score:
{
"language": "fr",
"score": 1.0
}
Use Cases
- Preprocessing for OCR and NLP tasks on historical corpora
- Document and segment-level language tagging
- Filtering and sorting multilingual newspaper archives
Limitations
- Works best on sentence- or paragraph-length texts
- May struggle with code-switching or OCR-degraded text that mixes languages
- Primarily optimized for Impresso-like sources (19th–20th century newspapers)
Installation
pip install transformers floret
Contact
- Website: https://impresso-project.ch
- Downloads last month
- 15
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support