Floret Language Identification & OCR Quality Scoring
This repository provides a simple and lightweight way to:
- Identify the language of a given text using a pre-trained Floret model.
- Assess OCR quality scores based on predefined Bloom filters.
Installation
Before using the language identification or OCR scoring, install the required dependencies:
pip install floret huggingface_hub
pip install cython pybloomfiltermmap3 huggingface_hub
pip install fasttext
pip install floret # Redundant but ensures installation
Language Identification
To use the Floret-based language detection model, load it dynamically using huggingface_hub
:
from huggingface_hub import hf_hub_download
exec(open(hf_hub_download("Maslionok/sudo_pipelines", "floret_language_recognition.py")).read())
Usage
Once loaded, call the model on a plain text input to detect its language:
floret_model("this is a simple text")
Output Example:
'en'
OCR Quality Score Calculation
To assess OCR text quality, load the OCR scoring model:
from huggingface_hub import hf_hub_download
exec(open(hf_hub_download("Maslionok/sudo_pipelines", "OCR_score.py")).read())
Usage
Call OCR_score()
on your text:
- Automatic language detection:
OCR_score("some OCR-extracted text")
- Specify a language manually:
OCR_score("some OCR-extracted text", language="en")