sudo_pipelines / readme.md
Gleb Vinarskis
fixed readme
710a8ff

Floret Language Identification & OCR Quality Scoring

This repository provides a simple and lightweight way to:

  • Identify the language of a given text using a pre-trained Floret model.
  • Assess OCR quality scores based on predefined Bloom filters.

Installation

Before using the language identification or OCR scoring, install the required dependencies:

pip install floret huggingface_hub
pip install cython pybloomfiltermmap3 huggingface_hub
pip install fasttext
pip install floret  # Redundant but ensures installation

Language Identification

To use the Floret-based language detection model, load it dynamically using huggingface_hub:

from huggingface_hub import hf_hub_download
exec(open(hf_hub_download("Maslionok/sudo_pipelines", "floret_language_recognition.py")).read())

Usage

Once loaded, call the model on a plain text input to detect its language:

floret_model("this is a simple text")

Output Example:

'en'

OCR Quality Score Calculation

To assess OCR text quality, load the OCR scoring model:

from huggingface_hub import hf_hub_download
exec(open(hf_hub_download("Maslionok/sudo_pipelines", "OCR_score.py")).read())

Usage

Call OCR_score() on your text:

  • Automatic language detection:
    OCR_score("some OCR-extracted text")
    
  • Specify a language manually:
    OCR_score("some OCR-extracted text", language="en")