MarineLives

non-profit

http://www.marinelives.org/wiki/MarineLives

Addaci

Activity Feed Request to join this org

AI & ML interests

Use of LLMs in post-production clean up of HTR for Early Modern Legal depositions

Organization Card

Community About org cards

1.1 Fine-tuning Small LLMs

Exploring the potential of small LLMs for cleaning Raw HTR outputs from machine-transcribed English Admiralty depositions.

Fine-Tuned Models

mT5-small (300M parameters)
GPT-2 Small (124M parameters)
LLaMA 3.1 (1B parameters)

Current Training Data

100 pages: 40,000 lines (~0.4M words)
200 pages: 80,000 lines (~0.8M words)
400 pages: 160,000 lines (~1.6M words)

Objectives

Word Correction: Identify and correct errors using contextual and grammatical cues.
Language Identification: Distinguish English from Latin text.
Artefact Removal: Eliminate HTR-generated artefacts.
Structural Recognition: Detect depositions’ components (e.g., front matter, headings, articles).
Insertion Logic: Handle missing text at marked positions.

1.2 Integration with RAG Pipeline

Components:

Retriever: BM25 or Sentence-BERT
LLM: mT5-small
Corpus: Curated historical texts or JSON/SQLite databases

Deployment Highlights:

Scalable: Easily runs on platforms like Hugging Face Spaces with lightweight GPU instances.
API-Friendly: Supports integrations via Hugging Face Inference API for retrieval-augmented tasks.

📚 2.0 Datasets

2.1 Published Datasets

ENGLISH HIGH COURT OF ADMIRALTY DEPOSITIONS

YIDDISH LETTERS

2.2 Unpublished Datasets

Dataset 1: 420K tokens, full diplomatic transcription (1627–1660)
Dataset 2: 4.5M tokens, semi-diplomatic transcription (1607–1660)
Dataset 3: 100K tokens, diplomatic transcription of Early Modern letters (1600–1685)

🌍 Explore MarineLives

Join us in unlocking Early Modern history by exploring our Hugging Face organization and datasets! You can follow us on BlueSky at @marinelives.bsky.social You can explore our content on our MarineLives wiki and on our ai-and-history-collaboratory GitHub repository.

Collections 1

spaces 6

Early Modern Legal Rag

Demonstration of research augmented retrieval

Mistral 7B V0.2 Summarizer

Chat bot and sumamrizer based on Mistral-7B-v0.2

MarineLives Legal Assistant

HTR correct Text summarization Text Question Answering

Yiddish English Translation

UI to translate Hebrew script Yiddish into English

Yiddish Transcription Correction

byt5-small-fine-tuned-yiddish-experiment-10 test UI

MT5 Small Experiment 4 Fine Tune Deployment

Deployment of Expriment 4: 3 Epoch fine-tuned mT5-small

models 10

MarineLives/byt5-finetuned-yiddish-experiment-11

Updated Dec 8, 2024

MarineLives/byt5-finetuned-yiddish-experiment-10

Updated Dec 8, 2024

MarineLives/byt5-finetuned-yiddish-experiment-9

Updated Dec 8, 2024

MarineLives/byt5-finetuned-yiddish-experiment-8

Updated Dec 8, 2024

MarineLives/byt5-finetuned-yiddish-experiment-7

Updated Dec 8, 2024

MarineLives/mBert-finetuned-yiddish-experiment-1

Updated Dec 8, 2024 • 1

MarineLives/mBert-finetuned-yiddish-experiment-3

Fill-Mask • Updated Dec 8, 2024 • 17

MarineLives/bert-base-multilingual-cased-finetuned-yiddish-experiment-1

Updated Dec 7, 2024

MarineLives/hca-1370-mt5-paragraph-embedding-rag

Updated Oct 25, 2024

MarineLives/mt5-small-raw-htr-clean-ver.1.0

Updated Oct 19, 2024 • 7

datasets 8

MarineLives/Gavin_yiddish_raw_HTR_and_groundtruth_paragraphs

Viewer • Updated Dec 8, 2024 • 98 • 5 • 1

MarineLives/Gavin_yiddish_raw_HT_and_groundtruth_lines

Updated Dec 8, 2024 • 9

MarineLives/raw-htr-handchecked-groundtruth-small

Viewer • Updated Oct 16, 2024 • 697 • 30

MarineLives/HCA-1358-HTR-Errors-In-Phrases

Viewer • Updated Sep 18, 2024 • 194 • 29

MarineLives/Line-Insertions

Viewer • Updated Sep 17, 2024 • 177 • 32 • 1

MarineLives/English-Expansions

Viewer • Updated Sep 16, 2024 • 175 • 39

MarineLives/Latin-Expansions

Viewer • Updated Sep 15, 2024 • 192 • 33

MarineLives/HCA-13-58-TEXT

Viewer • Updated Sep 14, 2024 • 65.8k • 41