๐Ÿ“š Institutional Books Topic Classifier

This model was trained as part of the analysis and post-processing work performed in preparation for the release of the Institutional Books 1.0 dataset by the Institutional Data Initiative.

We used this text classifier to assign a topic, derived from the first level of the Library of Congress' Classification Outline, to individual volumes.

Complete experimental setup and results are available in our technical report (Section 4.5).

Code: https://github.com/instdin/institutional-books-1-pipeline

Base model

google-bert/bert-base-multilingual-uncased

Input format

Book metadata, formated as follows:

Title: A treatise on analytical geometry of tree dimensions, containing the theory of curve surfaces and of curves of double curvature. 
Author: Hymers, J.   
Year: 1848
Language: English
General Note: Example of a general note

All of the fields listed in this example are optional.

Categories

  • GENERAL WORKS
  • PHILOSOPHY. PSYCHOLOGY. RELIGION
  • AUXILIARY SCIENCES OF HISTORY
  • WORLD HISTORY AND HISTORY OF EUROPE, ASIA, AFRICA, AUSTRALIA, NEW ZEALAND, ETC.
  • HISTORY OF THE AMERICAS
  • GEOGRAPHY. ANTHROPOLOGY. RECREATION
  • SOCIAL SCIENCES
  • POLITICAL SCIENCE
  • LAW
  • EDUCATION
  • MUSIC AND BOOKS ON MUSIC
  • FINE ARTS
  • LANGUAGE AND LITERATURE
  • SCIENCE
  • MEDICINE
  • AGRICULTURE
  • TECHNOLOGY
  • MILITARY SCIENCE
  • NAVAL SCIENCE
  • BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL)

Training data

  • Train split: 80,830 samples
  • Test split: 5,000 samples
  • An additional set of 1,000 samples was set aside for benchmarking purposes

Validation Metrics

Metric Value
loss 0.157407745718956
f1_macro 0.9613886456444749
f1_micro 0.9694
f1_weighted 0.9693030681223207
precision_macro 0.9679892485977634
precision_micro 0.9694
precision_weighted 0.9695713537396466
recall_macro 0.9560667596679707
recall_micro 0.9694
recall_weighted 0.9694
accuracy 0.9694

Post-training benchmark accuracy: 97.8% (978/1000)

Quickstart

from transformers import pipeline

to_label = """
Title: A treatise on analytical geometry of tree dimensions, containing the theory of curve surfaces and of curves of double curvature. 
Author: Hymers, J.   
Year: 1848
Language: English
General Note: Example of a general note
"""

pipe = pipeline("text-classification", model="instdin/institutional-books-topic-classifier-bert")
result = pipe(to_label.strip())
print(result[0]) # {'label': 'SCIENCE', 'score': 0.9996894598007202}

About IDI

The Institutional Data Initiative at Harvard Law School Library works with knowledge institutionsโ€”from libraries and museums to cultural groups and government agenciesโ€”to refine and publish their collections as data. Reach out to collaborate on your collections.

Cite

@misc{cargnelutti2025institutionalbooks10242b,
      title={Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability}, 
      author={Matteo Cargnelutti and Catherine Brobston and John Hess and Jack Cushman and Kristi Mukk and Aristana Scourtas and Kyle Courtney and Greg Leppert and Amanda Watson and Martha Whitehead and Jonathan Zittrain},
      year={2025},
      eprint={2506.08300},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.08300}, 
}
Downloads last month
20
Safetensors
Model size
167M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for institutional/institutional-books-topic-classifier-bert

Finetuned
(1786)
this model

Collection including institutional/institutional-books-topic-classifier-bert