๐ Institutional Books Topic Classifier
This model was trained as part of the analysis and post-processing work performed in preparation for the release of the Institutional Books 1.0 dataset by the Institutional Data Initiative.
We used this text classifier to assign a topic, derived from the first level of the Library of Congress' Classification Outline, to individual volumes.
Complete experimental setup and results are available in our technical report (Section 4.5).
Code: https://github.com/instdin/institutional-books-1-pipeline
Base model
google-bert/bert-base-multilingual-uncased
Input format
Book metadata, formated as follows:
Title: A treatise on analytical geometry of tree dimensions, containing the theory of curve surfaces and of curves of double curvature.
Author: Hymers, J.
Year: 1848
Language: English
General Note: Example of a general note
All of the fields listed in this example are optional.
Categories
- GENERAL WORKS
- PHILOSOPHY. PSYCHOLOGY. RELIGION
- AUXILIARY SCIENCES OF HISTORY
- WORLD HISTORY AND HISTORY OF EUROPE, ASIA, AFRICA, AUSTRALIA, NEW ZEALAND, ETC.
- HISTORY OF THE AMERICAS
- GEOGRAPHY. ANTHROPOLOGY. RECREATION
- SOCIAL SCIENCES
- POLITICAL SCIENCE
- LAW
- EDUCATION
- MUSIC AND BOOKS ON MUSIC
- FINE ARTS
- LANGUAGE AND LITERATURE
- SCIENCE
- MEDICINE
- AGRICULTURE
- TECHNOLOGY
- MILITARY SCIENCE
- NAVAL SCIENCE
- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL)
Training data
- Train split: 80,830 samples
- Test split: 5,000 samples
- An additional set of 1,000 samples was set aside for benchmarking purposes
Validation Metrics
Metric | Value |
---|---|
loss | 0.157407745718956 |
f1_macro | 0.9613886456444749 |
f1_micro | 0.9694 |
f1_weighted | 0.9693030681223207 |
precision_macro | 0.9679892485977634 |
precision_micro | 0.9694 |
precision_weighted | 0.9695713537396466 |
recall_macro | 0.9560667596679707 |
recall_micro | 0.9694 |
recall_weighted | 0.9694 |
accuracy | 0.9694 |
Post-training benchmark accuracy: 97.8% (978/1000)
Quickstart
from transformers import pipeline
to_label = """
Title: A treatise on analytical geometry of tree dimensions, containing the theory of curve surfaces and of curves of double curvature.
Author: Hymers, J.
Year: 1848
Language: English
General Note: Example of a general note
"""
pipe = pipeline("text-classification", model="instdin/institutional-books-topic-classifier-bert")
result = pipe(to_label.strip())
print(result[0]) # {'label': 'SCIENCE', 'score': 0.9996894598007202}
About IDI
The Institutional Data Initiative at Harvard Law School Library works with knowledge institutionsโfrom libraries and museums to cultural groups and government agenciesโto refine and publish their collections as data. Reach out to collaborate on your collections.
Cite
@misc{cargnelutti2025institutionalbooks10242b,
title={Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability},
author={Matteo Cargnelutti and Catherine Brobston and John Hess and Jack Cushman and Kristi Mukk and Aristana Scourtas and Kyle Courtney and Greg Leppert and Amanda Watson and Martha Whitehead and Jonathan Zittrain},
year={2025},
eprint={2506.08300},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.08300},
}
- Downloads last month
- 20
Model tree for institutional/institutional-books-topic-classifier-bert
Base model
google-bert/bert-base-multilingual-uncased