--- base_model: google-bert/bert-base-multilingual-uncased library_name: transformers license: apache-2.0 tags: - autotrain - text-classification --- # 📚 Institutional Books Topic Classifier This model was trained as part of the analysis and post-processing work performed in preparation for the release of the [Institutional Books 1.0 dataset](https://huggingface.co/collections/instdin/institutional-books-68366258bfb38364238477cf) by the Institutional Data Initiative. We used this text classifier to assign a topic, derived from the first level of the [Library of Congress' Classification Outline](https://www.loc.gov/catdir/cpso/lcco/), to individual volumes. Complete experimental setup and results are available in our [technical report](https://arxiv.org/abs/2506.08300) (Section 4.5). **Code:** https://github.com/instdin/institutional-books-1-pipeline ## Base model [google-bert/bert-base-multilingual-uncased](https://huggingface.co/google-bert/bert-base-multilingual-uncased) ## Input format Book metadata, formated as follows: ``` Title: A treatise on analytical geometry of tree dimensions, containing the theory of curve surfaces and of curves of double curvature. Author: Hymers, J. Year: 1848 Language: English General Note: Example of a general note ``` All of the fields listed in this example are optional. ## Categories - GENERAL WORKS - PHILOSOPHY. PSYCHOLOGY. RELIGION - AUXILIARY SCIENCES OF HISTORY - WORLD HISTORY AND HISTORY OF EUROPE, ASIA, AFRICA, AUSTRALIA, NEW ZEALAND, ETC. - HISTORY OF THE AMERICAS - GEOGRAPHY. ANTHROPOLOGY. RECREATION - SOCIAL SCIENCES - POLITICAL SCIENCE - LAW - EDUCATION - MUSIC AND BOOKS ON MUSIC - FINE ARTS - LANGUAGE AND LITERATURE - SCIENCE - MEDICINE - AGRICULTURE - TECHNOLOGY - MILITARY SCIENCE - NAVAL SCIENCE - BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL) ## Training data - Train split: 80,830 samples - Test split: 5,000 samples - An additional set of 1,000 samples was set aside for benchmarking purposes ## Validation Metrics | Metric | Value | | --- | --- | | loss | 0.157407745718956 | | f1_macro | 0.9613886456444749 | | f1_micro | 0.9694 | | f1_weighted | 0.9693030681223207 | | precision_macro | 0.9679892485977634 | | precision_micro | 0.9694 | | precision_weighted | 0.9695713537396466 | | recall_macro | 0.9560667596679707 | | recall_micro | 0.9694 | | recall_weighted | 0.9694 | | accuracy | 0.9694 | Post-training benchmark accuracy: 97.8% (978/1000) ## Quickstart ```python from transformers import pipeline to_label = """ Title: A treatise on analytical geometry of tree dimensions, containing the theory of curve surfaces and of curves of double curvature. Author: Hymers, J. Year: 1848 Language: English General Note: Example of a general note """ pipe = pipeline("text-classification", model="instdin/institutional-books-topic-classifier-bert") result = pipe(to_label.strip()) print(result[0]) # {'label': 'SCIENCE', 'score': 0.9996894598007202} ``` ## About IDI The Institutional Data Initiative at Harvard Law School Library works with knowledge institutions—from libraries and museums to cultural groups and government agencies—to refine and publish their collections as data. [Reach out to collaborate on your collections](https://institutionaldatainitiative.org/#get-involved). ## Cite ```bibtext @misc{cargnelutti2025institutionalbooks10242b, title={Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability}, author={Matteo Cargnelutti and Catherine Brobston and John Hess and Jack Cushman and Kristi Mukk and Aristana Scourtas and Kyle Courtney and Greg Leppert and Amanda Watson and Martha Whitehead and Jonathan Zittrain}, year={2025}, eprint={2506.08300}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.08300}, } ```