BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP Paper • 2506.10896 • Published 1 day ago • 1
Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training Paper • 2506.10952 • Published 1 day ago • 20
Institutional Books Collection A growing corpus of public domain books from library collections, seeded by Harvard Library. • 3 items • Updated 2 days ago • 1
Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability Paper • 2506.08300 • Published 4 days ago • 6
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper • 2506.05209 • Published 9 days ago • 36
Static Word Embeddings for Sentence Semantic Representation Paper • 2506.04624 • Published 9 days ago • 3
Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training Paper • 2506.01732 • Published 12 days ago • 2
XToM: Exploring the Multilingual Theory of Mind for Large Language Models Paper • 2506.02461 • Published 11 days ago • 1
view article Article No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL By toslali-ibm and 5 others • 11 days ago • 49
EmoBench-UA: A Benchmark Dataset for Emotion Detection in Ukrainian Paper • 2505.23297 • Published 16 days ago • 1
LLM in the Loop: Creating the PARADEHATE Dataset for Hate Speech Detoxification Paper • 2506.01484 • Published 12 days ago • 5
Novel Benchmark for NER in the Wastewater and Stormwater Domain Paper • 2506.01938 • Published 12 days ago • 1
Common Pile v0.1 Collection All resources related to Common Pile v0.1, an 8TB dataset of public domain and openly licensed text • 4 items • Updated 8 days ago • 25
ModernGBERT: German-only 1B Encoder Model Trained from Scratch Paper • 2505.13136 • Published 26 days ago • 21
Understanding Gated Neurons in Transformers from Their Input-Output Functionality Paper • 2505.17936 • Published 22 days ago • 1
Language Mixing in Reasoning Language Models: Patterns, Impact, and Internal Causes Paper • 2505.14815 • Published 25 days ago • 1