BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP Paper โข 2506.10896 โข Published 1 day ago โข 1 โข 2
Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability Paper โข 2506.08300 โข Published 4 days ago โข 6 โข 3
Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability Paper โข 2506.08300 โข Published 4 days ago โข 6 โข 3
taz2024full: Analysing German Newspapers for Gender Bias and Discrimination across Decades Paper โข 2506.05388 โข Published 11 days ago โข 2
taz2024full: Analysing German Newspapers for Gender Bias and Discrimination across Decades Paper โข 2506.05388 โข Published 11 days ago โข 2
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper โข 2506.05209 โข Published 9 days ago โข 36 โข 1
LLM in the Loop: Creating the PARADEHATE Dataset for Hate Speech Detoxification Paper โข 2506.01484 โข Published 12 days ago โข 5 โข 3
Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs Paper โข 2505.04519 โข Published May 7 โข 2 โข 1
ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations Paper โข 2505.02819 โข Published May 5 โข 24 โข 4
Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation Paper โข 2505.00022 โข Published Apr 24 โข 2 โข 1