LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to Representations Paper • 2509.03405 • Published 4 days ago • 17
Mapping Toxic Comments Across Demographics: A Dataset from German Public Broadcasting Paper • 2508.21084 • Published 12 days ago • 1
KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications Paper • 2503.17247 • Published Mar 21 • 1
German4All - A Dataset and Model for Readability-Controlled Paraphrasing in German Paper • 2508.17973 • Published 13 days ago • 1
Influence-driven Curriculum Learning for Pre-training on Limited Data Paper • 2508.15475 • Published 17 days ago • 1
Tokens with Meaning: A Hybrid Tokenization Approach for NLP Paper • 2508.14292 • Published 19 days ago • 1
GLiClass: Generalist Lightweight Model for Sequence Classification Tasks Paper • 2508.07662 • Published 28 days ago • 8
gpt-oss Collection Open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases. • 2 items • Updated Aug 7 • 338
German BabyLM Collection Data that can be used for developing developmentally plausible language models in German. • 13 items • Updated May 28 • 2
Teuken-7B-v0.6 Collection OpenGPT-X Teuken 7B models trained on 6 trillion tokens. • 2 items • Updated Jul 28 • 4
view article Article Say hello to `hf`: a faster, friendlier Hugging Face CLI ✨ By Wauplin and 2 others • Jul 25 • 80
GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface Paper • 2507.18546 • Published Jul 24 • 20
Effective Multi-Task Learning for Biomedical Named Entity Recognition Paper • 2507.18542 • Published Jul 24 • 1
Checklists Are Better Than Reward Models For Aligning Language Models Paper • 2507.18624 • Published Jul 24 • 2
Exploring Gender Bias in Large Language Models: An In-depth Dive into the German Language Paper • 2507.16557 • Published Jul 22 • 2
GG-BBQ: German Gender Bias Benchmark for Question Answering Paper • 2507.16410 • Published Jul 22 • 2