🪚 Cutting Parquet data

Quentin Lhoest PRO

lhoestq

huggingface

·

AI & ML interests

Maintainer of 🤗 Dataset Hub ecosystem: NLP, Multimodal data loading, viewing, processing and sharing

Recent Activity

new activity about 2 hours ago

5CD-AI/Viet-Handwriting-OCR-v2:Dataset Viewer issue

new activity about 2 hours ago

MTEB-BR/wikipedia-categories:Dataset Viewer issue

new activity about 2 hours ago

BW/ROS_Hackathon_2026_Toys_Dataset:Dataset Viewer issue

View all activity

Organizations

upvoted a collection 13 days ago

ConvFill: Inference-Time Knowledge Transfer

Model weights, dataset, and paper for https://arxiv.org/abs/2511.07397. • 16 items • Updated 21 days ago • 8

upvoted 2 collections 20 days ago

OpenThinker-Agent-Complete

OpenThinkerAgent-32B SFT data-scaling ladder (models + matching datasets, 316->100K) plus TaskTrove & AgentTrove sources. • 15 items • Updated Jun 10 • 5

OpenThinker-Agent2

OpenThinker-Agent2: agentic SFT/RL datasets and 8B/32B models (cold-start SFT, RL, and the OpenThinkerAgent-32B release). • 11 items • Updated Jun 11 • 9

upvoted a collection 27 days ago

Training Datasets

All data and models from our ArXivMath-Training and BrokenArXiv-training pipelines. • 8 items • Updated 27 days ago • 1

upvoted a collection about 1 month ago

Kimi K2.5

Moonshot's most powerful model • 4 items • Updated Jun 12 • 76

upvoted a changelog about 1 month ago

Hugging Face Changelog

Service Accounts for Enterprise organizations

Jun 12

• 150

upvoted 3 articles about 1 month ago

Article

Introducing Serge: GitHub-Native AI Code Review

huggingface

•

Jun 12

• 14

Article

Arcee Becomes the First Major American AI Lab to Replace AWS S3 with Hugging Face Private Storage, in a Multi-Million Dollar Commercial Partnership

clem

•

Jun 9

• 35

Article

Designing the hf CLI as an agent-optimized way to work with the Hub

celinah, Wauplin

•

Jun 4

• 59

upvoted a collection about 1 month ago

Cosmos3

Omnimodal World Models for Physical AI • 21 items • Updated 5 days ago • 146

upvoted 2 articles about 1 month ago

Article

ClawHub Security Signals: Large Corpus Multi-Scanner Dataset for Agent Skill Security Research

OpenClaw

•

Jun 1

• 15

Article

Harness, Scaffold, and the AI Agent Terms Worth Getting Right

sergiopaniego, ariG23498

•

May 25

• 132

upvoted an article about 2 months ago

Article

Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler

+3

ariG23498, sayakpaul, sergiopaniego, ror, pcuenq

•

May 29

• 143

upvoted a collection about 2 months ago

UltraData

Ultra Scale, Ultra Quality, Ultra Coverage • 14 items • Updated 2 days ago • 98

upvoted an article about 2 months ago

Article

Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

+6

aminediroHF, qgallouedec, kashif, lewtun, edbeeching, albertvillanova, lvwerra, sergiopaniego

•

May 27

• 42

upvoted a collection about 2 months ago

Toto-2.0

5 items • Updated May 11 • 36

upvoted a paper about 2 months ago

Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling

Paper • 2605.13301 • Published May 13 • 166

upvoted 3 articles 2 months ago

Article

Two Years of Local AI on a Laptop: When Open Models Outpaced Moore's Law

mishig

•

May 11

• 24

Article

Hugging Face on JFrog Artifactory: An Enterprise Guide (and What Changes in June 2026)

jeffboudier

•

May 8

• 5

Article

EMO: Pretraining mixture of experts for emergent modularity

allenai

•

May 8

• 38