96 18 19

Guilherme Penedo

guipenedo

AI & ML interests

None yet

Recent Activity

new activity 5 days ago

HuggingFaceFW/fineweb:Update README.md

new activity 5 days ago

HuggingFaceFW/fineweb:Upload 4 files

authored a paper 8 days ago

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

View all activity

Organizations

guipenedo's activity

upvoted a paper 9 days ago

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published 10 days ago • 36

upvoted an article 2 months ago

Article

Welcome Llama 4 Maverick & Scout on Hugging Face!

and 6 others •

Apr 5

• 145

upvoted an article 3 months ago

Article

Open R1: Update #3

and 9 others •

Mar 11

• 293

upvoted 2 articles 4 months ago

Article

Finding Moroccan Arabic (Darija) in Fineweb 2

and 3 others •

Dec 8, 2024

• 23

Article

Open R1: Update #2

and 6 others •

Feb 10

• 214

upvoted a paper 4 months ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 232

upvoted an article 4 months ago

Article

Open-R1: Update #1

and 7 others •

Feb 2

• 305

upvoted a paper 5 months ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published Jan 14 • 64

upvoted an article 6 months ago

Article

FineWeb2-C: Help Build Better Language Models in Your Language

and 5 others •

Dec 23, 2024

• 20

upvoted a collection 6 months ago

🥂 FineWeb2

Collection

3 items • Updated Dec 8, 2024 • 15

upvoted a paper 7 months ago

OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models

Paper • 2411.04905 • Published Nov 7, 2024 • 126

upvoted an article 7 months ago

Article

Releasing the largest multilingual open pretraining dataset

and 2 others •

Nov 13, 2024

• 101

upvoted an article 9 months ago

Article

🇨🇿 BenCzechMark - Can your LLM Understand Czech?

and 12 others •

Oct 1, 2024

• 21

upvoted a paper 12 months ago

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Paper • 2406.17557 • Published Jun 25, 2024 • 98

upvoted a paper about 1 year ago

What matters when building vision-language models?

Paper • 2405.02246 • Published May 3, 2024 • 104

upvoted 2 papers over 1 year ago

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Paper • 2312.00752 • Published Dec 1, 2023 • 143

The Falcon Series of Open Language Models

Paper • 2311.16867 • Published Nov 28, 2023 • 14

upvoted a paper about 2 years ago

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Paper • 2306.01116 • Published Jun 1, 2023 • 35