Guilherme Penedo
guipenedo
AI & ML interests
None yet
Organizations
guipenedo's activity
1 of 2 TODOs
#4 opened 4 days ago
by
meg

How to get more test samples of one problem?
2
#2 opened 25 days ago
by
lixiaonan
Discussion on comparision with previous work and citation?
1
#1 opened 28 days ago
by
JeremiahZ
Downloading the 350BT sample uses 990GB of disk space
4
#57 opened 3 months ago
by
ddh0

Create Ffcc
1
#58 opened 2 months ago
by
Ricky23184
New update returns a 500 server error using the datasets-server API
6
#18 opened 4 months ago
by
jonna32
Synthetic Data Generator
1
#5 opened 3 months ago
by
kishorekashyap
Cannot load with datasets
3
#4 opened 3 months ago
by
mbanon

A lot of load errors after new update
14
#19 opened 3 months ago
by
yzhangcs

Add "date" column to "default" subset
#20 opened 3 months ago
by
lhoestq

Simple exact deduplication removes 2/3 of data.
4
#49 opened 8 months ago
by
egor-pakhomov
Torrent?
3
3
#4 opened 12 months ago
by
emilss
Any plan to train models on larger subset of dataset?
1
#8 opened 12 months ago
by
mrfakename

Are copyrighted works included in this dataset?
1
4
#9 opened 12 months ago
by
umm-maybe

Reprocessing for a new language
2
14
#12 opened 12 months ago
by
pere

Training configs for data ablation study
1
2
#14 opened 12 months ago
by
jimmyhbx
tiny-fineweb
1
3
#19 opened 12 months ago
by
3thn

Unsafe files
1
#25 opened 11 months ago
by
alielfilali01

"Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20" using fineweb by Karpathy
5
#28 opened 11 months ago
by
clem

Regarding to the newly updated indexes(writen as deduplication issues)
5
#29 opened 11 months ago
by
kimcando
