Maurice Weber
mauriceweber
AI & ML interests
None yet
Recent Activity
new activity
about 1 month ago
togethercomputer/RedPajama-Data-V2:Add paper citation
authored
a paper
about 1 month ago
RedPajama: an Open Dataset for Training Large Language Models
Organizations
mauriceweber's activity
Add paper citation
1
#30 opened about 1 month ago
by
davanstrien
RPV2 ccnet preprocessing
1
#29 opened 4 months ago
by
bpwl0121
sample split details
3
#4 opened about 1 year ago
by
sujantkumarkv
How can I download the sample-10B fastestly?
1
#28 opened 6 months ago
by
zgxiao
defunct book subset
4
#28 opened about 1 year ago
by
polinaeterna
How much disk space would the whole HF dataset take?
1
#27 opened 9 months ago
by
protossw512
rpv2-subsamples
1
#26 opened 12 months ago
by
mauriceweber
The doc_id in duplicates is should contain?
3
#24 opened 12 months ago
by
newbietuan
Deduplication steps
23
#15 opened about 1 year ago
by
ilyayudkovich
Here's a download script parallelized using Spark
1
#22 opened about 1 year ago
by
srowen
what is the meaning of snapshots in redpajama-data-v2?
2
#21 opened about 1 year ago
by
choidonghun
How to join documents and quality signals when downloading directly
3
#19 opened about 1 year ago
by
tgshdyfuhuf
Missing duplicates parquet files
5
#18 opened about 1 year ago
by
bebensee
Script to download all files of 1B sample data locally
2
#13 opened about 1 year ago
by
ivanzhouyq
What is the total size, of the entirety of this dataset in TB?
1
#10 opened about 1 year ago
by
Bayaz
What's the concept on partitions
2
#5 opened about 1 year ago
by
SwatCat
quality_signals, minhash and duplicates missing
2
#3 opened about 1 year ago
by
sheshanshag
Request to add retries into RedPajama-Data-V2.py script
1
#16 opened about 1 year ago
by
yura38
How to obtain duplicates from minhash?
1
#8 opened about 1 year ago
by
cq
Obtaining Filtered Samples
4
#12 opened about 1 year ago
by
ssingh22