Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
davanstrien 
posted an update Mar 27
Post
1320
Would 1-2 sentence tl;dr summaries of datasets on the Hub be useful for you?

For example, for the togethercomputer/RedPajama-Data-1T dataset, would the following summary help give you a quick sense of its content?

> tl;dr: RedPajama is a fully open-source implementation of the LLaMa dataset, consisting of 1.2 trillion tokens from sources like Commoncrawl, C4, GitHub, Books, ArXiv, Wikipedia, and StackExchange, primarily in English, and is structured with metadata for each text sample.

I've created a dataset with example summaries of the 500 most liked datasets on the Hub: davanstrien/dataset-tldr

Would these kinds of summaries be helpful?

I think it would be nice if we could have what the data looks like in the tl;dr, how it was curated, the license, what type of model it can be trained with and so on, it would be very useful for me 🤩

·

That's a good point! It might be nice to combine the textual tl;dr description with some critical bits of metadata (where it exists).