@davanstrien on Hugging Face: "Would 1-2 sentence tl;dr summaries of datasets on the Hub be useful for you?…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

davanstrien

posted an update Mar 27, 2024

Post

1326

Would 1-2 sentence tl;dr summaries of datasets on the Hub be useful for you?

For example, for the togethercomputer/RedPajama-Data-1T dataset, would the following summary help give you a quick sense of its content?

> tl;dr: RedPajama is a fully open-source implementation of the LLaMa dataset, consisting of 1.2 trillion tokens from sources like Commoncrawl, C4, GitHub, Books, ArXiv, Wikipedia, and StackExchange, primarily in English, and is structured with metadata for each text sample.

I've created a dataset with example summaries of the 500 most liked datasets on the Hub: davanstrien/dataset-tldr

Would these kinds of summaries be helpful?

merve

Mar 27, 2024

•

edited Mar 27, 2024

I think it would be nice if we could have what the data looks like in the tl;dr, how it was curated, the license, what type of model it can be trained with and so on, it would be very useful for me 🤩

davanstrien

Mar 27, 2024

That's a good point! It might be nice to combine the textual tl;dr description with some critical bits of metadata (where it exists).

In this post