Dataset Tools
community
AI & ML interests
Tools for creating and exploring datasets
Recent Activity
View all activity
Dataset-Tools's activity
prithivMLmodsΒ
posted
an
update
3 days ago
davanstrienΒ
posted
an
update
5 days ago
Post
1518
Introducing FineWeb-C ππ, a community-built dataset for improving language models in ALL languages.
Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages.
318 annotators, 32K+ annotations, 12 languages - and growing! π
data-is-better-together/fineweb-c
Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages.
318 annotators, 32K+ annotations, 12 languages - and growing! π
data-is-better-together/fineweb-c
Post
1137
π From instruction-following to creative storytelling, dive into 2024's most impactful AI datasets! These gems are shaping everything from scientific research to video understanding.
Check it out: huggingface/open-source-ai-year-in-review-2024
Check it out: huggingface/open-source-ai-year-in-review-2024
lhoestqΒ
authored
a
paper
6 days ago
prithivMLmodsΒ
posted
an
update
6 days ago
Post
2062
Qwen2VL Models: Vision and Language Processing π
πFT; [ Latex OCR, Math Parsing, Text Analogy OCRTest ]
βοΈDemo : prithivMLmods/Qwen2-VL-2B . The demo includes the Qwen2VL 2B Base Model.
π―The space handles documenting content from the input image along with standardized plain text. It includes adjustment tools with over 30 font styles, file formatting support for PDF and DOCX, textual alignments, font size adjustments, and line spacing modifications.
πPDFs are rendered using the ReportLab software library toolkit.
π§΅Models :
+ prithivMLmods/Qwen2-VL-OCR-2B-Instruct
+ prithivMLmods/Qwen2-VL-Ocrtest-2B-Instruct
+ prithivMLmods/Qwen2-VL-Math-Prase-2B-Instruct
πSample Document :
+ https://drive.google.com/file/d/1Hfqqzq4Xc-3eTjbz-jcQY84V5E1YM71E/view?usp=sharing
π¦Collection :
+ prithivMLmods/vision-language-models-67639f790e806e1f9799979f
.
.
.
@prithivMLmods π€
πFT; [ Latex OCR, Math Parsing, Text Analogy OCRTest ]
βοΈDemo : prithivMLmods/Qwen2-VL-2B . The demo includes the Qwen2VL 2B Base Model.
π―The space handles documenting content from the input image along with standardized plain text. It includes adjustment tools with over 30 font styles, file formatting support for PDF and DOCX, textual alignments, font size adjustments, and line spacing modifications.
πPDFs are rendered using the ReportLab software library toolkit.
π§΅Models :
+ prithivMLmods/Qwen2-VL-OCR-2B-Instruct
+ prithivMLmods/Qwen2-VL-Ocrtest-2B-Instruct
+ prithivMLmods/Qwen2-VL-Math-Prase-2B-Instruct
πSample Document :
+ https://drive.google.com/file/d/1Hfqqzq4Xc-3eTjbz-jcQY84V5E1YM71E/view?usp=sharing
π¦Collection :
+ prithivMLmods/vision-language-models-67639f790e806e1f9799979f
.
.
.
@prithivMLmods π€
davidberenstein1957Β
posted
an
update
7 days ago
Post
1264
π Tumble down the AI rabbit hole without any technical knowledge!
Explore AI models on the Hub by a simple and quick search
Demo: davidberenstein1957/transformers-pipeline-playground
Explore AI models on the Hub by a simple and quick search
Demo: davidberenstein1957/transformers-pipeline-playground
prithivMLmodsΒ
posted
an
update
7 days ago
Post
3168
π Here Before - Xmasπ
β¨
π§π»βπModels
+ [ Xmas 2D Illustration ] : strangerzonehf/Flux-Xmas-Illustration-LoRA
+ [ Xmas 3D Art ] : strangerzonehf/Flux-Xmas-3D-LoRA
+ [ Xmas Chocolate ] : strangerzonehf/Flux-Xmas-Chocolate-LoRA
+ [ Xmas Isometric Kit ] : strangerzonehf/Flux-Xmas-Isometric-Kit-LoRA
+ [ Xmas Realpix ] : strangerzonehf/Flux-Xmas-Realpix-LoRA
+ [ Xmas Anime ] : strangerzonehf/Flux-Anime-Xmas-LoRA
βοΈCollections
+ [ Xmas Art ] : strangerzonehf/christmas-pack-6758b199487adafaddb68f82
+ [ Stranger Zone Collection ] : prithivMLmods/stranger-zone-collections-org-6737118adcf2cb40d66d0c7e
π₯ΆPage
+ [ Stranger Zone ] : https://huggingface.co/strangerzonehf
.
.
.
@prithivMLmods π€
π§π»βπModels
+ [ Xmas 2D Illustration ] : strangerzonehf/Flux-Xmas-Illustration-LoRA
+ [ Xmas 3D Art ] : strangerzonehf/Flux-Xmas-3D-LoRA
+ [ Xmas Chocolate ] : strangerzonehf/Flux-Xmas-Chocolate-LoRA
+ [ Xmas Isometric Kit ] : strangerzonehf/Flux-Xmas-Isometric-Kit-LoRA
+ [ Xmas Realpix ] : strangerzonehf/Flux-Xmas-Realpix-LoRA
+ [ Xmas Anime ] : strangerzonehf/Flux-Anime-Xmas-LoRA
βοΈCollections
+ [ Xmas Art ] : strangerzonehf/christmas-pack-6758b199487adafaddb68f82
+ [ Stranger Zone Collection ] : prithivMLmods/stranger-zone-collections-org-6737118adcf2cb40d66d0c7e
π₯ΆPage
+ [ Stranger Zone ] : https://huggingface.co/strangerzonehf
.
.
.
@prithivMLmods π€
Post
1133
π€ Want to share your AI models while protecting your work? Licenses are key!
Fascinating to see that nearly 60% of models on the Hub use Apache & MIT licenses.
Explore the viz here: huggingface/open-source-ai-year-in-review-2024
Fascinating to see that nearly 60% of models on the Hub use Apache & MIT licenses.
Explore the viz here: huggingface/open-source-ai-year-in-review-2024
Post
1262
Did a fun experiment: What are the main themes emerging from the 100+ Nieman Journalism Lab predictions for 2025?
I used natural language processing to cluster and map them β really helps spot patterns that weren't obvious when reading predictions one by one. So what will shape journalism next year? A lot of AI and US politics (surprise!), but there's also this horizontal axis that spans from industry strategies to deep reflections on how to talk to the public.
Click any dot to explore the original prediction. What themes surprise/interest you the most?
π fdaudens/nieman_lab_2025_predictions_visualization
P.s.: I discovered that Nieman Lab's content is under Creative Commons license!
I used natural language processing to cluster and map them β really helps spot patterns that weren't obvious when reading predictions one by one. So what will shape journalism next year? A lot of AI and US politics (surprise!), but there's also this horizontal axis that spans from industry strategies to deep reflections on how to talk to the public.
Click any dot to explore the original prediction. What themes surprise/interest you the most?
π fdaudens/nieman_lab_2025_predictions_visualization
P.s.: I discovered that Nieman Lab's content is under Creative Commons license!
nataliaElvΒ
posted
an
update
8 days ago
Post
1598
If you are still wondering how the FineWeb2 annotations are done, how to follow the guidelines or how Argilla works, this is your video!
I go through a few samples of the FineWeb2 dataset and classify them based on their educational content. Check it out!
https://www.youtube.com/watch?v=_-ORB4WAVGU
I go through a few samples of the FineWeb2 dataset and classify them based on their educational content. Check it out!
https://www.youtube.com/watch?v=_-ORB4WAVGU
davidberenstein1957Β
posted
an
update
9 days ago
Post
4106
Introducing the Synthetic Data Generator, a user-friendly application that takes a no-code approach to creating custom datasets with Large Language Models (LLMs). The best part: A simple step-by-step process, making dataset creation a non-technical breeze, allowing anyone to create datasets and models in minutes and without any code.
Blog: https://huggingface.co/blog/synthetic-data-generator
Space: argilla/synthetic-data-generator
Blog: https://huggingface.co/blog/synthetic-data-generator
Space: argilla/synthetic-data-generator
Post
647
The #NeurIPS2024 Class: Explore which are the leading research institutions ππ¬
huggingface/open-source-ai-year-in-review-2024
huggingface/open-source-ai-year-in-review-2024
prithivMLmodsΒ
posted
an
update
12 days ago
alielfilali01Β
posted
an
update
12 days ago
Post
3301
Unpopular opinion: Open Source takes courage to do !
Not everyone is brave enough to release what they have done (the way they've done it) to the wild to be judged !
It really requires a high level of "knowing wth are you doing" ! It's kind of a super power !
Cheers to the heroes here who see this!
Not everyone is brave enough to release what they have done (the way they've done it) to the wild to be judged !
It really requires a high level of "knowing wth are you doing" ! It's kind of a super power !
Cheers to the heroes here who see this!
davidberenstein1957Β
updated
a
collection
13 days ago
Post
1606
Made a HF Dataset editor a la gg sheets here:
lhoestq/dataset-spreadsheets
With Dataset Spreadsheets:
βοΈ Edit datasets in the UI
π Share link with collaborators
π Use locally in DuckDB or Python
Available for the 100,000+ parquet datasets on HF :)
With Dataset Spreadsheets:
βοΈ Edit datasets in the UI
π Share link with collaborators
π Use locally in DuckDB or Python
Available for the 100,000+ parquet datasets on HF :)
Post
1529
Are you at #NeurIPS2024? Check out our cool data visualizations about research papers in the Year in Review!
huggingface/open-source-ai-year-in-review-2024
huggingface/open-source-ai-year-in-review-2024
huggingface/open-source-ai-year-in-review-2024
huggingface/open-source-ai-year-in-review-2024
nataliaElvΒ
posted
an
update
14 days ago
Post
1244
How do your annotations for FineWeb2 compare to your teammates'?
I started contributing some annotations to the FineWeb2 collaborative annotation sprint and I wanted to know if my labelling trends were similar to those of my teammates.
I did some analysis and I wasn't surprised to see that I'm being a bit harsher on my evaluations than my mates π
Do you want to see how your annotations compare to others?
π Go to this Gradio space: nataliaElv/fineweb2_compare_my_annotations
βοΈ Enter the dataset that you've contributed to and your Hugging Face username.
How were your results?
- Contribute some annotations: data-is-better-together/fineweb-c
- Join your language channel in Rocket chat: HuggingFaceFW/discussion
I started contributing some annotations to the FineWeb2 collaborative annotation sprint and I wanted to know if my labelling trends were similar to those of my teammates.
I did some analysis and I wasn't surprised to see that I'm being a bit harsher on my evaluations than my mates π
Do you want to see how your annotations compare to others?
π Go to this Gradio space: nataliaElv/fineweb2_compare_my_annotations
βοΈ Enter the dataset that you've contributed to and your Hugging Face username.
How were your results?
- Contribute some annotations: data-is-better-together/fineweb-c
- Join your language channel in Rocket chat: HuggingFaceFW/discussion
davidberenstein1957Β
posted
an
update
16 days ago
Post
2047
Open Preference Dataset for Text-to-Image Generation by the π€ Community
Open Image Preferences is an Apache 2.0 licensed dataset for text-to-image generation. This dataset contains 10K text-to-image preference pairs across common image generation categories, while using different model families and varying prompt complexities.
https://huggingface.co/blog/image-preferences
Open Image Preferences is an Apache 2.0 licensed dataset for text-to-image generation. This dataset contains 10K text-to-image preference pairs across common image generation categories, while using different model families and varying prompt complexities.
https://huggingface.co/blog/image-preferences
alielfilali01Β
posted
an
update
17 days ago
Post
1476
Apparently i forgot to put this here !
Well, this is a bit late but consider given our recent blog a read if you are interested in Evaluation.
You don't have to be into Arabic NLP in order to read it, the main contribution we are introducing is a new evaluation measure for NLG. We made the fisrt application of this measure on Arabic for now and we will be working with colleagues from the community to expand it to other languages.
Blog:
Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard
https://huggingface.co/blog/leaderboard-3c3h-aragen
Space:
inceptionai/AraGen-Leaderboard
Give it a read and let me know your thoughts π€
Well, this is a bit late but consider given our recent blog a read if you are interested in Evaluation.
You don't have to be into Arabic NLP in order to read it, the main contribution we are introducing is a new evaluation measure for NLG. We made the fisrt application of this measure on Arabic for now and we will be working with colleagues from the community to expand it to other languages.
Blog:
Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard
https://huggingface.co/blog/leaderboard-3c3h-aragen
Space:
inceptionai/AraGen-Leaderboard
Give it a read and let me know your thoughts π€