Dataset Tools

community

AI & ML interests

Tools for creating and exploring datasets

Recent Activity

Dataset-Tools's activity

prithivMLmodsΒ 
posted an update about 8 hours ago
view post
Post
226
Dropping the domain-specific downstream image classification content moderation models, including the anime image type classification, GeoSceneNet, indoor-outdoor scene classification, and black-and-white vs. colored image classification models, along with the datasets. πŸ”₯

β•°β”ˆβž€Models :
+ GeoSceneNet : prithivMLmods/Multilabel-GeoSceneNet
+ IndoorOutdoorNet : prithivMLmods/IndoorOutdoorNet
+ B&W vs Colored : prithivMLmods/BnW-vs-Colored-Detection
+ Anime Image Type : prithivMLmods/Anime-Classification-v1.0
+ Multilabel Portrait : prithivMLmods/Multilabel-Portrait-SigLIP2

β•°β”ˆβž€Datasets :
- GeoSceneNet : prithivMLmods/Multilabel-GeoSceneNet-16K
- IndoorOutdoorNet : prithivMLmods/IndoorOutdoorNet-20K
- BnW vs Colored : prithivMLmods/BnW-vs-Colored-10K
- Multilabel Portrait : prithivMLmods/Multilabel-Portrait-18K

β•°β”ˆβž€Collections :
> Multilabel Image Classification Datasets : prithivMLmods/multilabel-image-classification-datasets-6809aa64637f45d4c47fa6ca
> Model Collection : prithivMLmods/siglip2-content-filters-models-v2-68053a958c42ef17a3a3f4d1

Note: The anime scene type dataset is not mentioned in the list because it is private and only accessible to members of the DeepGHS organization.

For raw ZIP files or more information about the datasets, visit: https://www.kaggle.com/prithivsakthiur/datasets
fdaudensΒ 
posted an update about 17 hours ago
davanstrienΒ 
posted an update 1 day ago
view post
Post
1141
Came across a very nice submission from @marcodsn for the reasoning datasets competition (https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition).

The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:

- Extracts both the logical structure AND researcher intuition from academic papers
- Adopts the persona of researchers "before experiments" to capture exploratory thinking
- Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model

It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.

I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.

Dataset can be found here: marcodsn/academic-chains (give it a like!)
prithivMLmodsΒ 
posted an update 7 days ago
view post
Post
2762
Dropping an entire collection of Style Intermixing Adapters on StrangerZone HF β€” including Realism, Anime, Sketch, Texture-Rich 3D Experimentals, Automotive Concept Images, and LoRA models based on Flux.1, SD 3.5 Turbo/Large, Stable Diffusion XL 🎨

β•°β”ˆβž€Collection :
➜ sketch : strangerzonehf/sketch-fav-675ba869c7ceaec7e652ee1c
➜ sketch2 : strangerzonehf/q-series-sketch-678e3503bf3a661758429717
➜ automotive : strangerzonehf/automotive-3d-675bb31a491d8c264d45d843
➜ texture 3d : strangerzonehf/flux-3dxl-engine-674833c14a001d5b1fdb5139
➜ super 3d : strangerzonehf/super-3d-engine-6743231d69f496df97addd2b
➜ style mix : strangerzonehf/mixer-engine-673582c9c5939d8aa5bf9533
➜ realism : strangerzonehf/realism-engine-67343495b6daf0fbdb904cc1

β•°β”ˆβž€The Entire Collection :
➜ flux.1 : prithivMLmods/flux-lora-collections-66dd5908be2206cfaa8519be
➜ flux-ultimate-lora-collection : strangerzonehf/Flux-Ultimate-LoRA-Collection
➜ sd 3.5 large / turbo : prithivMLmods/sd-35-large-lora-671b39d7bc2e7f71a446b163
➜ sdxl : prithivMLmods/sdxl-dev-models-667803a6d5ac75b59110e527

β•°β”ˆβž€Pages :
➜ page 1: strangerzonehf
➜ page 2: @prithivMLmods
➜ demo : prithivMLmods/FLUX-LoRA-DLC

.πŸ€—
fdaudensΒ 
posted an update 8 days ago
view post
Post
1472
Just tested something this morning that feels kind of game-changing for how we publish, discover, and consume news with AI: connecting Claude directly to the New York Times through MCP.

Picture this: You ask Claude about a topic, and it instantly pulls verified and trusted NYT content β€” no more guessing if the info is accurate.

The cool part? Publishers stay in control of what they share via API, and users get fast, reliable access through the AI tools they already use. Instead of scraping random stuff off the web, we get a future where publishers actively shape how their journalism shows up in AI.

It’s still a bit technical to set up right now, but this could get super simple soon β€” like installing apps on your phone, but for your chatbot. And you keep the brand connection, too.

Not saying it solves everything, but it’s definitely a new way to distribute content β€” and maybe even find some fresh value in the middle of this whole news + AI shakeup. Early movers will have a head start.

Curious what folks think β€” could MCPs be a real opportunity for journalism?
  • 1 reply
Β·
prithivMLmodsΒ 
posted an update 8 days ago
view post
Post
2508
Try out the demo for Multimodal OCR featuring the implementation of models including RolmOCR and Qwen2VL OCR. The use case showcases image-text-to-text conversion and video understanding support for the RolmOCR model ! πŸš€

πŸ€—Multimodal OCR Space : prithivMLmods/Multimodal-OCR

πŸ“¦The models implemented in this Space are:
+ Qwen2VL OCR : prithivMLmods/Qwen2-VL-OCR-2B-Instruct [ or ]
+ Qwen2VL OCR2 : prithivMLmods/Qwen2-VL-OCR2-2B-Instruct
+ RolmOCR : reducto/RolmOCR

Qwen2VL OCR supports only image-text-to-text in the space.
fdaudensΒ 
posted an update 13 days ago
view post
Post
2107
Want AI that truly understands your country's culture? Public institutions are sitting on the next AI revolution - and here's the practical guide to unlock it.

I've had fascinating conversations recently about sovereign AI, with people trying to solve this recurring question: "How do we build AI that truly understands our culture?"

This guide by @evijit and @yjernite brings lots of insights about this question. It's not just about throwing data at models. It's about partnering cultural expertise with tech infrastructure in ways we're just starting to figure out.

An example? The National Library of Norway already has 150+ AI models on Hugging Face. They're not just digitizing books - they're building AI that thinks in Norwegian, understands Norwegian values, and serves Norwegian citizens.

This is sovereign AI in practice: technology that understands your culture, values, and languages.

Especially loved the practical examples on how to do this:
- Real examples from museums, libraries, and government agencies
- How to convert complex documents (PDFs, PowerPoints) into ML-ready formats
- Code templates for processing public data
- Technical recipes for sharing datasets on open platforms

The stakes? Citizens' ability to leverage their collective digital intelligence.

The technology is ready. The infrastructure exists. The guide shows exactly how to use it. What's needed is your cultural expertise to shape these tools.

Check it out: https://huggingface.co/blog/evijit/public-org-data-ai

P.s.: Building cool projects in a public institution? Share them in the comments for others to learn from!
fdaudensΒ 
posted an update 14 days ago
view post
Post
2802
Do chatbots lie about CΓ©line Dion? We now have answers, not speculation.

Ai2 just released OLMoTrace and it's a game-changer for transparency. You can literally see where an AI's responses come from in its training data - in real time.

The demo shows results about CΓ©line. So I tried it out myself! Watch what happens in the video.

For journalists, researchers studying hallucinations and anyone who needs to trust their AI, this is like getting X-ray vision into AI systems. When the model made claims, I could instantly verify them against original sources. When it hallucinated, I could see why.

You can finally 1) understand how LLMs actually work and 2) verify if what they're saying is true. No more blind trust.

This pushes the open data movement to the next level.

πŸ‘‰ Blog post: https://allenai.org/blog/olmotrace
πŸ‘‰ Paper: https://www.datocms-assets.com/64837/1743890415-olmotrace.pdf

P.S.: A word of caution: never use a chatbot as a knowledge base. It's not Google. Better use it with a connection to the internet.
  • 1 reply
Β·
fdaudensΒ 
posted an update 15 days ago
view post
Post
4077
🎨 Designers, meet OmniSVG! This new model helps you create professional vector graphics from text/images, generate editable SVGs from icons to detailed characters, convert rasters to vectors, maintain style consistency with references, and integrate into your workflow.

@OmniSVG
  • 2 replies
Β·
davanstrienΒ 
posted an update 15 days ago
view post
Post
1640
I've created a v1 dataset ( davanstrien/reasoning-required) and model ( davanstrien/ModernBERT-based-Reasoning-Required) to help curate "wild text" data for generating reasoning examples beyond the usual code/math/science domains.

- I developed a "Reasoning Required" dataset with a 0-4 scoring system for reasoning complexity
- I used educational content from HuggingFaceFW/fineweb-edu, adding annotations for domains, reasoning types, and example questions

My approach enables a more efficient workflow: filter text with small models first, then use LLMs only on high-value content.

This significantly reduces computation costs while expanding reasoning dataset domain coverage.
fdaudensΒ 
posted an update 17 days ago
view post
Post
3626
I read the 456-page AI Index report so you don't have to (kidding). The wild part? While AI gets ridiculously more accessible, the power gap is actually widening:

1️⃣ The democratization of AI capabilities is accelerating rapidly:
- The gap between open and closed models is basically closed: difference in benchmarks like MMLU and HumanEval shrunk to just 1.7% in 2024
- The cost to run GPT-3.5-level performance dropped 280x in 2 years
- Model size is shrinking while maintaining performance - Phi-3-mini hitting 60%+ MMLU at fraction of parameters of early models like PaLM

2️⃣ But we're seeing concerning divides deepening:
- Geographic: US private investment ($109B) dwarfs everyone else - 12x China's $9.3B
- Research concentration: US and China dominate highly-cited papers (50 and 34 respectively in 2023), while next closest is only 7
- Gender: Major gaps in AI skill penetration rates - US shows 2.39 vs 1.71 male/female ratio

The tech is getting more accessible but the benefits aren't being distributed evenly. Worth thinking about as these tools become more central to the economy.

Give it a read - fascinating portrait of where AI is heading! https://hai-production.s3.amazonaws.com/files/hai_ai_index_report_2025.pdf
Β·
prithivMLmodsΒ 
posted an update 18 days ago
view post
Post
3301
Loaded some domain-specific downstream image classification content moderation models, which is essentially the practice of monitoring and filtering user-generated content on platforms, based on SigLIP-2 Base Patch16 with newly initialized trainable parameters. πŸ₯ 

+ Age-Classification-SigLIP2 : prithivMLmods/Age-Classification-SigLIP2
[ Age range classification from 0 to 65+ years ]
+ Facial-Emotion-Detection-SigLIP2 : prithivMLmods/Facial-Emotion-Detection-SigLIP2
[ Designed to classify different facial emotions ]
+ Hand-Gesture-2-Robot : prithivMLmods/Hand-Gesture-2-Robot
[ Human Hand Gesture Classification for Robot Control ]
+ Mature-Content-Detection : prithivMLmods/Mature-Content-Detection
[ Mature [adult] or neutral content categories ]
+ Vit-Mature-Content-Detection : prithivMLmods/Vit-Mature-Content-Detection
[ Mature [adult] or neutral content categories ft. ViT]
+ Human-Action-Recognition : prithivMLmods/Human-Action-Recognition
[ Human actions including clapping, sitting, running, and more ]
+ Mirage-Photo-Classifier : prithivMLmods/Mirage-Photo-Classifier
[ Whether an image is real or AI-generated (fake) ]
+ Food-101-93M : prithivMLmods/Food-101-93M
[ Classify food images into one of 101 popular dishes ]
+ Hand-Gesture-19 : prithivMLmods/Hand-Gesture-19
[ Classify hand gesture images into different categories ]
+ Trash-Net : prithivMLmods/Trash-Net
[ Classification of trash into six distinct categories ]
+ Gender-Classifier-Mini : prithivMLmods/Gender-Classifier-Mini
[ Classify images based on gender [Male / Female] ]

🎑Collections :

+ SigLIP2 Content Filters : https://huggingface.co/collections/prithivMLmods/siglip2-content-filters-models-67f001055ec2bed56ca41f6d
fdaudensΒ 
posted an update 19 days ago
view post
Post
2371
See that purple banner on the Llama 4 models? It's Xet storage, and this is actually huge for anyone building with AI models. Let's geek out a little bit πŸ€“

Current problem: AI models are massive files using Git LFS. But with models getting bigger and downloads exploding, we needed something better.
Xet lets you version large files like code, with compression and deduplication, all Git-compatible. That means less bandwidth, faster sharing, and smoother collaboration.

Real numbers: ~25% deduplication on Llama 4 models, hitting ~40% for finetunes.

Scale matters here - the Hub served 2B model downloads in 30 days, Llama models alone at 60M. The upcoming Llama 4 Behemoth has 2T parameters! Xet's chunk-based system was built exactly for this.

This is the kind of engineering that makes the next wave of large models actually usable. Kudos to the team! 🧨

Check out the models collection: meta-llama/llama-4-67f0c30d9fe03840bc9d0164
prithivMLmodsΒ 
posted an update 19 days ago
view post
Post
2138
ChatGPT-4o’s image generation goes wild for a weekβ€”featuring everything from Studio Ghibli-style art and image colorization to style intermixing. Here are some examples showcasing the generation of highly detailed images from freestyle design templates. Want to know more? Check out the blog πŸš€

πŸ”—Blog : https://huggingface.co/blog/prithivMLmods/chatgpt-4o-image-gen
fdaudensΒ 
posted an update 20 days ago
view post
Post
2519
"Am I going to be replaced by AI?" - Crucial question, but maybe we're asking the wrong one.

πŸ“ˆ There's a statistic from my reads this week that stays with me: Tomer Cohen, LinkedIn's CPO, shares to Jeremy Kahn that 70% of skills used in most jobs will change by 2030. Not jobs disappearing, but transforming. And he calls out bad leadership: "If in one year's time, you are disappointed that your workforce is not 'AI native,' it is your fault."

πŸ”„ Apparently, the Great Recalibration has begun. We're now heading into an era where AI is fundamentally redefining the nature of work itself, by forcing a complete reassessment of human value in the workplace, according to a piece in Fast Company. But it might be driven more by "the need for humans to change the way they work" than AI.

⚑ The Washington Post draws a crucial parallel: We're facing an "AI shock" similar to manufacturing's "China shock" - but hitting knowledge workers. Especially entry-level, white-collar work could get automated. The key difference? "Winning the AI tech competition with other countries won't be enough. It's equally vital to win the battle to re-skill workers."

Digging into these big questions in this week’s AI in the News: https://fdaudens.substack.com/publish/posts/detail/160596301

Also, I'm curious: how are you keeping up with this pace of change? What strategies are working for you?
zamalΒ 
posted an update 21 days ago
view post
Post
1780
πŸš€ DeepGit Lite is live! πŸ”βœ¨

Hey folks!
Just launched DeepGit Lite β€” a lighter version of DeepGit with fewer components under the hood.
It won’t perform quite like the full powerhouse, but it’s great for a quick peek and first-hand feel! βš™οΈπŸ‘€

Give it a spin and tell us what you think!
πŸ‘‰ Try it here zamal/DeepGit-lite
#opensource #DeepGit #gradio #githubresearch
  • 1 reply
Β·
fdaudensΒ 
posted an update 22 days ago
view post
Post
2251
Did we just drop personalized AI evaluation?! This tool auto-generates custom benchmarks on your docs to test which models are the best.

Most benchmarks test general capabilities, but what matters is how models handle your data and tasks. YourBench helps answer critical questions like:
- Do you really need a hundreds-of-billions-parameter model sledgehammer to crack a nut?
- Could a smaller, fine-tuned model work better?
- How well do different models understand your domain?

Some cool features:
πŸ“š Generates custom benchmarks from your own documents (PDFs, Word, HTML)
🎯 Tests models on real tasks, not just general capabilities
πŸ”„ Supports multiple models for different pipeline stages
🧠 Generate both single-hop and multi-hop questions
πŸ” Evaluate top models and deploy leaderboards instantly
πŸ’° Full cost analysis to optimize for your budget
πŸ› οΈ Fully configurable via a single YAML file

26 SOTA models tested for question generation. Interesting finding: Qwen2.5 32B leads in question diversity, while smaller Qwen models and Gemini 2.0 Flash offer great value for cost.

You can also run it locally on any models you want.

I'm impressed. Try it out: yourbench/demo