dvilasuero (Daniel Vila)

reacted to frascuchon's post with 🔥 1 day ago

Post

1536

Extending datasets just got a whole lot easier! 🚀 With Sheets, I was able to create a Spanish version of the popular fka/awesome-chatgpt-prompts dataset in just a few minutes ⏱️.

Check out the resulting dataset: frascuchon/fka_awesome_chatgpt_es 📊

Want to try it out for yourself? Head over to the Sheets space and see how easy it is to extend and modify existing datasets 🤯. The possibilities are endless! 🌐

replied to their post 2 days ago

we're currently discussing this!

reacted to burtenshaw's post with 🔥 6 days ago

Post

1333

Super excited to release Autotrain MCP. This is an MCP server for training AI models, so you can use your AI tools to train your AI models 🤯.

🔗 burtenshaw/autotrain-mcp

Use this MCP server with tools like Claude Desktop, Cursor, VSCode, or Continue to do this:

- Define an ML problem like Image Classification, LLM fine-tuning, Text Classification, etc.
- The AI can retrieve models and datasets from the hub using the hub MCP.
- Training happens on a Hugging Face space, so no worries about hardware restraints.
- Models are pushed to the hub to be used inference tools like Llama.cpp, vLLM, MLX, etc.
- Built on top of the AutoTrain library, so it has full integration with transformers and other libraries.

Everything is still under active development, but I’m super excited to hear what people build, and I’m open to contributions!

1 reply

·

reacted to frascuchon's post with 🚀 7 days ago

Post

1277

Unlock the full potential of your datasets with SHEETS! It's incredibly easy to extend existing datasets and unlock new insights.

Leverage open-source models to translate, summarize, classify, and more - all directly within your existing columns.

Ready to give it a try? Explore the possibilities here: aisheets/sheets

2 replies

·

reacted to Ameeeee's post with 🧠❤️🚀 8 days ago

Post

1690

With Sheets, try a new way to create structured content with the help of AI!

No installs. No login. Just open a link and 🤩

This app lets you create a dataset by importing a file or starting from a prompt.

What’s different about SHEETS?
🔎 Web search integration to ground answers in real-world data
📚 In-context learning from validated sources
🔗 Transparent sourcing — every result is linked
🧩 Runs on multiple open-source models

Fight hallucinations and start creating content you can rely on.

posted an update 8 days ago

Post

2426

Super excited to launch Hugging Face Sheets: Spreadsheets meet AI and unstructured data.

A few months ago, we started imagining new ways to build and transform datasets with the latest open-source models.

Today, I'm thrilled to introduce our first step in this direction.

In a nutshell:

📁 Effortlessly run prompts and models over your data.
🌐 Agentic search for accuracy and real-time information.
🖼️ Familiar, minimalistic interface for interacting with data.
🎯 Human feedback 2.0: Your input directly improves generated data.
💯 Access hundreds of open models and leading inference providers.

Go to this space to try it out!

aisheets/sheets

Leave your questions below, we're just getting started!

2 replies

·

reacted to burtenshaw's post with 🚀🤗 8 days ago

Post

2551

MCP course is now LIVE! We just dropped quizzes, videos, and live streams to make it a fully interactive course:

🔗 join in now:

mcp-course

- It’s still free!
- Video 1 walks you through onboarding to the course
- The first live session is next week!
- You can now get a certificate via exam app
- We improved and written material with interactive quizzes

If you’re studying MCP and want a live, interactive, visual, certified course, then join us on the hub!

reacted to davanstrien's post with 👍 8 days ago

Post

2704

Inspired by Hugging Face's official MCP server, I've developed a complementary tool that exposes my semantic search API to enhance discovery across the HF platform.

Key capabilities:

- AI-powered semantic search for models and datasets
- Parameter count analysis via safetensors metadata
- Trending content discovery
- Find similar models/datasets functionality
- 11 tools total for enhanced ecosystem navigation

The semantic search goes beyond simple keyword matching, understanding context and relationships between different models and datasets.

Example query: "Find around 10 reasoning Hugging Face datasets published in 2025 focusing on topics other than maths and science. Show a link and a short summary for each dataset." (results in video!)

https://github.com/davanstrien/hub-semantic-search-mcp

reacted to davanstrien's post with 🔥 15 days ago

Post

2246

Came across a very nice submission from @marcodsn for the reasoning datasets competition (https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition).

The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:

- Extracts both the logical structure AND researcher intuition from academic papers
- Adopts the persona of researchers "before experiments" to capture exploratory thinking
- Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model

It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.

I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.

Dataset can be found here: marcodsn/academic-chains (give it a like!)

reacted to frascuchon's post with ❤️👍 16 days ago

Post

2984

Hey! I built RAG MCP Server Space, a simple Gradio MCP server for RAG systems that allows you to search relevant results without passing huge contexts to your LLM.

You can use this space to integrate with your agents and improve the efficiency of your search results. Feel free to try it out and let me know if you have any feedback or questions!

frascuchon/rag-mcp-server

Thanks for checking it out!

reacted to davidberenstein1957's post with 🔥❤️👀 5 months ago

Post

1266

You can now use the "Synthetic Data Generator" at a much larger scale with your preferred inference engine: Ollama, vLLM, TGI, and serverless inference! 🔥

Install, configure, launch!

Space: https://huggingface.co/spaces/argilla/synthetic-data-generator?duplicate=true
Examples: https://github.com/argilla-io/synthetic-data-generator/tree/main/examples

reacted to nataliaElv's post with 🔥❤️ 5 months ago

Post

1525

New chapter in the Hugging Face NLP course! 🤗 🚀

We've added a new chapter about the very basics of Argilla to the Hugging Face NLP course. Learn how to set up an Argilla instance, load & annotate datasets, and export them to the Hub.

Any feedback for improvements welcome!

https://huggingface.co/learn/nlp-course/chapter10

reacted to davanstrien's post with 🚀 5 months ago

Post

2310

The data-is-better-together/fineweb-c dataset is growing!

This week a few more languages have got 1,000 annotations for the educational quality of data from HuggingFaceFW/fineweb-2.

Why should you care?

The quality of pre-training data can have a big impact on the performance of downstream language models trained on that data ( HuggingFaceFW/blogpost-fineweb-v1).

Being able to filter by educational quality is on way of improving the quality of the data you use for training an LLM. Very importantly this approach can also reduce the amount of data needed for pertaining.

Why not use an LLM?

LLMs can be used to annotate educational quality for a subset of data. This data can then be used to train a smaller encoder only model to label the full dataset. However, this may not work well for languages outside of english. This is where fineweb-c (community) comes in.

The community is annotating the educational quality of fineweb2 data. Currently 114 languages have some annotations. These annotations will enable a number of things:

- Evaluate whether an LLM can label the educational quality for texts in that language well
- Directly be used for training quality classifiers
- Help discover other rules and huerisitcs for refining fineweb2 further for different languages.

This week the following languages where done:

Swedish thanks to: @Lauler @AntonVic @ohallstrom @bjarlestam @menbom @Ekgren @apsod

Ukrainian thanks to: @hannayukhymenko @robinhad @realPivo @RabotiahovDmytro @reciprocate

Assamese thanks to: @moyoor97 @Arpanjyoti @nawaf-helmi123 @pahigogoi1 @aelhence @kishorekashyap

Want to learn more: https://huggingface.co/blog/davanstrien/fineweb2-community

Contribute yourself here: data-is-better-together/fineweb-c

1 reply

·

Daniel Vila PRO

AI & ML interests

Recent Activity

Organizations

dvilasuero's activity