Daniel Vila's picture

Daniel Vila PRO

dvilasuero

AI & ML interests

Data

Recent Activity

updated a dataset about 4 hours ago
dvilasuero/alpaca-spanish-translation-quality-eval
published a dataset about 4 hours ago
dvilasuero/alpaca-spanish-translation-quality-eval
liked a dataset about 5 hours ago
Ameeeee/Real_estate_table_Generated_ad
View all activity

Organizations

Hugging Face's profile picture SomosNLP's profile picture Libre Euro Lingua-Alliance's profile picture Hugging Face H4's profile picture Hugging Face OSS Metrics's profile picture Argilla's profile picture Blog-explorers's profile picture Hugging Face Smol Models Research's profile picture h4-argilla-collab's profile picture ZeroGPU Explorers's profile picture mLLM multilingual's profile picture DIBT Spanish's profile picture Data is Better Together - Russian Language Team's profile picture Open Arabic LLM Leaderboard's profile picture distilabel-internal-testing's profile picture ORPO Explorers's profile picture Data Is Better Together's profile picture Social Post Explorers's profile picture HuggingFaceFW-Dev's profile picture LLHF's profile picture UCSF-JHU Opioid Industry Documents Archive's profile picture SLLHF's profile picture Hugging Quants's profile picture argilla-internal-testing's profile picture Argilla Warehouse's profile picture rg-preview's profile picture Dataset Tools's profile picture open/ acc's profile picture Data Is Better Together Contributor's profile picture Open R1's profile picture Hugging Face Sheets's profile picture Hugging Face MCP Course's profile picture

dvilasuero's activity

reacted to frascuchon's post with πŸ”₯ 1 day ago
view post
Post
1536
Extending datasets just got a whole lot easier! πŸš€ With Sheets, I was able to create a Spanish version of the popular fka/awesome-chatgpt-prompts dataset in just a few minutes ⏱️.

Check out the resulting dataset: frascuchon/fka_awesome_chatgpt_es πŸ“Š

Want to try it out for yourself? Head over to the Sheets space and see how easy it is to extend and modify existing datasets 🀯. The possibilities are endless! 🌐
replied to their post 2 days ago
reacted to burtenshaw's post with πŸ”₯ 6 days ago
view post
Post
1333
Super excited to release Autotrain MCP. This is an MCP server for training AI models, so you can use your AI tools to train your AI models 🀯.

πŸ”— burtenshaw/autotrain-mcp

Use this MCP server with tools like Claude Desktop, Cursor, VSCode, or Continue to do this:

- Define an ML problem like Image Classification, LLM fine-tuning, Text Classification, etc.
- The AI can retrieve models and datasets from the hub using the hub MCP.
- Training happens on a Hugging Face space, so no worries about hardware restraints.
- Models are pushed to the hub to be used inference tools like Llama.cpp, vLLM, MLX, etc.
- Built on top of the AutoTrain library, so it has full integration with transformers and other libraries.

Everything is still under active development, but I’m super excited to hear what people build, and I’m open to contributions!
  • 1 reply
Β·
reacted to frascuchon's post with πŸš€ 7 days ago
view post
Post
1277
Unlock the full potential of your datasets with SHEETS! It's incredibly easy to extend existing datasets and unlock new insights.

Leverage open-source models to translate, summarize, classify, and more - all directly within your existing columns.

Ready to give it a try? Explore the possibilities here: aisheets/sheets
  • 2 replies
Β·
reacted to Ameeeee's post with πŸ§ β€οΈπŸš€ 8 days ago
view post
Post
1690
With Sheets, try a new way to create structured content with the help of AI!

No installs. No login. Just open a link and 🀩

This app lets you create a dataset by importing a file or starting from a prompt.

What’s different about SHEETS?
πŸ”Ž Web search integration to ground answers in real-world data
πŸ“š In-context learning from validated sources
πŸ”— Transparent sourcing β€” every result is linked
🧩 Runs on multiple open-source models

Fight hallucinations and start creating content you can rely on.

posted an update 8 days ago
view post
Post
2426
Super excited to launch Hugging Face Sheets: Spreadsheets meet AI and unstructured data.

A few months ago, we started imagining new ways to build and transform datasets with the latest open-source models.

Today, I'm thrilled to introduce our first step in this direction.


In a nutshell:

πŸ“ Effortlessly run prompts and models over your data.
🌐 Agentic search for accuracy and real-time information.
πŸ–ΌοΈ Familiar, minimalistic interface for interacting with data.
🎯 Human feedback 2.0: Your input directly improves generated data.
πŸ’― Access hundreds of open models and leading inference providers.

Go to this space to try it out!

aisheets/sheets

Leave your questions below, we're just getting started!
  • 2 replies
Β·
reacted to burtenshaw's post with πŸš€πŸ€— 8 days ago
view post
Post
2551
MCP course is now LIVE! We just dropped quizzes, videos, and live streams to make it a fully interactive course:

πŸ”— join in now: mcp-course

- It’s still free!
- Video 1 walks you through onboarding to the course
- The first live session is next week!
- You can now get a certificate via exam app
- We improved and written material with interactive quizzes

If you’re studying MCP and want a live, interactive, visual, certified course, then join us on the hub!
reacted to davanstrien's post with πŸ‘ 8 days ago
view post
Post
2704
Inspired by Hugging Face's official MCP server, I've developed a complementary tool that exposes my semantic search API to enhance discovery across the HF platform.

Key capabilities:

- AI-powered semantic search for models and datasets
- Parameter count analysis via safetensors metadata
- Trending content discovery
- Find similar models/datasets functionality
- 11 tools total for enhanced ecosystem navigation

The semantic search goes beyond simple keyword matching, understanding context and relationships between different models and datasets.

Example query: "Find around 10 reasoning Hugging Face datasets published in 2025 focusing on topics other than maths and science. Show a link and a short summary for each dataset." (results in video!)

https://github.com/davanstrien/hub-semantic-search-mcp
reacted to davanstrien's post with πŸ”₯ 15 days ago
view post
Post
2246
Came across a very nice submission from @marcodsn for the reasoning datasets competition (https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition).

The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:

- Extracts both the logical structure AND researcher intuition from academic papers
- Adopts the persona of researchers "before experiments" to capture exploratory thinking
- Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model

It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.

I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.

Dataset can be found here: marcodsn/academic-chains (give it a like!)
reacted to frascuchon's post with β€οΈπŸ‘ 16 days ago
view post
Post
2984
Hey! I built RAG MCP Server Space, a simple Gradio MCP server for RAG systems that allows you to search relevant results without passing huge contexts to your LLM.

You can use this space to integrate with your agents and improve the efficiency of your search results. Feel free to try it out and let me know if you have any feedback or questions!

frascuchon/rag-mcp-server

Thanks for checking it out!
reacted to davidberenstein1957's post with πŸ”₯β€οΈπŸ‘€ 5 months ago
reacted to nataliaElv's post with πŸ”₯❀️ 5 months ago
view post
Post
1525
New chapter in the Hugging Face NLP course! πŸ€— πŸš€

We've added a new chapter about the very basics of Argilla to the Hugging Face NLP course. Learn how to set up an Argilla instance, load & annotate datasets, and export them to the Hub.Β 

Any feedback for improvements welcome!

https://huggingface.co/learn/nlp-course/chapter10
reacted to davanstrien's post with πŸš€ 5 months ago
view post
Post
2310
The data-is-better-together/fineweb-c dataset is growing!

This week a few more languages have got 1,000 annotations for the educational quality of data from HuggingFaceFW/fineweb-2.

Why should you care?

The quality of pre-training data can have a big impact on the performance of downstream language models trained on that data ( HuggingFaceFW/blogpost-fineweb-v1).

Being able to filter by educational quality is on way of improving the quality of the data you use for training an LLM. Very importantly this approach can also reduce the amount of data needed for pertaining.

Why not use an LLM?

LLMs can be used to annotate educational quality for a subset of data. This data can then be used to train a smaller encoder only model to label the full dataset. However, this may not work well for languages outside of english. This is where fineweb-c (community) comes in.

The community is annotating the educational quality of fineweb2 data. Currently 114 languages have some annotations. These annotations will enable a number of things:

- Evaluate whether an LLM can label the educational quality for texts in that language well
- Directly be used for training quality classifiers
- Help discover other rules and huerisitcs for refining fineweb2 further for different languages.

This week the following languages where done:

Swedish thanks to: @Lauler @AntonVic @ohallstrom @bjarlestam @menbom @Ekgren @apsod

Ukrainian thanks to: @hannayukhymenko @robinhad @realPivo @RabotiahovDmytro @reciprocate

Assamese thanks to: @moyoor97 @Arpanjyoti @nawaf-helmi123 @pahigogoi1 @aelhence @kishorekashyap

Want to learn more: https://huggingface.co/blog/davanstrien/fineweb2-community

Contribute yourself here: data-is-better-together/fineweb-c
  • 1 reply
Β·