h4-argilla-collab

AI & ML interests

None defined yet.

Recent Activity

h4-argilla's activity

dvilasueroย 
posted an update 8 days ago
view post
Post
2430
Super excited to launch Hugging Face Sheets: Spreadsheets meet AI and unstructured data.

A few months ago, we started imagining new ways to build and transform datasets with the latest open-source models.

Today, I'm thrilled to introduce our first step in this direction.


In a nutshell:

๐Ÿ“ Effortlessly run prompts and models over your data.
๐ŸŒ Agentic search for accuracy and real-time information.
๐Ÿ–ผ๏ธ Familiar, minimalistic interface for interacting with data.
๐ŸŽฏ Human feedback 2.0: Your input directly improves generated data.
๐Ÿ’ฏ Access hundreds of open models and leading inference providers.

Go to this space to try it out!

aisheets/sheets

Leave your questions below, we're just getting started!
  • 2 replies
ยท
lewtunย 
posted an update 3 months ago
view post
Post
2791
Introducing OlympicCoder: a series of open reasoning models that can solve olympiad-level programming problems ๐Ÿง‘โ€๐Ÿ’ป

- 7B open-r1/OlympicCoder-7B
- 32B open-r1/OlympicCoder-32B

We find that OlympicCoder models outperform Claude 3.7 Sonnet, as well as others over 100x larger ๐Ÿ’ช

Together with the models, we are releasing:

๐Ÿ“ŠCodeForces-CoTs: new dataset of code problems from the most popular competitive coding platform, with R1 traces in C++ and Python open-r1/codeforces-cots

๐Ÿ† IOI'2024: a new benchmark of VERY hard programming problems where even frontier models struggle to match human performance open-r1/ioi

For links to the models and datasets, check out our latest progress report from Open R1: https://huggingface.co/blog/open-r1/update-3
  • 1 reply
ยท
lewtunย 
posted an update 4 months ago
view post
Post
5287
Introducing OpenR1-Math-220k!

open-r1/OpenR1-Math-220k

The community has been busy distilling DeepSeek-R1 from inference providers, but we decided to have a go at doing it ourselves from scratch ๐Ÿ’ช

Whatโ€™s new compared to existing reasoning datasets?

โ™พ Based on AI-MO/NuminaMath-1.5: we focus on math reasoning traces and generate answers for problems in NuminaMath 1.5, an improved version of the popular NuminaMath-CoT dataset.

๐Ÿณ 800k R1 reasoning traces: We generate two answers for 400k problems using DeepSeek R1. The filtered dataset contains 220k problems with correct reasoning traces.

๐Ÿ“€ 512 H100s running locally: Instead of relying on an API, we leverage vLLM and SGLang to run generations locally on our science cluster, generating 180k reasoning traces per day.

โณ Automated filtering: We apply Math Verify to only retain problems with at least one correct answer. We also leverage Llama3.3-70B-Instruct as a judge to retrieve more correct examples (e.g for cases with malformed answers that canโ€™t be verified with a rules-based parser)

๐Ÿ“Š We match the performance of DeepSeek-Distill-Qwen-7B by finetuning Qwen-7B-Math-Instruct on our dataset.

๐Ÿ”Ž Read our blog post for all the nitty gritty details: https://huggingface.co/blog/open-r1/update-2
lewtunย 
posted an update 5 months ago
view post
Post
10403
We are reproducing the full DeepSeek R1 data and training pipeline so everybody can use their recipe. Instead of doing it in secret we can do it together in the open!

๐Ÿงช Step 1: replicate the R1-Distill models by distilling a high-quality reasoning corpus from DeepSeek-R1.

๐Ÿง  Step 2: replicate the pure RL pipeline that DeepSeek used to create R1-Zero. This will involve curating new, large-scale datasets for math, reasoning, and code.

๐Ÿ”ฅ Step 3: show we can go from base model -> SFT -> RL via multi-stage training.

Follow along: https://github.com/huggingface/open-r1
ยท
lewtunย 
posted an update 5 months ago
view post
Post
3958
I was initially pretty sceptical about Meta's Coconut paper [1] because the largest perf gains were reported on toy linguistic problems. However, these results on machine translation are pretty impressive!

https://x.com/casper_hansen_/status/1875872309996855343

Together with the recent PRIME method [2] for scaling RL, reasoning for open models is looking pretty exciting for 2025!

[1] Training Large Language Models to Reason in a Continuous Latent Space (2412.06769)
[2] https://huggingface.co/blog/ganqu/prime
lewtunย 
posted an update 6 months ago
view post
Post
2349
This paper ( HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs (2412.18925)) has a really interesting recipe for inducing o1-like behaviour in Llama models:

* Iteratively sample CoTs from the model, using a mix of different search strategies. This gives you something like Stream of Search via prompting.
* Verify correctness of each CoT using GPT-4o (needed because exact match doesn't work well in medicine where there are lots of aliases)
* Use GPT-4o to reformat the concatenated CoTs into a single stream that includes smooth transitions like "hmm, wait" etc that one sees in o1
* Use the resulting data for SFT & RL
* Use sparse rewards from GPT-4o to guide RL training. They find RL gives an average ~3 point boost across medical benchmarks and SFT on this data already gives a strong improvement.

Applying this strategy to other domains could be quite promising, provided the training data can be formulated with verifiable problems!
  • 1 reply
ยท
lewtunย 
posted an update 6 months ago
view post
Post
7008
We outperform Llama 70B with Llama 3B on hard math by scaling test-time compute ๐Ÿ”ฅ

How? By combining step-wise reward models with tree search algorithms :)

We show that smol models can match or exceed the performance of their much larger siblings when given enough "time to think"

We're open sourcing the full recipe and sharing a detailed blog post.

In our blog post we cover:

๐Ÿ“ˆ Compute-optimal scaling: How we implemented DeepMind's recipe to boost the mathematical capabilities of open models at test-time.

๐ŸŽ„ Diverse Verifier Tree Search (DVTS): An unpublished extension we developed to the verifier-guided tree search technique. This simple yet effective method improves diversity and delivers better performance, particularly at large test-time compute budgets.

๐Ÿงญ Search and Learn: A lightweight toolkit for implementing search strategies with LLMs and built for speed with vLLM

Here's the links:

- Blog post: HuggingFaceH4/blogpost-scaling-test-time-compute

- Code: https://github.com/huggingface/search-and-learn

Enjoy!
  • 2 replies
ยท
dvilasueroย 
posted an update 6 months ago
view post
Post
2769
๐ŸŒ Announcing Global-MMLU: an improved MMLU Open dataset with evaluation coverage across 42 languages, built with Argilla and the Hugging Face community.

Global-MMLU is the result of months of work with the goal of advancing Multilingual LLM evaluation. It's been an amazing open science effort with collaborators from Cohere For AI, Mila - Quebec Artificial Intelligence Institute, EPFL, Massachusetts Institute of Technology, AI Singapore, National University of Singapore, KAIST, Instituto Superior Tรฉcnico, Carnegie Mellon University, CONICET, and University of Buenos Aires.

๐Ÿท๏ธ +200 contributors used Argilla MMLU questions where regional, dialect, or cultural knowledge was required to answer correctly. 85% of the questions required Western-centric knowledge!

Thanks to this annotation process, the open dataset contains two subsets:

1. ๐Ÿ—ฝ Culturally Agnostic: no specific regional, cultural knowledge is required.
2. โš–๏ธ Culturally Sensitive: requires dialect, cultural knowledge or geographic knowledge to answer correctly.

Moreover, we provide high quality translations of 25 out of 42 languages, thanks again to the community and professional annotators leveraging Argilla on the Hub.

I hope this will ensure a better understanding of the limitations and challenges for making open AI useful for many languages.

Dataset: https://huggingface.co/datasets/CohereForAI/Global-MMLU
dvilasueroย 
posted an update 7 months ago
dvilasueroย 
posted an update 8 months ago
view post
Post
709
Build datasets for AI on the Hugging Face Hubโ€”10x easier than ever!

Today, I'm excited to share our biggest feature since we joined Hugging Face.

Hereโ€™s how it works:

1. Pick a datasetโ€”upload your own or choose from 240K open datasets.
2. Paste the Hub dataset ID into Argilla and set up your labeling interface.
3. Share the URL with your team or the whole community!

And the best part? Itโ€™s:
- No code โ€“ no Python needed
- Integrated โ€“ all within the Hub
- Scalable โ€“ from solo labeling to 100s of contributors

I am incredibly proud of the team for shipping this after weeks of work and many quick iterations.

Let's make this sentence obsolete: "Everyone wants to do the model work, not the data work."


Read, share, and like the HF blog post:
https://huggingface.co/blog/argilla-ui-hub
dvilasueroย 
posted an update 8 months ago
view post
Post
1008
Big news! You can now build strong ML models without days of human labelling

You simply:
- Define your dataset, including annotation guidelines, labels and fields
- Optionally label some records manually.
- Use an LLM to auto label your data with a human (you? your team?) in the loop!

Get started with this blog post:
https://huggingface.co/blog/sdiazlor/custom-text-classifier-ai-human-feedback
dvilasueroย 
posted an update 9 months ago
view post
Post
425
Explore FinePersonas, visually with Argilla and black-forest-labs/FLUX.1-schnell


Excited to share this space where the community can explore a tiny subset of FinePersonas

argilla/finepersonas


Dataset built with distilabel and Free Serveless endpoints

This is just a first step towards more interesting experiments with FinePersonas, for example can we use it to assess biases in text2image models?

If you have ideas I'd love to hear them in the comments!

dvilasueroย 
posted an update about 1 year ago
view post
Post
8293
Today is a huge day in Argillaโ€™s history. We couldnโ€™t be more excited to share this with the community: weโ€™re joining Hugging Face!

Weโ€™re embracing a larger mission, becoming part of a brilliant and kind team and a shared vision about the future of AI.

Over the past year, weโ€™ve been collaborating with Hugging Face on countless projects: launching partner of Docker Spaces, empowering the community to clean Alpaca translations into Spanish and other languages, launching argilla/notus-7b-v1 building on Zephyrโ€™s learnings, the Data is Better Together initiative with hundreds of community contributors, or releasing argilla/OpenHermesPreferences, one of the largest open preference tuning datasets

After more than 2,000 Slack messages and over 60 people collaborating for over a year, it already felt like we were part of the same team, pushing in the same direction. After a week of the smoothest transition you can imagine, weโ€™re now the same team.

To those of you whoโ€™ve been following us, this wonโ€™t be a huge surprise, but it will be a big deal in the coming months. This acquisition means weโ€™ll double down on empowering the community to build and collaborate on high quality datasets, weโ€™ll bring full support for multimodal datasets, and weโ€™ll be in a better place to collaborate with the Open Source AI community. For enterprises, this means that the Enterprise Hub will unlock highly requested features like single sign-on and integration with Inference Endpoints.

As a founder, I am proud of the Argilla team. We're now part of something bigger and a larger team but with the same values, culture, and goals. Grateful to have shared this journey with my beloved co-founders Paco and Amรฉlie.

Finally, huge thanks to the Chief Llama Officer @osanseviero for sparking this and being such a great partner during the acquisition process.

Would love to answer any questions you have so feel free to add them below!
ยท
lewtunย 
posted an update about 1 year ago
view post
Post
5108
Introducing Zephyr 141B-A35B ๐Ÿช:

HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1

Yesterday, Mistral released their latest base model (via magnet link of course ๐Ÿ˜…) and the community quickly converted it to transformers format and pushed it to the Hub: mistral-community/Mixtral-8x22B-v0.1

Early evals of this model looked extremely strong, so we teamed up with Argilla and KAIST AI to cook up a Zephyr recipe with a few new alignment techniques that came out recently:

๐Ÿง‘โ€๐Ÿณ Align the base model with Odds Ratio Preference Optimisation (ORPO). This novel algorithm developed by @JW17 and @nlee-208 and @j6mes and does not require an SFT step to achieve high performance and is thus much more computationally efficient than methods like DPO and PPO.

๐Ÿฆซ Use a brand new dataset of 7k high-quality, multi-turn preferences that has been developed by our friends at Argilla. To create this dataset, they took the excellent Capybara SFT dataset from @LDJnr LDJnr/Capybara and converted it into a preference dataset by augmenting the final turn with responses from new LLMs that were then ranked by GPT-4.

What we find especially neat about this approach is that training on 7k samples only takes ~1.3h on 4 H100 nodes, yet produces a model that is very strong on chat benchmarks like IFEval and BBH.

Kudos to @alvarobartt @JW17 and @nlee-208 for this very nice and fast-paced collab!

For more details on the paper and dataset, checkout our collection: HuggingFaceH4/zephyr-orpo-6617eba2c5c0e2cc3c151524
dvilasueroย 
posted an update over 1 year ago
view post
Post
๐Ÿ”ฅ Community and Data Quality Are More For Alignment

A recipe to replicate SPIN (Self-Play Fine Tuning) with 30x less data:

๐Ÿ—ฃ๏ธ 50K samples vs 1.8K prompts curated by the 350+ amazing DIBT contributors.
โš—๏ธ Distillation of Mistral Large instead of OpenAI
๐Ÿ™Œ Open data & code with โš—๏ธdistilabel

SPIN Paper:
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models (2401.01335)

SPIN DIBT Collection with datasets and models:
argilla/dibt-prompt-collective-spin-65ef59062518776024395fc3

Repo:
https://github.com/argilla-io/distilabel-spin-dibt

Joint work with the amazing DIBT community ๐Ÿ‘‡
@aashish1904 , @flozi00 , @sayhan , @munish0838 , @0-hero , @dvilasuero , @eren23 , @davanstrien , @ahnz , @BlackKakapo , @kitano-o , @mmhamdy , @sdiazlor , @Stopwolf , @gabrielmbmb , @tculler91 , @plaguss , @ignacioct , @Hugi-R , @davidberenstein1957 , @Korla , @alvarobartt , @Hugs4Llamas , @Sumandora , @nataliaElv , @jfcalvo , @Averill , @steventrouble , @vasilis , @aeros93 , @kayyshf , @thomasgauthier , @jeromebas , @Ameeeee , @ayoubelmhamdi , @TuringsSolutions , @efels , @Haleyok , @abrazador , @emessy , @Nindaleth , @burtenshaw , @vicgalle , @CortexPE , @casey-martin , @Leire-aguirre-eguiluz , @mrfakename , @Portias600kNeurons , @nathaliepett , @Filippo
ยท