Natalia Elvira

nataliaElv

AI & ML interests

Data curation, high-quality data, multilinguality, NLP & computational linguistics

Recent Activity

Organizations

SomosNLP's profile picture Blog-explorers's profile picture Hugging Face Discord Community's profile picture Dataset Tools's profile picture Data Is Better Together Contributor's profile picture The Newsroom's profile picture

nataliaElv's activity

posted an update 3 months ago
view post
Post
1509
New chapter in the Hugging Face NLP course! ๐Ÿค— ๐Ÿš€

We've added a new chapter about the very basics of Argilla to the Hugging Face NLP course. Learn how to set up an Argilla instance, load & annotate datasets, and export them to the Hub.ย 

Any feedback for improvements welcome!

https://huggingface.co/learn/nlp-course/chapter10
reacted to davanstrien's post with ๐Ÿš€ 4 months ago
view post
Post
2286
The data-is-better-together/fineweb-c dataset is growing!

This week a few more languages have got 1,000 annotations for the educational quality of data from HuggingFaceFW/fineweb-2.

Why should you care?

The quality of pre-training data can have a big impact on the performance of downstream language models trained on that data ( HuggingFaceFW/blogpost-fineweb-v1).

Being able to filter by educational quality is on way of improving the quality of the data you use for training an LLM. Very importantly this approach can also reduce the amount of data needed for pertaining.

Why not use an LLM?

LLMs can be used to annotate educational quality for a subset of data. This data can then be used to train a smaller encoder only model to label the full dataset. However, this may not work well for languages outside of english. This is where fineweb-c (community) comes in.

The community is annotating the educational quality of fineweb2 data. Currently 114 languages have some annotations. These annotations will enable a number of things:

- Evaluate whether an LLM can label the educational quality for texts in that language well
- Directly be used for training quality classifiers
- Help discover other rules and huerisitcs for refining fineweb2 further for different languages.

This week the following languages where done:

Swedish thanks to: @Lauler @AntonVic @ohallstrom @bjarlestam @menbom @Ekgren @apsod

Ukrainian thanks to: @hannayukhymenko @robinhad @realPivo @RabotiahovDmytro @reciprocate

Assamese thanks to: @moyoor97 @Arpanjyoti @nawaf-helmi123 @pahigogoi1 @aelhence @kishorekashyap

Want to learn more: https://huggingface.co/blog/davanstrien/fineweb2-community

Contribute yourself here: data-is-better-together/fineweb-c
  • 1 reply
ยท
posted an update 4 months ago
view post
Post
560
Do you want to easily save annotations to a Dataset in the Hub?

In the last version of Argilla (v2.6.0), you can export your data directly from the UI to the Hub.

Check all the changes and update to the latest version: https://github.com/argilla-io/argilla/releases/tag/v2.6.0
posted an update 4 months ago
view post
Post
1669
If you are still wondering how the FineWeb2 annotations are done, how to follow the guidelines or how Argilla works, this is your video!

I go through a few samples of the FineWeb2 dataset and classify them based on their educational content. Check it out!

https://www.youtube.com/watch?v=_-ORB4WAVGU
posted an update 5 months ago
view post
Post
1304
How do your annotations for FineWeb2 compare to your teammates'?

I started contributing some annotations to the FineWeb2 collaborative annotation sprint and I wanted to know if my labelling trends were similar to those of my teammates.

I did some analysis and I wasn't surprised to see that I'm being a bit harsher on my evaluations than my mates ๐Ÿ˜‚


Do you want to see how your annotations compare to others?
๐Ÿ‘‰ Go to this Gradio space: nataliaElv/fineweb2_compare_my_annotations
โœ๏ธ Enter the dataset that you've contributed to and your Hugging Face username.

How were your results?
- Contribute some annotations: data-is-better-together/fineweb-c
- Join your language channel in Rocket chat: HuggingFaceFW/discussion
posted an update 5 months ago
view post
Post
1192
We're so close to reaching 100 languages! Can you help us cover the remaining 200? Check if we're still looking for language leads for your language: nataliaElv/language-leads-dashboard
posted an update 5 months ago
view post
Post
1651
Would you like to get a high-quality dataset to pre-train LLMs in your language? ๐ŸŒ

At Hugging Face we're preparing a collaborative annotation effort to build an open-source multilingual dataset as part of the Data is Better Together initiative.

Follow the link below, check if your language is listed and sign up to be a Language Lead!

https://forms.gle/s9nGajBh6Pb9G72J6
posted an update 5 months ago
view post
Post
370
You can now add your Bluesky handle to your Hugging Face profile! ๐Ÿฆ‹
Have you noticed?
reacted to maxiw's post with โค๏ธ 5 months ago
view post
Post
4665
I was curious to see what people post here on HF so I created a dataset with all HF Posts: maxiw/hf-posts

Some interesting stats:

Top 5 Authors by Total Impressions:
-----------------------------------
@merve : 171,783 impressions (68 posts)
@fdaudens : 135,253 impressions (81 posts)
@singhsidhukuldeep : 122,591 impressions (81 posts)
@akhaliq : 119,526 impressions (78 posts)
@MonsterMMORPG : 112,500 impressions (45 posts)

Top 5 Users by Number of Reactions Given:
----------------------------------------
@osanseviero : 1278 reactions
@clem : 910 reactions
@John6666 : 899 reactions
@victor : 674 reactions
@samusenps : 655 reactions

Top 5 Most Used Reactions:
-------------------------
โค๏ธ: 7048 times
๐Ÿ”ฅ: 5921 times
๐Ÿ‘: 4856 times
๐Ÿš€: 2549 times
๐Ÿค—: 2065 times
ยท
reacted to gabrielmbmb's post with โค๏ธ 9 months ago
view post
Post
2929
distilabel 1.3.0 is out! This release contains many core improvements and new tasks that help us building argilla/magpie-ultra-v0.1!

Distributed pipeline execution with Ray, new Magpie tasks, reward models, components for dataset diversity based on sentence embeddings, Argilla 2.0 compatibility and many more features!

Check the new release in GitHub: https://github.com/argilla-io/distilabel

reacted to alex-abb's post with ๐Ÿ”ฅ 10 months ago
view post
Post
4858
Hi everyone!
I'm Alex, I'm 16, I've been an internship at Hugging Face for a little over a week and I've already learned a lot about using and prompting LLM models. With @victor as tutor I've just finished a space that analyzes your feelings by prompting an LLM chat model. The aim is to extend it so that it can categorize hugging face posts.

alex-abb/LLM_Feeling_Analyzer
ยท
reacted to dvilasuero's post with ๐Ÿš€ 11 months ago
view post
Post
8248
Today is a huge day in Argillaโ€™s history. We couldnโ€™t be more excited to share this with the community: weโ€™re joining Hugging Face!

Weโ€™re embracing a larger mission, becoming part of a brilliant and kind team and a shared vision about the future of AI.

Over the past year, weโ€™ve been collaborating with Hugging Face on countless projects: launching partner of Docker Spaces, empowering the community to clean Alpaca translations into Spanish and other languages, launching argilla/notus-7b-v1 building on Zephyrโ€™s learnings, the Data is Better Together initiative with hundreds of community contributors, or releasing argilla/OpenHermesPreferences, one of the largest open preference tuning datasets

After more than 2,000 Slack messages and over 60 people collaborating for over a year, it already felt like we were part of the same team, pushing in the same direction. After a week of the smoothest transition you can imagine, weโ€™re now the same team.

To those of you whoโ€™ve been following us, this wonโ€™t be a huge surprise, but it will be a big deal in the coming months. This acquisition means weโ€™ll double down on empowering the community to build and collaborate on high quality datasets, weโ€™ll bring full support for multimodal datasets, and weโ€™ll be in a better place to collaborate with the Open Source AI community. For enterprises, this means that the Enterprise Hub will unlock highly requested features like single sign-on and integration with Inference Endpoints.

As a founder, I am proud of the Argilla team. We're now part of something bigger and a larger team but with the same values, culture, and goals. Grateful to have shared this journey with my beloved co-founders Paco and Amรฉlie.

Finally, huge thanks to the Chief Llama Officer @osanseviero for sparking this and being such a great partner during the acquisition process.

Would love to answer any questions you have so feel free to add them below!
ยท
reacted to Salama1429's post with ๐Ÿง  11 months ago
view post
Post
2566
๐Ÿ“บ Introducing the YouTube-Commons Dataset ๐Ÿ“บ

๐ŸŒ Overview: The YouTube Commons Dataset is a comprehensive collection of 30 billion words from 15,112,121 original and automatically translated transcripts, drawn from 2,063,066 videos on YouTube.

๐Ÿ”— License: All videos are shared under the CC-BY license, with the majority (71%) in English.

๐Ÿค– Applications: This dataset is ideal for training powerful AI models for converting speech to text (ASR) and translation models.

๐Ÿ“Š Utilization: The text can be used for model training and is republishable for reproducibility purposes.

๐Ÿค Collaboration: This dataset is the result of a collaboration between state start-up LANGU:IA, the French Ministry of Culture, and DINUM. It will be expanded in the coming months.

๐Ÿ”— Explore the dataset here: https://lnkd.in/d_paWKFE

#YouTubeCommons #AIResearch #MachineLearning #OpenData #ArtificialIntelligence #NLP #Dataset #TechCollaboration #Innovation #DigitalTransformation
reacted to dvilasuero's post with ๐Ÿš€โค๏ธ about 1 year ago
view post
Post
๐Ÿ”ฅ Community and Data Quality Are More For Alignment

A recipe to replicate SPIN (Self-Play Fine Tuning) with 30x less data:

๐Ÿ—ฃ๏ธ 50K samples vs 1.8K prompts curated by the 350+ amazing DIBT contributors.
โš—๏ธ Distillation of Mistral Large instead of OpenAI
๐Ÿ™Œ Open data & code with โš—๏ธdistilabel

SPIN Paper:
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models (2401.01335)

SPIN DIBT Collection with datasets and models:
argilla/dibt-prompt-collective-spin-65ef59062518776024395fc3

Repo:
https://github.com/argilla-io/distilabel-spin-dibt

Joint work with the amazing DIBT community ๐Ÿ‘‡
@aashish1904 , @flozi00 , @sayhan , @munish0838 , @0-hero , @dvilasuero , @eren23 , @davanstrien , @ahnz , @BlackKakapo , @kitano-o , @mmhamdy , @sdiazlor , @Stopwolf , @gabrielmbmb , @tculler91 , @plaguss , @ignacioct , @Hugi-R , @davidberenstein1957 , @Korla , @alvarobartt , @Hugs4Llamas , @Sumandora , @nataliaElv , @jfcalvo , @Averill , @steventrouble , @vasilis , @aeros93 , @kayyshf , @thomasgauthier , @jeromebas , @Ameeeee , @ayoubelmhamdi , @TuringsSolutions , @efels , @Haleyok , @abrazador , @emessy , @Nindaleth , @burtenshaw , @vicgalle , @CortexPE , @casey-martin , @Leire-aguirre-eguiluz , @mrfakename , @Portias600kNeurons , @nathaliepett , @Filippo
ยท