SomosNLP

non-profit

https://somosnlp.org/

SomosNLP_

somosnlp

Activity Feed

AI & ML interests

Democratizar el PLN en español e incentivar su aplicación para generar impacto social 💛

Recent Activity

haritzpuerto authored a paper 19 days ago

Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers

haritzpuerto authored a paper 19 days ago

C-SEO Bench: Does Conversational SEO Work?

mariagrandury published a dataset about 2 months ago

somosnlp/babylm-es

View all activity

frascuchon

posted an update about 1 month ago

Post

2442

Extended Dataset with Sheets 🚀

I used Sheets to extend the fka/awesome-chatgpt-prompts dataset with a single prompt 💡. Check out the result: frascuchon/extended_fka_awesome_chatgpt_prompts

Try Sheets to expand your datasets: aisheets/sheets 🛠️

frascuchon

posted an update about 2 months ago

Post

789

🚀 Are you ready to take control of your data? 📊 Follow the step-by-step guide to setup and run Sheets locally on your own machine 🖥️!

💻 Click the link to get started and become a Sheets master 🎯!
👉 https://huggingface.co/blog/frascuchon/running-sheets-locally

Try Sheet 👉

aisheets

mariagrandury

published a dataset about 2 months ago

somosnlp/babylm-es

Updated Jun 19 • 3

frascuchon

posted an update about 2 months ago

Post

2837

Extending datasets just got a whole lot easier! 🚀 With Sheets, I was able to create a Spanish version of the popular fka/awesome-chatgpt-prompts dataset in just a few minutes ⏱️.

Check out the resulting dataset: frascuchon/fka_awesome_chatgpt_es 📊

Want to try it out for yourself? Head over to the Sheets space and see how easy it is to extend and modify existing datasets 🤯. The possibilities are endless! 🌐

frascuchon

posted an update about 2 months ago

Post

1318

Unlock the full potential of your datasets with SHEETS! It's incredibly easy to extend existing datasets and unlock new insights.

Leverage open-source models to translate, summarize, classify, and more - all directly within your existing columns.

Ready to give it a try? Explore the possibilities here: aisheets/sheets

2 replies

frascuchon

posted an update 2 months ago

Post

2996

Hey! I built RAG MCP Server Space, a simple Gradio MCP server for RAG systems that allows you to search relevant results without passing huge contexts to your LLM.

You can use this space to integrate with your agents and improve the efficiency of your search results. Feel free to try it out and let me know if you have any feedback or questions!

frascuchon/rag-mcp-server

Thanks for checking it out!

reddrex

in somosnlp/LingComp_QA 3 months ago

How use the dataset to train my model GPT

#1 opened 3 months ago by

luisaarias

lewtun

authored a paper 3 months ago

Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

Paper • 2504.11354 • Published Apr 15 • 6

ouhenio

updated a Space 4 months ago

Mapa Blend-es

🌍

Revisa el avance colectivo de blend-es 😊

lewtun

authored a paper 4 months ago

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7 • 197

lewtun

authored a paper 5 months ago

Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning

Paper • 2503.07572 • Published Mar 10 • 47

lewtun

posted an update 5 months ago

Post

3213

Introducing OlympicCoder: a series of open reasoning models that can solve olympiad-level programming problems 🧑‍💻

- 7B open-r1/OlympicCoder-7B
- 32B open-r1/OlympicCoder-32B

We find that OlympicCoder models outperform Claude 3.7 Sonnet, as well as others over 100x larger 💪

Together with the models, we are releasing:

📊CodeForces-CoTs: new dataset of code problems from the most popular competitive coding platform, with R1 traces in C++ and Python open-r1/codeforces-cots

🏆 IOI'2024: a new benchmark of VERY hard programming problems where even frontier models struggle to match human performance open-r1/ioi

For links to the models and datasets, check out our latest progress report from Open R1: https://huggingface.co/blog/open-r1/update-3

1 reply

lewtun

posted an update 6 months ago

Post

5397

Introducing OpenR1-Math-220k!

open-r1/OpenR1-Math-220k

The community has been busy distilling DeepSeek-R1 from inference providers, but we decided to have a go at doing it ourselves from scratch 💪

What’s new compared to existing reasoning datasets?

♾ Based on AI-MO/NuminaMath-1.5: we focus on math reasoning traces and generate answers for problems in NuminaMath 1.5, an improved version of the popular NuminaMath-CoT dataset.

🐳 800k R1 reasoning traces: We generate two answers for 400k problems using DeepSeek R1. The filtered dataset contains 220k problems with correct reasoning traces.

📀 512 H100s running locally: Instead of relying on an API, we leverage vLLM and SGLang to run generations locally on our science cluster, generating 180k reasoning traces per day.

⏳ Automated filtering: We apply Math Verify to only retain problems with at least one correct answer. We also leverage Llama3.3-70B-Instruct as a judge to retrieve more correct examples (e.g for cases with malformed answers that can’t be verified with a rules-based parser)

📊 We match the performance of DeepSeek-Distill-Qwen-7B by finetuning Qwen-7B-Math-Instruct on our dataset.

🔎 Read our blog post for all the nitty gritty details: https://huggingface.co/blog/open-r1/update-2

lewtun

authored a paper 6 months ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 241

gabrielmbmb

authored a paper 6 months ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 241

mariagrandury

updated a collection 6 months ago

Corpus: Evaluation datasets for ES & LATAM

Collection

Corpus of La Leaderboard, the open LLM leaderboard for ES & LATAM • 56 items • Updated Feb 5 • 4

tadeodonegana

posted an update 6 months ago

Post

1159

At RooMix(dot)ai we’re looking for an expert in generative image models for a short consulting gig. Any recommendations?

1 reply

mariagrandury

updated 2 collections 6 months ago

Pre-trained LMs ES

Collection

Monolingual language models pre-trained on Spanish and related languages. • 21 items • Updated Feb 4 • 6

Instruction-Tuned Models ES

Collection

Instruction-tuned models in Spanish and other related languages • 8 items • Updated Feb 4 • 4

lewtun

posted an update 7 months ago

Post

10447

We are reproducing the full DeepSeek R1 data and training pipeline so everybody can use their recipe. Instead of doing it in secret we can do it together in the open!

🧪 Step 1: replicate the R1-Distill models by distilling a high-quality reasoning corpus from DeepSeek-R1.

🧠 Step 2: replicate the pure RL pipeline that DeepSeek used to create R1-Zero. This will involve curating new, large-scale datasets for math, reasoning, and code.

🔥 Step 3: show we can go from base model -> SFT -> RL via multi-stage training.

Follow along: https://github.com/huggingface/open-r1

5 replies

AI & ML interests

Recent Activity

Team members 299

somosnlp's activity

How use the dataset to train my model GPT

Mapa Blend-es