1 1 7

Raghav Prabhakar

raghavprabhakar

https://www.raghavprabhakar.com

AI & ML interests

Computer Vision, Deep Learning, Robotics

Recent Activity

reacted to Tonic's post with 👍 22 days ago

🙋🏻‍♂️ Normalize adding compute & runtime traces to your model cards

reacted to thomwolf's post with 🔥 8 months ago

We are proud to announce https://huggingface.co/datasets/HuggingFaceFW/fineweb-2: A sparkling update to https://huggingface.co/datasets/HuggingFaceFW/fineweb with 1000s of 🗣️languages. We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages. 🥂 FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments. The dataset is released under the permissive 📜 ODC-By 1.0 license, and the 💻 code to reproduce it and our evaluations is public. We will very soon announce a big community project, and are working on a 📝 blogpost walking you through the entire dataset creation process. Stay tuned! In the mean time come ask us question on our chat place: https://huggingface.co/spaces/HuggingFaceFW/discussion H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi

reacted to thomwolf's post with 🚀 8 months ago

View all activity

Organizations

reacted to Tonic's post with 👍 22 days ago

Post

3268

🙋🏻‍♂️ Normalize adding compute & runtime traces to your model cards

2 replies

reacted to thomwolf's post with 🔥🚀 8 months ago

Post

6251

We are proud to announce HuggingFaceFW/fineweb-2: A sparkling update to HuggingFaceFW/fineweb with 1000s of 🗣️languages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

🥂 FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.

The dataset is released under the permissive 📜 ODC-By 1.0 license, and the 💻 code to reproduce it and our evaluations is public.

We will very soon announce a big community project, and are working on a 📝 blogpost walking you through the entire dataset creation process. Stay tuned!

In the mean time come ask us question on our chat place: https://huggingface.co/spaces/HuggingFaceFW/discussion

H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi

2 replies

liked a dataset 9 months ago

raghavprabhakar/commonsense-embodied-ai

Updated Jul 17, 2024 • 165 • 4

upvoted a paper 9 months ago

Physical Reasoning and Object Planning for Household Embodied Agents

Paper • 2311.13577 • Published Nov 22, 2023 • 2

updated 2 datasets about 1 year ago

raghavprabhakar/commonsense-embodied-ai

Updated Jul 17, 2024 • 165 • 4

raghavprabhakar/spacenet3

Updated May 14, 2024 • 4

updated a collection about 1 year ago

Space

Collection

0 items • Updated May 14, 2024

authored a paper about 1 year ago

Physical Reasoning and Object Planning for Household Embodied Agents

Paper • 2311.13577 • Published Nov 22, 2023 • 2

reacted to merve's post with ❤️ over 1 year ago

Post

2852

I see you all send your documents to close-source APIs, this is not ok 👎 it breaks my heart 💔
I have seen many open-source document models, and I am amazed by what IDEFICS2 has done with document understanding 🤯🤩 it's not something you've ever seen before! HuggingFaceM4/idefics-8b

Please use it! Has Apache 2.0 license ❤️

reacted to akhaliq's post with ❤️ over 1 year ago

Post

Aya Dataset

An Open-Access Collection for Multilingual Instruction Tuning

Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning (2402.06619)

Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the finetuning of pre-trained models on a diverse set of tasks that enables a large language model (LLM) to respond to instructions. Instruction fine-tuning (IFT) requires specifically constructed and annotated datasets. However, existing datasets are almost all in the English language. In this work, our primary goal is to bridge the language gap by building a human-curated instruction-following dataset spanning 65 languages. We worked with fluent speakers of languages from around the world to collect natural instances of instructions and completions. Furthermore, we create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages. In total, we contribute four key resources: we develop and open-source the Aya Annotation Platform, the Aya Dataset, the Aya Collection, and the Aya Evaluation Suite. The Aya initiative also serves as a valuable case study in participatory research, involving collaborators from 119 countries. We see this as a valuable framework for future research collaborations that aim to bridge gaps in resources.

3 replies

reacted to Tonic's post with ❤️ over 1 year ago

Post

🙋🏻‍♂️hey there folks ,

🤗Aya has been released ! It's an absolutely massive undertaking to create a huge multilingual dataset and multilingual model of very high quality.

Papers :
https://cohere.com/research/papers/aya-dataset-paper-2024-02-13
https://cohere.com/research/papers/aya-model-paper-2024-02-13

Model : https://huggingface.co/CohereForAI/aya-101
Dataset : https://huggingface.co/datasets/CohereForAI/aya_dataset

I am proud to be one of 3,000 humans who built Aya - a new massively multilingual, generative LLM that outperforms existing open-source models and covers 101 different languages. Together, we are accelerating multilingual AI. 🤗