INCOME

university

https://github.com/NThakur20/income

Activity Feed Request to join this org

AI & ML interests

None defined yet.

Recent Activity

nthakur authored a paper about 1 month ago

FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

nthakur authored a paper about 1 month ago

Chatbot Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses

nthakur authored a paper about 1 month ago

Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval

View all activity

nthakur

authored 3 papers about 1 month ago

FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents

Paper • 2504.13128 • Published Apr 17 • 5

Chatbot Arena Meets Nuggets: Towards Explanations and Diagnostics in the Evaluation of LLM Responses

Paper • 2504.20006 • Published Apr 28

Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval

Paper • 2505.16967 • Published May 22 • 23

nthakur

posted an update 3 months ago

Post

1674

Last year, I curated & generated a few multilingual SFT and DPO datasets by translating English SFT/DPO datasets into 9-10 languages using the mistralai/Mistral-7B-Instruct-v0.2 model.

I hope it helps the community for pretraining/instruction tuning multilingual LLMs! I added a small diagram to briefly describe which datasets are added and their sources.

Happy to collaborate in either using these datasets for instruction FT, or wishes to extend translated versions of newer SFT/DPO english datasets!

nthakur/multilingual-sft-and-dpo-datasets-67eaf56fe3feca5a57cf7d74

nthakur

authored a paper 4 months ago

MMTEB: Massive Multilingual Text Embedding Benchmark

Paper • 2502.13595 • Published Feb 19 • 37

nthakur

authored a paper 8 months ago

MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems

Paper • 2410.13716 • Published Oct 17, 2024

nthakur

authored a paper 10 months ago

Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track

Paper • 2406.16828 • Published Jun 24, 2024

nthakur

posted an update about 1 year ago

Post

3573

🦢 The SWIM-IR dataset contains 29 million text-retrieval training pairs across 27 diverse languages. It is one of the largest synthetic multilingual datasets generated using PaLM 2 on Wikipedia! 🔥🔥

SWIM-IR dataset contains three subsets :
- Cross-lingual:nthakur/swim-ir-cross-lingual
- Monolingual: nthakur/swim-ir-monolingual
- Indic Cross-lingual: nthakur/indic-swim-ir-cross-lingual

Check it out:
https://huggingface.co/collections/nthakur/swim-ir-dataset-662ddaecfc20896bf14dd9b7

nthakur

authored 9 papers over 1 year ago

Resources for Brewing BEIR: Reproducible Reference Models and an Official Leaderboard

Paper • 2306.07471 • Published Jun 13, 2023

NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented Generation

Paper • 2312.11361 • Published Dec 18, 2023 • 1

HAGRID: A Human-LLM Collaborative Dataset for Generative Information-Seeking with Attribution

Paper • 2307.16883 • Published Jul 31, 2023

Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks

Paper • 2010.08240 • Published Oct 16, 2020

Evaluating Embedding APIs for Information Retrieval

Paper • 2305.06300 • Published May 10, 2023 • 1

GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval

Paper • 2112.07577 • Published Dec 14, 2021

Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages

Paper • 2210.09984 • Published Oct 18, 2022 • 2

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Paper • 2104.08663 • Published Apr 17, 2021 • 3

Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval

Paper • 2311.05800 • Published Nov 10, 2023 • 4

nreimers

authored a paper over 2 years ago

MTEB: Massive Text Embedding Benchmark

Paper • 2210.07316 • Published Oct 13, 2022 • 6

nthakur

updated 2 models over 2 years ago

income/bpr-contriever-gpl-scidocs

Updated Feb 10, 2023

income/bpr-contriever-gpl-arguana

Updated Feb 10, 2023

AI & ML interests

Recent Activity

Team members 2

income's activity