9 29 93

Nandan Thakur

nthakur

https://thakur-nandan.github.io

AI & ML interests

NLP, IR, QA

Recent Activity

upvoted an article 3 days ago

Nano-BEIR: A Multilingual Information Retrieval Benchmark with Quality-Enhanced Queries

upvoted a collection 3 days ago

Bharat-NanoBEIR: Indian Language Retrieval Benchmarks

upvoted a collection 3 days ago

Bharat-NanoBEIR

View all activity

Organizations

Posts 2

Post

1855

Last year, I curated & generated a few multilingual SFT and DPO datasets by translating English SFT/DPO datasets into 9-10 languages using the mistralai/Mistral-7B-Instruct-v0.2 model.

I hope it helps the community for pretraining/instruction tuning multilingual LLMs! I added a small diagram to briefly describe which datasets are added and their sources.

Happy to collaborate in either using these datasets for instruction FT, or wishes to extend translated versions of newer SFT/DPO english datasets!

nthakur/multilingual-sft-and-dpo-datasets-67eaf56fe3feca5a57cf7d74

Post

3772

🦢 The SWIM-IR dataset contains 29 million text-retrieval training pairs across 27 diverse languages. It is one of the largest synthetic multilingual datasets generated using PaLM 2 on Wikipedia! 🔥🔥

SWIM-IR dataset contains three subsets :
- Cross-lingual:nthakur/swim-ir-cross-lingual
- Monolingual: nthakur/swim-ir-monolingual
- Indic Cross-lingual: nthakur/indic-swim-ir-cross-lingual

Check it out:
https://huggingface.co/collections/nthakur/swim-ir-dataset-662ddaecfc20896bf14dd9b7

View all Posts