Post
3289
🦢 The SWIM-IR dataset contains 29 million text-retrieval training pairs across 27 diverse languages. It is one of the largest synthetic multilingual datasets generated using PaLM 2 on Wikipedia! 🔥🔥
SWIM-IR dataset contains three subsets :
- Cross-lingual:
- Monolingual:
- Indic Cross-lingual:
Check it out:
https://huggingface.co/collections/nthakur/swim-ir-dataset-662ddaecfc20896bf14dd9b7
SWIM-IR dataset contains three subsets :
- Cross-lingual:
nthakur/swim-ir-cross-lingual
- Monolingual:
nthakur/swim-ir-monolingual
- Indic Cross-lingual:
nthakur/indic-swim-ir-cross-lingual
Check it out:
https://huggingface.co/collections/nthakur/swim-ir-dataset-662ddaecfc20896bf14dd9b7