
RAG Datasets
AI & ML interests
Information Retrieval & Question Answering
Recent Activity
This organization and its dataset are not actively maintained anymore. Still you are invited to add similar datasets to it.
Feel free to join the organization if you want to add a dataset with a similar purpose :) Please tell me about your dataset before asking to join the org.
To test your RAG and other semantic information retrieval solutions it would be powerful to have access to a dataset that consists of a text corpus, correct responses to queries (e.g. question-answer) to test the solution end-to-end and maybe even a set of relevant passages from the text corpus for each query to test the retrieval component separately as well. We call this a question-answer-passages dataset.
There are plenty of large-scale datasets of this kind such as Google's Natural Questions.
Still we lack such datasets that are small-scale and narrow-domain to just test our RAG solution quickly or to see how it performs in a certain domain context.
We created this space to create a collections of such datasets to boost the developement of RAG solutions and welcome any feedback about how your ideal RAG-Dataset would look like. :)
Datasets consist of:
- A text corpus already split into passages, referencing passages by id.
- A dataset for testing consistig of:
- A question, and one or ideally both of the followin.
- A correct short answer.
- A list of the passage ids that are relevant to answer the question.