Papers
arxiv:2507.22752

CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset

Published on Jul 30
Authors:
,
,
,

Abstract

A benchmark for open-ended regional question answering using LLMs highlights gaps in regional knowledge and minimal correlation between automated metrics and human judgment.

AI-generated summary

We introduce a benchmark for open-ended regional question answering that encompasses both textual and visual modalities. We also provide strong baselines using state-of-the-art large language models (LLMs). Our dataset consists of manually curated questions and answers grounded in Wikipedia, created by native speakers from Czechia, Slovakia, and Ukraine, with accompanying English translations. It includes both purely textual questions and those requiring visual understanding. As a baseline, we evaluate state-of-the-art LLMs through prompting and complement this with human judgments of answer correctness. Using these human evaluations, we analyze the reliability of existing automatic evaluation metrics. Our baseline results highlight a significant gap in regional knowledge among current LLMs. Moreover, apart from LLM-based evaluation, there is minimal correlation between automated metrics and human judgment. We release this dataset as a resource to (1) assess regional knowledge in LLMs, (2) study cross-lingual generation consistency in a challenging setting, and (3) advance the development of evaluation metrics for open-ended question answering.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.22752 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.22752 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.