Papers
arxiv:2507.05980

RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages

Published on Jul 8
ยท Submitted by gabrielchua on Jul 10
Authors:
,
,

Abstract

Large language models (LLMs) and their safety classifiers often perform poorly on low-resource languages due to limited training data and evaluation benchmarks. This paper introduces RabakBench, a new multilingual safety benchmark localized to Singapore's unique linguistic context, covering Singlish, Chinese, Malay, and Tamil. RabakBench is constructed through a scalable three-stage pipeline: (i) Generate - adversarial example generation by augmenting real Singlish web content with LLM-driven red teaming; (ii) Label - semi-automated multi-label safety annotation using majority-voted LLM labelers aligned with human judgments; and (iii) Translate - high-fidelity translation preserving linguistic nuance and toxicity across languages. The final dataset comprises over 5,000 safety-labeled examples across four languages and six fine-grained safety categories with severity levels. Evaluations of 11 popular open-source and closed-source guardrail classifiers reveal significant performance degradation. RabakBench not only enables robust safety evaluation in Southeast Asian multilingual settings but also offers a reproducible framework for building localized safety datasets in low-resource environments. The benchmark dataset, including the human-verified translations, and evaluation code are publicly available.

Community

Paper author Paper submitter
โ€ข
edited about 6 hours ago

๐ŸŒ every country has a linguistic fingerprint - a blend of dialects and languages shaping daily life. in the age of global ai, capturing these local nuances isn't optional; it's essential for responsible deployments.

to address this, the govtech ai practice and sutd's social ai studio teamed up to build RabakBench. singapore's rich linguistic landscape - singlish/english, chinese, malay, tamil - creates the perfect stress test for llms and their guardrails.

we think this is a meaningful and challenging benchmark ๐Ÿ‹๐Ÿผ: evaluations on eleven popular open- and closed-source guardrails shows major inconsistencies. for example, popular guardrail options like openai moderation or llamaguard are not necessarily the best

building high-quality multilingual safety benchmarks for low-resource languages is labor-intensive and difficult to scale. to overcome this, we built RabakBench through a three-stage process that combines human-in-the-loop annotation with LLM-assisted red-teaming ๐Ÿ”ฆ and multilingual translation ๐Ÿ’ฌ. we share this as a demonstration of how we can scale up human annotation with LLM-assisted workflows and collaborative workshops, enabling rigorous, culturally-grounded benchmarks even in data-scarce settings

we hope RabakBench helps devs, researchers, and policymakers. tell us what you think!
1_yZziqIfcCs7o6s5spqbXYg.webp

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.05980 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.05980 in a Space README.md to link it from this page.

Collections including this paper 1