Papers
arxiv:2507.22720

Investigating Hallucination in Conversations for Low Resource Languages

Published on Jul 30
ยท Submitted by amanchadha on Aug 4
Authors:
,
,
,
,
,
,
,
,
,

Abstract

LLMs generate fewer hallucinations in Mandarin compared to Hindi and Farsi across multiple models.

AI-generated summary

Large Language Models (LLMs) have demonstrated remarkable proficiency in generating text that closely resemble human writing. However, they often generate factually incorrect statements, a problem typically referred to as 'hallucination'. Addressing hallucination is crucial for enhancing the reliability and effectiveness of LLMs. While much research has focused on hallucinations in English, our study extends this investigation to conversational data in three languages: Hindi, Farsi, and Mandarin. We offer a comprehensive analysis of a dataset to examine both factual and linguistic errors in these languages for GPT-3.5, GPT-4o, Llama-3.1, Gemma-2.0, DeepSeek-R1 and Qwen-3. We found that LLMs produce very few hallucinated responses in Mandarin but generate a significantly higher number of hallucinations in Hindi and Farsi.

Community

Paper submitter

The paper provides the first systematic hallucination evaluation of multilingual conversational LLM outputs (GPT-3.5, GPT-4o, Llama-3.1, Gemma-2.0, DeepSeek-R1, Qwen-3) across Hindi, Farsi, and Mandarin, revealing high hallucination in Hindi/Farsi versus minimal hallucination in Mandarin, and proposes benchmark-style evaluations using translated dialogue corpora.

โžก๏ธ ๐Š๐ž๐ฒ ๐‡๐ข๐ ๐ก๐ฅ๐ข๐ ๐ก๐ญ๐ฌ ๐จ๐Ÿ ๐จ๐ฎ๐ซ ๐‹๐จ๐ฐ-๐‘๐ž๐ฌ๐จ๐ฎ๐ซ๐œ๐ž ๐‡๐š๐ฅ๐ฅ๐ฎ๐œ๐ข๐ง๐š๐ญ๐ข๐จ๐ง ๐๐ž๐ง๐œ๐ก๐ฆ๐š๐ซ๐ค:

๐Ÿงช ๐‘ด๐’–๐’๐’•๐’Š๐’๐’Š๐’๐’ˆ๐’–๐’‚๐’ ๐‘ช๐’๐’๐’—๐’†๐’“๐’”๐’‚๐’•๐’Š๐’๐’๐’‚๐’ ๐‘ฏ๐’‚๐’๐’๐’–๐’„๐’Š๐’๐’‚๐’•๐’Š๐’๐’ ๐‘ฌ๐’—๐’‚๐’๐’–๐’‚๐’•๐’Š๐’๐’:
Introduces a hallucination benchmark for three low-resource languages (Hindi, Farsi, Mandarin) using LLM-translated versions of BlendedSkillTalk and DailyDialog datasets, evaluating model responses against ROUGE-1 and ROUGE-L scores with human verification.

๐Ÿงฉ ๐‘ช๐’๐’Ž๐’‘๐’‚๐’“๐’‚๐’•๐’Š๐’—๐’† ๐‘จ๐’๐’‚๐’๐’š๐’”๐’Š๐’” ๐’‚๐’„๐’“๐’๐’”๐’” ๐‘ณ๐‘ณ๐‘ด ๐‘ญ๐’‚๐’Ž๐’Š๐’๐’Š๐’†๐’” ๐’‚๐’๐’… ๐‘ณ๐’‚๐’๐’ˆ๐’–๐’‚๐’ˆ๐’†๐’”:
Finds that GPT-4o and GPT-3.5 outperform open-source models (LLaMA, Gemma, DeepSeek, Qwen) in minimizing hallucinations, especially in Mandarin; however, all models hallucinate more in Hindi and Farsi, indicating limitations of current LLMs under low-resource settings.

๐Ÿง  ๐‘น๐’†๐’”๐’๐’–๐’“๐’„๐’†-๐‘จ๐’˜๐’‚๐’“๐’† ๐‘ฏ๐’‚๐’๐’๐’–๐’„๐’Š๐’๐’‚๐’•๐’Š๐’๐’ ๐‘ท๐’‚๐’•๐’•๐’†๐’“๐’๐’” ๐’‚๐’๐’… ๐‘ญ๐’Š๐’™๐’†๐’”:
Attributes hallucination differences to training data availability; proposes use of retrieval-augmented generation (RAG), grounded decoding, and language-specific fine-tuning to improve factuality in low-resource conversational agents, with native-speaker evaluation confirming hallucination types (partial vs. complete).

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.22720 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.22720 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.22720 in a Space README.md to link it from this page.

Collections including this paper 1