arxiv:2507.22720

Investigating Hallucination in Conversations for Low Resource Languages

Published on Jul 30

· Submitted by

amanchadha on Aug 4

Upvote

Authors:

Abstract

LLMs generate fewer hallucinations in Mandarin compared to Hindi and Farsi across multiple models.

AI-generated summary

Large Language Models (LLMs) have demonstrated remarkable proficiency in generating text that closely resemble human writing. However, they often generate factually incorrect statements, a problem typically referred to as 'hallucination'. Addressing hallucination is crucial for enhancing the reliability and effectiveness of LLMs. While much research has focused on hallucinations in English, our study extends this investigation to conversational data in three languages: Hindi, Farsi, and Mandarin. We offer a comprehensive analysis of a dataset to examine both factual and linguistic errors in these languages for GPT-3.5, GPT-4o, Llama-3.1, Gemma-2.0, DeepSeek-R1 and Qwen-3. We found that LLMs produce very few hallucinated responses in Mandarin but generate a significantly higher number of hallucinations in Hindi and Farsi.

View arXiv page View PDF Add to collection

Community

amanchadha

Paper submitter 2 days ago

The paper provides the first systematic hallucination evaluation of multilingual conversational LLM outputs (GPT-3.5, GPT-4o, Llama-3.1, Gemma-2.0, DeepSeek-R1, Qwen-3) across Hindi, Farsi, and Mandarin, revealing high hallucination in Hindi/Farsi versus minimal hallucination in Mandarin, and proposes benchmark-style evaluations using translated dialogue corpora.

➡️ 𝐊𝐞𝐲 𝐇𝐢𝐠𝐡𝐥𝐢𝐠𝐡𝐭𝐬 𝐨𝐟 𝐨𝐮𝐫 𝐋𝐨𝐰-𝐑𝐞𝐬𝐨𝐮𝐫𝐜𝐞 𝐇𝐚𝐥𝐥𝐮𝐜𝐢𝐧𝐚𝐭𝐢𝐨𝐧 𝐁𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤:

🧪 𝑴𝒖𝒍𝒕𝒊𝒍𝒊𝒏𝒈𝒖𝒂𝒍 𝑪𝒐𝒏𝒗𝒆𝒓𝒔𝒂𝒕𝒊𝒐𝒏𝒂𝒍 𝑯𝒂𝒍𝒍𝒖𝒄𝒊𝒏𝒂𝒕𝒊𝒐𝒏 𝑬𝒗𝒂𝒍𝒖𝒂𝒕𝒊𝒐𝒏:
Introduces a hallucination benchmark for three low-resource languages (Hindi, Farsi, Mandarin) using LLM-translated versions of BlendedSkillTalk and DailyDialog datasets, evaluating model responses against ROUGE-1 and ROUGE-L scores with human verification.

🧩 𝑪𝒐𝒎𝒑𝒂𝒓𝒂𝒕𝒊𝒗𝒆 𝑨𝒏𝒂𝒍𝒚𝒔𝒊𝒔 𝒂𝒄𝒓𝒐𝒔𝒔 𝑳𝑳𝑴 𝑭𝒂𝒎𝒊𝒍𝒊𝒆𝒔 𝒂𝒏𝒅 𝑳𝒂𝒏𝒈𝒖𝒂𝒈𝒆𝒔:
Finds that GPT-4o and GPT-3.5 outperform open-source models (LLaMA, Gemma, DeepSeek, Qwen) in minimizing hallucinations, especially in Mandarin; however, all models hallucinate more in Hindi and Farsi, indicating limitations of current LLMs under low-resource settings.

🧠 𝑹𝒆𝒔𝒐𝒖𝒓𝒄𝒆-𝑨𝒘𝒂𝒓𝒆 𝑯𝒂𝒍𝒍𝒖𝒄𝒊𝒏𝒂𝒕𝒊𝒐𝒏 𝑷𝒂𝒕𝒕𝒆𝒓𝒏𝒔 𝒂𝒏𝒅 𝑭𝒊𝒙𝒆𝒔:
Attributes hallucination differences to training data availability; proposes use of retrieval-augmented generation (RAG), grounded decoding, and language-specific fine-tuning to improve factuality in low-resource conversational agents, with native-speaker evaluation confirming hallucination types (partial vs. complete).

librarian-bot

1 day ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.22720 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.22720 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.22720 in a Space README.md to link it from this page.