Investigating Hallucination in Conversations for Low Resource Languages
Abstract
LLMs generate fewer hallucinations in Mandarin compared to Hindi and Farsi across multiple models.
Large Language Models (LLMs) have demonstrated remarkable proficiency in generating text that closely resemble human writing. However, they often generate factually incorrect statements, a problem typically referred to as 'hallucination'. Addressing hallucination is crucial for enhancing the reliability and effectiveness of LLMs. While much research has focused on hallucinations in English, our study extends this investigation to conversational data in three languages: Hindi, Farsi, and Mandarin. We offer a comprehensive analysis of a dataset to examine both factual and linguistic errors in these languages for GPT-3.5, GPT-4o, Llama-3.1, Gemma-2.0, DeepSeek-R1 and Qwen-3. We found that LLMs produce very few hallucinated responses in Mandarin but generate a significantly higher number of hallucinations in Hindi and Farsi.
Community
The paper provides the first systematic hallucination evaluation of multilingual conversational LLM outputs (GPT-3.5, GPT-4o, Llama-3.1, Gemma-2.0, DeepSeek-R1, Qwen-3) across Hindi, Farsi, and Mandarin, revealing high hallucination in Hindi/Farsi versus minimal hallucination in Mandarin, and proposes benchmark-style evaluations using translated dialogue corpora.
โก๏ธ ๐๐๐ฒ ๐๐ข๐ ๐ก๐ฅ๐ข๐ ๐ก๐ญ๐ฌ ๐จ๐ ๐จ๐ฎ๐ซ ๐๐จ๐ฐ-๐๐๐ฌ๐จ๐ฎ๐ซ๐๐ ๐๐๐ฅ๐ฅ๐ฎ๐๐ข๐ง๐๐ญ๐ข๐จ๐ง ๐๐๐ง๐๐ก๐ฆ๐๐ซ๐ค:
๐งช ๐ด๐๐๐๐๐๐๐๐๐๐๐ ๐ช๐๐๐๐๐๐๐๐๐๐๐๐๐ ๐ฏ๐๐๐๐๐๐๐๐๐๐๐๐ ๐ฌ๐๐๐๐๐๐๐๐๐:
Introduces a hallucination benchmark for three low-resource languages (Hindi, Farsi, Mandarin) using LLM-translated versions of BlendedSkillTalk and DailyDialog datasets, evaluating model responses against ROUGE-1 and ROUGE-L scores with human verification.
๐งฉ ๐ช๐๐๐๐๐๐๐๐๐๐ ๐จ๐๐๐๐๐๐๐ ๐๐๐๐๐๐ ๐ณ๐ณ๐ด ๐ญ๐๐๐๐๐๐๐ ๐๐๐
๐ณ๐๐๐๐๐๐๐๐:
Finds that GPT-4o and GPT-3.5 outperform open-source models (LLaMA, Gemma, DeepSeek, Qwen) in minimizing hallucinations, especially in Mandarin; however, all models hallucinate more in Hindi and Farsi, indicating limitations of current LLMs under low-resource settings.
๐ง ๐น๐๐๐๐๐๐๐-๐จ๐๐๐๐ ๐ฏ๐๐๐๐๐๐๐๐๐๐๐๐ ๐ท๐๐๐๐๐๐๐ ๐๐๐
๐ญ๐๐๐๐:
Attributes hallucination differences to training data availability; proposes use of retrieval-augmented generation (RAG), grounded decoding, and language-specific fine-tuning to improve factuality in low-resource conversational agents, with native-speaker evaluation confirming hallucination types (partial vs. complete).
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CCL-XCoT: An Efficient Cross-Lingual Knowledge Transfer Method for Mitigating Hallucination Generation (2025)
- FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models (2025)
- Are LLMs Good Text Diacritizers? An Arabic and Yorรนbรก Case Study (2025)
- Token Level Hallucination Detection via Variance in Language Models (2025)
- Reducing Hallucinations in Summarization via Reinforcement Learning with Entity Hallucination Index (2025)
- Enhancing Hallucination Detection via Future Context (2025)
- Chain-of-Thought Prompting Obscures Hallucination Cues in Large Language Models: An Empirical Evaluation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper