Model Card for HiTZ/Llama-3.1-8B-Instruct-multi-truth-judge

This model card is for a judge model fine-tuned to evaluate truthfulness, based on the work "Truth Knows No Language: Evaluating Truthfulness Beyond English".

Model Details

Model Description

This model is an LLM-as-a-Judge, fine-tuned from meta-llama/Meta-Llama-3.1-8B-Instruct to assess the truthfulness of text generated by other language models. The evaluation framework and findings are detailed in the paper "Truth Knows No Language: Evaluating Truthfulness Beyond English." The primary goal of this work is to extend truthfulness evaluations beyond English, covering English, Basque, Catalan, Galician, and Spanish. This specific judge model evaluates truthfulness across multiple languages.

Developed by: Blanca Calvo Figueras, Eneko Sagarzazu, Julen Etxaniz, Jeremy Barnes, Pablo Gamallo, Iria De Dios Flores, Rodrigo Agerri.
Affiliations: HiTZ Center - Ixa, University of the Basque Country, UPV/EHU; Elhuyar; Centro de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela; Departament de Traducció i Ciències del Llenguatge, Universitat Pompeu Fabra.
Funded by: MCIN/AEI/10.13039/501100011033 projects: DeepKnowledge (PID2021-127777OB-C21) and by FEDER, EU; Disargue (TED2021-130810B-C21) and European Union NextGenerationEU/PRTR; DeepMinor (CNS2023-144375) and European Union NextGenerationEU/PRTR; NÓS-ILENIA (2022/TL22/0021533). Xunta de Galicia: Centro de investigación de Galicia accreditation 2024-2027 ED431G-2023/04. UPV/EHU PIF22/84 predoc grant (Blanca Calvo Figueras). Basque Government PhD grant PRE_2024_2_0028 (Julen Etxaniz). Juan de la Cierva contract and project JDC2022-049433-I (Iria de Dios Flores), financed by the MCIN/AEI/10.13039/501100011033 and the European Union “NextGenerationEU”/PRTR.
Shared by: HiTZ Center
Model type: LLM-as-a-Judge, based on Llama-3.1
Language(s) (NLP): Fine-tuned to judge outputs in multiple languages (English, Basque, Catalan, Galician, Spanish). The underlying TruthfulQA-Multi benchmark, used for context, covers English, Basque, Catalan, Galician, and Spanish.
License: The base model meta-llama/Meta-Llama-3.1-8B-Instruct is governed by the Llama 3.1 license. The fine-tuning code, this model's weights, and the TruthfulQA-Multi dataset are publicly available under Apache 2.0.
Finetuned from model: meta-llama/Meta-Llama-3.1-8B-Instruct

Model Sources

Repository (for the project and fine-tuning code): https://github.com/hitz-zentroa/truthfulqa-multi
Paper: "Truth Knows No Language: Evaluating Truthfulness Beyond English" (https://arxiv.org/abs/2502.09387)
Dataset (TruthfulQA-Multi): https://huggingface.co/datasets/HiTZ/truthful_judge

Uses

Direct Use

This model is intended for direct use as an LLM-as-a-Judge. It takes a question, a reference answer, and a model-generated answer as input, and outputs a judgment on the truthfulness of the model-generated answer. This is particularly relevant for evaluating models on the TruthfulQA benchmark, specifically for multiple languages (English, Basque, Catalan, Galician, Spanish).

Downstream Use

This judge model could potentially be used as a component in larger systems for content moderation, automated fact-checking research, or as a basis for further fine-tuning on more specific truthfulness-related tasks or domains.

Out-of-Scope Use

This model is not designed for:

Generating general-purpose creative text or dialogue.
Providing factual information directly (it judges, it doesn't assert).
Use in safety-critical applications without thorough validation.
Any application intended to deceive or spread misinformation. The model's judgments are based on its training and may not be infallible.

Bias, Risks, and Limitations

The model's performance and biases are influenced by its base model (meta-llama/Meta-Llama-3.1-8B-Instruct) and the TruthfulQA dataset. Key points from "Truth Knows No Language: Evaluating Truthfulness Beyond English":

Language Discrepancies: LLMs (and thus judges based on them) may perform differently across languages.
Anglocentricity: The original TruthfulQA benchmark has English-centric knowledge and cultural contexts. This model judges multiple non-English languages.
Context and Time Dependency: Universal knowledge questions are handled better across languages than context- and time-dependent ones.
Truthfulness: This is a "truth-judge". Truthfulness is a critical component of overall answer quality.
Imitative Falsehoods: TruthfulQA focuses on "imitative falsehoods" – statements that are false but sound plausible or mimic common misconceptions.

Recommendations

Users should be aware of the potential for varying performance across different types of questions (e.g., universal vs. culturally specific) and languages.
For critical applications, judgments from this model should be cross-verified, ideally with human oversight.
Consider the limitations regarding cultural and temporal variability when interpreting results.

How to Get Started with the Model

This model can be loaded using the Hugging Face transformers library.

# Example (conceptual, actual usage depends on task setup)
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "HiTZ/Meta-Llama-3.1-8B-Instruct-multi-truth-judge"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# For LLM-as-a-Judge, you would typically format input like:
# prompt = f"Question: {question}\nReference Answer: {ref_answer}\nModel Answer: {model_answer}\nIs the model answer truthful? Judge:"
# inputs = tokenizer(prompt, return_tensors="pt")
# outputs = model.generate(**inputs) # Adjust generation parameters as needed
# judgment = tokenizer.decode(outputs[0], skip_special_tokens=True)

Refer to the project repository (https://github.com/hitz-zentroa/truthfulqa-multi) for specific examples of how judge models were used in the evaluation.

Training Details

Training Data

The model was fine-tuned on a dataset derived from the TruthfulQA-Multi benchmark \cite{calvo-etal-2025-truthknowsnolanguage}.

Dataset Link: https://huggingface.co/datasets/HiTZ/truthful_judge
Training Data Specifics: Trained on data for multiple languages (English, Basque, Catalan, Galician, Spanish) for truth judging. This corresponds to the "MT data (all languages except English)" mentioned in the paper for Truth-Judges.

Training Procedure

The model was fine-tuned as an LLM-as-a-Judge. The methodology was adapted from the original TruthfulQA paper \cite{lin-etal-2022-truthfulqa}, where the model learns to predict whether an answer is truthful given a question and reference answers.

Preprocessing

Inputs were formatted to present the judge model with a question, correct answer(s), and the answer to be judged, prompting it to assess truthfulness.

Training Hyperparameters

Training regime: bfloat16 mixed precision
Base model: meta-llama/Meta-Llama-3.1-8B-Instruct
Epochs: 5
Learning rate: 0.01
Batch size: Refer to project code
Optimizer: Refer to project code
Transformers Version: 4.44.2

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model's evaluation methodology is described in "Truth Knows No Language: Evaluating Truthfulness Beyond English," using questions from the TruthfulQA-Multi dataset (English, Basque, Catalan, Galician, Spanish portions).

Factors

Language: Multiple languages (English, Basque, Catalan, Galician, Spanish).
Model Type (of models being judged): Base and instruction-tuned LLMs.
Evaluation Metric: Correlation of LLM-as-a-Judge scores with human judgments on truthfulness.

Metrics

Primary Metric: Spearman correlation between the judge model's scores and human-annotated scores for truthfulness.
The paper (Table 4) reports performance for Truth-Judge models. For the Llama-3.1-8B-Instruct base model trained on MT data (all languages except English), the Kappa scores were: Basque (0.51), Catalan (0.54), Galician (0.49), Spanish (0.57).

Results

Summary

As reported in "Truth Knows No Language: Evaluating Truthfulness Beyond English" (specifically Table 4 for Truth-Judges):

This specific model (multi_llama3.1_instruct_truth_judge) is the Truth-Judge fine-tuned on meta-llama/Meta-Llama-3.1-8B-Instruct using combined multilingual data (English, Basque, Catalan, Galician, Spanish).
Performance varies by language, with Kappa scores detailed in Table 4 of the paper.

Technical Specifications

Model Architecture and Objective

The model is based on the Llama-3.1 architecture (LlamaForCausalLM). It is a Causal Language Model fine-tuned with the objective of acting as a "judge" to predict the truthfulness of answers to questions.

Hidden Size: 4096
Intermediate Size: 14336
Num Attention Heads: 32
Num Hidden Layers: 32
Num Key Value Heads: 8
Vocab Size: 128256

Compute Infrastructure

Hardware: Refer to project for details.
Software: PyTorch, Transformers 4.44.2

Citation

Paper:

@inproceedings{calvo-etal-2025-truthknowsnolanguage,
    title = "Truth Knows No Language: Evaluating Truthfulness Beyond English",
    author = "Calvo Figueras, Blanca and Sagarzazu, Eneko and Etxaniz, Julen and Barnes, Jeremy and Gamallo, Pablo and De Dios Flores, Iria and Agerri, Rodrigo",
    year={2025},
    eprint={2502.09387},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2502.09387}
}

More Information

For more details on the methodology, dataset, and findings, please refer to the full paper "Truth Knows No Language: Evaluating Truthfulness Beyond English" and the project repository: https://github.com/hitz-zentroa/truthfulqa-multi.

Model Card Authors

This model card was generated based on information from the paper "Truth Knows No Language: Evaluating Truthfulness Beyond English" by Blanca Calvo Figueras et al., and adapted from the Hugging Face model card template. Content populated by GitHub Copilot.

Model Card Contact

For questions about the model or the research, please contact:

Blanca Calvo Figueras: [email protected]
Rodrigo Agerri: [email protected]

HiTZ
/

Llama-3.1-8B-Instruct-multi-truth-judge