File size: 4,950 Bytes

78a76be
dcfc9b2
 
 
 
 
 
 
 
 
 
 
 
 
78a76be
 
dcfc9b2
78a76be
dcfc9b2
 
 
78a76be
d1975d4
dcfc9b2
 
78a76be
dcfc9b2
78a76be
dcfc9b2
78a76be
dcfc9b2
78a76be
dcfc9b2
78a76be
dcfc9b2
 
 
 
78a76be
dcfc9b2
78a76be
dcfc9b2
78a76be
dcfc9b2
78a76be
dcfc9b2
78a76be
dcfc9b2
78a76be
dcfc9b2
 
 
78a76be
dcfc9b2
78a76be
dcfc9b2
 
78a76be
dcfc9b2
 
 
d1975d4
dcfc9b2
 
 
78a76be
dcfc9b2
 
 
78a76be
dcfc9b2
 
 
78a76be
dcfc9b2
 
78a76be
dcfc9b2
78a76be
dcfc9b2
78a76be
dcfc9b2
78a76be
dcfc9b2
78a76be
dcfc9b2
 
 
 
78a76be
dcfc9b2
78a76be
dcfc9b2
78a76be
2a5279e
78a76be
dcfc9b2
 
 
 
78a76be
dcfc9b2
78a76be
dcfc9b2
78a76be
dcfc9b2

---
license: mit
language:
- de
base_model:
- EuroBERT/EuroBERT-610m
pipeline_tag: token-classification
tags:
- token classification
- hallucination detection
- transformers
- question answer
datasets:
- KRLabsOrg/ragtruth-de-translated
---

# LettuceDetect: German Hallucination Detection Model

<p align="center">
  <img src="https://github.com/KRLabsOrg/LettuceDetect/blob/feature/cn_llm_eval/assets/lettuce_detective_multi.png?raw=true" alt="LettuceDetect Logo" width="400"/>
</p>

**Model Name:** KRLabsOrg/lettucedect-610m-eurobert-de-v1
**Organization:** KRLabsOrg  
**Github:** https://github.com/KRLabsOrg/LettuceDetect

## Overview

LettuceDetect is a transformer-based model for hallucination detection on context and answer pairs, designed for multilingual Retrieval-Augmented Generation (RAG) applications. This model is built on **EuroBERT-610M**, which has been specifically chosen for its extended context support (up to **8192 tokens**) and strong multilingual capabilities. This long-context capability is critical for tasks where detailed and extensive documents need to be processed to accurately determine if an answer is supported by the provided context.

**This is our German large model utilizing EuroBERT-610M architecture**

## Model Details

- **Architecture:** EuroBERT-610M with extended context support (up to 8192 tokens)
- **Task:** Token Classification / Hallucination Detection
- **Training Dataset:** RagTruth-DE (translated from the original RAGTruth dataset)
- **Language:** German

## How It Works

The model is trained to identify tokens in the German answer text that are not supported by the given context. During inference, the model returns token-level predictions which are then aggregated into spans. This allows users to see exactly which parts of the answer are considered hallucinated.

## Usage

### Installation

Install the 'lettucedetect' repository

```bash
pip install lettucedetect
```

### Using the model

```python
from lettucedetect.models.inference import HallucinationDetector

# For a transformer-based approach:
detector = HallucinationDetector(
    method="transformer", 
    model_path="KRLabsOrg/lettucedect-610m-eurobert-de-v1",
    lang="de",
    trust_remote_code=True
)

contexts = ["Frankreich ist ein Land in Europa. Die Hauptstadt von Frankreich ist Paris. Die Bevölkerung Frankreichs beträgt 67 Millionen."]
question = "Was ist die Hauptstadt von Frankreich? Wie groß ist die Bevölkerung Frankreichs?"
answer = "Die Hauptstadt von Frankreich ist Paris. Die Bevölkerung Frankreichs beträgt 69 Millionen."

# Get span-level predictions indicating which parts of the answer are considered hallucinated.
predictions = detector.predict(context=contexts, question=question, answer=answer, output_format="spans")
print("Predictions:", predictions)

# Predictions: [{'start': 41, 'end': 88, 'confidence': 0.9873546123504639, 'text': ' Die Bevölkerung Frankreichs beträgt 69 Millionen.'}]
```

## Performance

**Results on Translated RAGTruth-DE**

We evaluate our German models on translated versions of the [RAGTruth](https://aclanthology.org/2024.acl-long.585/) dataset. The EuroBERT-610M German model achieves an F1 score of 74.95%, significantly outperforming prompt-based methods like GPT-4.1-mini (60.91%) with a substantial improvement of +14.04 percentage points.

For detailed performance metrics across different languages, see the table below:

| Language | Model           | Precision (%) | Recall (%) | F1 (%) | GPT-4.1-mini F1 (%) | Δ F1 (%) |
|----------|-----------------|---------------|------------|--------|---------------------|----------|
| German  | EuroBERT-210M   | 66.70         | 66.70      | 66.70  | 60.91               | +5.79    |
| German  | EuroBERT-610M   | **77.04**     | **72.96**  | **74.95**  | 60.91               | **+14.04**   |

While the 610M variant requires more computational resources, it delivers substantially higher performance with over 8 percentage points improvement in F1 score compared to the 210M model.

### Manual Validation

We performed additional validation on a manually reviewed set of 300 examples covering all task types from the data (QA, summarization, data-to-text). The EuroBERT-610M German model showed strong performance with an F1 score of 71.79% on this curated dataset.

| Model         | Precision (%) | Recall (%) | F1 (%) |
|---------------|---------------|------------|--------|
| EuroBERT-210M | 68.32         | 68.32      | 68.32  |
| EuroBERT-610M | **74.47**     | 69.31      | **71.79** |

## Citing

If you use the model or the tool, please cite the following paper:

```bibtex
@misc{Kovacs:2025,
      title={LettuceDetect: A Hallucination Detection Framework for RAG Applications}, 
      author={Ádám Kovács and Gábor Recski},
      year={2025},
      eprint={2502.17125},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.17125}, 
}
```