---
title: L3Score
datasets:
- google/spiqa
tags:
- evaluate
- metric
- semantic-similarity
- qa
- llm-eval
description: >
  L3Score is a metric for evaluating the semantic similarity of free-form
  answers in question answering tasks. It uses log-probabilities of "Yes"/"No"
  tokens from a language model acting as a judge. Based on the SPIQA benchmark:
  https://arxiv.org/pdf/2407.09413
sdk: gradio
sdk_version: 5.25.1
app_file: app.py
pinned: false
---

# Metric Card: L3Score

## 📌 Description

**L3Score** evaluates how semantically close a model-generated answer is to a reference answer for a given question. It prompts a **language model as a judge** using the following format:

```text
You are given a question, ground-truth answer, and a candidate answer.

Question: {question}  
Ground-truth answer: {gt}  
Candidate answer: {answer}

Is the semantic meaning of the ground-truth and candidate answers similar?  
Answer in one word - Yes or No.
```

The model's **log-probabilities** for "Yes" and "No" tokens are used to compute the score.

### 🧮  Scoring Logic

Let $l_{\text{yes}} $ and $ l_{\text{no}} $ be the log-probabilities of "Yes" and "No", respectively.

- If neither token is in the top-5:

$$
\text{L3Score} = 0
$$

- If both are present:

$$
\text{L3Score} = \frac{\exp(l_{\text{yes}})}{\exp(l_{\text{yes}}) + \exp(l_{\text{no}})}
$$

- If only one is present, the missing token’s probability is estimated using the minimum of:
    - remaining probability mass apart from the top-5 tokens
    - the least likely top-5 token

The score ranges from 0 to 1, where 1 indicates the highest confidence by the LLM that the predicted and reference answers are semantically equivalent.

See [SPIQA paper](https://arxiv.org/pdf/2407.09413) for details.

## 🚀 How to Use

```python
import evaluate

l3score = evaluate.load("nhop/L3Score")

questions = ["What is the capital of France?", "What is the capital of Germany?"]
predictions = ["Paris", "Moscow"]
references = ["Paris", "Berlin"]

score = l3score.compute(
    questions=questions,
    predictions=predictions,
    references=references,
    api_key="your-openai-api-key",
    provider="openai",
    model="gpt-4o-mini"
)

print(score)
# {'L3Score': 0.49..., 'Cost':...}
```

---

### 🔠 Inputs

| Name         | Type         | Description                                                                 |
|--------------|--------------|-----------------------------------------------------------------------------|
| `questions`  | `list[str]`  | The list of input questions.                                                |
| `predictions`| `list[str]`  | Generated answers by the model being evaluated.                            |
| `references` | `list[str]`  | Ground-truth or reference answers.                                         |
| `api_key`    | `str`        | API key for the selected LLM provider.                                     |
| `provider`   | `str`        | Must support top-n token log-probabilities (currently available: `"openai"`, `"deepseek","xai"`). |
| `model`      | `str`        | Name of the evaluation LLM (e.g., `"gpt-4o-mini"`).                         |

---

### 📄 Output

A dictionary with a the score and the cost to query the LLM-provider API:

```python
{"L3Score": float, "Cost": float}
```

The value is the **average score** over all (question, prediction, reference) triplets and the total cost of all API calls.

---

## 💡 Examples

```python
l3score = evaluate.load("nhop/L3Score")

score = l3score.compute(
    questions=["What is the capital of France?"],
    predictions=["Paris"],
    references=["Paris"],
    api_key="your-openai-api-key",
    provider="openai",
    model="gpt-4o-mini"
)
# {'L3Score': 0.99...,'Cost':...}

score = l3score.compute(
    questions=["What is the capital of Germany?"],
    predictions=["Moscow"],
    references=["Berlin"],
    api_key="your-openai-api-key",
    provider="openai",
    model="gpt-4o-mini"
)
# {'L3Score': 0.00...,'Cost':...}
```

---

## ⚠️ Limitations and Bias

- Requires models that expose **top-n token log-probabilities** (e.g., OpenAI, DeepSeek, Groq).
- Scores are **only comparable when using the same judge model**.

---

## 📖 Citation

```bibtex
@article{pramanick2024spiqa,
  title={SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers},
  author={Pramanick, Shraman and Chellappa, Rama and Venugopalan, Subhashini},
  journal={arXiv preprint arXiv:2407.09413},
  year={2024}
}
```

---

## 🔗 Further References

- 🤗 [Dataset on Hugging Face](https://huggingface.co/datasets/google/spiqa)  
- 🐙 [GitHub Repository](https://github.com/google/spiqa)  
- 📄 [SPIQA Paper (arXiv:2407.09413)](https://arxiv.org/pdf/2407.09413)