Niklas Hoepner commited on
Commit
26a4d51
·
1 Parent(s): 8f7a170

Update gradio app

Browse files
Files changed (1) hide show
  1. app.py +132 -3
app.py CHANGED
@@ -1,6 +1,135 @@
 
1
  import evaluate
2
- from evaluate.utils import launch_gradio_widget
3
 
 
4
 
5
- module = evaluate.load("nhop/L3Score")
6
- launch_gradio_widget(module)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
  import evaluate
 
3
 
4
+ l3score = evaluate.load("your-username/L3Score")
5
 
6
+ def compute_l3score(api_key, provider, model, questions, predictions, references):
7
+ try:
8
+ result = l3score.compute(
9
+ questions=[q.strip() for q in questions.split("\n") if q.strip()],
10
+ predictions=[p.strip() for p in predictions.split("\n") if p.strip()],
11
+ references=[r.strip() for r in references.split("\n") if r.strip()],
12
+ api_key=api_key,
13
+ provider=provider,
14
+ model=model
15
+ )
16
+ return result
17
+ except Exception as e:
18
+ return {"error": str(e)}
19
+
20
+ with gr.Blocks() as demo:
21
+ gr.Markdown(r"""
22
+ # 🦢 L3Score Evaluation Demo
23
+
24
+ ## 📌 Description
25
+ **L3Score** evaluates how semantically close a model-generated answer is to a reference answer for a given question. It prompts a **language model as a judge** using:
26
+
27
+ ```text
28
+ You are given a question, ground-truth answer, and a candidate answer.
29
+
30
+ Question: {{question}}
31
+ Ground-truth answer: {{gt}}
32
+ Candidate answer: {{answer}}
33
+
34
+ Is the semantic meaning of the ground-truth and candidate answers similar?
35
+ Answer in one word - Yes or No.
36
+ ```
37
+
38
+ The model's **log-probabilities** for "Yes" and "No" tokens are used to compute the score.
39
+
40
+ ---
41
+
42
+ ## 🧮 Scoring Logic
43
+
44
+ Let $l_{\text{yes}}$ and $l_{\text{no}}$ be the log-probabilities of "Yes" and "No", respectively.
45
+
46
+ - If neither token is in the top-5:
47
+
48
+ $$
49
+ \text{L3Score} = 0
50
+ $$
51
+
52
+ - If both are present:
53
+
54
+ $$
55
+ \text{L3Score} = \frac{\exp(l_{\text{yes}})}{\exp(l_{\text{yes}}) + \exp(l_{\text{no}})}
56
+ $$
57
+
58
+ - If only one is present, the missing token’s probability is estimated using remaining mass or the least likely top-5 token.
59
+
60
+ ---
61
+
62
+ ## 🚀 How to Use
63
+
64
+ ```python
65
+ import evaluate
66
+
67
+ l3score = evaluate.load("your-username/L3Score")
68
+
69
+ score = l3score.compute(
70
+ questions=["What is the capital of France?"],
71
+ predictions=["Paris"],
72
+ references=["Paris"],
73
+ api_key="your-openai-api-key",
74
+ provider="openai",
75
+ model="gpt-4o-mini"
76
+ )
77
+ print(score)
78
+ # {'L3Score': 0.99...}
79
+ ```
80
+
81
+ ---
82
+
83
+ ## 🔠 Inputs
84
+ | Name | Type | Description |
85
+ |--------------|--------------|-----------------------------------------------------------------------------|
86
+ | `questions` | `list[str]` | The list of input questions. |
87
+ | `predictions`| `list[str]` | Generated answers by the model being evaluated. |
88
+ | `references` | `list[str]` | Ground-truth or reference answers. |
89
+ | `api_key` | `str` | API key for the selected LLM provider. |
90
+ | `provider` | `str` | Must support top-n token log-probabilities. |
91
+ | `model` | `str` | Name of the evaluation LLM. |
92
+
93
+ ## 📄 Output
94
+ ```python
95
+ {"L3Score": float}
96
+ ```
97
+ The value is the **average score** over all (question, prediction, reference) triplets.
98
+
99
+ ---
100
+
101
+ ## ⚠️ Limitations and Bias
102
+ - Requires models that expose **top-n token log-probabilities** (e.g., OpenAI, DeepSeek, Groq).
103
+ - Scores are **only comparable when using the same judge model**.
104
+
105
+ ## 📖 Citation
106
+ ```bibtex
107
+ @article{pramanick2024spiqa,
108
+ title={SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers},
109
+ author={Pramanick, Shraman and Chellappa, Rama and Venugopalan, Subhashini},
110
+ journal={arXiv preprint arXiv:2407.09413},
111
+ year={2024}
112
+ }
113
+ ```
114
+ """)
115
+
116
+ with gr.Row():
117
+ api_key = gr.Textbox(label="API Key", type="password")
118
+ provider = gr.Dropdown(label="Provider", choices=["openai", "deepseek", "xai"], value="openai")
119
+ model = gr.Textbox(label="Model", value="gpt-4o-mini")
120
+
121
+ with gr.Row():
122
+ questions = gr.Textbox(label="Questions (one per line)", lines=4, placeholder="What is the capital of France?")
123
+ predictions = gr.Textbox(label="Predictions (one per line)", lines=4, placeholder="Paris")
124
+ references = gr.Textbox(label="References (one per line)", lines=4, placeholder="Paris")
125
+
126
+ compute_button = gr.Button("Compute L3Score")
127
+ output = gr.JSON(label="L3Score Result")
128
+
129
+ compute_button.click(
130
+ fn=compute_l3score,
131
+ inputs=[api_key, provider, model, questions, predictions, references],
132
+ outputs=output
133
+ )
134
+
135
+ demo.launch()