Update README.md
Browse files
README.md
CHANGED
@@ -39,6 +39,12 @@ The table below summarizes the evaluation results:
|
|
39 |
| google/gemma-2-9b-it | 54.13% |
|
40 |
| ytu-ce-cosmos/Turkish-Llama-8b-DPO-v0.1 | 36.89% |
|
41 |
|
|
|
|
|
|
|
|
|
|
|
|
|
42 |
### 📊 Turkish Evaluation Benchmark Results (via `malhajar17/lm-evaluation-harness_turkish`)
|
43 |
|
44 |
| Model Name | Average | MMLU | Truthful_QA | ARC | Hellaswag | Gsm8K | Winogrande |
|
|
|
39 |
| google/gemma-2-9b-it | 54.13% |
|
40 |
| ytu-ce-cosmos/Turkish-Llama-8b-DPO-v0.1 | 36.89% |
|
41 |
|
42 |
+
|
43 |
+
### Voting Metodology
|
44 |
+
|
45 |
+
A question and two answers from different models were presented to human judges. The judges selected the better answer based on their preferences. For example, in the question below, the judge selected the answer on the right:
|
46 |
+

|
47 |
+
|
48 |
### 📊 Turkish Evaluation Benchmark Results (via `malhajar17/lm-evaluation-harness_turkish`)
|
49 |
|
50 |
| Model Name | Average | MMLU | Truthful_QA | ARC | Hellaswag | Gsm8K | Winogrande |
|