Reproducing belebele evaluation
I tried to reproduce the accuracy values for the belebele evaluation dataset that are listed here:
https://openeurollm.eu/blog/hplt-oellm-38-reference-models
Basically I am using this script here, simplified a bit:
https://github.com/geronimi73/belebele-llama
The accuracy I get are 25% for several languages (including eng and deu, that's why I am posting here), so like random guesses. I also compared with several other models like gemma-3 and they all reach the accuracy that I would expect. Is there anything I need to check or add to reach the accuracy on the blog?
Hello!
The difference could be due to the evaluation method. We used a closed-form evaluation, which can produce different results compared to a multiple-choice format.
To reproduce our results, you can run the following lighteval command:
python3 lighteval accelerate \
"pretrained=HPLT/hplt2c_deu_checkpoints" \
--output-dir ${OUTPUT_PATH} \
--override-batch-size 32 \
--custom-tasks "lighteval.tasks.multilingual.tasks" \
"lighteval|belebele_deu_Latn_cf|0|1"
Yes, I could reproduce your scores with lighteval. lm-evaluation-harness gives me accuracy of 0.25 more or less:
lm_eval --model hf \
--model_args pretrained=HPLT/hplt2c_deu_checkpoints \
--tasks belebele_deu_Latn \
--device cuda:0 \
--batch_size 8
How does closed-form evaluation work exactly? I thought multiple-choice would already entail closed-form evaluation. That's also what my favorite LLM tells me, so I am not sure what "closed-form" means in this case :)