HPLT
/

PyTorch
German
llama
HPLT
decoder

Reproducing belebele evaluation

#2
by pbouda - opened

I tried to reproduce the accuracy values for the belebele evaluation dataset that are listed here:

https://openeurollm.eu/blog/hplt-oellm-38-reference-models

Basically I am using this script here, simplified a bit:

https://github.com/geronimi73/belebele-llama

The accuracy I get are 25% for several languages (including eng and deu, that's why I am posting here), so like random guesses. I also compared with several other models like gemma-3 and they all reach the accuracy that I would expect. Is there anything I need to check or add to reach the accuracy on the blog?

Hello!
The difference could be due to the evaluation method. We used a closed-form evaluation, which can produce different results compared to a multiple-choice format.
To reproduce our results, you can run the following lighteval command:

python3 lighteval accelerate \
  "pretrained=HPLT/hplt2c_deu_checkpoints" \
  --output-dir ${OUTPUT_PATH} \
  --override-batch-size 32 \
  --custom-tasks "lighteval.tasks.multilingual.tasks" \
  "lighteval|belebele_deu_Latn_cf|0|1"

Yes, I could reproduce your scores with lighteval. lm-evaluation-harness gives me accuracy of 0.25 more or less:

lm_eval --model hf \
    --model_args pretrained=HPLT/hplt2c_deu_checkpoints \
    --tasks belebele_deu_Latn \
    --device cuda:0 \
    --batch_size 8

How does closed-form evaluation work exactly? I thought multiple-choice would already entail closed-form evaluation. That's also what my favorite LLM tells me, so I am not sure what "closed-form" means in this case :)

Sign up or log in to comment