docs/evaluate.md · d0rj/romb-leaderboard at main

Evaluation process

git clone https://huggingface.co/spaces/d0rj/romb-leaderboard
cd romb-leaderboard

1. Generate responses

The first and main step is to generate the answers. You can do this in any way that is convenient for you, including the scripts that exist in this repository. The main thing is that in the end you have a file with answers in JSONL format, where each object contains the fields id (question id, int) and generated_answer (model response, json object). Example:

{"id":0,"generated_answer":{"answer":"А","context":{}}}
{"id":1,"generated_answer":{"answer":"А","context":{}}}
{"id":2,"generated_answer":{"answer":36,"context":{}}}
{"id":3,"generated_answer":{"answer":10,"context":{}}}
{"id":4,"generated_answer":{"answer":3000000000000000,"context":{}}}
{"id":5,"generated_answer":{"answer":"А","context":{}}}
{"id":6,"generated_answer":{"answer":10,"context":{}}}
{"id":7,"generated_answer":{"answer":"А","context":{}}}
{"id":8,"generated_answer":{"answer":{"Удав":4,"Слоненок":1,"Мартышка":3},"context":{}}}
...

Generation utils

There are currently 2 types of prompts supported (responding immediately or after the first line of reasoning) and 2 types of model providers (ollama and openai api compatible).

python3 cli.py generate --help

An example for generating responses using the Gemma 3.1B model:

ollama run gemma3:1b

python3 cli.py generate --config-path configs/gemma-3-1b.yaml --output-path ./gemma-3-1b_nothink.jsonl --temp-path ./tmp_gemma-3-1b/

2. Validate responses

The generated responses can be checked for correctness using the utility:

python3 cli.py type-sanitycheck --help

python3 cli.py type-sanitycheck --file ./gemma-3-1b_nothink.jsonl

3. Evaluate responses

Once you have the answers file, you can run the solved/unsolved assessment using the utility:

python3 cli.py evaluate --help

python3 cli.py evaluate --file ./gemma-3-1b_nothink.jsonl

As a result, you will receive the file gemma-3-1b_nothink.eval.jsonl with a new field is_correct (bool) - the result of checking each response.

4. Calculate overall metrics

python3 cli.py metrics --help

python3 cli.py metrics --model-name gemma-3-1b --file ./gemma-3-1b_nothink.eval.jsonl --model-size 1.0 --model-url https://huggingface.co/google/gemma-3-1b-it --model-config "{'build_function': 'singleturn', 'top_k': 1, 'top_p': 1, 'temperature': 0.0}"

As a result, you will receive the file gemma-3-1b_nothink.eval.metrics.json with common metrics for the model:

[
    {
        "model_name": "gemma-3-1b",
        "model_size": 1.0,
        "model_url": "https://huggingface.co/google/gemma-3-1b-it",
        "pass1": 0.10148902821316615,
        "weighted_pass1": 0.10207932648691802,
        "arith_pass1": 0.08566433566433566,
        "geometry_pass1": 0.125,
        "logic_pass1": 0.13664596273291926,
        "config": "{'build_function': 'singleturn', 'top_k': 1, 'top_p': 1, 'temperature': 0.0}"
    }
]