Spaces:
Running
A newer version of the Gradio SDK is available:
5.45.0
Evaluation process
git clone https://huggingface.co/spaces/d0rj/romb-leaderboard
cd romb-leaderboard
1. Generate responses
The first and main step is to generate the answers. You can do this in any way that is convenient for you, including the scripts that exist in this repository. The main thing is that in the end you have a file with answers in JSONL format, where each object contains the fields id
(question id, int) and generated_answer
(model response, json object). Example:
{"id":0,"generated_answer":{"answer":"А","context":{}}}
{"id":1,"generated_answer":{"answer":"А","context":{}}}
{"id":2,"generated_answer":{"answer":36,"context":{}}}
{"id":3,"generated_answer":{"answer":10,"context":{}}}
{"id":4,"generated_answer":{"answer":3000000000000000,"context":{}}}
{"id":5,"generated_answer":{"answer":"А","context":{}}}
{"id":6,"generated_answer":{"answer":10,"context":{}}}
{"id":7,"generated_answer":{"answer":"А","context":{}}}
{"id":8,"generated_answer":{"answer":{"Удав":4,"Слоненок":1,"Мартышка":3},"context":{}}}
...
Generation utils
There are currently 2 types of prompts supported (responding immediately or after the first line of reasoning) and 2 types of model providers (ollama and openai api compatible).
python3 cli.py generate --help
An example for generating responses using the Gemma 3.1B model:
ollama run gemma3:1b
python3 cli.py generate --config-path configs/gemma-3-1b.yaml --output-path ./gemma-3-1b_nothink.jsonl --temp-path ./tmp_gemma-3-1b/
2. Validate responses
The generated responses can be checked for correctness using the utility:
python3 cli.py type-sanitycheck --help
python3 cli.py type-sanitycheck --file ./gemma-3-1b_nothink.jsonl
3. Evaluate responses
Once you have the answers file, you can run the solved/unsolved assessment using the utility:
python3 cli.py evaluate --help
python3 cli.py evaluate --file ./gemma-3-1b_nothink.jsonl
As a result, you will receive the file gemma-3-1b_nothink.eval.jsonl
with a new field is_correct
(bool) - the result of checking each response.
4. Calculate overall metrics
python3 cli.py metrics --help
python3 cli.py metrics --model-name gemma-3-1b --file ./gemma-3-1b_nothink.eval.jsonl --model-size 1.0 --model-url https://huggingface.co/google/gemma-3-1b-it --model-config "{'build_function': 'singleturn', 'top_k': 1, 'top_p': 1, 'temperature': 0.0}"
As a result, you will receive the file gemma-3-1b_nothink.eval.metrics.json
with common metrics for the model:
[
{
"model_name": "gemma-3-1b",
"model_size": 1.0,
"model_url": "https://huggingface.co/google/gemma-3-1b-it",
"pass1": 0.10148902821316615,
"weighted_pass1": 0.10207932648691802,
"arith_pass1": 0.08566433566433566,
"geometry_pass1": 0.125,
"logic_pass1": 0.13664596273291926,
"config": "{'build_function': 'singleturn', 'top_k': 1, 'top_p': 1, 'temperature': 0.0}"
}
]