## Evaluation process ```bash git clone https://huggingface.co/spaces/d0rj/romb-leaderboard cd romb-leaderboard ``` ### 1. Generate responses The first and main step is to generate the answers. You can do this in any way that is convenient for you, including the scripts that exist in this repository. The main thing is that in the end you have a file with answers in JSONL format, where each object contains the fields `id` (question id, int) and `generated_answer` (model response, json object). Example: ```json {"id":0,"generated_answer":{"answer":"А","context":{}}} {"id":1,"generated_answer":{"answer":"А","context":{}}} {"id":2,"generated_answer":{"answer":36,"context":{}}} {"id":3,"generated_answer":{"answer":10,"context":{}}} {"id":4,"generated_answer":{"answer":3000000000000000,"context":{}}} {"id":5,"generated_answer":{"answer":"А","context":{}}} {"id":6,"generated_answer":{"answer":10,"context":{}}} {"id":7,"generated_answer":{"answer":"А","context":{}}} {"id":8,"generated_answer":{"answer":{"Удав":4,"Слоненок":1,"Мартышка":3},"context":{}}} ... ``` #### Generation utils There are currently 2 types of prompts supported (responding immediately or after the first line of reasoning) and 2 types of model providers (ollama and openai api compatible). ```bash python3 cli.py generate --help ``` An example for generating responses using the Gemma 3.1B model: ```bash ollama run gemma3:1b ``` ```bash python3 cli.py generate --config-path configs/gemma-3-1b.yaml --output-path ./gemma-3-1b_nothink.jsonl --temp-path ./tmp_gemma-3-1b/ ``` ### 2. Validate responses The generated responses can be checked for correctness using the utility: ```bash python3 cli.py type-sanitycheck --help ``` ```bash python3 cli.py type-sanitycheck --file ./gemma-3-1b_nothink.jsonl ``` ### 3. Evaluate responses Once you have the answers file, you can run the solved/unsolved assessment using the utility: ```bash python3 cli.py evaluate --help ``` ```bash python3 cli.py evaluate --file ./gemma-3-1b_nothink.jsonl ``` As a result, you will receive the file `gemma-3-1b_nothink.eval.jsonl` with a new field `is_correct` (bool) - the result of checking each response. ### 4. Calculate overall metrics ```bash python3 cli.py metrics --help ``` ```bash python3 cli.py metrics --model-name gemma-3-1b --file ./gemma-3-1b_nothink.eval.jsonl --model-size 1.0 --model-url https://huggingface.co/google/gemma-3-1b-it --model-config "{'build_function': 'singleturn', 'top_k': 1, 'top_p': 1, 'temperature': 0.0}" ``` As a result, you will receive the file `gemma-3-1b_nothink.eval.metrics.json` with common metrics for the model: ```json [ { "model_name": "gemma-3-1b", "model_size": 1.0, "model_url": "https://huggingface.co/google/gemma-3-1b-it", "pass1": 0.10148902821316615, "weighted_pass1": 0.10207932648691802, "arith_pass1": 0.08566433566433566, "geometry_pass1": 0.125, "logic_pass1": 0.13664596273291926, "config": "{'build_function': 'singleturn', 'top_k': 1, 'top_p': 1, 'temperature': 0.0}" } ] ```