## Evaluation process

```bash
git clone https://huggingface.co/spaces/d0rj/romb-leaderboard
cd romb-leaderboard
```

### 1. Generate responses

The first and main step is to generate the answers. You can do this in any way that is convenient for you, including the scripts that exist in this repository. The main thing is that in the end you have a file with answers in JSONL format, where each object contains the fields `id` (question id, int) and `generated_answer` (model response, json object). Example:

```json
{"id":0,"generated_answer":{"answer":"А","context":{}}}
{"id":1,"generated_answer":{"answer":"А","context":{}}}
{"id":2,"generated_answer":{"answer":36,"context":{}}}
{"id":3,"generated_answer":{"answer":10,"context":{}}}
{"id":4,"generated_answer":{"answer":3000000000000000,"context":{}}}
{"id":5,"generated_answer":{"answer":"А","context":{}}}
{"id":6,"generated_answer":{"answer":10,"context":{}}}
{"id":7,"generated_answer":{"answer":"А","context":{}}}
{"id":8,"generated_answer":{"answer":{"Удав":4,"Слоненок":1,"Мартышка":3},"context":{}}}
...
```

#### Generation utils

There are currently 2 types of prompts supported (responding immediately or after the first line of reasoning) and 2 types of model providers (ollama and openai api compatible).

```bash
python3 cli.py generate --help
```

An example for generating responses using the Gemma 3.1B model:

```bash
ollama run gemma3:1b
```

```bash
python3 cli.py generate --config-path configs/gemma-3-1b.yaml --output-path ./gemma-3-1b_nothink.jsonl --temp-path ./tmp_gemma-3-1b/
```

### 2. Validate responses

The generated responses can be checked for correctness using the utility:

```bash
python3 cli.py type-sanitycheck --help
```

```bash
python3 cli.py type-sanitycheck --file ./gemma-3-1b_nothink.jsonl
```

### 3. Evaluate responses

Once you have the answers file, you can run the solved/unsolved assessment using the utility:

```bash
python3 cli.py evaluate --help
```

```bash
python3 cli.py evaluate --file ./gemma-3-1b_nothink.jsonl
```

As a result, you will receive the file `gemma-3-1b_nothink.eval.jsonl` with a new field `is_correct` (bool) - the result of checking each response.

### 4. Calculate overall metrics

```bash
python3 cli.py metrics --help
```

```bash
python3 cli.py metrics --model-name gemma-3-1b --file ./gemma-3-1b_nothink.eval.jsonl --model-size 1.0 --model-url https://huggingface.co/google/gemma-3-1b-it --model-config "{'build_function': 'singleturn', 'top_k': 1, 'top_p': 1, 'temperature': 0.0}"
```

As a result, you will receive the file `gemma-3-1b_nothink.eval.metrics.json` with common metrics for the model:

```json
[
    {
        "model_name": "gemma-3-1b",
        "model_size": 1.0,
        "model_url": "https://huggingface.co/google/gemma-3-1b-it",
        "pass1": 0.10148902821316615,
        "weighted_pass1": 0.10207932648691802,
        "arith_pass1": 0.08566433566433566,
        "geometry_pass1": 0.125,
        "logic_pass1": 0.13664596273291926,
        "config": "{'build_function': 'singleturn', 'top_k': 1, 'top_p': 1, 'temperature': 0.0}"
    }
]
```