Spaces:
Building
Building
Model Evaluation and Leaderboard | |
1) Model Evaluation | |
Before integrating a model into the leaderboard, it must first be evaluated using the lm-eval-harness library in both zero-shot and 5-shot configurations. | |
This can be done with the following command: | |
lm_eval --model hf --model_args pretrained=google/gemma-3-12b-it \ | |
--tasks evalita-mp --device cuda:0 --batch_size 1 --trust_remote_code \ | |
--output_path model_output --num_fewshot 5 -- | |
The output generated by the library will include the model's accuracy scores on the benchmark tasks. | |
This output is written to the standard output and should be saved in a txt file (e.g., slurm-8368.out), which needs to be placed in the | |
evalita_llm_models_output directory for further processing. | |
2) Extracting Model Metadata | |
To display model details on the leaderboard (e.g., organization/group, model name, and parameter count), metadata must be retrieved from Hugging Face. | |
This can be done by running: | |
python get_model_info.py | |
This script processes the evaluation files from Step 1 and saves each model's metadata in a JSON file within the evalita_llm_requests directory. | |
3) Generating Leaderboard Submission File | |
The leaderboard requires a structured file containing each model’s metadata along with its benchmark accuracy scores. | |
To generate this file, run: | |
python preprocess_model_output. | |
This script combines the accuracy results from Step 1 with the metadata from Step 2 and outputs a JSON file in the evalita_llm_results directory. | |
4) Updating the Hugging Face Repository | |
The evalita_llm_results repository on HuggingFace must be updated with the newly generated files from Step 3. | |
5) Running the Leaderboard Application | |
Finally, execute the leaderboard application by running: | |
python app.py | |