Spaces:

evalitahf
/

evalita_llm_leaderboard

Running

App Files Files Community

evalita_llm_leaderboard / run_instructions.txt

rzanoli

Small changes

dbd3b18 3 months ago

raw

history blame

1.81 kB

	Model Evaluation and Leaderboard

	1) Model Evaluation
	Before integrating a model into the leaderboard, it must first be evaluated using the lm-eval-harness library in both zero-shot and 5-shot configurations.

	This can be done with the following command:

	lm_eval --model hf --model_args pretrained=google/gemma-3-12b-it \
	--tasks evalita-mp --device cuda:0 --batch_size 1 --trust_remote_code \
	--output_path model_output --num_fewshot 5 --

	The output generated by the library will include the model's accuracy scores on the benchmark tasks.
	This output is written to the standard output and should be saved in a txt file (e.g., slurm-8368.out), which needs to be placed in the
	evalita_llm_models_output directory for further processing.

	2) Extracting Model Metadata
	To display model details on the leaderboard (e.g., organization/group, model name, and parameter count), metadata must be retrieved from Hugging Face.

	This can be done by running:

	python get_model_info.py

	This script processes the evaluation files from Step 1 and saves each model's metadata in a JSON file within the evalita_llm_requests directory.

	3) Generating Leaderboard Submission File
	The leaderboard requires a structured file containing each model’s metadata along with its benchmark accuracy scores.

	To generate this file, run:

	python preprocess_model_output.

	This script combines the accuracy results from Step 1 with the metadata from Step 2 and outputs a JSON file in the evalita_llm_results directory.

	4) Updating the Hugging Face Repository
	The evalita_llm_results repository on HuggingFace must be updated with the newly generated files from Step 3.

	5) Running the Leaderboard Application
	Finally, execute the leaderboard application by running:

	python app.py