open-llm-leaderboard/open_llm_leaderboard · Feature request: Progress meter/indicator (for currently running models)

Jul 11

for the currently running models, could there be a progress indicator indicating what task/sub-test it is on

this could be esp useful for

models that take a long time to finish running, so we know it's not stuck or frozen, and help with ETA
and in case if a model repeatedly fails, we can see if it's always at the same test/spot or if it is random, and whether if the same thing happens with other models - to help diagnose and debug to potentially help improve the tests/setup to mitigate failures

CombinHorizon changed discussion title from Feature request: Progress meter/indicator for currently running models) to Feature request: Progress meter/indicator (for currently running models) Jul 11

clefourrier

Open LLM Leaderboard org Jul 11

Hi!
It would be very hard to implement.
Example: sometimes, a model (even small) will run for several days. What is happening?
The node on which it is running gets preempted, then it gets rescheduled, starts re-running again, gets preempted again, etc. This is entirely due to the cluster usage which is intense at the moment. Some research teams have to wait several days just to launch their experiments.
So it would be impossible to give a good ETA, at the moment.

For model failures, they can be due to:

not following the submission instructions (model not in safetensors, config not well parametrized, etc)
a hardware or network failure on our side
an OOM error for the bigger models (this was not possible on the previous leaderboard as we used to run in batch size 1, but now we run with automatic batch size detection)

@alozowski is going to setup an internal logging system to help us track model failures more easily (we do it manually for the moment) but we'll see if it makes sense to make it public or not later.

As an order of magnitude, a normal 70B (not a MoE or structured state model) should take 20h to evaluate, and a 7B 2h. If it takes longer there has been some rescheduling going on.

clefourrier changed discussion status to closed Jul 11