Spaces:
Running
on
CPU Upgrade
Benchmark weight
Hello,
I believe averaging scores of benchmarks is not a fair way to compare models, mmlu evals have around 100 questions each vs MedQA which is 10000 questions. Currently MMLU has too much weight in the average, I see two options:
- Merge all MMLU scores into a single benchmark score "MMLU medical"
- Multiply benchmark scores by the number of questions and use a weighted average.
Hello,
Thank you for your insights on benchmark weighting. We agree that simply averaging scores across benchmarks like MMLU might not reflect an accurate model evaluation due to the differing number of questions.
While we won't be merging all MMLU scores into a single score—as we prefer to evaluate LLMs on an even more fine-grained level—it's clear that using a weighted average based on the number of questions is necessary. We're currently developing this adjustment and plan to implement it soon to ensure a fair comparison across all benchmarks.
Let us know if you disagree, we would love to hear your take!
Hello,
I agree with you that keeping granularity in results can be useful and that using a weighted average is probably the best approach. I would also advise a tunable parameter to change the weight of questions per benchmark. For example, PubMedQA and MedMCQA are not at the level of quality of MedQA or MMLU.
Maybe in the future, putting more emphasis on MedQA and MMLU as they are very high quality and source from the USMLE would make more sense.
Is there an ETA for this? Models like OpenBioLLM are already exploiting this flaw to do leaderboard hacking.
Just realized the same people who run the leaderboard are the ones who made OpenBioLLM and its very suspicious results.
I would advise everyone stay away from this leaderboard and anything produced by this org considering their unethical practices.
Thank you for your feedback regarding the leaderboard
@maximegmd
I appreciate you taking the time to share your concerns and thoughts on this matter.
I would like to clarify that the leaderboard has been widely adopted in the industry for several years and has been used by hundreds of models and papers.
Moreover it was actually proposed by Google and has been used to evaluate their own models, including MedPaLM [https://arxiv.org/abs/2212.13138] and MedPaLM-2 [https://arxiv.org/abs/2305.09617]. Even OpenAI has used this leaderboard to evaluate GPT-4 [https://arxiv.org/abs/2303.13375]. These leading organizations have utilized this leaderboard to evaluate their models, and it has become a standard benchmark in the field.
If you are not satisfied with the current benchmark, I would respectfully suggest reaching out to the authors of the Google papers or proposing a new leaderboard of your own. Your contribution to the field would be highly valued and appreciated.
Furthermore, We don't "own" the benchmark we just added the gpu and hf space to make the accessibility of the leaderboard to the researchers. It is important to note that OpenBioLLM is owned by Saama, not by Open Life Science AI (OLSA). OLSA simply aim to provide resources and support to the research community.
I would be more than happy to clarify any further doubts or questions you may have. Please feel free to reach out to me anytime.
Thank you once again for your feedback. We value your input and look forward to continuing to improve and advance the field together.
Hello, we are discussing the score weighting internally, unfortunately, we cannot share an ETA just yet (correct me @aaditya ).
Maybe in the future, putting more emphasis on MedQA and MMLU as they are very high quality and source from the USMLE would make more sense
This is, unfortunately, a debatable claim. USMLE-based benchmarks, despite being used as an exam for human clinicians, do not guarantee alignment to clinical settings. At the time of writing, we don't know which benchmarks would be the most representative ones for the clinical settings. Putting more emphasis on specific datasets while not knowing the benefit, would unnecessarily skew the medical NLP field. Let me know if you disagree!
Just realized the same people who run the leaderboard are the ones who made OpenBioLLM and its very suspicious results.
This benchmark is developed by people NOT only in the Open Life Science AI organisation with the hope of inviting initiatives from everyone. I'm NOT involved in the OpenBioLLM project in any capacity, yet I'm involved in this leaderboard project. We haven't done any unethical practices. We ran the evaluations under the same setting as the other models. If you foresee any suspicious practices (e.g., you failed to reproduce the results while running on the same setups), please reach out to the OpenBioLLM contributors as opposed to this leaderboard project as we, the leaderboard maintainers, will always remain impartial.
Our goal is to invite constructive discussions to advance medical NLP, and it's very upsetting to smear a good discussion with such a statement.
No unethical practices, yet the person deciding the weight of each benchmark on the leaderboard submits models and uses it to claim surpassing GPT-4, you have got to be kidding right?
This is, unfortunately, a debatable claim. USMLE-based benchmarks, despite being used as an exam for human clinicians, do not guarantee alignment to clinical settings.
It isn't, just open MedMCQA and have a look for yourself, most questions aren't even OCRed correctly.
Putting more emphasis on specific datasets while not knowing the benefit, would unnecessarily skew the medical NLP field.
But giving a question on soil in MMLU biology the same weight as 150 USMLE questions sounds fine to you.
If you are not satisfied with the current benchmark, I would respectfully suggest reaching out to the authors of the Google papers or proposing a new leaderboard of your own. Your contribution to the field would be highly valued and appreciated.
I treat patients, not benchmarks, the fact that these benchmarks are considered standard is quite alarming. Also, all serious work employs doctors to do manual blinded evaluations before making any claims, see Google or Meditron.
Our goal is to invite constructive discussions to advance medical NLP, and it's very upsetting to smear a good discussion with such a statement.
I stand by what I said, there is a serious conflict of interest.
@maximegmd I appreciate you sharing your concerns about the benchmark. You raise some fair points. However, we don't have the time to propose a new benchmark or change the subjects at the moment. This benchmark has been used by many models for a long time and was proposed by highly qualified researchers in this area who have been conducting research for years. Changing the benchmark now would not allow for a fair comparison with the models that have already provided their scores on these datasets. If participating in or using this benchmark is causing you significant issues, I would gently suggest that the simplest solution may be to opt out of using it. The wonderful thing about open-source projects is that participation is entirely voluntary - no one is under any obligation to use them.
That being said, if you strongly believe that the current benchmark has significant limitations or biases in how it weights different components, then proposing a new, fairly weighted benchmark could indeed be a valuable contribution to the ML community.
Creating a well-designed benchmark that addresses the issues you've identified and provides a more balanced and representative evaluation could help advance the field in a positive direction. Moreover, making it easily accessible to other researchers via a HuggingFace Space with voluntary GPU support would further enhance its impact and encourage widespread adoption.
The only relevant way to evaluate models is a clinical trial.
This conversation is a clear demonstration of the disconnect between pure computer science and medicine and given the reactions I don't think there is a way to bridge this gap here.