@clem on Hugging Face: "What are you using to evaluate models or AI systems? So far we're building…"

clem

posted an update May 6

Post

4080

What are you using to evaluate models or AI systems? So far we're building lighteval & leaderboards on the hub but still feels early & a lot more to build. What would be useful to you?

takarajordan

May 6

I'm using https://artificialanalysis.ai/ just because it puts everything in one place! It's not the best resource but these days I'm all about saving time.

onekq

May 6

Biggest pain point is still inference providers. Even decent labs like Ai2 or THUDM need to lobby for that. My leaderboard is for web developers but I can only evaluate the most visible models with token API support. https://huggingface.co/spaces/onekq-ai/WebApp1K-models-leaderboard

Maybe some players have GPUs but keep the results to themselves. We can only hope they will reciprocate for what they benefit from this community.

etemiz

May 7

I call mine Artificial Human Alignment but it could also be called liberating knowledge. Humans want to live free and happy and healthy.

https://huggingface.co/blog/etemiz/aha-leaderboard

nyuuzyou

May 7

Typically, I handpick potentially suitable models for a given task using a small dataset. Finally, the model with the best speed or tokens per dollar is used among suitable variants.

YoussefSharawy91

May 7

This comment has been hidden

YoussefSharawy91

May 7

I really appreciate the tools and courses built by that the Huggingface team! I'm still learning the ropes of AI systems and I'm happy to be a part of this collaborative community.

I feel that current evaluations are relying heavily on qualitative feedback and could be designed to prioritize validity for business outcomes . The most useful tool for me would be a solution similar to Job Evaluation but for AI models.

For example: in the past, when we first started evaluating people for jobs we began by evaluating each person and their capabilities. That approach lacked a formal transparent structure and fairness.

Today we start by evaluating the Job itself using a Job Evaluation methodology like Mercer, Hay, Towers Watson etc... which is a point based system that is scalable and assigns job weight and can be graded in relation to real business outcomes like revenue.

I believe AI needs a similar shift. Instead of comparing every new model to multiple benchmarks, I would like an enterprise tool to start by evaluating my own business task — the role AI is meant to play inside the enterprise or ecosystem. Then, we can assess what type of a model fulfills that role and at what proficiency level, just like how we compare candidates against a competency model.

I would love a tool that just takes an input with a current organizational competency model and process mining information and provides an evaluation for how models could impact my own unique business objectives.

Join the conversation