Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
clem 
posted an update 3 days ago
Post
3808
What are you using to evaluate models or AI systems? So far we're building lighteval & leaderboards on the hub but still feels early & a lot more to build. What would be useful to you?

I'm using https://artificialanalysis.ai/ just because it puts everything in one place! It's not the best resource but these days I'm all about saving time.

Biggest pain point is still inference providers. Even decent labs like Ai2 or THUDM need to lobby for that. My leaderboard is for web developers but I can only evaluate the most visible models with token API support. https://huggingface.co/spaces/onekq-ai/WebApp1K-models-leaderboard

Maybe some players have GPUs but keep the results to themselves. We can only hope they will reciprocate for what they benefit from this community.

I call mine Artificial Human Alignment but it could also be called liberating knowledge. Humans want to live free and happy and healthy.

https://huggingface.co/blog/etemiz/aha-leaderboard

Typically, I handpick potentially suitable models for a given task using a small dataset. Finally, the model with the best speed or tokens per dollar is used among suitable variants.

This comment has been hidden

I really appreciate the tools and courses built by that the Huggingface team! I'm still learning the ropes of AI systems and I'm happy to be a part of this collaborative community.

I feel that current evaluations are relying heavily on qualitative feedback and could be designed to prioritize validity for business outcomes . The most useful tool for me would be a solution similar to Job Evaluation but for AI models.

For example: in the past, when we first started evaluating people for jobs we began by evaluating each person and their capabilities. That approach lacked a formal transparent structure and fairness.

Today we start by evaluating the Job itself using a Job Evaluation methodology like Mercer, Hay, Towers Watson etc... which is a point based system that is scalable and assigns job weight and can be graded in relation to real business outcomes like revenue.

I believe AI needs a similar shift. Instead of comparing every new model to multiple benchmarks, I would like an enterprise tool to start by evaluating my own business task — the role AI is meant to play inside the enterprise or ecosystem. Then, we can assess what type of a model fulfills that role and at what proficiency level, just like how we compare candidates against a competency model.

I would love a tool that just takes an input with a current organizational competency model and process mining information and provides an evaluation for how models could impact my own unique business objectives.