Clémentine Fourrier

clefourrier

AI & ML interests

None yet

Recent Activity

Organizations

Hugging Face's profile picture Long Range Graph Benchmark's profile picture Evaluation datasets's profile picture BigScience: LMs for Historical Texts's profile picture HuggingFaceBR4's profile picture Huggingface Projects's profile picture Open Graph Benchmark's profile picture HuggingFaceGECLM's profile picture Pretrained Graph Transformers's profile picture Graph Datasets's profile picture BigCode's profile picture Hugging Face H4's profile picture InternLM's profile picture Vectara's profile picture GAIA's profile picture Hugging Face Smol Cluster's profile picture plfe's profile picture Open LLM Leaderboard's profile picture Qwen's profile picture Secure Learning Lab's profile picture Open Life Science AI's profile picture LLM360's profile picture TTS Eval (OLD)'s profile picture hallucinations-leaderboard's profile picture Bias Leaderboard Development's profile picture Leaderboard Organization's profile picture Demo Leaderboard's profile picture Demo leaderboard with an integrated backend's profile picture gg-hf's profile picture AIM-Harvard's profile picture Clinical & Biomedical ML Leaderboards's profile picture Women on Hugging Face's profile picture LMLLO2's profile picture Lighthouz AI's profile picture Open Arabic LLM Leaderboard's profile picture mx-test's profile picture IBM Granite's profile picture FineData's profile picture HF-contamination-detection's profile picture TTS AGI's profile picture Leader Board Test Org's profile picture Social Post Explorers's profile picture hsramall's profile picture Open RL Leaderboard's profile picture The Fin AI's profile picture La Leaderboard's profile picture Open Hebrew LLM's Leaderboard's profile picture gg-tt's profile picture HuggingFaceEval's profile picture HP Inc.'s profile picture Novel Challenge's profile picture Open LLM Leaderboard Archive's profile picture LLHF's profile picture SLLHF's profile picture lbhf's profile picture nltpt's profile picture Lighteval testing org's profile picture CléMax's profile picture Hugging Face Science's profile picture test_org's profile picture Coordination Nationale pour l'IA's profile picture LeMaterial's profile picture open-llm-leaderboard-react's profile picture Prompt Leaderboard's profile picture UBC-NLP Collaborations's profile picture smolagents's profile picture Your Bench's profile picture leaderboard explorer's profile picture Open R1's profile picture SIMS's profile picture OpenEvals's profile picture GeekAgents's profile picture piupiu.xyz's profile picture PoseidonDemo's profile picture interview 4298's profile picture LightEval Internal Testing's profile picture

clefourrier's activity

New activity in gaia-benchmark/GAIA about 5 hours ago
upvoted 3 changelogs about 6 hours ago
view changelog
Changelog

Filter by MCP compatibility available in HF Spaces

54
view changelog
Changelog

AI-generated Abstract summaries on Hugging Face Papers

43
view changelog
Changelog

Static Spaces can now have a build step

21
posted an update about 6 hours ago
view post
Post
70
Saying Claude 4 is "the best coding model in the world" from the SWEBench scores is super misleading, and here is why:

If you look at the announcement table, their model has the best scores, but... if you look at the very bottom, in font 4, you'll see that the metric they report is actually not the same metric as the one used for the other models!


Comparing "pass@1 averaged 10 times" to "normal pass@1" is like grading one student by allowing them to take the test 10 times and averaging question scores, when the other students only get one chance at grading.

The first way to grade (avg@10) is actually quite good statistically, much better than what model creators usually report, because models tend to be quite inconsistent - sometimes good, sometimes bad...
But! You want to do it for all models then, and report with error bars.
The issue is that, if you do... well, it's going to be harder to say your model is the best, because the error bars will overlap between models, by a lot.

Also, you'll see that 2 numbers are reported: the first one is using avg@10 (what I explained above), and the second, highest one is using this plus many other tricks:
- test time compute (so having the model generate a tree of answers and selecting the best as you go, more or less)
- removing the times when the model breaks the tests
- and using another model to select the most promising solution!
You can't really say it's better than the rest, mostly because it's **way less efficient** to achieve a similar result.

It's honestly a bit sad because from user reports, the model sounds good - however, this announcement is overblown numbers wise, and I'm quite sure it's more a problem of "too much marketing" than of "bad science"

Another thing which makes the comparison invalid is the complete absence of open source from the report - don't think they are aware of DeepSeek/ Qwen/The new mistral for code/and all the cool specialised models found on the hub?
reacted to fdaudens's post with ❤️ about 6 hours ago
view post
Post
194
Here’s what happens when a national institution builds its own digital intelligence: France’s Ministry of Culture just released 17K+ real users testing 30+ chatbots in French. Raw, diverse, and a goldmine for studying LLMs in the wild.

ministere-culture/comparia-conversations
New activity in gaia-benchmark/GAIA about 10 hours ago

hanson

#24 opened about 13 hours ago by
hanson888
replied to their post about 10 hours ago
view reply

Hi! The official GAIA leaderboard is unrelated to the test leaderboard for the agents classes :)
You should contact @burtenshaw who's managing the classes courses :)