lbhf (lbhf)

hfkwr

authored a paper 3 months ago

Understanding Reinforcement Learning for Model Training, and future directions with GRAPE

Paper • 2509.04501 • Published Sep 2, 2025 • 1

lysandre

posted an update 4 months ago

Post

7430

We're kick-starting the process of Transformers v5, with @ArthurZ and @cyrilvallez !

v5 should be significant: we're using it as a milestone for performance optimizations, saner defaults, and a much cleaner code base worthy of 2025.

Fun fact: v4.0.0-rc-1 came out on Nov 19, 2020, nearly five years ago!

6 replies

·

clefourrier

posted an update 8 months ago

Post

2129

Always surprised that so few people actually read the FineTasks blog, on
✨how to select training evals with the highest signal✨

If you're serious about training models without wasting compute on shitty runs, you absolutely should read it!!

An high signal eval actually tells you precisely, during training, how wel & what your model is learning, allowing you to discard the bad runs/bad samplings/...!

The blog covers in depth prompt choice, metrics, dataset, across languages/capabilities, and my fave section is "which properties should evals have"👌
(to know on your use case how to select the best evals for you)

Blog: HuggingFaceFW/blogpost-fine-tasks

2 replies

·

pcuenq

authored a paper 9 months ago

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7, 2025 • 202

clefourrier

posted an update 10 months ago

Post

2668

Gemma3 family is out! Reading the tech report, and this section was really interesting to me from a methods/scientific fairness pov.

Instead of doing over-hyped comparisons, they clearly state that **results are reported in a setup which is advantageous to their models**.
(Which everybody does, but people usually don't say)

For a tech report, it makes a lot of sense to report model performance when used optimally!
On leaderboards on the other hand, comparison will be apples to apples, but in a potentially unoptimal way for a given model family (like some user interact sub-optimally with models)

Also contains a cool section (6) on training data memorization rate too! Important to see if your model will output the training data it has seen as such: always an issue for privacy/copyright/... but also very much for evaluation!

Because if your model knows its evals by heart, you're not testing for generalization.

lysandre

posted an update 11 months ago

Post

8252

SmolVLM-2 and SigLIP-2 are now part of transformers in dedicated releases!

They're added on top of the v4.49.0 release, and can be installed from the following tags: v4.49.0-SmolVLM-2 and v4.49.0-SigLIP-2.

This marks a new beginning for the release process of transformers. For the past five years, we've been doing monthly releases featuring many models (v4.49.0, the latest release, features 9 new architectures).

Starting with SmolVLM-2 & SigLIP2, we'll now additionally release tags supporting new models on a stable branch. These models are therefore directly available for use by installing from the tag itself. These tags will continue to be updated with fixes applied to these models.

Going forward, continue expecting software releases following semantic versioning: v4.50.0 will have ~10 new architectures compared to v4.49.0, as well as a myriad of new features, improvements and bug fixes. Accompanying these software releases, we'll release tags offering brand new models as fast as possible, to make them accessible to all immediately.

1 reply

·

clefourrier

authored a paper 11 months ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4, 2025 • 253

clefourrier

authored a paper about 1 year ago

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Paper • 2412.03304 • Published Dec 4, 2024 • 19

wukaixingxp

updated a dataset about 1 year ago

meta-llama/Llama-3.3-70B-Instruct-evals

Viewer • Updated Dec 6, 2024 • 41.3k • 108 • 42

SaylorTwift

posted an update about 1 year ago

Post

1138

How do I test an LLM for my unique needs?
If you work in finance, law, or medicine, generic benchmarks are not enough.
This blog post uses Argilla, Distilllabel and 🌤️Lighteval to generate evaluation dataset and evaluate models.

https://github.com/argilla-io/argilla-cookbook/blob/main/domain-eval/README.md

wukaixingxp

authored a paper over 1 year ago

The Llama 3 Herd of Models

Paper • 2407.21783 • Published Jul 31, 2024 • 117

clefourrier

authored 2 papers over 1 year ago

The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models

Paper • 2404.05904 • Published Apr 8, 2024 • 9

GAIA: a benchmark for General AI Assistants

Paper • 2311.12983 • Published Nov 21, 2023 • 244

pcuenq

posted an update over 1 year ago

Post

10253

OpenELM in Core ML

Apple recently released a set of efficient LLMs in sizes varying between 270M and 3B parameters. Their quality, according to benchmarks, is similar to OLMo models of comparable size, but they required half the pre-training tokens because they use layer-wise scaling, where the number of attention heads increases in deeper layers.

I converted these models to Core ML, for use on Apple Silicon, using this script: https://gist.github.com/pcuenca/23cd08443460bc90854e2a6f0f575084. The converted models were uploaded to this community in the Hub for anyone that wants to integrate inside their apps: corenet-community/openelm-core-ml-6630c6b19268a5d878cfd194

The conversion was done with the following parameters:
- Precision: float32.
- Sequence length: fixed to 128.

With swift-transformers (https://github.com/huggingface/swift-transformers), I'm getting about 56 tok/s with the 270M on my M1 Max, and 6.5 with the largest 3B model. These speeds could be improved by converting to float16. However, there's some precision loss somewhere and generation doesn't work in float16 mode yet. I'm looking into this and will keep you posted! Or take a look at this issue if you'd like to help: https://github.com/huggingface/swift-transformers/issues/95

I'm also looking at optimizing inference using an experimental kv cache in swift-transformers. It's a bit tricky because the layers have varying number of attention heads, but I'm curious to see how much this feature can accelerate performance in this model family :)

Regarding the instruct fine-tuned models, I don't know the chat template that was used. The models use the Llama 2 tokenizer, but the Llama 2 chat template, or the default Alignment Handbook one that was used to train, are not recognized. Any ideas on this welcome!

5 replies

·

clefourrier

posted an update over 1 year ago

Post

6169

In a basic chatbots, errors are annoyances. In medical LLMs, errors can have life-threatening consequences 🩸

It's therefore vital to benchmark/follow advances in medical LLMs before even thinking about deployment.

This is why a small research team introduced a medical LLM leaderboard, to get reproducible and comparable results between LLMs, and allow everyone to follow advances in the field.

openlifescienceai/open_medical_llm_leaderboard

Congrats to @aaditya and @pminervini !
Learn more in the blog: https://huggingface.co/blog/leaderboard-medicalllm

clefourrier

posted an update over 1 year ago

Post

4797

Contamination free code evaluations with LiveCodeBench! 🖥️

LiveCodeBench is a new leaderboard, which contains:
- complete code evaluations (on code generation, self repair, code execution, tests)
- my favorite feature: problem selection by publication date 📅

This feature means that you can get model scores averaged only on new problems out of the training data. This means... contamination free code evals! 🚀

Check it out!

Blog: https://huggingface.co/blog/leaderboard-livecodebench
Leaderboard: livecodebench/leaderboard

Congrats to @StringChaos @minimario @xu3kev @kingh0730 and @FanjiaYan for the super cool leaderboard!

clefourrier

posted an update over 1 year ago

Post

2279

🆕 Evaluate your RL agents - who's best at Atari?🏆

The new RL leaderboard evaluates agents in 87 possible environments (from Atari 🎮 to motion control simulations🚶and more)!

When you submit your model, it's run and evaluated in real time - and the leaderboard displays small videos of the best model's run, which is super fun to watch! ✨

Kudos to @qgallouedec for creating and maintaining the leaderboard!
Let's find out which agent is the best at games! 🚀

open-rl-leaderboard/leaderboard

clefourrier

posted an update over 1 year ago

Post

2265

Fun fact about evaluation, part 2!

How much do scores change depending on prompt format choice?

Using different prompts (all present in the literature, from Prompt question? to Question: prompt question?\nChoices: enumeration of all choices\nAnswer: ), we get a score range of...

10 points for a single model!
Keep in mind that we only changed the prompt, not the evaluation subsets, etc.
Again, this confirms that evaluation results reported without their details are basically bullshit.

Prompt format on the x axis, all these evals look at the logprob of either "choice A/choice B..." or "A/B...".

Incidentally, it also changes model rankings - so a "best" model might only be best on one type of prompt...

clefourrier

posted an update almost 2 years ago

Post

2398

Fun fact about evaluation!

Did you know that, if you evaluate the same model, with the same prompt formatting & the same fixed few-shot examples, only changing
♻️the order in which the few shot examples are added to the prompt ♻️
you get a difference of up to 3 points in evaluation score?

I did a small experiment using some MMLU subsets on the best performing 7B and lower pretrained models from the leaderboard.

I tried 8 different prompting methods (containing more or less information, such as just the question, or Question: question, or Question: question Choices: ..., see the x axis) that are commonly used in evaluation.

I then compared the results for all these methods, in 5-shot, during 2 runs. The *only difference* between the first and second run being that the samples used in few-shot are not introduced in the same order.
For example, run one would have been "A B C D E Current sample", vs, in run 2, "D C E A B Current sample".
All the other experiment parameters stayed exactly the same.

As you can see on the attached picture, you get a difference of up to 3 points between the 2 few-shot samples shuffling.

So, when just changing *the order of the few shot samples* can change your results by several points, what is the impact of all other "minimal" and unreported prompting changes?

-> Any kind of model score, provided without an evaluation script for reproducibility, is basically bullshit (or coms).
-> This is why we need reproducible evaluation in a fair and exactly similar setup, using evaluation suites such as lm_eval from the Harness, lighteval from HF, or the Open LLM Leaderboard.

4 replies

·

clefourrier

posted an update almost 2 years ago

Post

2041

Are you looking for the perfect leaderboard/arena for your use case? 👀

There's a new tool for this!
https://huggingface.co/spaces/leaderboards/LeaderboardFinder

Select your modality, language, task... then search! 🔍
Some categories of interest:
- does the leaderboard accept submissions?
- is the test set private or public?
- is it using an automatic metric, human evaluators, or llm as a judge?

The spaces list is build from space metadata, and reloaded every hour.

Enjoy!

AI & ML interests

Team members 12

lbhf's activity