Hugging Test Lab

company
Activity Feed

AI & ML interests

Offering a great user experience for organizations on Hugging Face!

Recent Activity

HF-test-lab's activity

jeffboudierย 
posted an update 5 days ago
m-ricย 
posted an update 7 days ago
view post
Post
2454
A new research paper from KAIST builds on smolagents to push boundaries of distillation ๐Ÿฅณ
โžก๏ธ "Distilling LLM Agent into Small Models with Retrieval and Code Tools" teaches that, when trying to distil reasoning capability from a strong LLM ("teacher") into a smaller one ("student"), it's much better to use Agent traces than CoT traces.

Advantages are:
1. Improved generalization
Intuitively, this is because your agent can encounter more "surprising" results by interacting with its environment : for example, a web research called by the LLM teacher in agent mode can bring results that the LLM teacher would not have generated in CoT.

2. Reduce hallucinations
The trace won't hallucinate tool call outputs!

Thank you @akseljoonas for mentioning this paper!
jeffboudierย 
posted an update 10 days ago
view post
Post
457
Wrapping up a week of shipping and announcements with Dell Enterprise Hub now featuring AI Applications, on-device models for AI PCs, a new CLI and Python SDK... all you need for building AI on premises!

Blog post has all the details: https://huggingface.co/blog/dell-ai-applications
reach-vbย 
posted an update 14 days ago
view post
Post
3527
hey hey @mradermacher - VB from Hugging Face here, we'd love to onboard you over to our optimised xet backend! ๐Ÿ’ฅ

as you know we're in the process of upgrading our storage backend to xet (which helps us scale and offer blazingly fast upload/ download speeds too): https://huggingface.co/blog/xet-on-the-hub and now that we are certain that the backend can scale with even big models like Llama 4/ Qwen 3 - we;re moving to the next phase of inviting impactful orgs and users on the hub over as you are a big part of the open source ML community - we would love to onboard you next and create some excitement about it in the community too!

in terms of actual steps - it should be as simple as one of the org admins to join hf.co/join/xet - we'll take care of the rest.

p.s. you'd need to have a the latest hf_xet version of huggingface_hub lib but everything else should be the same: https://huggingface.co/docs/hub/storage-backends#using-xet-storage

p.p.s. this is fully backwards compatible so everything will work as it should! ๐Ÿค—
ยท
jeffboudierย 
posted an update 20 days ago
view post
Post
2560
Transcribing 1 hour of audio for less than $0.01 ๐Ÿคฏ

@mfuntowicz cooked with 8x faster Whisper speech recognition - whisper-large-v3-turbo transcribes at 100x real time on a $0.80/hr L4 GPU!

How they did it: https://huggingface.co/blog/fast-whisper-endpoints

1-click deploy with HF Inference Endpoints: https://endpoints.huggingface.co/new?repository=openai%2Fwhisper-large-v3-turbo&vendor=aws&region=us-east&accelerator=gpu&instance_id=aws-us-east-1-nvidia-l4-x1&task=automatic-speech-recognition&no_suggested_compute=true
m-ricย 
posted an update 20 days ago
view post
Post
2607
๐—”๐—ฏ๐˜€๐—ผ๐—น๐˜‚๐˜๐—ฒ ๐—ญ๐—ฒ๐—ฟ๐—ผ: ๐—Ÿ๐—Ÿ๐— ๐˜€ ๐—ฐ๐—ฎ๐—ป ๐˜๐—ฟ๐—ฎ๐—ถ๐—ป ๐˜„๐—ถ๐˜๐—ต๐—ผ๐˜‚๐˜ ๐—ฎ๐—ป๐˜† ๐—ฒ๐˜…๐˜๐—ฒ๐—ฟ๐—ป๐—ฎ๐—น ๐—ฑ๐—ฎ๐˜๐—ฎ ๐Ÿคฏ

Has the "data wall" just been breached?

Recent RL paradigms often relied on a set of questions an answers that needs to be manually curated. Researchers from Tsinghua University went like "why though".

๐Ÿค” Indeed, why learn from question designed by a human teacher, when the model can start from their base knowledge and learn by experimenting in a code environment, proposing coding tasks themselves and trying to solve them?

Thus they created โ€œAbsolute Zero Reasoningโ€ (AZR), an approach that removes any need for human curated data.

๐ŸŽญ ๐——๐˜‚๐—ฎ๐—น ๐—ฟ๐—ผ๐—น๐—ฒ๐˜€:
โ€ฃ Proposer: Generates challenging but solvable coding tasks
โ€ฃ Solver: Attempts to solve those self-proposed tasks

๐Ÿงช ๐—ง๐—ต๐—ฟ๐—ฒ๐—ฒ ๐˜๐—ฎ๐˜€๐—ธ ๐˜๐˜†๐—ฝ๐—ฒ๐˜€: all types are defined as triplets of program, input and output
โ€ฃ Deduction: Give model an input and program, it must deduce the output
โ€ฃ Abduction: Give model an program and output, it must find the input that gave said output
โ€ฃ Induction: Synthesize a program from input/output pairs
Btw this reminded me of my long-forgotten philosophy classes: Aristotle was more on the induction side, learning from real-world analogies, while Plato was more on the deduction side, trying to progress quite far with just one input and his reasoning.

๐Ÿ“Š ๐—ฅ๐—ฒ๐˜€๐˜‚๐—น๐˜๐˜€:
โ€ฃ AZR post-training creates a nice improvement on known models like Qwen2.5-7B
โ€ฃ Shows strong cross-domain transfer: coding โ†”๏ธ math reasoning

๐Ÿง ๐—ข๐˜๐—ต๐—ฒ๐—ฟ ๐—ณ๐—ถ๐—ป๐—ฑ๐—ถ๐—ป๐—ด๐˜€:
โ€ฃ Having a better base performance (general or code specific) amplify the gains from Absolute Zero Reasoning
โ€ฃ Researchers warn about "Uh-oh moments" (winking to the "aha moments" of DeepSeek) where the model generates concerning goals like "make an extremely convoluted code to outsmart all these humans": so supervision is still needed!

Paper here: Absolute Zero: Reinforced Self-play Reasoning with Zero Data (2505.03335)
m-ricย 
posted an update 24 days ago
view post
Post
4403
I've made an open version of Google's NotebookLM, and it shows the superiority of the open source tech task! ๐Ÿ’ช

The app's workflow is simple. Given a source PDF or URL, it extracts the content from it, then tasks Meta's Llama 3.3-70B with writing the podcast script, with a good prompt crafted by @gabrielchua ("two hosts, with lively discussion, fun notes, insightful question etc.")
Then it hands off the text-to-speech conversion to Kokoro-82M, and there you go, you have two hosts discussion any article.

The generation is nearly instant, because:
> Llama 3.3 70B is running at 1,000 tokens/seconds with Cerebras inference
> The audio is generated in streaming mode by the tiny (yet powerful) Kokoro, generating voices faster than real-time.

And the audio generation runs for free on Zero GPUs, hosted by HF on H200s.

Overall, open source solutions rival the quality of closed-source solutions at close to no cost!

Try it here ๐Ÿ‘‰๐Ÿ‘‰ m-ric/open-notebooklm
ยท
jeffboudierย 
posted an update 26 days ago
view post
Post
3011
So many orgs on HF would really benefit from security and governance built into Enterprise Hub - I wrote a guide on why and how upgrade: jeffboudier/how-to-upgrade-to-enterprise

For instance, did you know about Resource Groups?
pagezyhfย 
posted an update about 1 month ago
view post
Post
1976
If you haven't had the chance to test the latest open model from Meta, Llama 4 Maverick, go try it on AMD MI 300 on Hugging Face!

amd/llama4-maverick-17b-128e-mi-amd
m-ricย 
posted an update about 2 months ago
view post
Post
2856
New king of open VLMs: InternVL3 takes Qwen 2.5's crown! ๐Ÿ‘‘

InternVL have been a wildly successful series of model : and the latest iteration has just taken back their crown thanks to their superior, natively multimodal vision training pipeline.

โžก๏ธ Most of the vision language models (VLMs) these days are built like Frankenstein : take a good text-only Large Language Model (LLM) backbone, stitch a specific vision transformer (ViT) on top of it. Then the training is sequential ๐Ÿ”ข : 1. Freeze the LLM weights while you train the ViT only to work with the LLM part, then 2. Unfreeze all weights to train all weights in order to work together.

๐Ÿ’ซ The Shanghai Lab decided to challenge this paradigm and chose this approach that they call "native". For each of their model sizes, they still start from a good LLM (mostly Qwen-2.5 series, did I tell you I'm a huge fan of Qwen? โค๏ธ), and stitch the ViT, but they don't freeze anything : they train all weights together with interleaved text and image understanding data in a single pre-training phase ๐ŸŽจ.

They claim it results in more seamless interactions between modalities. And the results prove them right: they took the crown of top VLMs, at nearly all sizes, from their Qwen-2.5 parents. ๐Ÿ‘‘
  • 2 replies
ยท
jeffboudierย 
posted an update about 2 months ago
view post
Post
2203
Llama4 is out and Scout is already on the Dell Enterprise Hub to deploy on Dell systems ๐Ÿ‘‰ dell.huggingface.co
jeffboudierย 
posted an update 2 months ago
view post
Post
1564
Enterprise orgs now enable serverless Inference Providers for all members
- includes $2 free usage per org member (e.g. an Enterprise org with 1,000 members share $2,000 free credit each month)
- admins can set a monthly spend limit for the entire org
- works today with Together, fal, Novita, Cerebras and HF Inference.

Here's the doc to bill Inference Providers usage to your org: https://huggingface.co/docs/inference-providers/pricing#organization-billing
  • 2 replies
ยท
m-ricย 
posted an update 2 months ago
view post
Post
2397
๐Ÿš€ DeepSeek R1 moment has come for GUI agents: Rule-based Reinforcement Learning gives better results than SFT with 500x smaller datasets!

Traditionally (by which I mean "in the last few months"), GUI agents have been trained with supervised fine-tuning (SFT). This meant, collecting huge datasets of screen captures from people using computers, and using these to fine-tune your model. ๐Ÿ“š

๐Ÿ‘‰ But last week, a new paper introduced UI-R1, applying DeepSeek's R1-style rule-based reinforcement learning (RL) specifically to GUI action prediction tasks.
This is big news: with RL, maybe we could build good agents without the need for huge datasets.

UI-R1 uses a unified reward function that evaluates multiple responses from models, optimizing via policy algorithms like Group Relative Policy Optimization (GRPO).

Specifically, the reward function assesses:
๐ŸŽฏ Action type accuracy: Does the predicted action match the ground truth?
๐Ÿ“ Coordinate accuracy (specifically for clicks): Is the predicted click within the correct bounding box?
๐Ÿ“‘ Output format: Does the model clearly articulate both its reasoning and final action?

Using just 136 carefully selected mobile tasksโ€”compared to 76,000 tasks for larger models like OS-Atlasโ€”UI-R1 shows significant efficiency and improved performance:
๐Ÿ“ˆ Boosted action prediction accuracy from 76% to 89% on AndroidControl.
๐ŸŒ Outperformed larger, SFT-trained models (e.g., OS-Atlas-7B), demonstrating superior results with vastly fewer data points (136 tasks vs. 76K).
๐Ÿ” Enhanced adaptability and generalization, excelling even in out-of-domain scenarios.

The paper tests this RL-based method only in low-level GUI tasks. Could it generalize to more complex interactions? ๐Ÿง

Read the full paper here ๐Ÿ‘‰ UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning (2503.21620)
Wauplinย 
posted an update 2 months ago
view post
Post
2178
โ€ผ๏ธ huggingface_hub's v0.30.0 is out with our biggest update of the past two years!

Full release notes: https://github.com/huggingface/huggingface_hub/releases/tag/v0.30.0.

๐Ÿš€ Ready. Xet. Go!

Xet is a groundbreaking new protocol for storing large objects in Git repositories, designed to replace Git LFS. Unlike LFS, which deduplicates files, Xet operates at the chunk levelโ€”making it a game-changer for AI builders collaborating on massive models and datasets. Our Python integration is powered by [xet-core](https://github.com/huggingface/xet-core), a Rust-based package that handles all the low-level details.

You can start using Xet today by installing the optional dependency:

pip install -U huggingface_hub[hf_xet]


With that, you can seamlessly download files from Xet-enabled repositories! And donโ€™t worryโ€”everything remains fully backward-compatible if youโ€™re not ready to upgrade yet.

Blog post: https://huggingface.co/blog/xet-on-the-hub
Docs: https://huggingface.co/docs/hub/en/storage-backends#xet


โšก Inference Providers

- Weโ€™re thrilled to introduce Cerebras and Cohere as official inference providers! This expansion strengthens the Hub as the go-to entry point for running inference on open-weight models.

- Novita is now our 3rd provider to support text-to-video task after Fal.ai and Replicate.

- Centralized billing: manage your budget and set team-wide spending limits for Inference Providers! Available to all Enterprise Hub organizations.

from huggingface_hub import InferenceClient
client = InferenceClient(provider="fal-ai", bill_to="my-cool-company")
image = client.text_to_image(
    "A majestic lion in a fantasy forest",
    model="black-forest-labs/FLUX.1-schnell",
)
image.save("lion.png")


- No more timeouts when generating videos, thanks to async calls. Available right now for Fal.ai, expecting more providers to leverage the same structure very soon!
ยท
m-ricย 
posted an update 3 months ago
view post
Post
5061
smolagents now support vLLM! ๐Ÿฅณ

As one of the most popular local inference solutions, the community had been asking us to integrate vLLM: after a heavy refactoring of our LLM classes, we've just released smolagents 1.11.0, with a brand new VLLMModel class.

Go try it and tell us what you think!

https://github.com/huggingface/smolagents/blob/45b2c86857b7f7657daaa74e4d17d347e9e2c4a4/src/smolagents/models.py#L497
m-ricย 
posted an update 3 months ago
view post
Post
1087
Our new Agentic leaderboard is now live!๐Ÿ’ฅ

If you ever asked which LLM is best for powering agents, we've just made a leaderboard that ranks them all! Built with @albertvillanova , this ranks LLMs powering a smolagents CodeAgent on subsets of various benchmarks. โœ…

๐Ÿ† GPT-4.5 comes on top, even beating reasoning models like DeepSeek-R1 or o1. And Claude-3.7-Sonnet is a close second!

The leaderboard also allows you to show the scores of vanilla LLMs (without any agentic setup) on the same benchmarks: this shows the huge improvements brought by agentic setups. ๐Ÿ’ช

(Note that results will be added manually, so the leaderboard might not always have the latest LLMs)
  • 1 reply
ยท
alvarobarttย 
posted an update 3 months ago
view post
Post
3159
๐Ÿ”ฅ Agents can do anything! @microsoft Research just announced the release of Magma 8B!

Magma is a new Visual Language Model (VLM) with 8B parameters for multi-modal agents designed to handle complex interactions across virtual and real environments; and it's MIT licensed!

Magma comes with exciting new features such as:
- Introduces the Set-of-Mark and Trace-of-Mark techniques for fine-tuning
- Leverages a large amount of unlabeled video data to learn the spatial-temporal grounding and planning
- A strong generalization and ability to be fine-tuned for other agentic tasks
- SOTA in different multi-modal benchmarks spanning across UI navigation, robotics manipulation, image / video understanding and spatial understanding and reasoning
- Generates goal-driven visual plans and actions for agentic use cases

Model: microsoft/Magma-8B
Technical Report: Magma: A Foundation Model for Multimodal AI Agents (2502.13130)
m-ricย 
posted an update 3 months ago
view post
Post
4866
We now have a Deep Research for academia: SurveyX automatically writes academic surveys nearly indistinguishable from human-written ones ๐Ÿ”ฅ

Researchers from Beijing and Shanghai just published the first application of a deep research system to academia: their algorithm, given a question, can give you a survey of all papers on the subject.

To make a research survey, you generally follow two steps, preparation (collect and organize papers) and writing (outline creation, writing, polishing). Researchers followed the same two steps and automated them.

๐ŸŽฏ For the preparation part, a key part is find all the important references on the given subject.
Researchers first cast a wide net of all relevant papers. But then finding the really important ones is like distilling knowledge from a haystack of information. To solve this challenge, they built an โ€œAttributeTreeโ€ object that structures key information from citations. Ablating these AttributeTrees significantly decreased structure and synthesis scores, so they were really useful!

๐Ÿ“ For the writing part, key was to get a synthesis that's both short and true. This is not easy to get with LLMs! So they used methods like LLM-based deduplication to shorten the too verbose listings made by LLMs, and RAG to grab original quotes instead of made-up ones.

As a result, their system outperforms previous approaches by far!

As assessed by LLM-judges, the quality score os SurveyX even approaches this of human experts, with 4.59/5 vs 4.75/5 ๐Ÿ†

I advise you to read the paper, it's a great overview of the kind of assistants that we'll get in the short future! ๐Ÿ‘‰ SurveyX: Academic Survey Automation via Large Language Models (2502.14776)
Their website shows examples of generated surveys ๐Ÿ‘‰ http://www.surveyx.cn/