Hugging Face H4

Enterprise
company
Activity Feed

AI & ML interests

Aligning LLMs to be helpful, honest, harmless, and huggy (H4)

Recent Activity

HuggingFaceH4's activity

merve 
posted an update about 4 hours ago
view post
Post
298
Don't sleep on new AI at Meta Vision-Language release! 🔥

facebook/perception-encoder-67f977c9a65ca5895a7f6ba1
facebook/perception-lm-67f9783f171948c383ee7498

Meta dropped swiss army knives for vision with A2.0 license 👏
> image/video encoders for vision language modelling and spatial understanding (object detection etc) 👏
> The vision LM outperforms InternVL3 and Qwen2.5VL 👏
> They also release gigantic video and image datasets

The authors attempt to come up with single versatile vision encoder to align on diverse set of tasks.

They trained Perception Encoder (PE) Core: a new state-of-the-art family of vision encoders that can be aligned for both vision-language and spatial tasks. For zero-shot image tasks, it outperforms latest sota SigLIP2 👏



> Among fine-tuned ones, first one is PE-Spatial. It's a model to detect bounding boxes, segmentation, depth estimation and it outperforms all other models 😮



> Second one is PLM, Perception Language Model, where they combine PE-Core with Qwen2.5 LM 7B. it outperforms all other models (including InternVL3 which was trained with Qwen2.5LM too!)

The authors release the following checkpoints in sizes base, large and giant:

> 3 PE-Core checkpoints (224, 336, 448)
> 2 PE-Lang checkpoints (L, G)
> One PE-Spatial (G, 448)
> 3 PLM (1B, 3B, 8B)
> Datasets



Authors release following datasets 📑
> PE Video: Gigantic video datasete of 1M videos with 120k expert annotations ⏯️
> PLM-Video and PLM-Image: Human and auto-annotated image and video datasets on region-based tasks
> PLM-VideoBench: New video benchmark on MCQA
merve 
posted an update 2 days ago
view post
Post
2604
New foundation model on image and video captioning just dropped by NVIDIA AI 🔥

Describe Anything Model (DAM) is a 3B vision language model to generate detailed captions with localized references 😮

The team released the models, the dataset, a new benchmark and a demo 🤩 nvidia/describe-anything-680825bb8f5e41ff0785834c

Most of the vision LMs focus on image as a whole, lacking localized references in captions, and not taking in visual prompts (points, boxes, drawings around objects)

DAM addresses this on two levels: new vision backbone that takes in focal crops and the image itself, and a large scale dataset 👀

They generate a dataset by extending existing segmentation and referring expression generation datasets like REFCOCO, by passing in the images and classes to VLMs and generating captions.

Lastly, they also release a new benchmark again with self-supervision, they use an LLM to evaluate the detailed captions focusing on localization 👏
albertvillanova 
posted an update 3 days ago
view post
Post
2238
smolagents v1.14.0 is out! 🚀
🔌 MCPClient: A sleek new client for connecting to remote MCP servers, making integrations more flexible and scalable.
🪨 Amazon Bedrock: Native support for Bedrock-hosted models.
SmolAgents is now more powerful, flexible, and enterprise-ready. 💼

Full release 👉 https://github.com/huggingface/smolagents/releases/tag/v1.14.0
#smolagents #LLM #AgenticAI
yjernite 
posted an update 9 days ago
view post
Post
3108
Today in Privacy & AI Tooling - introducing a nifty new tool to examine where data goes in open-source apps on 🤗

HF Spaces have tons (100Ks!) of cool demos leveraging or examining AI systems - and because most of them are OSS we can see exactly how they handle user data 📚🔍

That requires actually reading the code though, which isn't always easy or quick! Good news: code LMs have gotten pretty good at automatic review, so we can offload some of the work - here I'm using Qwen/Qwen2.5-Coder-32B-Instruct to generate reports and it works pretty OK 🙌

The app works in three stages:
1. Download all code files
2. Use the Code LM to generate a detailed report pointing to code where data is transferred/(AI-)processed (screen 1)
3. Summarize the app's main functionality and data journeys (screen 2)
4. Build a Privacy TLDR with those inputs

It comes with a bunch of pre-reviewed apps/Spaces, great to see how many process data locally or through (private) HF endpoints 🤗

Note that this is a POC, lots of exciting work to do to make it more robust, so:
- try it: yjernite/space-privacy
- reach out to collab: yjernite/space-privacy
thomwolf 
posted an update 11 days ago
view post
Post
4371
If you've followed the progress of robotics in the past 18 months, you've likely noticed how robotics is increasingly becoming the next frontier that AI will unlock.

At Hugging Face—in robotics and across all AI fields—we believe in a future where AI and robots are open-source, transparent, and affordable; community-built and safe; hackable and fun. We've had so much mutual understanding and passion working with the Pollen Robotics team over the past year that we decided to join forces!

You can already find our open-source humanoid robot platform Reachy 2 on the Pollen website and the Pollen community and people here on the hub at pollen-robotics

We're so excited to build and share more open-source robots with the world in the coming months!
  • 1 reply
·
merve 
posted an update 11 days ago
view post
Post
4186
sooo many open AI releases past week, let's summarize! 🤗
merve/april-11-releases-67fcd78be33d241c0977b9d2

multimodal
> Moonshot AI released Kimi VL Thinking, first working open-source multimodal reasoning model and Kimi VL Instruct, both 16B MoEs with 3B active params (OS)
> InternVL3 released based on Qwen2.5VL, 7 ckpts with various sizes (1B to 78B)

LLMs
> NVIDIA released Llama-3_1-Nemotron-Ultra-253B-v1 an LLM built on Llama 405B for reasoning, chat and tool use
> Agentica released DeepCoder-14B-Preview, fine-tuned version of DeepSeek-R1-Distilled-Qwen-14B on problem-test pairs, along with the compiled dataset
> Zyphra/ZR1-1.5B is a new small reasoning LLM built on R1-Distill-1.5B (OS)
> Skywork-OR1-32B-Preview is a new reasoning model by Skywork

Image Generation
> HiDream releases three new models, HiDream I1 Dev, I1 Full, and I1 fast for image generation (OS)

*OS ones have Apache 2.0 or MIT licenses
·
abidlabs 
posted an update 22 days ago
view post
Post
3567
JOURNEY TO 1 MILLION DEVELOPERS

5 years ago, we launched Gradio as a simple Python library to let researchers at Stanford easily demo computer vision models with a web interface.

Today, Gradio is used by >1 million developers each month to build and share AI web apps. This includes some of the most popular open-source projects of all time, like Automatic1111, Fooocus, Oobabooga’s Text WebUI, Dall-E Mini, and LLaMA-Factory.

How did we get here? How did Gradio keep growing in the very crowded field of open-source Python libraries? I get this question a lot from folks who are building their own open-source libraries. This post distills some of the lessons that I have learned over the past few years:

1. Invest in good primitives, not high-level abstractions
2. Embed virality directly into your library
3. Focus on a (growing) niche
4. Your only roadmap should be rapid iteration
5. Maximize ways users can consume your library's outputs

1. Invest in good primitives, not high-level abstractions

When we first launched Gradio, we offered only one high-level class (gr.Interface), which created a complete web app from a single Python function. We quickly realized that developers wanted to create other kinds of apps (e.g. multi-step workflows, chatbots, streaming applications), but as we started listing out the apps users wanted to build, we realized what we needed to do:

Read the rest here: https://x.com/abidlabs/status/1907886