llrehf

community
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

ll-re-hf's activity

merveย 
posted an update 2 days ago
view post
Post
2068
Google released MedGemma on I/O'25 ๐Ÿ‘ google/medgemma-release-680aade845f90bec6a3f60c4

> 4B and 27B instruction fine-tuned vision LMs and a 4B pre-trained vision LM for medicine
> available with transformers from the get-go ๐Ÿค—

they also released a cool demo for scan reading โžก๏ธ google/rad_explain

use with transformers โคต๏ธ
  • 1 reply
ยท
burtenshawย 
posted an update 2 days ago
view post
Post
1343
MCP course is now LIVE! We just dropped quizzes, videos, and live streams to make it a fully interactive course:

๐Ÿ”— join in now: mcp-course

- Itโ€™s still free!
- Video 1 walks you through onboarding to the course
- The first live session is next week!
- You can now get a certificate via exam app
- We improved and written material with interactive quizzes

If youโ€™re studying MCP and want a live, interactive, visual, certified course, then join us on the hub!
merveย 
posted an update 2 days ago
view post
Post
2646
Bu post'u รงevirebilirsiniz ๐Ÿค—๐Ÿ’—
ยท
merveย 
posted an update 2 days ago
view post
Post
2170
tis the year of any-to-any/omni models ๐Ÿค 
ByteDance-Seed/BAGEL-7B-MoT 7B native multimodal model that understands and generates both image + text

it outperforms leading VLMs like Qwen 2.5-VL ๐Ÿ‘ and has Apache 2.0 license ๐Ÿ˜ฑ
merveย 
posted an update 4 days ago
view post
Post
1650
NVIDIA released new vision reasoning model for robotics: Cosmos-Reason1-7B ๐Ÿค– nvidia/cosmos-reason1-67c9e926206426008f1da1b7

> first reasoning model for robotics
> based on Qwen 2.5-VL-7B, use with Hugging Face transformers or vLLM ๐Ÿค—
> comes with SFT & alignment datasets and a new benchmark ๐Ÿ‘
reach-vbย 
posted an update 5 days ago
view post
Post
2981
hey hey @mradermacher - VB from Hugging Face here, we'd love to onboard you over to our optimised xet backend! ๐Ÿ’ฅ

as you know we're in the process of upgrading our storage backend to xet (which helps us scale and offer blazingly fast upload/ download speeds too): https://huggingface.co/blog/xet-on-the-hub and now that we are certain that the backend can scale with even big models like Llama 4/ Qwen 3 - we;re moving to the next phase of inviting impactful orgs and users on the hub over as you are a big part of the open source ML community - we would love to onboard you next and create some excitement about it in the community too!

in terms of actual steps - it should be as simple as one of the org admins to join hf.co/join/xet - we'll take care of the rest.

p.s. you'd need to have a the latest hf_xet version of huggingface_hub lib but everything else should be the same: https://huggingface.co/docs/hub/storage-backends#using-xet-storage

p.p.s. this is fully backwards compatible so everything will work as it should! ๐Ÿค—
ยท
merveย 
posted an update 5 days ago
view post
Post
2515
It was the week of video generation at @huggingface , on top of many new LLMs, VLMs and more!
Letโ€™s have a wrap ๐ŸŒฏ merve/may-16-releases-682aeed23b97eb0fe965345c

LLMs ๐Ÿ’ฌ
> Alibaba Qwen released WorldPM-72B, new World Preference Model trained with 15M preference samples (OS)
> II-Medical-8B, new LLM for medical reasoning that comes in 8B by Intelligent-Internet
> TRAIL is a new dataset by Patronus for trace error reasoning for agents (OS)

Multimodal ๐Ÿ–ผ๏ธ๐Ÿ’ฌ
> Salesforce Research released BLIP3o, a new any-to-any model with image-text input and image-text output ๐Ÿ’ฌitโ€™s based on an image encoder, a text decoder and a DiT, and comes in 8B
> They also released pre-training and fine-tuning datasets
> MMMG is a multimodal generation benchmark for image, audio, text (interleaved)

Image Generation โฏ๏ธ
> Alibaba Wan-AI released Wan2.1-VACE, video foundation model for image and text to video, video-to-audio and more tasks, comes in 1.3B and 14B (OS)
> ZuluVision released MoviiGen1.1, new cinematic video generation model based on Wan 2.1 14B (OS)
> multimodalart released isometric-skeumorphic-3d-bnb, an isometric 3D asset generator (like AirBnB assets) based on Flux
> LTX-Video-0.9.7-distilled is a new real-time video generation (text and image to video) model by Lightricks
> Hidream_t2i_human_preference is a new text-to-image preference dataset by Rapidata with 195k human responses from 38k annotators

Audio ๐Ÿ—ฃ๏ธ
> stabilityai released stable-audio-open-small new text-to-audio model
> TEN-framework released ten-vad, voice activity detection model (OS)

merveย 
posted an update 8 days ago
view post
Post
2239
New sota open-source depth estimation: Marigold v1-1 ๐ŸŒผ

> normal maps, depth maps of scenes & faces prs-eth/marigold-normals prs-eth/marigold
> get albedo (true color) and BRDF (texture) maps of scenes prs-eth/marigold-intrinsics
> they even release a depth-to-3D printer format demo ๐Ÿ˜ฎ prs-eth/depth-to-3d-print

All models are here prs-eth/marigold-computer-vision-6669e9e3d3ee30f48214b9ba
burtenshawย 
posted an update 8 days ago
view post
Post
2942
We're thrilled to announce the launch of our comprehensive Model Context Protocol (MCP) Course! This free program is designed to take learners from foundational understanding to practical application of MCP in AI.

Follow the course on the hub: mcp-course

In this course, you will:
๐Ÿ“– Study Model Context Protocol in theory, design, and practice.
๐Ÿง‘โ€๐Ÿ’ป Learn to use established MCP SDKs and frameworks.
๐Ÿ’พ Share your projects and explore applications created by the community.
๐Ÿ† Participate in challenges and evaluate your MCP implementations.
๐ŸŽ“ Earn a certificate of completion.

At the end of this course, you'll understand how MCP works and how to build your own AI applications that leverage external data and tools using the latest MCP standards.
  • 1 reply
ยท
merveย 
posted an update 12 days ago
view post
Post
4998
VLMS 2025 UPDATE ๐Ÿ”ฅ

We just shipped a blog on everything latest on vision language models, including
๐Ÿค– GUI agents, agentic VLMs, omni models
๐Ÿ“‘ multimodal RAG
โฏ๏ธ video LMs
๐Ÿค๐Ÿป smol models
..and more! https://huggingface.co/blog/vlms-2025
  • 1 reply
ยท
merveย 
posted an update 18 days ago
view post
Post
5049
A ton of impactful models and datasets in open AI past week, let's summarize the best ๐Ÿคฉ merve/releases-apr-21-and-may-2-6819dcc84da4190620f448a3

๐Ÿ’ฌ Qwen made it rain! They released Qwen3: new dense and MoE models ranging from 0.6B to 235B ๐Ÿคฏ as well as Qwen2.5-Omni, any-to-any model in 3B and 7B!
> Microsoft AI released Phi4 reasoning models (that also come in mini and plus sizes)
> NVIDIA released new CoT reasoning datasets
๐Ÿ–ผ๏ธ > ByteDance released UI-TARS-1.5, native multimodal UI parsing agentic model
> Meta released EdgeTAM, an on-device object tracking model (SAM2 variant)
๐Ÿ—ฃ๏ธ NVIDIA released parakeet-tdt-0.6b-v2, a smol 600M automatic speech recognition model
> Nari released Dia, a 1.6B text-to-speech model
> Moonshot AI released Kimi Audio, a new audio understanding, generation, conversation model
๐Ÿ‘ฉ๐Ÿปโ€๐Ÿ’ป JetBrains released Melium models in base and SFT for coding
> Tesslate released UIGEN-T2-7B, a new text-to-frontend-code model ๐Ÿคฉ
merveย 
posted an update 19 days ago
view post
Post
6545
A real-time object detector much faster and accurate than YOLO with Apache 2.0 license just landed to Hugging Face transformers ๐Ÿ”ฅ

D-FINE is the sota real-time object detector that runs on T4 (free Colab) ๐Ÿคฉ

> Collection with all checkpoints and demo ustc-community/d-fine-68109b427cbe6ee36b4e7352

Notebooks:
> Tracking https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_tracking.ipynb
> Inference https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_inference.ipynb
> Fine-tuning https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_finetune_on_a_custom_dataset.ipynb
h/t @vladislavbro @qubvel-hf @ariG23498 and the authors of the paper ๐ŸŽฉ

Regular object detectors attempt to predict bounding boxes in (x, y, w, h) pixel perfect coordinates, which is very rigid and hard to solve ๐Ÿฅฒโ˜น๏ธ



D-FINE formulates object detection as a distribution for bounding box coordinates, refines them iteratively, and it's more accurate ๐Ÿคฉ

Another core idea behind this model is Global Optimal Localization Self-Distillation โคต๏ธ

this model uses final layer's distribution output (sort of like a teacher) to distill to earlier layers to make early layers more performant.

  • 2 replies
ยท
merveย 
posted an update 22 days ago
burtenshawย 
posted an update 23 days ago
view post
Post
2096
Qwen 3 Fine tuning >> MoE. Update the experiment thread to include config and script for fine-tuning the Qwen3-30B-A3B model.

The goal is to make a low latency non-thinking model for a daily driver coding, so 3 billion parameters active should be perfect.

โœ”๏ธ training running
โœ”๏ธ evals running
โญ๏ธ improve dataset

The moe isn't going to fit into colab's A100 even with quantization (๐Ÿ™ @UnslothAI ). So I've been working on HF spaces' H100s for this. Everything is available in the tread and I'll share more tomorrow.

burtenshaw/Qwen3-Code-Lite#1
merveย 
posted an update 24 days ago
view post
Post
2637
Meta released Llama Guard 4 and new Prompt Guard 2 models ๐Ÿ”ฅ

Llama Guard 4 is a new model to filter model inputs/outputs both text-only and image ๐Ÿ›ก๏ธ use it before and after LLMs/VLMs! meta-llama/Llama-Guard-4-12B

Prompt Guard 2 22M & 86M are smol models to prevent model jailbreaks and prompt injections โš” meta-llama/Llama-Prompt-Guard-2-22M meta-llama/Llama-Guard-4-12B
Both come with new release of transformers ๐Ÿค—

Try the model right away ๐Ÿ‘‰๐Ÿปhttps://github.com/huggingface/huggingface-llama-recipes/blob/main/llama_guard_4.ipynb

Read our blog to learn more and easily get started ๐Ÿ‘‰๐Ÿป https://huggingface.co/blog/llama-guard-4 ๐Ÿฆ™
  • 1 reply
ยท
merveย 
posted an update 29 days ago
view post
Post
4012
Don't sleep on new AI at Meta Vision-Language release! ๐Ÿ”ฅ

facebook/perception-encoder-67f977c9a65ca5895a7f6ba1
facebook/perception-lm-67f9783f171948c383ee7498

Meta dropped swiss army knives for vision with A2.0 license ๐Ÿ‘
> image/video encoders for vision language modelling and spatial understanding (object detection etc) ๐Ÿ‘
> The vision LM outperforms InternVL3 and Qwen2.5VL ๐Ÿ‘
> They also release gigantic video and image datasets

The authors attempt to come up with single versatile vision encoder to align on diverse set of tasks.

They trained Perception Encoder (PE) Core: a new state-of-the-art family of vision encoders that can be aligned for both vision-language and spatial tasks. For zero-shot image tasks, it outperforms latest sota SigLIP2 ๐Ÿ‘



> Among fine-tuned ones, first one is PE-Spatial. It's a model to detect bounding boxes, segmentation, depth estimation and it outperforms all other models ๐Ÿ˜ฎ



> Second one is PLM, Perception Language Model, where they combine PE-Core with Qwen2.5 LM 7B. it outperforms all other models (including InternVL3 which was trained with Qwen2.5LM too!)

The authors release the following checkpoints in sizes base, large and giant:

> 3 PE-Core checkpoints (224, 336, 448)
> 2 PE-Lang checkpoints (L, G)
> One PE-Spatial (G, 448)
> 3 PLM (1B, 3B, 8B)
> Datasets



Authors release following datasets ๐Ÿ“‘
> PE Video: Gigantic video datasete of 1M videos with 120k expert annotations โฏ๏ธ
> PLM-Video and PLM-Image: Human and auto-annotated image and video datasets on region-based tasks
> PLM-VideoBench: New video benchmark on MCQA
  • 2 replies
ยท
pcuenqย 
in meta-llama/Llama-Guard-4-12B about 1 month ago

Fix rope and use static cache

#1 opened about 1 month ago by
pcuenq
burtenshawย 
posted an update about 1 month ago
view post
Post
2514
The rebooted LLM course starts today with an overhauled chapter 1 on Transformers:

๐Ÿ‘‰ Follow the org to join the course: huggingface-course

Weโ€™re starting from the foundations of modern generative AI by looking at transformers. This chapter is expanded in depth and features so contains new material like:

FREE and CERTIFIED exam on fundamentals of transformers
deeper exploration of transformer architectures and attention mechanisms
end -to-end exploration of inference strategies for prefill and decode steps

The course has leveled up in complexity and depth, so this a great time to join in if you want to build you own AI models.
merveย 
posted an update about 1 month ago
view post
Post
3420
New foundation model on image and video captioning just dropped by NVIDIA AI ๐Ÿ”ฅ

Describe Anything Model (DAM) is a 3B vision language model to generate detailed captions with localized references ๐Ÿ˜ฎ

The team released the models, the dataset, a new benchmark and a demo ๐Ÿคฉ nvidia/describe-anything-680825bb8f5e41ff0785834c

Most of the vision LMs focus on image as a whole, lacking localized references in captions, and not taking in visual prompts (points, boxes, drawings around objects)

DAM addresses this on two levels: new vision backbone that takes in focal crops and the image itself, and a large scale dataset ๐Ÿ‘€

They generate a dataset by extending existing segmentation and referring expression generation datasets like REFCOCO, by passing in the images and classes to VLMs and generating captions.

Lastly, they also release a new benchmark again with self-supervision, they use an LLM to evaluate the detailed captions focusing on localization ๐Ÿ‘