Post
2515
It was the week of video generation at
@huggingface
, on top of many new LLMs, VLMs and more!
Let’s have a wrap 🌯 merve/may-16-releases-682aeed23b97eb0fe965345c
LLMs 💬
> Alibaba Qwen released WorldPM-72B, new World Preference Model trained with 15M preference samples (OS)
> II-Medical-8B, new LLM for medical reasoning that comes in 8B by Intelligent-Internet
> TRAIL is a new dataset by Patronus for trace error reasoning for agents (OS)
Multimodal 🖼️💬
> Salesforce Research released BLIP3o, a new any-to-any model with image-text input and image-text output 💬it’s based on an image encoder, a text decoder and a DiT, and comes in 8B
> They also released pre-training and fine-tuning datasets
> MMMG is a multimodal generation benchmark for image, audio, text (interleaved)
Image Generation ⏯️
> Alibaba Wan-AI released Wan2.1-VACE, video foundation model for image and text to video, video-to-audio and more tasks, comes in 1.3B and 14B (OS)
> ZuluVision released MoviiGen1.1, new cinematic video generation model based on Wan 2.1 14B (OS)
> multimodalart released isometric-skeumorphic-3d-bnb, an isometric 3D asset generator (like AirBnB assets) based on Flux
> LTX-Video-0.9.7-distilled is a new real-time video generation (text and image to video) model by Lightricks
> Hidream_t2i_human_preference is a new text-to-image preference dataset by Rapidata with 195k human responses from 38k annotators
Audio 🗣️
> stabilityai released stable-audio-open-small new text-to-audio model
> TEN-framework released ten-vad, voice activity detection model (OS)
Let’s have a wrap 🌯 merve/may-16-releases-682aeed23b97eb0fe965345c
LLMs 💬
> Alibaba Qwen released WorldPM-72B, new World Preference Model trained with 15M preference samples (OS)
> II-Medical-8B, new LLM for medical reasoning that comes in 8B by Intelligent-Internet
> TRAIL is a new dataset by Patronus for trace error reasoning for agents (OS)
Multimodal 🖼️💬
> Salesforce Research released BLIP3o, a new any-to-any model with image-text input and image-text output 💬it’s based on an image encoder, a text decoder and a DiT, and comes in 8B
> They also released pre-training and fine-tuning datasets
> MMMG is a multimodal generation benchmark for image, audio, text (interleaved)
Image Generation ⏯️
> Alibaba Wan-AI released Wan2.1-VACE, video foundation model for image and text to video, video-to-audio and more tasks, comes in 1.3B and 14B (OS)
> ZuluVision released MoviiGen1.1, new cinematic video generation model based on Wan 2.1 14B (OS)
> multimodalart released isometric-skeumorphic-3d-bnb, an isometric 3D asset generator (like AirBnB assets) based on Flux
> LTX-Video-0.9.7-distilled is a new real-time video generation (text and image to video) model by Lightricks
> Hidream_t2i_human_preference is a new text-to-image preference dataset by Rapidata with 195k human responses from 38k annotators
Audio 🗣️
> stabilityai released stable-audio-open-small new text-to-audio model
> TEN-framework released ten-vad, voice activity detection model (OS)