Another great week in open ML! Here's a small recap π«°π»
Model releases β―οΈ Video Language Models AI at Meta released Vision-CAIR/LongVU_Qwen2_7B, a new state-of-the-art long video LM model based on DINOv2, SigLIP, Qwen2 and Llama 3.2
π¬ Small language models Hugging Face released HuggingFaceTB/SmolLM2-1.7B, a family of new smol language models with Apache 2.0 license that come in sizes 135M, 360M and 1.7B, along with datasets. Meta released facebook/MobileLLM-1B, a new family of on-device LLMs of sizes 125M, 350M and 600M
πΌοΈπ¬Any-to-Any gpt-omni/mini-omni2 is closest reproduction to GPT-4o, a new LLM that can take image-text-audio input and output speech is released!
Dataset releases πΌοΈ Spawning/PD12M, a new captioning dataset of 12.4 million examples generated using Florence-2
Meta AI vision has been cooking @facebook They shipped multiple models and demos for their papers at @ECCVπ€
Here's a compilation of my top picks: - Sapiens is family of foundation models for human-centric depth estimation, segmentation and more, all models have open weights and demos π
All models have their demos and even torchscript checkpoints! A collection of models and demos: facebook/sapiens-66d22047daa6402d565cb2fc - VFusion3D is state-of-the-art consistent 3D generation model from images