Extremely bullish on @CohereForAI's Aya Vision (8B & 32B) - new SOTA open-weight VLMs
- 8B wins up to 81% of the time in its class, better than Gemini Flash - 32B beats Llama 3.2 90B! - Covers 23 languages, excels in image captioning, VQA & more - Integrated on transformers from Day 0!
Weโre thrilled to share ๐ฆ๐บ๐ผ๐น๐ฉ๐๐ (256M & 500M)โthe smallest Visual Language Models ever built. Think: running on <1GB of GPU memoryโyou can fine-tune it on your laptop and run it on your toaster!
Why Itโs Game-Changing: - ๐ข๐๐๐ฝ๐ฒ๐ฟ๐ณ๐ผ๐ฟ๐บ๐ ๐๐ฎ๐ฟ๐ด๐ฒ๐ฟ ๐ ๐ผ๐ฑ๐ฒ๐น๐: Even the 256M model surpasses our SOTA 80B-parameter model from just 17 months ago. Over 300x reduction! ๐ ๐ถ๐ด๐ต๐๐ ๐๐ณ๐ณ๐ถ๐ฐ๐ถ๐ฒ๐ป๐ฐ๐: The 256M version delivers 80% of our 2.2B modelโs performance, and the 500M version hits 90% ๐๐ถ๐ด๐ต๐๐ป๐ถ๐ป๐ด-๐๐ฎ๐๐ ๐ฆ๐ฒ๐ฎ๐ฟ๐ฐ๐ต: SmolVLM integrates with ColiPali for state-of-the-art retrieval speedsโon par with models 10x bigger. That means cheaper, faster indexing and real-world impact.
Whatโs New Under the Hood: - ๐ก๐ฒ๐ ๐ฉ๐ถ๐๐ถ๐ผ๐ป ๐๐ป๐ฐ๐ผ๐ฑ๐ฒ๐ฟ: Smaller overall size (400M -> 93M), but with higher resolution. - ๐๐ถ๐ด๐ต๐ฒ๐ฟ ๐ฃ๐ถ๐ ๐ฒ๐น๐/๐ง๐ผ๐ธ๐ฒ๐ป: 4096 vs. 1820โmore efficient image processing. - ๐ฆ๐บ๐ฎ๐ฟ๐ ๐ง๐ผ๐ธ๐ฒ๐ป๐ถ๐๐ฎ๐๐ถ๐ผ๐ป: Faster training and a performance boost.
SmolVLM speeding locally on a laptop thanks to mlx-vlm and @Gradio ! Try it with two lines: pip install git+https://github.com/andimarafioti/mlx-vlm.git@stream-generate-fix python -m mlx_vlm.chat_ui --model mlx-community/SmolVLM-Instruct-8bit
Gotta love the MLX community! Big thanks to @pcuenq and @prince_canuma !
Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs.
- SmolVLM generates tokens 7.5 to 16 times faster than Qwen2-VL! ๐คฏ - Other models at this size crash a laptop, but SmolVLM comfortably generates 17 tokens/sec on a macbook! ๐ - SmolVLM can be fine-tuned on a Google collab! Or process millions of documents with a consumer GPU! - SmolVLM even outperforms larger models in video benchmarks, despite not even being trained on videos!
Hugging face presents FineVideo ๐ฅ! Unlocking the next generation of Video understanding ๐
๐คฏ3400 hours of annotated Creative Common videos with rich character descriptions, scene splits, mood, and content descriptions per scene as well as QA pairs. ๐ฅ @mfarre processed over 2M videos of Youtube-CC to make this incredibly powerful selection.
๐ Introducing Hugging Face's Multilingual Speech-to-Speech! ๐ค ๐ฌOur modular, cross-platform pipeline to run GPT4o-like experiences on device can now seamlessly switch languages mid-conversation with an imperceptible 100ms delay.
๐ Building on an amazing early reception with 2600 stars on GitHub ๐ ๐ We are expanding the library to support multiple languages ๐ฅ Try it out with a flag: --language fr ๐คฏ Or don't set the flag and let the system detect the language
These past months, I've been busy baking a special sort of Croissant ๐ฅ with an awesome team !
๐ฅ CroissantLLM is a truly bilingual language model trained on 3 trillion tokens of French and English data. In its size category (<2B), it is the best model in French, but it also rivals the best monolingual English models !
๐พ To train it, we collected, filtered and cleaned huge quantities of permissively licensed French data, across various domains (legal, administrative, cultural, scientific), and different text modalities (speech transcriptions, movie subtitles, encyclopedias, forums, webpages)...
โ๏ธ Assessing LLM performance is not easy, especially outside of English, and to this end we crafted a novel evaluation benchmark, FrenchBench, aiming to assess reasoning, factual knowledge, and linguistic capabilities of models in French !
๐ The best current LLMs are hidden behind a shroud of mystery, trained with undisclosed training data mixes or strategies. We go the opposite way, releasing all of the project's artefacts (model checkpoints, data, training details, evaluation benchmarks...) We obtain 81 % of the Stanford FMTI transparency criterias, far ahead of even most open initiatives !
๐งชBeyond a powerful industrial resource, our transparent initiative is a stepping stone for many scientific questions ! How does teaching a model two languages instead of one splits its monolingual ability ? Does training on so much French help the model integrate French-centric knowledge and cultural biases ? How does the model memorize the training data ?
Many more things to say, for those interested, I recommend checking out: