We just released TRL v0.20 with major multimodal upgrades!
๐๏ธ VLM support for GRPO (highly requested by the community!) ๐๏ธ New GSPO trainer (from @Qwen, released last week, VLM-ready) ๐ New MPO trainer (multimodal by design, as in the paper)
Introducing Voxtral WebGPU: State-of-the-art audio transcription directly in your browser! ๐คฏ ๐ฃ๏ธ Transcribe videos, meeting notes, songs and more ๐ Runs on-device, meaning no data is sent to a server ๐ Multilingual (8 languages) ๐ค Completely free (forever) & open source
That's right, we're running Mistral's new Voxtral-Mini-3B model 100% locally in-browser on WebGPU, powered by Transformers.js and ONNX Runtime Web! ๐ฅ
Yet Another New Multimodal Fine-Tuning Recipe ๐ฅง
๐งโ๐ณ In this @HuggingFace Face Cookbook notebook, we demonstrate how to align a multimodal model (VLM) using Mixed Preference Optimization (MPO) using trl.
๐ก This recipe is powered by the new MPO support in trl, enabled through a recent upgrade to the DPO trainer!
We align the multimodal model using multiple optimization objectives (losses), guided by a preference dataset (chosen vs. rejected multimodal pairs).
๐งโ๐ณ New Multimodal Fine-Tuning Recipe ๐งโ๐ณ
โก๏ธ In this new @huggingface Cookbook recipe, I walk you though the process of fine tuning a Visual Language Model (VLM) for Object Detection with Visual Grounding, using TRL.
๐ Object detection typically involves detecting categories in images (e.g., vase).
By combining it with visual grounding, we add contextual understanding so instead of detecting just "vase", we can detect "middle vase" in an image.
VLMs are super powerful!
In this case, I use PaliGemma 2 which already supports object detection and extend it to also add visual grounding.
Fine-tune Gemma3n on videos with audios inside with Colab A100 ๐ฅ Just dropped the notebook where you can learn how to fine-tune Gemma3n on images+audio+text at the same time!
keep in mind, it's made for educational purposes ๐ซก we do LoRA, audio resampling & video downsampling to be able to train <40GB VRAM stretch modalities and unfreeze layers as you wish! ๐๐ป merve/smol-vision