Okay this is insane... WebGPU-accelerated semantic video tracking, powered by DINOv3 and Transformers.js! ๐คฏ Demo (+ source code): webml-community/DINOv3-video-tracking
This will revolutionize AI-powered video editors... which can now run 100% locally in your browser, no server inference required (costs $0)! ๐
How does it work? ๐ค 1๏ธโฃ Generate and cache image features for each frame 2๏ธโฃ Create a list of embeddings for selected patch(es) 3๏ธโฃ Compute cosine similarity between each patch and the selected patch(es) 4๏ธโฃ Highlight those whose score is above some threshold
... et voilร ! ๐ฅณ
You can also make selections across frames to improve temporal consistency! This is super useful if the object changes its appearance slightly throughout the video.
Introducing Voxtral WebGPU: State-of-the-art audio transcription directly in your browser! ๐คฏ ๐ฃ๏ธ Transcribe videos, meeting notes, songs and more ๐ Runs on-device, meaning no data is sent to a server ๐ Multilingual (8 languages) ๐ค Completely free (forever) & open source
That's right, we're running Mistral's new Voxtral-Mini-3B model 100% locally in-browser on WebGPU, powered by Transformers.js and ONNX Runtime Web! ๐ฅ
Fine-tune Gemma3n on videos with audios inside with Colab A100 ๐ฅ Just dropped the notebook where you can learn how to fine-tune Gemma3n on images+audio+text at the same time!
keep in mind, it's made for educational purposes ๐ซก we do LoRA, audio resampling & video downsampling to be able to train <40GB VRAM stretch modalities and unfreeze layers as you wish! ๐๐ป merve/smol-vision
They have an image tokenizer unified with text, and they de-tokenize using either of two models (LLM and diffusion) The model is actually a full LLM (Qwen2), the tokenizer converts image tokens ๐คฏ