Meta dropped swiss army knives for vision with A2.0 license 👏 > image/video encoders for vision language modelling and spatial understanding (object detection etc) 👏 > The vision LM outperforms InternVL3 and Qwen2.5VL 👏 > They also release gigantic video and image datasets
The authors attempt to come up with single versatile vision encoder to align on diverse set of tasks.
They trained Perception Encoder (PE) Core: a new state-of-the-art family of vision encoders that can be aligned for both vision-language and spatial tasks. For zero-shot image tasks, it outperforms latest sota SigLIP2 👏
> Among fine-tuned ones, first one is PE-Spatial. It's a model to detect bounding boxes, segmentation, depth estimation and it outperforms all other models 😮
> Second one is PLM, Perception Language Model, where they combine PE-Core with Qwen2.5 LM 7B. it outperforms all other models (including InternVL3 which was trained with Qwen2.5LM too!)
The authors release the following checkpoints in sizes base, large and giant:
Authors release following datasets 📑 > PE Video: Gigantic video datasete of 1M videos with 120k expert annotations ⏯️ > PLM-Video and PLM-Image: Human and auto-annotated image and video datasets on region-based tasks > PLM-VideoBench: New video benchmark on MCQA
We’re starting from the foundations of modern generative AI by looking at transformers. This chapter is expanded in depth and features so contains new material like:
FREE and CERTIFIED exam on fundamentals of transformers deeper exploration of transformer architectures and attention mechanisms end -to-end exploration of inference strategies for prefill and decode steps
The course has leveled up in complexity and depth, so this a great time to join in if you want to build you own AI models.
Most of the vision LMs focus on image as a whole, lacking localized references in captions, and not taking in visual prompts (points, boxes, drawings around objects)
DAM addresses this on two levels: new vision backbone that takes in focal crops and the image itself, and a large scale dataset 👀
They generate a dataset by extending existing segmentation and referring expression generation datasets like REFCOCO, by passing in the images and classes to VLMs and generating captions.
Lastly, they also release a new benchmark again with self-supervision, they use an LLM to evaluate the detailed captions focusing on localization 👏
The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:
- Extracts both the logical structure AND researcher intuition from academic papers - Adopts the persona of researchers "before experiments" to capture exploratory thinking - Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model
It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.
I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.
smolagents v1.14.0 is out! 🚀 🔌 MCPClient: A sleek new client for connecting to remote MCP servers, making integrations more flexible and scalable. 🪨 Amazon Bedrock: Native support for Bedrock-hosted models. SmolAgents is now more powerful, flexible, and enterprise-ready. 💼
Hacked my presentation building with inference providers, Cohere command a, and sheer simplicity. Use this script if you’re burning too much time on presentations:
This is what it does: - uses command a to generates slides and speaker notes based on some material. - it renders the material in remark open format and imports all images, tables, etc - you can then review the slides as markdown and iterate - export to either pdf or pptx using backslide
🚀 Next steps are: add text to speech for the audio and generate a video. This should make Hugging Face educational content scale to a billion AI Learners.