SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion Paper • 2503.11576 • Published Mar 14 • 114
view article Article SmolVLM2: Bringing Video Understanding to Every Device By orrzohar and 6 others • Feb 20 • 291
view article Article Open-source DeepResearch – Freeing our search agents By m-ric and 4 others • Feb 4 • 1.28k
view article Article Timm ❤️ Transformers: Use any timm model with transformers By ariG23498 and 4 others • Jan 16 • 51
MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild Paper • 2411.11098 • Published Nov 17, 2024 • 1
Qwen2-VL Collection Vision-language model series based on Qwen2 • 16 items • Updated 15 days ago • 223
MobileNetV4 pretrained weights Collection Weights for MobileNet-V4 pretrained in timm • 17 items • Updated 4 days ago • 19
Transformer Explainer: Interactive Learning of Text-Generative Models Paper • 2408.04619 • Published Aug 8, 2024 • 163
🍃 MINT-1T Collection Data for "MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens" • 13 items • Updated Jul 24, 2024 • 62
Searching for Better ViT Baselines Collection Exploring ViT hparams and model shapes for the GPU poor (between tiny and base). • 28 items • Updated 4 days ago • 18
view article Article Introducing Idefics2: A Powerful 8B Vision-Language Model for the community By Leyo and 2 others • Apr 15, 2024 • 185
Uni-SMART: Universal Science Multimodal Analysis and Research Transformer Paper • 2403.10301 • Published Mar 15, 2024 • 55