view article Article Efficient MultiModal Data Pipeline By ariG23498 and 4 others β’ 1 day ago β’ 27
view article Article cocogold: training Marigold for text-grounded segmentation By pcuenq β’ about 13 hours ago β’ 17
Running on Zero 54 54 VLM Object Understanding π¦ Explore object detection, visual grounding, keypoint Detecti
view post Post 942 ByteDance released Tar 1.5B and 7B: image-text in image-text out models, fully open-source π ByteDance-Seed/tar-6864cf0d9fe59a3b91cc4260They have an image tokenizer unified with text, and they de-tokenize using either of two models (LLM and diffusion)The model is actually a full LLM (Qwen2), the tokenizer converts image tokens π€― See translation π₯ 4 4 + Reply
moca-embed/MoCa-Qwen25VL-3B Zero-Shot Image Classification β’ 4B β’ Updated 8 days ago β’ 75 β’ 4
moca-embed/MoCa-Qwen25VL-7B Zero-Shot Image Classification β’ 8B β’ Updated 8 days ago β’ 95 β’ 2
Tar Collection Unifying Visual Understanding and Generation via Text-Aligned Representations β’ 5 items β’ Updated 7 days ago β’ 12
HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context Paper β’ 2506.21277 β’ Published 13 days ago β’ 15