merve's picture

merve PRO

merve

·

https://github.com/merveenoyan/smol-vision

AI & ML interests

I love this website VLMs, vision & co

Recent Activity

updated a dataset about 10 hours ago

merve/finevideo-split

published a dataset about 10 hours ago

merve/finevideo-split

upvoted an article about 12 hours ago

Efficient MultiModal Data Pipeline

View all activity

Organizations

updated a dataset about 10 hours ago

merve/finevideo-split

Viewer • Updated about 10 hours ago • 5.54k

published a dataset about 10 hours ago

merve/finevideo-split

Viewer • Updated about 10 hours ago • 5.54k

upvoted 2 articles about 12 hours ago

Article

Efficient MultiModal Data Pipeline

By

and 4 others •

1 day ago

• 27

Article

cocogold: training Marigold for text-grounded segmentation

By

•

about 13 hours ago

• 17

New activity in THUDM/GLM-4.1V-9B-Thinking about 12 hours ago

Add notebook

#10 opened about 12 hours ago by

New activity in vikhyatk/moondream2 about 12 hours ago

Add notebook

#76 opened about 12 hours ago by

liked 3 Spaces about 12 hours ago

SpatialTrackerV2

Official Space for SpatialTrackerV2

Kontext Relight

relight images with Flux Kontext[dev]

VLM Object Understanding

Explore object detection, visual grounding, keypoint Detecti

New activity in echo840/MonkeyOCR about 12 hours ago

Add notebook

#16 opened about 12 hours ago by

liked a Space about 13 hours ago

Ovis U1 3B

Demo for multimodal understanding and generation

posted an update about 15 hours ago

Post

942

ByteDance released Tar 1.5B and 7B: image-text in image-text out models, fully open-source 👏 ByteDance-Seed/tar-6864cf0d9fe59a3b91cc4260

They have an image tokenizer unified with text, and they de-tokenize using either of two models (LLM and diffusion)
The model is actually a full LLM (Qwen2), the tokenizer converts image tokens 🤯

liked 2 models about 16 hours ago

moca-embed/MoCa-Qwen25VL-3B

Zero-Shot Image Classification • 4B • Updated 8 days ago • 75 • 4

moca-embed/MoCa-Qwen25VL-7B

Zero-Shot Image Classification • 8B • Updated 8 days ago • 95 • 2

liked a Space about 16 hours ago

Tar

Unified MLLM with Text-Aligned Representations

upvoted a collection about 16 hours ago

Tar

Unifying Visual Understanding and Generation via Text-Aligned Representations • 5 items • Updated 7 days ago • 12

liked 2 models about 16 hours ago

ByteDance-Seed/Tar-1.5B

Any-to-Any • 3B • Updated 7 days ago • 26 • 13

ghost233lism/DepthAnything-AC

Depth Estimation • Updated 1 day ago • 11

upvoted a paper about 16 hours ago

HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context

Paper • 2506.21277 • Published 13 days ago • 15

liked a dataset about 17 hours ago

dalle2/3blue1brown-manim

Viewer • Updated 1 day ago • 2.41k • 21 • 4