Merve Noyan PRO

merve

AI & ML interests

VLMs, vision & co

Recent Activity

liked a Space about 9 hours ago
tianbaoxiexxx/Jedi
liked a model about 9 hours ago
xlangai/Jedi-7B-1080p
View all activity

Organizations

Hugging Face's profile picture Google's profile picture SODA's profile picture Notebooks-explorers's profile picture Deprem Yapay Zeka's profile picture Deprem Private's profile picture PyTorch Image Models's profile picture Turkish NLP Dataset Creators's profile picture Templates's profile picture Demo Crafters 🤗 's profile picture Keras's profile picture tensorflow's profile picture Mukayese's profile picture HugGAN Community's profile picture EPFL VILAB's profile picture Hugging Face Fellows's profile picture Huggingface.js's profile picture Tools's profile picture HuggingFaceM4's profile picture scikit-learn's profile picture JAX ♥️ Diffusers 🧨's profile picture 2023 Jan Offsite hackathon's profile picture HF Canonical Model Maintainers's profile picture scikit-learn's profile picture fastai X Hugging Face Group 2022's profile picture Huggingface Projects's profile picture boun-tabi-LMG's profile picture Kornia AI's profile picture skops-tests's profile picture Hugging Face H4's profile picture Keras Dreambooth Event's profile picture Turkish T5 - BERT - GPT-2's profile picture Blog-explorers's profile picture Hugging Face for Computer Vision's profile picture Hacktoberfest 2023's profile picture Hugging Face Smol Models Research's profile picture adept-hf-collab's profile picture Qwen's profile picture ZeroGPU Explorers's profile picture kotol's profile picture Magic Leap Community's profile picture Llava Hugging Face's profile picture MLX Community's profile picture Social Post Explorers's profile picture Top Contributors: Profile Followers's profile picture Dev Mode Explorers's profile picture Paris AI Running Club's profile picture yorg's profile picture CVPR2024's profile picture Les papiers de Merve's profile picture nltpt's profile picture s0409's profile picture Hugging Face FineVideo's profile picture mv's profile picture Cookbook Authors's profile picture open/ acc's profile picture Agents's profile picture wut?'s profile picture University of Sydney's profile picture smolagents's profile picture s0225's profile picture Orr and associates org's profile picture gg-hf-g's profile picture llrehf's profile picture University of Science and Technology of China's profile picture VLMs's profile picture all things vision LMs's profile picture

merve's activity

reacted to prithivMLmods's post with ❤️🤗 about 9 hours ago
view post
Post
1323
Just made a demo for Cosmos-Reason1, a physical AI model that understands physical common sense and generates appropriate embodied decisions in natural language through long chain-of-thought reasoning. Also added video understanding support to it. 🤗🚀

✦ Try the demo here : prithivMLmods/Cosmos-x-DocScope

⤹ Cosmos-Reason1-7B : nvidia/Cosmos-Reason1-7B
⤹ docscopeOCR-7B-050425-exp : prithivMLmods/docscopeOCR-7B-050425-exp
⤹ Captioner-Relaxed : Ertugrul/Qwen2.5-VL-7B-Captioner-Relaxed

⤹ Multimodal Implementations : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

⤹ GitHub : https://github.com/PRITHIVSAKTHIUR/Nvidia-Cosmos-Reason1-Demo

To know more about it, visit the model card of the respective model. !!
posted an update about 10 hours ago
view post
Post
132
HOT: MiMo-VL new 7B vision LMs by Xiaomi surpassing gpt-4o (Mar), competitive in GUI agentic + reasoning tasks ❤️‍🔥 XiaomiMiMo/mimo-vl-68382ccacc7c2875500cd212

not only that, but also MIT license & usable with transformers 🔥
posted an update 1 day ago
view post
Post
2010
introducing: VLM vibe eval 🪭 visionLMsftw/VLMVibeEval

vision LMs are saturated over benchmarks, so we built vibe eval 💬

> compare different models with refreshed in-the-wild examples in different categories 🤠
> submit your favorite model for eval
no numbers -- just vibes!
reacted to clem's post with ❤️👀🚀🔥 3 days ago
view post
Post
3328
Playing with Veo3 this morning. Share your prompt if you want me to create videos for you (bonus point if they funnily reference HF/open-source). These videos are "a cat on the moon rapping "I love Hugging Face""!
·
posted an update 3 days ago
view post
Post
2434
emerging trend: models that can understand image + text and generate image + text

don't miss out ⤵️
> MMaDA: single 8B diffusion model aligned with CoT (reasoning!) + UniGRPO Gen-Verse/MMaDA
> BAGEL: 7B MoT model based on Qwen2.5, SigLIP-so-400M, Flux VAE ByteDance-Seed/BAGEL
both by ByteDance! 😱

I keep track of all any input → any output models here merve/any-to-any-models-6822042ee8eb7fb5e38f9b62
  • 1 reply
·
posted an update 5 days ago
view post
Post
3044
what happened in open AI past week? so many vision LM & omni releases 🔥 merve/releases-23-may-68343cb970bbc359f9b5fb05

multimodal 💬🖼️
> new moondream (VLM) is out: it's 4-bit quantized (with QAT) version of moondream-2b, runs on 2.5GB VRAM at 184 tps with only 0.6% drop in accuracy (OS) 🌚
> ByteDance released BAGEL-7B, an omni model that understands and generates both image + text. they also released Dolphin, a document parsing VLM 🐬 (OS)
> Google DeepMind dropped MedGemma in I/O, VLM that can interpret medical scans, and Gemma 3n, an omni model with competitive LLM performance

> MMaDa is a new 8B diffusion language model that can generate image and text



LLMs
> Mistral released Devstral, a 24B coding assistant (OS) 👩🏻‍💻
> Fairy R1-32B is a new reasoning model -- distilled version of DeepSeek-R1-Distill-Qwen-32B (OS)
> NVIDIA released ACEReason-Nemotron-14B, new 14B math and code reasoning model
> sarvam-m is a new Indic LM with hybrid thinking mode, based on Mistral Small (OS)
> samhitika-0.0.1 is a new Sanskrit corpus (BookCorpus translated with Gemma3-27B)

image generation 🎨
> MTVCrafter is a new human motion animation generator
  • 1 reply
·
posted an update 8 days ago
view post
Post
2545
Google released MedGemma on I/O'25 👏 google/medgemma-release-680aade845f90bec6a3f60c4

> 4B and 27B instruction fine-tuned vision LMs and a 4B pre-trained vision LM for medicine
> available with transformers from the get-go 🤗

they also released a cool demo for scan reading ➡️ google/rad_explain

use with transformers ⤵️
  • 1 reply
·
posted an update 9 days ago
view post
Post
3060
Bu post'u çevirebilirsiniz 🤗💗
·
posted an update 9 days ago
view post
Post
2342
tis the year of any-to-any/omni models 🤠
ByteDance-Seed/BAGEL-7B-MoT 7B native multimodal model that understands and generates both image + text

it outperforms leading VLMs like Qwen 2.5-VL 👏 and has Apache 2.0 license 😱
reacted to cfahlgren1's post with ❤️ 10 days ago
reacted to sayakpaul's post with 🚀🤗 10 days ago
view post
Post
1648
Despite the emergence of combining LLM and DiT architectures for T2I synthesis, its design remains severely understudied.

This was done long ago and got into CVPR25 -- super excited to finally share it now, along with the data and code ♥️

We explore several architectural choices that affect this design. We provide an open & reproducible training recipe that works at scale.

Works like Playground v3 have already explored a deep fusion between an LLM and a DiT, sharing their representations through layerwise attention. They exhibit excellent performance on T2I.

Despite its compelling results and other performance virtues, it remains unexplored, which is what we want to improve in our work. Specifically, we take a pre-trained LLM (Gemma-2B) and trainable DiT, and set out to explore what makes a "good deep fusion" between the two for T2I.

We explore several key questions in the work, such as:

Q1: How should we do attention? We considered several alternatives. PixArt-Alpha like attention (cross-attention) is very promising.
Q2: Should we incorporate additional text modulation?
Q3: Can we eliminate timestep conditioning?
Q4: How do we do positional encodings?
Q5: Do instruction-tuned LLMs help deep fusion?
Q6: Would using a decoder LLM from a multimodal model be helpful?
Q7: Does using a better variant of Gemma help?

Based on the above findings, we arrive at FuseDiT with the following components on top of the base architecture from the findings of our experiments.

* No AdaLN-Zero modules
* 1D + 2D-RoPE
* Gemma 2 2B, adjusting DiT configurations accordingly

We trained FuseDiT on a mixture from CC12M, JourneyDB, & SA (~26M image-text pairs) for 800 steps. While not the best model, it's encouraging to develop something in a guided manner using open datasets.

To know more (code, models, all are available), please check out the paper:
https://lnkd.in/gg6qyqZX.
reacted to AdinaY's post with 🚀🔥 10 days ago
view post
Post
2732
ByteDance is absolutely cooking lately🔥

BAGEL 🥯 7B active parameter open multimodal foundation model by Bytedance Seed team.

ByteDance-Seed/BAGEL-7B-MoT

✨ Apache 2.0
✨ Outperforms top VLMs (Qwen2.5-VL & InternVL-2.5)
✨ Mixture-of-Transformer-Experts + dual encoders
✨ Trained on trillions of interleaved tokens
posted an update 11 days ago
view post
Post
1701
NVIDIA released new vision reasoning model for robotics: Cosmos-Reason1-7B 🤖 nvidia/cosmos-reason1-67c9e926206426008f1da1b7

> first reasoning model for robotics
> based on Qwen 2.5-VL-7B, use with Hugging Face transformers or vLLM 🤗
> comes with SFT & alignment datasets and a new benchmark 👏
posted an update 12 days ago
view post
Post
2560
It was the week of video generation at @huggingface , on top of many new LLMs, VLMs and more!
Let’s have a wrap 🌯 merve/may-16-releases-682aeed23b97eb0fe965345c

LLMs 💬
> Alibaba Qwen released WorldPM-72B, new World Preference Model trained with 15M preference samples (OS)
> II-Medical-8B, new LLM for medical reasoning that comes in 8B by Intelligent-Internet
> TRAIL is a new dataset by Patronus for trace error reasoning for agents (OS)

Multimodal 🖼️💬
> Salesforce Research released BLIP3o, a new any-to-any model with image-text input and image-text output 💬it’s based on an image encoder, a text decoder and a DiT, and comes in 8B
> They also released pre-training and fine-tuning datasets
> MMMG is a multimodal generation benchmark for image, audio, text (interleaved)

Image Generation ⏯️
> Alibaba Wan-AI released Wan2.1-VACE, video foundation model for image and text to video, video-to-audio and more tasks, comes in 1.3B and 14B (OS)
> ZuluVision released MoviiGen1.1, new cinematic video generation model based on Wan 2.1 14B (OS)
> multimodalart released isometric-skeumorphic-3d-bnb, an isometric 3D asset generator (like AirBnB assets) based on Flux
> LTX-Video-0.9.7-distilled is a new real-time video generation (text and image to video) model by Lightricks
> Hidream_t2i_human_preference is a new text-to-image preference dataset by Rapidata with 195k human responses from 38k annotators

Audio 🗣️
> stabilityai released stable-audio-open-small new text-to-audio model
> TEN-framework released ten-vad, voice activity detection model (OS)