Deprem Yapay Zeka

community

AI & ML interests

Yapay zeka kullanarak afet müdahaleye yardım etmeye çalışıyoruz. Geliştirilen bütün uygulamalara bakmak için: https://github.com/acikkaynak

deprem-ml's activity

merve 
posted an update about 5 hours ago
merve 
posted an update 1 day ago
merve 
posted an update 3 days ago
view post
Post
1863
HOT: MiMo-VL new 7B vision LMs by Xiaomi surpassing gpt-4o (Mar), competitive in GUI agentic + reasoning tasks ❤️‍🔥 XiaomiMiMo/mimo-vl-68382ccacc7c2875500cd212

not only that, but also MIT license & usable with transformers 🔥
merve 
posted an update 4 days ago
view post
Post
2628
introducing: VLM vibe eval 🪭 visionLMsftw/VLMVibeEval

vision LMs are saturated over benchmarks, so we built vibe eval 💬

> compare different models with refreshed in-the-wild examples in different categories 🤠
> submit your favorite model for eval
no numbers -- just vibes!
merve 
posted an update 6 days ago
view post
Post
2500
emerging trend: models that can understand image + text and generate image + text

don't miss out ⤵️
> MMaDA: single 8B diffusion model aligned with CoT (reasoning!) + UniGRPO Gen-Verse/MMaDA
> BAGEL: 7B MoT model based on Qwen2.5, SigLIP-so-400M, Flux VAE ByteDance-Seed/BAGEL
both by ByteDance! 😱

I keep track of all any input → any output models here https://huggingface.co/collections/merve/any-to-any-models-6822042ee8eb7fb5e38f9b62
  • 1 reply
·
merve 
posted an update 7 days ago
view post
Post
3084
what happened in open AI past week? so many vision LM & omni releases 🔥 merve/releases-23-may-68343cb970bbc359f9b5fb05

multimodal 💬🖼️
> new moondream (VLM) is out: it's 4-bit quantized (with QAT) version of moondream-2b, runs on 2.5GB VRAM at 184 tps with only 0.6% drop in accuracy (OS) 🌚
> ByteDance released BAGEL-7B, an omni model that understands and generates both image + text. they also released Dolphin, a document parsing VLM 🐬 (OS)
> Google DeepMind dropped MedGemma in I/O, VLM that can interpret medical scans, and Gemma 3n, an omni model with competitive LLM performance

> MMaDa is a new 8B diffusion language model that can generate image and text



LLMs
> Mistral released Devstral, a 24B coding assistant (OS) 👩🏻‍💻
> Fairy R1-32B is a new reasoning model -- distilled version of DeepSeek-R1-Distill-Qwen-32B (OS)
> NVIDIA released ACEReason-Nemotron-14B, new 14B math and code reasoning model
> sarvam-m is a new Indic LM with hybrid thinking mode, based on Mistral Small (OS)
> samhitika-0.0.1 is a new Sanskrit corpus (BookCorpus translated with Gemma3-27B)

image generation 🎨
> MTVCrafter is a new human motion animation generator
  • 1 reply
·
merve 
posted an update 11 days ago
view post
Post
2564
Google released MedGemma on I/O'25 👏 google/medgemma-release-680aade845f90bec6a3f60c4

> 4B and 27B instruction fine-tuned vision LMs and a 4B pre-trained vision LM for medicine
> available with transformers from the get-go 🤗

they also released a cool demo for scan reading ➡️ google/rad_explain

use with transformers ⤵️
  • 1 reply
·
merve 
posted an update 11 days ago
view post
Post
3077
Bu post'u çevirebilirsiniz 🤗💗
·
merve 
posted an update 11 days ago
view post
Post
2361
tis the year of any-to-any/omni models 🤠
ByteDance-Seed/BAGEL-7B-MoT 7B native multimodal model that understands and generates both image + text

it outperforms leading VLMs like Qwen 2.5-VL 👏 and has Apache 2.0 license 😱
merve 
posted an update 13 days ago
view post
Post
1710
NVIDIA released new vision reasoning model for robotics: Cosmos-Reason1-7B 🤖 nvidia/cosmos-reason1-67c9e926206426008f1da1b7

> first reasoning model for robotics
> based on Qwen 2.5-VL-7B, use with Hugging Face transformers or vLLM 🤗
> comes with SFT & alignment datasets and a new benchmark 👏
merve 
posted an update 14 days ago
view post
Post
2570
It was the week of video generation at @huggingface , on top of many new LLMs, VLMs and more!
Let’s have a wrap 🌯 merve/may-16-releases-682aeed23b97eb0fe965345c

LLMs 💬
> Alibaba Qwen released WorldPM-72B, new World Preference Model trained with 15M preference samples (OS)
> II-Medical-8B, new LLM for medical reasoning that comes in 8B by Intelligent-Internet
> TRAIL is a new dataset by Patronus for trace error reasoning for agents (OS)

Multimodal 🖼️💬
> Salesforce Research released BLIP3o, a new any-to-any model with image-text input and image-text output 💬it’s based on an image encoder, a text decoder and a DiT, and comes in 8B
> They also released pre-training and fine-tuning datasets
> MMMG is a multimodal generation benchmark for image, audio, text (interleaved)

Image Generation ⏯️
> Alibaba Wan-AI released Wan2.1-VACE, video foundation model for image and text to video, video-to-audio and more tasks, comes in 1.3B and 14B (OS)
> ZuluVision released MoviiGen1.1, new cinematic video generation model based on Wan 2.1 14B (OS)
> multimodalart released isometric-skeumorphic-3d-bnb, an isometric 3D asset generator (like AirBnB assets) based on Flux
> LTX-Video-0.9.7-distilled is a new real-time video generation (text and image to video) model by Lightricks
> Hidream_t2i_human_preference is a new text-to-image preference dataset by Rapidata with 195k human responses from 38k annotators

Audio 🗣️
> stabilityai released stable-audio-open-small new text-to-audio model
> TEN-framework released ten-vad, voice activity detection model (OS)

merve 
posted an update 17 days ago
merve 
posted an update 21 days ago
view post
Post
5019
VLMS 2025 UPDATE 🔥

We just shipped a blog on everything latest on vision language models, including
🤖 GUI agents, agentic VLMs, omni models
📑 multimodal RAG
⏯️ video LMs
🤏🏻 smol models
..and more! https://huggingface.co/blog/vlms-2025
  • 1 reply
·
merve 
posted an update 27 days ago
view post
Post
5061
A ton of impactful models and datasets in open AI past week, let's summarize the best 🤩 merve/releases-apr-21-and-may-2-6819dcc84da4190620f448a3

💬 Qwen made it rain! They released Qwen3: new dense and MoE models ranging from 0.6B to 235B 🤯 as well as Qwen2.5-Omni, any-to-any model in 3B and 7B!
> Microsoft AI released Phi4 reasoning models (that also come in mini and plus sizes)
> NVIDIA released new CoT reasoning datasets
🖼️ > ByteDance released UI-TARS-1.5, native multimodal UI parsing agentic model
> Meta released EdgeTAM, an on-device object tracking model (SAM2 variant)
🗣️ NVIDIA released parakeet-tdt-0.6b-v2, a smol 600M automatic speech recognition model
> Nari released Dia, a 1.6B text-to-speech model
> Moonshot AI released Kimi Audio, a new audio understanding, generation, conversation model
👩🏻‍💻 JetBrains released Melium models in base and SFT for coding
> Tesslate released UIGEN-T2-7B, a new text-to-frontend-code model 🤩
merve 
posted an update 28 days ago
view post
Post
6566
A real-time object detector much faster and accurate than YOLO with Apache 2.0 license just landed to Hugging Face transformers 🔥

D-FINE is the sota real-time object detector that runs on T4 (free Colab) 🤩

> Collection with all checkpoints and demo ustc-community/d-fine-68109b427cbe6ee36b4e7352

Notebooks:
> Tracking https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_tracking.ipynb
> Inference https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_inference.ipynb
> Fine-tuning https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_finetune_on_a_custom_dataset.ipynb
h/t @vladislavbro @qubvel-hf @ariG23498 and the authors of the paper 🎩

Regular object detectors attempt to predict bounding boxes in (x, y, w, h) pixel perfect coordinates, which is very rigid and hard to solve 🥲☹️



D-FINE formulates object detection as a distribution for bounding box coordinates, refines them iteratively, and it's more accurate 🤩

Another core idea behind this model is Global Optimal Localization Self-Distillation ⤵️

this model uses final layer's distribution output (sort of like a teacher) to distill to earlier layers to make early layers more performant.

  • 2 replies
·
merve 
posted an update about 1 month ago
merve 
posted an update about 1 month ago
view post
Post
2645
Meta released Llama Guard 4 and new Prompt Guard 2 models 🔥

Llama Guard 4 is a new model to filter model inputs/outputs both text-only and image 🛡️ use it before and after LLMs/VLMs! meta-llama/Llama-Guard-4-12B

Prompt Guard 2 22M & 86M are smol models to prevent model jailbreaks and prompt injections ⚔ meta-llama/Llama-Prompt-Guard-2-22M meta-llama/Llama-Guard-4-12B
Both come with new release of transformers 🤗

Try the model right away 👉🏻https://github.com/huggingface/huggingface-llama-recipes/blob/main/llama_guard_4.ipynb

Read our blog to learn more and easily get started 👉🏻 https://huggingface.co/blog/llama-guard-4 🦙
  • 1 reply
·
merve 
posted an update about 1 month ago
view post
Post
4021
Don't sleep on new AI at Meta Vision-Language release! 🔥

facebook/perception-encoder-67f977c9a65ca5895a7f6ba1
facebook/perception-lm-67f9783f171948c383ee7498

Meta dropped swiss army knives for vision with A2.0 license 👏
> image/video encoders for vision language modelling and spatial understanding (object detection etc) 👏
> The vision LM outperforms InternVL3 and Qwen2.5VL 👏
> They also release gigantic video and image datasets

The authors attempt to come up with single versatile vision encoder to align on diverse set of tasks.

They trained Perception Encoder (PE) Core: a new state-of-the-art family of vision encoders that can be aligned for both vision-language and spatial tasks. For zero-shot image tasks, it outperforms latest sota SigLIP2 👏



> Among fine-tuned ones, first one is PE-Spatial. It's a model to detect bounding boxes, segmentation, depth estimation and it outperforms all other models 😮



> Second one is PLM, Perception Language Model, where they combine PE-Core with Qwen2.5 LM 7B. it outperforms all other models (including InternVL3 which was trained with Qwen2.5LM too!)

The authors release the following checkpoints in sizes base, large and giant:

> 3 PE-Core checkpoints (224, 336, 448)
> 2 PE-Lang checkpoints (L, G)
> One PE-Spatial (G, 448)
> 3 PLM (1B, 3B, 8B)
> Datasets



Authors release following datasets 📑
> PE Video: Gigantic video datasete of 1M videos with 120k expert annotations ⏯️
> PLM-Video and PLM-Image: Human and auto-annotated image and video datasets on region-based tasks
> PLM-VideoBench: New video benchmark on MCQA
  • 2 replies
·
merve 
posted an update about 1 month ago
view post
Post
3425
New foundation model on image and video captioning just dropped by NVIDIA AI 🔥

Describe Anything Model (DAM) is a 3B vision language model to generate detailed captions with localized references 😮

The team released the models, the dataset, a new benchmark and a demo 🤩 nvidia/describe-anything-680825bb8f5e41ff0785834c

Most of the vision LMs focus on image as a whole, lacking localized references in captions, and not taking in visual prompts (points, boxes, drawings around objects)

DAM addresses this on two levels: new vision backbone that takes in focal crops and the image itself, and a large scale dataset 👀

They generate a dataset by extending existing segmentation and referring expression generation datasets like REFCOCO, by passing in the images and classes to VLMs and generating captions.

Lastly, they also release a new benchmark again with self-supervision, they use an LLM to evaluate the detailed captions focusing on localization 👏
merve 
posted an update about 2 months ago
view post
Post
4482
sooo many open AI releases past week, let's summarize! 🤗
merve/april-11-releases-67fcd78be33d241c0977b9d2

multimodal
> Moonshot AI released Kimi VL Thinking, first working open-source multimodal reasoning model and Kimi VL Instruct, both 16B MoEs with 3B active params (OS)
> InternVL3 released based on Qwen2.5VL, 7 ckpts with various sizes (1B to 78B)

LLMs
> NVIDIA released Llama-3_1-Nemotron-Ultra-253B-v1 an LLM built on Llama 405B for reasoning, chat and tool use
> Agentica released DeepCoder-14B-Preview, fine-tuned version of DeepSeek-R1-Distilled-Qwen-14B on problem-test pairs, along with the compiled dataset
> Zyphra/ZR1-1.5B is a new small reasoning LLM built on R1-Distill-1.5B (OS)
> Skywork-OR1-32B-Preview is a new reasoning model by Skywork

Image Generation
> HiDream releases three new models, HiDream I1 Dev, I1 Full, and I1 fast for image generation (OS)

*OS ones have Apache 2.0 or MIT licenses
·