gg-hf-g

community
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

merveย 
posted an update about 11 hours ago
merveย 
posted an update 1 day ago
view post
Post
1369
Fine-tune Gemma3n on videos with audios inside with Colab A100 ๐Ÿ”ฅ
Just dropped the notebook where you can learn how to fine-tune Gemma3n on images+audio+text at the same time!

keep in mind, it's made for educational purposes ๐Ÿซก we do LoRA, audio resampling & video downsampling to be able to train <40GB VRAM

stretch modalities and unfreeze layers as you wish! ๐Ÿ™๐Ÿป merve/smol-vision
  • 1 reply
ยท
danielhanchenย 
posted an update 3 days ago
merveย 
posted an update 3 days ago
view post
Post
2293
past week had huuuge releases ๐Ÿ’—
here's our picks ๐Ÿ”ฅ find more models, datasets, demos here merve/releases-july-11-68750452c358c98b0fa663f7

> moonshotai/Kimi-K2-Instruct is the new sota LLM with 1T total 32B active parameters ๐Ÿคฏ

> HuggingFaceTB/SmolLM3-3B is the new best LM for it's size, offers thinking mode ๐Ÿ’ญ as well as the dataset HuggingFaceTB/smoltalk2

> Alibaba-NLP/WebSailor-3B is the new agentic LLM for complex browsing

> Google DeepMind released medical vision LMs with an agentic doctor-patient app google/medgemma-release-680aade845f90bec6a3f60c4

> fal released a LoRA to improve details on face images fal/Realism-Detailer-Kontext-Dev-LoRA
mlabonneย 
posted an update 7 days ago
view post
Post
3513
LiquidAI open-sources a new generation of edge LLMs! ๐Ÿฅณ

Based on a new hybrid architecture, these 350M, 700M, and 1.2B models are both fast and performant, ideal for on-device deployment.

I recommend fine-tuning them to power your next edge application. We already provide Colab notebooks to guide you. More to come soon!

๐Ÿ“ Blog post: https://www.liquid.ai/blog/liquid-foundation-models-v2-our-second-series-of-generative-ai-models
๐Ÿค— Models: LiquidAI/lfm2-686d721927015b2ad73eaa38
  • 1 reply
ยท
merveย 
posted an update 9 days ago
view post
Post
3037
GitHub refuses to render notebooks for a long time now ๐Ÿ’”

so smol-vision now lives in Hugging Face model repository ๐Ÿค— merve/smol-vision
  • 1 reply
ยท
merveย 
posted an update 10 days ago
view post
Post
3382
ByteDance released Tar 1.5B and 7B: image-text in image-text out models, fully open-source ๐Ÿ‘ ByteDance-Seed/tar-6864cf0d9fe59a3b91cc4260

They have an image tokenizer unified with text, and they de-tokenize using either of two models (LLM and diffusion)
The model is actually a full LLM (Qwen2), the tokenizer converts image tokens ๐Ÿคฏ
merveย 
posted an update 10 days ago
view post
Post
3635
Huge drops in open AI past week!
Find more models, datasets, demos here merve/releases-july-4-686bcc54ed7c45c341fbf654
Some of our picks ๐Ÿซก
โฏ๏ธ BAAI/MTVCraft is a new Veo3-like text-to-video model, demo is here BAAI/MTVCraft
๐Ÿง‘๐Ÿปโ€๐Ÿ’ป apple/diffucoder-6868139f56672ae046fe04e8 is a new family of diffusion LLMs (7B base and instruct) for coding
๐Ÿ—ฃ๏ธ kyutai/tts-1.6b-en_fr is a new small TTS model for English and France
๐Ÿ‘€ aharley/alltracker is a new pixel tracking model by Stanford, demo is here aharley/alltracker
๐Ÿ“– racineai/OGC_MEGA_MultiDomain_DocRetrieval is a new large visual document retrieval dataset
  • 1 reply
ยท
danielhanchenย 
posted an update 14 days ago
merveย 
posted an update 15 days ago
view post
Post
922
SOOOO MANY MODEL RELEASES ๐Ÿ˜
Here's some picks from past week ๐Ÿค—

> ByteDance/XVerse is a new identity preserving image generation model ๐Ÿ–ผ๏ธ
> google/gemma-3n-E4B-it, any-to-text model supported by transformers ๐Ÿค—
> nvidia/llama-nemoretriever-colembed-3b-v1 two new state-of-the-art visual document retrievers ๐Ÿ“‘
> New version of Dia TTS model is up nari-labs/Dia-1.6B-0626
> Black Forest Labs releases Kontext benchmark black-forest-labs/kontext-bench

Find more here merve/releases-june-27-6864e8eb17f7e3a8b444083c
merveย 
posted an update 16 days ago
danielhanchenย 
posted an update 16 days ago
merveย 
posted an update 18 days ago
merveย 
posted an update 22 days ago
view post
Post
597
Dataset Viewer for PDFs just landed on Hugging Face ๐Ÿ“–๐Ÿค— you can now preview all the PDFs easier than before!

on top of this, there's PdfFolder format to load the PDF datasets quicker ๐Ÿ’จ
> to use it, your dataset should follow a directory format like folder/train/doc1.pdf, folder/train/doc1.pdf
> if you want to include bounding boxes, labels etc. you can keep them in a metadata.csv file in the same folder ๐Ÿค

read document dataset docs https://huggingface.co/docs/datasets/main/en/document_dataset
check all the document datasets here https://huggingface.co/datasets?modality=modality:document&sort=trending ๐Ÿ“–
  • 1 reply
ยท
merveย 
posted an update 24 days ago
view post
Post
642
we've merged LightGlue keypoint matcher to Hugging Face transformers! it allows commercial use when paired with an open-source keypoint detector ๐Ÿ™๐Ÿป

it works very well, try it yourself: ETH-CVG/LightGlue

here's an in-the-wild test with two images of the same place โคต๏ธ
  • 1 reply
ยท
merveย 
posted an update 24 days ago
view post
Post
4335
Release picks of the past week is here! Find more models, datasets, Spaces here merve/june-20-releases-68594824d1f4dfa61aee3433

๐Ÿ–ผ๏ธ VLMs/OCR
> moonshotai/Kimi-VL-A3B-Thinking-2506 is a powerful reasoning vision LM, 3B active params, smarter with less tokens, supports long documents, videos ๐Ÿ‘ (OS)
> nanonets/Nanonets-OCR-s is 3.75B params OCR model based on Qwen2.5VL-3B-Instruct (OS)

๐Ÿ’ฌ LLMs
> moonshotai/Kimi-Dev-72B is a strong coding model based on Qwen2.5-72B (OS)
> Mistral released mistralai/Mistral-Small-3.2-24B-Instruct-2506, an update to their former model with better function calling & instruction following (OS)

๐Ÿ—ฃ๏ธ Audio
> Google released google/magenta-realtime, real time music generation & audio synthesis (cc-by-4)
> kyutai released new speech-to-text models that come in 1B & 2B ( kyutai/stt-1b-en_fr, stt-2b-en_fr) with 0.5s and 2.5s delay

3D
> Tencent released tencent/Hunyuan3D-2.1 an image-to-3D model (see below)
merveย 
posted an update 26 days ago
merveย 
posted an update 28 days ago
merveย 
posted an update 28 days ago
view post
Post
1924
stop using VLMs blindly โœ‹๐Ÿป

compare different VLM outputs on a huge variety of inputs (from reasoning to OCR!) ๐Ÿ”ฅ visionLMsftw/comparevlms

> has support for multiple VLMs: google/gemma-3-27b-it, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen2.5-VL-32B-Instruct, meta-llama/Llama-4-Maverick-17B-128E-Instruct, HuggingFaceTB/SmolVLM2-2.2B-Instruct
> recommend us new models or inputs, we'll add ๐Ÿซก

so far I figured out
> for fact-checks, you need a relatively bigger size (7B is ok!)
> Gemma 3 gets downgrade without pan and scan (especially for ๐Ÿ“‘)
> Qwen2.5VL-32B is very talkative, great for reasoning but not good for simple tasks ๐Ÿ—ฃ๏ธ
  • 2 replies
ยท