kornia (Kornia AI)

posted an update 3 days ago

Post

2202

Demo of OCR & Math QA using multi-capable VLMs like MonkeyOCR-pro-1.2B, R1-One-Vision, VisionaryR1, Vision Matters-7B, and VIGAL-7B, all running together with support for both image and video inference. 🪐

✦ Demo Spaces :
⤷ Multimodal VLMs : prithivMLmods/Multimodal-VLMs

✦ Models :
⤷ Visionary R1 : maifoundations/Visionary-R1
⤷ MonkeyOCR [1.2B] : echo840/MonkeyOCR-pro-1.2B
⤷ ViGaL 7B : yunfeixie/ViGaL-7B
⤷ Lh41-1042-Magellanic-7B-0711 : prithivMLmods/Lh41-1042-Magellanic-7B-0711
⤷ Vision Matters 7B : Yuting6/Vision-Matters-7B
⤷ WR30a-Deep-7B-0711 : prithivMLmods/WR30a-Deep-7B-0711

✦ MonkeyOCR-pro-1.2B Colab T4 Demo [ notebook ]
⤷ MonkeyOCR-pro-1.2B-ReportLab : https://github.com/PRITHIVSAKTHIUR/OCR-ReportLab/blob/main/MonkeyOCR-0709/MonkeyOCR-pro-1.2B-ReportLab.ipynb

✦ GitHub : https://github.com/PRITHIVSAKTHIUR/OCR-ReportLab

The community GPU grant was given by Hugging Face — special thanks to them.🤗🚀

.
.
.
To know more about it, visit the model card of the respective model. !!

merve

posted an update 3 days ago

Post

2841

GitHub refuses to render notebooks for a long time now 💔

so smol-vision now lives in Hugging Face model repository 🤗 merve/smol-vision

1 reply

·

merve

posted an update 4 days ago

Post

3305

ByteDance released Tar 1.5B and 7B: image-text in image-text out models, fully open-source 👏 ByteDance-Seed/tar-6864cf0d9fe59a3b91cc4260

They have an image tokenizer unified with text, and they de-tokenize using either of two models (LLM and diffusion)
The model is actually a full LLM (Qwen2), the tokenizer converts image tokens 🤯

merve

posted an update 5 days ago

Post

3583

Huge drops in open AI past week!
Find more models, datasets, demos here merve/releases-july-4-686bcc54ed7c45c341fbf654
Some of our picks 🫡
⏯️ BAAI/MTVCraft is a new Veo3-like text-to-video model, demo is here BAAI/MTVCraft
🧑🏻‍💻 apple/diffucoder-6868139f56672ae046fe04e8 is a new family of diffusion LLMs (7B base and instruct) for coding
🗣️ kyutai/tts-1.6b-en_fr is a new small TTS model for English and France
👀 aharley/alltracker is a new pixel tracking model by Stanford, demo is here aharley/alltracker
📖 racineai/OGC_MEGA_MultiDomain_DocRetrieval is a new large visual document retrieval dataset

1 reply

·

prithivMLmods

posted an update 9 days ago

Post

3462

Multimodal OCR with ReportLab? On Colab T4? (Nanonets OCR, Monkey OCR, OCRFlux 3B, Typhoo OCR 3B?) .. Yeah, it’s possible. I’ve made a dedicated Colab notebook to experiment with these models (all built on top of Qwen2.5 VL). 🤗🚀

Download notebooks here :

✦︎ NanonetsOCR : https://colab.research.google.com/drive/1VvA-amvSVxGdWgIsh4_by6KWOtEs_Iqp
✦︎ MonkeyOCR : https://colab.research.google.com/drive/1vPCojbmlXjDFUt06FJ1tjgnj_zWK4mUo
✦︎ OCRFluxOCR : https://colab.research.google.com/drive/1TDoCXzWdF2hxVLbISqW6DjXAzOyI7pzf
✦︎ TyphoonOCR : https://colab.research.google.com/drive/1_59zvLNnn1kvbiSFxzA1WiqhpbW8RKbz

🜲 Github : https://github.com/PRITHIVSAKTHIUR/OCR-ReportLab

What does it do?

1. Performs OCR on the input image
2. Generates a DOCX or PDF file with the input image and the extracted text

.
.
.
To know more about it, visit the model card of the respective model. !!

merve

posted an update 10 days ago

Post

892

SOOOO MANY MODEL RELEASES 😍
Here's some picks from past week 🤗

> ByteDance/XVerse is a new identity preserving image generation model 🖼️
> google/gemma-3n-E4B-it, any-to-text model supported by transformers 🤗
> nvidia/llama-nemoretriever-colembed-3b-v1 two new state-of-the-art visual document retrievers 📑
> New version of Dia TTS model is up nari-labs/Dia-1.6B-0626
> Black Forest Labs releases Kontext benchmark black-forest-labs/kontext-bench

Find more here merve/releases-june-27-6864e8eb17f7e3a8b444083c

merve

posted an update 10 days ago

Post

2978

visual reasoning is now in transformers 🔥
THUDM/GLM-4.1V-9B-Thinking is just released and merged into transformers, we gave it a vibe test run 🤠

it's very good, comes with 64k context length and MIT license 😍
it supports 4k image tokens and any aspect ratio as well!
Notebook: http://colab.research.google.com/drive/1atODIiV57hOZLv16Bjzwd6fwx0yoTorj?usp=sharing
Demo: THUDM/GLM-4.1V-9B-Thinking-Demo

prithivMLmods

posted an update 11 days ago

Post

1636

The bunch of comparable demos for Multimodal VLMs (excels in OCR, cinematography understanding, spatial reasoning, etc.) now up on the Hub 🤗 — max recent till Jun'25.

✦ Demo Spaces —

> [Nanonets-OCR-s, MonkeyOCR, Typhoon-OCR-7B, SmolDocling] : prithivMLmods/Multimodal-OCR2
> [GLM-4.1v, docscopeOCR-7B, MonkeyOCR, coreOCR-7B] : prithivMLmods/core-OCR
> [Camel-Doc-OCR, ViLaSR-7B, OCRFlux-3B, ShotVL-7B] : prithivMLmods/Doc-VLMs-v2-Localization
> [SkyCaptioner-V1, SpaceThinker-3B, coreOCR-7B, SpaceOm-3B] : prithivMLmods/VisionScope-R2
> [RolmOCR-7B, Qwen2-VL-OCR-2B, Aya-Vision-8B, Nanonets-OCR-s] : prithivMLmods/Multimodal-OCR
> [DREX-062225-7B, Typhoon-OCR-3B, olmOCR-7B-0225, VIREX-062225-7B] : prithivMLmods/Doc-VLMs-OCR
> [Cosmos-Reason1-7B, docscopeOCR-7B, Captioner-7B, visionOCR-3B] : prithivMLmods/DocScope-R1

✦ Space Collection : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

.
.
.
To know more about it, visit the model card of the respective model. !!

1 reply

·

Nymbo

posted an update 12 days ago

Post

1663

Anyone know how to reset Claude web's MCP config? I connected mine when the HF MCP first released with just the default example spaces added. I added lots of other MCP spaces but Claude.ai doesn't update the available tools... "Disconnecting" the HF integration does nothing, deleting it and adding it again does nothing.

Refreshing tools works fine in VS Code because I can manually restart it in mcp.json, but claude.ai has no such option. Anyone got any ideas?

3 replies

·

merve

posted an update 12 days ago

Post

2502

so many multimodal releases these days 🤠
> ERNIE-4.5-VL: new vision language MoE models by Baidu https://huggingface.co/models?search=ernie-4.5-vl
> new visual document retrievers by NVIDIA (sota on ViDoRe!) nvidia/llama-nemoretriever-colembed-3b-v1 nvidia/llama-nemoretriever-colembed-1b-v1
> Ovis-3b: new image-text in image-text out models by Alibaba ⤵️ https://huggingface.co/spaces/AIDC-AI/Ovis-U1-

prithivMLmods

posted an update 12 days ago

Post

2396

The demo for Camel-Doc-OCR-062825 (exp) is optimized for document retrieval and direct Markdown (.md) generation from images and PDFs. Additional demos include OCRFlux-3B (document OCR), VilaSR (spatial reasoning with visual drawing), and ShotVL (cinematic language understanding). 🐪

✦ Space : prithivMLmods/Doc-VLMs-v2-Localization

Models :
⤷ camel-doc-ocr-062825 : prithivMLmods/Camel-Doc-OCR-062825
⤷ ocrflux-3b : ChatDOC/OCRFlux-3B
⤷ vilasr : AntResearchNLP/ViLaSR
⤷ shotvl : Vchitect/ShotVL-7B

⤷ Multimodal Implementations : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

The community GPU grant was given by Hugging Face — special thanks to them. This space supports the following tasks: (image inference, video inference) with result markdown canvas and object detection/localization. 🤗🚀

.
.
.
To know more about it, visit the model card of the respective model. !!

merve

posted an update 16 days ago

Post

585

Dataset Viewer for PDFs just landed on Hugging Face 📖🤗 you can now preview all the PDFs easier than before!

on top of this, there's PdfFolder format to load the PDF datasets quicker 💨
> to use it, your dataset should follow a directory format like folder/train/doc1.pdf, folder/train/doc1.pdf
> if you want to include bounding boxes, labels etc. you can keep them in a metadata.csv file in the same folder 🤝

read document dataset docs https://huggingface.co/docs/datasets/main/en/document_dataset
check all the document datasets here https://huggingface.co/datasets?modality=modality:document&sort=trending 📖

1 reply

·

prithivMLmods

posted an update 18 days ago

Post

1966

The demo for DREX-062225-exp (Document Retrieval and Extraction eXpert ~ experimental) / typhoon-ocr-3b (a bilingual document parsing model built specifically for real-world documents) / VIREX-062225-exp (Video Information Retrieval and Extraction eXpert ~ experimental) / olmOCR-7B-0225-preview (the document parsing model based on Qwen2VL). 🤗

✦ Demo : prithivMLmods/Doc-VLMs-OCR ~ ( with .md canvas )

⤷ DREX-062225-exp : prithivMLmods/DREX-062225-exp
⤷ typhoon-ocr-3b : scb10x/typhoon-ocr-3b
⤷ VIREX-062225-exp : prithivMLmods/VIREX-062225-exp
⤷ olmOCR-7B-0225-preview : allenai/olmOCR-7B-0225-preview

⤷ Collection : prithivMLmods/doc-vl-685839064a863e1cd23be3f1
⤷ Multimodal Implementations : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0
.
.
.

To know more about it, visit the model card of the respective model. !!

5 replies

·

merve

posted an update 18 days ago

Post

634

we've merged LightGlue keypoint matcher to Hugging Face transformers! it allows commercial use when paired with an open-source keypoint detector 🙏🏻

it works very well, try it yourself: ETH-CVG/LightGlue

here's an in-the-wild test with two images of the same place ⤵️

1 reply

·

prithivMLmods

posted an update 19 days ago

Post

2687

Updated the docscopeOCR-7B-050425-exp with the DREX-062225-exp, with improved preciseness in table structure and line spacing in the markdown used on the document page. And though this is still an experimental one, it's expected to perform well in the defined DREX use cases [ Document Retrieval and Extraction eXpert – experimental ocr ]. 💻

⤷ Model : prithivMLmods/DREX-062225-exp
⤷ Demo : prithivMLmods/Doc-VLMs-OCR

⤷ Collection : prithivMLmods/doc-vl-685839064a863e1cd23be3f1
⤷ Multimodal Implementations : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0
⤷ Git : https://github.com/PRITHIVSAKTHIUR/DREX.git
.
.
.

To know more about it, visit the model card of the respective model. !!

merve

posted an update 19 days ago

Post

4326

Release picks of the past week is here! Find more models, datasets, Spaces here merve/june-20-releases-68594824d1f4dfa61aee3433

🖼️ VLMs/OCR
> moonshotai/Kimi-VL-A3B-Thinking-2506 is a powerful reasoning vision LM, 3B active params, smarter with less tokens, supports long documents, videos 👏 (OS)
> nanonets/Nanonets-OCR-s is 3.75B params OCR model based on Qwen2.5VL-3B-Instruct (OS)

💬 LLMs
> moonshotai/Kimi-Dev-72B is a strong coding model based on Qwen2.5-72B (OS)
> Mistral released mistralai/Mistral-Small-3.2-24B-Instruct-2506, an update to their former model with better function calling & instruction following (OS)

🗣️ Audio
> Google released google/magenta-realtime, real time music generation & audio synthesis (cc-by-4)
> kyutai released new speech-to-text models that come in 1B & 2B ( kyutai/stt-1b-en_fr, stt-2b-en_fr) with 0.5s and 2.5s delay

3D
> Tencent released tencent/Hunyuan3D-2.1 an image-to-3D model (see below)

merve

posted an update 20 days ago

Post

5029

fav open-source multimodal reasoning model just got an update 🔥

moonshotai/Kimi-VL-A3B-Thinking-2506 has
> smarter with less tokens, small size (only 3B active params!!!)
> better accuracy
> video reasoning
> higher resolution 🤯
Read their blog https://huggingface.co/blog/moonshotai/kimi-vl-a3b-thinking-2506

merve

posted an update 22 days ago

Post

2292

y'all have been asking my opinion on how OCR models compare to each other 👀
I will leave three apps to compare newest models by @prithivMLmods instead ⤵️
> compare Nanonets-OCR-s, Qwen2-VL-OCR-2B-Instruct, RolmOCR, Aya-Vision prithivMLmods/Multimodal-OCR
> SmolDocling, Nanonets-OCR-s, MonkeyOCR, Typhoon-OCR-7B prithivMLmods/Multimodal-OCR2
> docscopeOCR, MonkeyOCR, coreOCR prithivMLmods/core-OCR

1 reply

·

prithivMLmods

posted an update 22 days ago

Post

1898

The demo for smoldocling / nanonets ocr / typhoon ocr / monkey ocr explores the document OCR capabilities of various newly released multimodal VLMs in a single space. And if you're experiencing or demoing long document image OCR, kindly use the Smoldocling 256M preview [ Smoldocling is back in demo here. ] 🤗.

✦ Try the demo here : prithivMLmods/Multimodal-OCR2

⤷ MonkeyOCR Recognition : echo840/MonkeyOCR
⤷ Nanonets-OCR-s : nanonets/Nanonets-OCR-s
⤷ SmolDocling-256M-preview : ds4sd/SmolDocling-256M-preview
⤷ typhoon-ocr-7b : scb10x/typhoon-ocr-7b

⤷ Multimodal Implementations : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

⤷ Github : https://github.com/PRITHIVSAKTHIUR/Multimodal-OCR2

The community GPU grant was given by Hugging Face — special thanks to them. 🤗🚀

To know more about it, visit the model card of the respective model. !!

2 replies

·

merve

posted an update 23 days ago

Post

1917

stop using VLMs blindly ✋🏻

compare different VLM outputs on a huge variety of inputs (from reasoning to OCR!) 🔥 visionLMsftw/comparevlms

> has support for multiple VLMs: google/gemma-3-27b-it, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen2.5-VL-32B-Instruct, meta-llama/Llama-4-Maverick-17B-128E-Instruct, HuggingFaceTB/SmolVLM2-2.2B-Instruct
> recommend us new models or inputs, we'll add 🫡

so far I figured out
> for fact-checks, you need a relatively bigger size (7B is ok!)
> Gemma 3 gets downgrade without pan and scan (especially for 📑)
> Qwen2.5VL-32B is very talkative, great for reasoning but not good for simple tasks 🗣️

2 replies

·

Kornia AI

AI & ML interests

Recent Activity

AI & ML interests

Recent Activity

Team members 63

kornia's activity