Demo of OCR & Math QA using multi-capable VLMs like MonkeyOCR-pro-1.2B, R1-One-Vision, VisionaryR1, Vision Matters-7B, and VIGAL-7B, all running together with support for both image and video inference. 🪐
Because hackathons are often the starting point for many AI projects, I've created a Python-backend template incorporating my feedback to streamline collaboration and urgent deployments 🏎️
Within a year, I had the opportunity to participate in hackathons organized by Mistral, OpenAI, and DeepMind and this GitHub template is structured around several fundamental building blocks and recommendations I offer developers eager to participate in their first hackathon, whether as part of a team or individually. Its emphasis is on rapid setup and deployment through: - uv as a package manager, simplifying usage via a series of pre-configured make commands. - FastAPI for API management, structured in a modular architecture designed to minimize branch conflicts during merges to main branches (using minimal health-check and ping routes to verify Docker’s proper execution and backend accessibility on the local network). - Pydantic for validation and type handling, which simplifies debugging and enhances understanding of data objects. - A set of custom instructions tailored for agents (Cline and GitHub Copilot), aimed at improving overall comprehension of the application and optimizing the vibe-coding experience.
This template includes unit tests with a 100% success rate and test coverage, as well as a minimal CI file ensuring that the FastAPI application runs correctly. Thus, merging code that breaks the server into production becomes impossible ⛔️
In general, I would reiterate an essential piece of advice: your two main adversaries are branch conflicts—particularly when the same file is modified concurrently within a brief period, especially if your architecture isn’t built for scalability—and deployment issues under urgent circumstances ⏱️
Multimodal OCR with ReportLab? On Colab T4? (Nanonets OCR, Monkey OCR, OCRFlux 3B, Typhoo OCR 3B?) .. Yeah, it’s possible. I’ve made a dedicated Colab notebook to experiment with these models (all built on top of Qwen2.5 VL). 🤗🚀
The bunch of comparable demos for Multimodal VLMs (excels in OCR, cinematography understanding, spatial reasoning, etc.) now up on the Hub 🤗 — max recent till Jun'25.
The demo for Camel-Doc-OCR-062825 (exp) is optimized for document retrieval and direct Markdown (.md) generation from images and PDFs. Additional demos include OCRFlux-3B (document OCR), VilaSR (spatial reasoning with visual drawing), and ShotVL (cinematic language understanding). 🐪
The community GPU grant was given by Hugging Face — special thanks to them. This space supports the following tasks: (image inference, video inference) with result markdown canvas and object detection/localization. 🤗🚀
. . . To know more about it, visit the model card of the respective model. !!
The demo for DREX-062225-exp (Document Retrieval and Extraction eXpert ~ experimental) / typhoon-ocr-3b (a bilingual document parsing model built specifically for real-world documents) / VIREX-062225-exp (Video Information Retrieval and Extraction eXpert ~ experimental) / olmOCR-7B-0225-preview (the document parsing model based on Qwen2VL). 🤗
Updated the docscopeOCR-7B-050425-exp with the DREX-062225-exp, with improved preciseness in table structure and line spacing in the markdown used on the document page. And though this is still an experimental one, it's expected to perform well in the defined DREX use cases [ Document Retrieval and Extraction eXpert – experimental ocr ]. 💻
The demo for smoldocling / nanonets ocr / typhoon ocr / monkey ocr explores the document OCR capabilities of various newly released multimodal VLMs in a single space. And if you're experiencing or demoing long document image OCR, kindly use the Smoldocling 256M preview [ Smoldocling is back in demo here. ] 🤗.
🌐 Clinical Trials Dataset now available on Hugging Face! 🧬
I’ve just released a comprehensive, ML-ready dataset featuring 500,000+ clinical trial records sourced directly from ClinicalTrials.gov for biomedical NLP, healthcare analytics, and clinical research applications 🤗
I wanted to produce the most complete and up-to-date dump with all raw data partially flattened to simplify extraction, self-querying and processing.
Do you have any ideas about what we can do with it? Using descriptions to enhance specialized embedding models?
The demo for the MonkeyOCR Recognition model, which adopts a Structure-Recognition-Relation (SRR) triplet paradigm & Nanonets-OCR-s a powerful, state-of-the-art image-to-markdown OCR model that goes far beyond traditional text extraction and other experimental document OCR models, is combined into a single space.
So every bio/med/chem meeting i go to i always the same questions "why are you sharing a gdrive link with me for this?" and "Do you have any plans to publish your model weights and datasets on huggingface?" and finally i got a good answer today which explains everything :
basically there is some kind of government censorship on this (usa, but i'm sure others too) and they are told they are not allowed as it is considered a "dataleak" which is illegal !!!!
this is terrible ! but the good news is that we can do something about it !
OpenAI, Google, Hugging Face, and Anthropic have released guides and courses on building agents, prompting techniques, scaling AI use cases, and more. Below are 10+ minimalistic guides and courses that may help you in your progress. 📖
Just made a demo for Cosmos-Reason1, a physical AI model that understands physical common sense and generates appropriate embodied decisions in natural language through long chain-of-thought reasoning. Also added video understanding support to it. 🤗🚀
Got access to Google's all-new Gemini Diffusion a state-of-the-art text diffusion model. It delivers the performance of Gemini 2.0 Flash-Lite at 5x the speed, generating over 1000 tokens in a fraction of a second and producing impressive results. Below are some initial outputs generated using the model. ♊🔥
The more optimized explicit content filters with lightweight 𝙜𝙪𝙖𝙧𝙙 models trained based on siglip2 patch16 512 and vit patch16 224 for illustration and explicit content classification for content moderation in social media, forums, and parental controls for safer browsing environments. this version fixes the issues in the previous release, which lacked sufficient resources. 🚀
Models for detecting images generated by diffusion models (Flux.1, SDXL, ..) are trained or fine-tuned using image classification models for content moderation. These models use datasets available on the Hub. For identifying AI-generated images or moderating visual content, the recommended model is OpenSDI-Flux.1-SigLIP2.😺🧨
Dropping some image classification models for content moderation and classifiers trained with datasets available on the Hub. All are fine-tuned on the siglip2 backbone, (competitions AIOrNot, Imagenette, and Driver-Drowsiness). Models and datasets are listed below:
🔗Collection : [The previous collection of models is also listed in the same collection, so you can find more models focused on image classification tasks.]
Run LLM model Locally using Docker right inside your codebase (No GUI Needed!)
In this project, I did not used the suporting GUI like Open WebUI or LM Studio or any other, so the purpose to use stand alone LLM models with ollama to give you the idea that how you can use it in your project/code instead of running through third party. Everything is containerized with Docker, so setup is clean and repeatable. Its just a fun side project so my connections can learn more about running models locally in their own projects.
Tech stack used:
🐋 Docker
🦙 LLaMA via Ollama
💻 HTML/CSS/JS
🐍 Python + FastAPI
🌐 NGINX
Its still early and a fun side project, but if you are into local model deployment, or just want to see how it works, check it out on the given link!