As part of our Summer of Workflows series, we are excited to release MCP Server — an MCP ( Model Context Protocol) server that connects directly to your ApertureDB Cloud instance.
This workflow gives your Generative AI models and AI agents live, multimodal memory—enabling real-time access to images, text, video, embeddings, and more.
🔍 Why it matters: Static context limits what AI agents can do. With MCP + ApertureDB, your LLMs can now query fresh, contextual information as they reason, plan, and act.
✅ What’s included: A deployable MCP-compliant server - Zero glue code needed Works out-of-the-box with ApertureDB Cloud Built-in authentication for secure, production-ready deployment
Here is the fastest way to ingest it into a high-performance multimodal database.
Workflow #1: Croissant Ingestion is live on ApertureDB Cloud — the first release in our Summer Workflows series.
Plug in any MLCommons Croissant-formatted dataset and this ready-to-run workflow will:
✅Parse Croissant metadata 📥Download all linked assets (images, text, video, etc.) 📦Ingest them into ApertureDB, preserving structure and relationships All with just a few lines of Python.
Whether you are working with public datasets from Hugging Face or prepping production-ready data pipelines — this is the ingestion flow you’ve been waiting for.
We are happy to release the OpenPII English Anonymiser —the most powerful open-source tool for redacting sensitive info from English text.
Fine-tuned Modernbert on 5.7 million+ PII examples, it’s clocking 99%+ accuracy across emails, dates, social numbers, and more!
Why it’s a big deal: ✅ Top-tier precision: 100% for passport numbers, 99.96% for emails*. ✅ Totally free: MIT license for personal or commercial use. ✅ No secrets: Full metrics shared on Hugging Face.
Excited to share insights about LinkedIn's innovative approach to content search, recently detailed in a groundbreaking paper by their Mountain View team. This advancement represents a significant shift from traditional keyword-based search to semantic understanding.
>> Technical Architecture
The new search engine employs a sophisticated two-layer architecture:
Retrieval Layer - Token Based Retriever (TBR) for exact keyword matching - Embedding Based Retriever (EBR) using a two-tower model with multilingual-e5 embeddings - Pre-computed post embeddings stored in a dedicated embedding store for efficient retrieval
Multi-Stage Ranking - L1 Stage: Initial filtering using a lightweight model - L2 Stage: Advanced ranking with complex features including: - Query-post semantic matching - Author reputation analysis - User engagement metrics - Content freshness evaluation
>> Performance Improvements
The system has achieved remarkable results: - 10%+ improvement in both on-topic rate and long-dwell metrics - Enhanced ability to handle complex natural language queries - Significant boost in sitewide engagement
This advancement enables LinkedIn to better serve complex queries like "how to ask for a raise?" while maintaining high performance at scale. The system intelligently balances between exact keyword matching and semantic understanding, ensuring optimal results for both navigational and conceptual searches.
What impresses me most is how the team solved the scale challenge - processing billions of posts efficiently using pre-computed embeddings and approximate nearest neighbor search. This is enterprise-scale AI at its finest.
🌐 Announcing Global-MMLU: an improved MMLU Open dataset with evaluation coverage across 42 languages, built with Argilla and the Hugging Face community.
Global-MMLU is the result of months of work with the goal of advancing Multilingual LLM evaluation. It's been an amazing open science effort with collaborators from Cohere For AI, Mila - Quebec Artificial Intelligence Institute, EPFL, Massachusetts Institute of Technology, AI Singapore, National University of Singapore, KAIST, Instituto Superior Técnico, Carnegie Mellon University, CONICET, and University of Buenos Aires.
🏷️ +200 contributors used Argilla MMLU questions where regional, dialect, or cultural knowledge was required to answer correctly. 85% of the questions required Western-centric knowledge!
Thanks to this annotation process, the open dataset contains two subsets:
1. 🗽 Culturally Agnostic: no specific regional, cultural knowledge is required. 2. ⚖️ Culturally Sensitive: requires dialect, cultural knowledge or geographic knowledge to answer correctly.
Moreover, we provide high quality translations of 25 out of 42 languages, thanks again to the community and professional annotators leveraging Argilla on the Hub.
I hope this will ensure a better understanding of the limitations and challenges for making open AI useful for many languages.
I'm currently on a push to expand the scope of image based datasets on the Hub. There's certainly a lot already, but for anyone who's looked closely, there's not a whole lot of standardization. I am to fix that, datasets under the timm and pixparse orgs will serve as canonical examples for various task / modality combinations and be useable without fuss in libraries like timm, OpenCLIP, and hopefully more.
I just uploaded the first multi-label dataset that I'll support with timm scripts soon: timm/plant-pathology-2021
Next up object detection & segmentation! I've got an annotation spec sorted out, a lot of datasets ready to rip, and yeah that means timm support for object detection, eventually segmentation, is finally under development :O
OmniVision-968M: a new local VLM for edge devices, fast & small but performant 💨 a new vision language model with 9x less image tokens, super efficient 📖 aligned with DPO for reducing hallucinations ⚡️ Apache 2.0 license 🔥
In August, the XetHub team joined Hugging Face - https://huggingface.co/blog/xethub-joins-hf - and we’ve been rolling up our sleeves to bring the best of both worlds together. We started with a deep dive into the current state of files stored with Git LFS on the Hub.
Getting this information was no small feat. We had to: * Analyze a complete database dump of all repositories and files stored in Git LFS across Hugging Face. * Parse through metadata on file sizes and types to accurately map the storage breakdown across Spaces, Models, and Datasets.