@as-cle-bert on Hugging Face: "Let's pipe some 𝗱𝗮𝘁𝗮 𝗳𝗿𝗼𝗺 𝘁𝗵𝗲 𝘄𝗲𝗯 into our vector database…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

as-cle-bert

posted an update May 11, 2025

Post

4321

Let's pipe some 𝗱𝗮𝘁𝗮 𝗳𝗿𝗼𝗺 𝘁𝗵𝗲 𝘄𝗲𝗯 into our vector database, shall we?🤠

With 𝐢𝐧𝐠𝐞𝐬𝐭-𝐚𝐧𝐲𝐭𝐡𝐢𝐧𝐠 𝐯𝟏.𝟑.𝟎 (https://github.com/AstraBert/ingest-anything) you can now scrape content simply starting from URLs, extract the text from it, chunk it and put it into your favorite LlamaIndex-compatible database!🕸️

You can do it thanks to 𝗰𝗿𝗮𝘄𝗹𝗲𝗲 by Apify, an open-source crawling library for python and javascript that handles all the data flow from the web: ingest-anything then combines it with 𝗕𝗲𝗮𝘂𝘁𝗶𝗳𝘂𝗹𝗦𝗼𝘂𝗽, 𝗣𝗱𝗳𝗜𝘁𝗗𝗼𝘄𝗻 and 𝗣𝘆𝗠𝘂𝗣𝗱𝗳 to scrape HTML files, convert them to PDF and extract the text - hassle-free!😸

Check the attached code snippet if you're curious of knowing how to get started🎬

PS: Don't tell anybody, but this release also has another gem... It supports OpenAI models for agentic chunking, following the new releases of Chonkie🦛✨

If you don't want to miss out on the new features, leave us a little star on GitHub ➡️ https://github.com/AstraBert/ingest-anything
And join our discord community! ➡️ https://discord.gg/kDqHNjks

ZennyKenny

May 11, 2025

Whoa. Reliable open-sourced crawling software is a big win. I'll take it for a spin but I'm optimistic as this is the kind of thing I (and every other AI builder) has been building for years to avoid paying FireCrawl.

jm02

May 15, 2025

I hope this will help me solve my issues

deleted

Jun 18, 2025

Why don't you combine this with a model? Furthermore, why don't you add a mechanism that controls the operating system for scraping, like Manus? I'm wondering about this. When the operating system is running, doesn't working with DOS seem a bit outdated?

In this post