Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
CultriXΒ 
posted an update 7 days ago
Post
341
# Announcing the RAG-Ready Conteaant Scraper! πŸš€

Supercharge your Retrieval Augmented Generation (RAG) pipelines with ease! I just finished working on the **RAG-Ready Content Scraper**, a mix between two very useful tools (RAG-Scraper and RepoMix); now available as a Hugging Face Space!

## What can it do?

This intuitive application helps you effortlessly gather and process content from various sources:

* 🌐 **Webpages**: Scrape content from any URL (with RAG-Scraper). You can even control the scraping depth to fetch linked pages!
* πŸ“‚ **GitHub Repositories**: Process entire GitHub repos (using the power of Repomix) by simply providing a URL or username/repo ID.

## Various Output Formats

Convert the scraped content into a variety of RAG-friendly formats:

* **Markdown** (.md)
* **JSON** (.json)
* **CSV** (.csv)
* **Plain Text** (.txt)
* **PDF** (.pdf)

Perfect for building datasets, knowledge bases, and feeding your LLMs with high-quality, structured information.

## Hope you enjoY!

Ready to streamline your RAG data preparation?

πŸ‘‰ **Visit the RAG-Ready Content Scraper on Hugging Face Spaces:** [https://huggingface.co/spaces/CultriX/RAG-Scraper]

---

Feedback and feature requests are welcome! Let's build better RAG together.


I've been thinking a lot about using small caches of embeddings for local RAG lately. Have you considered an HTTP caching proxy like Squid as a low-impact source? It would retrieve what a user is reading anyway, and what's in their field of interest. A browser extension to signal some limited ingestion when a page is bookmarked might fit a lot of use cases.

For many reasons, smart management of context windows is my top priority with AI now!

Β·

I know it's something very different from what you described, but have you read about AnythingLLM and their browser extension? I have been using it a lot and it works very well.

I also have been looking into MCP a lot lately (it seems to be very promising and imo is the next big thing happening right now) which could be used for this.

Finally, just because I found it super useful (although a bit unrelated), this python script that can turn pretty much any text data into a LLM-dataset is something I wanted to share with you as well (even though technically not RAG-related. It's been a while since we talked haha): https://www.reddit.com/r/LocalLLaMA/comments/1ai2gby/comment/korunem/?share_id=DFUUUr1ZD2ZCKFGXwccvF