Clelia Astra Bertelli
as-cle-bert
AI & ML interests
Biology + Artificial Intelligence = โค๏ธ | AI for sustainable development, sustainable development for AI | Researching on Machine Learning Enhancement | I love automation for everyday things | Blogger | Open Source
Recent Activity
posted
an
update
1 day ago
Let's pipe some ๐ฑ๐ฎ๐๐ฎ ๐ณ๐ฟ๐ผ๐บ ๐๐ต๐ฒ ๐๐ฒ๐ฏ into our vector database, shall we?๐ค
With ๐ข๐ง๐ ๐๐ฌ๐ญ-๐๐ง๐ฒ๐ญ๐ก๐ข๐ง๐ ๐ฏ๐.๐.๐ (https://github.com/AstraBert/ingest-anything) you can now scrape content simply starting from URLs, extract the text from it, chunk it and put it into your favorite LlamaIndex-compatible database!๐ธ๏ธ
You can do it thanks to ๐ฐ๐ฟ๐ฎ๐๐น๐ฒ๐ฒ by Apify, an open-source crawling library for python and javascript that handles all the data flow from the web: ingest-anything then combines it with ๐๐ฒ๐ฎ๐๐๐ถ๐ณ๐๐น๐ฆ๐ผ๐๐ฝ, ๐ฃ๐ฑ๐ณ๐๐๐๐ผ๐๐ป and ๐ฃ๐๐ ๐๐ฃ๐ฑ๐ณ to scrape HTML files, convert them to PDF and extract the text - hassle-free!๐ธ
Check the attached code snippet if you're curious of knowing how to get started๐ฌ
PS: Don't tell anybody, but this release also has another gem... It supports OpenAI models for agentic chunking, following the new releases of Chonkie๐ฆโจ
If you don't want to miss out on the new features, leave us a little star on GitHub โก๏ธ https://github.com/AstraBert/ingest-anything
And join our discord community! โก๏ธ https://discord.gg/kDqHNjks
posted
an
update
10 days ago
Hey there, ๐ถ๐ป๐ด๐ฒ๐๐-๐ฎ๐ป๐๐๐ต๐ถ๐ป๐ด ๐๐ญ.๐ฌ.๐ฌ just dropped with major changes:
โ
Embeddings: now works with Sentence Transformers, Jina AI, Cohere, OpenAI, and Model2Vec
All powered via ๐๐ต๐ผ๐ป๐ธ๐ถ๐ฒโ๐ ๐๐๐๐ผ๐๐บ๐ฏ๐ฒ๐ฑ๐ฑ๐ถ๐ป๐ด๐.
No more local-only limitations ๐
โ
Vector DBs: now supports ๐ฎ๐น๐น ๐๐น๐ฎ๐บ๐ฎ๐๐ป๐ฑ๐ฒ๐
-๐ฐ๐ผ๐บ๐ฝ๐ฎ๐๐ถ๐ฏ๐น๐ฒ ๐ฏ๐ฎ๐ฐ๐ธ๐ฒ๐ป๐ฑ๐
Think: Qdrant, Pinecone, Weaviate, Milvus, etc.
No more bottlenecks๐
โ
File parsing: now plugs into any ๐๐น๐ฎ๐บ๐ฎ๐๐ป๐ฑ๐ฒ๐
-๐ฐ๐ผ๐บ๐ฝ๐ฎ๐๐ถ๐ฏ๐น๐ฒ ๐ฑ๐ฎ๐๐ฎ ๐น๐ผ๐ฎ๐ฑ๐ฒ๐ฟ
Using LlamaParse, Docling or your own setup? Youโre covered.
Curious of knowing more? Try it out! ๐ https://github.com/AstraBert/ingest-anything
posted
an
update
11 days ago
One of the biggest challenges I've been facing since I started developing [๐๐๐๐๐ญ๐๐จ๐ฐ๐ง](https://github.com/AstraBert/PdfItDown) was handling correctly the conversion of files like Excel sheets and CSVs: table conversion was bad and messy, almost unusable for downstream tasks๐ซฃ
That's why today I'm excited to introduce ๐ซ๐๐๐๐๐ซ๐ฌ, the new feature of PdfItDown v1.4.0!๐
With ๐ณ๐ฆ๐ข๐ฅ๐ฆ๐ณ๐ด, you can choose among three (for now๐) flavors of text extraction and conversion to PDF:
- ๐๐ผ๐ฐ๐น๐ถ๐ป๐ด, which does a fantastic work with presentations, spreadsheets and word documents๐ฆ
- ๐๐น๐ฎ๐บ๐ฎ๐ฃ๐ฎ๐ฟ๐๐ฒ by LlamaIndex, suitable for more complex and articulated documents, with mixture of texts, images and tables๐ฆ
- ๐ ๐ฎ๐ฟ๐ธ๐๐๐๐ผ๐๐ป by Microsoft, not the best at handling highly structured documents, by extremly flexible in terms of input file format (it can even convert XML, JSON and ZIP files!)โ๏ธ
You can use this new feature in your python scripts (check the attached code snippet!๐) and in the command line interface as well!๐
Have fun and don't forget to star the repo on GitHub โก๏ธ https://github.com/AstraBert/PdfItDown
Organizations
as-cle-bert's activity
Update requirements.txt
1
#1 opened 2 months ago
by
not-lain

Librarian Bot: Add language metadata for dataset
#2 opened 4 months ago
by
librarian-bot

Ideas!
2
#1 opened 6 months ago
by
davanstrien

why
1
#1 opened 10 months ago
by
YaserDS-777
Librarian Bot: Add language metadata for dataset
#2 opened about 1 year ago
by
librarian-bot

[bot] Conversion to Parquet
#1 opened about 1 year ago
by
parquet-converter

[ASSISTANTS] Community thread
2
189
#356 opened over 1 year ago
by
victor

[bot] Conversion to Parquet
#1 opened about 1 year ago
by
parquet-converter

[bot] Conversion to Parquet
#1 opened about 1 year ago
by
parquet-converter

[bot] Conversion to Parquet
#1 opened about 1 year ago
by
parquet-converter

[bot] Conversion to Parquet
#1 opened about 1 year ago
by
parquet-converter

[bot] Conversion to Parquet
#1 opened about 1 year ago
by
parquet-converter
