Clelia Astra Bertelli's picture

Clelia Astra Bertelli

as-cle-bert

AI & ML interests

Biology + Artificial Intelligence = โค๏ธ | AI for sustainable development, sustainable development for AI | Researching on Machine Learning Enhancement | I love automation for everyday things | Blogger | Open Source

Recent Activity

posted an update 1 day ago
Let's pipe some ๐—ฑ๐—ฎ๐˜๐—ฎ ๐—ณ๐—ฟ๐—ผ๐—บ ๐˜๐—ต๐—ฒ ๐˜„๐—ฒ๐—ฏ into our vector database, shall we?๐Ÿค  With ๐ข๐ง๐ ๐ž๐ฌ๐ญ-๐š๐ง๐ฒ๐ญ๐ก๐ข๐ง๐  ๐ฏ๐Ÿ.๐Ÿ‘.๐ŸŽ (https://github.com/AstraBert/ingest-anything) you can now scrape content simply starting from URLs, extract the text from it, chunk it and put it into your favorite LlamaIndex-compatible database!๐Ÿ•ธ๏ธ You can do it thanks to ๐—ฐ๐—ฟ๐—ฎ๐˜„๐—น๐—ฒ๐—ฒ by Apify, an open-source crawling library for python and javascript that handles all the data flow from the web: ingest-anything then combines it with ๐—•๐—ฒ๐—ฎ๐˜‚๐˜๐—ถ๐—ณ๐˜‚๐—น๐—ฆ๐—ผ๐˜‚๐—ฝ, ๐—ฃ๐—ฑ๐—ณ๐—œ๐˜๐——๐—ผ๐˜„๐—ป and ๐—ฃ๐˜†๐— ๐˜‚๐—ฃ๐—ฑ๐—ณ to scrape HTML files, convert them to PDF and extract the text - hassle-free!๐Ÿ˜ธ Check the attached code snippet if you're curious of knowing how to get started๐ŸŽฌ PS: Don't tell anybody, but this release also has another gem... It supports OpenAI models for agentic chunking, following the new releases of Chonkie๐Ÿฆ›โœจ If you don't want to miss out on the new features, leave us a little star on GitHub โžก๏ธ https://github.com/AstraBert/ingest-anything And join our discord community! โžก๏ธ https://discord.gg/kDqHNjks
posted an update 10 days ago
Hey there, ๐—ถ๐—ป๐—ด๐—ฒ๐˜€๐˜-๐—ฎ๐—ป๐˜†๐˜๐—ต๐—ถ๐—ป๐—ด ๐˜ƒ๐Ÿญ.๐Ÿฌ.๐Ÿฌ just dropped with major changes: โœ… Embeddings: now works with Sentence Transformers, Jina AI, Cohere, OpenAI, and Model2Vec All powered via ๐—–๐—ต๐—ผ๐—ป๐—ธ๐—ถ๐—ฒโ€™๐˜€ ๐—”๐˜‚๐˜๐—ผ๐—˜๐—บ๐—ฏ๐—ฒ๐—ฑ๐—ฑ๐—ถ๐—ป๐—ด๐˜€. No more local-only limitations ๐Ÿ™Œ โœ… Vector DBs: now supports ๐—ฎ๐—น๐—น ๐—Ÿ๐—น๐—ฎ๐—บ๐—ฎ๐—œ๐—ป๐—ฑ๐—ฒ๐˜…-๐—ฐ๐—ผ๐—บ๐—ฝ๐—ฎ๐˜๐—ถ๐—ฏ๐—น๐—ฒ ๐—ฏ๐—ฎ๐—ฐ๐—ธ๐—ฒ๐—ป๐—ฑ๐˜€ Think: Qdrant, Pinecone, Weaviate, Milvus, etc. No more bottlenecks๐Ÿ”“ โœ… File parsing: now plugs into any ๐—Ÿ๐—น๐—ฎ๐—บ๐—ฎ๐—œ๐—ป๐—ฑ๐—ฒ๐˜…-๐—ฐ๐—ผ๐—บ๐—ฝ๐—ฎ๐˜๐—ถ๐—ฏ๐—น๐—ฒ ๐—ฑ๐—ฎ๐˜๐—ฎ ๐—น๐—ผ๐—ฎ๐—ฑ๐—ฒ๐—ฟ Using LlamaParse, Docling or your own setup? Youโ€™re covered. Curious of knowing more? Try it out! ๐Ÿ‘‰ https://github.com/AstraBert/ingest-anything
posted an update 11 days ago
One of the biggest challenges I've been facing since I started developing [๐๐๐Ÿ๐ˆ๐ญ๐ƒ๐จ๐ฐ๐ง](https://github.com/AstraBert/PdfItDown) was handling correctly the conversion of files like Excel sheets and CSVs: table conversion was bad and messy, almost unusable for downstream tasks๐Ÿซฃ That's why today I'm excited to introduce ๐ซ๐ž๐š๐๐ž๐ซ๐ฌ, the new feature of PdfItDown v1.4.0!๐ŸŽ‰ With ๐˜ณ๐˜ฆ๐˜ข๐˜ฅ๐˜ฆ๐˜ณ๐˜ด, you can choose among three (for now๐Ÿ‘€) flavors of text extraction and conversion to PDF: - ๐——๐—ผ๐—ฐ๐—น๐—ถ๐—ป๐—ด, which does a fantastic work with presentations, spreadsheets and word documents๐Ÿฆ† - ๐—Ÿ๐—น๐—ฎ๐—บ๐—ฎ๐—ฃ๐—ฎ๐—ฟ๐˜€๐—ฒ by LlamaIndex, suitable for more complex and articulated documents, with mixture of texts, images and tables๐Ÿฆ™ - ๐— ๐—ฎ๐—ฟ๐—ธ๐—œ๐˜๐——๐—ผ๐˜„๐—ป by Microsoft, not the best at handling highly structured documents, by extremly flexible in terms of input file format (it can even convert XML, JSON and ZIP files!)โœ’๏ธ You can use this new feature in your python scripts (check the attached code snippet!๐Ÿ˜‰) and in the command line interface as well!๐Ÿ Have fun and don't forget to star the repo on GitHub โžก๏ธ https://github.com/AstraBert/PdfItDown
View all activity

Organizations

Social Post Explorers's profile picture Hugging Face Discord Community's profile picture GreenFit AI's profile picture

as-cle-bert's activity

New activity in as-cle-bert/pdfitdown 2 months ago

Update requirements.txt

1
#1 opened 2 months ago by
not-lain
New activity in bluesky-community/README 6 months ago

Ideas!

2
#1 opened 6 months ago by
davanstrien
New activity in as-cle-bert/Llama-3.1-405B-FP8 10 months ago

why

1
#1 opened 10 months ago by
YaserDS-777
New activity in huggingchat/chat-ui about 1 year ago

[ASSISTANTS] Community thread

2
189
#356 opened over 1 year ago by
victor
New activity in as-cle-bert/plastic-enzymes about 1 year ago
New activity in as-cle-bert/scerevisiae-transcripts-biotypes about 1 year ago
New activity in as-cle-bert/breastcancer-auto-objdetect about 1 year ago
New activity in as-cle-bert/genetics-arxiv-wiki about 1 year ago
New activity in as-cle-bert/VirBiCla-training about 1 year ago