Dominik Weckmüller

do-me

AI & ML interests

Making AI more accessible. Working on semantic search, embeddings and Geospatial AI applications. https://geo.rocks

Organizations

Posts 6

view post
Post
27
What are your favorite text chunkers/splitters?
Mine are:
- https://github.com/benbrandt/text-splitter (Rust/Python, battle-tested, Wasm version coming soon)
- https://github.com/umarbutler/semchunk (Python, really performant but some issues with huge docs)

I tried the huge Jina AI regex, but it failed for my (admittedly messy) documents, e.g. from EUR-LEX. Their free segmenter API is really cool but unfortunately times out on my huge docs (~100 pages): https://jina.ai/segmenter/

Also, I tried to write a Vanilla JS chunker with a simple, adjustable hierarchical logic (inspired from the above). I think it does a decent job for the few lines of code: https://do-me.github.io/js-text-chunker/

Happy to hear your thoughts!
view post
Post
3120
SemanticFinder now supports WebGPU thanks to @Xenova 's efforts with transformers.js v3!
Expect massive performance gains. Inferenced a whole book with 46k chunks in <5min. If your device doesn't support #WebGPU use the classic Wasm-based version:
- WebGPU: https://do-me.github.io/SemanticFinder/webgpu/
- Wasm: https://do-me.github.io/SemanticFinder/

WebGPU harnesses the full power of your hardware, no longer being restricted to just the CPU. The speedup is significant (4-60x) for all kinds of devices: consumer-grade laptops, heavy Nvidia GPU setups or Apple Silicon. Measure the difference for your device here: Xenova/webgpu-embedding-benchmark
Chrome currently works out of the box, Firefox requires some tweaking.

WebGPU + transformers.js allows to build amazing applications and make them accessible to everyone. E.g. SemanticFinder could become a simple GUI for populating your (vector) DB of choice. See the pre-indexed community texts here: do-me/SemanticFinder
Happy to hear your ideas!