Manuel Faysse's picture

Manuel Faysse

manu

AI & ML interests

NLP, Privacy, multi-modal DL

Recent Activity

updated a model 10 days ago
vidore/colqwen2.5-base
updated a model 10 days ago
vidore/colqwen2.5-v0.2
View all activity

Organizations

Illuin Technology's profile picture Spaces-explorers's profile picture Blog-explorers's profile picture CroissantLLM's profile picture Social Post Explorers's profile picture Illuin Technology - Vidore's profile picture MICS NLP's profile picture Illuin Exploration's profile picture PDFPages's profile picture smol-explorers's profile picture Contextualized Document Embedding Benchmark's profile picture EuroBERT's profile picture SmolVEncoder Project's profile picture Smolvencoder's profile picture

manu's activity

New activity in vidore/colqwen2.5-v0.2 4 days ago
upvoted an article 11 days ago
view reply

And I was telling Jack, I loved the CDE concept and it was a big part of the inspiration here! Initially, we wanted to be able to in-context learn the domain on disjoint small documents concatenated and late chunked (as an alternative to what we do on long docs here when we had no long docs), but sadly this worsened performance on long docs and we didn't pursue it much further ! This could have enabled doing CDE in one-stage only, I still think there are things to do in this space !

view reply

Hey Tom ! Thanks a lot !
Late Chunking actually doesn't change a thing compared to classical bi-encoders in terms of storage/inference cost ! In both cases, the token embeddings are averaged over each chunk. In classic LC, you could decide to keep all token embeddings to truly be able to chunk dynamically - but most often you already know how you are going to chunk, so you can use the same chunks standard bi-encoders would use, and get better contextualization (leading to better robustness to bad chunking - see ablations in section 6).

In our case, we have [sep] tokens between the chunks, and find chunk representations largely improve when they learn to stay different from adjacent document chunks. This could not be done without the model knowing the chunk boundaries beforehand. After averaging tokens between the [sep] tokens, we thus get exactly the same embedding sizes as a standard bi-encoders.

This is nice cause it really is a drop in replacement. Having said that, we believe models we trained still need to be a bit improved for production use cases, we were mainly happy to show the research direction looked very promising!

Cheers @tomaarsen and thanks for the kind words as always @merve !

updated a Space 14 days ago
published an article 14 days ago
view article
Article

*Context Is Gold to Find the Gold Passage*: Evaluating and Training Contextual Document Embeddings

By manu and 1 other
23
published a Space 16 days ago