question on max_seq_length
Does this model have the same max_seq_length as LaBSE (256) or can you go beyond this?
Thank you.
Hi, this model does not have a max_seq_len limit. It has static embeddings, so you can process documents of arbitrary length with it. If you want do do this, please set max_length to None, e.g. embeddings = model.encode(["Example sentence"], max_length=None) and it will process whatever length your input is.
Thank you for the quick response.
Will this affect the quality of the embedding?
That's hard to say; we have not done extensive experiments yet on long documents, most of our benchmarks were for documents < 512 tokens (MTEB). We do plan on experimenting with this in the future
It does affect the quality. Very long input texts with millions of tokens lead to almost useless embeddings (like with normal models, the longer the input, the poorer the quality), wrote a little bit about it here in the comments section: https://www.linkedin.com/posts/dominik-weckm%C3%BCller_from-days-to-seconds-creating-embeddings-activity-7255095750496321537-WwI2?utm_source=share&utm_medium=member_desktop. Will write about my findings soon.