Divergence with original model nearing max. sequence length?

#3
by tomaarsen - opened

Hello!

In my quick tests with the original model, it has some very specific rules when it comes to maximum sequence lengths and truncation that make it tricky to implement an AutoTokenizer that matches 100%. My understanding is that this model technically also differs when you reach the maximum sequence length, is that true?

Granted, the maximum sequence length is really high, something like 32k tokens presumably based on the position embeddings.

Also - congratulations on your BEI release with Baseten! It looks very solid

  • Tom Aarsen

Hey Tom,

Agree! There is to following rule, setting the max_length of query + document to 8192 (or 32768, if none is set). https://github.com/mixedbread-ai/mxbai-rerank/blob/7592f2d37db7d2dcc9627ad6dd002e8f4d4cc82b/mxbai_rerank/mxbai_rerank_v2.py#L79

The system prompt is maybe 100tokens + truncated to 6k query tokens + rest of budget to document.
Also, query and document are split by the \n token, which is tokenized, then appended. (aka, if tokenize + detokenize + tokenize again, the result would be different).
BAAI's llm reranker has similar policies: https://github.com/FlagOpen/FlagEmbedding/blob/d5292b68758f41c7911fe85596cdd0329901a3a5/FlagEmbedding/inference/reranker/decoder_only/base.py#L385

Not sure if any of this helps, let me know!

Thanks!

Very interesting! Thanks for sharing. These LLM-style reranker formats are quite tricky in terms of tokenization.

  • Tom Aarsen

Yeah, at least with the "ForSequenceClassification" they should be quite easy to run e.g. with vLLM / TEI etc.

Sign up or log in to comment