Divergence with original model nearing max. sequence length?
Hello!
In my quick tests with the original model, it has some very specific rules when it comes to maximum sequence lengths and truncation that make it tricky to implement an AutoTokenizer that matches 100%. My understanding is that this model technically also differs when you reach the maximum sequence length, is that true?
Granted, the maximum sequence length is really high, something like 32k tokens presumably based on the position embeddings.
Also - congratulations on your BEI release with Baseten! It looks very solid
- Tom Aarsen
Hey Tom,
Agree! There is to following rule, setting the max_length of query + document to 8192 (or 32768, if none is set). https://github.com/mixedbread-ai/mxbai-rerank/blob/7592f2d37db7d2dcc9627ad6dd002e8f4d4cc82b/mxbai_rerank/mxbai_rerank_v2.py#L79
The system prompt is maybe 100tokens + truncated to 6k query tokens + rest of budget to document.
Also, query and document are split by the \n
token, which is tokenized, then appended. (aka, if tokenize + detokenize + tokenize again, the result would be different).
BAAI's llm reranker has similar policies: https://github.com/FlagOpen/FlagEmbedding/blob/d5292b68758f41c7911fe85596cdd0329901a3a5/FlagEmbedding/inference/reranker/decoder_only/base.py#L385
Not sure if any of this helps, let me know!
Thanks!
Very interesting! Thanks for sharing. These LLM-style reranker formats are quite tricky in terms of tokenization.
- Tom Aarsen
Yeah, at least with the "ForSequenceClassification" they should be quite easy to run e.g. with vLLM / TEI etc.