Repro

#1
by pascalguldener - opened

Would you mind sharing your eval setup ?

LightOn AI org

Hello,
Sure, here is the BEIR one and the LongEmbed one.
Note that it requires the PLAID index that we are currently merging (it is available in main now, but the official release is planned soon) and also, for BEIR, I made a small modification to the load_beir function to returns queries alongside their IDs to allow exclude the query itself from retrieval for FiQa. Else you can take inspiration from the boilerplate in the repository, it'll just have results a bit different for FiQa.

Also note that for LongEmbed, the values reported for Needle/Passkey are average of NDCG@1 for all the doclength, as explained there.
Finally, note that with the merging of PLAID, PyLate will very soon be supported out of the box by MTEB, allowing to run all the benchmarks directly from there.

Hope it helps!

between my request and your answer i had setup my own eval using weaviate and the evaluation module from beir and got even slightly better results than with your setup (using Voyager) on scifact:
yours with voyager:
{'map': 0.6845755017313842, 'ndcg@10': 0.7227137813143577, 'ndcg@100': 0.7357195316890116, 'recall@10': 0.8382222222222222, 'recall@100': 0.8863333333333333}

weaviate with beir evaluator
2025-05-13 14:24:48 - NDCG@1: 0.6033
2025-05-13 14:24:48 - NDCG@3: 0.6881
2025-05-13 14:24:48 - NDCG@5: 0.7111
2025-05-13 14:24:48 - NDCG@10: 0.7305
2025-05-13 14:24:48 - NDCG@100: 0.7513
2025-05-13 14:24:48 - NDCG@1000: 0.7558
2025-05-13 14:24:48 -

2025-05-13 14:24:48 - MAP@1: 0.5738
2025-05-13 14:24:48 - MAP@3: 0.6562
2025-05-13 14:24:48 - MAP@5: 0.6735
2025-05-13 14:24:48 - MAP@10: 0.6826
2025-05-13 14:24:48 - MAP@100: 0.6876
2025-05-13 14:24:48 - MAP@1000: 0.6878
2025-05-13 14:24:48 -

2025-05-13 14:24:48 - Recall@1: 0.5738
2025-05-13 14:24:48 - Recall@3: 0.7466
2025-05-13 14:24:48 - Recall@5: 0.8063
2025-05-13 14:24:48 - Recall@10: 0.8629
2025-05-13 14:24:48 - Recall@100: 0.9560
2025-05-13 14:24:48 - Recall@1000: 0.9900
2025-05-13 14:24:48 -

2025-05-13 14:24:48 - P@1: 0.6033
2025-05-13 14:24:48 - P@3: 0.2744
2025-05-13 14:24:48 - P@5: 0.1800
2025-05-13 14:24:48 - P@10: 0.0970
2025-05-13 14:24:48 - P@100: 0.0108
2025-05-13 14:24:48 - P@1000: 0.0011

LightOn AI org

Yeah HNSW and PLAID are approximate search indexes (computing all the similarities for all query/documents pairs would be too expensive for most datasets) that can have variable parameters affecting the search results (see here) and even have some form of non-deterministic behavior, which can explain the differences.
That is why in the model card, I also reported the results of ColBERT-small in my setup to enable a fair comparison.

That being said, both of your results seems pretty low compared to mine. Did you use query_length of 48 for scifact?
Maybe HNSW parameters are too loose to compete with PLAID as well

Sign up or log in to comment