@KnutJaegersberg on Hugging Face: "Shocking: 2/3 of LLMs fail at 2K context length code_your_own

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

posted an update Jan 12

Post

Shocking: 2/3 of LLMs fail at 2K context length

code_your_own_ai makes a great vlog about mostly LLM related AI content.
As I watched the video below, I wondered about current best practices on LLM evaluation. We have benchmarks, we have sota LLMs evaluating LLMs, we have tools evaluating based on human comparison.
Often, I hear, just play with the LLM for 15 mins to form an opinion.
While I think for a specific use case and clear expectations, this could yield signal carrying experiences, I also see that one prompt is used to judge models.
While benchmarks have their weaknesses, and are by themselves not enough to judge model quality, I still think systematic methods that try to reduce various scientifically known errs should be the way forward, even for qualitative estimates.
What do you think? How can we make a public tool for judging models like lmsys/chatbot-arena-leaderboard help to leverage standards known in social science?

https://www.youtube.com/watch?v=mWrivekFZMM

NameeO

Jan 13

Hi Knut, Thank you so much for bringing this video to light. To me, this video actually highlights how important a good RAG workflow is as well as the need for fine-tuned models. I was inspired to recreate this experiment so we made up a document of about 11,000 tokens (much more than the 2k) about astrophysics, added the sentence that is being queried "Astrofield creates a normal understanding of non-celestial phenomenon" somewhere in the middle of the document and ran RAG. We then tried this against 3 models - LLMWare BLING Tiny Llama 1.1B, LLMWare DRAGON Yi-6b, and also the Zephyr 7B Beta (which had failed the test in the YT video).

Here are some screenshots of the results. As you can see, with RAG and fine-tuning, even a 1.1B model can find the answer.

LLMWare Bling Tiny Llama 1.1B:

LLMWare Dragon Yi 6B:

Zephyr 7B Beta (not finetuned for RAG so a little more chatty):

KnutJaegersberg

Jan 13

Amazing, thank you for sharing :)

adamo1139

Jan 14

To be honest to me it seems like skill issue. Context should be clearly defined with context start, context finish and there should be no mention of saying that model should say it can't find the info, especially at the end of the prompt. Don't forget that we are dealing with auto-regressive model where most weight is put on the last few words just like humans do it. The fact that this is being mentioned can throw model away and make it pick that answer. Is there a copy of the full prompt used? I don't feel like copying it by hand and it would be nice to re-run this with fixed prompting.

KnutJaegersberg

Jan 15

I didn't see a link to the prompt in the video, but prompt format can be optimized.

Tonic

Jan 19

the recent e5mistral7B embeddings model uses prompts to specifically tailor embeddings to specific use cases. check it out : https://huggingface.co/spaces/Tonic/e5

KnutJaegersberg

Jan 19

that's a nice space you made there, but it is also unrelated to my post

In this post