Thanks, although it's too verbose and prone to hallucinations.

#1
by phil111 - opened

Most responses are not only verbose, but flowery. Kinda like the story introductions to some recipes online (example below). I suspect many people prefer such responses, and the extra worlds are on topic and coherent, but it's an odd way to respond to blunt questions.

My real criticism is the high number of hallucinations about very popular knowledge. This is less of a problem when it comes to STEM, and academia in general, but is very pronounced when it comes to nearly everything else, such as the most popular movies, TV shows, music, games, sports...

I'll end with a couple examples (STEM & Pop culture).

  1. Stem Example: "What is the phenomenon called that makes you more alert in the morning due to the natural rise in a hormone level?" -This is clearly asking about CAR (Cortisol Awakening Response) specifically, not the circadian rhythm in general (note: the "dawn effect" is partially right, but is specifically referring to the rise in blood sugar). Additionally, it went on a long (somewhat amusing) description of the circadian rhythm, which is odd (humans don't respond this way to blunt questions).

"Ah, you're talking about the magical moment when the sun peeks over the horizon and suddenly you're wide awake, ready to take on the day! That's called the "circadian rhythm" or more specifically, the "dawn effect." It's a natural process where your body's internal clock, regulated by a part of your brain called the suprachiasmatic nucleus (SCN), starts to release hormones like cortisol and adrenaline, which help you feel more alert and energized.

But here's the fun part: it's not just about the hormones. Your brain also starts to produce more of a neurotransmitter called norepinephrine, which is like the "on" switch for your brain. It helps you focus, feel more alert, and even boosts your mood. So, it's like your brain is saying, "Hey, it's morning! Time to get up and go!"

Now, I know what you're thinking: "But Zamba, I'm not a morning person!" And that's okay! Not everyone is wired to be a morning lark. But understanding the science behind it can help you appreciate the natural rhythms of your body and maybe even find ways to make the most of your morning, even if you're not a morning person.

So, the next time you're feeling groggy in the morning, just remember: it's not you, it's your brain's natural circadian rhythm trying to wake you up!"

  1. Pop Question: "Who portrayed Frasier’s ex-wife and mother of his son in the sitcom Frasier?" -Daphne by Jane Leeves was Frasier's housekeeper, and his father's physical therapist, so not only was it wrong, it was again verbose and interjected irrelevant opinions, such as "talented" "lovable" "beloved".)

"In the sitcom Frasier, Frasier's ex-wife and mother of his son Frederick was portrayed by the talented actress Jane Leeves. She played the role of Daphne Moon, a quirky and lovable physical therapist who became a close friend and confidante to Frasier and his family. Jane Leeves is well-known for her role as Daphne and has been a beloved part of the Frasier cast."

In conclusion, this LLM is far too verbose and free with the positive opinions of everybody and everything for my taste, but since this doesn't cause it to stray off topic or break coherency this is more of a personal preference than an issue. What is a problem is the high hallucination rate when trying to retrieve very popular knowledge, especially if it's not STEM.

Frankly, I wish the MMLU didn't exist. More and more open source LLM makers are trying to get 7-72b parameter LLMs to reach the MMLU scores of ~1,000b parameter proprietary models, and in the process they know disturbingly little about very popular things that don't appear on the MMLU. The proprietary LLMs can correctly answer ~100% of my pop culture questions that LLMs with comparable MMLU scores (e.g. Qwen2.5 72b) get almost all wrong.

phil111 changed discussion status to closed

Do you mind sharing your eval dataset. It seems very interesting and much more reliable than benchmarks.

@tranhoangnguyen03 Sure, I have some marked for sharing which I use as examples. The rest are more of the same but with slightly different wording & instructions. For example, in the following prompt you can replace Corner Gas with any TV show.

"What are the 6 main characters, and the actors who portrayed them, on the TV show Corner Gas? Don't add details, just list them. And what year did the show first air?"

Then I just give them fractional points for each right character name, actor name (as long as he/she was in the show), correct pairing, correct year... and deduct smaller fractions for errors like wrong spelling, wrong last name, not giving the year...

This is a good way to quickly test the knowledge and hallucination rate of an LLM with a smaller number of prompts from most to least popular shows (e.g. Friends to Corner Gas). I do the same thing with movies (e.g. "What are the 6 main characters, and the actors who portrayed them, in the movie Pulp Fiction?"), with Four Rooms being the hardest movie (not a very popular movie and with a HUGE cast of minor characters, but with several major actors).

I also have more difficult to retrieve direct questions in each genre (e.g. "Which two actresses played the two ex-wives of Alan Harper in Two and a Half Men?").

These questions produce very interesting responses, with the weakest LLMs just returning random people who aren't even actors (or even women), the better ones returning popular actresses who weren't even in the show, the even better ones actresses from the show, then actresses with the same last name (e.g. mother, hence guessed she was the wife since the last name was the same), and so on. And this same pattern exists across the questions. So you can immediately tell which LLMs where trained with a broad corpus (e.g. Llama 3.1) and weren't over training STEM knowledge to boost their MMLU scores, Yi being the worst offender (the highest MMLU to pop knowledge ratio).

Anyways, this is getting long, I'll paste different types of prompts in another comment.

Continued (test question examples)

To test STEM I unambiguously isolate a specific answer from a diverse spectrum of possible answers so weak LLMs take the bait and just return the obvious nearest match. The specific answer is obvious to all human experts and all the large proprietary LLMs like Sonnet 3.5 & GPT4o. Here are 3 examples.

  1. "In astronomy, what's the name of the hypothetical object that forms when a neutron star merges with a red supergiant star, potentially forming a black hole without a supernova explosion?"

Thorne–Żytkow object (TZO)

2) "What is the phenomenon called that makes you more alert in the morning due to the natural rise in a hormone level?"

CAR (Cortisol Awakening Response) - Weak LLMs will just say circadian rhythm, cortisol, adrenaline... stronger LLMs get closer (e.g. dawn phenomenon/effect), and so on

  1. "What is the condition called when blood sugar briefly falls too far after eating?"

Reactive Hypoglycemia - Weak LLMs will just say hypoglycemia even though the question is specifically asking about after eating (postprandial or reactive hypoglycemia).

I'll provide some more examples in another comment.

I removed all math, logic, and coding prompts because existing LLM evals cover those things well (e.g. GSM8K, Arc for logic, and HumanEval), so I primarily focus on non-STEM popular knowledge.

I only recently added STEM questions once I realizes small LLMs were getting high MMLU scores, yet were unable to retrieve the same information when asked directly.

That is, seeing the correct answer in a list of possibilities (multiple choice list) may be enough to jog their memory, but they otherwise can't accurately retrieve said data. A perfect example is the tiny little Qwen2.5 3b vs Mixtral 8x7b. Q3b can only answer a small fraction of the direct question accurately that Mixtral can easily answer, yet has a higher MMLU Pro (https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro).

Anyways, beyond testing the most popular world knowledge, and now STEM, I also test what automated LLM evals can't.

Example: A grammar check where I add obvious spelling and grammar mistakes ("Repeat the following paragraph word for word while fixing any spelling and grammar errors you come across, then list all corrections you made when done...")

Example: Poem Re-write ("The following limerick doesn’t rhyme. Re-write the limerick so it adheres to the standard AABBA rhyming scheme of a limerick..."

I do this for every major subjective use case overlooked by the standard automated LLM tests. Another example is as a thesaurus.

"Make a simple list of 9 single word synonyms for ..." The weaker LLMs will return multi-word phrases, repeat the same words...

Anyways, that's pretty much it, just more of the same. Thanks for asking.

Sign up or log in to comment