Very High English MMLU scores, Yet Extremely Low Broad English Knowledge

by phil111 - opened Dec 19, 2024

Dec 19, 2024

•

edited Dec 19, 2024

Falcon3 7b got an MMLU Pro score that's ~ 5 points higher than both Llama 3.1 8b and Gemma2 9b (~35), yet has ORDERS OF MAGNITUDE less popular English knowledge across all popular domains (e.g. movies, music, sport, literature...) than either of them.

TII clearly not only selectively trained on the small subset of popular English knowledge covered by the English MMLU, but did so to an even greater degree than other culprits like Qwen2.5 7b.

I tested this model both offline and with the available spaces so I'm fairly certain this model's extremely poor performance in English isn't due to something else, such as a configuration issue. Plus Falcon 10b is returning the same reliable pattern of hallucinations.

For example, Two and A Half Men is an unusually popular English TV show, and Meghan Trainer is a very popular English singer, and Dear Future Husband is one of her biggest hits. And both Falcon 7b & 10b responded with things like the following (after multiple attempts and at low temps). And this profound level of ignorance exists across all popular domains of English knowledge, including movies, sports, literature...

Prompt: List the 6 main characters, and the actors who played them, from Two and a Half Men. Don't add details, just list them. And what year did the show first air?

Response: Characters: Alan Harper, Charlie Harper, Jenny Harper, Gloria Haradowski, Jake Harper, Patricia "Pat" Harper
Actors: Charlie Sheen, Ashton Kutcher, Jon Favreau, Mila Kunis, Luke Macfarlane, Deborah Ann Woll
Show first aired: 2003

If you don't know anything about Two and a Half Men then please trust me when I say that his is an extremely bad response compared to models like Llama 3.1 8b & Gemma 2 9b. And this isn't cherry picked. It's the norm across equally popular shows, movies, and other popular domains of English knowledge.

Prompt: Who sang the modern hit song Dear Future Husband? What album is it from? And what year was it released?

Response: The modern hit song "Dear Future Husband" was sung by American singer-songwriter Olivia Rodrigo. It is featured on her debut studio album "Sour," which was released on April 21, 2021.

In conclusion, the Falcon3 series selectively trained on the small subset of English knowledge covered by the MMLU, giving it a score of ~35 vs Llama 3.1 8b's and Gemma2 9b's ~30, yet when it came to other equal or more popular domains of English knowledge it didn't just perform much worse, but overwhelmingly worse.

Things are starting to get out of hand in the open source AI community. In their attempt to compete with the much larger and more powerful proprietary models more and more open models (e.g. Falcon3, Qwen2.5, & EXAONE) are strongly favoring training on a tiny subset of humanity's most popular knowledge that boosts their tests scores, leaving them vastly inferior overall, yet achieving comparable scores on tests like the MMLU. So far Meta and Google are the last hold outs, and some recent Mistral models also didn't cheat (e.g. Mistral Small, but not Ministral).

Brief History: Yi cheated since day one with Yi-34b (it has a relatively high English MMLU score, yet extremely low broad English knowledge). Then Microsoft did the same with the Phi series. That is, it played heavy favorites with the training data because they were trying to maximize test scores with the fewest possible parameters. However, Microsoft not only admitted doing this, but detailed how they curated the data, which data they de-prioritized, and so on. Then Qwen started doing the same over time, starting with their very small models, but by Qwen2.5 even their largest model (72b) regressed drastically in broad general English knowledge compared to Qwen2, and then Qwen2.5 14 & 32b favored MMLU data to such an absurd degree that their English MMLU pro scores where equal or greater to Llama 3.1 70b's despite having only 14/34b parameters, being made primarily for the Chinese language, and having orders of magnitude less broad English knowledge.

phil111 changed discussion status to closed Dec 19, 2024

yasserTII

Technology Innovation Institute org Dec 20, 2024

Hi, well, that kind of knowledge about movies, celebrities, etc., is easily accessible through RAG systems. The question is this: could Falcon3 models be fine-tuned on these topics without any problem? Most likely, yes. The instruct models might not have been the best for your questions because we simply do not include such content during SFT. However, we definitely train on most world knowledge (though, of course, one can't have 14TT of math and STEM knowledge anyway).
To be clear, we provide well-trained base models according to benchmarks and perform SFT on STEM, code, and math to evaluate how instruct models perform on the same benchmarks. Why? Because such knowledge can't be accessed through search or RAG; it needs to be learned. Chatgpt and such are designed for general chat, so training on common knowledge at SFT time makes sense for them.

phil111

Dec 20, 2024

•

edited Dec 20, 2024

Well, you're in good company. Qwen2.5, EXAONE-3.5, and even Ministral, are all now selectively overfitting math & coding while being unable to answer very basic questions across all major domains of humanity's most popular knowledge.

And what could possibly be the point? If you're going to rely on the internet (RAG) then the proprietary models like GPT4o and Sonnet 3.5 are fast & affordable, and even offer free tiers. Plus their math and coding are not only superior, but they offer orders of magnitude more knowledge and broader abilities.

Plus RAG is awful, and will always be awful. It breaks the organic flow of many tasks like story writing or having a back and forth on a given topic.

In my mind open source AI is dying a self-inflicted death. While the proprietary models are training on more and more data, and attempting to achieve a balance of broad abilities to serve a wide and diverse audience of users, most of the open source models (excluding Llama 3.1 & Gemma2) are training on progressively less and less diverse data so they can prioritize test scores and coding in order to appeal to a single demographic (autistic coding nerds, since they make up >95% of first adopters). And in the end they can't even code as well as the proprietary models (e.g. Sonnet 3.5) so the target audience isn't even using them.

In conclusion, any functional AI model for the general population simply MUST be equitably trained on humanity's most popular knowledge because said knowledge is popular for a reason. Creating an hallucination generator for humanity's most popular knowledge in order to improve math and coding skills to a level that's still far shy of what's needed for such precise skills to be functional, especially by 7b models, honestly seems absurd to me. What's the point?????????

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment