Where's the knowledge?

#5
by phil111 - opened

This model has OK knowledge, but not for its size. A 456 BILLION total parameter model should have a much higher SimpleQA than 18.5. Even tiny little non-thinking Llama 3 70b has a higher SimpleQA.

Also, where's the "thinking"? If models like this one were actually thinking then their performance across all cognitive tasks, such as writing poems and jokes, would improve. But the improvements are almost exclusively seen in a handful of overfit domains, particularly coding and math.

If you ended training of this model on trillions of poem and joke tokens instead of coding and math tokens then this "thinking" model would produce far better poems and jokes that better align with users' prompts, while performing far worse on math and coding tests.

I get the sense that the entire AI industry is in 'fake it until you make it' mode. That is, pretend to be making gains by grossly overfitting a handful of tasks, especially coding, math, and STEM knowledge so the scores like LiveCodeBench, MATH 500, and the MMLU creep up, all while general knowledge and abilities regress. We all knew since day one that if you end training on trillions of tokens from a select domain the performance would increase in said domain, but that this would start scrambling the previous trained weights, causing an across the board regression in general knowledge and abilities.

利益讓人忍不住炒作的心

This comment has been hidden (marked as Resolved)

That's an insightful point. Another factor could be that while creative tasks like writing poems and jokes are more reflective of daily life, it's inherently difficult to establish quantitative benchmarks for a model's "wit", given the lack of universal standards for such subjective domains. We currently often see models being evaluated in highly challenging academic contests. Perhaps agentic benchmarks like SWE-bench and tau-bench could encourage a more diversified evaluation approach.

Sign up or log in to comment