Spaces:
Running
The problem with tests.
The problem with these kind of tests is that they are "fine" until someone creates a dataset with a million of them, then they will all perform well but fail in simpler tasks.
I have a few unreleased logic problems and most AIs will miserably fail.
I agree anyway that at the moment the best reasoning AI is Claude, and it by by a long shot.
It's even weird that the measured difference of Claude compared to other big models, sometimes is "small", because in my opinion Claude is an year ahead everyone else in reasoning and coding abilities.
@ZeroWw are you referring to Sonnet 3.5? Can you describe your logical tests at a high level? Maybe I can also come up with those tests for my personal use. Thanks!
My tests have different types of logic, from lateral thinking or out of the box thinking to very simple tests like this one:
For an empty one they will give me back 50 cents, but for 10 empty ones they will give me 6 euros instead of 5.
Assuming I buy the beers, I drink them and give back the empty ones,
how many beers I can get in total without spending more money?
With this test, small models can't solve it.
Big models (gpt4o,claude,mistral large) solve it as "84".
Only gpt4o and Mistral large can get the best answer (85) after being given a hint.
I have other problems which no AI solves too.
Not based on word play or wrongly stated problems.