Model fails the strawberry test

#16
by Dev68 - opened

Turns out the models cram the answers not think about it. How disapointing.

Conversation:

  • User: how many r's are in the word strawberry when it is falling from a height of 20m.

  • Assistant:

<Thought> Alright, I need to determine how many 'r's are in the word "strawberry" when it is falling from a height of 20 meters. Let's break this down step by step to ensure accuracy.

First, let's focus on the word "strawberry." To count the number of 'r's in this word, I'll examine each letter individually:

    S – Not an 'r'.
    T – Not an 'r'.
    R – This is the first 'r'.
    A – Not an 'r'.
    W – Not an 'r'.
    E – Not an 'r'.
    B – Not an 'r'.
    E – Not an 'r'.
    R – This is the second 'r'.
    Y – Not an 'r'.

So, in the word "strawberry," there are two 'r's.

Now, considering the height of 20 meters that the word is falling from, I need to understand the context. Is the height affecting the number of 'r's? If we're talking about the word physically falling, the height might represent time or distance, but since the number of letters in a word doesn't change with physical movement, the height doesn't influence the count of 'r's.

Therefore, regardless of the height from which the word is falling, the number of 'r's remains the same.
</Thought> <Output> There are 2 'r's in the word "strawberry." </Output>

Screenshot:

image.png

AIDC-AI org
This comment has been hidden
AIDC-AI org

Thank you for your attention.
We made 5 attempts, and although the model can be disrupted in most cases by '20m' interference, there was still one successful instance where the correct answer was produced. We believe that the model actually has the capability to output the correct answer. These capabilities can be properly optimized in subsequent work, such as PRM-assisted decoding, DPO, PPO, etc.

However, this still demonstrates some limitations of our model at its current stage. Our work is ongoing, and we will continue to optimize and iterate in the future. Thank you again for your attention.

image.png

No, the model just happened to give a "right" answer, which doesn't necessarily mean that the model is capable in reasoning the number of "r"s. Though the final answer is right, the reasoning process is wrong: the model spelled "strawberry" in a wrong way. As for subsequent work, according to physics of language models, finetuning, as well as techniques adopted at even later stage such as sampling and decoding, might not bring as much improvement in reasoning as expected.

AIDC-AI org

Yes, we have found that most of the errors occur due to incorrect spelling of words or skipping the 'letter by letter' step.
The current work is not perfect, and we are striving to improve it. However, just like ORM ->PRM, we believe that this issue can be resolved in future work. Other directions, such as addressing model hallucinations and enhancing instruction-following capabilities, are also worth exploring in my opinion.

Likewisely, according to physics of language models, hallucinations are the result of llm learning language format faster than knowledge, so solution to this problem might still lie in the pretraining stage. Methods including mixed pretraining, data augmentation, etc, will be helpful. Thank you for your work and I'm looking forward to your latest updates 🚀

The challenge here is that models are not good at spelling. Its not what they are designed to do. They have to associate the token for a single letter to a token that represents multiple letters. They do not see a word the same way we do. To further compound the problem, this is a chinese model, not an english one, even if it can translate. It translates from a token representing a chinese word to a token representing a an english word or part of a word.

So until there is stronger training on individual letters and their associations with each other and the final words, most models will struggle a lot with this.

Frankly, I would prefer they not waste the parameters of a 7b model on spelling. The embedding engine handles it fine.

This comment has been hidden

It's very interesting, when I asked it what 5X12 + 80X3/12 equals, it kept telling me it was 25 and wouldn't change.so I feel that 7B parameters are not enough for rigorous reasoning, and most of the inference results are unreliable.

We acknowledge that we believe the 7B model has certain limitations in strict reasoning. However, it appears that it can correctly respond to both 5*12+80*3/12= and 5x12+80x3/12=.

image.png

image.png

Reasoning and math solving ability is related to model depth, i.e., number of layers. Pretrain a deeper model will increase its math and reasoning ability, as long as it can converge. And, 7B is large enough to store all english knowledge, as physics of language model claims.

Well I did checked again,and it gose the rightway now somehow....but I wont forget that little fight to convince it how to cauculate right and he keeps saying Im wrong and it's a WFT moment.Anyone who want to work with it plz just double check the result....

微信图片_20241128174820.png

I may conclude that the model does NOT have the ability to perform math reasoning. According to saisai1053's log, the model failed to solve the DAG, i.e., it failed to determine variables' dependency. I suggest model devs do some model probing to exam if the model is soving DAG correctly even before it output any token.

Oh, still something I fogot. @Sniper , may I ask what decoding strategy you are using, and temperature, samplers?

We made 5 attempts, and although the model can be disrupted in most cases by '20m' interference, there was still one successful instance where the correct answer was produced. We believe that the model actually has the capability to output the correct answer. These capabilities can be properly optimized in subsequent work, such as PRM-assisted decoding, DPO, PPO, etc.

Adding randomness to inferencing doesn't prove that the model is capable of something.

OH that reminds me! I was using Koboldcpp and I leaving temperature at 0.8,and that error happend....but I think 0.8 is a reasonble enough value,or I just feel wrong? I did reset the setting to Godlike today and it gose to 0.7,and now it's gose right..
If it's my fault to useing a wrong setting,then i sorry about all the mess.

AIDC-AI org

Oh, still something I fogot. @Sniper , may I ask what decoding strategy you are using, and temperature, samplers?

We made 5 attempts, and although the model can be disrupted in most cases by '20m' interference, there was still one successful instance where the correct answer was produced. We believe that the model actually has the capability to output the correct answer. These capabilities can be properly optimized in subsequent work, such as PRM-assisted decoding, DPO, PPO, etc.

Adding randomness to inferencing doesn't prove that the model is capable of something.

Yes. I just randomly used a temperature of 0.7 without employing a specific decoding strategy.
I agree that using random sampling doesn't necessarily reflect true performance. Our initial intention for this attempt was to determine our model's test@k. We believe that good test@k performance will be beneficial for subsequent search and optimization.
And we are actively iterating on our model and believe that future versions will perform better on tasks like this.

Sniper changed discussion status to closed

Sign up or log in to comment