matteogeniaccio/phi-4 · Notably better than Phi3.5 in many ways, but something is wrong.

Dec 16, 2024

Thanks for the sneak peek. This model is certainly more powerful than Phi3.5, but it's far too focused on things like coding to work as a general purpose LLM.

For example, it keeps answering basic questions with json formatting, such as the one below (although the answer is correct).

What is the condition called when blood sugar briefly falls too far after eating?

{
  "answer": "reactive hypoglycemia",
  "relevant_passages": [
    {
      "title": "Reactive Hypoglycemia",
      "score": 1,
      "passage_id": 2,
      "text": "Reactive hypoglycemia is a condition in which blood sugar levels drop too low after eating. It's also called postprandial hypoglycemia, or post-meal hypoglycemia."
    }
  ]
}

And while various tasks like creative story writing is improved, plus the alignment is less absurd (e.g. fewer denials and moralizations about perfectly legal, common, and harmless things), its general world knowledge is really bad for its size. Smaller models like Llama 3.1 8b and Gemma 9b can functionally respond to a much wider variety of prompts.

In short, Phi4, like Phi3.5, is just an AI tool, not a general purpose AI model like Llama 3.1, Gemma2 or ChatGPT. It's so overtrained on select tasks that its output format commonly makes no sense (e.g. json), and it can't function as an AI for the general population because extensive data filtering. That is, Microsoft acted as the gakekeepers of information, deciding which of humanity's most popular information to add to the corpus, leaving it almost completely ignorant about too many things that the general population cares most about. Again, that makes it an AI tool/agent (e.g. coder or math solver) and not a general purpose chat/instruct AI model.

nlpguy

Dec 16, 2024

•

edited Dec 16, 2024

Thanks for your review phil. I know you calculate exact scores for your models, I'd be interested to know what score phi 4 gets compared to other models in your knowledge test.

Its skewed distribution from heavy tuning probably makes it harder for the model to answer more general questions, but it might know more than it is willing to say. If it responds to basic questions in json format then the fine-tuning process, which primarily tunes the output format and style of responses, may be the reason for the skewed distribution, not the pretraining data itself, because I doubt the distribution resulting from pretraining could cause this response. So re-tuning it could maybe fix it? That's just my Idea though.

matteogeniaccio

Owner Dec 16, 2024

I think there is something wrong with your setup. It seems that phi4 is regurgitating some training data instead of answering properly.
The same prompt returns a properly formatted answer when I tried it.

Make sure that you are using the correct chat template. Phi4 is using a modified chatml format. Also try reducing the temperature settings

phil111

Dec 16, 2024

@nlpguy @matteogeniaccio You guys must be right. Something is clearly configured wrong on my end, and the chat template is the most likely culprit. But I do use low temps (0 and 0.3), plus minimize hallucinations further by using a high min-P, so there's clearly pockets of very popular knowledge missing from the corpus, regardless of any configuration issue.

Thanks for testing the question. About half of the answers to my questions are in json format, so if this was inherent to the model you would undoubtedly have noticed it to. Normally I would test odd outputs against a full float version online but can't find any (e.g. at LMsys).

But despite my configuration issue there's clearly something special about this model. It handled a complex, and frankly absurd, story prompt much better than previous Phi models. Plus it correctly answered some unusually difficult questions smaller models typically get wrong (e.g. Thorne–Żytkow Object and the more obscure literary references "it was the age of wisdom, it was the age of foolishness," rather than just the far more recognizable "It was the best of times, it was the worst of times" from Charles Dickens A Tale of Two Cities). I look forward to seeing how this performs once configured properly.

urtuuuu

Dec 17, 2024

Seems to be noticeably smarter than any model i tried up to 14b. Answered all my test questions correctly and never in json.

phil111

Dec 17, 2024

@urtuuu Thanks for checking. I clearly configured something wrong so I'm closing this discussion.

phil111 changed discussion status to closed Dec 17, 2024

phil111

Dec 23, 2024

Apparently I was right.

Phi4 14b was getting most of my simple questions wrong across popular domains of knowledge, and sure enough it only scored 3.0/100 on SimpleQA, which is the gold standard of world knowledge tests because it's extensive, diverse, hard, not multiple choice, and new, so not yet contaminated.

By comparison Phi3 14b, despite scoring notably lower than Phi4 14b on the MMLU, scored notably higher on SimpleQA (7.6). So Microsoft is not only heavily favoring the small subset of popular knowledge that maximizes tests scores, but is doing this to a progressively greater degree.

Qwen recently did the same, with Qwen2 72b scoring much higher on my general knowledge test than Qwen2.5 72b. Sure enough Qwen2.5 72b only scored 10 on SimpleQA, much lower than Llama 3.1 70b. And the smaller Qwen2.5's, including 34 & 14b scored as high or higher than Llama 3.1 70b on the English MMLU despite scoring only ~5 on SimpleQA.

I was hoping that the open source AI community would produce viable alternatives to proprietary models, but all most are doing is training for test scores and bragging rights. Qwen2.5, Falcon3, Cohere 7b, EXAONE3.5, Phi 1-4, Ministral... are all less than useless to >95% of the population. They're basically just overfit to the tests and a single demographic (autistic coding first adopters here on HF and LMsys).

Meta and Google are currently open source AI's best hope for quality general purpose AI. Mistral is starting to slip (Nemo & Ministral grossly overfit tests and have much less world knowledge than their previous and smaller Mistral 7b), and IBM granite is only OK, but shows potential. Pretty much all the rest are useless and just training for select tasks and tests.

qingy2024

Dec 28, 2024

@phil111 Interesting argument! I would say that for small ( < 15B parameter) models, the goal seems to be more focused on reasoning, coding, and goal-oriented tasks.
This is because small models just aren't inherently good at retaining knowledge within their limited amount of parameters, unlike large ( > 60B) models, which can "memorize" a lot more world knowledge.

Personally, I've found that Qwen2.5, EXAONE3.5, Llama 3.1 8B, and a couple others, are all great for simple tasks and instruction-following (which IMO is the most important thing for ~8B-12B models).

phil111

Dec 28, 2024

@qingy2024 Llama 3.1 8b & Gemma2 9b prove that you can fit several times more general English knowledge into ~7b models than other models like Qwen2.5 & EXAONE.

They aren't profoundly ignorant of English knowledge because they're ~7b; but rather, because they're selectively training on English knowledge that boosts English MMLU scores. And they did this to such an overwhelming degree that their English MMLU scores are ~5 points higher than the top English LLM Llama 3.1 8b despite having several times less overall English knowledge.

Simply put, foreign models like Qwen2.5 are training on broad Chinese knowledge, but in English, are heavily prioritizing knowledge that boost English test scores to free up more parameters and compute for their native languages (or small boosts in coding, math...), and this is why they're so profoundly ignorant of basic English knowledge. Not because they're only ~7b.

A simple way to illustrate this is by comparing SimpleQA to MMLU scores.

Take Qwen2.5 72b. It's Chinese SimpleQA (general Chinese knowledge) is commensurate with its Chinese MMLU (48 and 89.5, respectively), as is Llama 3.1 70b's (45 and 73.7, respectively).

But when it comes to English Qwen2.5 72b's English MMLU is even higher than Llama 3.1 70b (85 vs 80), yet it's English SimpleQA score is much lower (9 vs 20; which is a much bigger difference than it appears, and roughly equates to an MMLU score difference of ~60 to 85).

So yes, ~7b LLMs can store less knowledge than ~60b LLMs. But that doesn't change the fact that ~7b LLMs can easily hold several times more English knowledge. And to shamelessly achieve MMLU scores ~5 points higher than LLMs with several times more English knowledge (e.g. Llama 3.1 8b) by selectively training on the subset of English knowledge that overlaps the MMLU (which any idiot can do) is pitiful, especially if you don't do the same in your native language (e.g. Qwen2.5 in Chinese).