meta-llama/Llama-4-Scout-17B-16E-Instruct · 13 B and34 B Pleeease!!! Most people cannot even run this.

UniversalLove333

Apr 9

This was sooo disappointing. 😢

MrDevolver

Apr 9

They don't really care about us...

Doctor-Chad-PhD

Apr 9

13B Llama 4 would be amazing. Even an 8B upgrade would be so nice.

phil111

Apr 9

@Doctor-Chad-PhD While the speed of 8B LLMs is great, it's just not enough parameters to make a usable general purpose AI model.

By far the most broadly capable and knowledgeable ~8b English LLMs are Llama 3.1 8b and Gemma 2 9b, yet they score below 10 on SimpleQA and only 69/100 on my easy popular English knowledge test. Plus prompted stories are reliably filled with blatant contradictions to both the prompt and what they already wrote, even at lower temperatures.

So what Meta did here makes a lot of sense. 70b+ parameters is absolutely essential for a general purpose multimodal AI model, and 17b active parameters can run at a reasonable 4+ tokens/second with entry level 8-core AMD and Intel CPUs, and eventually GPUs will have more RAM. It's much cheaper and power efficient to increase RAM than compute.

What's the alternative? Other model families tanked their general knowledge and abilities while trying to boost their coding, math and other STEM scores. A perfect example is Qwen2.5 72b. It's predecessor Qwen2 72b scored 85.9/100 on my easy broad knowledge test, and nearly as good as Llama 3.1 70b on SimpleQA, yet it lost a full 3.5 generations of broad English knowledge (9-10 on SimpleQA and 68.4/100 on my test) in order to make small gains on coding, math, and other STEM tests which were barely discernible in real-world use cases. This may have fooled coding obsessed first adopters into thinking Qwen2.5 was an improvement, but as general purpose English AI models the Qwen2.5 family is astonishingly bad.

In short, going much smaller really isn't an option unless you're willing to trade a notable amount of broad knowledge and abilities for relatively small domain specific gains (e.g. coding and math).

LLaMA-lover

Apr 9

@UniversalLove333 You can easily run GGUF version of this and even with spilling to swap it will be still faster than dense models that are more than twice smaller than its file size that doesn't spill to swap.

Stop the claims of it not running on most systems.
It's a MoE, don't forget that, have you ever ran a MoE and compared the speed to a Dense model?
Also you are wrong on the "most people cannot even run this", I can run a 2 trillion parameter model if I want on a Intel Celeron laptop by running the LLM from disk/swap (extreme example I know but referencing it to prove a point). so its not the "run-ability" that matters, It's the speed that matters when you run it and it is faster than Dense LLM's that are more than 2 times smaller than it (Gemma 3 27B Q4 QAT [16GB] runs slower than LLaMA 4 Scout Q2_K_XL [42.6GB] on a system with 8GB vram and 32GB system ram [40GB total memory]).