This model is not much better than qwen3 32b for writing code

#4
by xldistance - opened

I suspect that scoring is a list swipe

GGUF quantized models fail some tasks that Qwen3 can complete in my case.

I can't agree from my experience making it write 3 small physics animations:

2-shot this one:
Write a Python program that shows a ball bouncing inside a spinning hexagon (use pygame). The ball should be realistically affected by gravity and friction, and wall bounces.

First output: all working apart buttons
Back and forth: everything now working

3-shot this second attempt:
Write a Python program that shows a ball bouncing inside a spinning hexagon (use pygame). The ball should be realistically affected by gravity and friction, and wall bounces.

And finally, 3-shot this one, I was quite shocked by how appealing the interface was!
Write a single-file HTML/JS implementation of Conway's Game of Life that runs in the browser and visualizes the grid on a HTMLS canvas at 60 fps.

First output: everything showing up, but couldn't start the simulation cause buttons not functioning
Back and forth - copy pasted console content - : everything working


I didn't even have to tweak the different coefficients values, the default it provided were in the right range, and even realistic/adapted!
All the simulations were already set to run at 60fps even for the hexagons for which I didn't specified it. I noticed I recorded at 30fps afterward... And again, same as above: steps values etc. were perfectly consistent, didn't need to tweak anything myself.

This was entirely vibe coded.

I even got 91 on livebench code generation task on the 2024-11-25 dataset !

~/l/livebench ❯❯❯ python show_livebench_result.py --bench-name live_bench/coding/LCB_generation --model-list EXAONE-4.0-32B-Q4_K_M_56K_think --livebench-release-option 2024-11-25
Using release 2024-11-25
{'coding': ['LCB_generation']}
loaded  78  questions

########## All Tasks ##########
task                             LCB_generation
model                                          
exaone-4.0-32b-q4_k_m_56k_think          91.026

########## All Groups ##########
category                         coding
model                                  
exaone-4.0-32b-q4_k_m_56k_think    91.0

That said I don't even understand how is this possible when I see the top ranked one is gemini-2.5-pro-preview and is at 86.
image.png

I don't know if I ran it correctly, because I also got 84.6 for Qwen3-30B-A3B, which seems really too high. I actually suspect contamination for both of them. Unfortunately the newer datasets are not publicly released yet. And when I tried to run LiveCodeBench instead, I didn't manage to replicate any of the results given in their leaderboard, so I might not doing things well there, I tried tweaking the preprompt etc. but couldn't get the same results. But this was the case for all the models I tried, not tied to EXAONE!

Also, I forgot to mention: all the tests in this discussion are done using llama.cpp with the official Q4_K_M quant, in thinking mode.

I even got 91 on livebench code generation task on the 2024-11-25 dataset !

~/l/livebench ❯❯❯ python show_livebench_result.py --bench-name live_bench/coding/LCB_generation --model-list EXAONE-4.0-32B-Q4_K_M_56K_think --livebench-release-option 2024-11-25
Using release 2024-11-25
{'coding': ['LCB_generation']}
loaded  78  questions

########## All Tasks ##########
task                             LCB_generation
model                                          
exaone-4.0-32b-q4_k_m_56k_think          91.026

########## All Groups ##########
category                         coding
model                                  
exaone-4.0-32b-q4_k_m_56k_think    91.0

That said I don't even understand how is this possible when I see the top ranked one is gemini-2.5-pro-preview and is at 86.
image.png

I don't know if I ran it correctly, because I also got 84.6 for Qwen3-30B-A3B, which seems really too high. I actually suspect contamination for both of them. Unfortunately the newer datasets are not publicly released yet. And when I tried to run LiveCodeBench instead, I didn't manage to replicate any of the results given in their leaderboard, so I might not doing things well there, I tried tweaking the preprompt etc. but couldn't get the same results. But this was the case for all the models I tried, not tied to EXAONE!

Also, I forgot to mention: all the tests in this discussion are done using llama.cpp with the official Q4_K_M quant, in thinking mode.

So it is either better than SOTA models or it used the benchmark tests as finetune material ?
If you really want to know how good models are, you always need to use custom tests that are not public known.
Everyone cheats with AI performance tests, including Google and others.

Yes you are right! But that said, LiveCodeBench for example shows clearly which model might have seen the problems/solutions in its training with red color. See https://livecodebench.github.io/leaderboard_v5.html
It's easier to get an easy to visualize overview than delving into the models infos to find the cutoff dates. So, generally we have a fail-proof way to assess if x model could or could't have been contaminated for a given benchmark dataset release.

BUT, here, I can't tell because:

EXAONE4: Knowledge cut-off Nov. 2024

and I precisely used the 2024-11-25 version of the livebench dataset, which is the current last public one (unlike livecodebench, they delay the releases of newer ones). So in the specific case here, EXAONE could actually have been contaminated, we can't really know apart by asking them :D

Sorry, the screenshot I gave was showing the general coding scores, not the specific LCB_generation (which is not livecodebench benchmark, but livebench, they just used a confusing name) I ran, but it would still be #1:

image.png

I only use two test prompts collected online on Q4_K_M and Q8_0 quantized models i.e.,

Write a Python program that shows 20 balls bouncing inside a spinning heptagon:
- All balls have the same radius.
- All balls have a number on it from 1 to 20.
- All balls drop from the heptagon center when starting.
- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35
- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.
- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.
- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.
- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.
- The heptagon size should be large enough to contain all the balls.
- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.
- All codes should be put in a single Python file.
Create a Flappy Bird game in Python. You must include these things:
1. You must use pygame.
2.The background color should be randomly chosen and is a light shade. Start with a light blue color.
3. Pressing SPACE multiple times will accelerate the bird.
4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.
5. Place on the bottom some land colored as dark brown or yellow chosen randomly.
6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.
7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.
8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.
The final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.

In my case, Q4_K_M passes the first collision test prompt with just 1 typo but completely fails the flappy bird test, whereas Q8_0 completely fails the two test prompts.

On the other hand, Qwen3-32B-AWQ passes flappy bird test all the times, and frequently passes the collision test.

I guess the best idea is to have them all ;)

Do you have any plans to distribute it on OpenRouter?
(Oops! wrong discussion)

Sign up or log in to comment