qwen3 thinking test

by shewin - opened about 14 hours ago

about 14 hours ago

W790E Sage + QYFS + 512G + RTX5090

IQ5_K:

llm_load_tensors: offloading 94 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 95/95 layers to GPU
llm_load_tensors: CPU buffer size = 40915.49 MiB
llm_load_tensors: CPU buffer size = 41335.49 MiB
llm_load_tensors: CPU buffer size = 41443.49 MiB
llm_load_tensors: CPU buffer size = 41298.98 MiB
llm_load_tensors: CPU buffer size = 491.49 MiB
llm_load_tensors: CUDA0 buffer size = 6064.04 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 163840
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 15980.05 MiB
llama_new_context_with_model: KV self size = 15980.00 MiB, K (q8_0): 7990.00 MiB, V (q8_0): 7990.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 4224.02 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 2624.05 MiB
llama_new_context_with_model: graph nodes = 3672
llama_new_context_with_model: graph splits = 190
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT_ID to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MOE_FUSED_UP_GATE to OFF

main: n_kv_max = 163840, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	17.122	239.22	68.431	14.96
4096	1024	4096	17.384	235.62	69.513	14.73
4096	1024	8192	17.810	229.98	70.888	14.45
4096	1024	12288	17.789	230.25	72.412	14.14
4096	1024	16384	18.391	222.72	79.688	12.85

shewin

about 14 hours ago

llama-perplexity:

-f wiki.test.raw
--seed 1337
--ctx-size 131072
-fa
-ctk q8_0 -ctv q8_0
-b 4096 -ub 4096
-fmoe
--override-tensor exps=CPU
-ngl 99
--threads 101
--flash-attn
-op 27,0,29,0

llm_load_tensors: offloading 94 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 95/95 layers to GPU
llm_load_tensors: CPU buffer size = 40915.49 MiB
llm_load_tensors: CPU buffer size = 41335.49 MiB
llm_load_tensors: CPU buffer size = 41443.49 MiB
llm_load_tensors: CPU buffer size = 41298.98 MiB
llm_load_tensors: CPU buffer size = 491.49 MiB
llm_load_tensors: CUDA0 buffer size = 6064.04 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 131072
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 12784.05 MiB
llama_new_context_with_model: KV self size = 12784.00 MiB, K (q8_0): 6392.00 MiB, V (q8_0): 6392.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 3456.02 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 2112.05 MiB
llama_new_context_with_model: graph nodes = 3672
llama_new_context_with_model: graph splits = 190
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT_ID to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MOE_FUSED_UP_GATE to OFF
XXXXXXXXXXXXXXXXXXXXXXXXXXXX offload(MUL_MAT_ID) = 0
XXXXXXXXXXXXXXXXXXXXXXXXXXXX offload(MOE_FUSED_UP_GATE) = 0

llama_print_timings: load time = 6887.95 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 1450020.61 ms / 262144 tokens ( 5.53 ms per token, 180.79 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 1474456.19 ms / 262145 tokens

ubergarm

Owner about 13 hours ago

For llama-perplexity I would advise to use -c 512 and full f16 kv-cache if you want to have numbers for comparing with mine. Here is the command I'm using:

numactl -N 1 -m 1 \
./build/bin/llama-perplexity \
    -m "$model" \
    -f wiki.test.raw \
    --seed 1337 \
    -fa -fmoe \
    --ctx-size 512 \
    -ub 4096 -b 4096 \
    --numa numactl \
    --threads 128 \
    --threads-batch 192 \
    --no-mmap

This is CPU-only, but your offload stuff is fine with CUDA.

If you change -c it changes the results a lot and so not comparable unless using the same context window.

Those are some good speeds though on the larger IQ5_K which is a great quality quant, fairly close to Q8_0!

Thanks for sharing your results!

shewin

about 11 hours ago

-f wiki.test.raw \
--seed 1337 \
-fa -fmoe \
--ctx-size 512 \
-ub 4096 -b 4096 \
--threads 128 \
--threads-batch 192 \
--no-mmap

llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 752.00 MiB
llama_new_context_with_model: KV self size = 752.00 MiB, K (f16): 376.00 MiB, V (f16): 376.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 4.64 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 3057.49 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 128.05 MiB
llama_new_context_with_model: graph nodes = 3672
llama_new_context_with_model: graph splits = 1319

system_info: n_threads = 128 (n_threads_batch = 192) / 112 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 823.796 ms
perplexity: calculating perplexity over 584 chunks, n_ctx=512, batch_size=4096, n_seq=8
perplexity: 7.21 seconds per pass - ETA 8.77 minutes
[1]3.3073,[2]4.4579,[3]3.5789,[4]2.7655,[5]2.3961,[6]2.1569,[7]2.0090,[8]1.9374,[9]1.8818,[10]1.8197,[11]1.7732,[12]1.8214,[13]1.8418,[14]1.9151,[15]2.0453,[16]2.2018,[17]2.2292,[18]2.3646,[19]2.4495,[20]2.4674,[21]2.4848,[22]2.5805,[23]2.5505,[24]2.5105,[25]2.4932,[26]2.4781,[27]2.4545,[28]2.4436,[29]2.5090,[30]2.5308,[31]2.5638,[32]2.6135,[33]2.6273,[34]2.6704,[35]2.7262,[36]2.8011,[37]2.8701,[38]2.9234,[39]2.9582,[40]3.0107,[41]3.0563,[42]3.0883,[43]3.1337,[44]3.1558,[45]3.1841,[46]3.2163,[47]3.3145,[48]3.3809,[49]3.3758,[50]3.3178,[51]3.2758,[52]3.3069,[53]3.3544,[54]3.3907,[55]3.4159,[56]3.4551,[57]3.4655,[58]3.4960,[59]3.5183,[60]3.5425,[61]3.5727,[62]3.5985,[63]3.6457,[64]3.6794,[65]3.7310,[66]3.7721,[67]3.8264,[68]3.8344,[69]3.8626,[70]3.8737,[71]3.9016,[72]3.9436,[73]3.9705,[74]3.9984,[75]3.9901,[76]4.0015,[77]4.0277,[78]4.0452,[79]3.9985,[80]3.9466,[81]3.9051,[82]3.8814,[83]3.8449,[84]3.8212,[85]3.8085,[86]3.8501,[87]3.8202,[88]3.7889,[89]3.7682,[90]3.7456,[91]3.7366,[92]3.7058,[93]3.6892,[94]3.6622,[95]3.6334,[96]3.6120,[97]3.5897,[98]3.5626,[99]3.5365,[100]3.5404,[101]3.5561,[102]3.5388,[103]3.5192,[104]3.5088,[105]3.5119,[106]3.5243,[107]3.5516,[108]3.5807,[109]3.6068,[110]3.6447,[111]3.7013,[112]3.7150,[113]3.7124,[114]3.7562,[115]3.7797,[116]3.7568,[117]3.7259,[118]3.7107,[119]3.6983,[120]3.6898,[121]3.6836,[122]3.6790,[123]3.6670,[124]3.6535,[125]3.6395,[126]3.6343,[127]3.6166,[128]3.6118,[129]3.5993,[130]3.5857,[131]3.5727,[132]3.5615,[133]3.5614,[134]3.5583,[135]3.5617,[136]3.5708,[137]3.5647,[138]3.5660,[139]3.5816,[140]3.5750,[141]3.5787,[142]3.5825,[143]3.5902,[144]3.6117,[145]3.6060,[146]3.5996,[147]3.5961,[148]3.5911,[149]3.5934,[150]3.5864,[151]3.5862,[152]3.5891,[153]3.5963,[154]3.5911,[155]3.5977,[156]3.5924,[157]3.5900,[158]3.5839,[159]3.5808,[160]3.5748,[161]3.5760,[162]3.5808,[163]3.5772,[164]3.5835,[165]3.5869,[166]3.5907,[167]3.5992,[168]3.6137,[169]3.6222,[170]3.6370,[171]3.6464,[172]3.6656,[173]3.6937,[174]3.7071,[175]3.7355,[176]3.7527,[177]3.7775,[178]3.8014,[179]3.8019,[180]3.7858,[181]3.7680,[182]3.7607,[183]3.7421,[184]3.7294,[185]3.7114,[186]3.6917,[187]3.6747,[188]3.6786,[189]3.6928,[190]3.7125,[191]3.7269,[192]3.7397,[193]3.7469,[194]3.7643,[195]3.7792,[196]3.7917,[197]3.8010,[198]3.7993,[199]3.8050,[200]3.8062,[201]3.8140,[202]3.8248,[203]3.8350,[204]3.8502,[205]3.8615,[206]3.8609,[207]3.8766,[208]3.8731,[209]3.8801,[210]3.8822,[211]3.8868,[212]3.8960,[213]3.9024,[214]3.9037,[215]3.9040,[216]3.9084,[217]3.9157,[218]3.9201,[219]3.9171,[220]3.9100,[221]3.9100,[222]3.9118,[223]3.9147,[224]3.9234,[225]3.9186,[226]3.9207,[227]3.9208,[228]3.9151,[229]3.9076,[230]3.9065,[231]3.9027,[232]3.9010,[233]3.9035,[234]3.9100,[235]3.9143,[236]3.9030,[237]3.9012,[238]3.8988,[239]3.8938,[240]3.8986,[241]3.9046,[242]3.9123,[243]3.9099,[244]3.9212,[245]3.9245,[246]3.9338,[247]3.9414,[248]3.9475,[249]3.9552,[250]3.9601,[251]3.9708,[252]3.9818,[253]3.9960,[254]4.0080,[255]4.0113,[256]4.0204,[257]4.0278,[258]4.0261,[259]4.0186,[260]4.0133,[261]4.0060,[262]4.0042,[263]4.0055,[264]4.0068,[265]4.0154,[266]4.0205,[267]4.0236,[268]4.0235,[269]4.0279,[270]4.0296,[271]4.0296,[272]4.0300,[273]4.0302,[274]4.0308,[275]4.0301,[276]4.0264,[277]4.0300,[278]4.0317,[279]4.0258,[280]4.0252,[281]4.0252,[282]4.0254,[283]4.0183,[284]4.0089,[285]4.0113,[286]4.0011,[287]3.9960,[288]3.9928,[289]3.9938,[290]4.0064,[291]4.0130,[292]4.0142,[293]4.0197,[294]4.0285,[295]4.0361,[296]4.0415,[297]4.0536,[298]4.0520,[299]4.0518,[300]4.0557,[301]4.0559,[302]4.0562,[303]4.0535,[304]4.0653,[305]4.0698,[306]4.0703,[307]4.0749,[308]4.0794,[309]4.0817,[310]4.0876,[311]4.0915,[312]4.0911,[313]4.0902,[314]4.0950,[315]4.0913,[316]4.0941,[317]4.1060,[318]4.1134,[319]4.1130,[320]4.1081,[321]4.1031,[322]4.1064,[323]4.1117,[324]4.1188,[325]4.1318,[326]4.1353,[327]4.1332,[328]4.1370,[329]4.1346,[330]4.1342,[331]4.1338,[332]4.1381,[333]4.1396,[334]4.1399,[335]4.1387,[336]4.1392,[337]4.1438,[338]4.1510,[339]4.1488,[340]4.1494,[341]4.1480,[342]4.1507,[343]4.1546,[344]4.1599,[345]4.1638,[346]4.1634,[347]4.1604,[348]4.1624,[349]4.1617,[350]4.1613,[351]4.1630,[352]4.1681,[353]4.1686,[354]4.1663,[355]4.1759,[356]4.1842,[357]4.1905,[358]4.1819,[359]4.1766,[360]4.1744,[361]4.1774,[362]4.1695,[363]4.1668,[364]4.1681,[365]4.1768,[366]4.1903,[367]4.1993,[368]4.2145,[369]4.2238,[370]4.2336,[371]4.2439,[372]4.2559,[373]4.2591,[374]4.2657,[375]4.2763,[376]4.2849,[377]4.2921,[378]4.2962,[379]4.3007,[380]4.3116,[381]4.3217,[382]4.3273,[383]4.3338,[384]4.3411,[385]4.3550,[386]4.3634,[387]4.3654,[388]4.3678,[389]4.3733,[390]4.3869,[391]4.3997,[392]4.3978,[393]4.3985,[394]4.3959,[395]4.3977,[396]4.4051,[397]4.4112,[398]4.4146,[399]4.4188,[400]4.4265,[401]4.4296,[402]4.4317,[403]4.4299,[404]4.4162,[405]4.4035,[406]4.3923,[407]4.3904,[408]4.3785,[409]4.3676,[410]4.3542,[411]4.3427,[412]4.3314,[413]4.3180,[414]4.3060,[415]4.2940,[416]4.2798,[417]4.2663,[418]4.2554,[419]4.2419,[420]4.2289,[421]4.2163,[422]4.2045,[423]4.1942,[424]4.1837,[425]4.1738,[426]4.1718,[427]4.1722,[428]4.1728,[429]4.1738,[430]4.1702,[431]4.1647,[432]4.1611,[433]4.1575,[434]4.1531,[435]4.1514,[436]4.1424,[437]4.1364,[438]4.1355,[439]4.1299,[440]4.1319,[441]4.1268,[442]4.1208,[443]4.1135,[444]4.1125,[445]4.1109,[446]4.1054,[447]4.0961,[448]4.0884,[449]4.0807,[450]4.0735,[451]4.0739,[452]4.0705,[453]4.0686,[454]4.0693,[455]4.0709,[456]4.0726,[457]4.0782,[458]4.0871,[459]4.0853,[460]4.0858,[461]4.0854,[462]4.0871,[463]4.0944,[464]4.0956,[465]4.0981,[466]4.1007,[467]4.1065,[468]4.1096,[469]4.1115,[470]4.1164,[471]4.1155,[472]4.1200,[473]4.1176,[474]4.1203,[475]4.1259,[476]4.1255,[477]4.1247,[478]4.1210,[479]4.1244,[480]4.1308,[481]4.1354,[482]4.1307,[483]4.1372,[484]4.1435,[485]4.1483,[486]4.1508,[487]4.1561,[488]4.1560,[489]4.1544,[490]4.1564,[491]4.1560,[492]4.1578,[493]4.1559,[494]4.1567,[495]4.1563,[496]4.1567,[497]4.1640,[498]4.1694,[499]4.1671,[500]4.1693,[501]4.1706,[502]4.1702,[503]4.1790,[504]4.1833,[505]4.1877,[506]4.1881,[507]4.1887,[508]4.1935,[509]4.1939,[510]4.1963,[511]4.1994,[512]4.1973,[513]4.1990,[514]4.2010,[515]4.2021,[516]4.2050,[517]4.2052,[518]4.2041,[519]4.2061,[520]4.2089,[521]4.2111,[522]4.2088,[523]4.2109,[524]4.2126,[525]4.2166,[526]4.2210,[527]4.2246,[528]4.2260,[529]4.2220,[530]4.2215,[531]4.2237,[532]4.2226,[533]4.2225,[534]4.2248,[535]4.2269,[536]4.2236,[537]4.2240,[538]4.2298,[539]4.2267,[540]4.2339,[541]4.2351,[542]4.2342,[543]4.2351,[544]4.2406,[545]4.2411,[546]4.2397,[547]4.2393,[548]4.2346,[549]4.2353,[550]4.2310,[551]4.2283,[552]4.2263,[553]4.2171,[554]4.2134,[555]4.2107,[556]4.2123,[557]4.2146,[558]4.2170,[559]4.2216,[560]4.2249,[561]4.2310,[562]4.2364,[563]4.2422,[564]4.2431,[565]4.2491,[566]4.2511,[567]4.2444,[568]4.2365,[569]4.2280,[570]4.2205,[571]4.2132,[572]4.2054,[573]4.1974,[574]4.1927,[575]4.1865,[576]4.1873,[577]4.1863,[578]4.1901,[579]4.1950,[580]4.2011,[581]4.2029,[582]4.2095,[583]4.2040,[584]4.2019,
Final estimate: PPL = 4.2019 +/- 0.02375

llama_print_timings: load time = 70585.76 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 487292.20 ms / 299008 tokens ( 1.63 ms per token, 613.61 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 497789.02 ms / 299009 tokens

ubergarm

Owner about 3 hours ago

•

edited about 3 hours ago

@shewin

Cool thanks! I don't think your CPU has 192 physical cores though? You can adjust for your CPU core count, but shouldn't effect final result just the speed.

Final estimate: PPL = 4.2019 +/- 0.02375

Very nice! I have had other reports also suggesting that hybrid CUDA+CPU inference results in slightly lower perplexity that I am measuring with CPU-only measurements. Not sure why there is a slight offset, but seems in the correct ball park. Thanks for confirming these quants are likely quite good!

Also 8 minutes seems quite fast!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

qwen3 thinking test

-f wiki.test.raw --seed 1337 --ctx-size 131072 -fa -ctk q8_0 -ctv q8_0 -b 4096 -ub 4096 -fmoe --override-tensor exps=CPU -ngl 99 --threads 101 --flash-attn -op 27,0,29,0

-f wiki.test.raw
--seed 1337
--ctx-size 131072
-fa
-ctk q8_0 -ctv q8_0
-b 4096 -ub 4096
-fmoe
--override-tensor exps=CPU
-ngl 99
--threads 101
--flash-attn
-op 27,0,29,0