qwen3 thinking test

#2
by shewin - opened

W790E Sage + QYFS + 512G + RTX5090

IQ5_K:

llm_load_tensors: offloading 94 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 95/95 layers to GPU
llm_load_tensors: CPU buffer size = 40915.49 MiB
llm_load_tensors: CPU buffer size = 41335.49 MiB
llm_load_tensors: CPU buffer size = 41443.49 MiB
llm_load_tensors: CPU buffer size = 41298.98 MiB
llm_load_tensors: CPU buffer size = 491.49 MiB
llm_load_tensors: CUDA0 buffer size = 6064.04 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 163840
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 15980.05 MiB
llama_new_context_with_model: KV self size = 15980.00 MiB, K (q8_0): 7990.00 MiB, V (q8_0): 7990.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 4224.02 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 2624.05 MiB
llama_new_context_with_model: graph nodes = 3672
llama_new_context_with_model: graph splits = 190
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT_ID to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MOE_FUSED_UP_GATE to OFF

main: n_kv_max = 163840, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 17.122 239.22 68.431 14.96
4096 1024 4096 17.384 235.62 69.513 14.73
4096 1024 8192 17.810 229.98 70.888 14.45
4096 1024 12288 17.789 230.25 72.412 14.14
4096 1024 16384 18.391 222.72 79.688 12.85

llama-perplexity:

-f wiki.test.raw
--seed 1337
--ctx-size 131072
-fa
-ctk q8_0 -ctv q8_0
-b 4096 -ub 4096
-fmoe
--override-tensor exps=CPU
-ngl 99
--threads 101
--flash-attn
-op 27,0,29,0

llm_load_tensors: offloading 94 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 95/95 layers to GPU
llm_load_tensors: CPU buffer size = 40915.49 MiB
llm_load_tensors: CPU buffer size = 41335.49 MiB
llm_load_tensors: CPU buffer size = 41443.49 MiB
llm_load_tensors: CPU buffer size = 41298.98 MiB
llm_load_tensors: CPU buffer size = 491.49 MiB
llm_load_tensors: CUDA0 buffer size = 6064.04 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 131072
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 12784.05 MiB
llama_new_context_with_model: KV self size = 12784.00 MiB, K (q8_0): 6392.00 MiB, V (q8_0): 6392.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 3456.02 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 2112.05 MiB
llama_new_context_with_model: graph nodes = 3672
llama_new_context_with_model: graph splits = 190
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT_ID to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MOE_FUSED_UP_GATE to OFF
XXXXXXXXXXXXXXXXXXXXXXXXXXXX offload(MUL_MAT_ID) = 0
XXXXXXXXXXXXXXXXXXXXXXXXXXXX offload(MOE_FUSED_UP_GATE) = 0

system_info: n_threads = 101 / 112 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 826.388 ms
perplexity: calculating perplexity over 2 chunks, n_ctx=131072, batch_size=4096, n_seq=1
perplexity: 744.57 seconds per pass - ETA 24.82 minutes
[1]3.7869,[2]3.5233,
Final estimate: PPL = 3.5233 +/- 0.01936

llama_print_timings: load time = 6887.95 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 1450020.61 ms / 262144 tokens ( 5.53 ms per token, 180.79 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 1474456.19 ms / 262145 tokens

For llama-perplexity I would advise to use -c 512 and full f16 kv-cache if you want to have numbers for comparing with mine. Here is the command I'm using:

numactl -N 1 -m 1 \
./build/bin/llama-perplexity \
    -m "$model" \
    -f wiki.test.raw \
    --seed 1337 \
    -fa -fmoe \
    --ctx-size 512 \
    -ub 4096 -b 4096 \
    --numa numactl \
    --threads 128 \
    --threads-batch 192 \
    --no-mmap

This is CPU-only, but your offload stuff is fine with CUDA.

If you change -c it changes the results a lot and so not comparable unless using the same context window.

Those are some good speeds though on the larger IQ5_K which is a great quality quant, fairly close to Q8_0!

Thanks for sharing your results!

-f wiki.test.raw \
--seed 1337 \
-fa -fmoe \
--ctx-size 512 \
-ub 4096 -b 4096 \
--threads 128 \
--threads-batch 192 \
--no-mmap

llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 752.00 MiB
llama_new_context_with_model: KV self size = 752.00 MiB, K (f16): 376.00 MiB, V (f16): 376.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 4.64 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 3057.49 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 128.05 MiB
llama_new_context_with_model: graph nodes = 3672
llama_new_context_with_model: graph splits = 1319

system_info: n_threads = 128 (n_threads_batch = 192) / 112 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 823.796 ms
perplexity: calculating perplexity over 584 chunks, n_ctx=512, batch_size=4096, n_seq=8
perplexity: 7.21 seconds per pass - ETA 8.77 minutes
[1]3.3073,[2]4.4579,[3]3.5789,[4]2.7655,[5]2.3961,[6]2.1569,[7]2.0090,[8]1.9374,[9]1.8818,[10]1.8197,[11]1.7732,[12]1.8214,[13]1.8418,[14]1.9151,[15]2.0453,[16]2.2018,[17]2.2292,[18]2.3646,[19]2.4495,[20]2.4674,[21]2.4848,[22]2.5805,[23]2.5505,[24]2.5105,[25]2.4932,[26]2.4781,[27]2.4545,[28]2.4436,[29]2.5090,[30]2.5308,[31]2.5638,[32]2.6135,[33]2.6273,[34]2.6704,[35]2.7262,[36]2.8011,[37]2.8701,[38]2.9234,[39]2.9582,[40]3.0107,[41]3.0563,[42]3.0883,[43]3.1337,[44]3.1558,[45]3.1841,[46]3.2163,[47]3.3145,[48]3.3809,[49]3.3758,[50]3.3178,[51]3.2758,[52]3.3069,[53]3.3544,[54]3.3907,[55]3.4159,[56]3.4551,[57]3.4655,[58]3.4960,[59]3.5183,[60]3.5425,[61]3.5727,[62]3.5985,[63]3.6457,[64]3.6794,[65]3.7310,[66]3.7721,[67]3.8264,[68]3.8344,[69]3.8626,[70]3.8737,[71]3.9016,[72]3.9436,[73]3.9705,[74]3.9984,[75]3.9901,[76]4.0015,[77]4.0277,[78]4.0452,[79]3.9985,[80]3.9466,[81]3.9051,[82]3.8814,[83]3.8449,[84]3.8212,[85]3.8085,[86]3.8501,[87]3.8202,[88]3.7889,[89]3.7682,[90]3.7456,[91]3.7366,[92]3.7058,[93]3.6892,[94]3.6622,[95]3.6334,[96]3.6120,[97]3.5897,[98]3.5626,[99]3.5365,[100]3.5404,[101]3.5561,[102]3.5388,[103]3.5192,[104]3.5088,[105]3.5119,[106]3.5243,[107]3.5516,[108]3.5807,[109]3.6068,[110]3.6447,[111]3.7013,[112]3.7150,[113]3.7124,[114]3.7562,[115]3.7797,[116]3.7568,[117]3.7259,[118]3.7107,[119]3.6983,[120]3.6898,[121]3.6836,[122]3.6790,[123]3.6670,[124]3.6535,[125]3.6395,[126]3.6343,[127]3.6166,[128]3.6118,[129]3.5993,[130]3.5857,[131]3.5727,[132]3.5615,[133]3.5614,[134]3.5583,[135]3.5617,[136]3.5708,[137]3.5647,[138]3.5660,[139]3.5816,[140]3.5750,[141]3.5787,[142]3.5825,[143]3.5902,[144]3.6117,[145]3.6060,[146]3.5996,[147]3.5961,[148]3.5911,[149]3.5934,[150]3.5864,[151]3.5862,[152]3.5891,[153]3.5963,[154]3.5911,[155]3.5977,[156]3.5924,[157]3.5900,[158]3.5839,[159]3.5808,[160]3.5748,[161]3.5760,[162]3.5808,[163]3.5772,[164]3.5835,[165]3.5869,[166]3.5907,[167]3.5992,[168]3.6137,[169]3.6222,[170]3.6370,[171]3.6464,[172]3.6656,[173]3.6937,[174]3.7071,[175]3.7355,[176]3.7527,[177]3.7775,[178]3.8014,[179]3.8019,[180]3.7858,[181]3.7680,[182]3.7607,[183]3.7421,[184]3.7294,[185]3.7114,[186]3.6917,[187]3.6747,[188]3.6786,[189]3.6928,[190]3.7125,[191]3.7269,[192]3.7397,[193]3.7469,[194]3.7643,[195]3.7792,[196]3.7917,[197]3.8010,[198]3.7993,[199]3.8050,[200]3.8062,[201]3.8140,[202]3.8248,[203]3.8350,[204]3.8502,[205]3.8615,[206]3.8609,[207]3.8766,[208]3.8731,[209]3.8801,[210]3.8822,[211]3.8868,[212]3.8960,[213]3.9024,[214]3.9037,[215]3.9040,[216]3.9084,[217]3.9157,[218]3.9201,[219]3.9171,[220]3.9100,[221]3.9100,[222]3.9118,[223]3.9147,[224]3.9234,[225]3.9186,[226]3.9207,[227]3.9208,[228]3.9151,[229]3.9076,[230]3.9065,[231]3.9027,[232]3.9010,[233]3.9035,[234]3.9100,[235]3.9143,[236]3.9030,[237]3.9012,[238]3.8988,[239]3.8938,[240]3.8986,[241]3.9046,[242]3.9123,[243]3.9099,[244]3.9212,[245]3.9245,[246]3.9338,[247]3.9414,[248]3.9475,[249]3.9552,[250]3.9601,[251]3.9708,[252]3.9818,[253]3.9960,[254]4.0080,[255]4.0113,[256]4.0204,[257]4.0278,[258]4.0261,[259]4.0186,[260]4.0133,[261]4.0060,[262]4.0042,[263]4.0055,[264]4.0068,[265]4.0154,[266]4.0205,[267]4.0236,[268]4.0235,[269]4.0279,[270]4.0296,[271]4.0296,[272]4.0300,[273]4.0302,[274]4.0308,[275]4.0301,[276]4.0264,[277]4.0300,[278]4.0317,[279]4.0258,[280]4.0252,[281]4.0252,[282]4.0254,[283]4.0183,[284]4.0089,[285]4.0113,[286]4.0011,[287]3.9960,[288]3.9928,[289]3.9938,[290]4.0064,[291]4.0130,[292]4.0142,[293]4.0197,[294]4.0285,[295]4.0361,[296]4.0415,[297]4.0536,[298]4.0520,[299]4.0518,[300]4.0557,[301]4.0559,[302]4.0562,[303]4.0535,[304]4.0653,[305]4.0698,[306]4.0703,[307]4.0749,[308]4.0794,[309]4.0817,[310]4.0876,[311]4.0915,[312]4.0911,[313]4.0902,[314]4.0950,[315]4.0913,[316]4.0941,[317]4.1060,[318]4.1134,[319]4.1130,[320]4.1081,[321]4.1031,[322]4.1064,[323]4.1117,[324]4.1188,[325]4.1318,[326]4.1353,[327]4.1332,[328]4.1370,[329]4.1346,[330]4.1342,[331]4.1338,[332]4.1381,[333]4.1396,[334]4.1399,[335]4.1387,[336]4.1392,[337]4.1438,[338]4.1510,[339]4.1488,[340]4.1494,[341]4.1480,[342]4.1507,[343]4.1546,[344]4.1599,[345]4.1638,[346]4.1634,[347]4.1604,[348]4.1624,[349]4.1617,[350]4.1613,[351]4.1630,[352]4.1681,[353]4.1686,[354]4.1663,[355]4.1759,[356]4.1842,[357]4.1905,[358]4.1819,[359]4.1766,[360]4.1744,[361]4.1774,[362]4.1695,[363]4.1668,[364]4.1681,[365]4.1768,[366]4.1903,[367]4.1993,[368]4.2145,[369]4.2238,[370]4.2336,[371]4.2439,[372]4.2559,[373]4.2591,[374]4.2657,[375]4.2763,[376]4.2849,[377]4.2921,[378]4.2962,[379]4.3007,[380]4.3116,[381]4.3217,[382]4.3273,[383]4.3338,[384]4.3411,[385]4.3550,[386]4.3634,[387]4.3654,[388]4.3678,[389]4.3733,[390]4.3869,[391]4.3997,[392]4.3978,[393]4.3985,[394]4.3959,[395]4.3977,[396]4.4051,[397]4.4112,[398]4.4146,[399]4.4188,[400]4.4265,[401]4.4296,[402]4.4317,[403]4.4299,[404]4.4162,[405]4.4035,[406]4.3923,[407]4.3904,[408]4.3785,[409]4.3676,[410]4.3542,[411]4.3427,[412]4.3314,[413]4.3180,[414]4.3060,[415]4.2940,[416]4.2798,[417]4.2663,[418]4.2554,[419]4.2419,[420]4.2289,[421]4.2163,[422]4.2045,[423]4.1942,[424]4.1837,[425]4.1738,[426]4.1718,[427]4.1722,[428]4.1728,[429]4.1738,[430]4.1702,[431]4.1647,[432]4.1611,[433]4.1575,[434]4.1531,[435]4.1514,[436]4.1424,[437]4.1364,[438]4.1355,[439]4.1299,[440]4.1319,[441]4.1268,[442]4.1208,[443]4.1135,[444]4.1125,[445]4.1109,[446]4.1054,[447]4.0961,[448]4.0884,[449]4.0807,[450]4.0735,[451]4.0739,[452]4.0705,[453]4.0686,[454]4.0693,[455]4.0709,[456]4.0726,[457]4.0782,[458]4.0871,[459]4.0853,[460]4.0858,[461]4.0854,[462]4.0871,[463]4.0944,[464]4.0956,[465]4.0981,[466]4.1007,[467]4.1065,[468]4.1096,[469]4.1115,[470]4.1164,[471]4.1155,[472]4.1200,[473]4.1176,[474]4.1203,[475]4.1259,[476]4.1255,[477]4.1247,[478]4.1210,[479]4.1244,[480]4.1308,[481]4.1354,[482]4.1307,[483]4.1372,[484]4.1435,[485]4.1483,[486]4.1508,[487]4.1561,[488]4.1560,[489]4.1544,[490]4.1564,[491]4.1560,[492]4.1578,[493]4.1559,[494]4.1567,[495]4.1563,[496]4.1567,[497]4.1640,[498]4.1694,[499]4.1671,[500]4.1693,[501]4.1706,[502]4.1702,[503]4.1790,[504]4.1833,[505]4.1877,[506]4.1881,[507]4.1887,[508]4.1935,[509]4.1939,[510]4.1963,[511]4.1994,[512]4.1973,[513]4.1990,[514]4.2010,[515]4.2021,[516]4.2050,[517]4.2052,[518]4.2041,[519]4.2061,[520]4.2089,[521]4.2111,[522]4.2088,[523]4.2109,[524]4.2126,[525]4.2166,[526]4.2210,[527]4.2246,[528]4.2260,[529]4.2220,[530]4.2215,[531]4.2237,[532]4.2226,[533]4.2225,[534]4.2248,[535]4.2269,[536]4.2236,[537]4.2240,[538]4.2298,[539]4.2267,[540]4.2339,[541]4.2351,[542]4.2342,[543]4.2351,[544]4.2406,[545]4.2411,[546]4.2397,[547]4.2393,[548]4.2346,[549]4.2353,[550]4.2310,[551]4.2283,[552]4.2263,[553]4.2171,[554]4.2134,[555]4.2107,[556]4.2123,[557]4.2146,[558]4.2170,[559]4.2216,[560]4.2249,[561]4.2310,[562]4.2364,[563]4.2422,[564]4.2431,[565]4.2491,[566]4.2511,[567]4.2444,[568]4.2365,[569]4.2280,[570]4.2205,[571]4.2132,[572]4.2054,[573]4.1974,[574]4.1927,[575]4.1865,[576]4.1873,[577]4.1863,[578]4.1901,[579]4.1950,[580]4.2011,[581]4.2029,[582]4.2095,[583]4.2040,[584]4.2019,
Final estimate: PPL = 4.2019 +/- 0.02375

llama_print_timings: load time = 70585.76 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 487292.20 ms / 299008 tokens ( 1.63 ms per token, 613.61 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 497789.02 ms / 299009 tokens

@shewin

Cool thanks! I don't think your CPU has 192 physical cores though? You can adjust for your CPU core count, but shouldn't effect final result just the speed.

Final estimate: PPL = 4.2019 +/- 0.02375

Very nice! I have had other reports also suggesting that hybrid CUDA+CPU inference results in slightly lower perplexity that I am measuring with CPU-only measurements. Not sure why there is a slight offset, but seems in the correct ball park. Thanks for confirming these quants are likely quite good!

Also 8 minutes seems quite fast!

Sign up or log in to comment