qwen3 thinking test
W790E Sage + QYFS + 512G + RTX5090
IQ5_K:
llm_load_tensors: offloading 94 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 95/95 layers to GPU
llm_load_tensors: CPU buffer size = 40915.49 MiB
llm_load_tensors: CPU buffer size = 41335.49 MiB
llm_load_tensors: CPU buffer size = 41443.49 MiB
llm_load_tensors: CPU buffer size = 41298.98 MiB
llm_load_tensors: CPU buffer size = 491.49 MiB
llm_load_tensors: CUDA0 buffer size = 6064.04 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 163840
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 15980.05 MiB
llama_new_context_with_model: KV self size = 15980.00 MiB, K (q8_0): 7990.00 MiB, V (q8_0): 7990.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 4224.02 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 2624.05 MiB
llama_new_context_with_model: graph nodes = 3672
llama_new_context_with_model: graph splits = 190
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT_ID to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MOE_FUSED_UP_GATE to OFF
main: n_kv_max = 163840, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101
PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
---|---|---|---|---|---|---|
4096 | 1024 | 0 | 17.122 | 239.22 | 68.431 | 14.96 |
4096 | 1024 | 4096 | 17.384 | 235.62 | 69.513 | 14.73 |
4096 | 1024 | 8192 | 17.810 | 229.98 | 70.888 | 14.45 |
4096 | 1024 | 12288 | 17.789 | 230.25 | 72.412 | 14.14 |
4096 | 1024 | 16384 | 18.391 | 222.72 | 79.688 | 12.85 |
llama-perplexity:
-f wiki.test.raw
--seed 1337
--ctx-size 131072
-fa
-ctk q8_0 -ctv q8_0
-b 4096 -ub 4096
-fmoe
--override-tensor exps=CPU
-ngl 99
--threads 101
--flash-attn
-op 27,0,29,0
llm_load_tensors: offloading 94 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 95/95 layers to GPU
llm_load_tensors: CPU buffer size = 40915.49 MiB
llm_load_tensors: CPU buffer size = 41335.49 MiB
llm_load_tensors: CPU buffer size = 41443.49 MiB
llm_load_tensors: CPU buffer size = 41298.98 MiB
llm_load_tensors: CPU buffer size = 491.49 MiB
llm_load_tensors: CUDA0 buffer size = 6064.04 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 131072
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 12784.05 MiB
llama_new_context_with_model: KV self size = 12784.00 MiB, K (q8_0): 6392.00 MiB, V (q8_0): 6392.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 3456.02 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 2112.05 MiB
llama_new_context_with_model: graph nodes = 3672
llama_new_context_with_model: graph splits = 190
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT_ID to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MOE_FUSED_UP_GATE to OFF
XXXXXXXXXXXXXXXXXXXXXXXXXXXX offload(MUL_MAT_ID) = 0
XXXXXXXXXXXXXXXXXXXXXXXXXXXX offload(MOE_FUSED_UP_GATE) = 0
system_info: n_threads = 101 / 112 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 826.388 ms
perplexity: calculating perplexity over 2 chunks, n_ctx=131072, batch_size=4096, n_seq=1
perplexity: 744.57 seconds per pass - ETA 24.82 minutes
[1]3.7869,[2]3.5233,
Final estimate: PPL = 3.5233 +/- 0.01936
llama_print_timings: load time = 6887.95 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 1450020.61 ms / 262144 tokens ( 5.53 ms per token, 180.79 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 1474456.19 ms / 262145 tokens
For llama-perplexity I would advise to use -c 512
and full f16 kv-cache if you want to have numbers for comparing with mine. Here is the command I'm using:
numactl -N 1 -m 1 \
./build/bin/llama-perplexity \
-m "$model" \
-f wiki.test.raw \
--seed 1337 \
-fa -fmoe \
--ctx-size 512 \
-ub 4096 -b 4096 \
--numa numactl \
--threads 128 \
--threads-batch 192 \
--no-mmap
This is CPU-only, but your offload stuff is fine with CUDA.
If you change -c
it changes the results a lot and so not comparable unless using the same context window.
Those are some good speeds though on the larger IQ5_K which is a great quality quant, fairly close to Q8_0!
Thanks for sharing your results!
-f wiki.test.raw \
--seed 1337 \
-fa -fmoe \
--ctx-size 512 \
-ub 4096 -b 4096 \
--threads 128 \
--threads-batch 192 \
--no-mmap
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 752.00 MiB
llama_new_context_with_model: KV self size = 752.00 MiB, K (f16): 376.00 MiB, V (f16): 376.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 4.64 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 3057.49 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 128.05 MiB
llama_new_context_with_model: graph nodes = 3672
llama_new_context_with_model: graph splits = 1319
system_info: n_threads = 128 (n_threads_batch = 192) / 112 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 823.796 ms
perplexity: calculating perplexity over 584 chunks, n_ctx=512, batch_size=4096, n_seq=8
perplexity: 7.21 seconds per pass - ETA 8.77 minutes
[1]3.3073,[2]4.4579,[3]3.5789,[4]2.7655,[5]2.3961,[6]2.1569,[7]2.0090,[8]1.9374,[9]1.8818,[10]1.8197,[11]1.7732,[12]1.8214,[13]1.8418,[14]1.9151,[15]2.0453,[16]2.2018,[17]2.2292,[18]2.3646,[19]2.4495,[20]2.4674,[21]2.4848,[22]2.5805,[23]2.5505,[24]2.5105,[25]2.4932,[26]2.4781,[27]2.4545,[28]2.4436,[29]2.5090,[30]2.5308,[31]2.5638,[32]2.6135,[33]2.6273,[34]2.6704,[35]2.7262,[36]2.8011,[37]2.8701,[38]2.9234,[39]2.9582,[40]3.0107,[41]3.0563,[42]3.0883,[43]3.1337,[44]3.1558,[45]3.1841,[46]3.2163,[47]3.3145,[48]3.3809,[49]3.3758,[50]3.3178,[51]3.2758,[52]3.3069,[53]3.3544,[54]3.3907,[55]3.4159,[56]3.4551,[57]3.4655,[58]3.4960,[59]3.5183,[60]3.5425,[61]3.5727,[62]3.5985,[63]3.6457,[64]3.6794,[65]3.7310,[66]3.7721,[67]3.8264,[68]3.8344,[69]3.8626,[70]3.8737,[71]3.9016,[72]3.9436,[73]3.9705,[74]3.9984,[75]3.9901,[76]4.0015,[77]4.0277,[78]4.0452,[79]3.9985,[80]3.9466,[81]3.9051,[82]3.8814,[83]3.8449,[84]3.8212,[85]3.8085,[86]3.8501,[87]3.8202,[88]3.7889,[89]3.7682,[90]3.7456,[91]3.7366,[92]3.7058,[93]3.6892,[94]3.6622,[95]3.6334,[96]3.6120,[97]3.5897,[98]3.5626,[99]3.5365,[100]3.5404,[101]3.5561,[102]3.5388,[103]3.5192,[104]3.5088,[105]3.5119,[106]3.5243,[107]3.5516,[108]3.5807,[109]3.6068,[110]3.6447,[111]3.7013,[112]3.7150,[113]3.7124,[114]3.7562,[115]3.7797,[116]3.7568,[117]3.7259,[118]3.7107,[119]3.6983,[120]3.6898,[121]3.6836,[122]3.6790,[123]3.6670,[124]3.6535,[125]3.6395,[126]3.6343,[127]3.6166,[128]3.6118,[129]3.5993,[130]3.5857,[131]3.5727,[132]3.5615,[133]3.5614,[134]3.5583,[135]3.5617,[136]3.5708,[137]3.5647,[138]3.5660,[139]3.5816,[140]3.5750,[141]3.5787,[142]3.5825,[143]3.5902,[144]3.6117,[145]3.6060,[146]3.5996,[147]3.5961,[148]3.5911,[149]3.5934,[150]3.5864,[151]3.5862,[152]3.5891,[153]3.5963,[154]3.5911,[155]3.5977,[156]3.5924,[157]3.5900,[158]3.5839,[159]3.5808,[160]3.5748,[161]3.5760,[162]3.5808,[163]3.5772,[164]3.5835,[165]3.5869,[166]3.5907,[167]3.5992,[168]3.6137,[169]3.6222,[170]3.6370,[171]3.6464,[172]3.6656,[173]3.6937,[174]3.7071,[175]3.7355,[176]3.7527,[177]3.7775,[178]3.8014,[179]3.8019,[180]3.7858,[181]3.7680,[182]3.7607,[183]3.7421,[184]3.7294,[185]3.7114,[186]3.6917,[187]3.6747,[188]3.6786,[189]3.6928,[190]3.7125,[191]3.7269,[192]3.7397,[193]3.7469,[194]3.7643,[195]3.7792,[196]3.7917,[197]3.8010,[198]3.7993,[199]3.8050,[200]3.8062,[201]3.8140,[202]3.8248,[203]3.8350,[204]3.8502,[205]3.8615,[206]3.8609,[207]3.8766,[208]3.8731,[209]3.8801,[210]3.8822,[211]3.8868,[212]3.8960,[213]3.9024,[214]3.9037,[215]3.9040,[216]3.9084,[217]3.9157,[218]3.9201,[219]3.9171,[220]3.9100,[221]3.9100,[222]3.9118,[223]3.9147,[224]3.9234,[225]3.9186,[226]3.9207,[227]3.9208,[228]3.9151,[229]3.9076,[230]3.9065,[231]3.9027,[232]3.9010,[233]3.9035,[234]3.9100,[235]3.9143,[236]3.9030,[237]3.9012,[238]3.8988,[239]3.8938,[240]3.8986,[241]3.9046,[242]3.9123,[243]3.9099,[244]3.9212,[245]3.9245,[246]3.9338,[247]3.9414,[248]3.9475,[249]3.9552,[250]3.9601,[251]3.9708,[252]3.9818,[253]3.9960,[254]4.0080,[255]4.0113,[256]4.0204,[257]4.0278,[258]4.0261,[259]4.0186,[260]4.0133,[261]4.0060,[262]4.0042,[263]4.0055,[264]4.0068,[265]4.0154,[266]4.0205,[267]4.0236,[268]4.0235,[269]4.0279,[270]4.0296,[271]4.0296,[272]4.0300,[273]4.0302,[274]4.0308,[275]4.0301,[276]4.0264,[277]4.0300,[278]4.0317,[279]4.0258,[280]4.0252,[281]4.0252,[282]4.0254,[283]4.0183,[284]4.0089,[285]4.0113,[286]4.0011,[287]3.9960,[288]3.9928,[289]3.9938,[290]4.0064,[291]4.0130,[292]4.0142,[293]4.0197,[294]4.0285,[295]4.0361,[296]4.0415,[297]4.0536,[298]4.0520,[299]4.0518,[300]4.0557,[301]4.0559,[302]4.0562,[303]4.0535,[304]4.0653,[305]4.0698,[306]4.0703,[307]4.0749,[308]4.0794,[309]4.0817,[310]4.0876,[311]4.0915,[312]4.0911,[313]4.0902,[314]4.0950,[315]4.0913,[316]4.0941,[317]4.1060,[318]4.1134,[319]4.1130,[320]4.1081,[321]4.1031,[322]4.1064,[323]4.1117,[324]4.1188,[325]4.1318,[326]4.1353,[327]4.1332,[328]4.1370,[329]4.1346,[330]4.1342,[331]4.1338,[332]4.1381,[333]4.1396,[334]4.1399,[335]4.1387,[336]4.1392,[337]4.1438,[338]4.1510,[339]4.1488,[340]4.1494,[341]4.1480,[342]4.1507,[343]4.1546,[344]4.1599,[345]4.1638,[346]4.1634,[347]4.1604,[348]4.1624,[349]4.1617,[350]4.1613,[351]4.1630,[352]4.1681,[353]4.1686,[354]4.1663,[355]4.1759,[356]4.1842,[357]4.1905,[358]4.1819,[359]4.1766,[360]4.1744,[361]4.1774,[362]4.1695,[363]4.1668,[364]4.1681,[365]4.1768,[366]4.1903,[367]4.1993,[368]4.2145,[369]4.2238,[370]4.2336,[371]4.2439,[372]4.2559,[373]4.2591,[374]4.2657,[375]4.2763,[376]4.2849,[377]4.2921,[378]4.2962,[379]4.3007,[380]4.3116,[381]4.3217,[382]4.3273,[383]4.3338,[384]4.3411,[385]4.3550,[386]4.3634,[387]4.3654,[388]4.3678,[389]4.3733,[390]4.3869,[391]4.3997,[392]4.3978,[393]4.3985,[394]4.3959,[395]4.3977,[396]4.4051,[397]4.4112,[398]4.4146,[399]4.4188,[400]4.4265,[401]4.4296,[402]4.4317,[403]4.4299,[404]4.4162,[405]4.4035,[406]4.3923,[407]4.3904,[408]4.3785,[409]4.3676,[410]4.3542,[411]4.3427,[412]4.3314,[413]4.3180,[414]4.3060,[415]4.2940,[416]4.2798,[417]4.2663,[418]4.2554,[419]4.2419,[420]4.2289,[421]4.2163,[422]4.2045,[423]4.1942,[424]4.1837,[425]4.1738,[426]4.1718,[427]4.1722,[428]4.1728,[429]4.1738,[430]4.1702,[431]4.1647,[432]4.1611,[433]4.1575,[434]4.1531,[435]4.1514,[436]4.1424,[437]4.1364,[438]4.1355,[439]4.1299,[440]4.1319,[441]4.1268,[442]4.1208,[443]4.1135,[444]4.1125,[445]4.1109,[446]4.1054,[447]4.0961,[448]4.0884,[449]4.0807,[450]4.0735,[451]4.0739,[452]4.0705,[453]4.0686,[454]4.0693,[455]4.0709,[456]4.0726,[457]4.0782,[458]4.0871,[459]4.0853,[460]4.0858,[461]4.0854,[462]4.0871,[463]4.0944,[464]4.0956,[465]4.0981,[466]4.1007,[467]4.1065,[468]4.1096,[469]4.1115,[470]4.1164,[471]4.1155,[472]4.1200,[473]4.1176,[474]4.1203,[475]4.1259,[476]4.1255,[477]4.1247,[478]4.1210,[479]4.1244,[480]4.1308,[481]4.1354,[482]4.1307,[483]4.1372,[484]4.1435,[485]4.1483,[486]4.1508,[487]4.1561,[488]4.1560,[489]4.1544,[490]4.1564,[491]4.1560,[492]4.1578,[493]4.1559,[494]4.1567,[495]4.1563,[496]4.1567,[497]4.1640,[498]4.1694,[499]4.1671,[500]4.1693,[501]4.1706,[502]4.1702,[503]4.1790,[504]4.1833,[505]4.1877,[506]4.1881,[507]4.1887,[508]4.1935,[509]4.1939,[510]4.1963,[511]4.1994,[512]4.1973,[513]4.1990,[514]4.2010,[515]4.2021,[516]4.2050,[517]4.2052,[518]4.2041,[519]4.2061,[520]4.2089,[521]4.2111,[522]4.2088,[523]4.2109,[524]4.2126,[525]4.2166,[526]4.2210,[527]4.2246,[528]4.2260,[529]4.2220,[530]4.2215,[531]4.2237,[532]4.2226,[533]4.2225,[534]4.2248,[535]4.2269,[536]4.2236,[537]4.2240,[538]4.2298,[539]4.2267,[540]4.2339,[541]4.2351,[542]4.2342,[543]4.2351,[544]4.2406,[545]4.2411,[546]4.2397,[547]4.2393,[548]4.2346,[549]4.2353,[550]4.2310,[551]4.2283,[552]4.2263,[553]4.2171,[554]4.2134,[555]4.2107,[556]4.2123,[557]4.2146,[558]4.2170,[559]4.2216,[560]4.2249,[561]4.2310,[562]4.2364,[563]4.2422,[564]4.2431,[565]4.2491,[566]4.2511,[567]4.2444,[568]4.2365,[569]4.2280,[570]4.2205,[571]4.2132,[572]4.2054,[573]4.1974,[574]4.1927,[575]4.1865,[576]4.1873,[577]4.1863,[578]4.1901,[579]4.1950,[580]4.2011,[581]4.2029,[582]4.2095,[583]4.2040,[584]4.2019,
Final estimate: PPL = 4.2019 +/- 0.02375
llama_print_timings: load time = 70585.76 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 487292.20 ms / 299008 tokens ( 1.63 ms per token, 613.61 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 497789.02 ms / 299009 tokens
Cool thanks! I don't think your CPU has 192 physical cores though? You can adjust for your CPU core count, but shouldn't effect final result just the speed.
Final estimate: PPL = 4.2019 +/- 0.02375
Very nice! I have had other reports also suggesting that hybrid CUDA+CPU inference results in slightly lower perplexity that I am measuring with CPU-only measurements. Not sure why there is a slight offset, but seems in the correct ball park. Thanks for confirming these quants are likely quite good!
Also 8 minutes seems quite fast!