Please share feedback here!

pinned

by shimmyshimmer - opened May 29

Unsloth AI org May 29

If you’ve tested any of the initial GGUFs, we’d really appreciate your feedback! Let us know if you encountered any issues, what went wrong, or how things could be improved. Also, feel free to share your inference speed results!

shimmyshimmer pinned discussion May 29

jcweiss2

May 29

•

edited May 29

Is it working for you?

Q8_0, Llama.cpp:
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q_b.weight' has wrong shape; expected 1536, 73728, got 1536, 24576, 1, 1
llama_model_load_from_file_impl: failed to load model

shimmyshimmer

Unsloth AI org May 29

Is it working for you?

Q8_0, Llama.cpp:
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q_b.weight' has wrong shape; expected 1536, 73728, got 1536, 24576, 1, 1
llama_model_load_from_file_impl: failed to load model

Could you try updating llama.cpp to the latest version?

jcweiss2

May 29

Yes, resolved, thank you!

botcreater

May 29

system prompt is added between bos and user token role right? it seems to work really well!

i suggest you state where the system prompt should be inserted in the prompt template so that it is clear for text completion users/ users not using something with an autotokenizer

qaraleza

May 29

I've tested the UD-Q3_K_XL in llama.cpp (Ubuntu), and it works great. I'm testing with a context size of around 14000.

MarxistLeninist

May 29

add Q1 quant ie 1 bit as well

Talkingperson123

May 29

•

edited May 29

Yo, DeepSeek-V2-Lite 16B needs to be GUFF'ed!

Talkingperson123

May 29

I meant yo.

shimmyshimmer

Unsloth AI org May 29

add Q1 quant ie 1 bit as well

its uploading

shimmyshimmer

Unsloth AI org May 29

add Q1 quant ie 1 bit as well

They're up now!

bundlepax2

May 30

ran the original Deepseek unsloth R1 quant with 2x 3090's with 128 GB of ram didn't get much as far as tokens 2-3/s.. interested to see if the new Unsloth Dynamic 2.0 GGUFs stack up with smart layering and shit

PlatonicSkeptic

May 30

ran the original Deepseek unsloth R1 quant with 2x 3090's with 128 GB of ram didn't get much as far as tokens 2-3/s.. interested to see if the new Unsloth Dynamic 2.0 GGUFs stack up with smart layering and shit

if you're not on ik_llama.cpp fork you're missing out

segmond

May 30

Why are these sizes substantially larger than the other ones? For example UD-Q3-K-XL original vs this, 273gb vs 350gb.

ljupco

May 31

•

edited May 31

Hi, thanks for all that, stellar work. I'm trying for the smallest R1 to see what tps I get on MBP M2 96GB RAM.

I'm following this

https://unsloth.ai/blog/deepseek-r1-0528

I run into this problem:

ljubomir@macbook2(:):~/llama.cpp$ build/bin/llama-cli \
    --model models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf \
    --cache-type-k q4_0 \
    --prio 3 \
    --temp 0.6 \
    --top_p 0.95 \
    --min_p 0.01 \
    --ctx-size 16384 \
    --prompt "<｜User｜>Write a Python program that shows 20 balls bouncing inside a spinning heptagon:\n- All balls have the same radius.\n- All balls have a number on it from 1 to 20.\n- All balls drop from the heptagon center when starting.\n- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35\n- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.\n- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.\n- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.\n- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.\n- The heptagon size should be large enough to contain all the balls.\n- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.\n- All codes should be put in a single Python file.<｜Assistant｜>"
build: 5626 (bc1007a4) with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.4.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Metal (Apple M2 Max) - 73727 MiB free
llama_model_load: error loading model: corrupted model: 1086 tensors expected but 978 found
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model 'models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf'
main: error: unable to load model

I suspected some of files didn't download correctly - they looked like this

ljubomir@macbook2(:):~/llama.cpp$ ls -al models/DeepSeek-R1-0528-UD-IQ1_S-0000*
-rw-r--r--@ 1 ljubomir  staff  49462945024 30 May 13:13 models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf
-rw-r--r--@ 1 ljubomir  staff  48568885664 30 May 14:14 models/DeepSeek-R1-0528-UD-IQ1_S-00002-of-00004.gguf
-rw-r--r--@ 1 ljubomir  staff  49564076576 30 May 15:30 models/DeepSeek-R1-0528-UD-IQ1_S-00003-of-00004.gguf
-rw-r--r--@ 1 ljubomir  staff  19455845600 30 May 16:46 models/DeepSeek-R1-0528-UD-IQ1_S-00004-of-00004.gguf

Is it possible to see the exact file sizes, to the byte, on HuggingFace web ui? Or put the sizes, maybe even crc like md5 sum, in a separate file?

Then it got worse. I thought - there must be some way to download incrementally, it will be smart enough to figure which file is truncated, and maybe even just download the extra, like rsync would do. So I asked gemini, it suggested

from huggingface_hub import snapshot_download

# This will download the entire 'UD-IQ1_S' folder and its contents
# It will create a directory like 'models/unsloth/DeepSeek-R1-0528-GGUF/UD-IQ1_S'
local_dir = snapshot_download(
    repo_id="unsloth/DeepSeek-R1-0528-GGUF",
    allow_patterns="UD-IQ1_S/*", # Only download files within the UD-IQ1_S folder
    local_dir="models/unsloth/DeepSeek-R1-0528-GGUF", # The base directory to download to
    local_dir_use_symlinks=False # Important for full copy
)
print(f"Downloaded model to: {local_dir}")

I moved the existing files in a newly created dir models/unsloth/DeepSeek-R1-0528-GGUF/UD-IQ1_S , and run the above in ipython.

Well - turns out it wiped the files completely and it's downloading from scratch now! :-) Haha - expected better than that tbh. We really do need AI, b/c atm our stuff is AS - Artificially Stupid, haha :-) No worries, it's chugging along now, will be done. But if you can provide the files sizes someplace or even better their md5sum-s too, so we know when the big files are downloaded correctly, that would be stellar!

Thanks for everything you do guys! It's been great running stuff on localhost, been enjoying it immensely. :-)

ljupco

May 31

•

edited May 31

Ignore the previous comment, seems I can't edit nor delete it anymore?

Previously had trouble downloading stuff and ensuring it's correctly downloaded. May help someone else - this worked for me:

Use wget to DL, it may restart a failed transfer

ljubomir@macbook2(:):~/llama.cpp/models$ wget 'https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/resolve/main/UD-IQ1_S/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf'
Length: 49094698368 (46G)
Saving to: ‘DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf’

The checksum is at

https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/blob/main/UD-IQ1_S/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf
Git LFS Details
SHA256: 19891f28c27908e2ba0402ecc15c7aaa7e48ab5d9e1c6d49096c42e74e8b16b8
Pointer size: 136 Bytes
Size of remote file: 49.1 GB
Xet backed hash: 229375f805e68a1006bcdbd96cea8f23ebabe02f9c7bd6a27598ec0a40c1df0b

Compute and compare

(torch) ljubomir@macbook2(:):~/llama.cpp$ sha256 models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf
SHA256 (models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf) = 19891f28c27908e2ba0402ecc15c7aaa7e48ab5d9e1c6d49096c42e74e8b16b8

jweb

May 31

i used IQ3_XXS mode on llama.cpp.
it works fine. much better than previous model.

LD_PRELOAD=/home/mycat7/amd-blis/lib/ILP64/libaocl-libmem.so
./llama-cli -model /home/mycat7/LLM/DeepSeek-R1-0528-GGUF/UD-IQ3_XXS/DeepSeek-R1-0528-UD-IQ3_XXS-00001-of-00006.gguf
-threads 16 -ctx-size 8192 -seed -1 -n-gpu-layers 5 -prio 2
-cache-type-k q8_0 -top_p 0.95 -top_k 20 -min_p 0.0 -temp 0.6 -cnv temp

i asked to generate music Beethoven's "Für Elise" in chunk musical language.
you are able to listen the generated music, here.
https://smartai.f5.si/

MB7977

May 31

Tested the UD-IQ3_XXS quant. All working fine.

The model itself is interesting, much longer outputs are possible than the first R1.

ljupco

May 31

•

edited May 31

Update - alas, it seems the 170GB weights can be run on 96GB RAM (I imagine 3/4 only is used as VRAM) on a macbook - even if mmap-ed and READ ONLY. Don't see why MacOS would not simply un/re-load whenever something is in the address space, even if not in RAM. TBH expected it to not straight out not work - expected it to work even if super slow, so I'd need to kill the process or (more likely) turn the computer off once it gets too stuck.

Put the error in gemini, but didn't learn anything about how to make it run. This:

ljubomir@macbook2(:):~/llama.cpp$ build/bin/llama-cli \
    --model models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf \
    --cache-type-k q4_0 \
    --prio 3 \
    --temp 0.6 \
    --top_p 0.95 \
    --min_p 0.01 \
    --ctx-size 16384 \
    --prompt "<｜User｜>Write a Python program that shows 20 balls bouncing inside a spinning heptagon:\n- All balls have the same radius.\n- All balls have a number on it from 1 to 20.\n- All balls drop from the heptagon center when starting.\n- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35\n- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.\n- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.\n- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.\n- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.\n- The heptagon size should be large enough to contain all the balls.\n- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.\n- All codes should be put in a single Python file.<｜Assistant｜>"
build: 5626 (bc1007a4) with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.4.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Metal (Apple M2 Max) - 73727 MiB free
llama_model_loader: additional 3 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 62 key-value pairs and 1086 tensors from models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Deepseek-R1-0528
llama_model_loader: - kv   3:                           general.basename str              = Deepseek-R1-0528
llama_model_loader: - kv   4:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   5:                         general.size_label str              = 256x20B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = DeepSeek R1 0528
llama_model_loader: - kv  10:               general.base_model.0.version str              = 0528
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Deepseek Ai
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/deepseek-ai/De...
llama_model_loader: - kv  13:                               general.tags arr[str,1]       = ["unsloth"]
llama_model_loader: - kv  14:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv  15:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv  16:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv  17:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  18:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  19:          deepseek2.attention.head_count_kv u32              = 1
llama_model_loader: - kv  20:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  21: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  22:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  23:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  24:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  25:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  26:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  27:             deepseek2.attention.key_length u32              = 576
llama_model_loader: - kv  28:           deepseek2.attention.value_length u32              = 512
llama_model_loader: - kv  29:         deepseek2.attention.key_length_mla u32              = 192
llama_model_loader: - kv  30:       deepseek2.attention.value_length_mla u32              = 128
llama_model_loader: - kv  31:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  32:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  33:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  34:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  35:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  36:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  37:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  38:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  39:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  40: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  41: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  42:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  43:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  44:                      tokenizer.ggml.tokens arr[str,129280]  = ["<｜begin▁of▁sentence｜>", "<�...
llama_model_loader: - kv  45:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  46:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv  47:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  48:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  49:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  50:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  51:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  52:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  53:               general.quantization_version u32              = 2
llama_model_loader: - kv  54:                          general.file_type u32              = 24
llama_model_loader: - kv  55:                      quantize.imatrix.file str              = DeepSeek-R1-0528-GGUF/imatrix_unsloth...
llama_model_loader: - kv  56:                   quantize.imatrix.dataset str              = unsloth_calibration_DeepSeek-R1-0528-...
llama_model_loader: - kv  57:             quantize.imatrix.entries_count i32              = 659
llama_model_loader: - kv  58:              quantize.imatrix.chunks_count i32              = 720
llama_model_loader: - kv  59:                                   split.no u16              = 0
llama_model_loader: - kv  60:                        split.tensors.count i32              = 1086
llama_model_loader: - kv  61:                                split.count u16              = 4
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:  122 tensors
llama_model_loader: - type q4_K:   56 tensors
llama_model_loader: - type q5_K:   36 tensors
llama_model_loader: - type q6_K:   17 tensors
llama_model_loader: - type iq2_xxs:   24 tensors
llama_model_loader: - type iq3_xxs:   49 tensors
llama_model_loader: - type iq1_s:  126 tensors
llama_model_loader: - type iq3_s:  154 tensors
llama_model_loader: - type iq4_xs:  141 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ1_S - 1.5625 bpw
print_info: file size   = 156.72 GiB (2.01 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 818
load: token to piece cache size = 0.8223 MB
print_info: arch             = deepseek2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 163840
print_info: n_embd           = 7168
print_info: n_layer          = 61
print_info: n_head           = 128
print_info: n_head_kv        = 1
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 576
print_info: n_embd_head_v    = 512
print_info: n_gqa            = 128
print_info: n_embd_k_gqa     = 576
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 18432
print_info: n_expert         = 256
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = yarn
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 0.025
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 671B
print_info: model params     = 671.03 B
print_info: general.name     = Deepseek-R1-0528
print_info: n_layer_dense_lead   = 3
print_info: n_lora_q             = 1536
print_info: n_lora_kv            = 512
print_info: n_embd_head_k_mla    = 192
print_info: n_embd_head_v_mla    = 128
print_info: n_ff_exp             = 2048
print_info: n_expert_shared      = 1
print_info: expert_weights_scale = 2.5
print_info: expert_weights_norm  = 1
print_info: expert_gating_func   = sigmoid
print_info: rope_yarn_log_mul    = 0.1000
print_info: vocab type       = BPE
print_info: n_vocab          = 129280
print_info: n_merges         = 127741
print_info: BOS token        = 0 '<｜begin▁of▁sentence｜>'
print_info: EOS token        = 1 '<｜end▁of▁sentence｜>'
print_info: EOT token        = 1 '<｜end▁of▁sentence｜>'
print_info: PAD token        = 2 '<｜▁pad▁｜>'
print_info: LF token         = 201 'Ċ'
print_info: FIM PRE token    = 128801 '<｜fim▁begin｜>'
print_info: FIM SUF token    = 128800 '<｜fim▁hole｜>'
print_info: FIM MID token    = 128802 '<｜fim▁end｜>'
print_info: EOG token        = 1 '<｜end▁of▁sentence｜>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 62/62 layers to GPU
load_tensors: Metal_Mapped model buffer size = 46815.35 MiB
load_tensors: Metal_Mapped model buffer size = 47469.88 MiB
load_tensors: Metal_Mapped model buffer size = 47641.07 MiB
load_tensors: Metal_Mapped model buffer size = 18554.54 MiB
load_tensors:   CPU_Mapped model buffer size =   497.11 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 0.025
llama_context: n_ctx_per_seq (16384) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 77309.41 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h64       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_context:        CPU  output buffer size =     0.49 MiB
llama_kv_cache_unified:      Metal KV buffer size =  1284.81 MiB
llama_kv_cache_unified: size = 1284.81 MiB ( 16384 cells,  61 layers,  1 seqs), K (q4_0):  308.81 MiB, V (f16):  976.00 MiB
llama_context:      Metal compute buffer size =  4522.00 MiB
llama_context:        CPU compute buffer size =    46.01 MiB
llama_context: graph nodes  = 4964
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16384
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
ggml_metal_graph_compute: command buffer 0 failed with status 5
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
graph_compute: ggml_backend_sched_graph_compute_async failed with error -1
llama_decode: failed to decode, ret = -3
main: llama threadpool init, n_threads = 8
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
You are a helpful assistant

<｜User｜>Hello<｜Assistant｜>Hi there<｜end▁of▁sentence｜><｜User｜>How are you?<｜Assistant｜>

system_info: n_threads = 8 (n_threads_batch = 8) / 12 | Metal : EMBED_LIBRARY = 1 | CPU : ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 |

main: interactive mode on.
sampler seed: 3358851179
sampler params:
    repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
    dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 16384
    top_k = 40, top_p = 0.950, min_p = 0.010, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.600
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 16384, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT

Write a Python program that shows 20 balls bouncing inside a spinning heptagon:
- All balls have the same radius.
- All balls have a number on it from 1 to 20.
- All balls drop from the heptagon center when starting.
- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35
- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.
- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.
- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.
- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.
- The heptagon size should be large enough to contain all the balls.
- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.
- All codes should be put in a single Python file.ggml_metal_graph_compute: command buffer 0 failed with status 5
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
graph_compute: ggml_backend_sched_graph_compute_async failed with error -1
llama_decode: failed to decode, ret = -3
main : failed to eval
ggml_metal_free: deallocating

Thanks for everything you do guys! Top marks! Have been enjoying this. :-) Will try again in the future on a bigger box.

bundlepax2

May 31

Put the error in gemini, but didn't learn anything about how to make it run. This:

ljubomir@macbook2(:):~/llama.cpp$ build/bin/llama-cli \
    --model models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf \
    --cache-type-k q4_0 \
    --prio 3 \
    --temp 0.6 \
    --top_p 0.95 \
    --min_p 0.01 \
    --ctx-size 16384 \
    --prompt "<｜User｜>Write a Python program that shows 20 balls bouncing inside a spinning heptagon:\n- All balls have the same radius.\n- All balls have a number on it from 1 to 20.\n- All balls drop from the heptagon center when starting.\n- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35\n- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.\n- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.\n- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.\n- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.\n- The heptagon size should be large enough to contain all the balls.\n- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.\n- All codes should be put in a single Python file.<｜Assistant｜>"
build: 5626 (bc1007a4) with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.4.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Metal (Apple M2 Max) - 73727 MiB free
llama_model_loader: additional 3 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 62 key-value pairs and 1086 tensors from models/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Deepseek-R1-0528
llama_model_loader: - kv   3:                           general.basename str              = Deepseek-R1-0528
llama_model_loader: - kv   4:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   5:                         general.size_label str              = 256x20B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = DeepSeek R1 0528
llama_model_loader: - kv  10:               general.base_model.0.version str              = 0528
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Deepseek Ai
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/deepseek-ai/De...
llama_model_loader: - kv  13:                               general.tags arr[str,1]       = ["unsloth"]
llama_model_loader: - kv  14:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv  15:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv  16:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv  17:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  18:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  19:          deepseek2.attention.head_count_kv u32              = 1
llama_model_loader: - kv  20:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  21: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  22:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  23:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  24:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  25:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  26:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  27:             deepseek2.attention.key_length u32              = 576
llama_model_loader: - kv  28:           deepseek2.attention.value_length u32              = 512
llama_model_loader: - kv  29:         deepseek2.attention.key_length_mla u32              = 192
llama_model_loader: - kv  30:       deepseek2.attention.value_length_mla u32              = 128
llama_model_loader: - kv  31:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  32:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  33:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  34:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  35:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  36:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  37:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  38:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  39:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  40: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  41: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  42:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  43:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  44:                      tokenizer.ggml.tokens arr[str,129280]  = ["<｜begin▁of▁sentence｜>", "<�...
llama_model_loader: - kv  45:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  46:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv  47:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  48:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  49:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  50:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  51:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  52:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  53:               general.quantization_version u32              = 2
llama_model_loader: - kv  54:                          general.file_type u32              = 24
llama_model_loader: - kv  55:                      quantize.imatrix.file str              = DeepSeek-R1-0528-GGUF/imatrix_unsloth...
llama_model_loader: - kv  56:                   quantize.imatrix.dataset str              = unsloth_calibration_DeepSeek-R1-0528-...
llama_model_loader: - kv  57:             quantize.imatrix.entries_count i32              = 659
llama_model_loader: - kv  58:              quantize.imatrix.chunks_count i32              = 720
llama_model_loader: - kv  59:                                   split.no u16              = 0
llama_model_loader: - kv  60:                        split.tensors.count i32              = 1086
llama_model_loader: - kv  61:                                split.count u16              = 4
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:  122 tensors
llama_model_loader: - type q4_K:   56 tensors
llama_model_loader: - type q5_K:   36 tensors
llama_model_loader: - type q6_K:   17 tensors
llama_model_loader: - type iq2_xxs:   24 tensors
llama_model_loader: - type iq3_xxs:   49 tensors
llama_model_loader: - type iq1_s:  126 tensors
llama_model_loader: - type iq3_s:  154 tensors
llama_model_loader: - type iq4_xs:  141 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ1_S - 1.5625 bpw
print_info: file size   = 156.72 GiB (2.01 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 818
load: token to piece cache size = 0.8223 MB
print_info: arch             = deepseek2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 163840
print_info: n_embd           = 7168
print_info: n_layer          = 61
print_info: n_head           = 128
print_info: n_head_kv        = 1
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 576
print_info: n_embd_head_v    = 512
print_info: n_gqa            = 128
print_info: n_embd_k_gqa     = 576
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 18432
print_info: n_expert         = 256
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = yarn
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 0.025
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 671B
print_info: model params     = 671.03 B
print_info: general.name     = Deepseek-R1-0528
print_info: n_layer_dense_lead   = 3
print_info: n_lora_q             = 1536
print_info: n_lora_kv            = 512
print_info: n_embd_head_k_mla    = 192
print_info: n_embd_head_v_mla    = 128
print_info: n_ff_exp             = 2048
print_info: n_expert_shared      = 1
print_info: expert_weights_scale = 2.5
print_info: expert_weights_norm  = 1
print_info: expert_gating_func   = sigmoid
print_info: rope_yarn_log_mul    = 0.1000
print_info: vocab type       = BPE
print_info: n_vocab          = 129280
print_info: n_merges         = 127741
print_info: BOS token        = 0 '<｜begin▁of▁sentence｜>'
print_info: EOS token        = 1 '<｜end▁of▁sentence｜>'
print_info: EOT token        = 1 '<｜end▁of▁sentence｜>'
print_info: PAD token        = 2 '<｜▁pad▁｜>'
print_info: LF token         = 201 'Ċ'
print_info: FIM PRE token    = 128801 '<｜fim▁begin｜>'
print_info: FIM SUF token    = 128800 '<｜fim▁hole｜>'
print_info: FIM MID token    = 128802 '<｜fim▁end｜>'
print_info: EOG token        = 1 '<｜end▁of▁sentence｜>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
ggml_backend_metal_log_allocated_size: warning: current allocated size is greater than the recommended max working set size
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 62/62 layers to GPU
load_tensors: Metal_Mapped model buffer size = 46815.35 MiB
load_tensors: Metal_Mapped model buffer size = 47469.88 MiB
load_tensors: Metal_Mapped model buffer size = 47641.07 MiB
load_tensors: Metal_Mapped model buffer size = 18554.54 MiB
load_tensors:   CPU_Mapped model buffer size =   497.11 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 0.025
llama_context: n_ctx_per_seq (16384) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Max
ggml_metal_init: picking default device: Apple M2 Max
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name:   Apple M2 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 77309.41 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h64       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_context:        CPU  output buffer size =     0.49 MiB
llama_kv_cache_unified:      Metal KV buffer size =  1284.81 MiB
llama_kv_cache_unified: size = 1284.81 MiB ( 16384 cells,  61 layers,  1 seqs), K (q4_0):  308.81 MiB, V (f16):  976.00 MiB
llama_context:      Metal compute buffer size =  4522.00 MiB
llama_context:        CPU compute buffer size =    46.01 MiB
llama_context: graph nodes  = 4964
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16384
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
ggml_metal_graph_compute: command buffer 0 failed with status 5
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
graph_compute: ggml_backend_sched_graph_compute_async failed with error -1
llama_decode: failed to decode, ret = -3
main: llama threadpool init, n_threads = 8
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
You are a helpful assistant

<｜User｜>Hello<｜Assistant｜>Hi there<｜end▁of▁sentence｜><｜User｜>How are you?<｜Assistant｜>

system_info: n_threads = 8 (n_threads_batch = 8) / 12 | Metal : EMBED_LIBRARY = 1 | CPU : ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 |

main: interactive mode on.
sampler seed: 3358851179
sampler params:
    repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
    dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 16384
    top_k = 40, top_p = 0.950, min_p = 0.010, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.600
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 16384, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT

Write a Python program that shows 20 balls bouncing inside a spinning heptagon:
- All balls have the same radius.
- All balls have a number on it from 1 to 20.
- All balls drop from the heptagon center when starting.
- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35
- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.
- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.
- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.
- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.
- The heptagon size should be large enough to contain all the balls.
- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.
- All codes should be put in a single Python file.ggml_metal_graph_compute: command buffer 0 failed with status 5
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
graph_compute: ggml_backend_sched_graph_compute_async failed with error -1
llama_decode: failed to decode, ret = -3
main : failed to eval
ggml_metal_free: deallocating

Thanks for everything you do guys! Top marks! Have been enjoying this. :-) Will try again in the future on a bigger box.

DROP YOUR context size down a shit ton pimp. im running this with the 1-5 guffs merged in to one fat boy guff. I'm not familiar with MAC M2 if you dont have CUDA try adding swap file to cover the missing ram space to see if it helps any. the below works for me just CPU.

cd ~/llama.cpp/build &&

./bin/llama-cli \

--model /usr/share/ollama/.ollama/DeepSeek-R1-0528/MERGE-DS_0528.guff \
--cache-type-k q4_0 \
--threads -1 \
--n-gpu-layers 0 \
--temp 0.6 \
--top_p 0.95 \
--min_p 0.01 \
--ctx-size 8192 \
--seed 3407 \
--prompt "<｜User｜>Write a Python program that shows 20 balls bouncing inside a spinning heptagon:\n- All balls have the same radius.\n- All balls have a number on it from 1 to 20.\n- All balls drop from the heptagon center when starting.\n- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35\n- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.\n- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.\n- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.\n- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.\n- The heptagon size should be large enough to contain all the balls.\n- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.\n- All codes should be put in a single Python file.<｜Assistant｜>" \

Panchovix

Jun 1

Hi there! Not entirely related, but does someone knows how to disable thinking? Either on llama-server directly, or llama-server + Sillytavern. I could test Tuesday-Wednesnay!

bundlepax2

Jun 2

•

edited Jun 2

Hi there! Not entirely related, but does someone knows how to disable thinking? Either on llama-server directly, or llama-server + Sillytavern. I could test Tuesday-Wednesnay!

depends on what your using to run this. i've herd /no_think in system and user prompt works with Qwen 3. But for DeepSeek what i have noticed is in the llama.cpp server go on the web interface and use completion mode instead of chat mode and start the script out to just start plugging away at the script with out the planing thinking mode. so for the prompt do this......................

Python script for a Flappy Bird game using Pygame.

Features:

1. Must use pygame.

2. The background color should be randomly chosen and is a light shade. Start with a light blue color.

3. Pressing SPACE multiple times will accelerate the bird.

4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.

5. Place on the bottom some land colored as dark brown or yellow chosen randomly.

6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.

7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.

8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again. The final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.

Start of the Python code:

import pygame
import random
import sys
import os

Initialize pygame

pygame.init()

Screen dimensions

WIDTH, HEIGHT = 800, 600
screen = pygame.display.set_mode((WIDTH, HEIGHT))
pygame.display.set_caption("Flappy Bird")

krustik

Jun 2

•

edited Jun 2

It's need a serious refinement. I've tested Q6_K which uses ~555Gb RAM. In text-generation-webui 3.4.1.
I've made temp as adviced to 0.6. The first problem is thinking too much and too long. In my test of making a ChucK song it basically stuck in forever loop of thinking which note to use.
Test of repairing broken code - kinda so-so, almost but failed, altrough it makes a 4 slightly different code versions in 1st try and all failed to run by logical error.
Anyway the thinking portion here is too big and too long. I feel some better quality than R1 original, but it need a serious refinement to make good result in Q6, can't imagine what is on lower quality, i thought it hallucinating in my furst test, which it shouldn't on such quality level.
Maybe it's problem of model launcher? I'll try others like llm studio & Kobold-cpp and report if its different result.

Rotating

Jun 2

•

edited Jun 2

Not getting stuck in loops for me so far, either on api or UD-Q2_K_XL.

0.6 temp, top-p 0.95 and rep penalty 1 (off) I use llama-server (llama.cpp) and mikupad

I asked it for a chuck song on the API version and it didn't get stuck, this is the song (don't know if it works) https://pastebin.com/m3dQeu3u

d2rx

Jun 2

Do you have a decoder for your model naming system? Say I am trying to find dynamic 2.0 quants and cannot for the life of me figure out which ones are dynamic quant? Are all the new ones, such as deepseek r1-0528 all dynamic 2.0 quantized?

Minami-su

Jun 3

Unsloth config file

wget https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/resolve/main/config.json
# Original tokenizer
wget https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/resolve/main/tokenizer.json
wget https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/resolve/main/tokenizer_config.json
# Original model files
wget https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/resolve/main/modeling_deepseek.py
wget https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/resolve/main/generation_config.json
wget https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/resolve/main/configuration_deepseek.py
mv config.json tokenizer.json tokenizer_config.json modeling_deepseek.py generation_config.json configuration_deepseek.py /data2/jcxy/llm_model/DeepSeek-R1-0528-GGUF-UD-IQ2_XXS/.
MODEL_PATH="/data2/jcxy/llm_model/DeepSeek-R1-0528-GGUF-UD-IQ2_XXS/DeepSeek-R1-0528-UD-IQ2_XXS-00001-of-00005.gguf"
LOG_FILE="vllm.log"
#export VLLM_ATTENTION_BACKEND=FLASHINFER /data2/jcxy/llm_model/PsyLLM4.5-Medium-2025-03-27-Instruct-SFT
export VLLM_USE_V1=0
# --cpu-offload-gb 80
SERVED_MODEL_NAME="DeepSeek-R1-0528"
export CUDA_VISIBLE_DEVICES=2,3,4,5
# 运行命令
nohup vllm serve
"$MODEL_PATH"
--hf-config-path /data2/jcxy/llm_model/DeepSeek-R1-0528-GGUF-UD-IQ2_XXS
--tokenizer /data2/jcxy/llm_model/DeepSeek-R1-0528-GGUF-UD-IQ2_XXS
--served-model-name "$SERVED_MODEL_NAME"
--trust-remote-code
--port 6011
--host 0.0.0.0
--dtype auto
--max-model-len 8192
--gpu_memory_utilization 0.98
--tensor_parallel_size 4
--enable-prefix-caching
>"$LOG_FILE" 2>&1 &

INFO 06-03 09:41:22 [init.py:239] Automatically detected platform cuda.
INFO 06-03 09:41:26 [api_server.py:1043] vLLM API server version 0.8.5.post1
INFO 06-03 09:41:26 [api_server.py:1044] args: Namespace(subparser='serve', model_tag='/data2/jcxy/llm_model/DeepSeek-R1-0528-GGUF-UD-IQ2_XXS/DeepSeek-R1-0528-UD-IQ2_XXS-00001-of-00005.gguf', config='', host='0.0.0.0', port=6011, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/data2/jcxy/llm_model/DeepSeek-R1-0528-GGUF-UD-IQ2_XXS/DeepSeek-R1-0528-UD-IQ2_XXS-00001-of-00005.gguf', task='auto', tokenizer='/data2/jcxy/llm_model/DeepSeek-R1-0528-GGUF-UD-IQ2_XXS', hf_config_path='/data2/jcxy/llm_model/DeepSeek-R1-0528-GGUF-UD-IQ2_XXS', skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=8192, guided_decoding_backend='xgrammar', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=4, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.98, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=True, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=['DeepSeek-R1-0528'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7fdfd4b5ca60>)
INFO 06-03 09:41:26 [config.py:209] Replacing legacy 'type' key with 'rope_type'
INFO 06-03 09:41:32 [config.py:717] This model supports multiple tasks: {'classify', 'score', 'reward', 'generate', 'embed'}. Defaulting to 'generate'.
Traceback (most recent call last):
File "/data/jcxy/haolu/anaconda3/envs/haolu/bin/vllm", line 8, in
sys.exit(main())
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 53, in main
args.dispatch_function(args)
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/site-packages/vllm/entrypoints/cli/serve.py", line 27, in cmd
uvloop.run(run_server(args))
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/site-packages/uvloop/init.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/site-packages/uvloop/init.py", line 61, in wrapper
return await main
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
async with build_async_engine_client(args) as engine_client:
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 166, in build_async_engine_client_from_engine_args
vllm_config = engine_args.create_engine_config(usage_context=usage_context)
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 1099, in create_engine_config
model_config = self.create_model_config()
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 987, in create_model_config
return ModelConfig(
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/site-packages/vllm/config.py", line 546, in init
self._verify_quantization()
File "/data/jcxy/haolu/anaconda3/envs/haolu/lib/python3.10/site-packages/vllm/config.py", line 816, in _verify_quantization
raise ValueError(
ValueError: Quantization method specified in the model config (fp8) does not match the quantization method specified in the quantization argument (gguf).

ideosphere

Jun 3

Do you have a decoder for your model naming system? Say I am trying to find dynamic 2.0 quants and cannot for the life of me figure out which ones are dynamic quant? Are all the new ones, such as deepseek r1-0528 all dynamic 2.0 quantized?

"UD" in the model / model sub-directory name e.g. "UD-Q5_K_XL" is (where used) an abbreviation for "unsloth dynamic" but that doesn't designate a version e.g. UD original vs. UD 2.0.

Rotating

Jun 3

I'm going to assume they use the latest version of their dynamic quanting unless they say otherwise. There would be no reason to use a 'worse' version.

shimmyshimmer

Unsloth AI org Jun 6

Do you have a decoder for your model naming system? Say I am trying to find dynamic 2.0 quants and cannot for the life of me figure out which ones are dynamic quant? Are all the new ones, such as deepseek r1-0528 all dynamic 2.0 quantized?

Do you have a decoder for your model naming system? Say I am trying to find dynamic 2.0 quants and cannot for the life of me figure out which ones are dynamic quant? Are all the new ones, such as deepseek r1-0528 all dynamic 2.0 quantized?

"UD" in the model / model sub-directory name e.g. "UD-Q5_K_XL" is (where used) an abbreviation for "unsloth dynamic" but that doesn't designate a version e.g. UD original vs. UD 2.0.

I'm going to assume they use the latest version of their dynamic quanting unless they say otherwise. There would be no reason to use a 'worse' version.

Ever since a month ago, EVERY single quant utilizes our Dynamic 2.0 method. It says it the model card that these quants use it. So 'UD' is in fact Dynamic 2.0

ciprianv

Jun 7

•

edited Jun 7

Hi, does anyone have perplexity comparison between these quants? I wonder if IQ3_XXS is better than Q2-XL-UD.. I am using Q2-XL now and works great on my 3955wx TR with 256GB ddr4 and 2x3090. 8t/s generation speed and 245t/s pp speed, ctx-size 71680. I am using ik_llama.

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
7168	1792	0	29.249	245.07	225.164	7.96

./build/bin/llama-sweep-bench
--model /home/ciprian/ai/models/DeepseekR1-0523-Q2-XL-UD/DeepSeek-R1-0528-UD-Q2_K_XL-00001-of-00006.gguf
--alias DeepSeek-R1-0528-UD-Q2_K_XL
--ctx-size 71680
-ctk q8_0
-mla 3
-fa
-amb 512
-fmoe
--temp 0.6
--top_p 0.95
--min_p 0.01
--n-gpu-layers 63
-ot "blk.[0-3].ffn_up_exps=CUDA0,blk.[0-3].ffn_gate_exps=CUDA0,blk.[0-3].ffn_down_exps=CUDA0"
-ot "blk.1[0-2].ffn_up_exps=CUDA1,blk.1[0-2].ffn_gate_exps=CUDA1"
--override-tensor exps=CPU
--parallel 1
--threads 16
--threads-batch 16
--host 0.0.0.0 --port 5002
--ubatch-size 7168 --batch-size 7168 --no-mmap

Panchovix

Jun 7

+1 to @ciprianv . I would want to see the dif between Q2_K_XL, IQ3_XXS and Q3_K_XL, as I can run those 3, but when I try to run llama-perplexity, I get nans. It's probably a me issue but it doesn't let me compare those 3 models.

Rotating

Jun 7

•

edited Jun 7

I would also like to see the KL divergence.

PS how much vram are these lines using?

-ot "blk.[0-3].ffn_up_exps=CUDA0,blk.[0-3].ffn_gate_exps=CUDA0,blk.[0-3].ffn_down_exps=CUDA0"
-ot "blk.1[0-2].ffn_up_exps=CUDA1,blk.1[0-2].ffn_gate_exps=CUDA1"

More specifically do you think it's possible for me to offload any combination of these onto two 12gb 3060s at Q2-XL-UD or should I just forget it?

[edit] Answering my own question these don't seem to use much vram at all to offload, assuming it worked on regular llamacpp.

ciprianv

Jun 7

Cca 9-10gb for each gpu

Rotating

Jun 7

•

edited Jun 7

Cca 9-10gb for each gpu

I was getting 5gb for each gpu so maybe i'm just seeing what's offloaded by using --override-tensor exps=CPU. I'm not sure the other stuff's working right on regular llamacpp because unless the --override-tensor exps=CPU is before those extra layers it errors out.

Anyway the --override-tensor exps=CPU is a huge upgrade for speed. It's now 2.5 t/s and stays there instead of starting at 2 and quite quickly slowing down with most of the context on cpu.

d2rx

Jun 16

I am having relatively low GPU usage during inference with deepseek-r1-0528-UD-IQ2-XXS. I welcome any advice. Basically I followed Unsloth's guide and here are my executions with llama.cpp

CUDA_VISIBLE_DEVICES="0,1"
./llama.cpp/build/bin/llama-server
-m llm_models/DeepSeek-R1-0528-UD-IQ2_XXS-00001-of-00005.gguf
--flash-attn --parallel 2
--no-context-shift
-c 32768 -n 32768 --prio 3
--rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.01
--host
--gpu-layers 99 --threads 32
--cache-type-k q4_0
--alias DS_R1_0528
-ot ".ffn_(up|down)_exps.=CPU"

The server spec I have

2 x Xeon Gold 5320 ( 52 CPUs, 104 threads)
2TB system RAM
2 x RTX Ada 6000 ( 48GB VRAM x 2)

The token generation speed is about 9~10 t/s while both GPU sets at 15% ~30% usage. The VRAM are utilized at about 75%. What is surprising to me is the CPU usage spike up to
30%. Welcome any help to make this more optimal. Also given my system spec, would it make more sense to run Q2_K_XL (2.7bit)?

Thanks in advance!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Please share feedback here!

i used IQ3_XXS mode on llama.cpp.it works fine. much better than previous model.

Python script for a Flappy Bird game using Pygame.

Features:

1. Must use pygame.

2. The background color should be randomly chosen and is a light shade. Start with a light blue color.

3. Pressing SPACE multiple times will accelerate the bird.

4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.

5. Place on the bottom some land colored as dark brown or yellow chosen randomly.

6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.

7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.

8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again. The final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.

Start of the Python code:

Initialize pygame

Screen dimensions

Unsloth config file

i used IQ3_XXS mode on llama.cpp.
it works fine. much better than previous model.