GGML_ASSERT(to_fp32_cl != nullptr) failed

#2
by FenixInDarkSolo - opened

I try to run the model with koboldcpp 1.91, but it does not work.
Didn't see this error before, anyway to run the model?

koboldcpp_191.exe --threads 12 --websearch --port 5002 --host 0.0.0.0 --contextsize 8192 --blasbatchsize 2048 --useclblast 0 0 --gpulayers 40
***
Welcome to KoboldCpp - Version 1.91
For command line arguments, please refer to --help
***
Loading Chat Completions Adapter: C:\Users\FENIX_~1\AppData\Local\Temp_MEI114962\kcpp_adapters\AutoGuess.json
Chat Completions Adapter Loaded
Unable to detect VRAM, please set layers manually.
Unable to determine GPU Memory
Initializing dynamic library: koboldcpp_clblast.dll

Namespace(admin=False, admindir='', adminpassword=None, analyze='', benchmark=None, blasbatchsize=2048, blasthreads=12, chatcompletionsadapter='AutoGuess', cli=False, config=None, contextsize=8192, debugmode=0, defaultgenamt=512, draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel='', embeddingsmodel='', enableguidance=False, exportconfig='', exporttemplate='', failsafe=False, flashattention=False, forceversion=0, foreground=False, gpulayers=40, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='0.0.0.0', ignoremissing=False, launch=False, lora=None, maxrequestsize=32, mmproj='', mmprojcpu=False, model=[], model_param='D:/program/koboldcpp/Dolphin3.0-L3.2-1B_RP_UNCENSORED.gguf', moeexperts=-1, multiplayer=False, multiuser=1, noavx2=False, noblas=False, nobostoken=False, nocertify=False, nofastforward=False, nommap=False, nomodel=False, noshift=False, onready='', overridekv='', overridetensors='', password=None, port=5002, port_param=5001, preloadstory='', prompt='', promptlimit=100, quantkv=0, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], savedatafile='', sdclamped=0, sdclipg='', sdclipl='', sdconfig=None, sdlora='', sdloramult=1.0, sdmodel='', sdnotile=False, sdquant=False, sdt5xxl='', sdthreads=0, sdvae='', sdvaeauto=False, showgui=False, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=12, ttsgpu=False, ttsmaxlen=4096, ttsmodel='', ttsthreads=0, ttswavtokenizer='', unpack='', useclblast=[0, 0], usecpu=False, usecublas=None, usemlock=False, usemmap=False, usevulkan=None, version=False, visionmaxres=1024, websearch=True, whispermodel='')

Loading Text Model: D:\program\koboldcpp\Dolphin3.0-L3.2-1B_RP_UNCENSORED.gguf

The reported GGUF Arch is: llama
Arch Category: 0


Identified as GGUF model.
Attempting to Load...

Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!

Platform:0 Device:0 - AMD Accelerated Parallel Processing with gfx1035
Platform:1 Device:0 - OpenCLOn12 with AMD Radeon(TM) Graphics
Platform:1 Device:1 - OpenCLOn12 with Microsoft Basic Render Driver

ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: selecting device: 'gfx1035'
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
llama_model_loader: loaded meta data with 81 key-value pairs and 147 tensors from D:\program\koboldcpp\Dolphin3.0-L3.2-1B_RP_UNCENSORED.gguf (version GGUF V3 (latest))
print_info: file format = GGUF V3 (latest)
print_info: file type = unknown, may not work
print_info: file size = 701.25 MiB (4.76 BPW)
init_tokenizer: initializing tokenizer for type 2
load: special tokens cache size = 258
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 2048
print_info: n_layer = 16
print_info: n_head = 32
print_info: n_head_kv = 8
print_info: n_rot = 64
print_info: n_swa = 0
print_info: n_swa_pattern = 1
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 8192
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 1B
print_info: model params = 1.24 B
print_info: general.name = Dolphin 3.0 Llama 3.2 1b
print_info: vocab type = BPE
print_info: n_vocab = 128258
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128256 '<|im_end|>'
print_info: EOT token = 128256 '<|im_end|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: PAD token = 128001 '<|end_of_text|>'
print_info: LF token = 198 '?'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: EOG token = 128256 '<|im_end|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)

OpenCL GPU Offload Fallback...
load_tensors: relocated tensors: 163 of 163
load_tensors: offloading 16 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 17/17 layers to GPU
load_tensors: CPU model buffer size = 701.25 MiB
..............................................................
Automatic RoPE Scaling: Using (scale:1.000, base:500000.0).
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 8312
llama_context: n_ctx_per_seq = 8312
llama_context: n_batch = 2048
llama_context: n_ubatch = 2048
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (8312) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context: CPU output buffer size = 0.49 MiB
create_memory: n_ctx = 8320 (padded)
llama_kv_cache_unified: kv_size = 8320, type_k = 'f16', type_v = 'f16', n_layer = 16, can_shift = 1, padding = 32
llama_kv_cache_unified: CPU KV buffer size = 260.00 MiB
llama_kv_cache_unified: KV self size = 260.00 MiB, K (f16): 130.00 MiB, V (f16): 130.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 1
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 2048, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 2048, n_seqs = 1
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 2048, n_seqs = 1
llama_context: CPU compute buffer size = 2209.02 MiB
llama_context: graph nodes = 550
llama_context: graph splits = 1

OpenCL: Unsupported Tensor Type Detected: 23
otherarch/ggml_v3b-opencl.cpp:1820: GGML_ASSERT(to_fp32_cl != nullptr) failed

Strange. I tested this model without problems here:
colab.research.google.com/github/LostRuins/koboldcpp/blob/concedo/colab.ipynb#scrollTo=uJS9i_Dltv8Y

Put this address in the first box and push play:
https://huggingface.co/Novaciano/DOLPHIN3.0-L3.2-1B_RP_UNCENSORED-GGUF/resolve/main/Dolphin3.0-L3.2-1B_RP_UNCENSORED.gguf

Sign up or log in to comment