Looking for GGUF format for this model
I am able to convert GGML to GGUF but the model won't load
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_ctx = 3072
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 64
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 5.0e-06
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: freq_base = 10000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 65B
llm_load_print_meta: model ftype = mostly Q4_K - Medium (guessed)
llm_load_print_meta: model size = 68.98 B
llm_load_print_meta: general.name = llama-2-70b-chat.ggmlv3.q4_K_M.bin
llm_load_print_meta: BOS token = 1 '''
llm_load_print_meta: EOS token = 2 '
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.23 MB
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (Tesla P100-PCIE-16GB) as main device
error loading model: create_tensor: tensor 'blk.0.attn_k.weight' has wrong shape; expected 8192, 8192, got 8192, 1024, 1, 1
llama_load_model_from_file: failed to load model
Grouped Query Attention should have been 8, since you converted from ggml, it took as 1 which is the default. You should reconvert by correctly passing the -gqa 8 parameter.