ubergarm's picture
add note about chat-template and working on imatrix now
254d15c
metadata
quantized_by: ubergarm
pipeline_tag: text-generation
base_model: deepcogito/cogito-v2-preview-deepseek-671B-MoE
license: mit
base_model_relation: quantized
tags:
  - mla
  - imatrix
  - conversational
  - deepseek_v3
  - ik_llama.cpp

WIP This big one will take a bit, please be patient as it cooks and uploads!

ik_llama.cpp imatrix Quantizations of deepcogito/cogito-v2-preview-deepseek-671B-MoE

This quant collection REQUIRES ik_llama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

NOTE ik_llama.cpp can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP.

These quants provide best in class perplexity for the given memory footprint.

Big Thanks

Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models!

Quant Collection

Perplexity computed against wiki.test.raw.

Perplexity Chart

These first two are just test quants for baseline perplexity comparison:

  • Q8_0 665.301 GiB (8.504 BPW)
    • Final estimate: PPL = TODO
  • Q4_0 TODO GiB (TODO BPW)
    • Final estimate: PPL = TODO

TODO

Quick Start

CPU-Only

Note it is auto-detecting chat template incorrectly so explicitly set --chat-template deepseek3

#!/usr/bin/env bash

model=/mnt/raid/models/ubergarm/cogito-v2-preview-deepseek-671B-MoE-GGUF/cogito-v2-preview-deepseek-671B-MoE-Q8_0.gguf

numactl -N 0 -m 0 \
./build/bin/llama-server \
    --model "$model"\
    --alias ubergarm/cogito-v2-preview-deepseek-671B-MoE-Q8_0 \
    --chat-template deepseek3 \
    --ctx-size 32768 \
    -ctk q8_0 \
    -fa -fmoe \
    -mla 3 \
    --parallel 1 \
    --threads 128 \
    --threads-batch 192 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080 \
    --no-mmap

References