README.md · ubergarm/cogito-v2-preview-deepseek-671B-MoE-GGUF at main

metadata

quantized_by: ubergarm
pipeline_tag: text-generation
base_model: deepcogito/cogito-v2-preview-deepseek-671B-MoE
license: mit
base_model_relation: quantized
tags:
  - mla
  - imatrix
  - conversational
  - deepseek_v3
  - ik_llama.cpp

WIP This big one will take a bit, please be patient as it cooks and uploads!

`ik_llama.cpp` imatrix Quantizations of deepcogito/cogito-v2-preview-deepseek-671B-MoE

This quant collection REQUIRES ik_llama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

NOTE ik_llama.cpp can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP.

These quants provide best in class perplexity for the given memory footprint.

Big Thanks

Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models!

Quant Collection

Perplexity computed against wiki.test.raw.

These first two are just test quants for baseline perplexity comparison:

Q8_0 665.301 GiB (8.504 BPW)
- Final estimate: PPL = TODO
Q4_0 TODO GiB (TODO BPW)
- Final estimate: PPL = TODO

TODO

Quick Start

CPU-Only

Note it is auto-detecting chat template incorrectly so explicitly set --chat-template deepseek3

#!/usr/bin/env bash

model=/mnt/raid/models/ubergarm/cogito-v2-preview-deepseek-671B-MoE-GGUF/cogito-v2-preview-deepseek-671B-MoE-Q8_0.gguf

numactl -N 0 -m 0 \
./build/bin/llama-server \
    --model "$model"\
    --alias ubergarm/cogito-v2-preview-deepseek-671B-MoE-Q8_0 \
    --chat-template deepseek3 \
    --ctx-size 32768 \
    -ctk q8_0 \
    -fa -fmoe \
    -mla 3 \
    --parallel 1 \
    --threads 128 \
    --threads-batch 192 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080 \
    --no-mmap

ubergarm
/

cogito-v2-preview-deepseek-671B-MoE-GGUF

`ik_llama.cpp` imatrix Quantizations of deepcogito/cogito-v2-preview-deepseek-671B-MoE

Big Thanks

Quant Collection

Quick Start

CPU-Only

References

ik_llama.cpp imatrix Quantizations of deepcogito/cogito-v2-preview-deepseek-671B-MoE

Big Thanks

Quant Collection

Quick Start

CPU-Only

References

`ik_llama.cpp` imatrix Quantizations of deepcogito/cogito-v2-preview-deepseek-671B-MoE