SnowDrogito-RpR-32B_IQ4-XS

SnowDrogito-RpR-32B Banner

Updates and Description of Files

  • Recent files uploaded use ArliAI RpR V3 instead of V1 as indicated in the name.
  • All quantizations in this repo use IQ4_XS as a base with Q8 embedding and output tensors.
  • (Recommended) SnowDrogito-RpR3-32B_IQ4-XS+Enhanced_Tensors.gguf - largest, highest quality, Q4KM size, quant using recalibrated imatrix on Bartowki's dataset+RP+Tao at 8k context, uses selective quantization with llama-quantize --tensor-type flags to bump up select FFN/self attention tensors between Q6 and Q8 as described here.
  • SnowDrogito-RpRv3-32B_IQ4-XS-Q8InOut-Q56Attn.gguf - Q6 and Q5 Attention tensors. This and all quants uploaded prior used imatrix from Snowdrop.

MORE SPEED!

Improve inference speed offloading tensors instead of layers as referenced HERE. --overridetensors ".[13579].ffn_up|.[1-3][13579].ffn_up=CPU Restricts offloading of every third FFN up tensor, saving enough space on GPU to offload all layers on 24gb, taking me from 3.9tps to 10.6 tps. Example:

python koboldcpp.py --gpulayers 65 --quantkv 1 --overridetensors "\.[13579]\.ffn_up|\.[1-3][13579]\.ffn_up=CPU" --threads 10 --usecublas --contextsize 40960 --flashattention --model ~/Downloads/SnowDrogito-RpR3-32B_IQ4-XS+Enhanced_Tensors.gguf

...obviously editing threads, filepaths, etc...

Overview

SnowDrogito-RpR-32B_IQ4-XS is my shot at an optimized imatrix quantization for my QwQ RP Reasoning merge, goal is to add smarts to the popular Snowdrop roleplay model, with a little ArliAI RpR and Deepcogito for the smarts. Built using the TIES merge method, it attempts to combine strengths from multiple fine-tuned QwQ-32B models, quantized to IQ4_XS with Q8_0 embeddings and output layers for enhanced quality, to plus it up just a bit. Uploading because the PPL was lower, have been getting more varied/longer/more creative responses with this, but maybe it lacks contextual awareness compared to snowdrop? Not sure.

Setup for Reasoning and ChatML

  • ChatML Formatting: Use ChatML with <|im_start|>role\ncontent<|im_end|>\n (e.g., <|im_start|>user\nHello!<|im_end|>\n).
  • Reasoning Settings: Set "include names" to "never." Start reply with <think>\n to enable reasoning.
  • Sampler Settings: From Snowdrop: Try temperature 0.9, min_p 0.05, top_a 0.3, TFS 0.75, repetition_penalty 1.03, DRY if available.
  • My Settings: Response (tokens): 2048 Context (tokens): 40960 Temperature: 3.25 Top P: 0.98 Min P: 0.04 Top nsigna: 2.5 Repetition Penalty: 1.03 (XTC) Threshold: 0.3 (XTC) Probability: 0.3 Dry Multiplier: 0.8 Dry Base: 1.75 Dry Allowed Length: 4 Dry Penalty Range: 1024

Getting great reasoning results with ST's Start Reply With:

<think>
Chain-of-thought: Alright, what just happened is

For more details, see the setup guides and master import for ST for Snowdrop and other info on ArliAI RpR.

Performance

  • Perplexity under identical conditions (IQ4_XS, 40,960 context, Q8_0 KV cache, on a 150K-token chat dataset) SnowDrogito-RpR-32B vs QwQ-32B-Snowdrop-v0:
  4.5597 ± 0.02554  
  4.6779 ± 0.02671
  • Fits 40960 context 24GB VRAM using Q8 KV Cache with full GPU offload.

Model Details

  • Base Model: Qwen/Qwen2.5-32B
  • Architecture: Qwen 2.5 (32B parameters)
  • Context Length: 40,960 tokens
  • Quantization: IQ4_XS with Q8_0 embeddings and output layers for better quality.
  • Used .imatrix file from Snowdrop.

Merge Configuration

This model was created using mergekit with the following TIES merge configuration:

models:
  - model: trashpanda-org/QwQ-32B-Snowdrop-v0
    parameters:
      weight: 0.75
      density: 0.5
  - model: deepcogito/cogito-v1-preview-qwen-32B
    parameters:
      weight: 0.15
      density: 0.5
  - model: ArliAI/QwQ-32B-ArliAI-RpR-v1
    parameters:
      weight: 0.1
      density: 0.5
merge_method: ties
base_model: Qwen/Qwen2.5-32B
parameters:
  weight: 0.9
  density: 0.9
  normalize: true
  int8_mask: true
tokenizer_source: Qwen/Qwen2.5-32B-Instruct
dtype: bfloat16

Quantization Details

  • Primary Quantization: IQ4_XS (4-bit integer with extra-small blocks) using an importance matrix (trashpanda-org_QwQ-32B-Snowdrop-v0.imatrix) for high quality at reduced size.
  • Embeddings & Output Layers: Quantized to Q8_0 (8-bit) to preserve precision in token embeddings and final output weights, differing from the standard IQ4_XS body. This boosts quality with a modest size increase.

Acknowledgments

Downloads last month
491
GGUF
Model size
32.8B params
Architecture
qwen2
Hardware compatibility
Log In to view the estimation
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for skatardude10/SnowDrogito-RpR-32B_IQ4-XS

Quantized
(3)
this model