DavidAU
/

Qwen2.5-MOE-2x-4x-6x-8x7BPower-CODER__19B-30B-42B-53B-gguf

Model card Files Files and versions Community

IMPORTANT: A lot of these CODER models (44 listed below) now have full repos and quants (by myself and/or team MRadermacher) see this collection:

https://huggingface.co/collections/DavidAU/coder-and-programming-models-686357166b1f0d4322ad3e2c

More repos/quants pending. The "collection" will also have coder models NOT listed (IE 87B 2x32Bs , Total Recall Master Coder (128 experts/reasoning), Zero Coder (Reasoning, 0.8B)) in this Multi-model repo directory below and may have models with higher context too.

Qwen2.5 / Mistral- MOE - 1X7B// 1X32B // 2X7B // 4X7B // 6X7B // 8X7B // 2X24B -- Power-CODER --- 13.3B // 60B // 19B // 30B // 42B // 53B // 44B - gguf

Also : Jan-Nano, Microsoft, Nvidia, Fine tunes, and others...

This repo contains 44 different MOE (mixture of expert) (and a few NOT MOEs too) quanted models (at different parameter counts and different number of expert(s)) for coding and programming in multiple computer languages. All major (and many minor) coding languages are covered.

Parameter range: 6B to 60B ; with specific select quant(s) of each (until final testing and optimizations are complete).

These are mostly "Qwen2.5" MOE models, with a few Mistral MOE Coders (2X22B // 2x24B).

Although these models can be used for other use cases, the primary use case is "coding."

Qwen2.5 MOEs require a "shared expert", hence the larger/odd parameter count sizes for some models.

All models, except 1x32B (Qwen 2.5 arch), are based on 7B Qwen 2.5 coder models/fine tunes.

3 models are MISTRAL ( One with Devstral with instruct and one Devstral with Magistral (reasoning on/off) ) MOES at 2x24B AND 1 2x22B Codestral MOE, all models denoted by Mistral in the name - see "set #Mistral 1" info.

All models are carefully selected and tested "coder" instruct models (base, experts and shared expert).

Tool use is supported.

Models are fully stable, "Class 1" MOEs.

The 1x32B (with a shared expert) is in a class by itself, but still a Qwen2.5 type model.

In addition, some models in 6X7B and 8X7B versions ARE specialists.

All versions, except for "set 1" / "set 1.5" / "set 1.75", use the same base model for the base of the MOE model construction - at the moment. Different base(s) are being assessed for improving performance and/or creating specific use case model(s).

Please note the following:

These are non reasoning models at the moment (except V2 Mistral: Devstral-Magistral MOE).
More experts (and activated)/more models in the MOE generate stronger and stronger code.
As many quants are below "full power" / not optimized, each model's performance will not be at peak power here... except "set 1.25".
Full repo(s) will be generated, along with full quants after assessing/tuning/tweaking period.

Just added: [reverse order, in addition to "sets" below.]

Blitzar-Coder-4B BASE - 40k context (set at this for testing)

Qwen3-4B-Blitzar-Coder-4B-F.1-bs20-2-Q6_K.gguf

6B // 607 Tensors // 55 Layers
Brainstrom 20x, fused with Blitzar-Coder (Qwen3) at 40k context.
Blitzar-Coder is a coding model fine tune.

Qwen3-4B-Blitzar-Coder-4B-F.1-75-75-merge-Q6_K.gguf

5.85B // 596 tensors // 54 layers
Blitzar-Coder Fused with Qwen3 Instruct 4B (75%/75% merge); 40k context.
Blitzar-Coder is a coding model fine tune.

Qwen3-4B-Blitzar-Coder-4B-F.1-75-75-merge-20-2-Q6_K.gguf

7.75B // 805 tensors // 73 layers
Brainstrom 20x, fused with Blitzar-Coder (Qwen3) ALSO FUSED with Qwen3 Instruct 4b at 40k context (2 merged models + Brainstorm).
Blitzar-Coder is a coding model fine tune.

Jan-Nano 4B Base, 128k Qwen 3.

Qwen3-4B-Jan-nano-128k-4B-bs20-2-Q6_K.gguf [6B-128k Context - NOT a MOE MODEL]

6B // 607 Tensors // 55 Layers (This will be slower (t/s) than Jan "reg" due to tensors/layers added)
Brainstrom 20x, fused with Jan-Nano (Qwen3) at 128k context.
This is a general, all purpose model that also does coding.

Qwen3-4B-Jan-nano-128k-4B-75-75-merge-Q6_K.gguf [5.85B-128k Context - NOT a MOE MODEL]

5.85B // 596 tensors // 54 layers (This will be slower (t/s) than Jan "reg" due to tensors/layers added)
Jan-Nano Fused with Qwen3 Instruct 4B (75%/75% merge); 128k context.
This is a general, all purpose model that also does coding.

Qwen3-4B-Jan-nano--2--128k-4B-75-75-merge-20-2-Q6_K.gguf [7.75B-128k Context - NOT a MOE MODEL]

7.75B // 805 tensors // 73 layers (This will be slower (t/s) than Jan "reg" due to tensors/layers added)
Brainstrom 20x, fused with Jan-Nano (Qwen3) ALSO FUSED with Qwen3 Instruct 4b at 128k context (2 merged models + Brainstorm).
This is a general, all purpose model that also does coding.
Thinking (blocks) has fully reactivated with this model ; an unexpected side effect.

Devstrals 2505 / 2507 - Mistral, 24B base (MOEs, including with Thinking on/off are listed further below)

Mistral-Devstral-Small-2507-bs20-2-Q2_K.gguf [34B] [not a moe model]

Mistral's Top NEWEST Coder Model with Brainstorm 20x applied to improve performance.

Mistral-Devstral-Small-2507-bs40-3-v1.4-Q2_K.gguf [44B] [not a moe model]

Mistral's top NEWEST coder model with Brainstorm 40x (1.4) applied to improve performance.

Mistral-Devstral-Small-2505-bs20-2-Q2_K.gguf [34B] [not a moe model]

Mistral's Top Coder Model with Brainstorm 20x applied to improve performance.

Mistral-Devstral-Small-2505-bs40-3-v1.4-Q2_K.gguf [44B] [not a moe model]

Mistral's top coder model with Brainstorm 40x (1.4) applied to improve performance.

Microsoft NextCODERs

Qwen2.5-7B-MS-Next-Coder-bs20-2-Q6_K.gguf [12B] [not a moe model]

7B Microsoft Next-Coder with Brainstorm 20x added.

Qwen2.5-14B-MS-Next-Coder-bs20-2-Q5_K_S.gguf [20B] [not a moe model]

14B Microsoft Next-Coder with Brainstorm 20x added.

Qwen2.5-32B-MS-Next-Coder-bs20-2-Q4_K_S.gguf [40B] [not a moe model]

32B Microsoft Next-Coder with Brainstorm 20x added.

Nvidia Open Coders

Qwen2.5-7B-Nemo-OpenCoder-bs20-2-Q6_K.gguf [12B] [not a moe model]

7B Opencoder Nemo with Brainstorm 20x added [automatic reasoning].

Qwen3-OpenCoder-Multi-exp5-11-Q6_K.gguf [8.5B] [not a moe model]

Brainstorm 5x Matrix Mind Series with Nemo Opencoder 7B [main] + MSCoder 7b + Olympic 7b + Qwen Coder Instruct 7B
Reasoning instruct coder generation.

Qwen3-OpenCoder-Multi-exp5-11-v2-Q6_K.gguf [8.5B] [not a moe model]

Brainstorm 5x Matrix Mind Series with Nemo Opencoder 7B [main] + MSCoder 7b [2x] + Qwen Coder Instruct 7B
Reasoning instruct coder generation.

Qwen3-OpenCoder-Multi-14B-exp5-11-Q6_K.gguf [16B] [not a moe model]

Brainstorm 5x Matrix Mind Series with Nemo Opencoder 14B [main] + MSCoder 7b [2x] + Qwen Coder Instruct 14B
Reasoning instruct coder generation.

Qwen3-OpenCoder-Multi-32B-exp5-11-Q4_K_S.gguf [35B] [not a moe model]

Brainstorm 5x Matrix Mind Series with Nemo Opencoder 32B [main] + MSCoder 32b + Olympic 32b + Qwen Coder Instruct 32B
Reasoning instruct coder generation.

Qwen3-OpenCoder-Multi-32B-exp5-11-V2-Q4_K_S.gguf [35B] [not a moe model]

Brainstorm 5x Matrix Mind Series with Nemo Opencoder 32B [main] + MSCoder 32b [2x] + Qwen Coder Instruct 32B
Reasoning instruct coder generation.

MOE - 3x7B

Qwen2.5-3X7B-QC-MS-OPY-Coder-V1-Q4_K_S.gguf

3x7B MOE, with MS Next Coder, Qwen 2.5 Coder Instruct, and Olympic Coder.
Source is here: https://huggingface.co/DavidAU/Qwen2.5-3X7B-CoderInstruct-OlympicCoder-MS-Next-Coder-25B-v1

Nvidia Nemo MOE

Qwen2.5-2X7B-NemoTron-TwoNemos-Coder-V1-Q4_K_S.gguf

2x7B MOE, Two OpenCoder Nemo Version 1 and 1.1 in a MOE config.

Mistral 2x24 MOE; Devstral 2507.

Mistral-2X24B-Power-Coder-Devs1.1-Q4_K_S.gguf

2x24B MOE Magistral with Devstral 1.1 in a MOE coder / all purpose with reasoning on/off.
See "Set #Mistral 1" below for system prompt to turn reasoning on/off

Nvidia 2x7B MOE

Qwen2.5-2X7B-NemoTron-Coder-V2.9-0-Q6_K.gguf

2x7B MOE Opencoder Nemo Version 1.1 + additional coder model.

SET #1 - 5 versions - 19B - 2x7B MOE w shared expert:

Four versions at Q6, using various top Qwen 2.5 Coder Models in MOE config 2X7B + a shared expert => 19B.

Denoted by V1 to V4, in Q6 without "shared" in the filename.

(shared expert is different per model, as well as base model for these 4 versions which means the model is roughly 2.5X7B )

SET #1.25 - 5 Versions - 19B - 2x7b w shared expert:

These are denoted by "Qwen2.5-2X7B-Power-Coder-V2-50-Q6_K" with shared expert at 50% power.

This test group is also assessing quant optimization and tensor optimizations.

Also, there are versions: V2-50-Q6_K

reg from bf16 source - this would be default Q6 settings, and default based on source from bf16.
one made from float32 source ("f-32")
one made float32 source + output tensor at bf16 ("opti"),
one made from float 32 source + output tensor at bf16 + all shared expert tensors (3) (all layers) at bf16 ("opti2").
one made from float 32 source + output tensor at f16 + all shared expert tensors (3) (all layers) at f16 ("OPTI2-f-16").

The last TWO are the most potent ones, and it will be the largest in size too, relative the other models in this set.

"f-16" (float 16 bit) was used for faster performance on CPU/MAC/METAL and/or using the model with partial offload from GPU. Float 16 is more cpu/metal/offload supported, and faster.

(shared expert is different per model, as well as base model for these 4 versions which means the model is roughly "2.5X7B" in terms of performance - but this varies.)

SET #1.5 - 3 versions - 13.3B - 1x7B MOE w shared expert:

Two versions at Q6, using ONE Qwen 2.5 Coder Model in MOE config 1X7B + a shared expert (a different model) at 100% => 13.3B.

Denoted by "shared100" (and 1x7B) in the filename.

This unusual config pairs a master expert, with 1 shared expert at 100%, creating a very strong coder at a smaller size, yet with MOST OF the power of a 2X7B MOE in a smaller package that is roughly 45% smaller (13.3B VS 19B).

A third model, a copy of model #2, with "shared" expert set at 50% to assess if this is better / reduces possible "overcooking".

The config in these models means the model is roughly 1.5x7B in terms of performance - but this varies.

Denoted by "shared50" in the filename.

SET #1.75 - 2 versions - 60B - 1x32B MOE w shared expert:

One version at Q3KS, using ONE Qwen 2.5 32B Coder Model in MOE config 1X32B + a shared expert (a different model - 32B ) at 100% => 60B.

One version at Q3KS, using ONE Qwen 2.5 32B Coder Model in MOE config 1X32B + a shared expert (a different model - 32B ) at 50% => 60B.

It contains the TWO strongest 32B Qwen 2.5 coder models.

Denoted by "shared100" (and 1x32B) in the filename. (v1)

Denoted by "shared50" (and 1x32B) in the filename. (v2)

This model is off the charts in terms of coding strength, detail, and other critical aspects of coding.

This unusual config pairs a master expert, with 1 shared expert at 100% / 50%, creating a very strong coder at a smaller size, yet with MOST OF the power of a 2X32B MOE in a smaller package that is roughly 30% smaller.

The config in these models means the model is roughly 1.5x32B in terms of performance - but this varies.

NOTE:

A 2x32B (plus a shared expert) would be roughly 75-80B parameters.
This is a very low quant size (one above Q2k - the lowest), higher quants will produce better results.
V1 - Tests are ongoing assessing "shared expert" settings VS performance; 100% may not be the best setting.
V2 - 50% shared expert ; testing pending.

SET #2: 1 version - 30B - 4x7B MOE w shared expert

One version is 4X7B (in filename) + 1 shared expert rendered in quant q4_K_S:

30B [roughly, shared expert adds more to it ; however compression makes this model smaller than it otherwise would be due to shared expert being common/base.]

SET #3: 1 version - 42B - 6x7B MOE w shared expert

One version is 6X7B (in filename) + 1 shared expert rendered in quant q4_K_S:

42B [roughly, shared expert adds more to it; however compression makes this model smaller than it otherwise would be due to shared expert being common/base. ]

SET #4: 1 version - 53B - 8x7B MOE w shared expert

One version is 8X7B (in filename) + 1 shared expert rendered in quant q4_K_S:

53B [roughly, shared expert adds more to it ; however compression makes this model smaller than it otherwise would be due to shared expert being common/base.]

Set #Mistral 1: 3 versions - 44B, 2x24B MOES

2 versions with "mistral" in the name, using Devstral as a base with one version with Instruct AND one version with Magistral (reasoning on/off).

For the latter, you need a system prompt as follows for reasoning (see below) ; note model will work fine without reasoning too.

Max context is 128k/131072 ; for reasoning (v2) strongly suggest min 8k context window, if reasoning is on.

REASONING SYSTEM PROMPT for V2 (optional):

A user will ask you to solve a task. You should first draft your thinking process (inner monologue) until you have derived the final answer. Afterwards, write a self-contained summary of your thoughts (i.e. your summary should be succinct but contain all the critical steps you needed to reach the conclusion). You should use Markdown and Latex to format your response. Write both your thoughts and summary in the same language as the task posed by the user.

Your thinking process must follow the template below:
<think>
Your thoughts or/and draft, like working through an exercise on scratch paper. Be as casual and as long as you want until you are confident to generate a correct answer.
</think>

Additional "Codestral" model 2X22B at Q4_K_S, this one requires CHATML template and has 32k context.

Use the "qwen" settings below for parameters.

First quants [2x24B] are q2k, so performance will be low relative to Q4, Q6 or Q8.

Additional [2x24B] V1 and V2 quant at q4_K_S with stronger performance.

Quants are unoptimized at the moment; optimizing will drastically improve performance.

GENERAL:

All versions have default of 2 experts activated except the "1x" models.

Number of active experts can be adjusted in Lmstudio and other AI Apps.

Experts can be set from 1 to 8 (depending on the model) ; better generations will occur with more experts activated due to MOE config as well as the fact these are all coder models.

Suggest 2-4 generations, especially if using 1 expert (all models).

Shared expert activated at 10% for all these models EXCEPT for "set 1" / "set 1.25" / "set 1.5" / "set 1.75".

Each version is different, not an "improvement" per say.

Models will accept "simple prompt" as well as very detailed instructions ; however for larger projects I suggest waiting for Q6/Q8 quants / optimized quants.

Evaluations are in progress.

Quants/models and structure(s) subject to change.

Full repo(s) per model(s) will be generated after eval period.

Feedback is welcome via "Community Tab".

Suggested Settings [Qwen 2.5 7B Coder default settings]:

Temp .5 to .7 (or lower)
Max Context is 32k (final versions will be 32k and 128k).
topk: 20, topp: .8, minp: .05
rep pen: 1.1 (can be lower)
Jinja Template (embedded) or CHATML template.
A System Prompt is not required. (ran tests with blank system prompt)

Help, Adjustments, Samplers, Parameters and More

CHANGE THE NUMBER OF ACTIVE EXPERTS:

See this document:

https://huggingface.co/DavidAU/How-To-Set-and-Manage-MOE-Mix-of-Experts-Model-Activation-of-Experts

Settings: CHAT / ROLEPLAY and/or SMOOTHER operation of this model:

In "KoboldCpp" or "oobabooga/text-generation-webui" or "Silly Tavern" ;

Set the "Smoothing_factor" to 1.5

: in KoboldCpp -> Settings->Samplers->Advanced-> "Smooth_F"

: in text-generation-webui -> parameters -> lower right.

: In Silly Tavern this is called: "Smoothing"

NOTE: For "text-generation-webui"

-> if using GGUFs you need to use "llama_HF" (which involves downloading some config files from the SOURCE version of this model)

Source versions (and config files) of my models are here:

https://huggingface.co/collections/DavidAU/d-au-source-files-for-gguf-exl2-awq-gptq-hqq-etc-etc-66b55cb8ba25f914cbf210be

OTHER OPTIONS:

Increase rep pen to 1.1 to 1.15 (you don't need to do this if you use "smoothing_factor")
If the interface/program you are using to run AI MODELS supports "Quadratic Sampling" ("smoothing") just make the adjustment as noted.

Highest Quality Settings / Optimal Operation Guide / Parameters and Samplers

This a "Class 1" model:

For all settings used for this model (including specifics for its "class"), including example generation(s) and for advanced settings guide (which many times addresses any model issue(s)), including methods to improve model performance for all use case(s) as well as chat, roleplay and other use case(s) please see:

[ https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters ]

You can see all parameters used for generation, in addition to advanced parameters and samplers to get the most out of this model here:

[ https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters ]

Special Thanks:

Special thanks to all the following, and many more...

All the model makers, fine tuners, mergers, and tweakers:

Provides the raw "DNA" for almost all my models.
Sources of model(s) can be found on the repo pages, especially the "source" repos with link(s) to the model creator(s).

Huggingface [ https://huggingface.co ] :

The place to store, merge, and tune models endlessly.
THE reason we have an open source community.

LlamaCPP [ https://github.com/ggml-org/llama.cpp ] :

The ability to compress and run models on GPU(s), CPU(s) and almost all devices.
Imatrix, Quantization, and other tools to tune the quants and the models.
Llama-Server : A cli based direct interface to run GGUF models.
The only tool I use to quant models.

Quant-Masters: Team Mradermacher, Bartowski, and many others:

Quant models day and night for us all to use.
They are the lifeblood of open source access.

MergeKit [ https://github.com/arcee-ai/mergekit ] :

The universal online/offline tool to merge models together and forge something new.
Over 20 methods to almost instantly merge model, pull them apart and put them together again.
The tool I have used to create over 1500 models.

Lmstudio [ https://lmstudio.ai/ ] :

The go to tool to test and run models in GGUF format.
The Tool I use to test/refine and evaluate new models.
LMStudio forum on discord; endless info and community for open source.

Text Generation Webui // KolboldCPP // SillyTavern:

Excellent tools to run GGUF models with - [ https://github.com/oobabooga/text-generation-webui ] [ https://github.com/LostRuins/koboldcpp ] .
Sillytavern [ https://github.com/SillyTavern/SillyTavern ] can be used with LMSTudio [ https://lmstudio.ai/ ] , TextGen [ https://github.com/oobabooga/text-generation-webui ], Kolboldcpp [ https://github.com/LostRuins/koboldcpp ], Llama-Server [part of LLAMAcpp] as a off the scale front end control system and interface to work with models.

Downloads last month: 11,051

GGUF

Model size

39.2B params

Architecture

llama

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

Inference Providers NEW

Text Generation

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collections including DavidAU/Qwen2.5-MOE-2x-4x-6x-8x7BPower-CODER__19B-30B-42B-53B-gguf

Qwen2.5 / Mistral- MOE - 1X7B// 1X32B // 2X7B // 4X7B // 6X7B // 8X7B // 2X24B -- Power-CODER --- 13.3B // 60B // 19B // 30B // 42B // 53B // 44B - gguf

Help, Adjustments, Samplers, Parameters and More

Special Thanks:

Collections including DavidAU/Qwen2.5-MOE-2x-4x-6x-8x__7B__Power-CODER__19B-30B-42B-53B-gguf

Collections including DavidAU/Qwen2.5-MOE-2x-4x-6x-8x7BPower-CODER__19B-30B-42B-53B-gguf