Specialized quants (including "MXFP4_MOE.gguf") for new OpenAI 20B MOE - Mixture of Experts Model at 80+ t/s.
Openai_gpt-oss-20b-CODER-NEO-CODE-DI-MATRIX-GGUF

These are DUEL NEO Imatrix / CODE IMATRIX GGUFs, NEO dataset by DavidAU with "Code" dataset compiled by DavidAU.
NEO dataset improves overall performance.
CODER dataset is specifically for coding performance.
Example generation provided at bottom of this page.
DUEL ("DI")-> Separate Imatrix datasets ("NEO" and "CODER" - generated separately per model) are co-joined to create a new Imatrix dataset, which is then applied to the quants.
Model also passed "hard" coding test too (4 experts); no issues (IQ4_NL).
TEST: Forcing the model to create code with no dependencies and limits of coding short cuts, with multiple loops, and in real time with no blocking in a language that does not support it normally.
(example prompt, and generation below)
Due to quanting issues with this model (which result in oddball quant sizes / mixtures), only TESTED quants will be uploaded (at the moment).
Currently that means IQ4_NL, and MXFP4_MOE (a new OpenAI specific quant type) are available.
NEO/Coder datasets performance improvements will show the most in the IQ4_NL quants.
There are TWO "IQ4_NL" quants:
- OpenAI-20B-NEO-CODE-DIMAT-IQ4_NL.gguf : DI Imatrix applied.
- OpenAI-20B-NEO-CODE-DIMAT-2-IQ4_NL.gguf : DI Imatrix applied, including output tensor (also imatrixed), and embed tensor at IQ4_NL.
There are THREE NEO MXFP4_MOE quants:
- OpenAI-20B-NEO-CODE-DIMAT-MXFP4_MOE2.gguf : Output tensor Q5_1 (DI Imatrix applied)
- OpenAI-20B-NEO-CODE-DIMAT-MXFP4_MOE3.gguf : Output tensor IQ4_NL (DI Imatrix applied)
- OpenAI-20B-NEO-CODE-DIMAT-MXFP4_MOE4.gguf : Output tensor IQ4_NL (DI Imatrix applied) AND Embed at IQ4_NL - this makes this quant the smallest version.
MXFP4_MOE quants vastly outperform (at the moment) all other quants, except IQ4_NL, Q5_1 and Q8_0 due to odd issues compressing OpenAI's 20B model due to odd "tensor" dimensions.
( IQ4_NL, Q5_1 and Q8_0 quants are compatible with OpenAI's tensor structure as of this writing. )
MXFP4_MOE quant was specifically designed for (what appears to be) loss less quantization relative to source model. Jury still out on this.
Suggested Settings (tested in Lmstudio, Beta Branch 0.3.21 ; 4 ):
- Context: 8k min.
- Temp 1.2+ for creative. Temp .6 (or so) for coding/general.
- Rep pen 1.1, topk 40, topp .95, min p 0.05
- Experts 4-8 depending on use case. (higher than 8 MAY lower quality)
Model Supports:
- 128k context
- up to 24 experts
- Tools use, browsing, etc
For my help docs, SETTING NUMBER OF EXPERTS, and other see below.
See more about this model here:
https://huggingface.co/openai/gpt-oss-20b
Please refer to their model card, especially to control "thinking" levels.
Help, Adjustments, Samplers, Parameters and More
CHANGE THE NUMBER OF ACTIVE EXPERTS:
See this document:
https://huggingface.co/DavidAU/How-To-Set-and-Manage-MOE-Mix-of-Experts-Model-Activation-of-Experts
Settings: CHAT / ROLEPLAY and/or SMOOTHER operation of this model:
In "KoboldCpp" or "oobabooga/text-generation-webui" or "Silly Tavern" ;
Set the "Smoothing_factor" to 1.5
: in KoboldCpp -> Settings->Samplers->Advanced-> "Smooth_F"
: in text-generation-webui -> parameters -> lower right.
: In Silly Tavern this is called: "Smoothing"
NOTE: For "text-generation-webui"
-> if using GGUFs you need to use "llama_HF" (which involves downloading some config files from the SOURCE version of this model)
Source versions (and config files) of my models are here:
OTHER OPTIONS:
Increase rep pen to 1.1 to 1.15 (you don't need to do this if you use "smoothing_factor")
If the interface/program you are using to run AI MODELS supports "Quadratic Sampling" ("smoothing") just make the adjustment as noted.
Highest Quality Settings / Optimal Operation Guide / Parameters and Samplers
This a "Class 1" model:
For all settings used for this model (including specifics for its "class"), including example generation(s) and for advanced settings guide (which many times addresses any model issue(s)), including methods to improve model performance for all use case(s) as well as chat, roleplay and other use case(s) please see:
You can see all parameters used for generation, in addition to advanced parameters and samplers to get the most out of this model here:
Example Generation
Using default settings (coder), 4 experts and no system prompt.
Prompt:
Write a JavaScript for fetch api using 4 ais, all using streaming that does NOT use "promise.all", and processes all streams separately but in real time - that means all streams need to be processed at the same time and not interrupt each other.
Also include sending a payload to each of the 4 ais with content, system prompt, temp, topk, topp, minp, repeat penalty ; all of these parameters will be in an array for each of the four ais.
Record the entire response - chunks - from each AI in an array - text.
Also, send the chunks in real time to 4 separate text boxes on an HTML page so real time readout from each AI can be seen by the user.
IMPORTANT: All streams must be processed at the same time, not in sequence. IMPORTANT: Do not use any dependencies, nor import any module(s).
OUTPUT:
(note: Pretty print, color and some formatting lost on copy/paste)
Thought for 11.81 seconds
We need to write JavaScript code that uses fetch API with streaming responses (likely using ReadableStream from Response.body.getReader()). We have 4 AI endpoints; each will be called with a payload containing content, system prompt, temp, topk, topp, minp, repeat penalty. These parameters are in an array for each of the four AIs.
We need to process all streams concurrently, not sequentially, and not using Promise.all. So we can start fetch calls, get readers, then read from each reader in separate async functions that run concurrently (like starting them but not awaiting). We need to send chunks in real time to 4 separate text boxes on an HTML page.
We cannot use dependencies or import modules; just plain JS.
We need to record entire response chunks per AI into arrays. So we maintain array of strings for each AI, push each chunk as it arrives.
Also we need to send the payload to each AI with content, system prompt, temp, topk, topp, minp, repeat penalty. These parameters will be in an array for each of the four AIs. So maybe we have a config array like:
const aiConfigs = [ {url: '', params: {...}}, ... ];
But they said "parameters will be in an array for each of the four ais." So we can define arrays.
We need to send payload as JSON body. Use fetch with method POST, headers: Content-Type: application/json, body: JSON.stringify(payload).
Then handle streaming: Response.body is a ReadableStream; getReader(); read loop reading chunks (Uint8Array). Convert to string using TextDecoder. Append to array and update UI.
We need to run all four concurrently but not use Promise.all. We can just start them sequentially but not await each before starting next. For example:
function startAI(index) { ... } Then call for i=0..3: startAI(i); That will kick off fetch calls; inside we handle reading and UI updates.
We need to ensure they run concurrently, so no blocking.
Also record entire response chunks in an array per AI.
We also need to send the chunks in real time to 4 separate text boxes on an HTML page. So we need to have <textarea id="ai1"> etc. Then update innerText or value as new chunk arrives.
We must not use dependencies, so no external libs.
Also we need to handle errors.
Let's produce code snippet with explanation.
We'll create a function fetchAndStream(aiIndex) that returns nothing but handles reading and UI updates.
We also need to define arrays for each AI's parameters. For example:
const aiParams = [ {url: '', content: 'Hello', systemPrompt: 'You are ...', temp: 0.7, topk: 50, topp: 0.95, minp: 0.01, repeatPenalty: 2}, ... ];
But they said "parameters will be in an array for each of the four ais." So we can define arrays like:
const aiConfigs = [ {url: '', payload: {...}}, ... ];
We need to send content, system prompt, temp, topk, topp, minp, repeat penalty. So we can create payload object accordingly.
Also we need to record entire response chunks in an array per AI. We'll have const responses = [[], [], [], []]; Then push each chunk string into responses[aiIndex].
We also need to update UI: document.getElementById('output'+(i+1)).value += chunk; or innerText.
Let's produce final code with comments and explanation.
Also we need to ensure no Promise.all. So just call startAI for each index.
Also we need to handle fetch errors gracefully.
Also we need to use async/await inside each function but not block others.
We can do:
async function streamFromAi(index) { ... }
Then call streamFromAi(0); streamFromAi(1); etc. Since they are async functions, calling them without await will start them concurrently.
Alternatively we can wrap in IIFE: (async () => {...})(); But simpler: just call them.
We need to ensure that the fetch calls happen concurrently; but if we use await inside each function, it won't block others because each function runs independently. So it's fine.
Also we need to record entire response chunks from each AI in an array - we can push chunk string into responses[index].
Now produce final code snippet with HTML skeleton and JS script.
We also need to mention that the fetch API must support streaming; some browsers may not support reading body as stream? But modern ones do.
We'll provide fallback for older browsers maybe but not necessary.
Also we need to ensure no dependencies, so just plain JS.
Let's craft final answer: Provide explanation then code snippet.
OUTPUT:
Below is a standβalone example that fulfils every requirement you listed:
4 independent AI calls
All four streams are read concurrently (no
Promise.all
or any other blocking construct)Each call receives its own payload (
content
,systemPrompt
,temp
,topk
,topp
,minp
,repeatPenalty
) β the values are stored in an array of objects.Every chunk that arrives is pushed into a perβAI array and written immediately to a dedicated
<textarea>
so the user can see the output live.No external libraries or imports β pure vanilla JavaScript.
1οΈβ£ HTML skeleton
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Concurrent AI Streaming Demo</title>
<style>
textarea { width:100%; height:150px; margin-bottom:10px;}
</style>
</head>
<body>
<!-- Four output boxes -->
<textarea id="output0" placeholder="AI #1 β¦"></textarea>
<textarea id="output1" placeholder="AI #2 β¦"></textarea>
<textarea id="output2" placeholder="AI #3 β¦"></textarea>
<textarea id="output3" placeholder="AI #4 β¦"></textarea>
<!-- Start button -->
<button id="startBtn">Start All 4 AIs</button>
<script src="ai-stream.js"></script> <!-- the script below -->
</body>
</html>
2οΈβ£
JavaScript (ai-stream.js
)
/* -------------------------------------------------------------
* 1. Configuration β one object per AI
* ------------------------------------------------------------- */
const aiConfigs = [
{
// URL of the first AI endpoint
url: 'https://api.example.com/v1/chat',
payload: {
content: "Hello, what can you do?",
systemPrompt: "You are a helpful assistant.",
temp: 0.7,
topk: 50,
topp: 0.95,
minp: 0.01,
repeatPenalty: 2
}
},
{
url: 'https://api.example.com/v1/chat',
payload: {
content: "Tell me a joke.",
systemPrompt: "You are a funny chatbot.",
temp: 0.8,
topk: 40,
topp: 0.9,
minp: 0.02,
repeatPenalty: 1
}
},
{
url: 'https://api.example.com/v1/chat',
payload: {
content: "Explain relativity.",
systemPrompt: "You are a physics teacher.",
temp: 0.6,
topk: 60,
topp: 0.97,
minp: 0.01,
repeatPenalty: 3
}
},
{
url: 'https://api.example.com/v1/chat',
payload: {
content: "Give me a recipe for pancakes.",
systemPrompt: "You are a cooking expert.",
temp: 0.5,
topk: 30,
topp: 0.85,
minp: 0.02,
repeatPenalty: 2
}
}
];
/* -------------------------------------------------------------
* 2. Storage for the full responses (array of strings)
* ------------------------------------------------------------- */
const allResponses = [[], [], [], []]; // one array per AI
/* -------------------------------------------------------------
* 3. Helper β read a stream and push chunks to UI & storage
* ------------------------------------------------------------- */
async function streamFromAi(aiIndex) {
const cfg = aiConfigs[aiIndex];
const outputBox = document.getElementById(
output${aiIndex}
);
const decoder = new TextDecoder('utf-8');
try {
// 3.1 Send the request
const response = await fetch(cfg.url, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(cfg.payload)
});
if (!response.ok) {
throw new Error(
HTTP ${response.status} β ${response.statusText}
);
}
// 3.2 Get the reader
const reader = response.body.getReader();
// 3.3 Read loop β runs until stream ends
while (true) {
const { value, done } = await reader.read();
if (done) break; // no more data
// Convert Uint8Array to string
const chunkText = decoder.decode(value, { stream: true });
// Store the chunk in the array
allResponses[aiIndex].push(chunkText);
// Push it live into the textarea
outputBox.value += chunkText;
}
// Finalise β flush any remaining buffered text
outputBox.value += decoder.decode(); // empty decode to flush
} catch (err) {
console.error(
AI #${aiIndex+1} error:
, err);
outputBox.value +=
\n[Error] ${err.message}\n
;
}
}
/* -------------------------------------------------------------
* 4. Kickβoff β start all four streams concurrently
* ------------------------------------------------------------- */
function startAllAIs() {
// Reset UI and storage
for (let i = 0; i < 4; ++i) {
document.getElementById(
output${i}
).value = '';
allResponses[i] = [];
}
// Start each stream β the async function is called but not awaited,
// so they run in parallel.
for (let i = 0; i < 4; ++i) {
streamFromAi(i); // <-- no await
}
}
/* -------------------------------------------------------------
* 5. Wire up the button
* ------------------------------------------------------------- */
document.getElementById('startBtn').addEventListener('click', startAllAIs);
How it works
Configuration β
aiConfigs
holds one object per AI, each with its own payload (content
,systemPrompt
, etc.).Storage β
allResponses
is an array of four arrays; every chunk that arrives for a given AI is pushed into the corresponding subβarray.Streaming β Inside
streamFromAi()
we:Send a POST request with the JSON payload.
Grab the
ReadableStream
fromresponse.body
.Use its
getReader()
and read in a loop (reader.read()
).Each chunk (a
Uint8Array
) is decoded to text, appended to the perβAI array and written immediately into the dedicated<textarea>
.The loop ends when
done
becomes true.
Concurrent start β In
startAllAIs()
we simply callstreamFromAi(i)
for each index without awaiting them. Because each function is async, calling it starts its own fetch and read loop immediately; all four loops run in parallel, so the streams are processed simultaneously.No external dependencies β everything uses only standard browser APIs (
fetch
,ReadableStream
,TextDecoder
, DOM methods).
What youβll see
When
you click Start All 4 AIs, each <textarea>
will start filling up in real time as the corresponding AI streams
its response. The full text of every AI is also kept in memory (the
allResponses
arrays) so you can later inspect or export it if needed.
Feel free to adapt the URLs, payloads, and UI layout to match your own environment. Happy streaming!
- Downloads last month
- 969
4-bit
Model tree for ronx-labs/affine-081410
Base model
openai/gpt-oss-20b