Output + Embedding
2-bit 3-bit 4-bit 5-bit 6-bit 8-bit 16-bit 32-bit
AXL BXL CXL DXL EXL FXL GXL HXL

Master Table

Variant Size (GB) BPW PPL PPL error
IQ3_M_FXL 2.06 4.08 1.9788 0.01061
IQ3_M_GXL 2.42 4.80 1.9785 0.01061
IQ3_M_HXL 3.20 6.35 1.9784 0.01061
IQ4_XS_FLX 2.73 4.69 1.9284 0.01018
IQ4_XS_GXL 2.36 5.42 1.9282 0.01018
IQ4_XS_HXL 3.51 6.96 1.9282 0.01018
IQ4_NL_GXL 2.84 5.64 1.9307 0.01024
IQ4_NL_HXL 3.62 7.18 1.9305 0.01023
Q4_K_M_GXL 2.96 5.87 1.9477 0.01047
Q4_K_M_HXL 3.73 7.41 1.9475 0.01047
Q5_K_M_FXL 2.98 5.92 1.9260 0.01024
Q5_K_M_GXL 3.35 6.65 1.9259 0.01023
Q5_K_M_HXL 4.13 8.19 1.9257 0.01023
Q6_K_FXL 3.40 6.75 1.9211 0.01018
Q6_K_GXL 3.77 7.48 1.9207 0.01018
Q6_K_HXL 4.54 9.02 1.9206 0.01017
Q8_0_GXL 4.65 9.23 1.9245 0.01026
Q8_0_HXL 5.42 10.77 1.9241 0.01025
BF16 8.05 16.00 1.9233 0.01024
BF16_HXL 18.83 17.55 1.9231 0.01024
F32 16.10 32.00 1.9232 0.01024

Variant chooser, prefer FXL first

(these are my personal notes to help you choose)

Variant (preferred) Size (GB) Quality vs BF16 Inference speed Long context headroom My notes to you
IQ3_M_FXL 2.06 Low Very fast Excellent I reach 76.33 tok/sec at 32k and 61.28 tok/sec at 64k. Use it when you must fit very tight limits.
IQ4_XS_FLX 2.73 Very good Fast Very good I like this as a small yet stable 4-bit. If you need more raw speed, try IQ4_XS_GXL.
Q5_K_M_FXL 2.98 Very good Medium fast Very good I use this when I want sturdier outputs than 4-bit with almost no size penalty.
Q6_K_FXL 3.40 Excellent Medium fast Very good I lean on this for balanced quality, speed, and long contexts.
Q8_0_GXL 4.65 Excellent Medium OK In my tests it kept high quality, 54.21 tok/sec at 16k and 52.04 at 32k.
BF16 8.05 Reference Slow Tight I use BF16 when I want very high quality without going full F32.

Quick picks by GPU VRAM

(again, these are personal notes from my RTX 3060 12 GB with 48 GB RAM)

GPU VRAM Pick Why I recommend it
16 GB Q6_K_GXL or Q8_0_GXL, consider BF16 for near best IQ I get near-BF16 quality with room for long context or batching. BF16 fits but leaves less headroom.
12 GB Q6_K_GXL for balance, or Q8_0_GXL for quality focus On my 3060 12 GB these give strong quality and good 32k performance.
8 GB IQ4_XS_GXL for speed, Q5_K_M_GXL for sturdier outputs Both leave comfortable KV space for longer contexts in my runs.
6 GB Q5_K_M_FXL or IQ4_XS_FLX I find these the safest balance when memory is tight.
4 GB IQ3_M_FXL first, IQ4_XS_FLX if your context still fits I reach the best chance of running under strict limits with IQ3_M_FXL.

PROMPTS

Lesson

Create a C1 level dialogue for language learning. First, provide the introduction to the dialogue in English, then the dialogue in Italian. Then, an English translation of the dialogue. Next, there is a vocabulary list of all the Italian words used in the text, including their gender, class, and English meaning. Point out which words are more frequent, as they should be memorized for mastering that level. The grammar part should explain the grammar used in the text and present some grammar patterns that should be memorized as they are essential for mastering that level. Focus on explaining for students. Then, create a translation exercise section using sentences from the text for English to Italian translation.

Conversation Practice

I'd like to role-play. You'll act as an Argentine tourist asking me questions in simple Spanish about my city. Please keep the language at B1 level throughout.

You are "Ana", a Brazilian tourist. Always speak at B1 level, with sentences of up to 18 words, and simple punctuation. Do not invent facts about the user. If you are not sure, say "I do not know" and ask. Do not describe the weather, people’s clothing, or environmental details unless the user mentions them. Avoid ending the conversation with farewells unless the user ends it. Do not use em dashes; prefer commas, parentheses, or colons.

Downloads last month
199
GGUF
Model size
4.02B params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for marcelone/Qwen3-4B-Instruct-2507-gguf

Quantized
(73)
this model

Collections including marcelone/Qwen3-4B-Instruct-2507-gguf