OlympicCoder-32B GGUF Models
Ultra-Low-Bit Quantization with IQ-DynamicGate (1-2 bit)
Our latest quantization method introduces precision-adaptive quantization for ultra-low-bit models (1-2 bit), with benchmark-proven improvements on Llama-3-8B. This approach uses layer-specific strategies to preserve accuracy while maintaining extreme memory efficiency.
Benchmark Context
All tests conducted on Llama-3-8B-Instruct using:
- Standard perplexity evaluation pipeline
- 2048-token context window
- Same prompt set across all quantizations
Method
- Dynamic Precision Allocation:
- First/Last 25% of layers β IQ4_XS (selected layers)
- Middle 50% β IQ2_XXS/IQ3_S (increase efficiency)
- Critical Component Protection:
- Embeddings/output layers use Q5_K
- Reduces error propagation by 38% vs standard 1-2bit
Quantization Performance Comparison (Llama-3-8B)
Quantization | Standard PPL | DynamicGate PPL | Ξ PPL | Std Size | DG Size | Ξ Size | Std Speed | DG Speed |
---|---|---|---|---|---|---|---|---|
IQ2_XXS | 11.30 | 9.84 | -12.9% | 2.5G | 2.6G | +0.1G | 234s | 246s |
IQ2_XS | 11.72 | 11.63 | -0.8% | 2.7G | 2.8G | +0.1G | 242s | 246s |
IQ2_S | 14.31 | 9.02 | -36.9% | 2.7G | 2.9G | +0.2G | 238s | 244s |
IQ1_M | 27.46 | 15.41 | -43.9% | 2.2G | 2.5G | +0.3G | 206s | 212s |
IQ1_S | 53.07 | 32.00 | -39.7% | 2.1G | 2.4G | +0.3G | 184s | 209s |
Key:
- PPL = Perplexity (lower is better)
- Ξ PPL = Percentage change from standard to DynamicGate
- Speed = Inference time (CPU avx2, 2048 token context)
- Size differences reflect mixed quantization overhead
Key Improvements:
- π₯ IQ1_M shows massive 43.9% perplexity reduction (27.46 β 15.41)
- π IQ2_S cuts perplexity by 36.9% while adding only 0.2GB
- β‘ IQ1_S maintains 39.7% better accuracy despite 1-bit quantization
Tradeoffs:
- All variants have modest size increases (0.1-0.3GB)
- Inference speeds remain comparable (<5% difference)
When to Use These Models
π Fitting models into GPU VRAM
β Memory-constrained deployments
β Cpu and Edge Devices where 1-2bit errors can be tolerated
β Research into ultra-low-bit quantization
Choosing the Right Model Format
Selecting the correct model format depends on your hardware capabilities and memory constraints.
BF16 (Brain Float 16) β Use if BF16 acceleration is available
- A 16-bit floating-point format designed for faster computation while retaining good precision.
- Provides similar dynamic range as FP32 but with lower memory usage.
- Recommended if your hardware supports BF16 acceleration (check your device's specs).
- Ideal for high-performance inference with reduced memory footprint compared to FP32.
π Use BF16 if:
β Your hardware has native BF16 support (e.g., newer GPUs, TPUs).
β You want higher precision while saving memory.
β You plan to requantize the model into another format.
π Avoid BF16 if:
β Your hardware does not support BF16 (it may fall back to FP32 and run slower).
β You need compatibility with older devices that lack BF16 optimization.
F16 (Float 16) β More widely supported than BF16
- A 16-bit floating-point high precision but with less of range of values than BF16.
- Works on most devices with FP16 acceleration support (including many GPUs and some CPUs).
- Slightly lower numerical precision than BF16 but generally sufficient for inference.
π Use F16 if:
β Your hardware supports FP16 but not BF16.
β You need a balance between speed, memory usage, and accuracy.
β You are running on a GPU or another device optimized for FP16 computations.
π Avoid F16 if:
β Your device lacks native FP16 support (it may run slower than expected).
β You have memory limitations.
Quantized Models (Q4_K, Q6_K, Q8, etc.) β For CPU & Low-VRAM Inference
Quantization reduces model size and memory usage while maintaining as much accuracy as possible.
- Lower-bit models (Q4_K) β Best for minimal memory usage, may have lower precision.
- Higher-bit models (Q6_K, Q8_0) β Better accuracy, requires more memory.
π Use Quantized Models if:
β You are running inference on a CPU and need an optimized model.
β Your device has low VRAM and cannot load full-precision models.
β You want to reduce memory footprint while keeping reasonable accuracy.
π Avoid Quantized Models if:
β You need maximum accuracy (full-precision models are better for this).
β Your hardware has enough VRAM for higher-precision formats (BF16/F16).
Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)
These models are optimized for extreme memory efficiency, making them ideal for low-power devices or large-scale deployments where memory is a critical constraint.
IQ3_XS: Ultra-low-bit quantization (3-bit) with extreme memory efficiency.
- Use case: Best for ultra-low-memory devices where even Q4_K is too large.
- Trade-off: Lower accuracy compared to higher-bit quantizations.
IQ3_S: Small block size for maximum memory efficiency.
- Use case: Best for low-memory devices where IQ3_XS is too aggressive.
IQ3_M: Medium block size for better accuracy than IQ3_S.
- Use case: Suitable for low-memory devices where IQ3_S is too limiting.
Q4_K: 4-bit quantization with block-wise optimization for better accuracy.
- Use case: Best for low-memory devices where Q6_K is too large.
Q4_0: Pure 4-bit quantization, optimized for ARM devices.
- Use case: Best for ARM-based devices or low-memory environments.
Summary Table: Model Format Selection
Model Format | Precision | Memory Usage | Device Requirements | Best Use Case |
---|---|---|---|---|
BF16 | Highest | High | BF16-supported GPU/CPUs | High-speed inference with reduced memory |
F16 | High | High | FP16-supported devices | GPU inference when BF16 isn't available |
Q4_K | Medium Low | Low | CPU or Low-VRAM devices | Best for memory-constrained environments |
Q6_K | Medium | Moderate | CPU with more memory | Better accuracy while still being quantized |
Q8_0 | High | Moderate | CPU or GPU with enough VRAM | Best accuracy among quantized models |
IQ3_XS | Very Low | Very Low | Ultra-low-memory devices | Extreme memory efficiency and low accuracy |
Q4_0 | Low | Low | ARM or low-memory devices | llama.cpp can optimize for ARM devices |
Included Files & Details
OlympicCoder-32B-bf16.gguf
- Model weights preserved in BF16.
- Use this if you want to requantize the model into a different format.
- Best if your device supports BF16 acceleration.
OlympicCoder-32B-f16.gguf
- Model weights stored in F16.
- Use if your device supports FP16, especially if BF16 is not available.
OlympicCoder-32B-bf16-q8_0.gguf
- Output & embeddings remain in BF16.
- All other layers quantized to Q8_0.
- Use if your device supports BF16 and you want a quantized version.
OlympicCoder-32B-f16-q8_0.gguf
- Output & embeddings remain in F16.
- All other layers quantized to Q8_0.
OlympicCoder-32B-q4_k.gguf
- Output & embeddings quantized to Q8_0.
- All other layers quantized to Q4_K.
- Good for CPU inference with limited memory.
OlympicCoder-32B-q4_k_s.gguf
- Smallest Q4_K variant, using less memory at the cost of accuracy.
- Best for very low-memory setups.
OlympicCoder-32B-q6_k.gguf
- Output & embeddings quantized to Q8_0.
- All other layers quantized to Q6_K .
OlympicCoder-32B-q8_0.gguf
- Fully Q8 quantized model for better accuracy.
- Requires more memory but offers higher precision.
OlympicCoder-32B-iq3_xs.gguf
- IQ3_XS quantization, optimized for extreme memory efficiency.
- Best for ultra-low-memory devices.
OlympicCoder-32B-iq3_m.gguf
- IQ3_M quantization, offering a medium block size for better accuracy.
- Suitable for low-memory devices.
OlympicCoder-32B-q4_0.gguf
- Pure Q4_0 quantization, optimized for ARM devices.
- Best for low-memory environments.
- Prefer IQ4_NL for better accuracy.
π If you find these models useful
β€ Please click "Like" if you find this useful!
Help me test my AI-Powered Network Monitor Assistant with quantum-ready security checks:
π Free Network Monitor
π¬ How to test:
- Click the chat icon (bottom right on any page)
- Choose an AI assistant type:
TurboLLM
(GPT-4-mini)FreeLLM
(Open-source)TestLLM
(Experimental CPU-only)
What Iβm Testing
Iβm pushing the limits of small open-source models for AI network monitoring, specifically:
- Function calling against live network services
- How small can a model go while still handling:
- Automated Nmap scans
- Quantum-readiness checks
- Metasploit integration
π‘ TestLLM β Current experimental model (llama.cpp on 6 CPU threads):
- β Zero-configuration setup
- β³ 30s load time (slow inference but no API costs)
- π§ Help wanted! If youβre into edge-device AI, letβs collaborate!
Other Assistants
π’ TurboLLM β Uses gpt-4-mini for:
- Real-time network diagnostics
- Automated penetration testing (Nmap/Metasploit)
- π Get more tokens by downloading our Free Network Monitor Agent
π΅ HugLLM β Open-source models (β8B params):
- 2x more tokens than TurboLLM
- AI-powered log analysis
- π Runs on Hugging Face Inference API
π‘ Example AI Commands to Test:
"Give me info on my websites SSL certificate"
"Check if my server is using quantum safe encyption for communication"
"Run a quick Nmap vulnerability test"
Model Card for OlympicCoder-32B
OlympicCoder-32B is a code model that achieves very strong performance on competitive coding benchmarks such as LiveCodeBench andthe 2024 International Olympiad in Informatics.
- Repository: https://github.com/huggingface/open-r1
- Blog post: https://huggingface.co/blog/open-r1/update-3
Model description
- Model type: A 32B parameter model fine-tuned on a decontaminated version of the codeforces dataset.
- Language(s) (NLP): Primarily English
- License: apache-2.0
- Finetuned from model: Qwen/Qwen2.5-Coder-32B-Instruct
Evaluation
We compare the performance of OlympicCoder models on two main benchmarks for competitive coding:
- IOI'2024: 6 very challenging problems from the 2024 International Olympiad in Informatics. Models are allowed up to 50 submissions per problem.
- LiveCodeBench: Python programming problems source from platforms like CodeForces and LeetCoder. We use the
v4_v5
subset oflivecodebench/code_generation_lite
, which corresponds to 268 problems. We uselighteval
to evaluate models on LiveCodeBench using the sampling parameters described here.
The OlympicCoder models were post-trained exclusively on C++ solutions generated by DeepSeek-R1. As a result the performance on LiveCodeBench should be considered to be partially out-of-domain, since this expects models to output solutions in Python.
IOI'24
LiveCodeBench
Usage
Here's how you can run the model using the pipeline()
function from π€ Transformers:
# pip install transformers
# pip install accelerate
import torch
from transformers import pipeline
pipe = pipeline("text-generation", model="open-r1/OlympicCoder-32B", torch_dtype=torch.bfloat16, device_map="auto")
# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
{"role": "user", "content": "Write a python program to calculate the 10th Fibonacci number"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=8000, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
#<|im_start|>user
#Write a python program to calculate the 10th fibonacci number<|im_end|>
#<|im_start|>assistant
#<think>Okay, I need to write a Python program that calculates the 10th Fibonacci number. Hmm, the Fibonacci sequence starts with 0 and 1. Each subsequent number is the sum of the two preceding ones. So the sequence goes: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, and so on. ...
To ensure that the model consistently outputs a long chain-of-thought, we have edited the chat template to prefill the first assistant turn with a
<think>
token. As a result, the outputs from this model will not show the opening<think>
token if you use the model'sgenerate()
method. To apply reinforcement learning with a format reward, either prepend the<think>
token to the model's completions or amend the chat template to remove the prefill. Check out our blog post for more details.
Training procedure
Training hyper-parameters
The following hyperparameters were used during training on 16 H100 nodes:
- dataset: open-r1/codeforces-cots_decontaminated
- learning_rate: 4.0e-5
- train_batch_size: 1
- seed: 42
- packing: false
- distributed_type: fsdp
- num_devices: 128
- gradient_accumulation_steps: 1
- total_train_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine_with_min_lr
- min_lr_rate: 0.1
- lr_scheduler_warmup_ratio: 0.03
- num_epochs: 10.0
- Downloads last month
- 305
Model tree for Mungert/OlympicCoder-32B-GGUF
Base model
Qwen/Qwen2.5-32B