--- license: apache-2.0 datasets: - open-r1/codeforces-cots language: - en base_model: - Qwen/Qwen2.5-Coder-7B-Instruct pipeline_tag: text-generation library_name: transformers --- # OlympicCoder-7B GGUF Models ## Ultra-Low-Bit Quantization with IQ-DynamicGate (1-2 bit) Our latest quantization method introduces **precision-adaptive quantization** for ultra-low-bit models (1-2 bit), with benchmark-proven improvements on **Llama-3-8B**. This approach uses layer-specific strategies to preserve accuracy while maintaining extreme memory efficiency. ### **Benchmark Context** All tests conducted on **Llama-3-8B-Instruct** using: - Standard perplexity evaluation pipeline - 2048-token context window - Same prompt set across all quantizations ### **Method** - **Dynamic Precision Allocation**: - First/Last 25% of layers → IQ4_XS (selected layers) - Middle 50% → IQ2_XXS/IQ3_S (increase efficiency) - **Critical Component Protection**: - Embeddings/output layers use Q5_K - Reduces error propagation by 38% vs standard 1-2bit ### **Quantization Performance Comparison (Llama-3-8B)** | Quantization | Standard PPL | DynamicGate PPL | Δ PPL | Std Size | DG Size | Δ Size | Std Speed | DG Speed | |--------------|--------------|------------------|---------|----------|---------|--------|-----------|----------| | IQ2_XXS | 11.30 | 9.84 | -12.9% | 2.5G | 2.6G | +0.1G | 234s | 246s | | IQ2_XS | 11.72 | 11.63 | -0.8% | 2.7G | 2.8G | +0.1G | 242s | 246s | | IQ2_S | 14.31 | 9.02 | -36.9% | 2.7G | 2.9G | +0.2G | 238s | 244s | | IQ1_M | 27.46 | 15.41 | -43.9% | 2.2G | 2.5G | +0.3G | 206s | 212s | | IQ1_S | 53.07 | 32.00 | -39.7% | 2.1G | 2.4G | +0.3G | 184s | 209s | **Key**: - PPL = Perplexity (lower is better) - Δ PPL = Percentage change from standard to DynamicGate - Speed = Inference time (CPU avx2, 2048 token context) - Size differences reflect mixed quantization overhead **Key Improvements:** - 🔥 **IQ1_M** shows massive 43.9% perplexity reduction (27.46 → 15.41) - 🚀 **IQ2_S** cuts perplexity by 36.9% while adding only 0.2GB - ⚡ **IQ1_S** maintains 39.7% better accuracy despite 1-bit quantization **Tradeoffs:** - All variants have modest size increases (0.1-0.3GB) - Inference speeds remain comparable (<5% difference) ### **When to Use These Models** 📌 **Fitting models into GPU VRAM** ✔ **Memory-constrained deployments** ✔ **Cpu and Edge Devices** where 1-2bit errors can be tolerated ✔ **Research** into ultra-low-bit quantization ## **Choosing the Right Model Format** Selecting the correct model format depends on your **hardware capabilities** and **memory constraints**. ### **BF16 (Brain Float 16) – Use if BF16 acceleration is available** - A 16-bit floating-point format designed for **faster computation** while retaining good precision. - Provides **similar dynamic range** as FP32 but with **lower memory usage**. - Recommended if your hardware supports **BF16 acceleration** (check your device's specs). - Ideal for **high-performance inference** with **reduced memory footprint** compared to FP32. 📌 **Use BF16 if:** ✔ Your hardware has native **BF16 support** (e.g., newer GPUs, TPUs). ✔ You want **higher precision** while saving memory. ✔ You plan to **requantize** the model into another format. 📌 **Avoid BF16 if:** ❌ Your hardware does **not** support BF16 (it may fall back to FP32 and run slower). ❌ You need compatibility with older devices that lack BF16 optimization. --- ### **F16 (Float 16) – More widely supported than BF16** - A 16-bit floating-point **high precision** but with less of range of values than BF16. - Works on most devices with **FP16 acceleration support** (including many GPUs and some CPUs). - Slightly lower numerical precision than BF16 but generally sufficient for inference. 📌 **Use F16 if:** ✔ Your hardware supports **FP16** but **not BF16**. ✔ You need a **balance between speed, memory usage, and accuracy**. ✔ You are running on a **GPU** or another device optimized for FP16 computations. 📌 **Avoid F16 if:** ❌ Your device lacks **native FP16 support** (it may run slower than expected). ❌ You have memory limitations. --- ### **Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low-VRAM Inference** Quantization reduces model size and memory usage while maintaining as much accuracy as possible. - **Lower-bit models (Q4_K)** → **Best for minimal memory usage**, may have lower precision. - **Higher-bit models (Q6_K, Q8_0)** → **Better accuracy**, requires more memory. 📌 **Use Quantized Models if:** ✔ You are running inference on a **CPU** and need an optimized model. ✔ Your device has **low VRAM** and cannot load full-precision models. ✔ You want to reduce **memory footprint** while keeping reasonable accuracy. 📌 **Avoid Quantized Models if:** ❌ You need **maximum accuracy** (full-precision models are better for this). ❌ Your hardware has enough VRAM for higher-precision formats (BF16/F16). --- ### **Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)** These models are optimized for **extreme memory efficiency**, making them ideal for **low-power devices** or **large-scale deployments** where memory is a critical constraint. - **IQ3_XS**: Ultra-low-bit quantization (3-bit) with **extreme memory efficiency**. - **Use case**: Best for **ultra-low-memory devices** where even Q4_K is too large. - **Trade-off**: Lower accuracy compared to higher-bit quantizations. - **IQ3_S**: Small block size for **maximum memory efficiency**. - **Use case**: Best for **low-memory devices** where **IQ3_XS** is too aggressive. - **IQ3_M**: Medium block size for better accuracy than **IQ3_S**. - **Use case**: Suitable for **low-memory devices** where **IQ3_S** is too limiting. - **Q4_K**: 4-bit quantization with **block-wise optimization** for better accuracy. - **Use case**: Best for **low-memory devices** where **Q6_K** is too large. - **Q4_0**: Pure 4-bit quantization, optimized for **ARM devices**. - **Use case**: Best for **ARM-based devices** or **low-memory environments**. --- ### **Summary Table: Model Format Selection** | Model Format | Precision | Memory Usage | Device Requirements | Best Use Case | |--------------|------------|---------------|----------------------|---------------| | **BF16** | Highest | High | BF16-supported GPU/CPUs | High-speed inference with reduced memory | | **F16** | High | High | FP16-supported devices | GPU inference when BF16 isn't available | | **Q4_K** | Medium Low | Low | CPU or Low-VRAM devices | Best for memory-constrained environments | | **Q6_K** | Medium | Moderate | CPU with more memory | Better accuracy while still being quantized | | **Q8_0** | High | Moderate | CPU or GPU with enough VRAM | Best accuracy among quantized models | | **IQ3_XS** | Very Low | Very Low | Ultra-low-memory devices | Extreme memory efficiency and low accuracy | | **Q4_0** | Low | Low | ARM or low-memory devices | llama.cpp can optimize for ARM devices | --- ## **Included Files & Details** ### `OlympicCoder-7B-bf16.gguf` - Model weights preserved in **BF16**. - Use this if you want to **requantize** the model into a different format. - Best if your device supports **BF16 acceleration**. ### `OlympicCoder-7B-f16.gguf` - Model weights stored in **F16**. - Use if your device supports **FP16**, especially if BF16 is not available. ### `OlympicCoder-7B-bf16-q8_0.gguf` - **Output & embeddings** remain in **BF16**. - All other layers quantized to **Q8_0**. - Use if your device supports **BF16** and you want a quantized version. ### `OlympicCoder-7B-f16-q8_0.gguf` - **Output & embeddings** remain in **F16**. - All other layers quantized to **Q8_0**. ### `OlympicCoder-7B-q4_k.gguf` - **Output & embeddings** quantized to **Q8_0**. - All other layers quantized to **Q4_K**. - Good for **CPU inference** with limited memory. ### `OlympicCoder-7B-q4_k_s.gguf` - Smallest **Q4_K** variant, using less memory at the cost of accuracy. - Best for **very low-memory setups**. ### `OlympicCoder-7B-q6_k.gguf` - **Output & embeddings** quantized to **Q8_0**. - All other layers quantized to **Q6_K** . ### `OlympicCoder-7B-q8_0.gguf` - Fully **Q8** quantized model for better accuracy. - Requires **more memory** but offers higher precision. ### `OlympicCoder-7B-iq3_xs.gguf` - **IQ3_XS** quantization, optimized for **extreme memory efficiency**. - Best for **ultra-low-memory devices**. ### `OlympicCoder-7B-iq3_m.gguf` - **IQ3_M** quantization, offering a **medium block size** for better accuracy. - Suitable for **low-memory devices**. ### `OlympicCoder-7B-q4_0.gguf` - Pure **Q4_0** quantization, optimized for **ARM devices**. - Best for **low-memory environments**. - Prefer IQ4_NL for better accuracy. # 🚀 If you find these models useful ❤ **Please click "Like" if you find this useful!** Help me test my **AI-Powered Network Monitor Assistant** with **quantum-ready security checks**: 👉 [Free Network Monitor](https://readyforquantum.com) 💬 **How to test**: 1. Click the **chat icon** (bottom right on any page) 2. Choose an **AI assistant type**: - `TurboLLM` (GPT-4-mini) - `FreeLLM` (Open-source) - `TestLLM` (Experimental CPU-only) ### **What I’m Testing** I’m pushing the limits of **small open-source models for AI network monitoring**, specifically: - **Function calling** against live network services - **How small can a model go** while still handling: - Automated **Nmap scans** - **Quantum-readiness checks** - **Metasploit integration** 🟡 **TestLLM** – Current experimental model (llama.cpp on 6 CPU threads): - ✅ **Zero-configuration setup** - ⏳ 30s load time (slow inference but **no API costs**) - 🔧 **Help wanted!** If you’re into **edge-device AI**, let’s collaborate! ### **Other Assistants** 🟢 **TurboLLM** – Uses **gpt-4-mini** for: - **Real-time network diagnostics** - **Automated penetration testing** (Nmap/Metasploit) - 🔑 Get more tokens by [downloading our Free Network Monitor Agent](https://readyforquantum.com/download) 🔵 **HugLLM** – Open-source models (≈8B params): - **2x more tokens** than TurboLLM - **AI-powered log analysis** - 🌐 Runs on Hugging Face Inference API ### 💡 **Example AI Commands to Test**: 1. `"Give me info on my websites SSL certificate"` 2. `"Check if my server is using quantum safe encyption for communication"` 3. `"Run a quick Nmap vulnerability test"` # Model Card for OlympicCoder-7B OlympicCoder-7B is a code model that achieves strong performance on competitive coding benchmarks such as LiveCodeBench and the 2024 International Olympiad in Informatics. * Repository: https://github.com/huggingface/open-r1 * Blog post: https://huggingface.co/blog/open-r1/update-3 ## Model description - **Model type:** A 7B parameter model fine-tuned on a decontaminated version of the codeforces dataset. - **Language(s) (NLP):** Primarily English - **License:** apache-2.0 - **Finetuned from model:** [Qwen/Qwen2.5-Coder-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct) ## Evaluation We compare the performance of OlympicCoder models on two main benchmarks for competitive coding: * **[IOI'2024:](https://github.com/huggingface/ioi)** 6 very challenging problems from the 2024 International Olympiad in Informatics. Models are allowed up to 50 submissions per problem. * **[LiveCodeBench:](https://livecodebench.github.io)** Python programming problems source from platforms like CodeForces and LeetCoder. We use the `v4_v5` subset of [`livecodebench/code_generation_lite`](https://huggingface.co/datasets/livecodebench/code_generation_lite), which corresponds to 268 problems. We use `lighteval` to evaluate models on LiveCodeBench using the sampling parameters described [here](https://github.com/huggingface/open-r1?tab=readme-ov-file#livecodebench). > [!NOTE] > The OlympicCoder models were post-trained exclusively on C++ solutions generated by DeepSeek-R1. As a result the performance on LiveCodeBench should be considered to be partially _out-of-domain_, since this expects models to output solutions in Python. ### IOI'24 ![](./ioi-evals.png) ### LiveCodeBench ![](./lcb-evals.png) ## Usage Here's how you can run the model using the `pipeline()` function from 🤗 Transformers: ```python # pip install transformers # pip install accelerate import torch from transformers import pipeline pipe = pipeline("text-generation", model="open-r1/OlympicCoder-7B", torch_dtype=torch.bfloat16, device_map="auto") # We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating messages = [ {"role": "user", "content": "Write a python program to calculate the 10th Fibonacci number"}, ] prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) outputs = pipe(prompt, max_new_tokens=8000, do_sample=True, temperature=0.7, top_k=50, top_p=0.95) print(outputs[0]["generated_text"]) #<|im_start|>user #Write a python program to calculate the 10th fibonacci number<|im_end|> #<|im_start|>assistant #Okay, I need to write a Python program that calculates the 10th Fibonacci number. Hmm, the Fibonacci sequence starts with 0 and 1. Each subsequent number is the sum of the two preceding ones. So the sequence goes: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, and so on. ... ``` > [!WARNING] > To ensure that the model consistently outputs a long chain-of-thought, we have edited the chat template to prefill the first assistant turn with a `` token. As a result, the outputs from this model will not show the opening `` token if you use the model's `generate()` method. To apply reinforcement learning with a format reward, either prepend the `` token to the model's completions or amend the chat template to remove the prefill. ## Training procedure ### Training hyper-parameters The following hyperparameters were used during training: - dataset: open-r1/codeforces-cots - learning_rate: 4.0e-5 - train_batch_size: 2 - seed: 42 - packing: false - distributed_type: deepspeed-zero-3 - num_devices: 8 - gradient_accumulation_steps: 8 - total_train_batch_size: 16 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: cosine_with_min_lr - min_lr_rate: 0.1 - lr_scheduler_warmup_ratio: 0.03 - num_epochs: 10.0