Nxcode-CQ-7B-orpo - SOTA GGUF

Model creator: NTQAI
Original model: Nxcode-CQ-7B-orpo

Description

This repo contains State Of The Art quantized GGUF format model files for Nxcode-CQ-7B-orpo.

Quantization was done with an importance matrix that was trained for ~1M tokens (256 batches of 4096 tokens) of answers from the CodeFeedback-Filtered-Instruction dataset.

NOTE: Due to the majority of tensors in Qwen2 models being oddly shaped a consequential portion of the quantization fell back to IQ4_NL instead of the specified method, causing significantly larger (and "smarter"; even IQ1_S is perfectly usable) model files than usual!

Prompt template: ChatML

<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

Compatibility

These quantised GGUFv3 files are compatible with llama.cpp from February 27th 2024 onwards, as of commit 0becb22

They are also compatible with many third party UIs and libraries provided they are built using a recent llama.cpp.

Explanation of quantisation methods

Click to see details

The new methods available are:

GGML_TYPE_IQ1_S - 1-bit quantization in super-blocks with an importance matrix applied, effectively using 1.56 bits per weight (bpw)
GGML_TYPE_IQ1_M - 1-bit quantization in super-blocks with an importance matrix applied, effectively using 1.75 bpw
GGML_TYPE_IQ2_XXS - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.06 bpw
GGML_TYPE_IQ2_XS - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.31 bpw
GGML_TYPE_IQ2_S - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.5 bpw
GGML_TYPE_IQ2_M - 2-bit quantization in super-blocks with an importance matrix applied, effectively using 2.7 bpw
GGML_TYPE_IQ3_XXS - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.06 bpw
GGML_TYPE_IQ3_XS - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.3 bpw
GGML_TYPE_IQ3_S - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.44 bpw
GGML_TYPE_IQ3_M - 3-bit quantization in super-blocks with an importance matrix applied, effectively using 3.66 bpw
GGML_TYPE_IQ4_XS - 4-bit quantization in super-blocks with an importance matrix applied, effectively using 4.25 bpw
GGML_TYPE_IQ4_NL - 4-bit non-linearly mapped quantization with an importance matrix applied, effectively using 4.5 bpw

Refer to the Provided Files table below to see what files use which methods, and how.

Provided files

Name	Quant method	Bits	Size	Max RAM required	Use case
Nxcode-CQ-7B-orpo.IQ1_S.gguf	IQ1_S	1	2.2 GB	2.4 GB	smallest, significant quality loss
Nxcode-CQ-7B-orpo.IQ1_M.gguf	IQ1_M	1	2.3 GB	2.5 GB	very small, significant quality loss
Nxcode-CQ-7B-orpo.IQ2_XXS.gguf	IQ2_XXS	2	2.5 GB	2.7 GB	very small, high quality loss
Nxcode-CQ-7B-orpo.IQ2_XS.gguf	IQ2_XS	2	2.6 GB	2.8 GB	very small, high quality loss
Nxcode-CQ-7B-orpo.IQ2_S.gguf	IQ2_S	2	2.7 GB	2.9 GB	small, substantial quality loss
Nxcode-CQ-7B-orpo.IQ2_M.gguf	IQ2_M	2	2.9 GB	3.1 GB	small, greater quality loss
Nxcode-CQ-7B-orpo.IQ3_XXS.gguf	IQ3_XXS	3	3.1 GB	3.3 GB	very small, high quality loss
Nxcode-CQ-7B-orpo.IQ3_XS.gguf	IQ3_XS	3	3.2 GB	3.4 GB	small, substantial quality loss
Nxcode-CQ-7B-orpo.IQ3_S.gguf	IQ3_S	3	3.3 GB	3.5 GB	small, greater quality loss
Nxcode-CQ-7B-orpo.IQ3_M.gguf	IQ3_M	3	3.4 GB	3.6 GB	medium, balanced quality - recommended
Nxcode-CQ-7B-orpo.IQ4_NL.gguf	IQ4_NL	4	4.0 GB	4.2 GB	small, substantial quality loss

Generated importance matrix file: Nxcode-CQ-7B-orpo.imatrix.dat

Note: the above RAM figures assume no GPU offloading with 4K context. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.

Example `llama.cpp` command

Make sure you are using llama.cpp from commit 0becb22 or later.

./main -ngl 33 -m Nxcode-CQ-7B-orpo.IQ2_XS.gguf --color -c 65536 --temp 1.0 --repeat-penalty 1.0 --top-p 0.95 -n -1 -p "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>\n{prompt}<|im_end|>\n<|im_start|>assistant\n"

Change -ngl 33 to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.

Change -c 65536 to the desired sequence length.

If you want to have a chat-style conversation, replace the -p <PROMPT> argument with -i -ins

If you are low on V/RAM try quantizing the K-cache with -ctk q8_0 or even -ctk q4_0 for big memory savings (depending on context size). There is a similar option for V-cache (-ctv), however that is not working yet.

For other parameters and how to use them, please refer to the llama.cpp documentation

How to run from Python code

You can use GGUF models from Python using the llama-cpp-python module.

How to load this model in Python code, using llama-cpp-python

For full documentation, please see: llama-cpp-python docs.

First install the package

Run one of the following commands, according to your system:

# Prebuilt wheel with basic CPU support
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
# Prebuilt wheel with NVidia CUDA acceleration
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121 (or cu122 etc.)
# Prebuilt wheel with Metal GPU acceleration
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal
# Build base version with no GPU acceleration
pip install llama-cpp-python
# With NVidia CUDA acceleration
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
# Or with OpenBLAS acceleration
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
# Or with CLBLast acceleration
CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
# Or with AMD ROCm GPU acceleration (Linux only)
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
# Or with Metal GPU acceleration for macOS systems only
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
# Or with Vulkan acceleration
CMAKE_ARGS="-DLLAMA_VULKAN=on" pip install llama-cpp-python
# Or with Kompute acceleration
CMAKE_ARGS="-DLLAMA_KOMPUTE=on" pip install llama-cpp-python
# Or with SYCL acceleration
CMAKE_ARGS="-DLLAMA_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" pip install llama-cpp-python

# In windows, to set the variables CMAKE_ARGS in PowerShell, follow this format; eg for NVidia CUDA:
$env:CMAKE_ARGS = "-DLLAMA_CUDA=on"
pip install llama-cpp-python

Simple llama-cpp-python example code

from llama_cpp import Llama

# Chat Completion API

llm = Llama(model_path="./Nxcode-CQ-7B-orpo.IQ2_XS.gguf", n_gpu_layers=33, n_ctx=65536)
print(llm.create_chat_completion(
    messages = [
        {"role": "system", "content": "You are an expert AI coding assistant."},
        {
            "role": "user",
            "content": "Pick a LeetCode challenge and solve it in Python."
        }
    ]
))

Introduction

Nxcode-CQ-7B-orpo is an Monolithic Preference Optimization without Reference Model fine-tune of Qwen/CodeQwen1.5-7B on 100k samples of high-quality ranking data.

Evalplus

EvalPlus	pass@1
HumanEval	86.6
HumanEval+	83.5
MBPP(v0.2.0)	82.3
MBPP+(v0.2.0)	70.4

We use a simple template to generate the solution for evalplus:

"Complete the following Python function:\n{prompt}"

Evalplus Leaderboard

Models	HumanEval	HumanEval+
GPT-4-Turbo (April 2024)	90.2	86.6
GPT-4 (May 2023)	88.4	81.17
GPT-4-Turbo (Nov 2023)	85.4	79.3
Nxcode-CQ-7B-orpo	83.5	78.7
claude-3-opus (Mar 2024)	82.9	76.8
DeepSeek-Coder-33B-instruct	81.1	75.0
WizardCoder-33B-V1.1	79.9	73.2
OpenCodeInterpreter-DS-33B	79.3	73.8
speechless-codellama-34B-v2.0	77.4	72
GPT-3.5-Turbo (Nov 2023)	76.8	70.7
Llama3-70B-instruct	76.2	70.7

Bigcode Leaderboard

09/05/2024

Top 1 average score.

Top 2 winrate.

Quickstart

Here provides a code snippet with apply_chat_template to show you how to load the tokenizer and model and how to generate contents. You should upgrade the transformers if you receive an error when loading the tokenizer

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    "NTQAI/Nxcode-CQ-7B-orpo",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("NTQAI/Nxcode-CQ-7B-orpo")

prompt = """Complete the following Python function:
from typing import List


def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """
"""
messages = [
    {"role": "user", "content": prompt}
]

inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, do_sample=False, top_k=50, top_p=0.95, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)
res = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)

Contact information

For personal communication related to this project, please contact Nha Nguyen Van ([email protected]).

CISCai
/

Nxcode-CQ-7B-orpo-SOTA-GGUF

Nxcode-CQ-7B-orpo - SOTA GGUF

Description

Prompt template: ChatML

Compatibility

Explanation of quantisation methods

Provided files

Example `llama.cpp` command

How to run from Python code

How to load this model in Python code, using llama-cpp-python

First install the package

Simple llama-cpp-python example code

Introduction

Evalplus

Bigcode Leaderboard

Quickstart

Contact information

Model tree for CISCai/Nxcode-CQ-7B-orpo-SOTA-GGUF

Dataset used to train CISCai/Nxcode-CQ-7B-orpo-SOTA-GGUF

Nxcode-CQ-7B-orpo - SOTA GGUF

Description

Prompt template: ChatML

Compatibility

Explanation of quantisation methods

Provided files

Example llama.cpp command

How to run from Python code

How to load this model in Python code, using llama-cpp-python

First install the package

Simple llama-cpp-python example code

Introduction

Evalplus

Bigcode Leaderboard

Quickstart

Contact information

Model tree for CISCai/Nxcode-CQ-7B-orpo-SOTA-GGUF

Dataset used to train CISCai/Nxcode-CQ-7B-orpo-SOTA-GGUF

Example `llama.cpp` command