4-bit quantization of the vicuna-13b-v1.1 model.

The delta was added to the original LLaMa weights using FastChat.
Quantization and inference with GPTQ-For-LLaMa (commit 58c8ab4).

Quantization args: $MODEL_DIRECTORY, c4, wbits 4, true-sequential, act-order, groupsize 128.
Inference args: $MODEL_DIRECTORY, wbits 4, groupsize 128, load $CHECKPOINT_FILE
Add arg device=0 if using GPU for inference. You may have to change min_length and max_length for better inference outputs.

The separator has been changed to </s>. Simple prompt is "Human: $REQUEST</s>Assistant:".

Delta: https://huggingface.co/lmsys/vicuna-13b-delta-v1.1
FastChat: https://github.com/lm-sys/FastChat
GTPQ-for-LLaMa: https://github.com/qwopqwop200/GPTQ-for-LLaMa

Downloads last month
17
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.