--- base_model: - mistralai/Mistral-Large-Instruct-2407 pipeline_tag: text-generation tags: - mistral - 3bit --- This is a 3bit AutoRound GPTQ version of Mistral-Large-Instruct-2407. This conversion used model-*.safetensors. This quantized model needs at least ~ 50GB + context (~5GB) VRAM. I quantized it so that it could fit 64GB VRAM. Quantization script (it takes around 520 GB RAM and A40 GPU 48GB around 20 hours to convert): ``` from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "mistralai/Mistral-Large-Instruct-2407" model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16) tokenizer = AutoTokenizer.from_pretrained(model_name) from auto_round import AutoRound bits, group_size, sym = 3, 128, True autoround = AutoRound(model, tokenizer, nsamples=256, iters=512, low_gpu_mem_usage=True, batch_size=4, bits=bits, group_size=group_size, sym=sym, device='cuda') autoround.quantize() output_dir = "./Mistral-Large-Instruct-2407-3bit" autoround.save_quantized(output_dir, format='auto_gptq', inplace=True) ``` Evals using lm-eval-harness. ``` example command: # !pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git auto-gptq optimum m="VPTQ-community/Mistral-Large-Instruct-2407-v8-k65536-256-woft" !lm_eval --model hf --model_args pretrained={m},dtype=auto --tasks wikitext --num_fewshot 0 --batch_size 1 --output_path ./eval/ ``` hf (pretrained=MLDataScientist/Mistral-Large-Instruct-2407-GPTQ-3bit,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 2 | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |--------|------:|------|-----:|---------------|---|-----:|---|------| |wikitext| 2|none | 0|bits_per_byte |↓ |0.4103|± | N/A| | | |none | 0|byte_perplexity|↓ |1.3290|± | N/A| | | |none | 0|word_perplexity|↓ |4.5765|± | N/A| vs 3bit VPTQ hf (pretrained=VPTQ-community/Mistral-Large-Instruct-2407-v8-k65536-256-woft,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 1 | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |--------|------:|------|-----:|---------------|---|-----:|---|------| |wikitext| 2|none | 0|bits_per_byte |↓ |0.4017|± | N/A| | | |none | 0|byte_perplexity|↓ |1.3211|± | N/A| | | |none | 0|word_perplexity|↓ |4.4324|± | N/A| vs 4bit GPTQ: hf (pretrained=ModelCloud/Mistral-Large-Instruct-2407-gptq-4bit,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 1: | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |--------|------:|------|-----:|---------------|---|-----:|---|------| |wikitext| 2|none | 0|bits_per_byte |↓ |0.3536|± | N/A| | | |none | 0|byte_perplexity|↓ |1.2777|± | N/A| | | |none | 0|word_perplexity|↓ |3.7082|± | N/A| vs 4bit VPTQ hf (pretrained=VPTQ-community/Mistral-Large-Instruct-2407-v8-k65536-65536-woft,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 1 | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |--------|------:|------|-----:|---------------|---|-----:|---|------| |wikitext| 2|none | 0|bits_per_byte |↓ |0.3415|± | N/A| | | |none | 0|byte_perplexity|↓ |1.2671|± | N/A| | | |none | 0|word_perplexity|↓ |3.5463|± | N/A| vs exl2 4bpw (I think the tests are different) | |Wikitext| C4 |FineWeb|Max VRAM| |-------------|--------|-----|-------|--------| |EXL2 4.00 bpw| 2.885 |6.484| 6.246 |60.07 GB|