AI Model Name: Llama 3 70B "Built with Meta Llama 3" https://llama.meta.com/llama3/license/ How to quantize 70B model so it will fit on 2x4090 GPUs: I tried EXL2, AutoAWQ, and SqueezeLLM and they all failed for different reasons (issues opened). HQQ worked: I rented a 4x GPU 1TB RAM ($19/hr) instance on runpod with 1024GB container and 1024GB workspace disk space. I think you only need 2x GPU with 80GB VRAM and 512GB+ system RAM so probably overpaid. Note you need to fill in the form to get access to the 70B Meta weights. You can copy/paste this on the console and it will just set up everything automatically: ```bash apt update apt install vim -y mkdir -p ~/miniconda3 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3 ~/miniconda3/bin/conda init bash source ~/.bashrc conda create -n hqq python=3.10 -y && conda activate hqq git lfs install git clone https://github.com/mobiusml/hqq.git cd hqq pip install torch pip install . pip install huggingface_hub[hf_transfer] export HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli login ``` Create `quantize.py` file by copy/pasting this into console: ``` echo " import torch model_id = 'meta-llama/Meta-Llama-3-70B-Instruct' save_dir = 'cat-llama-3-70b-hqq' compute_dtype = torch.bfloat16 from hqq.core.quantize import * quant_config = BaseQuantizeConfig(nbits=4, group_size=64, offload_meta=True) zero_scale_group_size = 128 quant_config['scale_quant_params']['group_size'] = zero_scale_group_size quant_config['zero_quant_params']['group_size'] = zero_scale_group_size from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer model = HQQModelForCausalLM.from_pretrained(model_id) from hqq.models.hf.base import AutoHQQHFModel AutoHQQHFModel.quantize_model(model, quant_config=quant_config, compute_dtype=compute_dtype) AutoHQQHFModel.save_quantized(model, save_dir) model = AutoHQQHFModel.from_quantized(save_dir) model.eval() " > quantize.py ``` Run script: ``` python quantize.py ```