catid commited on
Commit
df2e5b3
1 Parent(s): a2e2e00

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -0
README.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ How to quantize 70B model so it will fit on 2x4090 GPUs:
2
+
3
+ I tried EXL2, AutoAWQ, and SqueezeLLM and they all failed for different reasons (issues opened).
4
+
5
+ HQQ worked:
6
+
7
+ I rented a 4x GPU 1TB RAM ($19/hr) instance on runpod with 1024GB container and 1024GB workspace disk space.
8
+ I think you only need 2x GPU with 80GB VRAM and 512GB+ system RAM so probably overpaid.
9
+
10
+ Note you need to fill in the form to get access to the 70B Meta weights.
11
+
12
+ You can copy/paste this on the console and it will just set up everything automatically:
13
+
14
+ ```bash
15
+ apt update
16
+ apt install vim -y
17
+
18
+ mkdir -p ~/miniconda3
19
+ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
20
+ bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
21
+ ~/miniconda3/bin/conda init bash
22
+ source ~/.bashrc
23
+
24
+ conda create -n hqq python=3.10 -y && conda activate hqq
25
+
26
+ git lfs install
27
+ git clone https://github.com/mobiusml/hqq.git
28
+ cd hqq
29
+
30
+ pip install torch
31
+ pip install .
32
+
33
+ pip install huggingface_hub[hf_transfer]
34
+ export HF_HUB_ENABLE_HF_TRANSFER=1
35
+
36
+ huggingface-cli login
37
+ ```
38
+
39
+ Create `quantize.py` file by copy/pasting this into console:
40
+
41
+ ```
42
+ echo "
43
+ import torch
44
+
45
+ model_id = 'meta-llama/Meta-Llama-3-70B-Instruct'
46
+ save_dir = 'cat-llama-3-70b-hqq'
47
+ compute_dtype = torch.bfloat16
48
+
49
+ from hqq.core.quantize import *
50
+ quant_config = BaseQuantizeConfig(nbits=4, group_size=64, offload_meta=True)
51
+ zero_scale_group_size = 128
52
+ quant_config['scale_quant_params']['group_size'] = zero_scale_group_size
53
+ quant_config['zero_quant_params']['group_size'] = zero_scale_group_size
54
+
55
+ from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
56
+ model = HQQModelForCausalLM.from_pretrained(model_id)
57
+
58
+ from hqq.models.hf.base import AutoHQQHFModel
59
+ AutoHQQHFModel.quantize_model(model, quant_config=quant_config,
60
+ compute_dtype=compute_dtype)
61
+
62
+ AutoHQQHFModel.save_quantized(model, save_dir)
63
+ model = AutoHQQHFModel.from_quantized(save_dir)
64
+
65
+ model.eval()
66
+
67
+ " > quantize.py
68
+ ```
69
+
70
+ Run script:
71
+
72
+ ```
73
+ python quantize.py
74
+ ```
75
+