pszemraj commited on
Commit
0cbf47e
·
1 Parent(s): e9213aa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -0
README.md CHANGED
@@ -1,3 +1,78 @@
1
  ---
2
  license: cc-by-nc-sa-4.0
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-sa-4.0
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ inference: false
7
+ tags:
8
+ - gptq
9
+ - auto-gptq
10
+ - quantized
11
  ---
12
+
13
+ # stablelm-tuned-alpha-3b-gptq-4bit-128g
14
+
15
+ This is a quantized model saved with [auto-gptq](https://github.com/PanQiWei/AutoGPTQ). At time of writing, you cannot directly load models from the hub, but will need to clone the repo and load locally.
16
+
17
+ See below for details.
18
+
19
+ ---
20
+
21
+ # Auto-GPTQ Quick Start
22
+
23
+ ## Quick Installation
24
+
25
+ Start from v0.0.4, one can install `auto-gptq` directly from pypi using `pip`:
26
+ ```shell
27
+ pip install auto-gptq
28
+ ```
29
+
30
+ AutoGPTQ supports using `triton` to speedup inference, but it currently **only supports Linux**. To integrate triton, using:
31
+ ```shell
32
+ pip install auto-gptq[triton]
33
+ ```
34
+
35
+ For some people who want to try the newly supported `llama` type models in 🤗 Transformers but not update it to the latest version, using:
36
+ ```shell
37
+ pip install auto-gptq[llama]
38
+ ```
39
+
40
+ By default, CUDA extension will be built at installation if CUDA and pytorch are already installed.
41
+
42
+ To disable building CUDA extension, you can use the following commands:
43
+
44
+ For Linux
45
+ ```shell
46
+ BUILD_CUDA_EXT=0 pip install auto-gptq
47
+ ```
48
+ For Windows
49
+ ```shell
50
+ set BUILD_CUDA_EXT=0 && pip install auto-gptq
51
+ ```
52
+
53
+ ## Basic Usage
54
+ *The full script of basic usage demonstrated here is `examples/quantization/basic_usage.py`*
55
+
56
+ The two main classes currently used in AutoGPTQ are `AutoGPTQForCausalLM` and `BaseQuantizeConfig`.
57
+ ```python
58
+ from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
59
+ ```
60
+ ### Load quantized model and do inference
61
+
62
+ Instead of `.from_pretrained`, you should use `.from_quantized` to load a quantized model.
63
+ ```python
64
+ device = "cuda:0"
65
+ model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, use_triton=False, use_safetensors=True)
66
+ ```
67
+ This will first read and load `quantize_config.json` in `opt-125m-4bit-128g` directory, then based on the values of `bits` and `group_size` in it, load `gptq_model-4bit-128g.bin` model file into the first GPU.
68
+
69
+ Then you can initialize 🤗 Transformers' `TextGenerationPipeline` and do inference.
70
+ ```python
71
+ from transformers import TextGenerationPipeline
72
+
73
+ pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device=device)
74
+ print(pipeline("auto-gptq is")[0]["generated_text"])
75
+ ```
76
+
77
+ ## Conclusion
78
+ Congrats! You learned how to quickly install `auto-gptq` and integrate with it. In the next chapter, you will learn the advanced loading strategies for pretrained or quantized model and some best practices on different situations.