ethzanalytics
/

stablelm-tuned-alpha-3b-gptq-4bit-128g

Text Generation

Model card Files Files and versions

pszemraj commited on May 7, 2023

Commit

0cbf47e

·

1 Parent(s): e9213aa

Update README.md

Files changed (1) hide show

README.md +75 -0

README.md CHANGED Viewed

@@ -1,3 +1,78 @@
 ---
 license: cc-by-nc-sa-4.0
 ---

 ---
 license: cc-by-nc-sa-4.0
+language:
+- en
+pipeline_tag: text-generation
+inference: false
+tags:
+- gptq
+- auto-gptq
+- quantized
 ---
+# stablelm-tuned-alpha-3b-gptq-4bit-128g
+This is a quantized model saved with [auto-gptq](https://github.com/PanQiWei/AutoGPTQ). At time of writing, you cannot directly load models from the hub,  but will need to clone the repo and load locally.
+See below for details.
+---
+# Auto-GPTQ Quick Start
+## Quick Installation
+Start from v0.0.4, one can install `auto-gptq` directly from pypi using `pip`:
+```shell
+pip install auto-gptq
+```
+AutoGPTQ supports using `triton` to speedup inference, but it currently **only supports Linux**. To integrate triton, using:
+```shell
+pip install auto-gptq[triton]
+```
+For some people who want to try the newly supported `llama` type models in 🤗 Transformers but not update it to the latest version, using:
+```shell
+pip install auto-gptq[llama]
+```
+By default, CUDA extension will be built at installation if CUDA and pytorch are already installed.
+To disable building CUDA extension, you can use the following commands:
+For Linux
+```shell
+BUILD_CUDA_EXT=0 pip install auto-gptq
+```
+For Windows
+```shell
+set BUILD_CUDA_EXT=0 && pip install auto-gptq
+```
+## Basic Usage
+*The full script of basic usage demonstrated here is `examples/quantization/basic_usage.py`*
+The two main classes currently used in AutoGPTQ are `AutoGPTQForCausalLM` and `BaseQuantizeConfig`.
+```python
+from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
+```
+### Load quantized model and do inference
+Instead of `.from_pretrained`, you should use `.from_quantized` to load a quantized model.
+```python
+device = "cuda:0"
+model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, use_triton=False, use_safetensors=True)
+```
+This will first read and load `quantize_config.json` in `opt-125m-4bit-128g` directory, then based on the values of `bits` and `group_size` in it, load `gptq_model-4bit-128g.bin` model file into the first GPU.
+Then you can initialize 🤗 Transformers' `TextGenerationPipeline` and do inference.
+```python
+from transformers import TextGenerationPipeline
+pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device=device)
+print(pipeline("auto-gptq is")[0]["generated_text"])
+```
+## Conclusion
+Congrats! You learned how to quickly install `auto-gptq` and integrate with it. In the next chapter, you will learn the advanced loading strategies for pretrained or quantized model and some best practices on different situations.