Update README.md
Browse files
README.md
CHANGED
@@ -25,11 +25,17 @@ It is the result of quantising to 4bit using [GPTQ-for-LLaMa](https://github.com
|
|
25 |
|
26 |
**This is an experimental new GPTQ which offers up to 8K context size**
|
27 |
|
28 |
-
The increased context is
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
|
30 |
Please read carefully below to see how to use it.
|
31 |
|
32 |
-
**NOTE**: Using the full 8K context will exceed 24GB VRAM.
|
33 |
|
34 |
GGML versions are not yet provided, as there is not yet support for SuperHOT in llama.cpp. This is being investigated and will hopefully come soon.
|
35 |
|
@@ -40,7 +46,7 @@ GGML versions are not yet provided, as there is not yet support for SuperHOT in
|
|
40 |
|
41 |
GGML quants are not yet provided, as there is not yet support for SuperHOT in llama.cpp. This is being investigated and will hopefully come soon.
|
42 |
|
43 |
-
## How to easily download and use this model in text-generation-webui
|
44 |
|
45 |
Please make sure you're using the latest version of text-generation-webui
|
46 |
|
@@ -56,9 +62,76 @@ Please make sure you're using the latest version of text-generation-webui
|
|
56 |
10. The model will automatically load, and is now ready for use!
|
57 |
11. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
|
58 |
|
59 |
-
## How to use this GPTQ model from Python code
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
60 |
|
61 |
-
Using this model with increased context from Python code is currently untested, so this section is removed for now.
|
62 |
|
63 |
## Provided files
|
64 |
|
|
|
25 |
|
26 |
**This is an experimental new GPTQ which offers up to 8K context size**
|
27 |
|
28 |
+
The increased context is tested to work with [ExLlama](https://github.com/turboderp/exllama), via the latest release of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
29 |
+
|
30 |
+
It has also been tested from Python code using AutoGPTQ, and `trust_remote_code=True`.
|
31 |
+
|
32 |
+
Code credits:
|
33 |
+
- Original concept and code for increasing context length: [kaiokendev](https://huggingface.co/kaiokendev)
|
34 |
+
- Updated Llama modelling code that includes this automatically via trust_remote_code: [emozilla](https://huggingface.co/emozilla).
|
35 |
|
36 |
Please read carefully below to see how to use it.
|
37 |
|
38 |
+
**NOTE**: Using the full 8K context on a 30B model will exceed 24GB VRAM.
|
39 |
|
40 |
GGML versions are not yet provided, as there is not yet support for SuperHOT in llama.cpp. This is being investigated and will hopefully come soon.
|
41 |
|
|
|
46 |
|
47 |
GGML quants are not yet provided, as there is not yet support for SuperHOT in llama.cpp. This is being investigated and will hopefully come soon.
|
48 |
|
49 |
+
## How to easily download and use this model in text-generation-webui with ExLlama
|
50 |
|
51 |
Please make sure you're using the latest version of text-generation-webui
|
52 |
|
|
|
62 |
10. The model will automatically load, and is now ready for use!
|
63 |
11. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
|
64 |
|
65 |
+
## How to use this GPTQ model from Python code with AutoGPTQ
|
66 |
+
|
67 |
+
First make sure you have AutoGPTQ and Einops installed:
|
68 |
+
|
69 |
+
```
|
70 |
+
pip3 install einops auto-gptq
|
71 |
+
```
|
72 |
+
|
73 |
+
Then run the following code. Note that in order to get this to work, `config.json` has been hardcoded to a sequence length of 8192.
|
74 |
+
|
75 |
+
If you want to try 4096 instead to reduce VRAM usage, please manually edit `config.json` to set `max_position_embeddings` to the value you want.
|
76 |
+
|
77 |
+
```python
|
78 |
+
from transformers import AutoTokenizer, pipeline, logging
|
79 |
+
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
|
80 |
+
import argparse
|
81 |
+
|
82 |
+
model_name_or_path = "TheBloke/WizardLM-33B-V1.0-Uncensored-SuperHOT-8K-GPTQ"
|
83 |
+
model_basename = "wizardlm-33b-v1.0-uncensored-superhot-8k-GPTQ-4bit--1g.act.order"
|
84 |
+
|
85 |
+
use_triton = False
|
86 |
+
|
87 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
|
88 |
+
|
89 |
+
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
90 |
+
model_basename=model_basename,
|
91 |
+
use_safetensors=True,
|
92 |
+
trust_remote_code=True,
|
93 |
+
device_map='auto',
|
94 |
+
use_triton=use_triton,
|
95 |
+
quantize_config=None)
|
96 |
+
|
97 |
+
model.seqlen = 8192
|
98 |
+
|
99 |
+
# Note: check the prompt template is correct for this model.
|
100 |
+
prompt = "Tell me about AI"
|
101 |
+
prompt_template=f'''USER: {prompt}
|
102 |
+
ASSISTANT:'''
|
103 |
+
|
104 |
+
print("\n\n*** Generate:")
|
105 |
+
|
106 |
+
input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
|
107 |
+
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
|
108 |
+
print(tokenizer.decode(output[0]))
|
109 |
+
|
110 |
+
# Inference can also be done using transformers' pipeline
|
111 |
+
|
112 |
+
# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
|
113 |
+
logging.set_verbosity(logging.CRITICAL)
|
114 |
+
|
115 |
+
print("*** Pipeline:")
|
116 |
+
pipe = pipeline(
|
117 |
+
"text-generation",
|
118 |
+
model=model,
|
119 |
+
tokenizer=tokenizer,
|
120 |
+
max_new_tokens=512,
|
121 |
+
temperature=0.7,
|
122 |
+
top_p=0.95,
|
123 |
+
repetition_penalty=1.15
|
124 |
+
)
|
125 |
+
|
126 |
+
print(pipe(prompt_template)[0]['generated_text'])
|
127 |
+
```
|
128 |
+
|
129 |
+
## Using other UIs: monkey patch
|
130 |
+
|
131 |
+
Provided in the repo is `llama_rope_scaled_monkey_patch.py`, written by @kaiokendev.
|
132 |
+
|
133 |
+
It can be theoretically be added to any Python UI or custom code to enable the same result as `trust_remote_code=True`. I have not tested this, and it should be superseded by using `trust_remote_code=True`, but I include it for completeness and for interest.
|
134 |
|
|
|
135 |
|
136 |
## Provided files
|
137 |
|