TheBloke commited on
Commit
18c4a6e
1 Parent(s): 40de25c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -5
README.md CHANGED
@@ -25,11 +25,17 @@ It is the result of quantising to 4bit using [GPTQ-for-LLaMa](https://github.com
25
 
26
  **This is an experimental new GPTQ which offers up to 8K context size**
27
 
28
- The increased context is currently only tested to work with [ExLlama](https://github.com/turboderp/exllama), via the latest release of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
 
 
 
 
 
 
29
 
30
  Please read carefully below to see how to use it.
31
 
32
- **NOTE**: Using the full 8K context will exceed 24GB VRAM.
33
 
34
  GGML versions are not yet provided, as there is not yet support for SuperHOT in llama.cpp. This is being investigated and will hopefully come soon.
35
 
@@ -40,7 +46,7 @@ GGML versions are not yet provided, as there is not yet support for SuperHOT in
40
 
41
  GGML quants are not yet provided, as there is not yet support for SuperHOT in llama.cpp. This is being investigated and will hopefully come soon.
42
 
43
- ## How to easily download and use this model in text-generation-webui
44
 
45
  Please make sure you're using the latest version of text-generation-webui
46
 
@@ -56,9 +62,76 @@ Please make sure you're using the latest version of text-generation-webui
56
  10. The model will automatically load, and is now ready for use!
57
  11. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
58
 
59
- ## How to use this GPTQ model from Python code - TBC
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
- Using this model with increased context from Python code is currently untested, so this section is removed for now.
62
 
63
  ## Provided files
64
 
 
25
 
26
  **This is an experimental new GPTQ which offers up to 8K context size**
27
 
28
+ The increased context is tested to work with [ExLlama](https://github.com/turboderp/exllama), via the latest release of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
29
+
30
+ It has also been tested from Python code using AutoGPTQ, and `trust_remote_code=True`.
31
+
32
+ Code credits:
33
+ - Original concept and code for increasing context length: [kaiokendev](https://huggingface.co/kaiokendev)
34
+ - Updated Llama modelling code that includes this automatically via trust_remote_code: [emozilla](https://huggingface.co/emozilla).
35
 
36
  Please read carefully below to see how to use it.
37
 
38
+ **NOTE**: Using the full 8K context on a 30B model will exceed 24GB VRAM.
39
 
40
  GGML versions are not yet provided, as there is not yet support for SuperHOT in llama.cpp. This is being investigated and will hopefully come soon.
41
 
 
46
 
47
  GGML quants are not yet provided, as there is not yet support for SuperHOT in llama.cpp. This is being investigated and will hopefully come soon.
48
 
49
+ ## How to easily download and use this model in text-generation-webui with ExLlama
50
 
51
  Please make sure you're using the latest version of text-generation-webui
52
 
 
62
  10. The model will automatically load, and is now ready for use!
63
  11. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
64
 
65
+ ## How to use this GPTQ model from Python code with AutoGPTQ
66
+
67
+ First make sure you have AutoGPTQ and Einops installed:
68
+
69
+ ```
70
+ pip3 install einops auto-gptq
71
+ ```
72
+
73
+ Then run the following code. Note that in order to get this to work, `config.json` has been hardcoded to a sequence length of 8192.
74
+
75
+ If you want to try 4096 instead to reduce VRAM usage, please manually edit `config.json` to set `max_position_embeddings` to the value you want.
76
+
77
+ ```python
78
+ from transformers import AutoTokenizer, pipeline, logging
79
+ from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
80
+ import argparse
81
+
82
+ model_name_or_path = "TheBloke/WizardLM-33B-V1.0-Uncensored-SuperHOT-8K-GPTQ"
83
+ model_basename = "wizardlm-33b-v1.0-uncensored-superhot-8k-GPTQ-4bit--1g.act.order"
84
+
85
+ use_triton = False
86
+
87
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
88
+
89
+ model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
90
+ model_basename=model_basename,
91
+ use_safetensors=True,
92
+ trust_remote_code=True,
93
+ device_map='auto',
94
+ use_triton=use_triton,
95
+ quantize_config=None)
96
+
97
+ model.seqlen = 8192
98
+
99
+ # Note: check the prompt template is correct for this model.
100
+ prompt = "Tell me about AI"
101
+ prompt_template=f'''USER: {prompt}
102
+ ASSISTANT:'''
103
+
104
+ print("\n\n*** Generate:")
105
+
106
+ input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
107
+ output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
108
+ print(tokenizer.decode(output[0]))
109
+
110
+ # Inference can also be done using transformers' pipeline
111
+
112
+ # Prevent printing spurious transformers error when using pipeline with AutoGPTQ
113
+ logging.set_verbosity(logging.CRITICAL)
114
+
115
+ print("*** Pipeline:")
116
+ pipe = pipeline(
117
+ "text-generation",
118
+ model=model,
119
+ tokenizer=tokenizer,
120
+ max_new_tokens=512,
121
+ temperature=0.7,
122
+ top_p=0.95,
123
+ repetition_penalty=1.15
124
+ )
125
+
126
+ print(pipe(prompt_template)[0]['generated_text'])
127
+ ```
128
+
129
+ ## Using other UIs: monkey patch
130
+
131
+ Provided in the repo is `llama_rope_scaled_monkey_patch.py`, written by @kaiokendev.
132
+
133
+ It can be theoretically be added to any Python UI or custom code to enable the same result as `trust_remote_code=True`. I have not tested this, and it should be superseded by using `trust_remote_code=True`, but I include it for completeness and for interest.
134
 
 
135
 
136
  ## Provided files
137