TheBloke
/

Manticore-13B-Chat-Pyg-Guanaco-SuperHOT-8K-fp16

@@ -28,9 +28,74 @@ Note that `config.json` has been set to a sequence length of 8192. This can be m
 ## Repositories available
 * [4-bit GPTQ models for GPU inference](https://huggingface.co/TheBloke/Manticore-13B-Chat-Pyg-Guanaco-SuperHOT-8K-GPTQ)
 * [Unquantised SuperHOT fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/TheBloke/Manticore-13B-Chat-Pyg-Guanaco-SuperHOT-8K-fp16)
 * [Unquantised base fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/Monero/Manticore-13b-Chat-Pyg-Guanaco)
 <!-- footer start -->
 ## Discord
@@ -53,7 +118,7 @@ Donaters will get priority support on any and all AI/LLM/model questions and req
 **Special thanks to**: Luke from CarbonQuill, Aemon Algiz, Dmitriy Samsonov.
-**Patreon special mentions**: Pyrater, WelcomeToTheClub, Kalila, Mano Prime, Trenton Dambrowitz, Spiking Neurons AB, Pierre Kircher, Fen Risland, Kevin Schuppel, Luke, Rainer Wilmers, vamX, Gabriel Puliatti, Alex , Karl Bernard, Ajan Kanaga, Talal Aujan, Space Cruiser, ya boyyy, biorpg, Johann-Peter Hartmann, Asp the Wyvern, Ai Maven, Ghost , Preetika Verma, Nikolai Manek, trip7s trip, John Detwiler, Fred von Graf, Artur Olbinski, subjectnull, John Villwock, Junyu Yang, Rod A, Lone Striker, Chris McCloskey, Iucharbius , Matthew Berman, Illia Dulskyi, Khalefa Al-Ahmad, Imad Khwaja, chris gileta, Willem Michiel, Greatston Gnanesh, Derek Yates, K, Alps Aficionado, Oscar Rangel, David Flickinger, Luke Pendergrass, Deep Realms, Eugene Pentland, Cory Kujawski, terasurfer , Jonathan Leane, senxiiz, Joseph William Delisle, Sean Connelly, webtim, zynix , Nathan LeClaire.
 Thank you to all my generous patrons and donaters!

 ## Repositories available
 * [4-bit GPTQ models for GPU inference](https://huggingface.co/TheBloke/Manticore-13B-Chat-Pyg-Guanaco-SuperHOT-8K-GPTQ)
+* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU inference](https://huggingface.co/TheBloke/Manticore-13B-Chat-Pyg-Guanaco-SuperHOT-8K-GGML)
 * [Unquantised SuperHOT fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/TheBloke/Manticore-13B-Chat-Pyg-Guanaco-SuperHOT-8K-fp16)
 * [Unquantised base fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/Monero/Manticore-13b-Chat-Pyg-Guanaco)
+## How to use this model from Python code
+First make sure you have Einops installed:
+```
+pip3 install auto-gptq
+```
+Then run the following code. `config.json` has been default to a sequence length of 8192, but you can also configure this in your Python code.
+The provided modelling code, activated with `trust_remote_code=True` will automatically set the `scale` parameter from the configured `max_position_embeddings`. Eg for 8192, `scale` is set to `4`.
+```python
+from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, pipeline
+import argparse
+model_name_or_path = "TheBloke/Manticore-13B-Chat-Pyg-Guanaco-SuperHOT-8K-fp16"
+use_triton = False
+tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
+config = AutoConfig.from_pretrained(model_name_or_path, trust_remote_code=True)
+# Change this to the sequence length you want
+config.max_position_embeddings = 8192
+model = AutoModelForCausalLM.from_quantized(model_name_or_path,
+        config=config,
+        trust_remote_code=True,
+        device_map='auto')
+# Note: check to confirm if this is correct prompt template is correct for this model!
+prompt = "Tell me about AI"
+prompt_template=f'''USER: {prompt}
+ASSISTANT:'''
+print("\n\n*** Generate:")
+input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
+output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
+print(tokenizer.decode(output[0]))
+# Inference can also be done using transformers' pipeline
+print("*** Pipeline:")
+pipe = pipeline(
+    "text-generation",
+    model=model,
+    tokenizer=tokenizer,
+    max_new_tokens=512,
+    temperature=0.7,
+    top_p=0.95,
+    repetition_penalty=1.15
+)
+print(pipe(prompt_template)[0]['generated_text'])
+```
+## Using other UIs: monkey patch
+Provided in the repo is `llama_rope_scaled_monkey_patch.py`, written by @kaiokendev.
+It can be theoretically be added to any Python UI or custom code to enable the same result as `trust_remote_code=True`.  I have not tested this, and it should be superseded by using `trust_remote_code=True`, but I include it for completeness and for interest.
 <!-- footer start -->
 ## Discord
 **Special thanks to**: Luke from CarbonQuill, Aemon Algiz, Dmitriy Samsonov.
+**Patreon special mentions**: zynix , ya boyyy, Trenton Dambrowitz, Imad Khwaja, Alps Aficionado, chris gileta, John Detwiler, Willem Michiel, RoA, Mano Prime, Rainer Wilmers, Fred von Graf, Matthew Berman, Ghost , Nathan LeClaire, Iucharbius , Ai Maven, Illia Dulskyi, Joseph William Delisle, Space Cruiser, Lone Striker, Karl Bernard, Eugene Pentland, Greatston Gnanesh, Jonathan Leane, Randy H, Pierre Kircher, Willian Hasse, Stephen Murray, Alex , terasurfer , Edmond Seymore, Oscar Rangel, Luke Pendergrass, Asp the Wyvern, Junyu Yang, David Flickinger, Luke, Spiking Neurons AB, subjectnull, Pyrater, Nikolai Manek, senxiiz, Ajan Kanaga, Johann-Peter Hartmann, Artur Olbinski, Kevin Schuppel, Derek Yates, Kalila, K, Talal Aujan, Khalefa Al-Ahmad, Gabriel Puliatti, John Villwock, WelcomeToTheClub, Daniel P. Andersen, Preetika Verma, Deep Realms, Fen Risland, trip7s trip, webtim, Sean Connelly, Michael Levine, Chris McCloskey, biorpg, vamX, Viktor Bowallius, Cory Kujawski.
 Thank you to all my generous patrons and donaters!