3.0 bpw?

#1
by CulturedMan - opened

I'm at 12 gb vram, and can't hit 8k context at 3.5 bpw. If you could upload a 3.0 bpw variant, that would be greatly appreciated.

oo fair, yeah i'll make one now!

Just started, should be up in ~20-30 min

it's up btw @CulturedMan

I'm using Sillytavern with the recommended settings, and with Mistral format. It was working for a while, but eventually it starts giving me a "cannot extract reply in 5 tries" message. I don't get the error with any other model I'm using. Just thought I'd pass that along. Is everything working on your end?

i assume that's not being generated by the model, but instead by sillytavern? I didn't have any issues but i also didn't go particularly in depth. is it possible that it's just taking a long time and whatever you're using to host silly tavern is losing connection waiting for a response? i've encountered that before and had to up my timeout

The error is generated rather quickly. Within a few seconds it informs me of the 5 failed attempts, so it's not the timeout issue. Maybe it is Sillytavern related. I'll keep messing with it!

The weird thing is that it works perfectly for a while before the errors start.

what do you use as your backend for sillytavern?

Oobabooga / Text Generation Web UI.

I just loaded it up again. I'm looking at the logs, and it seems to be giving me an assertion error after the first response. The first response itself works perfectly, though.

Here's what I get after the first response:

Traceback (most recent call last):
File "C:\Users\Zugzwang\Desktop\test\text-generation-webui\modules\callbacks.py", line 61, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Zugzwang\Desktop\test\text-generation-webui\modules\text_generation.py", line 397, in generate_with_callback
shared.model.generate(**kwargs)
File "C:\Users\Zugzwang\Desktop\test\text-generation-webui\installer_files\env\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Zugzwang\Desktop\test\text-generation-webui\installer_files\env\Lib\site-packages\transformers\generation\utils.py", line 1592, in generate
return self.sample(
^^^^^^^^^^^^
File "C:\Users\Zugzwang\Desktop\test\text-generation-webui\installer_files\env\Lib\site-packages\transformers\generation\utils.py", line 2696, in sample
outputs = self(
^^^^^
File "C:\Users\Zugzwang\Desktop\test\text-generation-webui\modules\exllamav2_hf.py", line 127, in call
self.ex_model.forward(seq_tensor[longest_prefix:-1].view(1, -1), ex_cache, preprocess_only=True, loras=self.loras)
File "C:\Users\Zugzwang\Desktop\test\text-generation-webui\installer_files\env\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Zugzwang\Desktop\test\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\model.py", line 553, in forward
assert past_len + q_len <= cache.max_seq_len, "Total sequence length exceeds cache size in model.forward"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Total sequence length exceeds cache size in model.forward

I have max_seq_length set to 8192 in Oobabooga and Sillytavern.

According to Oobabooga, the total context was 2252 when it started bugging out. The 2048 context threshold may be the point of failure.

what's your max_prompt_len set to?

The truncation length is set to 8192.

I just tried setting the max context in Sillytavern to 2048, and it starts giving replies again normally. If I try going up to 3072, it gives me errors again. So, it does appear to be related to that threshold in some way.

changing context length I cant get this one to load at all.

21:04:16-528836 INFO Loading "Beyonder-4x7B-v3-exl2"
21:07:01-521544 ERROR Failed to load the model.
Traceback (most recent call last):
File "C:\Users\Tom_N\Desktop\text-generation-webui\modules\ui_model_menu.py", line 245, in load_model_wrapper
shared.model, shared.tokenizer = load_model(selected_model, loader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Tom_N\Desktop\text-generation-webui\modules\models.py", line 86, in load_model
output = load_func_maploader
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Tom_N\Desktop\text-generation-webui\modules\models.py", line 344, in ExLlamav2_loader
model, tokenizer = Exllamav2Model.from_pretrained(model_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Tom_N\Desktop\text-generation-webui\modules\exllamav2.py", line 70, in from_pretrained
model.load_autosplit(cache)
File "C:\Users\Tom_N\Desktop\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\model.py", line 349, in load_autosplit
for item in f: x = item
File "C:\Users\Tom_N\Desktop\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\model.py", line 438, in load_autosplit_gen
module.load()
File "C:\Users\Tom_N\Desktop\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\attn.py", line 239, in load
self.o_proj.load()
File "C:\Users\Tom_N\Desktop\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\linear.py", line 90, in load
if w is None: w = self.load_weight()
^^^^^^^^^^^^^^^^^^
File "C:\Users\Tom_N\Desktop\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\module.py", line 106, in load_weight
qtensors = self.load_multi(key, ["q_weight", "q_invperm", "q_scale", "q_scale_max", "q_groups", "q_perm", "bias"])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Tom_N\Desktop\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\module.py", line 86, in load_multi
tensors[k] = stfile.get_tensor(key + "." + k, device = self.device())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Tom_N\Desktop\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\fasttensors.py", line 204, in get_tensor
tensor = f.get_tensor(key)
^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

That's the error you get when the version of pytorch you're using isn't supported on your GPU, this literally just happened to me on my p100 but I need to look into how to fix it still

I just wanted to report that the model is now working for me at full context. After the latest Sillytavern and Oobabooga updates, it just started working all of a sudden. It has quickly become one of my favorites. Cheers!

Sign up or log in to comment