bartowski/Beyonder-4x7B-v3-exl2

Mar 23, 2024

I'm at 12 gb vram, and can't hit 8k context at 3.5 bpw. If you could upload a 3.0 bpw variant, that would be greatly appreciated.

bartowski

Owner Mar 23, 2024

oo fair, yeah i'll make one now!

bartowski

Owner Mar 23, 2024

Just started, should be up in ~20-30 min

bartowski

Owner Mar 23, 2024

it's up btw @CulturedMan

CulturedMan

Mar 23, 2024

Thanks!

CulturedMan

Mar 26, 2024

I'm using Sillytavern with the recommended settings, and with Mistral format. It was working for a while, but eventually it starts giving me a "cannot extract reply in 5 tries" message. I don't get the error with any other model I'm using. Just thought I'd pass that along. Is everything working on your end?

bartowski

Owner Mar 26, 2024

i assume that's not being generated by the model, but instead by sillytavern? I didn't have any issues but i also didn't go particularly in depth. is it possible that it's just taking a long time and whatever you're using to host silly tavern is losing connection waiting for a response? i've encountered that before and had to up my timeout

CulturedMan

Mar 26, 2024

•

edited Mar 26, 2024

The error is generated rather quickly. Within a few seconds it informs me of the 5 failed attempts, so it's not the timeout issue. Maybe it is Sillytavern related. I'll keep messing with it!

The weird thing is that it works perfectly for a while before the errors start.

bartowski

Owner Mar 26, 2024

what do you use as your backend for sillytavern?

CulturedMan

Mar 27, 2024

•

edited Mar 27, 2024

Oobabooga / Text Generation Web UI.

I just loaded it up again. I'm looking at the logs, and it seems to be giving me an assertion error after the first response. The first response itself works perfectly, though.

Here's what I get after the first response:

Traceback (most recent call last):
File "C:\Users\Zugzwang\Desktop\test\text-generation-webui\modules\callbacks.py", line 61, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Zugzwang\Desktop\test\text-generation-webui\modules\text_generation.py", line 397, in generate_with_callback
shared.model.generate(**kwargs)
File "C:\Users\Zugzwang\Desktop\test\text-generation-webui\installer_files\env\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Zugzwang\Desktop\test\text-generation-webui\installer_files\env\Lib\site-packages\transformers\generation\utils.py", line 1592, in generate
return self.sample(
^^^^^^^^^^^^
File "C:\Users\Zugzwang\Desktop\test\text-generation-webui\installer_files\env\Lib\site-packages\transformers\generation\utils.py", line 2696, in sample
outputs = self(
^^^^^
File "C:\Users\Zugzwang\Desktop\test\text-generation-webui\modules\exllamav2_hf.py", line 127, in call
self.ex_model.forward(seq_tensor[longest_prefix:-1].view(1, -1), ex_cache, preprocess_only=True, loras=self.loras)
File "C:\Users\Zugzwang\Desktop\test\text-generation-webui\installer_files\env\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Zugzwang\Desktop\test\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\model.py", line 553, in forward
assert past_len + q_len <= cache.max_seq_len, "Total sequence length exceeds cache size in model.forward"
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Total sequence length exceeds cache size in model.forward

CulturedMan

Mar 27, 2024

I have max_seq_length set to 8192 in Oobabooga and Sillytavern.

According to Oobabooga, the total context was 2252 when it started bugging out. The 2048 context threshold may be the point of failure.

bartowski

Owner Mar 27, 2024

what's your max_prompt_len set to?

CulturedMan

Mar 27, 2024

•

edited Mar 28, 2024

The truncation length is set to 8192.

CulturedMan

Mar 28, 2024

I just tried setting the max context in Sillytavern to 2048, and it starts giving replies again normally. If I try going up to 3072, it gives me errors again. So, it does appear to be related to that threshold in some way.

Tom-Neverwinter

Apr 6, 2024

•

edited Apr 6, 2024

changing context length I cant get this one to load at all.

21:04:16-528836 INFO Loading "Beyonder-4x7B-v3-exl2"
21:07:01-521544 ERROR Failed to load the model.
Traceback (most recent call last):
File "C:\Users\Tom_N\Desktop\text-generation-webui\modules\ui_model_menu.py", line 245, in load_model_wrapper
shared.model, shared.tokenizer = load_model(selected_model, loader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Tom_N\Desktop\text-generation-webui\modules\models.py", line 86, in load_model
output = load_func_maploader
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Tom_N\Desktop\text-generation-webui\modules\models.py", line 344, in ExLlamav2_loader
model, tokenizer = Exllamav2Model.from_pretrained(model_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Tom_N\Desktop\text-generation-webui\modules\exllamav2.py", line 70, in from_pretrained
model.load_autosplit(cache)
File "C:\Users\Tom_N\Desktop\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\model.py", line 349, in load_autosplit
for item in f: x = item
File "C:\Users\Tom_N\Desktop\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\model.py", line 438, in load_autosplit_gen
module.load()
File "C:\Users\Tom_N\Desktop\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\attn.py", line 239, in load
self.o_proj.load()
File "C:\Users\Tom_N\Desktop\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\linear.py", line 90, in load
if w is None: w = self.load_weight()
^^^^^^^^^^^^^^^^^^
File "C:\Users\Tom_N\Desktop\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\module.py", line 106, in load_weight
qtensors = self.load_multi(key, ["q_weight", "q_invperm", "q_scale", "q_scale_max", "q_groups", "q_perm", "bias"])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Tom_N\Desktop\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\module.py", line 86, in load_multi
tensors[k] = stfile.get_tensor(key + "." + k, device = self.device())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Tom_N\Desktop\text-generation-webui\installer_files\env\Lib\site-packages\exllamav2\fasttensors.py", line 204, in get_tensor
tensor = f.get_tensor(key)
^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

bartowski

Owner Apr 6, 2024

That's the error you get when the version of pytorch you're using isn't supported on your GPU, this literally just happened to me on my p100 but I need to look into how to fix it still

CulturedMan

Apr 7, 2024

I just wanted to report that the model is now working for me at full context. After the latest Sillytavern and Oobabooga updates, it just started working all of a sudden. It has quickly become one of my favorites. Cheers!

bartowski
/

Beyonder-4x7B-v3-exl2

3.0 bpw?