Weights broken ?

#1
by Kerni - opened

Hello there,
I downloaded both the 4 bit 32 and 128g weights, and on my machine the model spurts out only gibberish.

I used Text gen webUI as backend with Exllama V1 and Exllama V2 for testing with multiple parameters.
--model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 21,21
--model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 21,21 --alpha_value 2 --max_seq_len 8192
(Sillty tavern for the front end)

Other models work perfectly.
(Xwin 70b for example)

Can anyone confirm this or am I just an idiot >_< ?

Can you show me an example of the gibberish - is it one word repeated over and over?

--max_seq_len 8192 should be 32768 as it is the default for LongAlpca, and no alpha_value

exllama needs to maully scale max_seq_len based on alpha_value. e.g. if alpha_value=2, max_seq_lenneeds to be 32768*2

Hello Again, sorry for the late reply (I did some testing around after Yhyu13 posted his comment)
(I used the 128g 4 bit weights for testing this time)

Can you show me an example of the gibberish - is it one word repeated over and over?

Yes , it is kind of like that. (this time i used : --model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 21,21 --max_seq_len 16384)
grafik.png

--max_seq_len 8192 should be 32768 as it is the default for LongAlpca, and no alpha_value

exllama needs to maully scale max_seq_len based on alpha_value. e.g. if alpha_value=2, max_seq_lenneeds to be 32768*2

I think you are right but i can not test it , my 2x 3090 do not want to load the weights with :
--model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 24,24 --max_seq_len 32768
grafik.png

I guess that one is on me >_<

Yes, that's a sequence length issue as we thought

Can you try with --max_seq_len 8192 - and no alpha parameter specified

Okay , i was able to do inference with --max_seq_len 32768
I used : --model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 19,23 --max_seq_len 32768

But.. ahm....
grafik.png

Yes, that's a sequence length issue as we thought

Can you try with --max_seq_len 8192 - and no alpha parameter specified

Of course,
here is the result with using : --model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 19,23 --max_seq_len 8192

grafik.png

Not sure then, sorry - maybe it only works at 32768. I've not played around with sequence length in a UI like text-generation-webui in a while. I thought it was meant to also work at lower sequence lengths

What about if you use --compress_pos_emb 2 --max_seq_len 8192 - you'll need to check that's the correct name for compress_pos_emb, but it's something like that

Not sure then, sorry - maybe it only works at 32768. I've not played around with sequence length in a UI like text-generation-webui in a while. I thought it was meant to also work at lower sequence lengths

What about if you use --compress_pos_emb 2 --max_seq_len 8192 - you'll need to check that's the correct name for compress_pos_emb, but it's something like that

This was an excellent idea actually, I tested it now with : --model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 21,21 --compress_pos_emb 2 --max_seq_len 8192
And.. it is.. ahm.. kind of okay ?

grafik.png

Then i tested it with : --model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 21,21 --compress_pos_emb 2 --max_seq_len 16384

grafik.png

And with : --model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 21,21 --compress_pos_emb 4 --max_seq_len 16384

grafik.png
Ahm.. okay.. din't know that our tower was half a kilometer long.. O.o

And with : --model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 19,23 --compress_pos_emb 8 --max_seq_len 32768

grafik.png

I guess this is the way to go then , i initially thought that this model does not need compress_pos_emb or alpha value to function.
So.. i guess we can close this one now ?

I tested this and got good, coherent output at max_sequence_length 32768 and compress_pos_emb of 8 using exllama_hf (not exllamav2). Other sequence lengths produced less coherent but still kind of usable output. Seems important to set it to 32K.

I tested this and got good, coherent output at max_sequence_length 32768 and compress_pos_emb of 8 using exllama_hf (not exllamav2). Other sequence lengths produced less coherent but still kind of usable output. Seems important to set it to 32K.

Just note, that exllama_hf uses huggingface implementation of transformer which is much slower than exllama with flash attention on cuda devices

exllama_hf at max hit 40% usage rate for single card inference for a 7B model on RTX3090, where as exllama with flash attention would easily achieve >95% usage rate

Thank you. I’ll try the model out on my build on Ubuntu, that has FA2 installed. First run was Windows.

Sign up or log in to comment