Weights broken ?

by Kerni - opened Oct 16, 2023

Oct 16, 2023

Hello there,
I downloaded both the 4 bit 32 and 128g weights, and on my machine the model spurts out only gibberish.

I used Text gen webUI as backend with Exllama V1 and Exllama V2 for testing with multiple parameters.
--model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 21,21
--model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 21,21 --alpha_value 2 --max_seq_len 8192
(Sillty tavern for the front end)

Other models work perfectly.
(Xwin 70b for example)

Can anyone confirm this or am I just an idiot >_< ?

TheBloke

Owner Oct 16, 2023

Can you show me an example of the gibberish - is it one word repeated over and over?

Yhyu13

Oct 16, 2023

•

edited Oct 16, 2023

--max_seq_len 8192 should be 32768 as it is the default for LongAlpca, and no alpha_value

exllama needs to maully scale max_seq_len based on alpha_value. e.g. if alpha_value=2, max_seq_lenneeds to be 32768*2

Kerni

Oct 16, 2023

Hello Again, sorry for the late reply (I did some testing around after Yhyu13 posted his comment)
(I used the 128g 4 bit weights for testing this time)

Can you show me an example of the gibberish - is it one word repeated over and over?

Yes , it is kind of like that. (this time i used : --model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 21,21 --max_seq_len 16384)

--max_seq_len 8192 should be 32768 as it is the default for LongAlpca, and no alpha_value

exllama needs to maully scale max_seq_len based on alpha_value. e.g. if alpha_value=2, max_seq_lenneeds to be 32768*2

I think you are right but i can not test it , my 2x 3090 do not want to load the weights with :
--model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 24,24 --max_seq_len 32768

I guess that one is on me >_<

TheBloke

Owner Oct 16, 2023

Yes, that's a sequence length issue as we thought

Can you try with --max_seq_len 8192 - and no alpha parameter specified

Kerni

Oct 16, 2023

Okay , i was able to do inference with --max_seq_len 32768
I used : --model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 19,23 --max_seq_len 32768

But.. ahm....

Yes, that's a sequence length issue as we thought

Can you try with --max_seq_len 8192 - and no alpha parameter specified

Of course,
here is the result with using : --model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 19,23 --max_seq_len 8192

TheBloke

Owner Oct 16, 2023

Not sure then, sorry - maybe it only works at 32768. I've not played around with sequence length in a UI like text-generation-webui in a while. I thought it was meant to also work at lower sequence lengths

What about if you use --compress_pos_emb 2 --max_seq_len 8192 - you'll need to check that's the correct name for compress_pos_emb, but it's something like that

Kerni

Oct 16, 2023

Not sure then, sorry - maybe it only works at 32768. I've not played around with sequence length in a UI like text-generation-webui in a while. I thought it was meant to also work at lower sequence lengths

What about if you use --compress_pos_emb 2 --max_seq_len 8192 - you'll need to check that's the correct name for compress_pos_emb, but it's something like that

This was an excellent idea actually, I tested it now with : --model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 21,21 --compress_pos_emb 2 --max_seq_len 8192
And.. it is.. ahm.. kind of okay ?

Then i tested it with : --model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 21,21 --compress_pos_emb 2 --max_seq_len 16384

And with : --model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 21,21 --compress_pos_emb 4 --max_seq_len 16384

Ahm.. okay.. din't know that our tower was half a kilometer long.. O.o

And with : --model TheBlokeLongAlpaca-70B-GPTQ --loader exllamav2 --api --verbose --gpu-split 19,23 --compress_pos_emb 8 --max_seq_len 32768

I guess this is the way to go then , i initially thought that this model does not need compress_pos_emb or alpha value to function.
So.. i guess we can close this one now ?

MB7977

Oct 17, 2023

I tested this and got good, coherent output at max_sequence_length 32768 and compress_pos_emb of 8 using exllama_hf (not exllamav2). Other sequence lengths produced less coherent but still kind of usable output. Seems important to set it to 32K.

Yhyu13

Oct 17, 2023

I tested this and got good, coherent output at max_sequence_length 32768 and compress_pos_emb of 8 using exllama_hf (not exllamav2). Other sequence lengths produced less coherent but still kind of usable output. Seems important to set it to 32K.

Just note, that exllama_hf uses huggingface implementation of transformer which is much slower than exllama with flash attention on cuda devices

exllama_hf at max hit 40% usage rate for single card inference for a 7B model on RTX3090, where as exllama with flash attention would easily achieve >95% usage rate

MB7977

Oct 17, 2023

Thank you. I’ll try the model out on my build on Ubuntu, that has FA2 installed. First run was Windows.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment