4bit version
I wonder if this thing can run on a beefy server cpu once quantized. @TheBloke What do you think, does this thing run on CPU in 4bit at useful speeds?
Ah I know... @Camelidae you have quantized llama 65 even to 2 bit. What are your experiences, performance wise? When does quantization do too much harm? On what hardware do you run the 3 and 4 bit versions?
tag does not work... gonna go to your page https://huggingface.co/camelids/llama-65b-ggml-q4_0
I'm having a look at it now, will let you know how I get in
OK, here are GGML 4bit and 2bit quantised versions: https://huggingface.co/TheBloke/alpaca-lora-65B-GGML
I had some problems making GPTQs due to my attempts killing the Runpod host I was on :) I got one GPTQ made, then the host died and I couldn't get back in to access the file I'd saved. I didn't want to start from scratch, and they've promised to fix it for tomorrow, so I will continue from where I left off then.
@TheBloke Any suggestions on what project to get this 2bit version working? Oogabooga doesn't seem to like it and I'm doing this on a remote server as my local machine doesn't have an a100 or a decent connection. I only know how to get llama.cpp working locally and it'll take me a day to download it so any hints would be appreciated. I've been trying to get it working since shortly after your upload and keep slamming my face into brick walls.
I believe that right now the only way to get the 2bit working is to use the q2q3
branch of the llama.cpp fork I listed in the README.
I imagine that sometime soon this will be merged into the main llama.cpp. And once it's been merged, it'll likely only be a few days until it's supported by the various tools that interface with llama, like text-generation-webui and the python bindings. If you're a developer then I suppose it may be possible to merge the new q2q3 code into one of the llama.cpp interfaces to enable it to work as a server. But that's not something I've looked at, or plan to look at.
But right now the only simple option I know of would be to run it on the command line using:
git clone https://github.com/sw/llama.cpp llama-q2q3
cd llama-q2q3
git checkout q2q3
make
./main ...
If that doesn't work for you then I'd suggest just waiting. Once the q2 code is merged things will be easier. And it's quite possible that by that time, a better 65B model will be available anyway :)