good performance but failed for long generation
thanks for your work. I compared a lot llms and find this one is has good performance with low cost. but when i try to generate 512 tokens or 1024 tokens, both 33b and 65b versions will failed and my linux system will freeze.
i am currently using two 4090 with newest driver 525. any comments will be appreciated.
following is my inference config
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16
)
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path,
quantization_config=nf4_config,
trust_remote_code=True,
torch_dtype=torch.float16, # additional option to lower RAM consumtion
device_map={"": 1} #used for 33b version
#device_map='auto' #used for 65b version
)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
input_text = 'system: you are AI model. help user and responde carefully with long generation.\n
user:what is y-90 liver treatment. responde with 1000 words. \n
AI:'
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to('cuda')
outputs = model.generate(input_ids, max_length=1024, top_k=10, temperature=0.5)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
I tried apply guanaco-33b adapter to llama-30b on my machine locally. Same issue happened. By a forced long generation, the system will freeze.
Good news is the issue could be prevented by using only 8bit quantization. Avoid NF4 quantization could solve this problem. bad news, in another word 33b model could not be deployed on single 24GB GPU. 48GB will be suggested.
another test. 4bit on vicuna-13b works fast without this issue.
I have found the problem. The second 4090 is hardware borken. 4bit inference works well on a single 4090.
If someone found the system hangs/freeze/crash via inference llm or stable diffusion webui. You should consider your 4090 is broken.