Text Generation
Transformers
Safetensors
falcon
conversational
text-generation-inference

Error on num_kv_heads when using TGI

#5
by nsegev - opened

I'm trying to deploy the model with 4-bit quantizaion on sagemaker using the following configuration:
config = {
'HF_MODEL_ID': 'tiiuae/falcon-180B-chat',
'SM_NUM_GPUS': json.dumps(8),
'MAX_TOTAL_TOKENS': json.dumps(2048),
'MAX_INPUT_LENGTH': json.dumps(2048 - MAX_NEW_TOKENS),
'HUGGING_FACE_HUB_TOKEN': ,
'HF_MODEL_QUANTIZE': 'bitsandbytes-nf4',
}

I'm receiving an unexpected error "NotImplementedError: Tensor Parallelism is not implemented for 14 not divisible by 8"
Looks like it comes from the FlashRWLargeAttention class from the line "if self.num_groups % process_group.size() != 0"

As far as I can understand, the number 14 is n_head_kv, but why is it 14?
Where is the number 14 coming from?

Getting the same error too. Any idea why this is happening and how can it be solved?

Getting the same error too. Any idea why this is happening and how can it be solved?

It was fixed in TGI version 1.1.0 (recently released)

Sign up or log in to comment