tiiuae/falcon-180B-chat · Error on num_kv

Sep 14, 2023

I'm trying to deploy the model with 4-bit quantizaion on sagemaker using the following configuration:
config = {
'HF_MODEL_ID': 'tiiuae/falcon-180B-chat',
'SM_NUM_GPUS': json.dumps(8),
'MAX_TOTAL_TOKENS': json.dumps(2048),
'MAX_INPUT_LENGTH': json.dumps(2048 - MAX_NEW_TOKENS),
'HUGGING_FACE_HUB_TOKEN': ,
'HF_MODEL_QUANTIZE': 'bitsandbytes-nf4',
}

I'm receiving an unexpected error "NotImplementedError: Tensor Parallelism is not implemented for 14 not divisible by 8"
Looks like it comes from the FlashRWLargeAttention class from the line "if self.num_groups % process_group.size() != 0"

As far as I can understand, the number 14 is n_head_kv, but why is it 14?
Where is the number 14 coming from?

Diogo-V

Sep 19, 2023

Getting the same error too. Any idea why this is happening and how can it be solved?

nsegev

Oct 1, 2023

Getting the same error too. Any idea why this is happening and how can it be solved?

It was fixed in TGI version 1.1.0 (recently released)

tiiuae
/

falcon-180B-chat

Error on num_kv_heads when using TGI