Deploying Llama3.1 to Nvidia T4 instance (sagemaker endpoints)
When I try and deploy meta-llama/Meta-Llama-3.1-8B-Instruct
to a g4dn.xlarge (Nvidia T4) with quantization enabled, I get:
RuntimeError: FlashAttention only supports Ampere GPUs or newer.
.
I am NOT able to use any newer GPU due to the region I am deploying a model to. I see models like unsloth SHOULD work and I get past the flash attention error, but I have also been unable to use that one for a different reason.
How can I get flash attention error to go away?
config = {
"HF_MODEL_ID": "meta-llama/Meta-Llama-3.1-8B-Instruct", # model_id from hf.co/models
"SM_NUM_GPUS": json.dumps(number_of_gpu), # Number of GPU used per replica
"MAX_INPUT_LENGTH": "4096", # Max length of input text
"MAX_TOTAL_TOKENS": "8192", # Max length of the generation (including input text)
"MAX_BATCH_TOTAL_TOKENS": "8192", # Limits the number of tokens that can be processed in parallel during the generation
"MESSAGES_API_ENABLED": "true", # Enable the messages API
"HF_MODEL_QUANTIZE" : "bitsandbytes" # [possible values: awq, eetq, exl2, gptq, marlin, bitsandbytes, bitsandbytes-nf4, bitsandbytes-fp4, fp8]
}
# check if token is set
assert (
config["HUGGING_FACE_HUB_TOKEN"] != "test"
), "Please set your Hugging Face Hub token"
# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
role=role, image_uri=llm_image, env=config, sagemaker_session=sess, transformers_version="4.43.3", tensorflow_version="2.17.0", pytorch_version="2.3.1",
)
+1, following this issue if someone eventually has a workaround
@sumanthnall if you have access to an AWS rep, ask for access to more EC2 instance types then are in GA.
Alternatively, you can use a custom sagemaker image that uses VLLM (instead of TGI) if you want to customize the packages.
Try adding 'CUDA_GRAPHS':0
to the config
And also USE_FLASH_ATTENTION: "false"