Qwen/Qwen2-VL-7B-Instruct · [HELP] Slow and inaccurate inference using AWS SageMaker

Hello! I'm fairly new to this model as as well as AWS SageMaker, so bear with me.

I have deployed this model (Qwen2-VL-7B-Instruct) to our AWS SageMaker. I've used the configuration as shown on HuggingFace for AWS, with two more additional parameters based on this thread.

hub = {
    'HF_MODEL_ID':'Qwen/Qwen2-VL-7B-Instruct',
    'SM_NUM_GPUS': json.dumps(1),
    'CUDA_GRAPHS': json.dumps(0),
    'MESSAGES_API_ENABLED': "true"
}

Invoking the endpoint with the model seems to work relatively fast and produces good results when the prompt is just text (e.g. "Tell me something about LLMs".

When I try to use an OpenAI-style of prompting, the latency sky rockets to above ~1mins, with the output seemingly becoming incoherent. Here's an example:

INPUT:

messages= [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Tell me about AWS SageMaker. Make sure to end you response with 'THAT IS IT'."
    }
]

llm = Predictor(
    endpoint_name = endpoint_qwen,
    sagemaker_session = sess,
    serializer = sagemaker.serializers.JSONSerializer(),
    deserializer = sagemaker.deserializers.JSONDeserializer(),
)

llm.predict(
    {
        # "inputs": prompt,
        "messages": messages,
        "parameters": {
            "max_new_tokens":2048,
            "top_p":0.9,
            "temperature":0.6,
        }
    }
)

OUTPUT:

{'object': 'chat.completion', 'id': '', 'created': 1741714418, 'model': 'Qwen/Qwen2-VL-7B-Instruct', 'system_fingerprint': '3.0.1-native', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': "AWS SageMaker is a fully managed, high-performance service provided by Amazon Web Services that enables developers and businesses alike to build, train, and deploy machine learning models quickly and with no experience in writing software or deploying serverless systems required. SageMaker's empowered with an extensive set of tools required to run the lifecycle of a machine learning project.Platform is a multi-model, multi-debug usage go with the data as a manager as no combin that the system is a system for the target of the data and the team of the system to the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team"}, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 39, 'completion_tokens': 243, 'total_tokens': 282}}

There's nothing in the logs (AWS CloudWatch) that screams what's wrong, so I'd love if someone can point me towards a few things that I could look at?

Thanks!