Why is it so slow to first token?

#3
by kweg - opened
MLX Community org

I've been using for a while this model: https://huggingface.co/lmstudio-community/Llama-4-Scout-17B-16E-MLX-text-4bit
And it's usually (for short prompts) 0.3s to first token.

I've tried to switch to this model as it has a vision support. And it's 99% of time (for the same prompts) 5s to first token. Same settings, same everything. Except of course vision support. Is the vision support that expensive?

Sign up or log in to comment