Inference time not good vs the original model

#2
by xJohn - opened

Hi,
I test model GGUF with this command
"llama-mtmd-cli -m typhooyphoon-ocr-7b.Q4_K_S.gguf --mmproj typhoon-ocr-7b.mmproj-f16.gguf -p "extract this image to text" --image "test.png".
llama-mtmd-cli use CUDA A10.
The inference time so long.
Do you have any suggestion?

You are running the model on CPU. No wonder you have a terrible experience. Please add -ngl 999 so you offload all layers to the GPU.

Sign up or log in to comment