Inference time not good vs the original model
#2
by
xJohn
- opened
Hi,
I test model GGUF with this command
"llama-mtmd-cli -m typhooyphoon-ocr-7b.Q4_K_S.gguf --mmproj typhoon-ocr-7b.mmproj-f16.gguf -p "extract this image to text" --image "test.png".
llama-mtmd-cli use CUDA A10.
The inference time so long.
Do you have any suggestion?
You are running the model on CPU. No wonder you have a terrible experience. Please add -ngl 999
so you offload all layers to the GPU.