why it use much more GPU memory than official model?

#7
by pootow - opened

with official model i can use up to 10K context, but with this model, i can use only 3.5K. both are q4_k_m, my hardware is RTX3090. i use lmstudio.

another problem is that this model is very slow, even it i offload all layers onto GPU, it has just 3 tokens/sec and CPU usage is high.

pootow changed discussion status to closed
Unsloth AI org

with official model i can use up to 10K context, but with this model, i can use only 3.5K. both are q4_k_m, my hardware is RTX3090. i use lmstudio.

another problem is that this model is very slow, even it i offload all layers onto GPU, it has just 3 tokens/sec and CPU usage is high.

Hey are you still having the issue?

on my 3090 i'm getting -
total duration: 44.7685654s
load duration: 17.8537ms
prompt eval count: 13 token(s)
prompt eval duration: 347ms
prompt eval rate: 37.46 tokens/s
eval count: 1450 token(s)
eval duration: 44.402s
eval rate: 32.66 tokens/s

using q4_k_m

on my 3090 i'm getting -
total duration: 44.7685654s
load duration: 17.8537ms
prompt eval count: 13 token(s)
prompt eval duration: 347ms
prompt eval rate: 37.46 tokens/s
eval count: 1450 token(s)
eval duration: 44.402s
eval rate: 32.66 tokens/s

that is very impressive! my mother board and cpu is pretty old, the best speed I can get is around 23token/s.

using q4_k_m

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment