why it use much more GPU memory than official model?
with official model i can use up to 10K context, but with this model, i can use only 3.5K. both are q4_k_m, my hardware is RTX3090. i use lmstudio.
another problem is that this model is very slow, even it i offload all layers onto GPU, it has just 3 tokens/sec and CPU usage is high.
with official model i can use up to 10K context, but with this model, i can use only 3.5K. both are q4_k_m, my hardware is RTX3090. i use lmstudio.
another problem is that this model is very slow, even it i offload all layers onto GPU, it has just 3 tokens/sec and CPU usage is high.
Hey are you still having the issue?
on my 3090 i'm getting -
total duration: 44.7685654s
load duration: 17.8537ms
prompt eval count: 13 token(s)
prompt eval duration: 347ms
prompt eval rate: 37.46 tokens/s
eval count: 1450 token(s)
eval duration: 44.402s
eval rate: 32.66 tokens/s
using q4_k_m
on my 3090 i'm getting -
total duration: 44.7685654s
load duration: 17.8537ms
prompt eval count: 13 token(s)
prompt eval duration: 347ms
prompt eval rate: 37.46 tokens/s
eval count: 1450 token(s)
eval duration: 44.402s
eval rate: 32.66 tokens/s
that is very impressive! my mother board and cpu is pretty old, the best speed I can get is around 23token/s.
using q4_k_m