why it use much more GPU memory than official model?

by pootow - opened Mar 9

Mar 9

•

with official model i can use up to 10K context, but with this model, i can use only 3.5K. both are q4_k_m, my hardware is RTX3090. i use lmstudio.

another problem is that this model is very slow, even it i offload all layers onto GPU, it has just 3 tokens/sec and CPU usage is high.

pootow changed discussion status to closed Mar 9

shimmyshimmer

Unsloth AI org Mar 10

with official model i can use up to 10K context, but with this model, i can use only 3.5K. both are q4_k_m, my hardware is RTX3090. i use lmstudio.

another problem is that this model is very slow, even it i offload all layers onto GPU, it has just 3 tokens/sec and CPU usage is high.

Hey are you still having the issue?

greyhatyosh

Mar 10

on my 3090 i'm getting -
total duration: 44.7685654s
load duration: 17.8537ms
prompt eval count: 13 token(s)
prompt eval duration: 347ms
prompt eval rate: 37.46 tokens/s
eval count: 1450 token(s)
eval duration: 44.402s
eval rate: 32.66 tokens/s

using q4_k_m

pootow

Mar 13

on my 3090 i'm getting -
total duration: 44.7685654s
load duration: 17.8537ms
prompt eval count: 13 token(s)
prompt eval duration: 347ms
prompt eval rate: 37.46 tokens/s
eval count: 1450 token(s)
eval duration: 44.402s
eval rate: 32.66 tokens/s

that is very impressive! my mother board and cpu is pretty old, the best speed I can get is around 23token/s.

using q4_k_m

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment