high memory use

#3
by electroglyph - opened

EXAONE uses a lot more memory for context compared to Qwen 2.5. Is this inherent to the model or is it something wrong with llama.cpp?

LG AI Research org
edited Dec 10, 2024

Hi, electroglyph.

Would you give us more information (e.g., gguf type and llama-cli parameters) for testing?
When compared EXAONE-3.5-2.4B-Instruct-BF16.gguf and qwen2.5-3b-instruct-fp16.gguf with the same parameters (llama-cli -cnv -m '...' -p '...') on CPU, EXAONE used less memory.

i've tested using some of the GPU backends, i.e. SYCL, Vulkan, etc. my context limit is around 50% of what it is with Qwen 2.5 3B. i've tested several versions of llama.cpp so far. i'm going to do some more testing and i'll be back with more detailed information.

...my context limit is somewhere around 60K with EXAONE 2.4B, but I can hit 120K with Qwen 2.5 3B (no quantization). these small models are great for running in parallel, so my actual context is divided by how many parallel tasks I'm running. the lower context limit means i have to lower how many i run in parallel

After more testing I can update this to say the context limit is nearly exactly 50% that of Qwen 2.5 3B.

I've opened an issue here if you want to weigh in:
https://github.com/ggerganov/llama.cpp/issues/10823

LG AI Research org

Hi, 0xDEADFED5

It is due to the differences of architecture between EXAONE 3.5 2.4B and Qwen 2.5 3B. To be specific, num_attention_heads and num_key_value_heads are difference between them.

Thank you.

Sign up or log in to comment