Failed to load model (with the latest version 17 hours ago )

#3
by omnibookxp - opened

I just tried to use "Meta-Llama-3.1-8B-Instruct-Q8_0.gguf" with LM Studio 0.2.28

Failed to load model

Error message:
"llama.cpp error: 'done_getting_tensors: wrong number of tensors; expected 292, got 291'"

Diagnostics info:
{
"memory": {
"ram_capacity": "32.00 GB",
"ram_unused": "10.00 GB"
},
"gpu": {
"gpu_names": [
"Apple Silicon"
],
"vram_recommended_capacity": "21.33 GB",
"vram_unused": "9.10 GB"
},
"os": {
"platform": "darwin",
"version": "14.5"
},
"app": {
"version": "0.2.28",
"downloadsDir": "/Users/maxm1/.cache/lm-studio/models"
},
"model": {}
}

Same here with Ollama.

ollama run Meta-Llama-3.1-8B-Instruct-Q8_0:latest
Error: llama runner process has terminated: error loading model: done_getting_tensors: wrong number of tensors; expected 292, got 291

@omnibookxp

lmstudio just got updated to 0.2.29 which adds support for llama 3.1 with the rope fixes, go grab it :D

https://lmstudio.ai/

The error is fixed with the new version of LM Studio 0.2.29

In my case (using latest llama-server) the VRAM requirement for Q8_0.gguf was unexpectedly large (when the model was started with the 128k tokens context window). For a 8-bit quant I was expecting VRAM memory req. to be similar to the GGUF file size (which was the case with Llama 3.0), but this model version required 4 times more VRAM... (32911MiB / 81920MiB on an otherwise empty A100 80GB GPU). It looks like a bug not a feature that increasing the context window 16 times would takes up 4x more VRAM... for other models like Qwen 2 the jump in memory usage with similar increases in context windows wasn't that dramatic (double-digit, not triple-digit percentage increases). Here reducing the size of the context window to 8k tokens brings back VRAM use to the levels from the previous model version (Llama 3.0 8B: 9683MiB / 81920MiB for the 8-bit quant).

This is a feature not a bug, 128k context is an insane amount and will need to allocate a TON of memory. in fact, I would assume a much more than 4x increase in VRAM if context went up 16x

I updated the latest version, but the issue is still consistent.

Which issue? Latest version of what?

I am having the same issue with llama-cpp-python. I tried updating the latter, as suggested on some forums, but no improvement.

Which issue? Latest version of what?
LMstudio. It's fixed now. Thanks.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment