Failed to load model (with the latest version 17 hours ago )

by omnibookxp - opened Jul 28, 2024

Jul 28, 2024

I just tried to use "Meta-Llama-3.1-8B-Instruct-Q8_0.gguf" with LM Studio 0.2.28

Failed to load model

Error message:
"llama.cpp error: 'done_getting_tensors: wrong number of tensors; expected 292, got 291'"

Diagnostics info:
{
"memory": {
"ram_capacity": "32.00 GB",
"ram_unused": "10.00 GB"
},
"gpu": {
"gpu_names": [
"Apple Silicon"
],
"vram_recommended_capacity": "21.33 GB",
"vram_unused": "9.10 GB"
},
"os": {
"platform": "darwin",
"version": "14.5"
},
"app": {
"version": "0.2.28",
"downloadsDir": "/Users/maxm1/.cache/lm-studio/models"
},
"model": {}
}

allenc87

Jul 28, 2024

Same here with Ollama.

ollama run Meta-Llama-3.1-8B-Instruct-Q8_0:latest
Error: llama runner process has terminated: error loading model: done_getting_tensors: wrong number of tensors; expected 292, got 291

bartowski

Owner Jul 29, 2024

@omnibookxp

lmstudio just got updated to 0.2.29 which adds support for llama 3.1 with the rope fixes, go grab it :D

https://lmstudio.ai/

omnibookxp

Jul 29, 2024

The error is fixed with the new version of LM Studio 0.2.29

mirekphd

Jul 29, 2024

•

edited Jul 29, 2024

In my case (using latest llama-server) the VRAM requirement for Q8_0.gguf was unexpectedly large (when the model was started with the 128k tokens context window). For a 8-bit quant I was expecting VRAM memory req. to be similar to the GGUF file size (which was the case with Llama 3.0), but this model version required 4 times more VRAM... (32911MiB / 81920MiB on an otherwise empty A100 80GB GPU). It looks like a bug not a feature that increasing the context window 16 times would takes up 4x more VRAM... for other models like Qwen 2 the jump in memory usage with similar increases in context windows wasn't that dramatic (double-digit, not triple-digit percentage increases). Here reducing the size of the context window to 8k tokens brings back VRAM use to the levels from the previous model version (Llama 3.0 8B: 9683MiB / 81920MiB for the 8-bit quant).

bartowski

Owner Jul 29, 2024

This is a feature not a bug, 128k context is an insane amount and will need to allocate a TON of memory. in fact, I would assume a much more than 4x increase in VRAM if context went up 16x

nothingness6

Jul 30, 2024

I updated the latest version, but the issue is still consistent.

bartowski

Owner Jul 30, 2024

Which issue? Latest version of what?

taoio

Aug 6, 2024

I am having the same issue with llama-cpp-python. I tried updating the latter, as suggested on some forums, but no improvement.

nothingness6

Aug 8, 2024

Which issue? Latest version of what?
LMstudio. It's fixed now. Thanks.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment