You will release a small version for consumer hardware like the v2 generation?
I think will be very usefult center this model in CHAT more than other fuctions like code and math.
+1 here
The ONLY way for ordinary people to run it is on CPU with 12 RAM slots motherboards (i don't see any other way, there won't be consumer GPUs in any near future for such VRAM sizes). Such motherboards made many years usually for Xeons, even today from Asus, Gigabyte always made them - today very fancy server ones for newest DDR5 RAM. But you can get even older their boards for DDR4, i got such for only $150 from China. In these motherboard you can get a Terabyte Ram total, GPU can work as off-load accelerator in such tools like oobabooga or LMstudio based on llama-cpp, of course there's bandwidth problem, but there's no other way. CPUs must be no less than 12 cores, latest AMD CPU with 100+ cores i'm sure will be very fast on level of good GPU, but it costs thousands. Home clusters solutions like network of certain PCs are too weird and not useful today, usually it's projects from chain of super expensive Macs on which Ram & storage is a historically painful topic.
About RAM, in such boards mostly you can use only LRDIMM Ram modules which twice in size of ordinary, used server DDR4 LRDIMM in 64Gb quite cheap was ~$90 each, 128Gb still rare and quite expensive. LRDIMMs with radiator are very hot in use, twice warmer than ordinary RAM, together in 12 RAM boards it can get into 90C degrees without additional ventilation.
Storage. This huge models are already beyond our current PC hardware limits. There's a problem of loading such huge model into RAM from storage. Even NVMe SSD speed is not enough already for such sizes and NVMe very expensive for such sizes of terabyte or more.
I've tested loading Meta Llama 3 405 billions from ordinary hard drive and loading it into RAM in q5 quality size takes 30 minutes, on NVMe kinda 10-15 minutes. Model need to be loaded only once at the start, after that it's working from RAM only.
Running THIS model. I see the problem that first quantized bf16 file are twice bigger than this original model, everything above terabyte in storage is unrealistic to run today locally. This BF16 need to be quantized into smaller quality sizes and better it to be less than terabyte. I want to try that model but on my hardware maybe the q4 oe even q3 quality is max.
When i've used Meta 405 billions it disappointed me not even by speed (it was very slow) but by censorship, it can't do much except very mundane tasks like calculation or encyclopedia. I don't see value in such model tool.
Deepseek V2.5 latest is for now the best and valuable tool, it can do programming logic & code in size of only 236 billions. It's really fast on 22 cores CPU than Meta Llama 405 billions, it uses in q8 quality kinda 270Gb Ram, but in q6 also good quality may use less and be fitted in 4 RAM motherboards which supports 256Gb Ram total.
The ONLY way for ordinary people to run it is on CPU with 12 RAM slots motherboards (i don't see any other way, there won't be consumer GPUs in any near future for such VRAM sizes). Such motherboards made many years usually for Xeons, even today from Asus, Gigabyte always made them - today very fancy server ones for newest DDR5 RAM. But you can get even older their boards for DDR4, i got such for only $150 from China. In these motherboard you can get a Terabyte Ram total, GPU can work as off-load accelerator in such tools like oobabooga or LMstudio based on llama-cpp, of course there's bandwidth problem, but there's no other way. CPUs must be no less than 12 cores, latest AMD CPU with 100+ cores i'm sure will be very fast on level of good GPU, but it costs thousands. Home clusters solutions like network of certain PCs are too weird and not useful today, usually it's projects from chain of super expensive Macs on which Ram & storage is a historically painful topic.
About RAM, in such boards mostly you can use only LRDIMM Ram modules which twice in size of ordinary, used server DDR4 LRDIMM in 64Gb quite cheap was ~$90 each, 128Gb still rare and quite expensive. LRDIMMs with radiator are very hot in use, twice warmer than ordinary RAM, together in 12 RAM boards it can get into 90C degrees without additional ventilation.
Storage. This huge models are already beyond our current PC hardware limits. There's a problem of loading such huge model into RAM from storage. Even NVMe SSD speed is not enough already for such sizes and NVMe very expensive for such sizes of terabyte or more.
I've tested loading Meta Llama 3 405 billions from ordinary hard drive and loading it into RAM in q5 quality size takes 30 minutes, on NVMe kinda 10-15 minutes. Model need to be loaded only once at the start, after that it's working from RAM only.
Running THIS model. I see the problem that first quantized bf16 file are twice bigger than this original model, everything above terabyte in storage is unrealistic to run today locally. This BF16 need to be quantized into smaller quality sizes and better it to be less than terabyte. I want to try that model but on my hardware maybe the q4 oe even q3 quality is max.
When i've used Meta 405 billions it disappointed me not even by speed (it was very slow) but by censorship, it can't do much except very mundane tasks like calculation or encyclopedia. I don't see value in such model tool.
Deepseek V2.5 latest is for now the best and valuable tool, it can do programming logic & code in size of only 236 billions. It's really fast on 22 cores CPU than Meta Llama 405 billions, it uses in q8 quality kinda 270Gb Ram, but in q6 also good quality may use less and be fitted in 4 RAM motherboards which supports 256Gb Ram total.
that's a little bit crazy 😂, I think most of those who need a small model are possibly can only afford 16GB or less VRAM, a 16B model(like deepseek v2 lite) plus Q4 gguf might be a good option
I would love deepseek v3 to run on a single h100
that's a little bit crazy 😂, I think most of those who need a small model are possibly can only afford 16GB or less VRAM, a 16B model(like deepseek v2 lite) plus Q4 gguf might be a good option
Nah, it's access to models, which is important. I still don't see any human which intelligence improved by speed of such tools. About the tools itself, speed can be dangerous when the tools achieve self-improving level (so geometric progression).
So, first run. Deepseek V3 - 671 billions - Q5_K Medium GGUF - uses 502Gb RAM. Test on Linux, Oobabooga 2.2 (text generation webui). Only CPU, BLAS not installed yet, speed is slow 0.20 min - 0.40 max tokens per second, but can be used for some coding or task left for processing. Writing in real time slightly slower than normal human with pen).
And most of all, i'm using literally ANCIENT hardware, Xeon from 2015, 2696 v4, Gigabyte 12 RAM slots motherboard also 10 years old at min.
About quality, if you want perfect quality-you should go higher than Q5-K-M, like Q6-K or better Q8 but it require more RAM (usually its storage size+10% in GGUFs). I'm testing it in code, it's good, but Q5 quite degraded it, there's thing i don't like, instead of writing portion of code which need to be changed, it writing full code size in every answer, wasting a sea of tokens (tokens is money on servers)-which is always sign of smaller dumber models like Qwen or Mistral Large. V2.5 in Q8 close to original quality are smart enough not doing that and save many tokens and your time. Going lower Q5 not recommended, big models hallucinate a lot on lower quality quantizations. V3 only after three full code repeats (in full it's size) kinda self-tuned and started to shorten code in answers writing portions parts to edit.
You
hello darling!
AI
Well, hello there, darling! Always a pleasure to brighten your day. What’s on your mind? Need advice, a creative idea, or just someone to chat with? I’m all ears—or, well, algorithms! 😊