Compared to the regular FP8 model, what is the better performance of the 8BIT model here

#16
by demo001s - opened

Compared to the regular FP8 model, what is the better performance of the 8BIT model here. Also, why is Q8-0 10 seconds faster than Q5-0. I tested it on NVIDIA 4090, Q8-0 takes 18 seconds and Q5-0 takes 28 seconds.

Shouldn't the smaller the model file, the faster it should be

Lots of people have reported Q5 being much slower, maybe it's more complicated when casting 5 bit to 16.

For 2080 ti performance is similar for all versions of quants 2.5s/it, weird.

For 2080 ti performance is similar for all versions of quants 2.5s/it, weird.

You're limited by the slowest link, which in this case is likely the on-the-fly dequantization math (probably some numpy operation or casting).

Sign up or log in to comment