Optimized quants

#2
by MikeRoz - opened

Thank you for adding support for the new architecture and uploading these quants.

Would you be willing to please explain and/or provide a recipe for the optimized quants? I'm curious what's involved and how to replicate the process with my own quants.

They're made by quantizing the model at different bitrates and then replacing layers, one at a time, measuring the end-to-end impact on KL-divergence for each replacement. This gives a measure of how much the error on each layer propagates to the logits. Then you can make an optimized bitrate mix that assigns more bits to layers where errors are measured to be more important, instead of just trying to keep the bitrate as even as possible across the model (which is still a valid strategy for many models, just less so for MoEs).

Also, since this is an MoE model where something like 95% of the weights are in expert layers, all attention layers are given higher precision since that's a very favorable tradeoff.

I added some facilities to eval/model_diff.py to make this easier, and I added util/recompile.py to recompile a model from an arbitrary mix of tensors from two or more other models. Currently I'm optimizing tensor-parallel inference, but after that I plan to make a tool that automates the whole process, maybe using some kind of proxy measure prior to quantization so you can easily devise an optimized quantization strategy and finish the quantization pass in one step.

Would you be willing to upload the YAML files you used? I'm a big fan of model cards like this one from Ubergarm that goes into detail on how each quant was made if there's anything novel or non-standard going on. I'd like to adapt your strategy to making quants of the larger GLM-4.5.

I didn't keep the files for each one. But what I did was bump up all *.self_attn.* tensors by 1bpw, then some of the other layers by model.layers.x.* with these priorities:

image.png

I.e. you'd get the most improvement from layers 2, 43, 1, 29, and so on in that order. I didn't think about it at the time, but it would also make sense to do *.shared_experts.*, probably even at +2 bpw since that still only adds about 0.015 bpw to the overall model but should be a big improvement regardless. So it could look something like:

# Example for 3bpw model
sources:
  - id: 4
    model_dir: /path/to/4bpw_model
  - id: 5
    model_dir: /path/to/5bpw_model
overrides:
  - key: "*.self_attn.*"
    source: 5
  - key: "*.shared_experts.*"
    source: 5
  - key: "model.layers.2.*"
    source: 4
  - key: "model.layers.43.*"
    source: 4
  - key: "model.layers.1.*"
    source: 4
  - key: "model.layers.29.*"
    source: 4

Then: python util/recompile.py -i /path/to/3bpw_model -o /path/to/new_model -or overrides.yaml should give you a model that's slightly larger than 3bpw but significantly better.

It'll take some trial and error to really dial it in (until I have the tooling done at least), but recompiling is quick, at least. You can also run some tests without even recompiling with: python eval/model_diff.py -ma /path/to/test_model -mb /path/to/unquantized_model -r 10 -or overrides.yaml. That gives you KL-divergence and such in like a minute (depending on your storage bandwidth), so you can finetune the recipe that way.

I didn't think about it at the time, but it would also make sense to do .shared_experts., probably even at +2 bpw since that still only adds about 0.015 bpw to the overall model but should be a big improvement regardless.

Turboderp will you re-quantitize GLM 4.5 Air once tooling for this optimization is done to provide a bump in improvement? How far out is tooling to do this? Thanks for all your work on this library!

Sign up or log in to comment