Why not FP8 with static and per-tensor quantization?
#2
by
wanzhenchn
- opened
Thanks a lot. I found that the config.json
and recipe.yaml
shows its dynamic FP8 quantization, I have following questions:
- Why not static and per-tensor?
- The ignore_layers_list showed in
recipe.yaml
is as follows. why these're:.*self_attn', 're:.*feed_forward.gate_proj', 're:.*feed_forward.up_proj', 're:.*feed_forward.down_proj'
layers is ignored?
ignore: ['re:.*lm_head', 're:.*self_attn', 're:.*router', 're:.*vision_model', 're:.*multi_modal_projector',
're:.*shared_expert', 're:.*feed_forward.gate_proj', 're:.*feed_forward.up_proj',
're:.*feed_forward.down_proj']
Could you share the code of using llmcompressor tookit to get FP8-dynamic/static model?
We are following the standard set by https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 for now. I do think per-channel and per-token are needed to preserve accuracy for this model. We are exploring more aggressive quantization ablations right now, but this is what we wanted to push first.