Why not FP8 with static and per-tensor quantization?

#2
by wanzhenchn - opened

Thanks a lot. I found that the config.json and recipe.yaml shows its dynamic FP8 quantization, I have following questions:

  • Why not static and per-tensor?
  • The ignore_layers_list showed in recipe.yaml is as follows. why these 're:.*self_attn', 're:.*feed_forward.gate_proj', 're:.*feed_forward.up_proj', 're:.*feed_forward.down_proj' layers is ignored?
ignore: ['re:.*lm_head', 're:.*self_attn', 're:.*router', 're:.*vision_model', 're:.*multi_modal_projector',
        're:.*shared_expert', 're:.*feed_forward.gate_proj', 're:.*feed_forward.up_proj',
        're:.*feed_forward.down_proj']

Could you share the code of using llmcompressor tookit to get FP8-dynamic/static model?

Red Hat AI org

We are following the standard set by https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 for now. I do think per-channel and per-token are needed to preserve accuracy for this model. We are exploring more aggressive quantization ablations right now, but this is what we wanted to push first.

Thanks a lot. I found that the config.json and recipe.yaml shows its dynamic FP8 quantization, I have following questions:

  • Why not static and per-tensor?
  • The ignore_layers_list showed in recipe.yaml is as follows. why these 're:.*self_attn', 're:.*feed_forward.gate_proj', 're:.*feed_forward.up_proj', 're:.*feed_forward.down_proj' layers is ignored?
ignore: ['re:.*lm_head', 're:.*self_attn', 're:.*router', 're:.*vision_model', 're:.*multi_modal_projector',
        're:.*shared_expert', 're:.*feed_forward.gate_proj', 're:.*feed_forward.up_proj',
        're:.*feed_forward.down_proj']

Could you share the code of using llmcompressor tookit to get FP8-dynamic/static model?

Were you able to successfully compile this build yet, and was it nominal to say the least?

Sign up or log in to comment