Why not FP8 with static and per-tensor quantization?

#2
by wanzhenchn - opened

Thanks a lot. I found that the config.json and recipe.yaml shows its dynamic FP8 quantization, I have following questions:

  • Why not static and per-tensor?
  • The ignore_layers_list showed in recipe.yaml is as follows. why these 're:.*self_attn', 're:.*feed_forward.gate_proj', 're:.*feed_forward.up_proj', 're:.*feed_forward.down_proj' layers is ignored?
ignore: ['re:.*lm_head', 're:.*self_attn', 're:.*router', 're:.*vision_model', 're:.*multi_modal_projector',
        're:.*shared_expert', 're:.*feed_forward.gate_proj', 're:.*feed_forward.up_proj',
        're:.*feed_forward.down_proj']

Could you share the code of using llmcompressor tookit to get FP8-dynamic/static model?

Red Hat org

We are following the standard set by https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 for now. I do think per-channel and per-token are needed to preserve accuracy for this model. We are exploring more aggressive quantization ablations right now, but this is what we wanted to push first.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment