Kernel assertion errors on 5090 using generation with MXfp4 (gpt-oss) - (stable on 4090)

by cmp-nct - opened 7 days ago

7 days ago

File "/root/.cache/huggingface/hub/models--kernels-community--triton_kernels/snapshots/1d2e9557ac0d4c651055a209055748d4db0fe65b/build/torch-universal/triton_kernels/matmul_ogs_details/opt_flags.py", line 214, in make_default_opt_flags_nvidia
assert num_stages >= 1

I had to manually comment that assertion to get it running.
Otherwise I've in 30% of batch sizes and prompt lengths AssertionError crashes with gpt-oss-20b on my 5090

On 4090 I've no such problems.

marcsun13

kernels-community org 1 day ago

Thanks for this ! Would you like to open a PR for that ? Otherwise, I will sync the latest version in a few days after it is a bit more stable on triton kernels side

cmp-nct

about 6 hours ago

Hi, I think commenting it out is not the right final solution. I'm not experienced in triton lang to fix the bug.

The problem is probably in the calculation that leads to num_stages getting assigned 0, someone with experience in triton needs to fix it.
The dev responsible for the function should take a look, maybe it's something obvious.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment