Very slow on T4 instance

by oddball516 - opened 2 days ago

Discussion

oddball516

2 days ago

•

edited 2 days ago

I just tried this fp8 model on a T4 instance, it loads but training runs very very slow.

steps:   1%|█                    | 7/800 [03:17<6:11:57, 28.14s/it, avr_loss=0.305]

Is that normal?

rockerBOO

Owner 2 days ago

T4 doesn't support bf16, so if you use bf16 or bf16 mixed precision which is required as fp16 produces NaN. But if you set it to bf16 it will convert it to FP32 every time it does a calculation. Use L4 which supports bf16.

oddball516

2 days ago

Thanks, the fp8 model worked with L4, ETA is 50 minutes this time.

oddball516 changed discussion status to closed 2 days ago

oddball516

1 day ago

@rockerBOO I did another test on L40S, the fp8 and fp16 model have similar completion time, 17 min vs 18 min, is that normal? Should I expect a performance boost on the fp8 version?

oddball516 changed discussion status to open 1 day ago

rockerBOO

Owner 1 day ago

Depends if you are using mixed precision, as usually you'd be coming from fp32 and mixed precision to do it in bf16 or fp16 so a performance increase on the calculations. But with fp8 and doing the calculations at bf16, you're doing it at a higher precision. Would need to do mixed precision at fp8 which is a little more involved and requires third party libraries to do so.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment