Very slow on T4 instance
I just tried this fp8 model on a T4 instance, it loads but training runs very very slow.
steps: 1%|β | 7/800 [03:17<6:11:57, 28.14s/it, avr_loss=0.305]
Is that normal?
T4 doesn't support bf16, so if you use bf16 or bf16 mixed precision which is required as fp16 produces NaN. But if you set it to bf16 it will convert it to FP32 every time it does a calculation. Use L4 which supports bf16.
Thanks, the fp8 model worked with L4, ETA is 50 minutes this time.
@rockerBOO I did another test on L40S, the fp8 and fp16 model have similar completion time, 17 min vs 18 min, is that normal? Should I expect a performance boost on the fp8 version?
Depends if you are using mixed precision, as usually you'd be coming from fp32 and mixed precision to do it in bf16 or fp16 so a performance increase on the calculations. But with fp8 and doing the calculations at bf16, you're doing it at a higher precision. Would need to do mixed precision at fp8 which is a little more involved and requires third party libraries to do so.