num_train_epochs = 3 but save_steps = 250?

#1
by DanielTTY - opened

just trying to understand the qlora.py in your github, how many steps are there in an num_train_epochs ?

Depends on the number of elements in your training data, batch size, gradient accumulation steps, etc. Personally I think it's easier to go by epochs, and in this case I actually ended up using 5 (although with a higher learning rate perhaps 3 would be fine).

Without gradient accumulation, and batch size of 6, on ~30k training rows, it's about 5k steps per epoch, so I'm saving roughly every 5% through the training data. With a gradient accumulation steps 16, the steps per epoch is closer to 325.

1 epoch = once through the data, but steps is highly variable based on config, so I just updated the code to allow consistency in passes through training data without having to calculate steps.

The save steps isn't as important IMO, but in the event of a crash, or if you want to be able to test earlier checkpoints, it can be useful.

Sign up or log in to comment