MahmoodLab/UNI · UNI input image size during training

Dear Mahmood Lab Team,

Thank you for making UNI available.
I would like to understand at which input image size UNI was trained at.
From the UNI publication it appears to be 256 X 256 (for majority of iterations) and 512 X 512 (for high resolution fine-tuning at the end), however in the Huggingface docs an input image size of 224 X 224 pixels is suggested. Why is this the case?
Of course, ideally the ViT model should be able to generalize to varying sequence sizes, but replicating the train input image size seems like the most straight forward approach.

Looking forward to your reply.
Best,
Lydia