Insight about Prithvi

#13
by AnuIdame - opened

We aim to understand the architecture of Prithvi-100M, including
a) How to deal with heterogeneous data. In this case, different resolutions/picture sizes and possibly different numbers of channels
b) How to deal with metadata, such as spatial positioning of the image
c) What goes into pre-training and what into fine-tuning
Can you help to understand these details?

IBM NASA Geospatial org

Hello, dealing with your questions in order:
a) The model is trained for specific resolution, and deviating from that will likely result in poor performance. Changing the number of input channels is not possible without re-training.
b) The model itself is location-agnostic in that spatial metadata is not an input.
c) Pre-training is a self-supervised training that sees many more images than the fine-tuning procedure. During pre-training, the model masks images and then reconstructs them to learn relationships between the inputs. During fine-tuning, the model is provided fewer but labelled images and learns to predict those labels.

@CEPhillips Thank you very much for the explanation. We have two more questions as well.

  1. Does the pre-training dataset include data such as burn-scar and flood detection?
  2. It is stated that pre-training images are time series images where each input has three-time stamps. What is the time gap between them? Do these three images have any shifts, or do they perfectly overlap on each other perfectly?[whether they are same exact location]
IBM NASA Geospatial org

I'm glad that helps. The pre-training does not include burn-scar or flood detection tasks. Those are handled during fine-tuning. Regarding the input data, HLS observations are tiled and there are no spatial shifts between times. It is my understanding that times can very somewhat due to irregular overpasses by the satellite.

carlosgomes98 changed discussion status to closed
AnuIdame changed discussion status to open

@CEPhillips In the description it is stated "The model can also handle static imagery which can be fed into the model with T=1.". Do you feed T=1 images for the pre-training. If that so how the network handles both T=1 and T=3 differently?

IBM NASA Geospatial org

Hi @AnuIdame ! For pretraining we always fed 3 images. However, as you can see from the downstream tasks we tackled, once you use the pretrained encoder you can customise it to the number of T you want.

Hello, can someone help me to fine tune this model for processing data to predict something, how can i start? i want to do it on google colab but dk what to do, like how to import it there then what to do exactly after that. any advice would be highly appreciated

Sign up or log in to comment