MAESTRO: Masked AutoEncoders for Multimodal, Multitemporal, and Multispectral Earth Observation Data
Abstract
MAESTRO, an adapted Masked Autoencoder with optimized fusion strategies and spectral prior normalization, achieves state-of-the-art performance on multitemporal Earth observation tasks.
Self-supervised learning holds great promise for remote sensing, but standard self-supervised methods must be adapted to the unique characteristics of Earth observation data. We take a step in this direction by conducting a comprehensive benchmark of fusion strategies and reconstruction target normalization schemes for multimodal, multitemporal, and multispectral Earth observation data. Based on our findings, we propose MAESTRO, a novel adaptation of the Masked Autoencoder, featuring optimized fusion strategies and a tailored target normalization scheme that introduces a spectral prior as a self-supervisory signal. Evaluated on four Earth observation datasets, MAESTRO sets a new state-of-the-art on tasks that strongly rely on multitemporal dynamics, while remaining highly competitive on tasks dominated by a single mono-temporal modality. Code to reproduce all our experiments is available at https://github.com/ignf/maestro.
Community
We present MAESTRO, a novel adaptation of the Masked Autoencoder for Earth observation. It introduces optimized multimodal fusion strategies and a spectral prior–based normalization scheme as self-supervision. On four Earth Observation datasets, MAESTRO achieves state-of-the-art results on temporally dynamic tasks and remains highly competitive elsewhere.
The datasets are already available on our organization page, the code is public, and the models will follow shortly.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SMARTIES: Spectrum-Aware Multi-Sensor Auto-Encoder for Remote Sensing Images (2025)
- MAPEX: Modality-Aware Pruning of Experts for Remote Sensing Foundation Models (2025)
- SkySense V2: A Unified Foundation Model for Multi-modal Remote Sensing (2025)
- Asymmetric Dual Self-Distillation for 3D Self-Supervised Representation Learning (2025)
- TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis (2025)
- SpectralX: Parameter-efficient Domain Generalization for Spectral Remote Sensing Foundation Models (2025)
- TESPEC: Temporally-Enhanced Self-Supervised Pretraining for Event Cameras (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 3
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper