An overview of our self-supervised multi-modal contrastive learning approach for Vision Transformers. We propose to use large datasets of unlabelled multi-modal remote sensing data (Sentinel-1 and Sentinel-2) for self-supervised pre-training of vision Transformers using a contrastive loss. After self-supervised training of the backbone (A), the model and task-specific head can be fine-tuned on much smaller labelled datasets for different downstream tasks (B), such as land-cover segmentation and classification.
An overview of our self-supervised multi-modal contrastive learning approach for Vision Transformers. We propose to use large datasets of unlabelled multi-modal remote sensing data (Sentinel-1 and Sentinel-2) for self-supervised pre-training of vision Transformers using a contrastive loss. After self-supervised training of the backbone (A), the model and task-specific head can be fine-tuned on much smaller labelled datasets for different downstream tasks (B), such as land-cover segmentation and classification.

Transformer models have recently approached or even surpassed the performance of ConvNets on computer vision tasks like classification and segmentation. To a large degree, these successes have been enabled by the use of large-scale labelled image datasets for supervised pre-training. This poses a significant challenge for the adaption of vision Transformers to domains where datasets with millions of labelled samples are not available. In this work, we bridge the gap between ConvNets and Transformers for Earth observation by self-supervised pre-training on large-scale unlabelled remote sensing data. We show that self-supervised pre-training yields latent task-agnostic representations that can be utilized for both land cover classification and segmentation tasks, where they significantly outperform the fully supervised baselines. Additionally, we find that subsequent fine-tuning of Transformers for specific downstream tasks performs on-par with commonly used ConvNet architectures. An ablation study further illustrates that the labelled dataset size can be reduced to one-tenth after self-supervised pre-training while still maintaining the performance of the fully supervised approach.

Our proposed approach is trained in two stages. We propose to use large datasets of unlabelled remote sensing data for self-supervised pre-training of vision Transformers. After self-supervised training of the backbone, the model and task-specific head can be fine-tuned on much smaller labelled datasets for different downstream tasks. For pretraining, we utilize the SEN12MS (Schmitt et al., 2019) dataset, which contains co-located pairs of Sentinel-1 and Sentinel-2 patches, disregarding available labels; for fine-tuning, we utilize the GRSS Data Fusion 2020 dataset, which comes with high-fidelity LULC segmentation labels.

Architecture of our task-agnostic backbone. For Sentinel-1 and Sentinel-2 input pairs, we pre-train a unique backbone consisting of two streams of Swin Transformers using a self-supervised contrastive loss.
Architecture of our task-agnostic backbone. For Sentinel-1 and Sentinel-2 input pairs, we pre-train a unique backbone consisting of two streams of Swin Transformers using a self-supervised contrastive loss.
Architecture of classification and segmentation heads. For the supervised training of both tasks, the two outputs of the backbone (Z<sub>1</sub>, Z<sub>2</sub>) are fed into
each head. Intermediate representations (Z<sub>1i</sub>, Z<sub>2i</sub>) are also used for the segmentation head. The final projection layer of the segmentation head consists of an up-sampling layer followed by a 1x1 convolutional layer.
Architecture of classification and segmentation heads. For the supervised training of both tasks, the two outputs of the backbone (Z1, Z2) are fed into each head. Intermediate representations (Z1i, Z2i) are also used for the segmentation head. The final projection layer of the segmentation head consists of an up-sampling layer followed by a 1x1 convolutional layer.

Our results show that latent representations derived through self-supervised pre-training are task agnostic and can be utilized for both land cover classification and segmentation. They also show that SSL in combination with vision Transformers or ConvNets can yield large performance gains (up to +30% over supervised baselines) across different downstream tasks when finetuned with labelled data.

Classification and segmentation results for training on different fractions of labelled data.
Classification and segmentation results for training on different fractions of labelled data.

Moreover, we are able to show the efficiency of pre-training with our approach: with only 10% of the labelled data, all self-supervised models outperform the best supervised baselines trained on the entire dataset. The performance rapidly increases with the amount of available labelled training data for all models.

Qualitative comparison of results for 3 different regions. Results from left to right: Sentinel-2 true color (RGB), DFC groundtruth, UNet trained from scratch on fusion of both inputs, SwinUNet trained from scratch on both inputs, SwinUNet fine-tuned on both inputs, and finally an ensemble model of both UNet and SwinUNet.
Qualitative comparison of results for 3 different regions. Results from left to right: Sentinel-2 true color (RGB), DFC groundtruth, UNet trained from scratch on fusion of both inputs, SwinUNet trained from scratch on both inputs, SwinUNet fine-tuned on both inputs, and finally an ensemble model of both UNet and SwinUNet.

To conclude, our approach illustrates the utility of Transformer models for Earth observation applications without the need for large labelled datasets.

The results of this study have been presented at the CVPR 2022 Earthvision workshop in New Orleans, USA. This work received the Best Student Paper Award.