overview_figure
We propose a method to adapt pre-trained visual foundation models to unseen remote sensing modalities in three steps: First, we add additional parameters, called SLR Adapters, to the foundation model. Then, the new parameters are trained with a masked autoencoding objective on unlabeled data from the new domain. Finally, the adapter parameters are fine-tuned in supervised fashion for the task of interest.

Background

Pre-trained foundation models are widely used to solve computer vision problems, including with remote sensing imagery. The remote sensing field is characterized by widely varying imaging sensors and heterogeneous datasets. This limits the applicability of many visual foundation models, which are often pre-trained on a specific image modality. In this work, we present a self-supervised adaptation technique that makes it possible to efficiently adapt large, pre-trained foundation models to previously unseen imaging modalities.

Method

We propose a method that adapts existing foundation models to new modalities by introducing a small number of additional parameters and training the new parameters in self-supervised fashion on the target modality. These scaled low-rank (SLR) adapter parameters are added throughout the layers of the model to augment linear transforms in the transformer feed-forward blocks and QKV projections. Each adapter consists of learnable input and output scaling vectors si1,2 and low-rank matrices Wi1,2 that form a residual connection around the pre-trained linear transform Wi.

slr_figure
Transformer block (left) with SLR adapters (right). We add individual SLR adapters to linear transformations in the qkv projection and mlp layers of the transformer block. SLR freezes the original transform Wi and introduces trainable scaling parameters si1,2 and low-rank matrices Wi1,2.

The adapters are trained with a masked-autoencoding approach on unlabeled data from target domain. For instance, an RGB foundation model can be adapted to SAR data by adding the adapters and training them on a SAR dataset. In the final step, we fine-tune the adapters along with a task-specific model head (e.g., segmentation or regression) on labeled dataset that could be much smaller.

Results

avg_acc_figure
Average performance (colored bars) and number of trainable parameters (gray bars) for different visual foundation models across 8 remote sensing datasets. Our Scaled Low-Rank (SLR) adapter method achieves significant performance improvements across datasets and models with no (SLR Linear) and as little as 1-2% (SLR fine-tuned) additional parameters.

We test our method across different remote sensing datasets and visual foundation models. Average linear evaluation accuracy across eight classification and segmentation tasks significantly improves after incorporating and training SLR adapters for MAE, SatMAE and Scale-MAE foundation models. Fine-tuning of SLR adapters (1-2% of model parameters) results in performance that is on-par with supervised fine-tuning of the entire 300M ViT-L model.

seg_figure
Segmentation examples on Sentinel-1 SAR imagery from the BENGE-8k dataset. Predictions for 8 different land-cover classed obtained with SatMAE combined with a FCN segmentation head. Comparisons between full SatMAE fine-tuning and fine-tuning of SLR adapters.

Qualitatively, we find that self-supervised adaptation improves the segmentation results on SAR data for foundation models that were trained on optical RGB data.

SLR Adapters

In our experiments, SLR adapters outperform parameter efficient fine-tuning methods such as BitFit, (IA)3, low-rank adaptation (LoRA), and tuning of normalizations (see Tab. 5 in the paper).

Additionally, SLR adapter facilitate few-shot learning through their scaling vectors, which can be adjusted effectively on very small datasets. We also find that self-supervised training of the adapters before supervised fine-tuning improves downstream performance, even without access to additional unlabeled data.

Conclusion

We present a new method to adapt foundation models to unseen modalities while avoiding costly fine-tuning. This enables supervised training of large models with a limited number of labeled samples and improves downstream performance on new modalities.