Abstract

Aerial image segmentation is top-down semantic segmentation with challenges such as foreground-background imbalance, complex backgrounds, intra-class heterogeneity, inter-class homogeneity, and tiny objects.

AerialFormer unifies Transformer-based multi-scale features at the contracting path with lightweight multi-dilated convolutional neural networks at the expanding path. The design combines local and global context for high-resolution segmentation.