Aircraft segmentation in remote sensing images based on multi-scale residual U-Net with attention

Wang, Xuqi; Zhang, Shanwen; Huang, Lei

doi:10.1007/s11042-023-16210-2

Aircraft segmentation in remote sensing images based on multi-scale residual U-Net with attention

Open access
Published: 15 July 2023

Volume 83, pages 17855–17872, (2024)
Cite this article

Download PDF

You have full access to this open access article

Multimedia Tools and Applications Aims and scope Submit manuscript

Aircraft segmentation in remote sensing images based on multi-scale residual U-Net with attention

Download PDF

769 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Aircraft segmentation in remote sensing images (RSIs) is an important but challenging problem for both civil and military applications. U-Net and its variants are widely used in RSI detection, but they are not suitable for multi-scale aircraft segmentation in RSIs, due to the aircrafts in RSIs are relatively small with various orientations, different sizes, fuzzy illumination and shadow, obscure boundary and irregular background. To overcome this problem, a multi-scale residual U-Net with attention (MSRAU-Net) model is constructed for multi-scale aircraft segmentation in RSIs. A multi-scale convolutional module, two modified Respaths and two kinds of attention modules are designed and introduced into MSRAU-Net to extract the multi-scale feature and make the feature fusion between the contraction path and the expansion path more efficient. Different from U-Net, MSRAU-Net replaces the convolutional block of U-Net with the Inception residual block to help the U-Net architecture coordinate the features learned from aircrafts with different scales, and the residual module and attention module are introduced into the modified Respath to deepen the network layers and solve the gradient disappearing problem while extracting the more effective feature from RSIs. The experiments on the RSI dataset validate that MSRAU-Net outperforms the other networks, in particular for detecting the small aircrafts. Compared with attention U-Net and MultiMixUNet, the precision of MSRAU-Net is improved by 9.25 and 3.36, respectively.

Small-scale aircraft detection in remote sensing images based on Faster-RCNN

Article 08 March 2022

DMA-YOLO: multi-scale object detection method with attention mechanism for aerial images

Article 28 September 2023

Small object detection model for UAV aerial image based on YOLOv7

Article 29 December 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Automatic aircraft detection and segmentation in RSIs is of great significance in military and civil, and has attracted more attention in recent years [3, 5, 22]. With the improvement of RSIs, computer, big data and image processing, many aircraft detection and segmentation methods have been presented, which can be divided into the traditional feature extraction and template matching based methods [7, 16, 24] and the deep learning based methods [2, 18]. Because the aircrafts in RSIs are often small with diverse sizes, arbitrary orientations, illumination changes, various scenes, complex background, and large amount of interference in RSIs, as shown in Fig. 1, the traditional aircraft detection and segmentation methods still have low recognition accuracy, time-consuming, and poor generalization [14]. From Fig. 1, it is obviously seen that RSIs have low signal-to-noise ratio, irregular obstacles and complex background, and the shapes of aircrafts are deformed, irregular or asymmetric with different sizes. So it is time-consuming to design various shape templates for each shape of aircraft and it is difficult to extract the rotation-invariant and scale-invariant features from RSIs for aircraft detection and segmentation, and it often fails to detect the small low-resolution aircrafts from RSIs using the general automatic aircraft detection methods [25]. Yang et al. [21] presented a multiple knowledge representation (MKR) framework and discussed its potential for developing big data artificial intelligence (AI) techniques with possible broader impacts across different AI areas. MKR is an advanced AI representation framework for intelligent multiscale feature aggregation and multi-scale image segmentation.

In recent years, convolutional neural network (CNN) and its modified models have achieved remarkable results in various RSI segmentation, detection and recognition, including aircraft detection in RSIs [4, 9]. Zhang et al. [23] proposed an aircraft detection framework based on CNNs to detect multi-scale targets in extremely large and complicated scenes, designed a constrained EdgeBoxes approach to generate a modest number of target candidates quickly and precisely, and constructed a modified GoogLeNet combined with Fast Region-based CNN (R-CNN) to extract useful features from RSIs for multi-scale aircraft detection. Zhong et al. [27] proposed an airplane detection method in RSIs based on deep learning and transfer learning, and adopted a single deep CNN and limited training samples to implement end-to-end trainable airplane detection and ensure an optimal solution for the final stage. Yang et al. [20] developed a method for aircraft detection in RSIs based on deep residual network (ResNet) and super-vector coding. They designed a variant of ResNet with fewer layers to increase the resolution of the feature map, and integrated the multi-level convolutional features into an informative feature description for region proposal, extracted the histogram of oriented gradient (HOG) with super-vector coding from each region of interest (ROI) to assist convolutional features to complete object classification. Wang et al. [13] proposed a compact multi-scale dense CNN (MS-DenseNet) for aircraft detection in RSIs, combined feature pyramid network (FPN) with DenseNet to form a MS-DenseNet for learning multi-scale features, and designed three compact architectures for detecting small aircrafts: MS-DenseNet-41, MS-DenseNet-65 and MS-DenseNet-77. The comparative experiments showed that the compact MS-DenseNet-65 is very effective and achieved the state-of-the-art performance with a recall of 94%, an F₁-score of 92.7% and less computational time. Wu et al. [17] constructed a WFA-1400 dataset and proposed an improved mask R-CNN model to enhance the detection effect in the high-resolution RSIs which contain the dense targets and complex background. The model uses a modified mask R-CNN based on the ResNet101 backbone network to obtain more discriminative feature and adds a set of dilated convolutions with a specific size to improve the instance segmentation effect. Pan et al. [10] constructed a cascade CNN (CCNN) framework based on transfer-learning and geometric feature constraints (GFC) for aircraft detection in RSIs, and achieved high accuracy and efficient detection with relatively few samples. CCNN consists of an image classifier and an object detector. The transfer-learning is used to fine-tune pre-trained models with few samples, a GFC region proposal filtering method is proposed to improve detection efficiency, and the aircraft detection is completed by CCNN.

Fully convolutional network (FCN), SegNet and U-Net are three common and well-known backbone networks in the field of complex image segmentation, tracking, detection and recognition [8, 12]. In general, the pooling operation will lose the high frequency components of the image, resulting in the passivation of the blurred image block, and the loss of the image position information. U-Net uses skip connections to connect the low-level and high-level feature images and then capture both coarse level and fine level information at the deconvolutional layers. Because of the learnable upsampling, U-Net has much larger parameters to learn and is comparatively slower to train than SegNet, while SegNet does not capture multiscale information as effectively as U-Net. Due to the excellent performance of U-Net, many image segmentation and detection methods are based on the modified U-Net.

Attention mechanism has been widely added to deep learning models to filter the non-important features. Wang et al. [15] designed a symbiotic attention with object-centric feature alignment framework (SAOA), and proposed a symbiotic attention mechanism to encourage the mutual interaction between the two branches and select the most action-relevant candidates for classification. Xu et al. [19] introduced attention mechanism in U-Net to effectively improve the prediction performance of the network. The attention module takes the upper-level feature maps of the up-sampling and the feature map from the down-sampling process as input. Zhao et al. [26] introduced spatial attention in U-Net and proposed a lightweight network model named Spatial attention U-Net (SA-UNet) for gland segmentation. By introducing a small number of additional parameters, spatial attention can enhance important features and suppress unimportant features to improve the representation ability of SA-UNet. Zhuang et al. [28] designed an improved U-Net namely RDAUS-UNet for ultrasound segmentation of breast tumors. In the network, contraction path is used to increase the receptive field of the network, and Attention Gate module is employed to replace the original clipping and copying operation. Although this network is better than U-Net, it has some shortcomings in multi-scale feature extraction. Ibtehaz et al. [6] demonstrated that the classical U-Net architecture seems to be deficient in some aspects, then constructed a modified U-Net model namely MultiResUNet by introducing MultiMix block and Respath into U-Net, and tested and compared it with the classical U-Net on a vast repertoire of multimodal medical images.

U-Net and its modified models based aircraft detection and segmentation methods have been increasingly presented in recent years, but the features of the small and low-resolution aircrafts may be submerged by the redundant top-level features, resulting in poor detection. To address these problems, making use of the advantages of residual network, U-Net, MultiResUNet, attention mechanism and multi-scale convolution, a multi-scale residual U-Net with attention (MSRAU-Net) is constructed for automatic aircraft segmentation in RSIs. The main contributions of this paper are given as followings:

Inception and attention module are introduced into MSRAU-Net to extract the multi-scale feature and then to improve the standard U-Net architecture.
Respath with attention module is introduced into MSRAU-Net to connect effectively the features of low-level layers and high-level layers.
A lot experiments are conducted on the RSI dataset to evaluate MSRAU-Net.

The rest of this paper is arranged as follows. Section 2 introduces the related works including U-Net, MultiResUNet, dilated convolution, residual network and Inception. MSRAU-Net is described in detail in Section 3. Experiments and results are presented in Section 4. Lastly, the conclusions are drawn and the future works are given in Section 4.

2 Related works

2.1 U-Net

U-Net is a symmetric U-shaped network, as shown in Fig. 2, consisting of contraction network on the left side, expansion network on the right side, and skip connections in the middle.

In U-Net, downsampling encoder in contraction network is used to extract spatial features from the image, upsampling decoder in expansion network is employed to construct the segmentation map from the encoded features, and 4 skip connections are used to concatenate the low-level and high-level feature maps through the encoding and decoding process.

2.2 MultiResUNet

MultiResUNet is presented by Ibtehaz et al. [6] to enlarge the multi-resolution ability of U-Net. Its architecture is shown in Fig. 3.

Its encoders and decoders are replaced by MultiRes blocks, and the ordinary skip connections in U-Net are replaced by Respaths. Some convolution operations are applied to propagate the feature maps from the encoder stage to the decoder stage, and all the convolutional layers except for the output layer are activated by ReLU. Similar to U-Net, the output layer is activated by a sigmoid activation function.

2.3 Dilated convolution

Dilated convolution can enlarge the receptive field without increasing the network parameters and computational amount, so as to obtain multi-scale local features of the image and retain the spatial location information of the most pixels [11]. Three dilated convolution kernel structures of size 3 × 3 are shown in Fig. 4, where 1-dilated convolution is the same as ordinary convolution operation.

From Fig. 4, it is seen that the dilated convolution can exponentially expand the receptive field, and the dilated convolution operator can apply the same kernel at different ranges using different dilation expansion rates.

2.4 Residual module

Increasing the number of network layers can extract richer abstract features at different levels, and the deeper the network, the more abstract the features are, and the more semantic information is available. However, simply increasing the depth of the network can easily lead to the problems of over-fitting, gradient disappearance, and gradient explosion. Residual network (ResNet) enables the deep network to extract features that contain more image information, these problems can be avoided by using connection mechanism of short-circuit, and the network convergence can be speedup. The residual module structure is shown in Fig. 5, where x is the residual network input, F(x) is the output of the residual module before the second layer activation function, ‘⊕’ is addition, F(x) = W₂σ(W₁x), W₁ and W₂ are the weights of the first and second layers, σ is the ReLU activation function. The last output is σ(F(x) + x), the path is added between the input and output to learn the residual of multi-level resolution features. A ResNet is constructed by several residual modules.

2.5 Inception

Inception is a multi-scale convolution module in GoogleNet, consisting of four parallel network layers, i.e., 1 × 1, 3 × 3, 5 × 5 convolution layers and a 3 × 3 max-pooling layer [1]. It uses the receptive fields of different sizes to perform multiple convolution operations on the input images in parallel, and concatenates all feature maps under different receptive fields. Its structure is shown in Fig. 6. In Inception, several 1 × 1 convolutions are used to reduce the network parameters to further reduce the number of parameters and improve the network adaptability to multi-scale targets. The parallel convolution operation is implemented on the feature maps by three kinds of convolution kernels (e.g. 1 × 1, 3 × 3 and 5 × 5 convolution kernel), which can effectively solve the problem of image-scale change, and all convolution kernels actually constitute the local sparse matrix, which can greatly improve the computing power of the model and thus accelerate the model convergence.

3 Multi-scale residual U-Net with attention (MSRAU-Net)

The performance of U-Net is often improved through three parts: encoder, decoder and connection path. As for the aircraft segmentation in RSIs with very wide diversity on size, pose, view, background and other visual characteristics, to learn multi-scale deep features for detecting aircrafts with various scales in RSIs, a multi-scale residual U-Net with attention (MSRAU-Net). Its architecture is shown in Fig. 7.

The structure of MSRAU-Net is often divided into contraction network on the left and expansion network on the right. The contraction network has a series of downsampling operations of convolution and max-pooling, including 4 ResConv blocks. Each ResConv is followed by a batch normalization layer and a rectified linear unit (ReLU), and the max pooling operation is utilized for max-pooling downsampling with a stride size of 2, the number of feature channels is doubled in each downsampling. Expansion network consists of 4 blocks, where each block multiplies the size of the feature map by 2 and halves its number by deconvolution, and then concatenates it with the feature map of the symmetric contraction network on the left. Two skip connections and two modified Respaths allow MSRAU-Net to retrieve the spatial information lost by pooling operations. Since the dimensions of the feature maps of the contraction network and the expansion network are different, the feature map of the contraction network is normalized to the feature map of the same size as the extension network for feature fusion. In MSRAU-Net, attention gate and spatial attention module are introduced to improve the feature extraction ability, and a dropout technique is utilized after each convolutional layer to avoid overfitting and speedup the training step. At the final layer, a 1 × 1 convolution and Sigmoid activation are used to obtain the segmentation map.

The main components of MSRAU-Net are described in detail as follows:

(1)
ResConv block. In aircraft segmentation in RSI, since the aircraft shape, size and orientation are changeable, it is difficult to distinguish the aircraft area from the irregular background. The serial connection may lead to insufficient extraction of the multi-scale feature of multi-scale aircrafts, and the parallel connection may limit the network ability to distinguish aircraft and background, and consume a lot of computing resources. To solve these problems simultaneously, based on MultiMix block [6] and Inception-ResNet [1], a ResConv block is constructed by employing the multi-scale Inception and dilated convolution, as shown in Fig. 8A. When the convolution kernel size of the convolution layer is fixed, different numbers of convolution layers are used to obtain different receptive fields, and the output of each convolution layer in a path can obtain multi-scale fusion features. Therefore, ResConv block cascades the outputs of each convolutional layer on each path, and adds and fuses the outputs of the ordinary convolutional path and the outputs of the dilated convolutional path. In this way, the network can not only enhance the ability to distinguish background and target, but also extract multi-scale features.
(2)
Modified Respath. Different from U-Net and MultiResUNet, in MSRAU-Net, two modified Respaths are designed instead of two skip connections of U-Net to connect the contraction network and the expansion network, and reduce the semantic gap between their feature layers. Each modified Respath is composed of 4 residual convolution blocks and a attention module, as shown in Fig. 8B, where spatial attention mechanism consists of global average pooling, 1 × 1 convolution and Sigmoid activation, then a feature based on channel weighting is obtained.
Fig. 8
The main components of MSRAU-Net
Full size image
(3)
Transpose convolution. Transpose convolution of decoder consists of the decoder block corresponding to the encoder module. First, the feature maps of the encoder module are weighted by the attention module and stacked with the feature maps of upsampling. Then, through each decoder module, each decoder module includes 1 × 1 convolution operation to halve the number of channels and perform batch normalization and transpose the convolution to realize upsampling of the feature graphs. Images with only aircrafts and non-aircrafts labels are generated through the final convolution layer. The transpose convolution is designed as shown in Fig. 8C.
(4)
Spatial attention. Spatial attention uses the spatial relationship between the feature maps of encoder to produce a spatial attention map and retain the key information, and then transform it to the feature maps of decoder. Its structure is shown in Fig. 8D. The feature maps of encoder are weighted by spatial attention to strengthen the important feature information of the aircraft region, weaken the useless feature information, enhance the efficiency of feature utilization, and improve the segmentation performance of the model for small aircraft.
(5)
Loss function. The most common loss function of deep learning is the cross entropy loss function. In the binary classification, the cross entropy loss function is calculated as,
$${L}_{cross}=-\frac{1}{N}\sum_i\left[y{\log}_2{y}^{\prime }+\log \left(1-{y}^{\prime}\right)\right]$$
(1)

where y and y’ are the real pixel label value and the predicted tag pixel value, respectively, and N is the number of pixel points.

Dice coefficient is usually used as loss function to calculate the similarity of two samples. It is more suitable for the situation where samples are extremely uneven. The Dice coefficient loss function calculated as

$${L}_{\textrm{Dice}}=1-\frac{2\mid \textrm{X}\cap \textrm{Y}\mid }{\mid \textrm{X}\mid +\mid Y\mid }$$

(2)

where X and Y represent the generated prediction graph and real label respectively. |X∩Y| is the intersection between tag and forecast, |X| and |Y| respectively labels and to predict the number of elements.

MSRAU-Net is an end-to-end deep learning network. MSRAU-Net based aircraft segmentation is a pixel-by-pixel binary classification problem to determine whether each predicted pixel in RSI is an aircraft pixel or a background pixel. In RSIs, aircraft pixels accounts for a small part of the overall RSI, some of which are less than 5%, so cross entropy loss is not the best choice for such task. To solve this problem, dice coefficient loss function and cross entropy loss are combined as the loss function, calculated as follows,

$${L}_{\textrm{Loss}}={L}_{Cross}+{L}_{\textrm{Dice}}$$

(3)

In the experiments, 5-FCV test is adopted to train all models, the initial training step is carried on automatically by default, and the Adam optimizer is utilized to optimize all models by combining dice coefficient with as the loss function. This process is repeated until the property of the loss function is satisfied during training. At iteration processing, the weights and biases are updated via back-propagation. When training is complete, the model checkpoints save the final parameter values. These weight values and biases are then used for the trained model to test the test samples.

4 Experiments and analysis

MSRAU-Net is tested on the public RSI dataset and compared with 4 competitive methods: FCN [8], U-Net [12], Attention U-Net (AU-Net) [19] and MultiResUNet [6].

4.1 Dataset

Because the existing RSI datasets are relatively small, we constructed a RSI dataset of 680 RSIs containing aircrafts, where 80 images from NWPUVHR-10 (https://hyper.ai/datasets/5422) and 600 images from UCAS-AOD-2015 (https://hyper.ai/datasets/5419). NWPUVHR-10 has 650 target images and 150 background images for a total of 800 images from 10 target types, where 80 images containing aircrafts. UCAS-AOD-2015 has 1000 images used for aircraft and vehicle segmentation, which is divided into three subsets: CAR, PLANE and NEG, where PLANE has 600 aircraft images containing 3210 aircrafts. Each aircraft in RSIs is marked independently using the software and by the relevant professionals. Some original aircraft RSIs are shown in Fig. 1. From Fig. 1, it is seen that the RSIs in the dataset have different numbers of aircraft, and the aircrafts in RSIs are very small with various sizes, postures, orientations, illustration, low resolution and cluttered background. The sizes of RSIs are inconsistent from 556×556 pixels to 1264×987 pixels. Each original image is reshaped to uniform size of 512×512 pixels and augmented to 4 new images through means of 20% rotation range and horizontal flipping, which can improve the training performance and robustness of the model and overcome the overfitting problem. Then an augmented image dataset is constructed containing contains a total of 3400 RSIs in total, of which 680 original RSI and 2720 augmented RSIs.

4.2 Experimental setting

It is known that the U-Net based network architectures are similar. For fair comparisons, we employ the same strategies on data augmentation, middle layer supervision, parameter initializations and network training [15, 19, 26]. The end-to-end training strategy is utilized to train and test MSRAU-Net, in which the combined loss function is used to update the network weight. All experiments are implemented on Ubuntu 16.04, Intel(R) Xeon(R) CPU E3-1230 v5@3.40GHz, NVIDIA GTX-1080 T GPU (12GB memory), advanced deep learning open-source library of TensorFlow 2.0.0 and Python 3.7. All network training hyper-parameters are set as follows: mini-batch size is 40, learning rate 1 × 10⁻³, and drops to 1 × 10⁻⁴ after 2000 iterations, momentum 0.9, weight decay 0.0005, and number of iterations 3000. When training each network model is complete, a test image is input into the trained model for aircraft segmentation.

K-fold-cross-validation (K-FCV) test is often employed to estimate the effectiveness of deep learning model on an independent dataset, ensuring a balance between bias and variance. In a K-FCV test, the dataset D is randomly divided into k subsets {D₁, D₂, …, D_k} with equal or near equal size. The model is carried on k times subsequently, each time taking one of the k splits as the test set and the rest as the training set. The training set is used to train the model, while the test set is used to test the model. The best performing result on the test set achieved through the total number of epochs is recorded in each run. Finally, averaging the results of all K runs is an overall estimation of the performance of MSRAU-Net. To evaluate the segmentation accuracy of MSRAU-Net, 5-FCV test is performed.

To quantify the results of aircraft segmentation, precision, recall and F₁-score are adopted as evaluation criteria, where precision measures the aircraft pixel segmentation fraction of true positive samples, recall measures fraction of positives over the number of ground-truths, and F₁-score measures the aircraft pixel segmentation fraction of true positive samples. They are calculated as follows,

$$Precision=\frac{B_{seg}}{B_{seg}+{I}_{wseg}}, Recall=\frac{B_{seg}}{B_{seg}+{I}_{unseg}}$$

(4)

$${F}_1- score=2\times \frac{precision\times recall}{precision+ recall}$$

(5)

where B_seg, I_wseg and I_unseg denote true positive, false negative, and false positive, respectively. In particular, B_seg is the pixel number correctly segmented into aircraft area in the segmentation results, I_unseg is pixel number that is aircraft in RSI but are not segmented into aircraft area, and I_wseg is pixel number that segments the non-aircraft area into aircraft in the segmentation result.

In Eqs. (4) and (5), precision, recall and F₁-score are based on labeled images, where F₁-score combines the precision and recall metrics as a single measure to comprehensively evaluate the quality of the aircraft segmentation model.

4.3 Model training

During the training process, the training RSIs are input into MSRAU-Net, and the feature maps are gradually extracted through several convolutional layers and pooling layers of contraction network. To test the segmentation performance of MSRAU-Net, a RSI is used to visualize the convolutional feature maps of the different convolutional layers of contraction network, as shown in Fig. 9.

The convolutional feature maps in Fig. 9 indicates that MSRAU-Net can capture the fine details of RSIs, the low-level convolution feature maps contain more detailed information of aircrafts, while the high-level convolution feature contains the key information of aircrafts. It can be seen from Fig. 9D that the convolutional feature map has no obvious sharpening edge and gradually fades, because attention mechanism is introduced into the model to pay more attention to the aircraft area rather than the edge of the aircraft image. As can be seen from Fig. 9E, the first feature map is close to the output layer, and the aircraft area is more concentrated.

4.4 Results and analysis

To test the ability of ResConv block, we only use 4 ResConv blocks to replace 4 convolutional module of U-Net, and 4 MultiRes blocks of MultiResUNet, respectively, without changing the other factors of U-Net and MultiResUNet. Through 5-FCV experiments, the precisions of U-Net, MultiResUNet and MSRAU-Net by with ResConv blocks as convolution modules are shown in Table 1. The results in Table 1 indicate that the ResConv block can improve the aircraft segmentation ability.

Table 1 The segmentation precisions by replacing convolution module with ResConv blocks

Full size table

The training process of U-Net, MultiResUNet and MSRAU-Net is similar because they have the same architecture consisting of encoder network and decoder network. To observe the converging performance of MSRAU-Net, a lot of experiments are conducted on the augmented RSI dataset and compare it with U-Net [15] and MultiResUNet [6], which are closely related to MSRAU-Net. Figure 10 shows the losses of three models versus the training iterations. As seen in Fig. 10, the loss values of three models quickly reduce before 1000 iterations, are nearly stable after 2500 iterations. On the whole, MSRAU-Net attains convergence much faster, achieves the better results after 2200 iterations, where the memory of MSRAU-Net is 2.47GB, training time is 4.35 h, test time is 1.64 s. The minimum loss values of MSRAU-Net, U-Net and MultiResUNet are near 0.085, 0.147 and 0.115 respectively, which show the superior performance of MSRAU-Net against that of U-Net and MultiResUNet. Figure 10 indicates that MSRAU-Net has strong feature learning ability. From Table 1 and Fig. 10, the structure and parameters of MSRAU-Net can be determined.

To assess the overall performance of MSRAU-Net, it is compared with four related models on the augmented dataset: FCN [8], U-Net [12], AU-Net [19] and MultiResUNet [6]. For fair comparison, all trained models are selected at 3000 iterations. Some representative RSIs and corresponding detected aircrafts by five trained models are shown in Fig. 11, involving in multi-scale, small aircrafts, and simple and complex background. In Fig. 11, the first column is the original RSIs, the 2st is column annotated aircrafts, and the 3th to the 7th columns are detected aircrafts by FCN, U-Net, AU-Net, MultiResUNet and MSRAU-Net, respectively.

As can be seen from Fig. 11, FCN can detect aircrafts in RSIs with obvious contrast between the aircrafts and the background, but it misses some small aircrafts, U-Net and AU-Net can basically detect the position of aircrafts, but there is a big difference between the detected aircrafts and the annotation images, MultiResUNet can accurately segment the location and outline of the aircrafts including small aircrafts, but the edge of the aircraft image is blurry, the effect of MSRAU-Net is the best obviously, and the shape of the segmented results is more similar to the annotated images. Through comprehensive comparison, MSRAU-Net not only outperforms other comparison models in aircraft and background segmentation, but the segmented aircraft shape is similar to the labeled images. The reason may be it introduces attention mechanism into its convolution module and modified Respaths.

The trained MultiResU-Net is tested and compared with other methods on the augmented dataset, including FCN, U-Net, AU-Net and MultiResUNet. Their segmentation results are listed in Table 2.

Table 2 The segmentation results of four RSI species by six algorithms

Full size table

It is observed from Fig. 11 and Table 2 that other models except MSRAU-Net have the phenomenon of mistaking detecting small aircrafts, in which FCN and U-Net are more serious, AU-Net is better than U-Net due to the attention mechanism, and MultiResUNet has a lighter mistaking phenomenon of small aircrafts. MSRAU-Net is superior to other methods. The main reason is that Respath, multi-scale and attention mechanisms are introduced into MSRAU-Net. In the feature extraction stage, the weighted feature after convolution is used to replace the original feature for residual fusion, and attention mechanisms is used to reduce information loss and speed up network training in the training process.

4.5 Ablation experiments

As known in U-Net, MultiResUNet and MSRAU-Net, the connection mode between contraction network and expansion network is important and associated with the segmentation precision. To verify the effectiveness and robustness of MSRAU-Net, the design choice of its integration module and its impact on overall performance is evaluated.

To observe the effect of connection mode in MSRAU-Net, we introduce 10 different connection modes combining by copy, Respath and modified Respath into MSRAU-Net. To eliminate the influence of other factors, we only change the connection mode without considering the other factors and operations in MSRAU-Net. Through 5-FCV experiments, the precisions of adding different connection modes into the different feature fusion channels of MSRAU-Net are shown in Table 3.

Table 3 The segmentation precisions by 10 kinds of connection modes

Full size table

From Table 3, it is seen that the connection mode of MSRAU-Net, that is 2 modified Respaths+2 Copys, achieves the highest Precision, the modified Respath helps improve the Precision of MSRAU-Net, and the Precision by 3 and 4 modified Respaths does not improve, but significantly decreases. The reason may be the large semantic gap between the contraction path feature layer and the expansion path feature layer corresponding to the first and second feature fusion channels in MSRAU-Net, and the effect of adding modified Respath only to the first feature fusion channel is not obvious. However, for the third and fourth feature fusion channels, the corresponding contraction path feature layer and expansion path feature layer are closely separated in the network, and the semantic gap is small. Using modified Respath will transmit some redundant information and affect the overall performance of the network, so keeping copy mode of U-Net is more effective. Based on the above segmentation results and the example segmentation analysis, 2 modified Respaths+2 Copys is selected as the final connection mode.

The results in Figs. 10 and 11 and Tables 1, 2 and 3 confirm that MSRAU-Net is effective and robust for detecting the multi-scale small aircrafts, which can be attributed to multi-scale convolutional Inception, two kinds of attentions and Respaths.

5 Conclusion

Due to the aircrafts in RSIs are very wide diversity on size, view and other visual features, the aircraft segmentation in RSIs is a challenging topic. U-Net and its variants based aircraft automatic segmentation methods have been increasingly presented in recent years. A multi-scale residual U-Net with attention (MSRAU-Net) model is constructed for multi-scale aircraft segmentation in RSIs. In MSRAU-Net, a multi-scale convolutional module, two modified Respaths and two kinds of attention modules are used to extract the multi-scale feature and make the connection from the contraction path to the expansion path more efficient. The experiments on the RSI dataset validate that MSRAU-Net outperforms other networks, in particular for detecting the small aircrafts. As a conclusion, this method is tractable to be constructed and implemented with a high application value. The superpixel based clustering algorithm is significantly faster and more robust than state of-the-art clustering algorithms for color image detection and segmentation. In the future, we try to obtain a superpixel image with accurate contour as the input data of MSRAU-Net, to further improve the multi-scale small aircraft segmentation performance.

References

Alruwaili M, Shehab A, El-Ghany SA (2021) COVID-19 diagnosis using an enhanced inception-ResNet V2 deep learning model in CXR images. J Healthc Eng 2021(4):1–16
Article Google Scholar
Bowen CAI, Zhiguo JIANG, Haopeng ZHANG et al (2017) Airport detection using end-to-end convolutional neural network with hard example mining. Remote Sens 9(11):1198
Article Google Scholar
Cheng G, Han JW (2016) A survey on object detection in optical remote sensing images. ISPRS J Photogramm Remote Sens 117:11–28
Article Google Scholar
Deepan P, Sudha LR (2021) Effective utilization of YOLOv3 model for aircraft detection in remotely sensed images. Mater Today Proc (3). https://doi.org/10.1016/j.matpr.2021.02.831
Hai JR, Ya XJ, Guang SZ (2014) Aircraft recognition using modular extreme learning machine. Neurocomputing 128(27):166–174
Google Scholar
Ibtehaz N, Rahman MS (2020) MultiResUNet: rethinking the U-net architecture for multimodal biomedical image segmentation. Neural Networks 121:74–87
Article Google Scholar
Liu G, Sun X, Fu K et al (2013) Aircraft recognition in high-resolution satellite images using coarse-to-fine shape prior. IEEE Geosci Remote Sens Lett 10(3):573–577
Article Google Scholar
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Meeboonmak N, Cooharojananone N (2020) Aircraft segmentation from remote sensing images using modified deeply supervised salient object detection with short connections. International conference on mathematics and computers in science and engineering (MACISE), pp 184-187. https://doi.org/10.1109/MACISE49704.2020.00040
Pan B, Tai J, Zheng Q et al (2017) Cascade convolutional neural network based on transfer-learning for aircraft detection on high-resolution remote sensing images. J Sens:1–14
Perone CS, Calabrese E, Cohen-Adad J (2018) Spinal cord gray matter segmentation using deep dilated convolutions. Sci Rep 8(1):5966
Article Google Scholar
Sanjar K, Bekhzod O, Kim J et al (2020) Improved U-net: fully convolutional network model for skin-lesion segmentation. Appl Sci 10(10):3658
Article Google Scholar
Wang Y, Li H, Jia P et al (2019) Multi-scale DenseNets-based aircraft detection from remote sensing images. Sensors 19(23):5270
Article Google Scholar
Wang G, Zhai Q, Lin J (2022) Multi-scale network for remote sensing segmentation. IET Image Process 16(6):1742–1751. https://doi.org/10.1049/ipr2.12444
Article Google Scholar
Wang X, Zhu L, Wu Y, Yang Y (2023) Symbiotic attention for egocentric action recognition with object-centric alignment. IEEE Trans Pattern Anal Mach Intell 45(6):6605–6617. https://doi.org/10.1109/TPAMI.2020.3015894
Article Google Scholar
Wu QC, Sun H, Sun X et al (2015) Aircraft recognition in high-resolution optical satellite remote sensing images. IEEE Geosci Remote Sens Lett 12(1):112–116
Article Google Scholar
Wu Q, Feng D, Cao C et al (2021) Improved mask R-CNN for aircraft detection in remote sensing images. Sensors 21(8):2618
Article Google Scholar
Xu TB, Cheng GL, Yang J et al (2016) Fast aircraft detection using end-to-end fully convolutional network. IEEE international conference on digital signal processing, pp 139-143
Xu Z, Wang S, Stanislawski LV et al (2021) An attention U-net model for detection of fine-scale hydrologic streamlines. Environ Model Softw 140(5):104992
Article Google Scholar
Yang J, Zhu Y, Jiang B et al (2018) Aircraft detection in remote sensing images based on a deep residual network and super-vector coding. Remote Sens Lett 9(3):229–237
Article Google Scholar
Yang Y, Zhuang Y, Pan Y (2021) Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Front Inf Electron Eng (English) 22(12):1551–1558
Article Google Scholar
Yao X, Han J, Guo L et al (2015) A coarse-to-fine model for airport detection from remote sensing images using target-oriented visual saliency and CRF. Neurocomputing 164:162–172
Article Google Scholar
Zhang Y, Fu K, Sun H et al (2018) A multi-model ensemble method based on convolutional neural networks for aircraft detection in large remote sensing images. Remote Sens Lett 9(1):11–20
Article Google Scholar
Zhao A, Fu K, Sun H et al (2017) An effective method based on ACF for aircraft detection in remote sensing images. IEEE Geosci Remote Sens Lett 14(5):744–748
Article Google Scholar
Zhao L, Qiao P, Dou Y (2019) Aircraft segmentation based on deep learning framework: from extreme points to remote sensing image segmentation. IEEE symposium series on computational intelligence (SSCI), pp 1362-1366. https://doi.org/10.1109/SSCI44817.2019.9002656
Zhao P, Zhang J, Fang W et al (2020) SCAU-net: spatial-channel attention U-net for gland segmentation. Front Bioeng Biotechnol 8:670
Article Google Scholar
Zhong C, Ting Z, Chao O (2018) End-to-end airplane detection using transfer learning in remote sensing images. Remote Sens 10(1):139
Article Google Scholar
Zhuang Z, Li N, Raj A et al (2019) An RDAU-NET model for lesion segmentation in breast ultrasound images. PLoS ONE 14(8):e0221535
Article Google Scholar

Download references

Acknowledgments

This work is supported by the National Natural Science Foundation of China (Nos. 62172338 and 62072378),Xijing University High-level Talent Special Fund Project (No.XJ21B14). Henan science and technology research project (No. 212102210005).

Author information

Authors and Affiliations

College of Information Engineering, XiJing University, Xi’an, 710123, China
Xuqi Wang & Shanwen Zhang
CNPC Tubular Goods Research Institute, Xi’an, 710077, China
Lei Huang

Authors

Xuqi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shanwen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Lei Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xuqi Wang.

Ethics declarations

Conflict of interests

The authors declare they have no financial or conflict of interest exists in this manuscript.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, X., Zhang, S. & Huang, L. Aircraft segmentation in remote sensing images based on multi-scale residual U-Net with attention. Multimed Tools Appl 83, 17855–17872 (2024). https://doi.org/10.1007/s11042-023-16210-2

Download citation

Received: 29 April 2022
Revised: 17 June 2023
Accepted: 04 July 2023
Published: 15 July 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s11042-023-16210-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Aircraft segmentation in remote sensing images based on multi-scale residual U-Net with attention

Abstract

Similar content being viewed by others

Small-scale aircraft detection in remote sensing images based on Faster-RCNN

DMA-YOLO: multi-scale object detection method with attention mechanism for aerial images

Small object detection model for UAV aerial image based on YOLOv7

1 Introduction