1 Introduction

Automatic aircraft detection and segmentation in RSIs is of great significance in military and civil, and has attracted more attention in recent years [3, 5, 22]. With the improvement of RSIs, computer, big data and image processing, many aircraft detection and segmentation methods have been presented, which can be divided into the traditional feature extraction and template matching based methods [7, 16, 24] and the deep learning based methods [2, 18]. Because the aircrafts in RSIs are often small with diverse sizes, arbitrary orientations, illumination changes, various scenes, complex background, and large amount of interference in RSIs, as shown in Fig. 1, the traditional aircraft detection and segmentation methods still have low recognition accuracy, time-consuming, and poor generalization [14]. From Fig. 1, it is obviously seen that RSIs have low signal-to-noise ratio, irregular obstacles and complex background, and the shapes of aircrafts are deformed, irregular or asymmetric with different sizes. So it is time-consuming to design various shape templates for each shape of aircraft and it is difficult to extract the rotation-invariant and scale-invariant features from RSIs for aircraft detection and segmentation, and it often fails to detect the small low-resolution aircrafts from RSIs using the general automatic aircraft detection methods [25]. Yang et al. [21] presented a multiple knowledge representation (MKR) framework and discussed its potential for developing big data artificial intelligence (AI) techniques with possible broader impacts across different AI areas. MKR is an advanced AI representation framework for intelligent multiscale feature aggregation and multi-scale image segmentation.

Fig. 1
figure 1

RSI examples with different scales, orientations and environments

In recent years, convolutional neural network (CNN) and its modified models have achieved remarkable results in various RSI segmentation, detection and recognition, including aircraft detection in RSIs [4, 9]. Zhang et al. [23] proposed an aircraft detection framework based on CNNs to detect multi-scale targets in extremely large and complicated scenes, designed a constrained EdgeBoxes approach to generate a modest number of target candidates quickly and precisely, and constructed a modified GoogLeNet combined with Fast Region-based CNN (R-CNN) to extract useful features from RSIs for multi-scale aircraft detection. Zhong et al. [27] proposed an airplane detection method in RSIs based on deep learning and transfer learning, and adopted a single deep CNN and limited training samples to implement end-to-end trainable airplane detection and ensure an optimal solution for the final stage. Yang et al. [20] developed a method for aircraft detection in RSIs based on deep residual network (ResNet) and super-vector coding. They designed a variant of ResNet with fewer layers to increase the resolution of the feature map, and integrated the multi-level convolutional features into an informative feature description for region proposal, extracted the histogram of oriented gradient (HOG) with super-vector coding from each region of interest (ROI) to assist convolutional features to complete object classification. Wang et al. [13] proposed a compact multi-scale dense CNN (MS-DenseNet) for aircraft detection in RSIs, combined feature pyramid network (FPN) with DenseNet to form a MS-DenseNet for learning multi-scale features, and designed three compact architectures for detecting small aircrafts: MS-DenseNet-41, MS-DenseNet-65 and MS-DenseNet-77. The comparative experiments showed that the compact MS-DenseNet-65 is very effective and achieved the state-of-the-art performance with a recall of 94%, an F1-score of 92.7% and less computational time. Wu et al. [17] constructed a WFA-1400 dataset and proposed an improved mask R-CNN model to enhance the detection effect in the high-resolution RSIs which contain the dense targets and complex background. The model uses a modified mask R-CNN based on the ResNet101 backbone network to obtain more discriminative feature and adds a set of dilated convolutions with a specific size to improve the instance segmentation effect. Pan et al. [10] constructed a cascade CNN (CCNN) framework based on transfer-learning and geometric feature constraints (GFC) for aircraft detection in RSIs, and achieved high accuracy and efficient detection with relatively few samples. CCNN consists of an image classifier and an object detector. The transfer-learning is used to fine-tune pre-trained models with few samples, a GFC region proposal filtering method is proposed to improve detection efficiency, and the aircraft detection is completed by CCNN.

Fully convolutional network (FCN), SegNet and U-Net are three common and well-known backbone networks in the field of complex image segmentation, tracking, detection and recognition [8, 12]. In general, the pooling operation will lose the high frequency components of the image, resulting in the passivation of the blurred image block, and the loss of the image position information. U-Net uses skip connections to connect the low-level and high-level feature images and then capture both coarse level and fine level information at the deconvolutional layers. Because of the learnable upsampling, U-Net has much larger parameters to learn and is comparatively slower to train than SegNet, while SegNet does not capture multiscale information as effectively as U-Net. Due to the excellent performance of U-Net, many image segmentation and detection methods are based on the modified U-Net.

Attention mechanism has been widely added to deep learning models to filter the non-important features. Wang et al. [15] designed a symbiotic attention with object-centric feature alignment framework (SAOA), and proposed a symbiotic attention mechanism to encourage the mutual interaction between the two branches and select the most action-relevant candidates for classification. Xu et al. [19] introduced attention mechanism in U-Net to effectively improve the prediction performance of the network. The attention module takes the upper-level feature maps of the up-sampling and the feature map from the down-sampling process as input. Zhao et al. [26] introduced spatial attention in U-Net and proposed a lightweight network model named Spatial attention U-Net (SA-UNet) for gland segmentation. By introducing a small number of additional parameters, spatial attention can enhance important features and suppress unimportant features to improve the representation ability of SA-UNet. Zhuang et al. [28] designed an improved U-Net namely RDAUS-UNet for ultrasound segmentation of breast tumors. In the network, contraction path is used to increase the receptive field of the network, and Attention Gate module is employed to replace the original clipping and copying operation. Although this network is better than U-Net, it has some shortcomings in multi-scale feature extraction. Ibtehaz et al. [6] demonstrated that the classical U-Net architecture seems to be deficient in some aspects, then constructed a modified U-Net model namely MultiResUNet by introducing MultiMix block and Respath into U-Net, and tested and compared it with the classical U-Net on a vast repertoire of multimodal medical images.

U-Net and its modified models based aircraft detection and segmentation methods have been increasingly presented in recent years, but the features of the small and low-resolution aircrafts may be submerged by the redundant top-level features, resulting in poor detection. To address these problems, making use of the advantages of residual network, U-Net, MultiResUNet, attention mechanism and multi-scale convolution, a multi-scale residual U-Net with attention (MSRAU-Net) is constructed for automatic aircraft segmentation in RSIs. The main contributions of this paper are given as followings:

  • Inception and attention module are introduced into MSRAU-Net to extract the multi-scale feature and then to improve the standard U-Net architecture.

  • Respath with attention module is introduced into MSRAU-Net to connect effectively the features of low-level layers and high-level layers.

  • A lot experiments are conducted on the RSI dataset to evaluate MSRAU-Net.

The rest of this paper is arranged as follows. Section 2 introduces the related works including U-Net, MultiResUNet, dilated convolution, residual network and Inception. MSRAU-Net is described in detail in Section 3. Experiments and results are presented in Section 4. Lastly, the conclusions are drawn and the future works are given in Section 4.

2 Related works

2.1 U-Net

U-Net is a symmetric U-shaped network, as shown in Fig. 2, consisting of contraction network on the left side, expansion network on the right side, and skip connections in the middle.

Fig. 2
figure 2

U-Net architecture

In U-Net, downsampling encoder in contraction network is used to extract spatial features from the image, upsampling decoder in expansion network is employed to construct the segmentation map from the encoded features, and 4 skip connections are used to concatenate the low-level and high-level feature maps through the encoding and decoding process.

2.2 MultiResUNet

MultiResUNet is presented by Ibtehaz et al. [6] to enlarge the multi-resolution ability of U-Net. Its architecture is shown in Fig. 3.

Fig. 3
figure 3

MultiResUNet architecture

Its encoders and decoders are replaced by MultiRes blocks, and the ordinary skip connections in U-Net are replaced by Respaths. Some convolution operations are applied to propagate the feature maps from the encoder stage to the decoder stage, and all the convolutional layers except for the output layer are activated by ReLU. Similar to U-Net, the output layer is activated by a sigmoid activation function.

2.3 Dilated convolution

Dilated convolution can enlarge the receptive field without increasing the network parameters and computational amount, so as to obtain multi-scale local features of the image and retain the spatial location information of the most pixels [11]. Three dilated convolution kernel structures of size 3 × 3 are shown in Fig. 4, where 1-dilated convolution is the same as ordinary convolution operation.

Fig. 4
figure 4

Three dilated convolution kernel structures with three dilation rate parameters

From Fig. 4, it is seen that the dilated convolution can exponentially expand the receptive field, and the dilated convolution operator can apply the same kernel at different ranges using different dilation expansion rates.

2.4 Residual module

Increasing the number of network layers can extract richer abstract features at different levels, and the deeper the network, the more abstract the features are, and the more semantic information is available. However, simply increasing the depth of the network can easily lead to the problems of over-fitting, gradient disappearance, and gradient explosion. Residual network (ResNet) enables the deep network to extract features that contain more image information, these problems can be avoided by using connection mechanism of short-circuit, and the network convergence can be speedup. The residual module structure is shown in Fig. 5, where x is the residual network input, F(x) is the output of the residual module before the second layer activation function, ‘⊕’ is addition, F(x) = W2σ(W1x), W1 and W2 are the weights of the first and second layers, σ is the ReLU activation function. The last output is σ(F(x) + x), the path is added between the input and output to learn the residual of multi-level resolution features. A ResNet is constructed by several residual modules.

Fig. 5
figure 5

Residual module

2.5 Inception

Inception is a multi-scale convolution module in GoogleNet, consisting of four parallel network layers, i.e., 1 × 1, 3 × 3, 5 × 5 convolution layers and a 3 × 3 max-pooling layer [1]. It uses the receptive fields of different sizes to perform multiple convolution operations on the input images in parallel, and concatenates all feature maps under different receptive fields. Its structure is shown in Fig. 6. In Inception, several 1 × 1 convolutions are used to reduce the network parameters to further reduce the number of parameters and improve the network adaptability to multi-scale targets. The parallel convolution operation is implemented on the feature maps by three kinds of convolution kernels (e.g. 1 × 1, 3 × 3 and 5 × 5 convolution kernel), which can effectively solve the problem of image-scale change, and all convolution kernels actually constitute the local sparse matrix, which can greatly improve the computing power of the model and thus accelerate the model convergence.

Fig. 6
figure 6

Inception structure

3 Multi-scale residual U-Net with attention (MSRAU-Net)

The performance of U-Net is often improved through three parts: encoder, decoder and connection path. As for the aircraft segmentation in RSIs with very wide diversity on size, pose, view, background and other visual characteristics, to learn multi-scale deep features for detecting aircrafts with various scales in RSIs, a multi-scale residual U-Net with attention (MSRAU-Net). Its architecture is shown in Fig. 7.

Fig. 7
figure 7

MSRAU-Net architecture

The structure of MSRAU-Net is often divided into contraction network on the left and expansion network on the right. The contraction network has a series of downsampling operations of convolution and max-pooling, including 4 ResConv blocks. Each ResConv is followed by a batch normalization layer and a rectified linear unit (ReLU), and the max pooling operation is utilized for max-pooling downsampling with a stride size of 2, the number of feature channels is doubled in each downsampling. Expansion network consists of 4 blocks, where each block multiplies the size of the feature map by 2 and halves its number by deconvolution, and then concatenates it with the feature map of the symmetric contraction network on the left. Two skip connections and two modified Respaths allow MSRAU-Net to retrieve the spatial information lost by pooling operations. Since the dimensions of the feature maps of the contraction network and the expansion network are different, the feature map of the contraction network is normalized to the feature map of the same size as the extension network for feature fusion. In MSRAU-Net, attention gate and spatial attention module are introduced to improve the feature extraction ability, and a dropout technique is utilized after each convolutional layer to avoid overfitting and speedup the training step. At the final layer, a 1 × 1 convolution and Sigmoid activation are used to obtain the segmentation map.

The main components of MSRAU-Net are described in detail as follows:

  1. (1)

    ResConv block. In aircraft segmentation in RSI, since the aircraft shape, size and orientation are changeable, it is difficult to distinguish the aircraft area from the irregular background. The serial connection may lead to insufficient extraction of the multi-scale feature of multi-scale aircrafts, and the parallel connection may limit the network ability to distinguish aircraft and background, and consume a lot of computing resources. To solve these problems simultaneously, based on MultiMix block [6] and Inception-ResNet [1], a ResConv block is constructed by employing the multi-scale Inception and dilated convolution, as shown in Fig. 8A. When the convolution kernel size of the convolution layer is fixed, different numbers of convolution layers are used to obtain different receptive fields, and the output of each convolution layer in a path can obtain multi-scale fusion features. Therefore, ResConv block cascades the outputs of each convolutional layer on each path, and adds and fuses the outputs of the ordinary convolutional path and the outputs of the dilated convolutional path. In this way, the network can not only enhance the ability to distinguish background and target, but also extract multi-scale features.

  2. (2)

    Modified Respath. Different from U-Net and MultiResUNet, in MSRAU-Net, two modified Respaths are designed instead of two skip connections of U-Net to connect the contraction network and the expansion network, and reduce the semantic gap between their feature layers. Each modified Respath is composed of 4 residual convolution blocks and a attention module, as shown in Fig. 8B, where spatial attention mechanism consists of global average pooling, 1 × 1 convolution and Sigmoid activation, then a feature based on channel weighting is obtained.

    Fig. 8
    figure 8

    The main components of MSRAU-Net

  3. (3)

    Transpose convolution. Transpose convolution of decoder consists of the decoder block corresponding to the encoder module. First, the feature maps of the encoder module are weighted by the attention module and stacked with the feature maps of upsampling. Then, through each decoder module, each decoder module includes 1 × 1 convolution operation to halve the number of channels and perform batch normalization and transpose the convolution to realize upsampling of the feature graphs. Images with only aircrafts and non-aircrafts labels are generated through the final convolution layer. The transpose convolution is designed as shown in Fig. 8C.

  4. (4)

    Spatial attention. Spatial attention uses the spatial relationship between the feature maps of encoder to produce a spatial attention map and retain the key information, and then transform it to the feature maps of decoder. Its structure is shown in Fig. 8D. The feature maps of encoder are weighted by spatial attention to strengthen the important feature information of the aircraft region, weaken the useless feature information, enhance the efficiency of feature utilization, and improve the segmentation performance of the model for small aircraft.

  5. (5)

    Loss function. The most common loss function of deep learning is the cross entropy loss function. In the binary classification, the cross entropy loss function is calculated as,

    $${L}_{cross}=-\frac{1}{N}\sum_i\left[y{\log}_2{y}^{\prime }+\log \left(1-{y}^{\prime}\right)\right]$$
    (1)

where y and y’ are the real pixel label value and the predicted tag pixel value, respectively, and N is the number of pixel points.

Dice coefficient is usually used as loss function to calculate the similarity of two samples. It is more suitable for the situation where samples are extremely uneven. The Dice coefficient loss function calculated as

$${L}_{\textrm{Dice}}=1-\frac{2\mid \textrm{X}\cap \textrm{Y}\mid }{\mid \textrm{X}\mid +\mid Y\mid }$$
(2)

where X and Y represent the generated prediction graph and real label respectively. |XY| is the intersection between tag and forecast, |X| and |Y| respectively labels and to predict the number of elements.

MSRAU-Net is an end-to-end deep learning network. MSRAU-Net based aircraft segmentation is a pixel-by-pixel binary classification problem to determine whether each predicted pixel in RSI is an aircraft pixel or a background pixel. In RSIs, aircraft pixels accounts for a small part of the overall RSI, some of which are less than 5%, so cross entropy loss is not the best choice for such task. To solve this problem, dice coefficient loss function and cross entropy loss are combined as the loss function, calculated as follows,

$${L}_{\textrm{Loss}}={L}_{Cross}+{L}_{\textrm{Dice}}$$
(3)

In the experiments, 5-FCV test is adopted to train all models, the initial training step is carried on automatically by default, and the Adam optimizer is utilized to optimize all models by combining dice coefficient with as the loss function. This process is repeated until the property of the loss function is satisfied during training. At iteration processing, the weights and biases are updated via back-propagation. When training is complete, the model checkpoints save the final parameter values. These weight values and biases are then used for the trained model to test the test samples.

4 Experiments and analysis

MSRAU-Net is tested on the public RSI dataset and compared with 4 competitive methods: FCN [8], U-Net [12], Attention U-Net (AU-Net) [19] and MultiResUNet [6].

4.1 Dataset

Because the existing RSI datasets are relatively small, we constructed a RSI dataset of 680 RSIs containing aircrafts, where 80 images from NWPUVHR-10 (https://hyper.ai/datasets/5422) and 600 images from UCAS-AOD-2015 (https://hyper.ai/datasets/5419). NWPUVHR-10 has 650 target images and 150 background images for a total of 800 images from 10 target types, where 80 images containing aircrafts. UCAS-AOD-2015 has 1000 images used for aircraft and vehicle segmentation, which is divided into three subsets: CAR, PLANE and NEG, where PLANE has 600 aircraft images containing 3210 aircrafts. Each aircraft in RSIs is marked independently using the software and by the relevant professionals. Some original aircraft RSIs are shown in Fig. 1. From Fig. 1, it is seen that the RSIs in the dataset have different numbers of aircraft, and the aircrafts in RSIs are very small with various sizes, postures, orientations, illustration, low resolution and cluttered background. The sizes of RSIs are inconsistent from 556×556 pixels to 1264×987 pixels. Each original image is reshaped to uniform size of 512×512 pixels and augmented to 4 new images through means of 20% rotation range and horizontal flipping, which can improve the training performance and robustness of the model and overcome the overfitting problem. Then an augmented image dataset is constructed containing contains a total of 3400 RSIs in total, of which 680 original RSI and 2720 augmented RSIs.

4.2 Experimental setting

It is known that the U-Net based network architectures are similar. For fair comparisons, we employ the same strategies on data augmentation, middle layer supervision, parameter initializations and network training [15, 19, 26]. The end-to-end training strategy is utilized to train and test MSRAU-Net, in which the combined loss function is used to update the network weight. All experiments are implemented on Ubuntu 16.04, Intel(R) Xeon(R) CPU E3-1230 v5@3.40GHz, NVIDIA GTX-1080 T GPU (12GB memory), advanced deep learning open-source library of TensorFlow 2.0.0 and Python 3.7. All network training hyper-parameters are set as follows: mini-batch size is 40, learning rate 1 × 10−3, and drops to 1 × 10−4 after 2000 iterations, momentum 0.9, weight decay 0.0005, and number of iterations 3000. When training each network model is complete, a test image is input into the trained model for aircraft segmentation.

K-fold-cross-validation (K-FCV) test is often employed to estimate the effectiveness of deep learning model on an independent dataset, ensuring a balance between bias and variance. In a K-FCV test, the dataset D is randomly divided into k subsets {D1, D2, …, Dk} with equal or near equal size. The model is carried on k times subsequently, each time taking one of the k splits as the test set and the rest as the training set. The training set is used to train the model, while the test set is used to test the model. The best performing result on the test set achieved through the total number of epochs is recorded in each run. Finally, averaging the results of all K runs is an overall estimation of the performance of MSRAU-Net. To evaluate the segmentation accuracy of MSRAU-Net, 5-FCV test is performed.

To quantify the results of aircraft segmentation, precision, recall and F1-score are adopted as evaluation criteria, where precision measures the aircraft pixel segmentation fraction of true positive samples, recall measures fraction of positives over the number of ground-truths, and F1-score measures the aircraft pixel segmentation fraction of true positive samples. They are calculated as follows,

$$Precision=\frac{B_{seg}}{B_{seg}+{I}_{wseg}}, Recall=\frac{B_{seg}}{B_{seg}+{I}_{unseg}}$$
(4)
$${F}_1- score=2\times \frac{precision\times recall}{precision+ recall}$$
(5)

where Bseg, Iwseg and Iunseg denote true positive, false negative, and false positive, respectively. In particular, Bseg is the pixel number correctly segmented into aircraft area in the segmentation results, Iunseg is pixel number that is aircraft in RSI but are not segmented into aircraft area, and Iwseg is pixel number that segments the non-aircraft area into aircraft in the segmentation result.

In Eqs. (4) and (5), precision, recall and F1-score are based on labeled images, where F1-score combines the precision and recall metrics as a single measure to comprehensively evaluate the quality of the aircraft segmentation model.

4.3 Model training

During the training process, the training RSIs are input into MSRAU-Net, and the feature maps are gradually extracted through several convolutional layers and pooling layers of contraction network. To test the segmentation performance of MSRAU-Net, a RSI is used to visualize the convolutional feature maps of the different convolutional layers of contraction network, as shown in Fig. 9.

Fig. 9
figure 9

The convolutional feature maps of the different convolutional layers of contraction network

The convolutional feature maps in Fig. 9 indicates that MSRAU-Net can capture the fine details of RSIs, the low-level convolution feature maps contain more detailed information of aircrafts, while the high-level convolution feature contains the key information of aircrafts. It can be seen from Fig. 9D that the convolutional feature map has no obvious sharpening edge and gradually fades, because attention mechanism is introduced into the model to pay more attention to the aircraft area rather than the edge of the aircraft image. As can be seen from Fig. 9E, the first feature map is close to the output layer, and the aircraft area is more concentrated.

4.4 Results and analysis

To test the ability of ResConv block, we only use 4 ResConv blocks to replace 4 convolutional module of U-Net, and 4 MultiRes blocks of MultiResUNet, respectively, without changing the other factors of U-Net and MultiResUNet. Through 5-FCV experiments, the precisions of U-Net, MultiResUNet and MSRAU-Net by with ResConv blocks as convolution modules are shown in Table 1. The results in Table 1 indicate that the ResConv block can improve the aircraft segmentation ability.

Table 1 The segmentation precisions by replacing convolution module with ResConv blocks

The training process of U-Net, MultiResUNet and MSRAU-Net is similar because they have the same architecture consisting of encoder network and decoder network. To observe the converging performance of MSRAU-Net, a lot of experiments are conducted on the augmented RSI dataset and compare it with U-Net [15] and MultiResUNet [6], which are closely related to MSRAU-Net. Figure 10 shows the losses of three models versus the training iterations. As seen in Fig. 10, the loss values of three models quickly reduce before 1000 iterations, are nearly stable after 2500 iterations. On the whole, MSRAU-Net attains convergence much faster, achieves the better results after 2200 iterations, where the memory of MSRAU-Net is 2.47GB, training time is 4.35 h, test time is 1.64 s. The minimum loss values of MSRAU-Net, U-Net and MultiResUNet are near 0.085, 0.147 and 0.115 respectively, which show the superior performance of MSRAU-Net against that of U-Net and MultiResUNet. Figure 10 indicates that MSRAU-Net has strong feature learning ability. From Table 1 and Fig. 10, the structure and parameters of MSRAU-Net can be determined.

Fig. 10
figure 10

The losses of three models versus training iterations

To assess the overall performance of MSRAU-Net, it is compared with four related models on the augmented dataset: FCN [8], U-Net [12], AU-Net [19] and MultiResUNet [6]. For fair comparison, all trained models are selected at 3000 iterations. Some representative RSIs and corresponding detected aircrafts by five trained models are shown in Fig. 11, involving in multi-scale, small aircrafts, and simple and complex background. In Fig. 11, the first column is the original RSIs, the 2st is column annotated aircrafts, and the 3th to the 7th columns are detected aircrafts by FCN, U-Net, AU-Net, MultiResUNet and MSRAU-Net, respectively.

Fig. 11
figure 11

Some representative RSIs and corresponding segmented aircrafts

As can be seen from Fig. 11, FCN can detect aircrafts in RSIs with obvious contrast between the aircrafts and the background, but it misses some small aircrafts, U-Net and AU-Net can basically detect the position of aircrafts, but there is a big difference between the detected aircrafts and the annotation images, MultiResUNet can accurately segment the location and outline of the aircrafts including small aircrafts, but the edge of the aircraft image is blurry, the effect of MSRAU-Net is the best obviously, and the shape of the segmented results is more similar to the annotated images. Through comprehensive comparison, MSRAU-Net not only outperforms other comparison models in aircraft and background segmentation, but the segmented aircraft shape is similar to the labeled images. The reason may be it introduces attention mechanism into its convolution module and modified Respaths.

The trained MultiResU-Net is tested and compared with other methods on the augmented dataset, including FCN, U-Net, AU-Net and MultiResUNet. Their segmentation results are listed in Table 2.

Table 2 The segmentation results of four RSI species by six algorithms

It is observed from Fig. 11 and Table 2 that other models except MSRAU-Net have the phenomenon of mistaking detecting small aircrafts, in which FCN and U-Net are more serious, AU-Net is better than U-Net due to the attention mechanism, and MultiResUNet has a lighter mistaking phenomenon of small aircrafts. MSRAU-Net is superior to other methods. The main reason is that Respath, multi-scale and attention mechanisms are introduced into MSRAU-Net. In the feature extraction stage, the weighted feature after convolution is used to replace the original feature for residual fusion, and attention mechanisms is used to reduce information loss and speed up network training in the training process.

4.5 Ablation experiments

As known in U-Net, MultiResUNet and MSRAU-Net, the connection mode between contraction network and expansion network is important and associated with the segmentation precision. To verify the effectiveness and robustness of MSRAU-Net, the design choice of its integration module and its impact on overall performance is evaluated.

To observe the effect of connection mode in MSRAU-Net, we introduce 10 different connection modes combining by copy, Respath and modified Respath into MSRAU-Net. To eliminate the influence of other factors, we only change the connection mode without considering the other factors and operations in MSRAU-Net. Through 5-FCV experiments, the precisions of adding different connection modes into the different feature fusion channels of MSRAU-Net are shown in Table 3.

Table 3 The segmentation precisions by 10 kinds of connection modes

From Table 3, it is seen that the connection mode of MSRAU-Net, that is 2 modified Respaths+2 Copys, achieves the highest Precision, the modified Respath helps improve the Precision of MSRAU-Net, and the Precision by 3 and 4 modified Respaths does not improve, but significantly decreases. The reason may be the large semantic gap between the contraction path feature layer and the expansion path feature layer corresponding to the first and second feature fusion channels in MSRAU-Net, and the effect of adding modified Respath only to the first feature fusion channel is not obvious. However, for the third and fourth feature fusion channels, the corresponding contraction path feature layer and expansion path feature layer are closely separated in the network, and the semantic gap is small. Using modified Respath will transmit some redundant information and affect the overall performance of the network, so keeping copy mode of U-Net is more effective. Based on the above segmentation results and the example segmentation analysis, 2 modified Respaths+2 Copys is selected as the final connection mode.

The results in Figs. 10 and 11 and Tables 1, 2 and 3 confirm that MSRAU-Net is effective and robust for detecting the multi-scale small aircrafts, which can be attributed to multi-scale convolutional Inception, two kinds of attentions and Respaths.

5 Conclusion

Due to the aircrafts in RSIs are very wide diversity on size, view and other visual features, the aircraft segmentation in RSIs is a challenging topic. U-Net and its variants based aircraft automatic segmentation methods have been increasingly presented in recent years. A multi-scale residual U-Net with attention (MSRAU-Net) model is constructed for multi-scale aircraft segmentation in RSIs. In MSRAU-Net, a multi-scale convolutional module, two modified Respaths and two kinds of attention modules are used to extract the multi-scale feature and make the connection from the contraction path to the expansion path more efficient. The experiments on the RSI dataset validate that MSRAU-Net outperforms other networks, in particular for detecting the small aircrafts. As a conclusion, this method is tractable to be constructed and implemented with a high application value. The superpixel based clustering algorithm is significantly faster and more robust than state of-the-art clustering algorithms for color image detection and segmentation. In the future, we try to obtain a superpixel image with accurate contour as the input data of MSRAU-Net, to further improve the multi-scale small aircraft segmentation performance.