1 Introduction

The changes in global sea ice volume, distribution, and movement reflect the interaction of the atmosphere-cryosphere-hydrosphere and the global climate change [30]. Sea ice study is also significant because it causes marine navigation and transportation safety concerns. Since the classification of sea ice and open water provides valuable information for safe navigation, sea ice classification and monitoring draw extensive attention [8, 37, 39]. Satellite remote sensing, such as optical camera, microwave radiometer, and synthetic aperture radar (SAR), has been the most effective way to monitor sea ice in the polar regions [21, 40]. SAR images have been the primary source for sea ice classification and monitoring, due to its high spatial resolution, wide-coverage, and ability to penetrate clouds [7].

Series of studies have been devoted to classifying sea ice and open water on SAR images, including threshold-based methods, expert systems, and machine learning methods. Multi-Year Ice (MYI) Mapping System (MIMS) is a typical threshold-based model, and it can quickly map MYI in uncalibrated SAR images [13]. The representation of expert systems is the Advanced Reasoning using Knowledge for Typing Of Sea ice (ARKTOS) [38]. ARKTOS performs a fully automated analysis of SAR sea ice images by mimicking the reasoning process of sea ice experts. For machine learning methods, the regression model is an early exploration. Lundhaug and Maria [29] proposed a multivariate regression method to model the relationship between the mean and standard deviation of the backscattering coefficients and air temperatures with sea ice types and water. Their experiments showed the correlation coefficients between predicted and actual values were higher than 0.90. Karvonen [20] developed a modified pulse-coupled neural network (PCNN) to classify the sea ice in the Baltic Sea. Zhang et al. [47] proposed a k-means-based model which combines microwave scatterometer and radiometer data to classify sea ice types. Zakhvatkina [45] extracted textural features from the gray-level co-occurrence matrix (GLCM) and input the features into an artificial neural network (ANN)-based model to classify sea ice and the open water. Similarly, researchers combined the GLCM with other machine learning algorithms, such as Markov random field (MRF) [6] and support vector machine (SVM) [25] to classify sea ice from SAR images.

Overall, the main drawback of the aforementioned traditional methods is that they need prior expert knowledge and sophisticated manual engineering to extract features for discriminating between sea ice and open water. This drawback has been a common challenge faced by the earth system science in the era of big data [33].

Deep learning (DL) technology addresses the mentioned challenge [19]. A typical DL model consists of deep neural networks (DNN), which accepts input data in a raw format and automatically discover the required features [24]. In recent years, DL has been successfully applied in oceanography, geography, and remote sensing, which has helped humans gain further process understanding of earth system science problems [27, 32,33,34, 43]. A deep convolution neural network (CNN) is a particular type of DNN composed of CNN layers. A CNN layer connects to the local patches of the previous layer through convolution kernels to extract local spatial features [22]. Since CNN-based methods have achieved great success in image classification, researchers employed CNN to extract features automatically to improve the accuracy and efficiency of sea ice classification. Yan and Scott [44] introduced an early CNN-based model AlexNet [2], and transfer learning to classify sea ice and open water. Li et al. [26] proposed a CNN-based model to classify sea ice and open water from Chinese Gaofen-3 SAR images . Wang et al. [42] constructed a CNN model consists of three CNN layers and two fully connected neural network layers to classify sea ice near the Bering Strait. [16] integrated transfer learning and dense CNN blocks to form a transferred multilevel fusion network (MLFN). The MLFN outperformed the PCAKM [5], the NBRELM [15], and the GaborPCANet [12] in classifying sea ice and open water.

More and more researchers are trying to construct DL-based models to achieve end-to-end classification between sea ice and open water. Though the aforementioned DL-based models deliver excellent performances, several issues still exist. First, classification accuracy needs to be further improved. Especially for the medium-high resolution SAR images, fine-grained objects such as small floes, sinuous ice-water boundaries, and ice channels need to be well classified. Second, the information of SAR images, such as dual-polarization information and incident angle (IA), are not fully utilized by most DL-based models. The benefit of fusing dual-polarized information has been demonstrated in the conventional method [25], and the IA affects the radar backscattering intensity. All this information should be considered to improve classification accuracy. Third, most of the existing models are validated by independent images, and their applicability to more challenging tasks, such as classifying a series of images from freezing to melting, remains to be verified.

Aiming to solve the issues mentioned above, we propose a dual-attention U-Net model, DAU-Net, to classify sea ice and open water on SAR images. U-Net was initially developed for the semantic segmentation of biomedical images [35]. It is designed to work with fewer training samples but is still able to yield precise segmentations. The effectiveness of employing U-Net to solve classification or segmentation problems of geoscience has been demonstrated [11, 28, 46]. Therefore, we use the U-Net as the backbone of the classification model. The dual-polarized information and the IA of SAR images are utilized as the model inputs. To extract more characteristic features from the multiple input information, we integrate the dual-attention mechanism [14] to optimize the origin U-Net. Finally, we use SAR images near the Bering Sea to train and evaluate the model. We validate the applicability of DAU-Net by a series of SAR images of Bering Strait and compare the classification results with the sea ice products of the National Snow and Ice Data Center (NSIDC).

2 Data

2.1 Study Area

The study areas are the Bering Sea and Bering Strait, which locates near the out edge of the sea ice on the Pacific side of the Arctic (Fig. 1). The Bering Strait is the only channel for water exchange between the Pacific Ocean and the Arctic Ocean, showing strong atmosphere-sea-ice interactions and supports one of the world’s most productive and valuable fisheries with ever increasing commercial vessel activities [9]. Therefore, sea ice detection and monitoring in this region are of great interest to scientific research communities and commercial fishing and transportation industries.

Fig. 1
figure 1

The location of the study area

2.2 SAR Images

The SAR images are obtained from Sentinel-1A in the interferometric wide-swath (IW) mode with a swath width of 250 km. The images are the ground range detected (GRD) products with VV + VH (vertical emitting and vertical and horizontal receiving, respectively) polarizations. The IA is between 30.00-46.00 degrees. The range and azimuth resolutions are 5 and 20 m, respectively, with a sampling space of 10 m.

The data set consists of 34 SAR images as shown in Table 1, and is divided into three subsets: 1) the training set, 2) the testing set, and 3) the applicability validation set (Fig. 1). The model training set includes 15 images (No.1 – No.15 in Table 1). The testing set is the No.16 image in Table 1, and we used this image to evaluate the model performance by metrics. The applicability validation set is a series of images covering the Bering Strait. The series contains six images, each of which is mosaicked from three single Sentinel-1A images, a total of 18 Sentinel-1A images (No.17–No.34 in Table 1). The image series covers the whole ocean process from freezing to melting of the Bering Strait. Therefore, we could validate the applicability of the well-trained model by monitoring the entire cycle of sea ice in the Bering Strait.

Table 1 Information of the SAR images

2.3 NSIDC Sea Ice Products

The sea ice products of the NSIDC [41], named Multisensor Analyzed Sea Ice Extent - Northern Hemisphere (MASIE-NH), are employed as a reference for the applicability discussion. The product is based on the Interactive Multisensor Snow and Ice Mapping System (IMS) results produced by the National Ice Center (NIC). NIC utilizes visible imagery, passive microwave data, and NIC weekly analysis products to create their data product. MASIE-NH provides measurements of daily sea ice extent and sea ice edge boundary for the Northern Hemisphere and 16 Arctic regions in a polar stereographic projection at both 1 and 4 km grid cell sizes [41]. We choose the 1 km MASIE-NH products as the reference.

2.4 Data Preprocessing

We use SNAP 3.0 [10] to perform radiometric calibration and boxcar filtering on all SAR images. As the size of the source SAR image is too large, we downscale each image to 1/3 of the original image size, about 8,000\(\times \)5,000 pixels. Although the spatial resolution is downscaled from 10 m to 30 m, it is still much higher than that of the MASIE-NH products (1 km). It is far more detailed than could be expected from existed manual or operational automatic classifiers [25]. We scale all pixel values to 0-1. All IA values are scaled to 0-1, referred by 0\(^{\circ }\)-90\(^{\circ }\).

The SAR images are labeled into two classes, 1 for sea ice and 0 for open water, by the annotation tool LabelMe [36] to obtain the ground truth labels. As the resolutions of existing sec ice products are much lower than that of the Sentinel-1A images [25], the labeling process is based on visual interpretation. For regions that are difficult to distinguish, we refer to the 1 km MASIE-NH products to label them. In this way, most of the pixels in the SAR images could be labeled correctly. Due to the limitations of SAR image noise and manual labeling, there are inevitably a few mislabeling pixels, and some small sea ice objects cannot be accurately labeled. This is a common problem in the supervised learning field. For most classification missions, such mislabeled pixels account for a small proportion of all pixels and do not affect the convergence of the model [17].

We divide all images (VV, VH, and IA) into 256\(\times \)256-pixel chips as the model inputs. Fig. 2 takes the VV channel as an example to show the SAR image chips and the corresponding ground truth labels.

Fig. 2
figure 2

Image chips (VV channel) and the corresponding labels. ah SAR image chips with 256\(\times \)256-pixel. ip Labels corresponding to the ah SAR images

3 Method

3.1 Overall Structure of DAU—Net

The backbone of the proposed DAU-Net is a U-Net model. The U-Net is named for its almost symmetric encoder-decoder network architecture like a “U” shape and is designed to work with fewer training samples but still able to yield precise segmentations. The encoder extracts abstracted, downscaled high-level feature maps. The decoder restores the resolution of the high-level feature maps. The intermediate feature maps extracted by encoder and decoder are connected to form multi-scale feature maps for pixel-level classifications. The encoder can be a mature DNN model, such as VGG16, ResNet18, ResNet34, etc [4, 23].

Discriminant feature representations are essential for improving classification accuracy. To achieve a high accuracy classification between sea ice and open water in medium-high resolution SAR images, we need more characteristic features to discriminate fine-grained objects such as small floes, sinuous ice-water boundaries, and ice channels. Therefore, we integrate a dual-attention mechanism into the original U-Net and form a DAU-Net model to improve the feature representations of sea ice and open water. The dual self-attention mechanism means position attention module (PAM) and channel attention module (CAM), which could capture the long-range dependencies in spatial and channel dimensions. It has been demonstrated effective in classical image segmentation [14].

The PAM captures long-range dependencies in spatial dimension by a self-attention mechanism. For a feature map, the feature value at a specific position is updated by aggregating feature values at all positions with a weighted summation. The weights are determined by the feature similarities between the corresponding two positions. Any two positions with similar features can contribute to mutual improvement regardless of their distance in the spatial dimension. Similarly, the CAM employs the self-attention mechanism to capture the channel dependencies between any two-channel maps. Each channel map is updated by a weighted sum of all channel maps. Finally, the outputs of these two attention modules are fused to enhance the feature representations further.

Fig. 3
figure 3

Model design. a The model’s input: VV, VH, and IA channels. b The model’s encoder. c The attention modules. D. The model’s decoder. e The model’s output

Overall, as shown in Fig. 3, the DAU-Net consists of five parts: inputs, encoder, attention, decoder, and output. Each input unit consists of three channels of a 256\(\times \)256-pixel SAR image: VV, VH, and IA. The encoder is the ResNet-34, a mature model for image recognition, and it extracts abstracted, downscaled feature maps for accurate classification. The attention part performs position attention and channel attention on the extracted feature maps to capture long-range dependencies in spatial and channel dimensions. The outputs of the two attention modules are fused to form more characteristic features transmitted to the decoder. The decoder module rescales the downscaled feature maps to the original size. Skip connections link the encoder feature and decoder feature. Next, we will detail the encoder, attention, decoder, and output modules.

3.2 Encoder

He et al. [18] proposed the residual network (ResNet) to increase the number of hidden CNN layers to more than one hundred. The ResNet family includes ResNet-18, ResNet-34, ResNet-50, and ResNet-101, where the number represents the number of CNN layers. Large numbers mean more CNN layers, more parameters, and more training complexity. The ResNet family has been widely used in semantic segmentation and object detection. Considering the depth of the model, the number of trainable parameters, and the complexity of sea ice texture, we choose ResNet-34 as the encoder for DAU-Net. The comparisons between the ResNet-34 and the other ResNet-based encoders are carried out in the F part of Section IV.

The encoder consists of 33 CNN layers of the ResNet-34, including five stages. The first stage is one CNN layer with 7\(\times \)7 kernel size and 2\(\times \)2 strides. After the first stage, the original image size is downscaled to 128\(\times \)128. The remaining four stages are composed of 3, 4, 6, and 3 ResNet blocks and a total of 16 ResNet blocks, Fig. 3. Each ResNet block contains two stacking CNN layers with a shortcut connection linking the input of the block and the output of the 2nd CNN layers [18]. The number of convolutional kernels in the five stages is 64, 64, 128, 256, and 512. The original ResNet34 model uses four 2\(\times \)2 max-pooling layers that are stacked on four ResNet stages to downscale the feature map. Here, we discard the last max-pooling layer and retain the first three max-pooling layers. The activation function of each CNN layer is ReLU [1]. After encoding, the origin inputs are transformed into 512 16\(\times \)16 feature maps. Following, these high-level features are transmitted to the attention part.

3.3 Attention

The 512 16\(\times \)16 feature maps extracted by the encoder are fed into the PAM and CAM to capture spatial and channel dependencies. The outputs of these two attention modules are fused and transformed into the decoder.

3.3.1 PAM

Since CNN adopts local connection, the features captured by CNN are local. For semantic segmentation, local features generated by fully CNN are not representative enough, which could lead to misclassifications [31]. The PAM addresses this issue. The PAM updates the feature value at a specific position by aggregating feature values at all positions with a weighted summation. Thus, the global spatial dependencies of any two positions could be captured. These global features are fused with local features to form more characteristic features. Following, we will detail the calculation of PAM.

As shown in Fig. 4a., let H, W, and C represent the width, height, and channels, and \({A\mathbb {\in R}}^{H \times W \times C}\) is a local feature map extracted from the model inputs. The white/dark regions represent sea ice/water features. There are some inaccurate features in A, especially the regions marked by the red rectangle. Then A is fed into all three CNN layers to generate three feature

Fig. 4
figure 4

Flow of PAM. a before PAM, some water pixels are misclassified. b PAM. c after PAM, the misclassified pixels are corrected

maps \({B\mathbb {\in R}}^{H \times W \times C}\), \({C\mathbb {\in R}}^{H \times W \times C}\), and \({D\mathbb {\in R}}^{H \times W \times C}\), as shown in Fig. 4b. B is reshaped to \(B^{1} \in \mathbb {R}^{N \times C}\), where N = H \(\times \) W is the number of pixels. C is reshaped and transposed to \(C^{1} \in \mathbb {R}^{C \times N}\). Then, matrix multiplication is performed between \({B^{1}}\) and \({C^{1}}\). Then, the multiplication result is activated by a softmax layer to calculate the spatial attention map \(S \in \mathbb {R}^{N \times N}\). The softmax activation [3] normalizes S by row and makes the sum of each row is 1. The more similar feature representations of the two positions contribute to a higher correlation between them, generating a large value in S.

The global dependencies of any two positions in the feature map modeled by S. D is reshaped to\(\ D^{1} \in \mathbb {R}^{N \times C}\). S is multiplied by \({D^{1}}\) to generate\(\ {A^{s}\mathbb {\in R}}^{N \times C}\):

$$\begin{aligned} a_{ij}^{s} = S_{i} \cdot D_{j}^{1}, j \in [1,C] \end{aligned}$$
(1)

where \(a_{ij}^{s}\) is an element of \(A^{s}\), \({S_{i}}\) is the \({i_{th}}\) row of S and \(D_{j}^{1}\) is the \({j}_{th}\) column of \({D^{1}}\). \({A^{s}}\) is reshaped to \({A^{1}\mathbb {\in R}}^{H \times W \times C}\). For each channel of \({A^{1}}\), the element of a position is the weighted sum of elements across all positions in the corresponding channel of D based on the weights in S. Therefore, \(A^{1}\) has a global contextual view and selectively aggregates contexts according to the spatial attention map. \(A^{1}\) is multiplied by a scale parameter \(\alpha \) and added to the input feature map A in element-wise to obtain the output \(E^{H \times W \times C}\):

$$\begin{aligned} E = \alpha A^{1} + A \end{aligned}$$
(2)

where \(\alpha \) is initialized as 0 and gradually learns to assign more weight.

The pixel value of the output feature map E is a weighted sum of the features across all pixels and original features. E integrates the local features and the long-range global features. The similar semantic features achieve mutual gains, thus improving intra-class compact and semantic consistency. Intuitively, as shown in Fig. 4c, the inaccurate features in A are optimized by the PAM, which contributes to the final output.

3.3.2 CAM

Each channel map of high-level features can be regarded as a class-specific response, and different semantic responses are associated with each other. The CAM updates the feature value at a position by aggregating feature values of all channels in the same position with a weighted sum. The interdependencies between channels of feature maps are captured, which improves the feature representation of specific semantics.

Fig. 5
figure 5

The detailed calculation process of CAM in the DAU-Net. a Feature maps without CAM, some water pixels are inaccurately encoded as sea ice pixels, marked in the red rectangle. b The calculation process of CAM. c Feature maps after CAM. Some inaccurate sea ice pixels are modified as water pixels, improving the accuracy of outputs

The structure of CAM is illustrated in Fig. 5. As shown in Fig. 5a., let H, W, and C represent the width, height, and channels, and \({A\mathbb {\in R}}^{H \times W \times C}\) is a local feature map extracted from the model inputs, Fig. 5a. The channel attention map \(X \in \mathbb {R}^{C \times C}\) is calculated from the original features \({A\mathbb {\in R}}^{H \times W \times C}\), Fig. 5b. A is reshaped to \(A^{1} \in \mathbb {R}^{N \times C}\), and is reshaped and transposed to \(A^{2} \in \mathbb {R}^{C \times N}\). Then, a matrix multiplication between \({A^{2}}\) and \({A^{1}}\) is performed. Then, a softmax layer is applied to obtain the channel attention map X. The more similar feature representations of the two channels contribute to a higher correlation between them, generating a larger value in X. The sum of each row in X is 1. \({A}^{1}\) is multiplied by the transpose of X to generate \(A^{x} \in \mathbb {R}^{N \times C}\):

$$\begin{aligned} a_{ij}^{x} = A_{i}^{1} \cdot X_{j},j\in [1,C] \end{aligned}$$
(3)

where \(a_{ij}^{x}\) is an element of \(A^{x}\), \(A_{i}^{1}\) is the \({i}^{th}\) row of \({A}^{1}\) and \({X_{j}}\) is the \({j_{th}}\) column of X. \({A}^{x}\) is reshaped to \(A^{3} \in \mathbb {R}^{H \times W \times C}\). For each position of \({A^{3}}\), the element of a channel is the weighted sum of elements across all channels in the corresponding position of A based on the weights in X. Therefore, \(A^{3}\) has long-range contextual dependencies in channel dimensions. \({A^{3}}\) is multiplied by a scale parameter \(\beta \) and added to the input feature map A in element-wise to obtain the output \(F^{H \times W \times C}\):

$$\begin{aligned} F = \beta A^{3} + A \end{aligned}$$
(4)

where \(\beta \) gradually learns a weight from 0.

The final feature of each channel is a weighted sum of the features of all channels and original features. The long-range semantic dependencies between different channels of the feature maps are modeled, which boosts feature discriminability. As shown in Fig. 5a, many open water regions are inaccurately represented as sea ice features in feature map A. After the channel attention procedure, most of the inaccurate regions in A are corrected, Fig. 5c. The outputted feature map F is more discriminating than A and helps to achieve a good classification result.

3.3.3 Fusion

The PAM output and CAM output are separately transformed by a CNN layer. An element-wise summation is performed on the two transformed results. A CNN layer executes convolutions on the summation to generate fusion features. Finally, the fusion features are transmitted to the decoding part.

3.4 Decoder

Five decoder modules are stacked upon the features outputted by attention modules, and each decoder module is composed of one up-sampling layer and two stacking CNN layers. Each CNN layer is followed by a batch normalization layer and a ReLU activation layer. The number of convolutional kernels in the four decoders is 256, 128, 64, 32, and 16, respectively. Three concatenations fuse the features generated from the same level encoder and decoder. The kernel size of all CNN layers in decoder modules is 3\(\times \)3. After decoding, the 16\(\times \)16 feature maps are rescaled to the same size as the input image, 256\(\times \)256.

3.5 Output

The feature maps output by the decoder are fed into the output module that consists of one CNN layer with one 1\(\times \)1 convolutional kernel. One sigmoid layer performs non-linear activation on the convolutional outputs to predict the value of each pixel. The activation value is between [0,1]. If it is larger than 0.5, the pixel is sea ice; otherwise, it is open water. The loss function is binary cross-entropy.

4 Experiments

4.1 Experiments Setting

There are 4,684 SAR chips in the training set. We split 30% samples from the training set as the validation set. We choose a typical image with rough sea surface and various sea ice textures as the testing image. We divided the testing image into 672 256\(\times \)256 chips. The developed model runs on a GPU workstation with one NVIDIA TESLA V100 32 GB GPU. Its batch size is 16, and the initial learning rate is 0.0001. We use Keras as the DL packages, and the ReduceLROnPlateau and early stopping strategies in Keras are employed to accelerate convergence and avoid overfitting.

4.2 Evaluation Metrics

Accuracy, precision, recall and mean intersection over union (IoU) are employed to evaluate the performance of the classification methods. The definition of these metrics is shown in Fig. 6. Precision refers to the proportion of correctly predicted pixels, both sea ice, and water, among all predicted pixels. Precision refers to the proportion of pixels that are true sea ice and predicted as sea ice to all predicted sea ice pixels. A higher precision value means the model extracts less false alarms. Recall refers to the proportion of pixels that are true sea ice and predicted as sea ice to all true sea ice pixels. A higher recall value means the model misses fewer sea ice pixels. IoU means the proportion of pixels that are true sea ice and predicted as sea ice to the union of true sea ice and predicted sea ice pixels. When the predicted sea ice pixels coincide with the true sea ice pixels completely, the IoU is the maximum value of 1.

Fig. 6
figure 6

Definitions of accuracy ((TP+TN)/(TP+TN+FP+FN)), precision (TP/(TP +FP)), recall (TP/(TP +FN)), and IoU (TP/(TP +FP+FN))

4.3 Comparison Experiments Against Other Models Performances

To validate the performance of the proposed DAU-Net, two recently proposed DL-based sea ice classification models are selected for comparison: 1) CNNwang, which is the CNN-based detection model proposed by Wang et al. [42] in 2018. It consists of five CNN layers and three max-pooling layers; 2) DenseNetFCN, which has a similar structure with the MLFN model proposed in 2019 [16]. To satisfy the pixel-level segmentation and make a fair comparison, DenseNetFCN replaces the fully connected layers in the original MLFN with fully convolutional layers and adds upsampling blocks, forming a “U” shape segmentation model.

We also compare our model performance against the classic U-Net model that has a similar structure with DAU-Net except that the CAM and PAM are removed. It is worth noting that the CNN layers after two attention modules and the CNN layer of the fusion part are retained to ensure a fair comparison. U-NetCAM means the U-Net model with CAM but no PAM. U-NetPAM is the U-Net model with PAM but no CAM. Similarly, the CNN layers are retained in these two models. We tune the hyper parameters of all compared models and record the results with the best accuracy.

Table 2 Evaluation results of all compared models

The evaluation metrics of all models are shown in Table 2, and the corresponding classification results are shown in Fig. 7. The accuracy, IoU, and precision of CNNwang are lower than those of the other five models. However, the recall of CNNwang is the largest one. The precision and the recall are very unbalanced, which means CNNwang misses fewer sea ice pixels but misclassifies many open water pixels as sea ice (high false alarms). As shown in Fig. 7d, the classification results of CNNwang, such as sea ice edges and ice blocks, are coarse-grained. Limited by the model complexity, it is difficult for CNNwang to extract enough representative features to achieve fine-grained classification, thus generate many false alarms. Compared with CNNwang, the accuracy, IoU, and precision of DenseNetFCN are improved obviously, and recall is reduced. The gap between precision and recall is narrowed. Fig. 7e shows that the classification results are much more refined than those of CNNwang. However, there are still some false alarms in the region marked by the red rectangle. Although DenseNetFCN is more complicated than CNN, it is still not enough to extract sufficiently characteristic features to accurately distinguish sea ice and water, especially in areas where sea ice and water are mixed under complex sea conditions.

Fig. 7
figure 7

ac Inputs of the test SAR image, VV channel, and VH channel are scaled to 0-255 for better visualization. di classification results of different models

The U-Net model outperforms CNNwang and DenseNetFCN in both accuracy and IoU. Its recall and precision are also more balanced. Fig. 7f shows that the U-Net obviously reduces the false alarms generated by DenseNetFCN (marked by the red rectangle). By introducing attention modules, U-NetCAM and U-NetPAM show improvements in accuracy and IoU. The precisions and recalls do not show significant improvements. However, as shown in Fig. 7g-h, the classification results of U-NetCAM and U-NetPAM are more refined, and the boundary between sea ice and open water is more smoother. Finally, the DAU-Net, integrated with CAM and PAM, obtains the most considerable accuracy, IoU, and precision (Table 2). Compared with the original U-Net model, the accuracy, IoU, and precision of the DAU-Net increased by 0.50%, 1.00%, and 1.14%, respectively. The accuracy and recall are in balance. By comparing Fig. 7i and f, it can be found that the false alarms generated by U-Net are reduced significantly, and the classification results of DAU-Net are more refined. The fine-grained objects such as small floes, sinuous ice-water boundaries, and ice channels are classified more smoothly by DAU-Net. Therefore, the CAM and the PAM can improve the representative ability of extracted features to promote the classification results of sea ice and open water.

4.4 Effectiveness of IA

As the IA is ignored in existed DL-based models [16, 42], we design an experiment to evaluate the effectiveness of employing the IA of SAR images as one input. Table 3 shows the experiment results. DAU-Net is the model with IA, and DAU-NetNIA is the model without IA. The other experiment settings are unchanged. The accuracy and IoU of DAU-NetNIA are less than those of the DAU-Net. The precision is much larger than the recall, which means DAU-NetNIA misses many sea ice pixels. As shown in Fig. 8c, some sea ice pixels are misclassified as open water in the upper left part of the image. Thus, the IA is essential to obtain better classification results.

Table 3 Evaluation results of using IA
Fig. 8
figure 8

a and b, VV channel and VH channel of the testing set; cf, classification results of the model without IA, VH, VV as inputs, separately

4.5 Effectiveness of Dual—Polarization Information

We design an experiment to evaluate the effectiveness of dual-polarization inputs. DAU-Net uses the VV channel, VH channel, and IA as the inputs. DAU-NetVV uses VV channel and IA as the inputs, and DAU-NetVH uses VH channel and IA as inputs. The other experiment settings are unchanged, as shown in Table 4. The four metrics of DAU-NetVH are smaller than those of the other two models. As Fig. 8e shown, DAU-NetVH misclassifies many sea ice pixels as open water, mainly the pixels in the upper left part of the image. DAU-NetVV performs better than DAU-NetVH, but it still misses some sea ice pixels in the middle part of the image, Fig. 8d. Finally, by combining VV and VH as inputs, DAU-Net achieves the best performance. Thus, the dual-polarization information of SAR image is helpful to obtain better classification results.

Table 4 Evaluation results of using Dual-Polarization Information

4.6 Performances of Different ResNet-Based Encoders

The encoder in DAU-Net is ResNet-34. We design an experiment to evaluate the performances of the other two ResNet-based encoders. DAU-Net18 is the model using ResNet-18 as the encoder, and DAU-Net50 is the model using ResNet-50 as the encoder. The other parts of these two models are the same as those of the DAU-Net. As shown in Table 5, the performances of the three models do not show much difference. DAU-Net with ResNet-34 as encoder slightly outperforms the other two ResNet-based encoders. For our classification mission, ResNet-34 is a more suitable encoder than the other two ResNet models.

Table 5 Evaluation of different ResNet encoders

5 Discussions

To validate the robustness of the proposed model, we employ the DAU-Net to classify sea ice and open water from a series of SAR images in the Bering Strait and compare the classification results with the sea ice products provided by NSIDC. As the DenseNetFCN represents the existing DL-based classification model for sea ice, we take the results of DenseNetFCN as comparison targets. The image series consists of six images, each of which is mosaiced from three Sentinel-1A images, and a total of 18 Sentinel-1A images. Their details are shown in Table 1. The image series covers the process from freezing to melting of the Bering Strait, including a variety of sea ice textures and sea surface conditions. As shown in Fig. 9a-f, sea ice partially appeared in the Bering Strait on Dec 13, 2018, and it covered the entire region until Mar 19, 2019. Then, on Mar 31, 2019, the sea ice started to melt, and by May 6, 2019, most of it had receded. The most recent data (generally from the previous day) of the 1 km products appear in the archive at approximately 10:00 p.m. (Greenwich Mean Time, GMT). The 18 Sentinel-1A images in the Bering Strait are acquired around 06:00 p.m. (GMT). Due to the time difference, the date of the MASIE-NH products we employed is one day later than the date of the Sentinel-1A images. The cell size of the DAU-Net result is 30 m. The spatial resolution of the two data is too different, so it is unreasonable to compare their evaluation metrics quantitatively. Here, we discuss the performance of DAU-Net through the visual comparison of classification results.

Fig. 9
figure 9

Comparison between results of DAU-Net, MASIE-NH products, and results of DenseNetFCN of a time series (Dec 13, 2018-May. 06, 2019) in Bering Strait. af SAR images, VV channel. gl classification results of DAU-Net. mr 1km MASIE-NH products. sx classification results of DenseNetFCN

Figure 9g-l show the classification results of DAU-Net and Fig. 9m-r are the corresponding MASIE-NH products. Overall, the DAU-Net results are consistent with the MASIE-NH products. The sea surface in Fig. 9a, d, and f is very rough and bright, mixing with the sea ice, especially the regions marked as red rectangles. As shown in Fig. 9g, j, and l, the DAU-Net classifies the sea ice and open water well, which demonstrates that the proposed model can deal with the complex sea surface. There are many water gaps, small sea ice floes, and sinuous ice-water boundaries in Fig. 9c and f, which are finely classified by the DAU-Net, as shown in Fig. 9i and l. The separate water channels in Fig. 9e are also successfully classified by DAU-Net, as shown in Fig. 9k. As the spatial resolution of the MASIE-NH products is 33.3 times lower than that of DAU-Net results. Many fine-grained objects cannot be classified in the MASIE-NH products. As shown in Fig. 9i, k, and l, the classification results of DAU-Net are more consistent with the SAR images than the MASIE-NH products, especially in the regions marked by the yellow rectangles in Fig. 9c, e, and f. Taking the region marked by the yellow rectangle in Fig. 9f as an example, we show the detailed comparisons between the classification results of DAU-Net and 1km MASIE-NH products in Fig. 10. Our classification results show obvious advantages over MASIE-NH products in spatial resolution, Fig. 10b-d.

Fig. 10
figure 10

A detailed comparison between results of DAU-Net and MASIE-NH products in a representative region marked in Fig. 9f. a The SAR image on May 6, 2019. bd the detailed SAR image, classification results of DAU-Net, and 1km MASIE-NH products corresponding to the marked region

However, DAU-Net performs not very well in some regions. As marked by the green rectangles in Fig. 9a, some sea ice pixels with dark textures are misclassified as open water. Some open water pixels with extremely rough surfaces are misclassified as sea ice. The misclassifications may be due to the lack of these two types of samples in the training set. The misclassifications mainly exist in the SAR image on Dec 13, 2018, the early stage of sea ice in the Bering Strait, with some very dark sea ice textures. These textures are rare during the freezing and melting stages. In addition, the extremely rough sea surfaces are also rare in the training set, resulting in misclassifications. As shown in Fig. 9s-x, the results of DenseNetFCN are generally consistent with the MASIE-NH products. However, DenseNetFCN performs worse than DAU-Net, especially in the regions marked by red circles. Some rough sea surface pixels are misclassified as sea ice pixels.

In summary, by validating the applicability of DAU-Net through a series of SAR images in the Bering Strait, we demonstrated that the DAU-Net performs well in most sea conditions. The proposed is capable of dealing with various sea ice textures. Due to the advantages of SAR image resolution and model performance, the results of DAU-Net are more refined than MASIE-NH products. DAU-Net also outperforms the existing DL-based sea ice classification model, DenseNetFCN.However, the DAU-Net performs not well on some unusual textures. To further improve the model applicability, we will collect more training samples to supplement the rare texture types.

6 Conclusions

This study proposes a DAU-Net model to classify the sea ice and open water from SAR images. We combine the ResNet34 with the U-Net to form the model backbone. SAR images are obtained from Sentinel 1A. The dual-polarized information and the IA of SAR images are utilized as the model inputs. We integrate the dual-attention mechanism, PAM and CAM, into the original U-Net model to extract more characteristic features, which helps to achieve more accurate classifications. We use 15 Sentinel-1A SAR images acquired near the Bering Sea to train the model. We evaluate the model performance by one SAR image and compare the DAU-Net with the typical DL-based ice classification models. Further, we use the well-trained model to classify a series of SAR images of Bering Strait, which covers the process from freezing to melting. We make a comparison between the classification results of DAU-Net and the 1km MASIE-NH products of NSIDC. Experiments show that: 1) the dual-attention mechanism enhances the representative ability of features and help the DAU-Net outperforms the origin U-Net and typical existing DL-based ice classification models, especially in the classification of fine-grained targets; 2) the three-channel inputs, dual-polarized information (VV and VH) and IA, contribute to high accuracy classifications; and 3) the DAU-Net is capable of dealing with complex sea state conditions from freezing to melting, showing good robustness and applicability.

In the future, to address the misclassifications on unusual sea ice textures, we will collect more training samples from a wide range of space and time. We will also explore the possibility of integrating few-shot learning to solve the mentioned problem. Besides, the multi-category classification models to discriminate MYI, sea ice, and open water will be will become a follow-up work.