Lightweight U-Net for cloud detection of visible and thermal infrared remote sensing images

Accurate and rapid cloud detection is exceedingly significant for improving the downlink efficiency of on-orbit data, especially for the microsatellites with limited power and computational ability. However, the inference speed and large model limit the potential of on-orbit implementation of deep-learning-based cloud detection method. In view of the above problems, this paper proposes a lightweight network based on depthwise separable convolutions to reduce the size of model and computational cost of pixel-wise cloud detection methods. The network achieves lightweight end-to-end cloud detection through extracting feature maps from the images to generate the mask with the obtained maps. For the visible and thermal infrared bands of the Landsat 8 cloud cover assessment validation dataset, the experimental results show that the pixel accuracy of the proposed method for cloud detection is higher than 90%, the inference speed is about 5 times faster than that of U-Net, and the model parameters and floating-point operations are reduced to 12.4% and 12.8% of U-Net, respectively.


U-Net based on depthwise separable convolutions
As shown in Fig. 1, the process of the proposed method consists of two parts: training and testing. In this paper, the L8 CCA data set is divided into training set, validation set, and test set. In the training phase, the images with the corresponding ground truth of the training set and the validation set are input into the network, and the model is trained by a backpropagation (BP) algorithm to obtain the parameters with minimal loss. In the testing phase, the model outputs the probability tensor of the test set images, and the cloud detection result is obtained by the ARGMAX function.
The network used in this paper is a fully convolutional network based on an encoding-decoding architecture improved by depthwise separable convolutions (Howard et al. 2017), as shown in Fig. 2. The input data is processed by encoders based on depthwise separable convolution to generate high-dimensional feature maps of different scales; then, the decoders continuously upsample the feature maps and concatenate the feature maps of encoders and decoders with same scale; finally, the network outputs a tensor containing the classification probability of each pixel, and the class with the highest probability is used as the label of the pixel.
U-Net (Ronneberger et al. 2015) is originally used for segmentation of biomedical images. With the characteristics of simple model, easy to train, and more suitable for small data sets, U-Net is currently widely used in medical diagnosis, remote sensing and other fields. The convolutional layers in U-Net are calculated using standard convolutions which are the most important component of convolutional neural networks. The operation of standard convolution can be divided into two parts. First, the input feature maps are convolution filtered, and then the convolution filtered results are combined into output feature maps. Figure 3a shows the process of standard convolutions. Suppose that a feature map with a size of H × W and a channel number of M is input into a standard convolution, and after the calculation of N filters with a kernel size of K × K , a feature map with a size of H × W and a channel number of N is output. Then the parameters of standard convolution are The computational cost is Depthwise separable convolution divides standard convolution into depthwise convolution and pointwise convolution to complete filtering and channel combination respectively (Howard et al. 2017). As shown in Fig. 3b, depthwise convolution with a K × K kernel processes the input feature map of size H × W × M into a feature map of M channels, then the parameters of depthwise convolution is The computational cost is The parameters and cost are 1∕ N of the standard convolution. Figure 3c shows the process of pointwise convolutions. The pointwise convolution is a standard convolution with a kernel size of 1 × 1, which is mainly used to change the dimension of the output feature map of the depthwise convolution in order to complete the combination operation of the standard convolution. The parameters of pointwise convolution are The computational cost is The computational cost of depthwise separable convolution is which is the sum of depthwise convolution and pointwise convolution. According to the computational cost of standard convolution shown in Eq. (2), the relationship between the computational cost of the depthwise separable convolution and the standard convolution is: The size of the convolution kernel used in this article is 3 × 3, therefore the computational cost of the depthwise separable convolution is about one-ninth to one-eighth of the standard convolution. By breaking the interaction between the number of output channels and the size of the kernel with depthwise separable convolution, the parameters and computational cost are greatly reduced.
In order to prevent the gradient disappearing during training, a batch normalization layer and a ReLU are added after each depthwise convolution and pointwise convolution. The stride of the depthwise convolution can be 1 or 2, and the stride of the pointwise convolution is fixed at 1.
The optimization of U-Net in this article includes two aspects: replacement of standard convolutions and modification of sampling method. U-Net is composed of four encoders and four decoders between which are the corresponding skip connections. The network includes 18 layers of 3 × 3-kernel-sized convolution operations, and there is much room for improvement in the size of model and computational cost. As mentioned above, depthwise separable convolutions can reduce the parameters and computational cost of standard convolution, so replacing the standard convolution with depthwise separable convolutions can reduce the size of model and floating-point operations(FLOPs) of U-Net. Sampling methods include upsampling and downsampling. According to Springenberg et al. (2014), max-pooling does not always improve the performance of deep neural networks, and the use of convolutions with stride instead of pooling helps reduce the information loss of feature maps caused by downsampling. Therefore, this paper replaces the downsampling method of the encoders in U-Net from the max-pooling layer to a depthwise separable convolution with a kernel size of 3 × 3 and a stride of 2. The upsampling method uses bilinear interpolation which is easier to train than the deconvolution in U-Net. The comparison between the proposed method and U-Net is shown in Table 1. The parameters of this method are 3.9 Million, and the FLOPs are 11,002.5 Million, which are 12.4% and 12.8% of U-Net, respectively. The proposed network is still an encoder-decoder architecture, consisting of four encoders and four decoders. The encoders consist of two depthwise separable convolutions with kernel size of 3 × 3 and stride of 1 and one depthwise separable convolutions with kernel size of 3 × 3 and a stride of 2. The decoders consist of one layer of bilinear interpolation and two depthwise separable convolutions with kernel size of 3 × 3 and stride of 1. The skip connections between encoders and decoders enable the network to combine the high-level and low-level semantic features extracted from the image to improve the segmentation effect (Long et al. 2015). The structure of the proposed method is shown in Fig. 2, and the specific parameters are shown in Table 2.

Metrics
Metrics used in this paper are pixel accuracy (PA), mean pixel accuracy (mPA) and mean intersection over union (mIoU) (Garcia-Garcia et al. 2017). PA is the ratio of correctly classified pixels and all pixels. mPA is the mean value of PA of all classes. mIoU is the mean value of IoU of all classes, where IoU is the ratio of the intersection and union of the prediction and the ground truth. The metrics can be solved by the confusion matrix shown in Table 3, where TP, FP, FN, and TN stand for True Positive, False Positive, False Negative, and True Negative.
Therefore, the formulas of PA, mPA, and mIoU are as follow: Proposed method 3.9 11,002.5  3 Experimental results and discussion

Experimental images and parameter setting
In this paper, the visible bands 4, 3, and 2 of the Landsat 8 Cloud Cover Assignment (L8 CCA) dataset (Foga et al. 2017) are used as RGB channels to merge true color images. 72 images from Landsat 8 CCA were selected as the data set and processed into 28,800 patches with size of 224 × 224, 24,000 of which were used as the training set, and 2400 of which were used as the validation set. 6 images of 4480 × 4480 size were used as test set. The same process is performed on band 11 in the L8 CCA data set to verify the applicability of this method in the thermal infrared band.
The experimental environment of this study is shown in Table 4. All methods are implemented using Python. The deep learning framework, PyTorch (Paszke et al. 2017), is used for model training and testing. The training epoch is 70. Adam (Kingma and Ba 2014) is used as optimizer. Batch size is set to 16. The initial value of the learning rate is 0.001, and it is updated every 10 epochs with a decrease factor of 0.5. As shown in the Eq. (12), the loss function is the weighted average of the Dice loss (Milletari et al. 2016) and the Binary Cross Entropy (Zhang and Sabuncu 2018). The weight of the Binary Cross Entropy is 0.8. Dice loss and Binary Cross Entropy are shown in Eqs. (13) and (14), respectively, where P is the prediction result and T is the ground truth.

Experimental results and analysis
The loss and metrics changes of proposed method within 70 epochs are shown in Fig. 4, where   Figure 5 shows the cloud detection results of the proposed method and U-Net. It can be seen that the overall results of the proposed method are better than U-Net. In the boxed areas in Fig. 5a, d, e, the detection results of the proposed method are more complete, while U-Net does not detect clouds in these areas. In the boxed area of Fig. 5c, the proposed method misses some broken clouds, and U-Net has a larger range of misclassification. The result of U-Net in the thin cloud box area in Fig. 5b is better than the proposed method. Although the result of proposed method completely covers the true value in that area, the non-cloud pixels around the cloud are misclassified as cloud. The proposed method has a large area of misclassification in the box and the surrounding area of Fig. 5f. This problem may be due to the existence of underwater reefs in this area in the original image. Some features of these reefs are similar to thin clouds. The proposed method is more sensitive than U-Net in thin cloud area, therefore misclassifies these reefs as thin cloud.
To objectively evaluate the cloud detection results of the proposed method, the trained model is used to infer the test set images. Table 5 shows the average results of the test set and its comparison with the U-Net results, where the inference time is the time it takes the algorithm to complete the cloud detection of a patch of 224 × 224 on the CPU. Table 6 shows the detailed results of the images in the test set. The experimental results show that the overall pixel accuracy of proposed method is 90.54%, which is 2.37% higher than U-Net's 88.17%; the mPA is 91.36%, which is 7.5% higher than U-Net; the mIoU is 81.53%, which is 6.09% higher than U-Net. The inference time of proposed method is 2.18 s/patch, and the inference speed is about 5 times faster than that of U-Net.
In order to verify the applicability of this method in the thermal infrared band, we apply this method to the band 11 of L8 CCA. Figure 6 shows the cloud detection results of the proposed method for band 11, and Table 7 shows the detailed results of the images in the test set. Experimental results show that for thermal infrared images, the pixel accuracy of proposed method can reach 90.99%. However, because the thermal infrared band of Landsat 8 has a resolution of 100 meters, and the input data is a single-channel image, the thermal infrared image has less information than the visible image. The mPA and mIoU of this method drop to 87.36% and 80.16%, respectively.

Conclusion
In this paper, a lightweight deep learning network for cloud detection of visible and thermal infrared remote sensing images is proposed. Depthwise separable convolutions is applied to compress the model of U-Net and accelerate the inference speed of the neural network. The proposed method is verified on Landsat 8 remote sensing images. Experimental results show that the proposed method can efficiently detect clouds on visible and thermal infrared images with high inference speed and low computational cost. Compared with U-Net, the cloud detection results of the proposed network in visible bands are greatly improved; And although the mPA and mIoU decreased in the thermal infrared images, the PA still remained at a high level. The parameters and FLOPs of the network are reduced to 12.4% and 12.8% of U-Net, respectively, and the inference speed increases by nearly 5 times. In the future, the pruning, quantization, and knowledge distillation will be adopted to improve the effectiveness and further compress the model.