1 Introduction

Climate change is a real threat to the economic and existential aspects of human civilization. Both nature and human-driven activities cause global warming, and it incurs changes in the weather pattern over a more extended period. According to NASA, “Global warming is the long-term heating of Earth’s climate system observed since the pre-industrial period (between 1850 and 1900) due to human activities, primarily fossil fuel burning, which increases heat-trapping greenhouse gas levels in Earth’s atmosphere” [1, 2]. An increase in the global average temperature triggers a speedy meltdown of glaciers and ice caps across the planet. An estimate suggests that the people who reside in the low line areas near the oceans could be impacted most in the next 50 years. Over the last 800, 000 years, there have been ice ages and warmer inter-glacial periods. After the last ice age 20, 000 years ago, the average global temperature rose by about \(3^{\circ }\)C to \(8^{\circ }\)C, over about 10, 000 years. Scientist says that the mother earth goes through the cooling and warming process. However, industrialization in the last two centuries, massive fossil fuel consumption, and deforestation are prominent reasons for the acceleration in global warming. Nature fights back in different ways when Earth’s dominant species exploit it, that is, humans. The outbreak of Ebola and COVID-19 are few recent consequences of human transgressions over natural animal habitats. Irregularity in monsoon impacts the overall crop cycles and eventually hampers the livelihood of millions of farmers. The environmental effects of climate change are broad and far-reaching, affecting oceans, ice, and weather. Changes may occur gradually or rapidly. Another consequence of global warming is either shrinking of rivers and lakes due to increased heat or expanding water-bodies due to the glacial melt [3]. Usually, rivers and lakes are the prime sources of freshwater.

It is observed that the size of a water-body (that is, the area covered by a water-body) changes over a more extended period. Human experts can track it based on satellite images. As the number of such water bodies is massive, an Artificial Intelligence-based approach to automatically detect and monitor a water body’s changes in shape or size from the satellite images is a well-suited and efficient solution. Specifically, the latest Detectron2 [4] Feature Pyramidal Network (FPN) architecture with different backbone networks has been used to build the proposed model from Sentinel-2 Satellite water-bodies images. Our main objective is to develop an approach to quantify any water-body changes in shape/size over time. It can be used to monitor the natural or illegal encroachment around a water body. Besides, an ensemble, a model fusion, has been implemented compared to the baseline ResNet50 and ResNet101 backbone networks. The proposed approach provides a more reliable and consistent instance segmentation (that is, detecting a water-body).

The paper is divided into six sections. Few recent and related research articles are discussed in Section 2. In Section 3, the main deep learning architecture has been mentioned. Section 4 describes the used dataset and explains the experiments. The results analysis and the necessary visual presentation are given in Section 5. Finally, the paper has been concluded in Section 6.

2 Related work

Deep learning-based approaches are used for both the semantic and the instance segmentation on the satellite images. Here, few recent and relevant research articles are discussed in brief. In [5], authors have implemented different data augmentation techniques to enlarge both the training and test datasets. A ResNet34 network has been used in the 2D UNet encoder part. The proposed model works well on the wildfire areal datasets. The authors have proposed an attention dilation-LinkNet (AD-LinkNet) neural network. It helps to find the contextual information from the satellite image using an adopting encoder-decoder, dilated convolution, channel-based attention mechanism for semantic segmentation. The results provide multi-scale features for different scaled objects in an image. Their model outperforms the D-LinkNet on the DeepGlobe road extraction competition dataset. In [6], a new hybrid framework has been proposed by the author for semantic image segmentation. They have used the traditional k-means algorithm to segment an image into the k region of interests (ROI) based on pixel similarity. Then their SegNet encoder-decoder deep neural network uses the ROIs for training and tuned feature-map extraction. Finally, a simple multi-layer perceptron network with a backpropagation algorithm is used as a classifier. Their proposed approach has been examined on the aerial images taken from the United States Geological Survey and DeepGlobe datasets. An automatic water-body segmentation has been discussed in [7]. Authors have proposed a restricted receptive field deconvolution network (RRF DeconvNet) to extract a semantic segment of water bodies from areal images. It gives better results than its VGG and ResNet18 variants. One recent research article [8] showcases Mask R-CNN-based instance segmentation for water-bodies detection from satellite images. The authors have discussed the purpose of such a study for automatic flood monitoring. Besides, Lu et al. have introduced a new approach CO-attention Siamese Network (COSNet), to address the object segmentation task [9]. Again in [10], the authors have proposed object segmentation with episodic graph memory networks. However, in both the papers, the focus is on non-areal objects in videos, most human activities, and otherwise. In [11], the authors have proposed a real-time segmentation network, that is, YOLACT. Furthermore, the same authors have also provided in the subsequent year an improved version of YOLACT, that is, YOLACT++ [12]. However, these variants are fast in frame per second but less accurate than the widely accepted Mask R-CNN.

The focus is on instance segmentation of static images rather than video object segmentation in this paper. Therefore, accuracy is more critical than FPS in our study. There is a fundamental difference between semantic and instance segmentation. Let us take an example to understand the difference. Assume an image containing three cats in it. The semantic segmentation gives us two segments- one segment containing all the cats, and the remaining portion has been segmented as background. On the other hand, the three cats are segmented separately in the instance segmentation without the background. Here, the proposed models are implemented on the aerial images for water-body (single object) detection. Most of the images in the used dataset contain only a single water body.

The novelty of this paper is to segment the water bodies captured from the satellite/aerial high-resolution color images with high accuracy using the latest state-of-the-art Detectron2 framework. It is already established that the FPN-based ResNet backbone network provides the best speed versus accuracy trade-off over the tradition RPN+FCN-based Mask R-CNN on COCO dataset [13]. Again, an ensemble instance segmentation framework has been proposed to analyze the change in a water-body shape over a more extended time. It provides efficient and automatic monitoring of shrinkage and expansion of lakes and wetlands across the globe. Thus, it can help the administrators and policy-makers monitor the water-bodies in the real world without human intervention.

3 Background

Detectron2 [14] architecture has been used to implement the Mask R-CNN with Feature Pyramidal Network (FPN), which is a pre-trained-based model in this paper [15]. Detectron2 is a Facebook Artificial Intelligence Research (FAIR) open-source platform for the object detection and segmentation [16, 17]. The complete package contains different algorithms for object detection but with improved speed and scalability. It is better than its first adaptation, that is, Detectron, based on multiple metrics. Distributing training on multiple GPUs is one of the major changes. Detectron2 has been built using an open-source PyTorch deep learning framework. The main novelty of Detectron2 is the panoptic segmentation [18, 19] which is a combination of semantic and instance segmentation. However, the Mask R-CNN [20] with FPN structure with variants of ResNet backbone networks have been used in our study, for instance, segmentation. The default configuration of Detectron2 is compared with our configured models. Detectron2 has three basic building blocks (refer Fig. 1): (a) Backbone Network, (b) Region Proposal Network, and (c) Region of Interest (ROI) Head. The default framework uses Stochastic Gradient Descent (SGD) with a learning rate (lr)=\(1e-3\) and momentum=0.9. In our variation, it has been modified to learning rate=\(1e-4\). Besides, the newly proposed AdaBelief optimization [21] technique has also been employed for comparison. The AdaBelief takes lr=\(1e-4\), betas=(0.9, 0.999), eps=\(1e-16\) and a fixed weight decay. Non-maximum suppression (NMS) has been used to eliminate the overlapping boxes [22, 23]. The model is executed for 10000 iterations in all of our experiments.

The complete experiment in this study has been performed on the Google Colab free online cloud-based Jupyter notebook environment with GPU and other installed necessary dependencies for the Detectron2 package. Three primary ResNet backbone networks: R50_FPN_1x, R50_FPN_3x, and R101_FPN_3x have been used without any change. The other important modified configuration data are as follows:

figure a
Fig. 1
figure 1

Block diagram of three major components of Detectron2 Architecture

4 Proposed Pipeline and Experiments

A straight-forward pipeline for satellite image segmentation has been given in Fig. 2 in which this study aims at the third step of the pipeline shown in a red bounded box.

Fig. 2
figure 2

The diagram of the proposed pipeline for an efficient monitoring of water-bodies from the satellite images

4.1 Dataset

The dataset has a total of 2811 images, out of which 1087 images have been selected and annotated for the water-body segmentation. The images have been taken from the Kaggle [24] Sentinel-2 Satellite Images of Water Bodies dataset. The dataset has \(2699\times 3771\) high resolution to \(54\times 44\) low resolution images. A single image contains a single water body. However, the raw input images are resized to \(448\times 448\) before feeding to a segmentation model. In practice, a proper image registration step needs to be implemented before image segmentation. However, the used dataset contains already pre-processed satellite images. Therefore, images can be directly used for model building (training). Besides the mentioned dataset, Google time-lapsed videos are also used to examine the proposed instance segmentation Detectron2 model. The time-lapsed videos are of: Aral lakeFootnote 1. The high-resolution frames containing different water bodies across the planet are captures from the publicly available Google time-lapse videos. Each frame corresponds to a single year from 1984 to 2016, that is, a total of 32 years of transformation.

The Microsoft open-source Visual Object Tagging ToolFootnote 6 (VoTT) has been used for image annotation and cvat.org for transforming it to Detectron2 COCO compatible format.

4.2 Experimental setup

The paper has three separate experiments related to a common problem of water-body instance segmentation.

Experiment-1: In this experiment, we have implemented a Detectron2 architecture with different backbone networks on 1087 training-set and 103 validation-set, respectively. In this experiment, mAP, i.e., average precision (AP) [25] for a single class with IoU values 50, 75 and 50 : 95 have been computed respectively.

Experiment-2: In this experiment, we have randomly selected four water-body images from the validation set to validate the true positive instance segmentation and its accuracy (%). The objective of this experiment is to achieve segmentation accuracy to be as high as possible (the predicted pixels cover the actual annotated pixels).

Experiment-3: In this experiment, Google time-lapsed images of five famous lakes for the period 1984 - 2000 and 2000 - 2016 are used to observe automatically whether the shape of the water-bodies has been increased or decreased over time. The objective of this experiment is to decide whether the water-body shrinks \(\downarrow\) or expands \(\uparrow\) over 16 years.

The implementation pipeline (see Algorithm 1) can be described as follows:

  1. (i)

    It takes input from the aerial/satellite images or video.

  2. (ii)

    A suitable pre-processing such as image registration is required to stitch different images captured in the same area from multiple views.

  3. (iii)

    Detectron2 Mask+FPN ResNet backbone has been employed on the training dataset (annotated images).

  4. (iv)

    Once the model is built, it has been used to generate a mask (object segment) from the test dataset.

  5. (v)

    The generated masks are compared with the actual human-annotated polygons.

  6. (vi)

    Finally, the segmentation accuracy (%), mAP has been computed for a particular model to evaluate its performance.

figure b

Similarly, an ensemble pipeline (see Fig. 3) has also been proposed in this paper. The difference is that it uses multiple backbone networks to obtain separate masks for a single input image. After that, it combines the mask either based on pixel-wise intersection or union of them. It is observed that the pipeline is more appropriate for evaluating the year-wise shape changes of a water-body.

Fig. 3
figure 3

Proposed ensemble model workflow diagram

5 Results and analysis

Detectron2 architecture with ResNet50 and 101 backbone networks has been implemented in different combinations to examine instance segmentation performances. Two distinct types of optimizers, the Stochastic Gradient Descent (SGD) and the AdaBelief [26], have also been used to verify the quick convergence criteria in this study. These optimizers are employed for loss calculation once with 300 epochs and again with 1000 epochs. From Table 1, we observed that AdaBelief outperforms SGD in the early iterations, but SGD matches with AdaBelief with a more significant epoch number (see in Fig. 4).

Table 1 Results obtained from different combination of backbones and optimizers (SGD and AdaBelief) in Detectron2 instance segmentation architecture
Fig. 4
figure 4

Plots obtained from different combination of backbones, total loss and optimizer (SGD and AdaBelief) in Detectron2 instance segmentation architecture for 300 and 1000 epochs

The results obtained from R50_FPN_1x, R50_FPN_3x and R101_FPN_3x using the 103 test images and different optimization techniques are given in Tables 2 and 3 for (default) SGD and AdaBelief, respectively. Both the bounded box and the segmentation (polygon) metrics have been computed to evaluate the model’s quality. The used models have been trained for 10000 epochs. The difference between the SGD-based model and AdaBelief is insignificant for R50_FPN_3x; otherwise, SGD provides noticeable improvements over its AdaBelief counterparts. Another important empirical evidence is the frames per second (FPS) comparison. The overall FPS for SGD-based Detectron2 is significantly higher than the same architecture with AdaBelief optimizer, irrespective of the backbone networks.

Here, R101_FPN_3x has the higher average precision values. (\(AP_{0.50:0.95}\) and \(AP_{50}\)) for the used test images over the remaining models. However, R50_FPN_3x outperforms others due to its highest FPS rate and smaller model size (313MB). On the other hand, the same backbone network outsmarts even R101_FPN_3x in terms of \(AP_{0.50:0.95}\) \(AP_{50}\) and FPS. The empirical analysis can conclude that the ResNet50 using Feature Pyramidal Network (3x) is the best performing backbone network among all the employed variants. Therefore, the rest of the study of this paper has been presented based on R50_FPN_3x (SGD) model.

Table 2 Results obtained from Average Precision (AP), \(AP_{50}\) and \(AP_{75}\) using different backbones and SGD optimizer in Detectron2 instance segmentation architecture
Table 3 Results obtained from Average Precision (AP), \(AP_{50}\) and \(AP_{75}\) using different backbones and AdaBelief optimizer in Detectron2 instance segmentation architecture
Fig. 5
figure 5

Visualization of water_body_3.jpg validation image and its predicted instance segmentation using Detectron2 ResNet50 FPN (3x) model

Fig. 6
figure 6

Visualization of water_body_104.jpg validation image and its predicted instance segmentation using Detectron2 ResNet50 FPN (3x) model

Fig. 7
figure 7

Visualization of water_body_1421.jpg validation image and its predicted instance segmentation using Detectron2 ResNet50 FPN (3x) model

Fig. 8
figure 8

Visualization of water_body_7234.jpg validation image and its predicted instance segmentation using Detectron2 ResNet50 FPN (3x) model

Fig. 9
figure 9

The output of Detectron2 ensemble instance segmentation of the Guozha Cuo Lake; Three models provide three different overlapping masks of pink, brown and green segmentation, respectively

Again, four validation images (wb_3.jpg, wb_104.jpg, wb_1421.jpg, and wb_7234.jpg) out of 103 test image-set have been shown here with its actual image, predicted segment, annotated mask, and predicted mask in Figs. 5, 6, 7 and 8. The purpose of the visual representation is to demonstrate the model’s efficiency. The only observable difference is the edges of the annotated and the predicted segments. This is because the hand annotation is done using polygons (multiple straight lines) while the predicted edges are smoother due to the continuous loss function. A Table 4 has also been given to show the instance segmentation accuracy (%) of the best performing Detectron2 architecture. One can observe that the segmentation accuracies are more than 85% for these randomly chosen test images. It has been validated with other test images as well. Here, it must be noted that the segmented accuracy indicates the ratio between the number of actual pixels within the annotation and the predicted number of pixels within the same annotated area.

Table 4 Results obtained from randomly selected four validation images (wb_3.jpg, wb_104.jpg, wb_1421.jpg and wb_7234,jpg) using Detectron2 ResNet50 FPN 3x instance segmentation model (values are the predicted number of pixels in a segment
Table 5 Results obtained from Detectron2 instance segmentation architectures (values are the predicted number of pixels in a segment)
Fig. 10
figure 10

Visualization of time lapsed images of Aral Lake (1984/2000/2016) and its predicted instance segmentation using Detectron2 ResNet50 FPN (3x) model (\(Shrinking \downarrow\))

Fig. 11
figure 11

Visualization of time lapsed images of Tibet Lake (1984/2000/2016) and its predicted instance segmentation using Detectron2 ResNet50 FPN (3x) model (\(Expanding \uparrow\))

Fig. 12
figure 12

Comparison of Detectron2 ensemble (intersection) instance segmentation over 32 years time-span (1984 to 2016) for few popular lakes across the world (values are the predicted number of pixels (px.) in a segment)

An ensemble instance segmentation model has also been examined on the same dataset. The ensemble considers a pixel as valid if and only if all the base segmentation models correctly predict it. An example of the proposed model applied on the Guozha Cuo Lake is shown in Fig. 9. The base instance segmentation models provide an overlapping mask indicated by pink, brown, and green segmentation. Finally, all the primary and proposed ensemble models (both the union and the intersection models) have been tested on the five popular lake images captured by ESA’s Sentinel-2 Satellite. These are compiled as Google Earth 32 years time-lapsed videos from 1984 to 2016. The detailed results of the study have been provided in Table 5. It contains the model-wise and year-based total number of segmented pixels for each lake. This study aims to demonstrate the efficient implementation of the deep learning approach in automated monitoring of shrinkage or expansion of a water-body from the satellite images. In Table 5, the area covered by these lakes has been shrinking except the Tibet lake, where the amount of water increases over the same time. Aral lake, Lake Mead, Lake Poopó, and South Dead Sea are drying out mostly due to climate change (natural cause) [27,28,29,30,31]. On the other side, the Tibet lake has been expanding its boundary as it is situated in the Himalayan Mountains range and benefited from the glacial melting [32]. This paper plots two prominent examples (Aral Lake and Tibet Lake) with their actual images, segmented images, and predicted masks in Figs. 10 and 11 for the years 1984, 2000 and 2016 (16 years alternatively), respectively. All the used models reflect the same trends of shrinkage or expansion. There is an exception for Lake Poopó and Tibet lake. However, it is noticed that the proposed intersection-based ensemble model gives a consistent descending and ascending segmentation even for the Lake Poopó and the Tibet lake over the remaining models. A comparison bar plot has also been given to visualize the shrinkage and expansion for all the five lakes in Fig. 12.

A rigorous survey has been done based on articles from NASA, National Geographic and BBC, etc., to validate our observations. It is noticed that the outcome of this paper is the same as their analysis and conclusions. For a detailed understanding, one can refer to these articles [33,34,35].

6 Conclusion

Global warming is a significant issue in the \(21^{st}\) century. The reckless and unplanned industrialization causes immense strains on nature. The effects can be observed through naked eyes. However, sometimes the change happens over a longer time. Modern computer vision techniques can track these drastic changes based on the satellite images of different geographical entities. Here, the satellite images of water bodies have been used to build the prediction model using the proposed ensemble Detectron2 instance segmentation architecture. Finally, the Google time-lapsed satellite images of few famous lakes across the planet have been used to understand the grim reality of global warming. It is observed that the area covered by the water bodies across the globe mostly dried out over 32 years. However, the water-body has been grown over the same time in Tibet. It suggests the meltdown of many Himalayan glaciers in that region. The best performing model out of all the used variants is Mask R-CNN with ResNet50+FPN backbone and SGD optimizer. It provides 99.50, 94.51 and \(\sim 10\) for SGD over 97.99, 96.02 and \(\sim 7\) for AdaBelief optimizer at \(AP_{50}\), \(AP_{75}\) and FPS, respectively.

In the future, we can extend our work to multi-spectrum or multi-modal satellite image segmentation. A real-world example focuses on a specific metropolitan area where the proposed pipeline can be employed to monitor and automatically identify different water-bodies changes in a single fiscal year.