1 Introduction

Change detection aims to locate the land cover variation and identify the categories of two multi-temporal images with pixel-wise boundaries [1, 2], which is a crucial image interpretation task in various remote sensing applications [3]. In the literature, binary change detection (BCD) was firstly formulated by detecting the locations of changed pixels between the input multi-temporal images [4, 5]. However, methods of BCD only focused on the changed locations and overlooked the specific categories. This restricts the popularization and application of BCD, since the identification of change types is usually essential for diverse applications, e.g., urban planning [6], and natural resource management [7]. Therefore, in recent years, semantic change detection has become an active research topic in this field [1, 8, 9].

With the rapid development of deep learning techniques, deep semantic change detection methods have achieved great progress. In particular, by considering the correlation between land cover mapping (LCM) and semantic change detection (SCD), many representative Siamese network models with multi-task learning have been proposed [2, 10, 11]. However, as illustrated in Fig. 1, there are still some important issues that have not been paid attention to and solved. More specifically, due to the reality and complexity of the change process itself, there exist categories of changes that are difficult to distinguish from one another, e.g., “water-vegetation” vs. “water-tree”, which can be regarded as fine-grained differences. In addition, a slight shift in the remote sensing image position and the annotating difficulty at the boundaries of object pixels will cause ambiguity among these categories, which is known as label ambiguity. Moreover, the change process itself is a continuous process (not discrete change), so it can also lead to label ambiguity. Both fine-grained nature and label ambiguity make SCD a challenging task, and they have not yet been fully investigated.

Figure 1
figure 1

Samples of multi-temporal images in SCD. Its fine-grained nature and label ambiguity can be clearly observed, e.g., the pixel regions of “tree” vs. those of “vegetation”

In this paper, we propose a coarse-to-fine attention tree (CAT) model to address these challenges. The most different point from previous work is the coarse-to-fine basic idea illustrated in Fig. 2. By recalling the identification process of human vision, humans always follow a coarse-to-fine procedure. Concretely, we tend to first conduct an easy task, i.e., determining whether this change belongs to a coarse-grained change category, e.g., “water-land cover”. Then, within the candidate set of change types, we conduct the more difficult task of judging the fine-grained change category, such as “water-vegetation” or “water-ground”. Motivated by this, we realize a coarse-to-fine attention tree to model the label hierarchy of change categories and capture the discriminative pixel regions for better SCD predictions as well. For handling label ambiguity, we modify the original pixel-level labels in SCD/LCM as the label distribution and take the label distribution as the ground truth, to drive model training. By introducing a label distribution learning loss [12, 13], the proposed CAT model can calculate the losses and be optimized in an end-to-end manner. Beyond that, as a unified model, the CAT model also consists of a popular encoder-decoder structure for representation learning in the early module, which is demonstrated in Fig. 3. In experiments, the proposed CAT model is conducted on the large-scale SECOND dataset for SCD [1]. Empirical results and ablation studies can validate the effectiveness of the proposed CAT method and our proposals in CAT.

Figure 2
figure 2

An illustration of coarse-to-fine hierarchical information as a crucial cue for the semantic change detection task

Figure 3
figure 3

Overall framework of our CAT model, which consists of the encoder, the decoder and the coarse-to-fine attention tree. The inputs are two multi-temporal images, while the outputs are two land cover mapping predictions regarding the inputs, as well as a prediction of semantic change detection

The main contributions of this paper are listed as follows:

  1. 1)

    The fine-grained nature and label ambiguity in the semantic change detection task are investigated, and further a novel coarse-to-fine attention tree for dealing with these challenges is proposed.

  2. 2)

    A tree structure is developed for both modeling the label hierarchy of change categories at different granularity levels and capturing the discriminative pixel-level regions, as well as introducing the label distribution learning loss to alleviate label ambiguity.

  3. 3)

    Experiments and various ablation studies are conducted on the large-scale semantic change detection dataset, which demonstrates superior results compared to the state-of-the-art methods and baselines.

The rest of this paper is organized as follows. Section 2 reviews the related work of semantic change detection and hierarchical architecture. Section 3 elaborately presents the proposed CAT model, as well as its main components. In Sect. 4, we report the empirical settings and experimental results for evaluating the effectiveness of our CAT model, as well as the ablation studies for in-depth investigation. Finally, Sect. 5 draws the conclusions and discusses promising future work.

2 Related work

We briefly review the previous work in the literature from two related aspects, semantic change detection and hierarchical architecture.

2.1 Semantic change detection

Change detection using multi-temporal remote sensing images is an important technique to monitor dynamic changes on the land surface [4, 5, 9]. With the rapid development of deep learning techniques, recent semantic change detection (SCD) methods based on deep learning have achieved great success [1, 1417].

More specifically, Caye et al. [18] proposed an iterative learning method to train a fully convolutional neural network (CNN) for detecting changes from noisy data. Papadomanolaki et al. [19] combined a recurrent neural network (RNN) with a fully convolutional neural network, by using multi-sequence high-resolution data for urban change detection. The siamese neural network can evaluate the similarity between two images, which makes it effective in the SCD task. Liu et al. [14] designed a deep convolutional coupling network for detecting changes from two multi-temporal images. Daudt et al. [10] proposed two Siamese expansion modules based on ordinary images and multispectral images, and then fused them into a fully convolutional neural network to complete SCD. Chen et al. [11] combined the advantages of CNN and RNN, and proposed a deep Siamese convolutional recurrent neural network for change detection, which can also be used for homogeneous and heterogeneous high-resolution remote sensing images. Du et al. [20] employed two symmetrical neural networks to extract features of dual time-series images, and then used slow feature analysis (SFA) to highlight the changing parts of the transformed features. Based on kernel principal component analysis convolution, Wu et al. [21] proposed an unsupervised deep Siamese convolutional network for binary and multi-category change detection. Very recently, to better evaluate the methods of SCD, Yang et al. [1] constructed a well-annotated new benchmark dataset, the SEmantic Change detectiON Dataset (SECOND). Additionally, an asymmetric Siamese network was proposed in [1]. Later, Xia et al. [2] designed a deep Siamese post-classification fusion network to address the accumulation of misclassification error issues.

2.2 Hierarchical architecture

Hierarchical architecture, which aims to capture information at different scales, is a common practice in computer vision and pattern recognition. In the early era, multi-stage networks were developed to perform handwritten digit recognition [22]. In deep learning ages, multi-stage networks were developed to perform object recognition [23]. In particular, feature pyramid networks were employed to capture features of different scales in the field of object detection and segmentation [24, 25]. Recently, many researchers have developed multi-level Transformer structures to boost the performance of various vision tasks [26, 27].

In addition to those general models, researchers have developed different modules for specific vision tasks, including fine-grained image analysis [28] and remote sensing [1, 29]. Specifically, hierarchical convolutions [30] and hierarchical bilinear pooling [31] were proposed to capture the fine-grained cues in fine-grained visual objects. Multi-level recurrent units along with attention modules were developed for vehicle re-identification [32]. On the other hand, in remote sensing, cascade detection frameworks were applied to handle high-resolution object detection [33]. Random walk was adopted in a hierarchical style to solve the problems in the classification of hyperspectral and LiDAR data [34]. In contrast to previous work, in this paper, we introduce a coarse-to-fine hierarchical tree structure to model the dependence among categories at different granularities.

3 Methodology

In this section, we elaborate on the proposed coarse-to-fine attention tree (CAT) method in the following aspects, i.e., the overall architecture, the coarse-to-fine attention tree module and its loss functions.

3.1 Notations and overall architecture

In general, the semantic change detection (SCD) task takes a pair of multi-temporal images as inputs which are denoted as \({I}_{1}\) and \({I}_{2}\). Specifically, \(\Omega =\{0,1,\ldots ,H-1\}\times \{0,1,\ldots , W-1\}\) is the image grid of the multi-temporal image, where H and W are the height and width, respectively. Given a set \(\mathcal{L}=\{y_{1},\ldots ,y_{C}\}\) of C semantic categories, the SCD problem aims to learn a mapping function \(f_{I_{1}, I_{2}}: \Omega \rightarrow \mathcal{L}^{2}\) such that

$$ \forall \boldsymbol{p} \in \Omega ,\quad f_{I_{1}, I_{2}}( \boldsymbol{p})=\textstyle\begin{cases} (0, 0), & \text{if } \mathcal{C}_{I_{1}, I_{2}}(\boldsymbol{p})< \tau, \\ (l_{1},l_{2}), & \text{otherwise}, \end{cases} $$
(1)

where \(\mathcal{C}_{I_{1}, I_{2}}(\boldsymbol{p})\) measures the change probability of each pixel \(\boldsymbol{p} \in \Omega \), \(l_{1}, l_{2}\in \mathcal{L}\), and \((0,0)\) indicates the non-change class. τ is a scalar thresholding on \(\mathcal{C}_{I_{1}, I_{2}}\). Therefore, \(f_{I_{1}, I_{2}}(\cdot )\) can locate changed regions and identify their categories simultaneously.

As illustrated in Fig. 3, the framework of CAT consists of several components, including the encoder, the decoder, and the most important coarse-to-fine attention tree. Specifically, we follow the encoder-decoder structure in [2] for representation learning. The encoder \(\operatorname{enc}(\cdot )\) extracts the information from multi-temporal images separately to form the feature maps with multiple scales. For the i-th scale, \(\boldsymbol{x}^{i}_{1}\) and \(\boldsymbol{x}^{i}_{2}\) are the corresponding feature maps obtained via

$$\begin{aligned}& \boldsymbol{x}^{i}_{1}= \operatorname{enc}(I_{1}; i) \in \mathbb{R}^{h^{i}\times w^{i} \times d^{i}} , \end{aligned}$$
(2)
$$\begin{aligned}& \boldsymbol{x}^{i}_{2}= \operatorname{enc}(I_{2}; i) \in \mathbb{R}^{h^{i}\times w^{i} \times d^{i}} , \end{aligned}$$
(3)

where \(h^{i}\), \(w^{i}\) and \(d^{i}\) are the dimensions of the feature maps of the i-th scale. After encoding, for the decoding phase, feature maps from small scales to large scales are gradually decoded in a two-stream fashion until their resolution equals the resolution of the original input images, which applies to subsequent LCM and SCD tasks. The two streams in the decoder \(\operatorname{dec}(\cdot )\) correspond to two land cover temporal images, respectively. Regarding the specific process in the i-th scale of \(\operatorname{dec}(\cdot )\), we concatenate the feature maps of the same scale in \(\operatorname{enc}(\cdot )\) and the upsampled results from the \((i-1)\)-th scale in \(\operatorname{dec}(\cdot )\) to form the representation \(\boldsymbol{x}^{\prime i}_{1}/\boldsymbol{x}^{\prime i}_{2}\) of the i-th scale in \(\operatorname{dec}(\cdot )\). The upsampled results are obtained by performing a sub-network \(\operatorname{Net}(\cdot )\) on the concatenation of \(\boldsymbol{x}^{\prime i-1}_{1}\) and \(\boldsymbol{x}^{\prime i-1}_{2}\). The process is formulated as

$$ \boldsymbol{x}^{\prime i}_{1} = \bigl[\operatorname{Net} \bigl( \bigl[\boldsymbol{x}^{\prime i-1}_{1}; \boldsymbol{x}^{\prime i-1}_{2} \bigr] \bigr); \boldsymbol{x}^{i}_{1}; \boldsymbol{x}^{i}_{2} \bigr] , $$
(4)

where \([\cdot ;\ldots ;\cdot ]\) is the concatenation operation, and the implementation of \(\operatorname{Net}(\cdot )\) can be found in the experimental details in Sect. 4. For a special case, regarding \(\boldsymbol{x}^{\prime 1}_{1}\) in \(\operatorname{dec}(\cdot )\), it has no representation from the previous scale and only has \([\boldsymbol{x}^{1}_{1}; \boldsymbol{x}^{1}_{2} ]\). Eventually, we can obtain the final representation of these two streams, i.e., \(\boldsymbol{x}_{\mathrm{LCM}1}\) and \(\boldsymbol{x}_{\mathrm{LCM}2}\), based on \(\boldsymbol{x}^{\prime 5}_{1}\) and \(\boldsymbol{x}^{\prime 5}_{2}\).

After that, we calculate the differences between \(\boldsymbol{x}_{\mathrm{LCM}1}\) and \(\boldsymbol{x}_{\mathrm{LCM}2}\) as \(\boldsymbol{x}_{\mathrm{SCD}}\) for semantic change detection in Equation (5):

$$ \boldsymbol{x}_{\mathrm{SCD}} = \|\boldsymbol{x}_{\mathrm{LCM}1}- \boldsymbol{x}_{\mathrm{LCM}2}\| . $$
(5)

Since LCM and SCD can benefit each other, \(\boldsymbol{x}_{\mathrm{LCM}1}\), \(\boldsymbol{x}_{\mathrm{LCM}2}\) and \(\boldsymbol{x}_{\mathrm{SCD}}\) are aggregated as multiple modals to be fed into the coarse-to-fine attention tree for conducting both LCM and SCD predictions. The details of the coarse-to-fine attention tree module and the loss functions in our CAT model are elaborated in the following sub-sections.

3.2 Coarse-to-fine attention tree

Inspired by the human identification process (cf. Figure 2), we first develop the coarse-to-fine hierarchical module, since humans always give the predictions of coarse-grained change detection at first, e.g., “water-land cover”, and then predict the fine-grained change detection, e.g., “water-vegetation” or “water-building”. Meanwhile, by considering the fine-grained nature of SCD, to better capture the changing detailed regions and simultaneously model the coarse-to-fine hierarchical structure, we combine the attention mechanism and hierarchical structure into one, and propose a coarse-to-fine attention tree.

As mentioned above, the input of the coarse-to-fine attention tree is the aggregated representations of \(\boldsymbol{x}_{\mathrm{LCM}1}\), \(\boldsymbol{x}_{\mathrm{LCM}2}\) and \(\boldsymbol{x}_{\mathrm{SCD}}\), which is denoted by

$$ \boldsymbol{x}_{\mathrm{AGG}}= [\boldsymbol{x}_{\mathrm{LCM}1}; \boldsymbol{x}_{\mathrm{LCM}2};\boldsymbol{x}_{\mathrm{SCD}} ] . $$
(6)

Since it aggregates tri-sourced information, we term it as “tri-aggregation”. We also conduct ablation studies in Table 2 to validate its effectiveness. Furthermore, to better leverage the correlation of both land cover mapping and semantic change detection, we formulate them into a multi-task learning framework, which is depicted on the right of Fig. 3, and each task corresponds to a coarse-to-fine attention tree, cf. Figure 4.

Figure 4
figure 4

The proposed coarse-to-fine attention tree. The pink rounded rectangle represents the attention sub-modules, cf. Figure 6

In each tree, there are three levels of nodes, where the root node in the first level is the input, the nodes in the second level are employed for coarse-grained predictions, and the leaf nodes in the third level are for the fine-grained predictions. For specific data in such a tree, a branch routing strategy in the tree determines which branch it takes.

Branch routing

In the proposed CAT model, we design the strategy to divide the branches of the tree and determine the flow of the data, cf. Figure 5. Concretely, for \(\boldsymbol{x}_{\mathrm{AGG}}\), it is first conducted with a \(1\times 1\) conv operation to aggregate the information and is formed as a unified representation. Then, we apply a channel attention, which is illustrated in Fig. 5, to capture the discriminative information. Finally, for the obtained activation tensor, several operations are serialized for performance, i.e., average-pooling, signed square-root normalization, \(\ell _{2}\)-normalization, \(1\times 1\) conv, and the sigmoid function. Then, we can obtain the routing parameter s. The routing strategy is that, if s is larger than 0.5, then the data flow will go to the right branch of the tree; if it is less than 0.5, the data flow will go to the left. Since these two branches have different attention processes (cf. Figure 4), they can capture patterns at different granularities, e.g., the left branch learns more fine-grained features thanks to the stacked attention module. Additionally, it is apparent to observe that the direction of the data flow is also learned automatically in the CAT model.

Figure 5
figure 5

Branch routing in the proposed CAT model

Attention sub-module in the tree

To introduce diversity and capacity into the model, we develop an asymmetric attention tree in the CAT model, cf. Figure 4. As illustrated, the left and right branches of the attention tree have two and one attention sub-module, respectively. Different numbers of attention sub-modules bring different model complexities, and more importantly, they allow these two branches to learn different patterns, instead of the same redundant ones. The detailed attention sub-module is presented in Fig. 6. The attention sub-module consists of a channel attention and an atrous spatial pyramid pooling (ASPP) [35]. Specifically, the channel attention is the same as that in the branch routing. The ASPP has four parallel dilated convolutions with different dilation rates, i.e., 1, 6, 12, and 18. After that, we concatenate the results derived from different dilated convolutions and conduct a \(1\times 1\) convolution to aggregate the information as the final output regarding the attention sub-module in the tree. Beyond the channel attention, the reason why we integrate ASPP [35] is that it can provide different feature maps with different scales/dilated convolutions. Such an advantage benefits the fine-grained SCD predictions, especially for the pixel level predictions [28, 35].

Figure 6
figure 6

Attention sub-module in our coarse-to-fine attention tree

Label predictions on the nodes

After performing the attention sub-module of the branch, we upsample its output, and conduct batch-normalization [36] followed by a \(3\times 3\) convolution, a ReLU activation function and a \(1\times 1\) convolution. Until now, the final prediction \(\hat{\boldsymbol{p}} \in \mathbb{R}^{C\times H\times W}\) can be returned, where C represents the number of categories in the SCD or LCM tasks. As shown in the attention tree in Fig. 4, the CAT model has four fine-grained predictions derived from the leaf nodes, two coarse-grained predictions from the nodes in the second level, and an ensemble prediction based on the four leaf nodes’ predictions. Here, a simple ensemble strategy of average is used. Thus, we denote the notations as \(\hat{\boldsymbol{p}}^{\mathrm{fine}}\), \(\hat{\boldsymbol{p}}^{\mathrm{coarse}}\), and \(\hat{\boldsymbol{p}}^{\mathrm{ensem}}\) for the aforementioned predictions, respectively. During model training, we calculate the corresponding losses on these predictions and employ them to drive end-to-end optimization.

3.3 Loss functions

In SCD tasks, a slight shift in the remote sensing image position and the annotating difficulty at the boundaries of object pixels will cause label ambiguity, which cannot be solved by traditional classification losses. Different from the traditional classification loss function used in LCM or SCD, we hereby propose to use the label distribution learning (LDL) [12, 13, 37] manner as a loss function to drive model training, which aims to relieve the label ambiguity in such a multi-temporal prediction problem.

Take \(\hat{\boldsymbol{p}}^{\mathrm{ensem}}\) as an example and its corresponding ground truth can be presented by \(\boldsymbol{y}^{\mathrm{ensem}}\in \mathbb{R}^{C\times H\times W}\). Our label distribution learning in SCD requires a distribution of labels as its ground truth label. Therefore, following [13], we utilize a \(5\times 5\) Gaussian kernel upon all channels of \(\boldsymbol{y}^{\mathrm{ensem}}\) and obtain \(\boldsymbol{y}^{\prime \mathrm{ensem}}\). Then, through the dimension of categories, we conduct the following normalization to make it a legitimate distribution:

$$ {y}^{\prime\prime \mathrm{ensem}}_{c,i,j} = \frac{{y}^{\prime \mathrm{ensem}}_{c,i,j}}{\sum_{c=0}^{C} {y}^{\prime \mathrm{ensem}}_{c,i,j}} . $$
(7)

Thus, \(\boldsymbol{y}^{\prime\prime \mathrm{ensem}} \in \mathbb{R}^{C\times H\times W}\) is the ground truth in the label distribution learning loss, and the loss function is calculated by

$$\begin{aligned} &\mathcal{L}^{\mathrm{ensem}}_{\mathrm{LDL}} \bigl(\boldsymbol{ \hat{y}}^{\prime\prime \mathrm{ensem}}, \hat{ \boldsymbol{p}}^{\mathrm{ensem}} \bigr) \\ &\quad = \sum_{i} \sum _{j} \sum_{c} y^{\prime\prime \mathrm{ensem}}_{c,i,j} \ln \frac{y^{\prime\prime \mathrm{ensem}}_{c,i,j}}{\hat{p}_{c,i,j}} \end{aligned}$$
(8)
$$\begin{aligned} &\quad = - \sum_{i}\sum _{j} \sum_{c} y^{\prime\prime \mathrm{ensem}}_{c,i,j} \ln {\hat{p}_{c,i,j}} , \end{aligned}$$
(9)

where \(\ln (\cdot )\) is the natural logarithm function. For other predictions, i.e., \(\hat{\boldsymbol{p}}^{\mathrm{fine}}\) and \(\hat{\boldsymbol{p}}^{\mathrm{coarse}}\), we can conduct the similar process to obtain their losses \(\mathcal{L}^{\mathrm{fine}}_{\mathrm{LDL}}\) and \(\mathcal{L}^{\mathrm{coarse}}_{\mathrm{LDL}}\). Thus, the final loss function of the CAT model is calculated in Equation (10):

$$ \mathcal{L} = \mathcal{L}^{\mathrm{fine}}_{\mathrm{LDL}} + \mathcal{L}^{\mathrm{coarse}}_{\mathrm{LDL}} + \mathcal{L}^{\mathrm{ensem}}_{\mathrm{LDL}} . $$
(10)

Note that, the trade-off parameter between these terms are 1, which reveals the robustness and practicality of our model.

4 Experiments

In this section, we first introduce the dataset and experimental settings, as well as the implementation details. Then, we report the main results. Ablation studies are conducted to verify the efficacy of the main components of the CAT method at last.

4.1 Datasets and empirical settings

In the experiments, we utilize a large-scale semantic change detection dataset, i.e., SECOND [1], to evaluate the performance of the proposed CAT model. The original SECOND dataset is collected from several sensors/platforms across different cities (e.g., Hangzhou, Chengdu) in China and has 4662 pairs of images in 30 change categories. For fair comparisons with the state-of-the-art method [2], we follow its modified categories from SECOND. Therefore, a total of 14 fine-grained change types are used for empirically evaluating semantic change detection (SCD), i.e., “no-change”, “water-ground”, “water-vegetation”, “water-building”, “ground-vegetation”, “ground-water”, “ground-building”, “vegetation-ground”, “vegetation-water”, “vegetation-building”, “building-water”, “building-ground”, “building-vegetation”, and “building-building”. In that case, the coarse-grained semantic change categories can be grouped as four meta categories, including “no-change”, “water-land cover”, “land cover-water”, and “other types”. On the other hand, for the land cover mapping (LCM) task, the coarse-grained semantic categories are “water”, “land cover” and “background”, and the fine-grained categories for LCM involve “water”, “ground”, “vegetation”, “building” and “background”. In addition, other experimental setups are also followed [2] for fair comparisons.

Regarding the evaluation metric, we employ the arithmetic mean of the per-class F1-score (F1) and overall accuracy (OA) for the LCM task, which can be calculated in Equation (11):

$$\begin{aligned}& F1 = \frac{1}{C+1} \sum_{i=0}^{C} \frac{2\times {\mathrm{TP}}_{i}}{2\times {\mathrm{TP}_{i}} + {\mathrm{FN}_{i}} + {\mathrm{FP}_{i}}} , \end{aligned}$$
(11)
$$\begin{aligned}& {\mathrm{OA}} = \frac{\sum_{i=0}^{C} ({\mathrm{TP}_{i}}+{\mathrm{TN}_{i}})}{\sum_{i=0}^{C} ({\mathrm{TP}_{i}}+ {\mathrm{FN}_{i}}+{\mathrm{FP}_{i}}+{\mathrm{TN}_{i}})} , \end{aligned}$$
(12)

where \({\mathrm{TP}_{i}}\), \({\mathrm{FP}_{i}}\), \({\mathrm{FN}_{i}}\), and \({\mathrm{TN}_{i}}\) are the numbers of true positive, false positive, false negative, and true negative pixels for category i, respectively. C denotes the number of categories.

For the SCD task, by following [8], \(\mathrm{F}^{\mathrm{loc}}\), \(\mathrm{F}^{\mathrm{types}}\), and \(\mathrm{F}^{\mathrm{overall}}\) are used for evaluations, whose detailed computation is presented by

$$ {\mathrm{F}^{\mathrm{overall}}} = 0.3{\mathrm{F}^{\mathrm{loc}}} + 0.7{ \mathrm{F}^{\mathrm{types}}} , $$
(13)

where \({\mathrm{F}^{\mathrm{loc}}}\) is the F1-score of changed pixels, and \({\mathrm{F}^{\mathrm{types}}}\) is calculated by the arithmetic mean of per-class F1-scores. Additionally, OA is also utilized to evaluate the accuracy of the “from-to” change types.

4.2 Implementation details

For fair comparisons with previous work, e.g., [2], we fix the resolution of the input images as \(512\times 512\). Random cropping, flipping, rotation and blurring operations are also employed for data augmentation. Regarding the proposed CAT model, the encoder \(\operatorname{enc}(\cdot )\) consists of the layers before the average-pooling layer of ResNet-34 [38] (pre-trained on ImageNet [39]). The calculation of \(\operatorname{Net}(\cdot )\) in Equation (4) can be presented by

$$\begin{aligned}& \mathrm{ReLU}\bigl(\mathrm{BN}\bigl(\mathrm{conv}\bigl(\mathrm{ReLU}\bigl( \mathrm{BN}\bigl(\mathrm{conv}\bigl(\mathrm{upsample}(\cdot ); \\& \quad 3\times 3\bigr) \bigr)\bigr); 3\times 3\bigr)\bigr)\bigr), \end{aligned}$$

where \(\mathrm{upsample}(\cdot )\) is the upsampling operation to turn the feature map of a small scale into a large scale, \(\mathrm{conv}(\cdot ;3\times 3)\) is a convolution operation of the kernel size of \(3\times 3\), \(\mathrm{BN}(\cdot )\) is batch normalization [36], and \(\mathrm{ReLU}(\cdot )\) is the ReLU activation function. Regarding the optimization of CAT, stochastic gradient descent with a batch size of 8 is employed as the optimizer, and its learning rate is set as 10−3. All experiments are conducted on 8 NVIDIA GeForce RTX 3090 Ti for 100 epochs.

4.3 Main results and comparisons

In experiments, we compare the CAT model with baseline methods and competing state-of-the-art methods in SCD. A brief introduction of these methods is presented as follows.

  1. 1)

    MTL-CD [40]: A deep multitask learning method adopting the encoder-decoder architecture for change detection.

  2. 2)

    DSIFN [41]: A deeply supervised image fusion network for change detection that fuses multi-level deep features of raw images by means of attention modules for change map reconstruction.

  3. 3)

    DDCNN [16]: An end-to-end change detection network, termed as the difference-enhancement dense-attention convolutional neural network (DDCNN).

  4. 4)

    HRSCD.srt1 [9]: The two-step independent post-classification comparison method based on convolutional neural networks.

  5. 5)

    HRSCD.srt2 [9]: A direct classification SCD method based on the single encoder-decoder structure.

  6. 6)

    HRSCD.srt3 [9]: A modified post-classification comparison network that introduces temporal correlation information by constructing a new change detection branch.

  7. 7)

    HRSCD.srt4 [9]: The differences from HRSCD.srt3 lie in that it proposes a modified skip operation, which connects Siamese encoders with the decoder of the change detection branch.

  8. 8)

    ASN-Res [1]: The Siamese network for SCD using locally asymmetric architecture with a residual encoder.

  9. 9)

    PCFN-Hard [2]: A deep Siamese post-classification fusion network for semantic change detection, which employs the hard fusion strategy.

  10. 10)

    PCFN [2]: A deep Siamese post-classification fusion network, which is the current state-of-the-art method in semantic change detection.

As reported in Table 1, compared with state-of-the-art methods and other baselines, our proposed CAT model achieves consistent improvements with a large margin for both land cover mapping (LCM) and semantic change detection (SCD) tasks. More specifically, our model outperforms the competing PCFN method [2] by 0.57% and 0.54% of F1, as well as 2.35% and 2.03% of OA for LCM regarding two multi-temporal images. For SCD, our CAT model can reach a significant improvement over PCFN, i.e., 7.35% of \(\mathrm{F^{\mathrm{loc}}}\), 19.01% of \(\mathrm{F^{\mathrm{types}}}\), 2.69% of OA, and 15.51% of \(\mathrm{F^{\mathrm{overall}}}\), respectively.

Table 1 Comparisons on the SECOND dataset with state-of-the-art methods

Additionally, we also visualize the SCD/LCM predictions of some samples of the SECOND dataset in Fig. 7. As observed, our CAT model can recover more fine-grained details when compared with PCFN, which led to the significantly better performance in \(\mathrm{F^{\mathrm{loc}}}\), \(\mathrm{F^{\mathrm{types}}}\), OA, and \(\mathrm{F^{\mathrm{overall}}}\).

Figure 7
figure 7

Visualization results of several samples in the SECOND dataset. We also visualize the results of state-of-the-art PCFN [2] and the ground truth for clear comparisons

4.4 Ablation studies and discussions

In this section, we conduct ablation studies on SECOND to characterize the proposed CAT method, especially for its main components.

We investigate the effects of 1) the attention tree, 2) tri-aggregation, 3) the proposed coarse-to-fine hierarchical structure, 4) label distribution learning loss and 5) the asymmetric attention of the tree. As reported in Table 2, compared with ♯1, ♯2, and ♯3, we can find that even without performing coarse-grained predictions, our proposed attention tree is effective for either LCM or SCD tasks. Especially for equipping with our attention tree on both LCM and SCD simultaneously, our model achieves a significant improvement. When only applying attention trees with one branch (i.e., ♯4), although they are equipped on both LCM and SCD, the results are still unsatisfactory. Regarding tri-aggregation, if we remove it and merely rely on the information from their own stream of LCM/SCD (comparing ♯5 and ♯3), it causes a slight drop on LCM, but an obvious performance drop is observed on SCD, e.g., 1.86% on \(\mathrm{F^{\mathrm{overall}}}\). These observations justify the necessity of tri-aggregation in our proposal. Furthermore, when comparing ♯6 with ♯5, it shows that the proposed coarse-to-fine hierarchical structure will also bring improvements, e.g., 1.16% on \(\mathrm{F^{\mathrm{overall}}}\) of SCD. Meanwhile, for the LCM, our coarse-to-fine structure achieves better gains than those of its baseline. On the other hand, when comparing ♯9 and ♯6, it validates the effectiveness of our label distribution learning loss used in SCD, since we observe over 1% improvements on \(\mathrm{F^{\mathrm{overall}}}\) for SCD. In addition, we also conduct experiments by making the attention tree symmetric, i.e., each branch has one attention sub-module. As shown in Table 2, by comparing ♯7 with ♯9, a symmetric structure of the attention tree causes a significant drop in both LCM and SCD performance. Moreover, we further validate the effectiveness of the attention in the CAT model, i.e., no attention tree but other mechanisms remain (♯8). A significant performance drop can be observed compared with ♯9.

Table 2 Ablation studies on the SECOND dataset

Furthermore, we also visualize the SCD results of these ablation studies in Fig. 8, in which the qualitative results in this figure are consistent with the quantitative results in Table 2.

Figure 8
figure 8

Visualization results of ablation studies. Note that, on the top of this figure, there are different settings of our CAT, which can be refered to in Table 2

5 Conclusion

In this paper, we propose a coarse-to-fine attention tree (CAT) model to simultaneously address semantic change detection (SCD) and land cover mapping (LCM) tasks. Specifically, motivated by the identification process of human vision, a coarse-to-fine hierarchical tree structure is developed for modeling the category hierarchy in SCD/LCM. Meanwhile, to better capture the discriminative pixel regions, attention mechanisms are integrated into the tree and both SCD and LCM predictions are returned. To relieve the label ambiguity in the tasks, a label distribution learning loss is further equipped. Extensive experiments on the large-scale SECOND dataset indicate that our CAT model can achieve the best results on both SCD and LCM. In the future, we will attempt to embed the coarse-to-fine attention tree directly into the encoder-decoder architecture as a more compact SCD model.