CAT: a coarse-to-fine attention tree for semantic change detection

Wei, Xiu-Shen; Xu, Yu-Yan; Zhang, Chen-Lin; Xia, Gui-Song; Peng, Yu-Xin

doi:10.1007/s44267-023-00004-z

CAT: a coarse-to-fine attention tree for semantic change detection

Research
Open access
Published: 08 May 2023

Volume 1, article number 3, (2023)
Cite this article

Download PDF

You have full access to this open access article

Visual Intelligence Aims and scope Submit manuscript

CAT: a coarse-to-fine attention tree for semantic change detection

Download PDF

Xiu-Shen Wei ORCID: orcid.org/0000-0002-8200-1845^1,2,3,
Yu-Yan Xu^1,2,3,
Chen-Lin Zhang⁴,
Gui-Song Xia⁵ &
…
Yu-Xin Peng⁶

1961 Accesses
3 Citations
Explore all metrics

Abstract

Semantic change detection (SCD) and land cover mapping (LCM) are always treated as a dual task in the field of remote sensing. However, due to diverse real-world scenarios, many SCD categories are not easy to be clearly recognized, such as “water-vegetation” and “water-tree”, which can be regarded as fine-grained differences. In addition, even a single LCM category is usually difficult to define. For instance, some “vegetation” categories with litter vegetation coverage are easily confused with the general “ground” category. SCD/LCM becomes challenging under both challenges of its fine-grained nature and label ambiguity. In this paper, we tackle the SCD and LCM tasks simultaneously by proposing a coarse-to-fine attention tree (CAT) model. Specifically, it consists of an encoder, a decoder and a coarse-to-fine attention tree module. The encoder-decoder structure extracts the high-level features from input multi-temporal images first and then reconstructs them to return SCD and LCM predictions. Our coarse-to-fine attention tree, on the one hand, utilizes the tree structure to better model a hierarchy of categories by predicting the coarse-grained labels first and then predicting the fine-grained labels later. On the other hand, it applies the attention mechanism to capture discriminative pixel regions. Furthermore, to address label ambiguity in SCD/LCM, we also equip a label distribution learning loss upon our model. Experiments on the large-scale SECOND dataset justify that the proposed CAT model outperforms state-of-the-art models. Moreover, various ablation studies have demonstrated the effectiveness of tailored designs in the CAT model for solving semantic change detection problems.

Multi-temporal Land Cover Segmentation Via Trans-ConvLSTM

MA-DBFAN: multiple-attention-based dual branch feature aggregation network for aerial image semantic segmentation

Article 01 April 2024

HybridNet: Integrating Multiple Approaches for Aerial Semantic Segmentation

Article 27 December 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Change detection aims to locate the land cover variation and identify the categories of two multi-temporal images with pixel-wise boundaries [1, 2], which is a crucial image interpretation task in various remote sensing applications [3]. In the literature, binary change detection (BCD) was firstly formulated by detecting the locations of changed pixels between the input multi-temporal images [4, 5]. However, methods of BCD only focused on the changed locations and overlooked the specific categories. This restricts the popularization and application of BCD, since the identification of change types is usually essential for diverse applications, e.g., urban planning [6], and natural resource management [7]. Therefore, in recent years, semantic change detection has become an active research topic in this field [1, 8, 9].

With the rapid development of deep learning techniques, deep semantic change detection methods have achieved great progress. In particular, by considering the correlation between land cover mapping (LCM) and semantic change detection (SCD), many representative Siamese network models with multi-task learning have been proposed [2, 10, 11]. However, as illustrated in Fig. 1, there are still some important issues that have not been paid attention to and solved. More specifically, due to the reality and complexity of the change process itself, there exist categories of changes that are difficult to distinguish from one another, e.g., “water-vegetation” vs. “water-tree”, which can be regarded as fine-grained differences. In addition, a slight shift in the remote sensing image position and the annotating difficulty at the boundaries of object pixels will cause ambiguity among these categories, which is known as label ambiguity. Moreover, the change process itself is a continuous process (not discrete change), so it can also lead to label ambiguity. Both fine-grained nature and label ambiguity make SCD a challenging task, and they have not yet been fully investigated.

In this paper, we propose a coarse-to-fine attention tree (CAT) model to address these challenges. The most different point from previous work is the coarse-to-fine basic idea illustrated in Fig. 2. By recalling the identification process of human vision, humans always follow a coarse-to-fine procedure. Concretely, we tend to first conduct an easy task, i.e., determining whether this change belongs to a coarse-grained change category, e.g., “water-land cover”. Then, within the candidate set of change types, we conduct the more difficult task of judging the fine-grained change category, such as “water-vegetation” or “water-ground”. Motivated by this, we realize a coarse-to-fine attention tree to model the label hierarchy of change categories and capture the discriminative pixel regions for better SCD predictions as well. For handling label ambiguity, we modify the original pixel-level labels in SCD/LCM as the label distribution and take the label distribution as the ground truth, to drive model training. By introducing a label distribution learning loss [12, 13], the proposed CAT model can calculate the losses and be optimized in an end-to-end manner. Beyond that, as a unified model, the CAT model also consists of a popular encoder-decoder structure for representation learning in the early module, which is demonstrated in Fig. 3. In experiments, the proposed CAT model is conducted on the large-scale SECOND dataset for SCD [1]. Empirical results and ablation studies can validate the effectiveness of the proposed CAT method and our proposals in CAT.

The main contributions of this paper are listed as follows:

1)
The fine-grained nature and label ambiguity in the semantic change detection task are investigated, and further a novel coarse-to-fine attention tree for dealing with these challenges is proposed.
2)
A tree structure is developed for both modeling the label hierarchy of change categories at different granularity levels and capturing the discriminative pixel-level regions, as well as introducing the label distribution learning loss to alleviate label ambiguity.
3)
Experiments and various ablation studies are conducted on the large-scale semantic change detection dataset, which demonstrates superior results compared to the state-of-the-art methods and baselines.

The rest of this paper is organized as follows. Section 2 reviews the related work of semantic change detection and hierarchical architecture. Section 3 elaborately presents the proposed CAT model, as well as its main components. In Sect. 4, we report the empirical settings and experimental results for evaluating the effectiveness of our CAT model, as well as the ablation studies for in-depth investigation. Finally, Sect. 5 draws the conclusions and discusses promising future work.

2 Related work

We briefly review the previous work in the literature from two related aspects, semantic change detection and hierarchical architecture.

2.1 Semantic change detection

Change detection using multi-temporal remote sensing images is an important technique to monitor dynamic changes on the land surface [4, 5, 9]. With the rapid development of deep learning techniques, recent semantic change detection (SCD) methods based on deep learning have achieved great success [1, 14–17].

More specifically, Caye et al. [18] proposed an iterative learning method to train a fully convolutional neural network (CNN) for detecting changes from noisy data. Papadomanolaki et al. [19] combined a recurrent neural network (RNN) with a fully convolutional neural network, by using multi-sequence high-resolution data for urban change detection. The siamese neural network can evaluate the similarity between two images, which makes it effective in the SCD task. Liu et al. [14] designed a deep convolutional coupling network for detecting changes from two multi-temporal images. Daudt et al. [10] proposed two Siamese expansion modules based on ordinary images and multispectral images, and then fused them into a fully convolutional neural network to complete SCD. Chen et al. [11] combined the advantages of CNN and RNN, and proposed a deep Siamese convolutional recurrent neural network for change detection, which can also be used for homogeneous and heterogeneous high-resolution remote sensing images. Du et al. [20] employed two symmetrical neural networks to extract features of dual time-series images, and then used slow feature analysis (SFA) to highlight the changing parts of the transformed features. Based on kernel principal component analysis convolution, Wu et al. [21] proposed an unsupervised deep Siamese convolutional network for binary and multi-category change detection. Very recently, to better evaluate the methods of SCD, Yang et al. [1] constructed a well-annotated new benchmark dataset, the SEmantic Change detectiON Dataset (SECOND). Additionally, an asymmetric Siamese network was proposed in [1]. Later, Xia et al. [2] designed a deep Siamese post-classification fusion network to address the accumulation of misclassification error issues.

2.2 Hierarchical architecture

Hierarchical architecture, which aims to capture information at different scales, is a common practice in computer vision and pattern recognition. In the early era, multi-stage networks were developed to perform handwritten digit recognition [22]. In deep learning ages, multi-stage networks were developed to perform object recognition [23]. In particular, feature pyramid networks were employed to capture features of different scales in the field of object detection and segmentation [24, 25]. Recently, many researchers have developed multi-level Transformer structures to boost the performance of various vision tasks [26, 27].

In addition to those general models, researchers have developed different modules for specific vision tasks, including fine-grained image analysis [28] and remote sensing [1, 29]. Specifically, hierarchical convolutions [30] and hierarchical bilinear pooling [31] were proposed to capture the fine-grained cues in fine-grained visual objects. Multi-level recurrent units along with attention modules were developed for vehicle re-identification [32]. On the other hand, in remote sensing, cascade detection frameworks were applied to handle high-resolution object detection [33]. Random walk was adopted in a hierarchical style to solve the problems in the classification of hyperspectral and LiDAR data [34]. In contrast to previous work, in this paper, we introduce a coarse-to-fine hierarchical tree structure to model the dependence among categories at different granularities.

3 Methodology

In this section, we elaborate on the proposed coarse-to-fine attention tree (CAT) method in the following aspects, i.e., the overall architecture, the coarse-to-fine attention tree module and its loss functions.

3.1 Notations and overall architecture

In general, the semantic change detection (SCD) task takes a pair of multi-temporal images as inputs which are denoted as ${I}_{1}$ and ${I}_{2}$. Specifically, $\Omega =\{0,1,\ldots ,H-1\}\times \{0,1,\ldots , W-1\}$ is the image grid of the multi-temporal image, where H and W are the height and width, respectively. Given a set $\mathcal{L}=\{y_{1},\ldots ,y_{C}\}$ of C semantic categories, the SCD problem aims to learn a mapping function $f_{I_{1}, I_{2}}: \Omega \rightarrow \mathcal{L}^{2}$ such that

$$ \forall \boldsymbol{p} \in \Omega ,\quad f_{I_{1}, I_{2}}( \boldsymbol{p})=\textstyle\begin{cases} (0, 0), & \text{if } \mathcal{C}_{I_{1}, I_{2}}(\boldsymbol{p})< \tau, \\ (l_{1},l_{2}), & \text{otherwise}, \end{cases} $$

(1)

where $\mathcal{C}_{I_{1}, I_{2}}(\boldsymbol{p})$ measures the change probability of each pixel $\boldsymbol{p} \in \Omega $, $l_{1}, l_{2}\in \mathcal{L}$, and $(0,0)$ indicates the non-change class. τ is a scalar thresholding on $\mathcal{C}_{I_{1}, I_{2}}$. Therefore, $f_{I_{1}, I_{2}}(\cdot )$ can locate changed regions and identify their categories simultaneously.

As illustrated in Fig. 3, the framework of CAT consists of several components, including the encoder, the decoder, and the most important coarse-to-fine attention tree. Specifically, we follow the encoder-decoder structure in [2] for representation learning. The encoder $\operatorname{enc}(\cdot )$ extracts the information from multi-temporal images separately to form the feature maps with multiple scales. For the i-th scale, $\boldsymbol{x}^{i}_{1}$ and $\boldsymbol{x}^{i}_{2}$ are the corresponding feature maps obtained via

$$\begin{aligned}& \boldsymbol{x}^{i}_{1}= \operatorname{enc}(I_{1}; i) \in \mathbb{R}^{h^{i}\times w^{i} \times d^{i}} , \end{aligned}$$

(2)

$$\begin{aligned}& \boldsymbol{x}^{i}_{2}= \operatorname{enc}(I_{2}; i) \in \mathbb{R}^{h^{i}\times w^{i} \times d^{i}} , \end{aligned}$$

(3)

where $h^{i}$, $w^{i}$ and $d^{i}$ are the dimensions of the feature maps of the i-th scale. After encoding, for the decoding phase, feature maps from small scales to large scales are gradually decoded in a two-stream fashion until their resolution equals the resolution of the original input images, which applies to subsequent LCM and SCD tasks. The two streams in the decoder $\operatorname{dec}(\cdot )$ correspond to two land cover temporal images, respectively. Regarding the specific process in the i-th scale of $\operatorname{dec}(\cdot )$, we concatenate the feature maps of the same scale in $\operatorname{enc}(\cdot )$ and the upsampled results from the $(i-1)$-th scale in $\operatorname{dec}(\cdot )$ to form the representation $\boldsymbol{x}^{\prime i}_{1}/\boldsymbol{x}^{\prime i}_{2}$ of the i-th scale in $\operatorname{dec}(\cdot )$. The upsampled results are obtained by performing a sub-network $\operatorname{Net}(\cdot )$ on the concatenation of $\boldsymbol{x}^{\prime i-1}_{1}$ and $\boldsymbol{x}^{\prime i-1}_{2}$. The process is formulated as

$$ \boldsymbol{x}^{\prime i}_{1} = \bigl[\operatorname{Net} \bigl( \bigl[\boldsymbol{x}^{\prime i-1}_{1}; \boldsymbol{x}^{\prime i-1}_{2} \bigr] \bigr); \boldsymbol{x}^{i}_{1}; \boldsymbol{x}^{i}_{2} \bigr] , $$

(4)

where $[\cdot ;\ldots ;\cdot ]$ is the concatenation operation, and the implementation of $\operatorname{Net}(\cdot )$ can be found in the experimental details in Sect. 4. For a special case, regarding $\boldsymbol{x}^{\prime 1}_{1}$ in $\operatorname{dec}(\cdot )$, it has no representation from the previous scale and only has $[\boldsymbol{x}^{1}_{1}; \boldsymbol{x}^{1}_{2} ]$. Eventually, we can obtain the final representation of these two streams, i.e., $\boldsymbol{x}_{\mathrm{LCM}1}$ and $\boldsymbol{x}_{\mathrm{LCM}2}$, based on $\boldsymbol{x}^{\prime 5}_{1}$ and $\boldsymbol{x}^{\prime 5}_{2}$.

After that, we calculate the differences between $\boldsymbol{x}_{\mathrm{LCM}1}$ and $\boldsymbol{x}_{\mathrm{LCM}2}$ as $\boldsymbol{x}_{\mathrm{SCD}}$ for semantic change detection in Equation (5):

$$ \boldsymbol{x}_{\mathrm{SCD}} = \|\boldsymbol{x}_{\mathrm{LCM}1}- \boldsymbol{x}_{\mathrm{LCM}2}\| . $$

(5)

Since LCM and SCD can benefit each other, $\boldsymbol{x}_{\mathrm{LCM}1}$, $\boldsymbol{x}_{\mathrm{LCM}2}$ and $\boldsymbol{x}_{\mathrm{SCD}}$ are aggregated as multiple modals to be fed into the coarse-to-fine attention tree for conducting both LCM and SCD predictions. The details of the coarse-to-fine attention tree module and the loss functions in our CAT model are elaborated in the following sub-sections.

3.2 Coarse-to-fine attention tree

Inspired by the human identification process (cf. Figure 2), we first develop the coarse-to-fine hierarchical module, since humans always give the predictions of coarse-grained change detection at first, e.g., “water-land cover”, and then predict the fine-grained change detection, e.g., “water-vegetation” or “water-building”. Meanwhile, by considering the fine-grained nature of SCD, to better capture the changing detailed regions and simultaneously model the coarse-to-fine hierarchical structure, we combine the attention mechanism and hierarchical structure into one, and propose a coarse-to-fine attention tree.

As mentioned above, the input of the coarse-to-fine attention tree is the aggregated representations of $\boldsymbol{x}_{\mathrm{LCM}1}$, $\boldsymbol{x}_{\mathrm{LCM}2}$ and $\boldsymbol{x}_{\mathrm{SCD}}$, which is denoted by

$$ \boldsymbol{x}_{\mathrm{AGG}}= [\boldsymbol{x}_{\mathrm{LCM}1}; \boldsymbol{x}_{\mathrm{LCM}2};\boldsymbol{x}_{\mathrm{SCD}} ] . $$

(6)

Since it aggregates tri-sourced information, we term it as “tri-aggregation”. We also conduct ablation studies in Table 2 to validate its effectiveness. Furthermore, to better leverage the correlation of both land cover mapping and semantic change detection, we formulate them into a multi-task learning framework, which is depicted on the right of Fig. 3, and each task corresponds to a coarse-to-fine attention tree, cf. Figure 4.

In each tree, there are three levels of nodes, where the root node in the first level is the input, the nodes in the second level are employed for coarse-grained predictions, and the leaf nodes in the third level are for the fine-grained predictions. For specific data in such a tree, a branch routing strategy in the tree determines which branch it takes.

Branch routing

In the proposed CAT model, we design the strategy to divide the branches of the tree and determine the flow of the data, cf. Figure 5. Concretely, for $\boldsymbol{x}_{\mathrm{AGG}}$, it is first conducted with a $1\times 1$ conv operation to aggregate the information and is formed as a unified representation. Then, we apply a channel attention, which is illustrated in Fig. 5, to capture the discriminative information. Finally, for the obtained activation tensor, several operations are serialized for performance, i.e., average-pooling, signed square-root normalization, $\ell _{2}$-normalization, $1\times 1$ conv, and the sigmoid function. Then, we can obtain the routing parameter s. The routing strategy is that, if s is larger than 0.5, then the data flow will go to the right branch of the tree; if it is less than 0.5, the data flow will go to the left. Since these two branches have different attention processes (cf. Figure 4), they can capture patterns at different granularities, e.g., the left branch learns more fine-grained features thanks to the stacked attention module. Additionally, it is apparent to observe that the direction of the data flow is also learned automatically in the CAT model.

Attention sub-module in the tree

To introduce diversity and capacity into the model, we develop an asymmetric attention tree in the CAT model, cf. Figure 4. As illustrated, the left and right branches of the attention tree have two and one attention sub-module, respectively. Different numbers of attention sub-modules bring different model complexities, and more importantly, they allow these two branches to learn different patterns, instead of the same redundant ones. The detailed attention sub-module is presented in Fig. 6. The attention sub-module consists of a channel attention and an atrous spatial pyramid pooling (ASPP) [35]. Specifically, the channel attention is the same as that in the branch routing. The ASPP has four parallel dilated convolutions with different dilation rates, i.e., 1, 6, 12, and 18. After that, we concatenate the results derived from different dilated convolutions and conduct a $1\times 1$ convolution to aggregate the information as the final output regarding the attention sub-module in the tree. Beyond the channel attention, the reason why we integrate ASPP [35] is that it can provide different feature maps with different scales/dilated convolutions. Such an advantage benefits the fine-grained SCD predictions, especially for the pixel level predictions [28, 35].

Label predictions on the nodes

After performing the attention sub-module of the branch, we upsample its output, and conduct batch-normalization [36] followed by a $3\times 3$ convolution, a ReLU activation function and a $1\times 1$ convolution. Until now, the final prediction $\hat{\boldsymbol{p}} \in \mathbb{R}^{C\times H\times W}$ can be returned, where C represents the number of categories in the SCD or LCM tasks. As shown in the attention tree in Fig. 4, the CAT model has four fine-grained predictions derived from the leaf nodes, two coarse-grained predictions from the nodes in the second level, and an ensemble prediction based on the four leaf nodes’ predictions. Here, a simple ensemble strategy of average is used. Thus, we denote the notations as $\hat{\boldsymbol{p}}^{\mathrm{fine}}$, $\hat{\boldsymbol{p}}^{\mathrm{coarse}}$, and $\hat{\boldsymbol{p}}^{\mathrm{ensem}}$ for the aforementioned predictions, respectively. During model training, we calculate the corresponding losses on these predictions and employ them to drive end-to-end optimization.

3.3 Loss functions

In SCD tasks, a slight shift in the remote sensing image position and the annotating difficulty at the boundaries of object pixels will cause label ambiguity, which cannot be solved by traditional classification losses. Different from the traditional classification loss function used in LCM or SCD, we hereby propose to use the label distribution learning (LDL) [12, 13, 37] manner as a loss function to drive model training, which aims to relieve the label ambiguity in such a multi-temporal prediction problem.

Take $\hat{\boldsymbol{p}}^{\mathrm{ensem}}$ as an example and its corresponding ground truth can be presented by $\boldsymbol{y}^{\mathrm{ensem}}\in \mathbb{R}^{C\times H\times W}$. Our label distribution learning in SCD requires a distribution of labels as its ground truth label. Therefore, following [13], we utilize a $5\times 5$ Gaussian kernel upon all channels of $\boldsymbol{y}^{\mathrm{ensem}}$ and obtain $\boldsymbol{y}^{\prime \mathrm{ensem}}$. Then, through the dimension of categories, we conduct the following normalization to make it a legitimate distribution:

$$ {y}^{\prime\prime \mathrm{ensem}}_{c,i,j} = \frac{{y}^{\prime \mathrm{ensem}}_{c,i,j}}{\sum_{c=0}^{C} {y}^{\prime \mathrm{ensem}}_{c,i,j}} . $$

(7)

Thus, $\boldsymbol{y}^{\prime\prime \mathrm{ensem}} \in \mathbb{R}^{C\times H\times W}$ is the ground truth in the label distribution learning loss, and the loss function is calculated by

$$\begin{aligned} &\mathcal{L}^{\mathrm{ensem}}_{\mathrm{LDL}} \bigl(\boldsymbol{ \hat{y}}^{\prime\prime \mathrm{ensem}}, \hat{ \boldsymbol{p}}^{\mathrm{ensem}} \bigr) \\ &\quad = \sum_{i} \sum _{j} \sum_{c} y^{\prime\prime \mathrm{ensem}}_{c,i,j} \ln \frac{y^{\prime\prime \mathrm{ensem}}_{c,i,j}}{\hat{p}_{c,i,j}} \end{aligned}$$

(8)

$$\begin{aligned} &\quad = - \sum_{i}\sum _{j} \sum_{c} y^{\prime\prime \mathrm{ensem}}_{c,i,j} \ln {\hat{p}_{c,i,j}} , \end{aligned}$$

(9)

where $\ln (\cdot )$ is the natural logarithm function. For other predictions, i.e., $\hat{\boldsymbol{p}}^{\mathrm{fine}}$ and $\hat{\boldsymbol{p}}^{\mathrm{coarse}}$, we can conduct the similar process to obtain their losses $\mathcal{L}^{\mathrm{fine}}_{\mathrm{LDL}}$ and $\mathcal{L}^{\mathrm{coarse}}_{\mathrm{LDL}}$. Thus, the final loss function of the CAT model is calculated in Equation (10):

$$ \mathcal{L} = \mathcal{L}^{\mathrm{fine}}_{\mathrm{LDL}} + \mathcal{L}^{\mathrm{coarse}}_{\mathrm{LDL}} + \mathcal{L}^{\mathrm{ensem}}_{\mathrm{LDL}} . $$

(10)

Note that, the trade-off parameter between these terms are 1, which reveals the robustness and practicality of our model.

4 Experiments

In this section, we first introduce the dataset and experimental settings, as well as the implementation details. Then, we report the main results. Ablation studies are conducted to verify the efficacy of the main components of the CAT method at last.

4.1 Datasets and empirical settings

In the experiments, we utilize a large-scale semantic change detection dataset, i.e., SECOND [1], to evaluate the performance of the proposed CAT model. The original SECOND dataset is collected from several sensors/platforms across different cities (e.g., Hangzhou, Chengdu) in China and has 4662 pairs of images in 30 change categories. For fair comparisons with the state-of-the-art method [2], we follow its modified categories from SECOND. Therefore, a total of 14 fine-grained change types are used for empirically evaluating semantic change detection (SCD), i.e., “no-change”, “water-ground”, “water-vegetation”, “water-building”, “ground-vegetation”, “ground-water”, “ground-building”, “vegetation-ground”, “vegetation-water”, “vegetation-building”, “building-water”, “building-ground”, “building-vegetation”, and “building-building”. In that case, the coarse-grained semantic change categories can be grouped as four meta categories, including “no-change”, “water-land cover”, “land cover-water”, and “other types”. On the other hand, for the land cover mapping (LCM) task, the coarse-grained semantic categories are “water”, “land cover” and “background”, and the fine-grained categories for LCM involve “water”, “ground”, “vegetation”, “building” and “background”. In addition, other experimental setups are also followed [2] for fair comparisons.

Regarding the evaluation metric, we employ the arithmetic mean of the per-class F1-score (F1) and overall accuracy (OA) for the LCM task, which can be calculated in Equation (11):

$$\begin{aligned}& F1 = \frac{1}{C+1} \sum_{i=0}^{C} \frac{2\times {\mathrm{TP}}_{i}}{2\times {\mathrm{TP}_{i}} + {\mathrm{FN}_{i}} + {\mathrm{FP}_{i}}} , \end{aligned}$$

(11)

$$\begin{aligned}& {\mathrm{OA}} = \frac{\sum_{i=0}^{C} ({\mathrm{TP}_{i}}+{\mathrm{TN}_{i}})}{\sum_{i=0}^{C} ({\mathrm{TP}_{i}}+ {\mathrm{FN}_{i}}+{\mathrm{FP}_{i}}+{\mathrm{TN}_{i}})} , \end{aligned}$$

(12)

where ${\mathrm{TP}_{i}}$, ${\mathrm{FP}_{i}}$, ${\mathrm{FN}_{i}}$, and ${\mathrm{TN}_{i}}$ are the numbers of true positive, false positive, false negative, and true negative pixels for category i, respectively. C denotes the number of categories.

For the SCD task, by following [8], $\mathrm{F}^{\mathrm{loc}}$, $\mathrm{F}^{\mathrm{types}}$, and $\mathrm{F}^{\mathrm{overall}}$ are used for evaluations, whose detailed computation is presented by

$$ {\mathrm{F}^{\mathrm{overall}}} = 0.3{\mathrm{F}^{\mathrm{loc}}} + 0.7{ \mathrm{F}^{\mathrm{types}}} , $$

(13)

where ${\mathrm{F}^{\mathrm{loc}}}$ is the F1-score of changed pixels, and ${\mathrm{F}^{\mathrm{types}}}$ is calculated by the arithmetic mean of per-class F1-scores. Additionally, OA is also utilized to evaluate the accuracy of the “from-to” change types.

4.2 Implementation details

For fair comparisons with previous work, e.g., [2], we fix the resolution of the input images as $512\times 512$. Random cropping, flipping, rotation and blurring operations are also employed for data augmentation. Regarding the proposed CAT model, the encoder $\operatorname{enc}(\cdot )$ consists of the layers before the average-pooling layer of ResNet-34 [38] (pre-trained on ImageNet [39]). The calculation of $\operatorname{Net}(\cdot )$ in Equation (4) can be presented by

$$\begin{aligned}& \mathrm{ReLU}\bigl(\mathrm{BN}\bigl(\mathrm{conv}\bigl(\mathrm{ReLU}\bigl( \mathrm{BN}\bigl(\mathrm{conv}\bigl(\mathrm{upsample}(\cdot ); \\& \quad 3\times 3\bigr) \bigr)\bigr); 3\times 3\bigr)\bigr)\bigr), \end{aligned}$$

where $\mathrm{upsample}(\cdot )$ is the upsampling operation to turn the feature map of a small scale into a large scale, $\mathrm{conv}(\cdot ;3\times 3)$ is a convolution operation of the kernel size of $3\times 3$, $\mathrm{BN}(\cdot )$ is batch normalization [36], and $\mathrm{ReLU}(\cdot )$ is the ReLU activation function. Regarding the optimization of CAT, stochastic gradient descent with a batch size of 8 is employed as the optimizer, and its learning rate is set as 10⁻³. All experiments are conducted on 8 NVIDIA GeForce RTX 3090 Ti for 100 epochs.

4.3 Main results and comparisons

In experiments, we compare the CAT model with baseline methods and competing state-of-the-art methods in SCD. A brief introduction of these methods is presented as follows.

1)
MTL-CD [40]: A deep multitask learning method adopting the encoder-decoder architecture for change detection.
2)
DSIFN [41]: A deeply supervised image fusion network for change detection that fuses multi-level deep features of raw images by means of attention modules for change map reconstruction.
3)
DDCNN [16]: An end-to-end change detection network, termed as the difference-enhancement dense-attention convolutional neural network (DDCNN).
4)
HRSCD.srt1 [9]: The two-step independent post-classification comparison method based on convolutional neural networks.
5)
HRSCD.srt2 [9]: A direct classification SCD method based on the single encoder-decoder structure.
6)
HRSCD.srt3 [9]: A modified post-classification comparison network that introduces temporal correlation information by constructing a new change detection branch.
7)
HRSCD.srt4 [9]: The differences from HRSCD.srt3 lie in that it proposes a modified skip operation, which connects Siamese encoders with the decoder of the change detection branch.
8)
ASN-Res [1]: The Siamese network for SCD using locally asymmetric architecture with a residual encoder.
9)
PCFN-Hard [2]: A deep Siamese post-classification fusion network for semantic change detection, which employs the hard fusion strategy.
10)
PCFN [2]: A deep Siamese post-classification fusion network, which is the current state-of-the-art method in semantic change detection.

As reported in Table 1, compared with state-of-the-art methods and other baselines, our proposed CAT model achieves consistent improvements with a large margin for both land cover mapping (LCM) and semantic change detection (SCD) tasks. More specifically, our model outperforms the competing PCFN method [2] by 0.57% and 0.54% of F1, as well as 2.35% and 2.03% of OA for LCM regarding two multi-temporal images. For SCD, our CAT model can reach a significant improvement over PCFN, i.e., 7.35% of $\mathrm{F^{\mathrm{loc}}}$, 19.01% of $\mathrm{F^{\mathrm{types}}}$, 2.69% of OA, and 15.51% of $\mathrm{F^{\mathrm{overall}}}$, respectively.

Table 1 Comparisons on the SECOND dataset with state-of-the-art methods

Full size table

Additionally, we also visualize the SCD/LCM predictions of some samples of the SECOND dataset in Fig. 7. As observed, our CAT model can recover more fine-grained details when compared with PCFN, which led to the significantly better performance in $\mathrm{F^{\mathrm{loc}}}$, $\mathrm{F^{\mathrm{types}}}$, OA, and $\mathrm{F^{\mathrm{overall}}}$.

4.4 Ablation studies and discussions

In this section, we conduct ablation studies on SECOND to characterize the proposed CAT method, especially for its main components.

We investigate the effects of 1) the attention tree, 2) tri-aggregation, 3) the proposed coarse-to-fine hierarchical structure, 4) label distribution learning loss and 5) the asymmetric attention of the tree. As reported in Table 2, compared with ♯1, ♯2, and ♯3, we can find that even without performing coarse-grained predictions, our proposed attention tree is effective for either LCM or SCD tasks. Especially for equipping with our attention tree on both LCM and SCD simultaneously, our model achieves a significant improvement. When only applying attention trees with one branch (i.e., ♯4), although they are equipped on both LCM and SCD, the results are still unsatisfactory. Regarding tri-aggregation, if we remove it and merely rely on the information from their own stream of LCM/SCD (comparing ♯5 and ♯3), it causes a slight drop on LCM, but an obvious performance drop is observed on SCD, e.g., 1.86% on $\mathrm{F^{\mathrm{overall}}}$. These observations justify the necessity of tri-aggregation in our proposal. Furthermore, when comparing ♯6 with ♯5, it shows that the proposed coarse-to-fine hierarchical structure will also bring improvements, e.g., 1.16% on $\mathrm{F^{\mathrm{overall}}}$ of SCD. Meanwhile, for the LCM, our coarse-to-fine structure achieves better gains than those of its baseline. On the other hand, when comparing ♯9 and ♯6, it validates the effectiveness of our label distribution learning loss used in SCD, since we observe over 1% improvements on $\mathrm{F^{\mathrm{overall}}}$ for SCD. In addition, we also conduct experiments by making the attention tree symmetric, i.e., each branch has one attention sub-module. As shown in Table 2, by comparing ♯7 with ♯9, a symmetric structure of the attention tree causes a significant drop in both LCM and SCD performance. Moreover, we further validate the effectiveness of the attention in the CAT model, i.e., no attention tree but other mechanisms remain (♯8). A significant performance drop can be observed compared with ♯9.

Table 2 Ablation studies on the SECOND dataset

Full size table

Furthermore, we also visualize the SCD results of these ablation studies in Fig. 8, in which the qualitative results in this figure are consistent with the quantitative results in Table 2.

5 Conclusion

In this paper, we propose a coarse-to-fine attention tree (CAT) model to simultaneously address semantic change detection (SCD) and land cover mapping (LCM) tasks. Specifically, motivated by the identification process of human vision, a coarse-to-fine hierarchical tree structure is developed for modeling the category hierarchy in SCD/LCM. Meanwhile, to better capture the discriminative pixel regions, attention mechanisms are integrated into the tree and both SCD and LCM predictions are returned. To relieve the label ambiguity in the tasks, a label distribution learning loss is further equipped. Extensive experiments on the large-scale SECOND dataset indicate that our CAT model can achieve the best results on both SCD and LCM. In the future, we will attempt to embed the coarse-to-fine attention tree directly into the encoder-decoder architecture as a more compact SCD model.

Availability of data and materials

All data generated or analyzed during this study are included in this published article [1].

Abbreviations

ASPP:: atrous spatial pyramid pooling
BCD:: binary change detection
CAT:: coarse-to-fine attention tree
CNN:: convolutional neural network
DDCNN:: difference-enhancement dense-attention convolutional neural network
LCM:: land cover mapping
LDL:: label distribution learning
OA:: overall accuracy
SCD:: semantic change detection
RNN:: recurrent neural network

References

Yang, K., Xia, G.-S., Liu, Z., Du, B., Yang, W., Pelillo, M., & Zhang, L. (2021). Asymmetric siamese networks for semantic change detection in aerial images. IEEE Transactions on Geoscience and Remote Sensing, 60, 5609818.
Google Scholar
Xia, H., Tian, Y., Zhang, L., & Li, S. (2022). A deep Siamese postclassification fusion network for semantic change detection. IEEE Transactions on Geoscience and Remote Sensing, 60, 5622716.
Article Google Scholar
Gu, Y., Liu, T., Gao, G., Ren, G., Ma, Y., Chanussot, J., & Jia, X. (2021). Multimodal hyperspectral remote sensing: an overview and perspective. Science China. Information Sciences, 64(2), 121301.
Article Google Scholar
Robin, A., Moisan, L., & Le Hegarat-Mascle, S. (2010). An a-contrario approach for subpixel change detection in satellite imagery. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(11), 1977–1993.
Article Google Scholar
Lanza, A., & Di Stefano, L. (2011). Statistical change detection by the pool adjacent violators algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(9), 1894–1910.
Article Google Scholar
Zhou, X., & Wang, Y.-C. (2011). Spatial-temporal dynamics of urban green space in response to rapid urbanization and greening policies. Landscape and Urban Planning, 100(3), 268–277.
Article Google Scholar
Coppin, P., Jockheere, I., Nackaerts, K., Muys, B., & Lambin, E. (2004). Review articledigital change detection methods in ecosystem monitoring: a review. International Journal of Remote Sensing, 25(9), 1565–1596.
Article Google Scholar
Zheng, Z., Zhong, Y., Wang, J., Ma, A., & Zhang, L. (2021). Building damage assessment for rapid disaster response with a deep object-based semantic change detection framework: from natural disasters to man-made disasters. IEEE Transactions on Geoscience and Remote Sensing, 265, 112636.
Google Scholar
Daudt, R. C., Le Saux, B., Boulch, A., & Gousseau, Y. (2019). Multitask learning for large-scale semantic change detection. Computer Vision and Image Understanding, 187, 102783.
Article Google Scholar
Daudt, R. C., Le Saux, B., & Boulch, A. (2018). Fully convolutional Siamese networks for change detection. In Proceedings of IEEE international conference on image processing (pp. 4063–4067). Los Alamitos: IEEE.
Google Scholar
Chen, H., Wu, C., Du, B., Zhang, L., & Wang, l. (2019). Change detection in multisource vhr images via deep Siamese convolutional multiple-layers recurrent neural network. IEEE Transactions on Geoscience and Remote Sensing, 58(4), 2848–2864.
Article Google Scholar
Geng, X. (2016). Label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 28(7), 1734–1748.
Article Google Scholar
Gao, B.-B., Xing, C., Xie, C.-W., Wu, J., & Geng, X. (2017). Deep label distribution learning with label ambiguity. IEEE Transactions on Image Processing, 26(6), 2825–2838.
Article MathSciNet MATH Google Scholar
Liu, J., Gong, M., Qin, K., & Zhang, P. (2016). A deep convolutional coupling network for change detection based on heterogeneous optical and radar images. IEEE Transactions on Neural Networks and Learning Systems, 29(3), 545–559.
Article MathSciNet Google Scholar
Peng, D., Bruzzone, L., Zhang, Y., Guan, H., Ding, H., & Huang, X. (2020). SemiCDNet: a semisupervised convolutional neural network for change detection in high resolution remote-sensing images. IEEE Transactions on Geoscience and Remote Sensing, 59(7), 5891–5906.
Article Google Scholar
Peng, X., Zhong, R., Li, Z., & Li, Q. (2021). Optical remote sensing image change detection based on attention mechanism and image difference. IEEE Transactions on Geoscience and Remote Sensing, 59(9), 7296–7307.
Article Google Scholar
Yang, K., Tong, X.-Y., Xia, G.-S., Shen, W., & Zhang, L. (2022). Hidden path selection network for semantic segmentation of remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 60, 5628115.
Article Google Scholar
Daudt, R. C., Le Saux, B., Boulch, A., & Gousseau, Y. (2019). Guided anisotropic diffusion and iterative learning for weakly supervised change detection. In Proceedings of IEEE conference on computer vision and pattern recognition workshops (pp. 1461–1470). Los Alamitos: IEEE.
Google Scholar
Papadomanolaki, M., Verma, S., Vakalopoulou, M., Gupta, S., & Karantzalos, K. (2019). Detecting urban changes with recurrent neural networks from multitemporal sentinel-2 data. In Proceedings of international geoscience and remote sensing symposium (pp. 214–217). Los Alamitos: IEEE.
Google Scholar
Du, B., Ru, L., Wu, C., & Zhang, L. (2019). Unsupervised deep slow feature analysis for change detection in multi-temporal remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 57(12), 9976–9992.
Article Google Scholar
Wu, C., Chen, H., Du, B., & Zhang, L. (2022). Unsupervised change detection in multitemporal VHR images based on deep kernel PAC convolutional mapping network. IEEE Transactions on Cybernetics, 52(11), 12084–12098.
Article Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Article Google Scholar
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In F. Pereira, C. J. Burges, L. Bottou, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (Vol. 25, pp. 1106–1114). Red Hook: Curran Associates.
Google Scholar
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 2117–2125). Los Alamitos: IEEE.
Google Scholar
Kirillov, A., Girshick, R., He, K., & Dollár, P. (2019). Panoptic feature pyramid networks. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 6399–6408). Los Alamitos: IEEE.
Google Scholar
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: hierarchical vision transformer using shifted windows. In Proceedings of IEEE international conference on computer vision (pp. 10012–10022). Los Alamitos: IEEE.
Google Scholar
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021). Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In Proceedings of IEEE international conference on computer vision (pp. 568–578). Los Alamitos: IEEE.
Google Scholar
Wei, X.-S., Song, Y.-Z., Mac Aodha, O., Wu, J., Peng, Y., Tang, J., Yang, J., & Belongie, S. (2022). Fine-grained image analysis with deep learning: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12), 8927–8948.
Article Google Scholar
He, N., Fang, L., & Plaza, A. (2020). Hybrid first and second order attention Unet for building segmentation in remote sensing images. Science China. Information Sciences, 63(4), 140305.
Article Google Scholar
Cai, S., Zuo, W., & Zhang, L. (2017). Higher-order integration of hierarchical convolutional activations for fine-grained visual categorization. In Proceedings of IEEE international conference on computer vision (pp. 511–520). Los Alamitos: IEEE.
Google Scholar
Yu, C., Zhao, X., Zheng, Q., Zhang, P., & You, X. (2018). Hierarchical bilinear pooling for fine-grained visual recognition. In Proceedings of European conference on computer vision (pp. 574–589). Berlin: Springer.
Google Scholar
Wei, X.-S., Zhang, C.-L., Liu, L., Shen, C., & Wu, J. (2018). Coarse-to-fine: a RNN-based hierarchical attention model for vehicle re-identification. In Proceedings of Asian conference on computer vision (pp. 575–591). Berlin: Springer.
Google Scholar
Zhang, Y., Yuan, Y., Feng, Y., & Lu, X. (2019). Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection. IEEE Transactions on Geoscience and Remote Sensing, 57(8), 5535–5548.
Article Google Scholar
Zhao, X., Tao, R., Li, W., Li, H.-C., Du, Q., Liao, W., & Philips, W. (2020). Joint classification of hyperspectral and LiDAR data using hierarchical random walk and deep CNN architecture. IEEE Transactions on Geoscience and Remote Sensing, 58(10), 7355–7370.
Article Google Scholar
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., & DeepLab, A. L. Y. (2018). Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.
Article Google Scholar
Loffe, S., & Szegedy, C. (2015). Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of international conference on machine learning (Vol. 37, pp. 448–456). JMLR.
Geng, X., Yin, C., & Zhou, Z.-H. (2013). Facial age estimation by learning from label distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(10), 2401–2412.
Article Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 770–778). Los Alamitos: IEEE.
Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., & Bernstein, M. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Article MathSciNet Google Scholar
Sun, Y., Zhang, X., Huang, J., Wang, H., & Xin, Q. (2020). Fine-grained building change detection from very high-spatial-resolution remote sensing images based on deep multitask learning. IEEE Geoscience and Remote Sensing Letters, 19, 8000605.
Google Scholar
Zhang, C., Yue, P., Tapete, D., Jiang, L., Shangguan, B., Huang, L., & Liu, G. (2020). A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS Journal of Photogrammetry and Remote Sensing, 166, 183–200.
Article Google Scholar

Download references

Funding

This work was supported by National Key R&D Program of China (No. 2021YFA1001100), National Natural Science Foundation of China (Nos. 62272231, 61925201, 62132001, and U21B2025), Natural Science Foundation of Jiangsu Province of China under Grant (No. BK20210340), the Fundamental Research Funds for the Central Universities (No. 30920041111), CAAI-Huawei MindSpore Open Fund, and Beijing Academy of Artificial Intelligence (BAAI). We gratefully acknowledge the support of MindSpore, Compute Architecture for Neural Networks (CANN) and Ascend AI Processor used for this research.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China
Xiu-Shen Wei & Yu-Yan Xu
State Key Laboratory of Integrated Services Networks, Xidian University, Xi’an, China
Xiu-Shen Wei & Yu-Yan Xu
Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, and Jiangsu Key Lab of Image and Video Understanding for Social Security, Nanjing, China
Xiu-Shen Wei & Yu-Yan Xu
4Paradigm Inc, Beijing, China
Chen-Lin Zhang
School of Computer Science and Institute of Artificial Intelligence, Wuhan University, Wuhan, China
Gui-Song Xia
Wangxuan Institute of Computer Technology, Peking University, Beijing, China
Yu-Xin Peng

Authors

Xiu-Shen Wei
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Yan Xu
View author publications
You can also search for this author in PubMed Google Scholar
Chen-Lin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Gui-Song Xia
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Xin Peng
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

X-SW conceived the idea of the study and wrote the paper. Y-YX collected data and conducted experiments. Detailed modification of the basic idea was performed by C-LZ and G-SX. Y-XP contributed to the main idea and proofreading of this work. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Yu-Xin Peng.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wei, XS., Xu, YY., Zhang, CL. et al. CAT: a coarse-to-fine attention tree for semantic change detection. Vis. Intell. 1, 3 (2023). https://doi.org/10.1007/s44267-023-00004-z

Download citation

Received: 15 September 2022
Revised: 26 November 2022
Accepted: 13 February 2023
Published: 08 May 2023
DOI: https://doi.org/10.1007/s44267-023-00004-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

CAT: a coarse-to-fine attention tree for semantic change detection

Abstract

Similar content being viewed by others

Multi-temporal Land Cover Segmentation Via Trans-ConvLSTM

MA-DBFAN: multiple-attention-based dual branch feature aggregation network for aerial image semantic segmentation

HybridNet: Integrating Multiple Approaches for Aerial Semantic Segmentation

1 Introduction