D2ANet: Difference-aware attention network for multi-level change detection from satellite imagery

Recognizing dynamic variations on the ground, especially changes caused by various natural disasters, is critical for assessing the severity of the damage and directing the disaster response. However, current workflows for disaster assessment usually require human analysts to observe and identify damaged buildings, which is labor-intensive and unsuitable for large-scale disaster areas. In this paper, we propose a difference-aware attention network (D2ANet) for simultaneous building localization and multi-level change detection from the dual-temporal satellite imagery. Considering the differences in different channels in the features of pre- and post-disaster images, we develop a dual-temporal aggregation module using paired features to excite change-sensitive channels of the features and learn the global change pattern. Since the nature of building damage caused by disasters is diverse in complex environments, we design a difference-attention module to exploit local correlations among the multi-level changes, which improves the ability to identify damage on different scales. Extensive experiments on the large-scale building damage assessment dataset xBD demonstrate that our approach provides new state-of-the-art results. Source code is publicly available at https://github.com/mj129/D2ANet.


Introduction
Change detection is a technique used to identify changes in objects at different time series [1]; it has been studied in numerous application domains, such as environmental monitoring [2], anomaly detection [3], disaster assessment [4], malware detection [5], and health care [6]. One of the broad applications of change detection is to analyze the dynamic variation of land cover. Satellite imagery not only offers multi-temporal images covering the same areas but also is relatively easy to obtain, which provides essential support for change detection. When a natural disaster occurs, assessing the severity and extent of the damage is critical to assist the affected population and allocate disaster relief resources. Since disasters are usually dangerous and difficult to access, satellite imagery has become a valuable tool for evaluating the impact of disasters. However, most existing assessment methods require human observers to analyze the dual-temporal satellite images, i.e., pre-disaster and post-disaster images. These approaches are labor-intensive and unsuitable for disasters over large-scale areas. To alleviate the manual effort and accelerate the assessment process, automated change detection methods that leverage dual-temporal satellite images have been developed in recent years.
Traditional studies usually detect the changes in images from multiple periods based on pixel differences [7][8][9][10]. However, misregistration errors and illumination variations between the dualtemporal images of the same area cause difficulties for change detection algorithms. These errors are often caused by variations in satellite imaging parameters and are difficult to avoid. Pixel-difference based methods are typically proposed for specific data, which cannot effectively deal with these problems [11]. With the development of deep learning, convolutional neural networks (CNNs) have been proposed and proven to be effective for semantic segmentation [12][13][14]. CNNs with an encoder-decoder architecture have been used [15,16] to accomplish the change detection based on segmentation. Considering the significance of damage assessment, some studies [17][18][19] focus on the changes caused by the disasters. These methods combine pre-and post-disaster satellite images to detect damaged buildings, while they are usually designed to only identify the changes caused by a single disaster. Recently, a largescale dataset named xBD is presented by Gupta et al. [20] for the advancement of change detection and building damage assessment, which contains dual-temporal satellite images at different damage scales and 19 natural disasters. Based on the xBD, Ref. [20] further introduces a U-Net architecture for building localization and ResNet-50 for the damage classification task. Wu et al. [21] introduced a method that contains two networks to complete the two tasks. However, these methods usually divide building localization and damage assessment into two separate stages, which suppresses the networks from benefiting from the multi-task learning and needs complicated steps to train. This paper develops a difference-aware attention network (D2ANet) for simultaneous building localization and multi-level change detection from dual-temporal satellite images. We first apply a CNN architecture as the encoder to extract features from the dual-temporal images. The pre-disaster features after the encoder are fed into one decoder to perform the building segmentation task. For the multi-level change detection task, we develop a difference-aware attention (D2A) block to explore the relationship between different changes based on the pre-and post-disaster features. The D2A block contains a dual-temporal aggregation (DTA) module and a difference-attention (DA) module. The different channels in the features of pre-and post-disaster images may present diverse information, which is worth exploring. The DTA module is proposed on the paired features to excite the change-sensitive channels and learn the global change information. Furthermore, a natural disaster often causes different scales of damage to buildings. As illustrated in Fig. 1, exploiting the correlations among the multilevel changes helps to identify different damage scales. We develop a DA module to capture long-range dependencies among any positions and any channels of one cube group, where small cubes in each group have the potential to represent multi-level differences. Extensive experiments on the large-scale building damage assessment dataset xBD demonstrate the superiority of our D2ANet compared with several state-of-the-art methods.
Our contributions are summarized as follows: the different changes. Compared with existing methods, it can efficiently learn the similarities between multi-level differences by splitting small feature cubes.

Change detection
Change detection from satellite imagery has drawn more and more attention in recent years [4,22,23]. Traditional approaches usually applied pixeldifference based models and defined a specific threshold to match. Early researches utilized the threshold-based comparison method [7]. Later, Markov random field [9], support vector machines (SVM) [24], and conditional random field [25] were also adopted for change detection. Nemmour and Chibani [26] presented a combined framework using an SVM with fuzzy integral and attractor dynamics for land cover change detection. Im and Jensen [8] introduced a three-channel neighborhood correlation image using multi-date high spatial resolution images. Gueguen and Hamid [10] proposed a semisupervised framework to detect change. However, these approaches are only suitable for processing small-scale areas.
With the development of deep learning, CNN based models have been proposed for change detection [16,27]. Caye Daudt et al. [28] introduced three fully convolutional neural network architectures with two Siamese extensions to implement change detection in pairs of co-registered remote sensing images. Daudt et al. [29] proposed a guided anisotropic diffusion algorithm with an iterative training scheme for change detection from noisy data. Papadomanolaki et al. [30] adopted recurrent networks for pixel-wise change detection using dual-temporal high-resolution data. Liu et al. [15] introduced a Siamese network that integrated a dual attention module to accomplish both change detection and building segmentation. Wang et al. [16] developed a deep Siamese network combined with a hybrid feature extraction module, which uses a change decision model to detect change based on feature differences. A novel kernel principal component analysis convolution was proposed in Ref. [31] for change detection in very-high-resolution (VHR) images taken at multiple times.
There are also some temporal anomaly detection works. Li et al. [3] proposed a smoothness-inducing sequential variational auto-encoder (SISVAE) model for robust estimation and anomaly detection from multidimensional correlated time-series data. For contaminated multivariate time-series, Li et al. [32] further presented a robust deep state space model (RDSSM) by bridging the gap between deep state space models and robust statistics.
Some works focus on changes caused by natural disasters and conduct damage assessments at the same time. Xu et al. [17] concatenated the pre-and post-disaster satellite images for detecting damaged buildings in an earthquake. Zhu et al. [18] proposed a multi-level instance segmentation network to detect damages in buildings from aerial videos. Ji et al. [19] utilized CNN to detect collapsed buildings from post-earthquake satellite imagery. Duarte et al. [33] combined satellite and drone images of a disaster to classify building damages. A new fusion method was proposed in Ref. [34] for flood damage segmentation based on multi-resolution, multi-sensor, and multi-temporal satellite imagery. These methods are usually designed to detect changes caused by a single type of disaster, but natural disasters can be more complicated and the above methods may not be applicable. Recently, Gupta et al. [20] presented a large-scale dataset named xBD for change detection and building damage assessment. It contains satellite images from 19 different natural disasters and introduces a four-level joint damage scale. Based on the xBD dataset, they further introduced a baseline model, which uses a U-Net architecture [35] for building localization task and ResNet-50 [36] for damage classification task. Weber and Kané [37] adopted Mask R-CNN [38] augmented with a feature pyramid network module [39] and a semantic segmentation head for both building detection and damage level assessment. Shen et al. [40] proposed a two-stage CNN-based framework, which adopts a U-Net for building segmentation and a two-branch U-Net for damage assessment. However, these methods utilize two separate stages for building localization and damage assessment tasks, which usually require complicated steps and a long time to train.

Building extraction
Traditional works usually extract buildings from satellite imagery using specific metrics like length, shape, color, and texture. Rüther et al. [41] introduced a snake algorithm to extract building shapes using edge information in aerial photographs. Tsai [42] presented several invariant color spaces for building shadow detection and compensation in urban aerial images. Sirmacek and Unsalan [43] extracted areas of interest by utilizing invariant color features, which uses the edge and shadow information to detect buildings. However, these methods are designed for specific data and are rarely used for general building extraction procedure.
In recent years, CNN based methods [13,[44][45][46][47] have been developed and shown to be effective in semantic segmentation and object detection tasks. Yang et al. [48] proposed a novel instance level denoising module for the feature map and an IoU smooth L1 loss for rotation detection in aerial images. To solve the boundary problem in rotation detection, Yang and Yan [49] designed a classification based rotation detection paradigm, which contains a circular smooth label to address angle periodicity and a densely coded label to overcome the problem of excessive model parameters. With the release of some large-scale datasets, such as SpaceNet [50] and DeepGlobe [51], some studies treated building extraction as a segmentation task and utilized CNNbased methods. Hamaguchi and Hikosaka [52] introduced a multi-task U-Net model, which learns multiple detectors and each of them focuses on a specific size of building. Golovanov et al. [53] presented a modified version of LinkNet [54] and utilized several loss functions for building segmentation. Yuan [55] designed a simple network structure that combines activation from multiple layers to predict the pixel-wise output. A gated neural network was proposed in Ref. [56], which integrates contextual features and local details to improve the performance of building segmentation. Pan et al. [57] introduced a generative adversarial network combined with spatial and channel attention modules for building segmentation from the remote sensing images. Zhao et al. [58] combined MASK R-CNN [38] and building-boundary regularization to obtain building instance segmentation with regular building shapes. However, these methods only learn a single task and are not well suited to handle complicated data from natural disasters.

Self-attention in vision
Self-attention is an attention mechanism to compute the relationship between different positions in a single sequence, which has been used successfully in machine translation and natural language processing [59].
Recently, some studies have adopted self-attention approaches for image recognition. The non-local block was proposed in Ref. [60] to capture longrange dependencies between any positions, while it has high memory and computational costs. Crisscross network [61] further improved the efficiency by capturing contextual information of all pixels through a criss-cross path. Mei et al. [62] adopted self-attention and proposed a slice grouped nonlocal module for pulmonary nodule detection. Bello et al. [63] proposed a two-dimensional relative self-attention mechanism, which is combined with convolutional operators to obtain the best results. Ramachandran et al. [64] developed a stand-alone attention layer to construct a full attention vision model with local self-attention units. Zhao et al. [65] explored two forms of self-attention module: pairwise self-attention and patchwise self-attention, which are validated to be effective for constructing image recognition models. A position-sensitive axialattention design was proposed in Ref. [66], which combines 1D self-attention and position-sensitive self-attention. Wang et al. [67] proposed a multiscale attention network to integrate low-layer texture features and high-level semantic features. Guo et al. [68] provided a systematic review of visual attention methods. Vision Transformer (ViT) [69] applied a standard transformer to linear embeddings of image patches for image classification. Xu et al. [70] surveyed visual transformer methods in lowlevel vision. Differing from the above approaches, this work tries to introduce the group idea into the selfattention module to explore the relationships between multi-level changes and reduce the computational cost.

Method
Considering the actual complex environment and the different kinds of natural disasters, variation between pre-and post-disaster images may not be the same in different areas. To explore the relationships between different changes, we develop a differenceaware attention network (D2ANet) for building localization and multi-level damage detection from dual-temporal satellite imagery, as illustrated in Fig. 2. In D2ANet, a dual-temporal aggregation (DTA) module is proposed to learn the global change pattern. And we further introduce a differenceattention (DA) module to capture local dependencies among the multi-level changes.

Network architecture
We adopt ResNet-101 [36] as the encoder in our D2ANet due to its outstanding feature extraction ability. Since atrous convolution allows us to modify the kernel size of filters without introducing additional parameters for extracting denser features, following Ref. [13], we apply atrous convolution to the last two blocks of the encoder with rate = 2 and rate = 4, respectively. We further employ an atrous spatial pyramid pooling (ASPP) module [14] to increase the receptive fields of feature points and effectively learn multi-scale features. ASPP applies four parallel atrous convolutions with different atrous rates to capture multi-scale information, which is then concatenated with the feature of a global average pooling layer. The input to our network is dualtemporal images, i.e., a pair of pre-and post-disaster images. We obtain two groups of features from the convolutional blocks after performing the encoder.
In order to effectively cope with building localization and change detection tasks, we adopt two decoders. Each decoder contains five blocks with an up-sampling layer and a convolutional layer. The pre-disaster features are fed into one decoder to generate a binary building mask and accomplish building localization task. As for change detection task, we develop a D2A block to deal with the two groups of features considering global and local change information between the pair of images taken at different times. The D2A block contains a dual-temporal aggregation module and a differenceattention module, which will be introduced detailedly in Section 3.2 and Section 3.3, respectively. As shown in Fig. 2, before being fed into the decoder, the paired features are processed by the D2A block to excite the global change pattern between dual-temporal satellite images and capture the rich dependencies among elements across the multi-level changes.
For the two tasks, we use combo loss [71] for building localization and cross-entropy loss for the change detection. The overall loss function of our model can be defined as where λ 1 is a constant. The combo loss is defined as where λ 2 is a constant used to balance the corresponding term. The combo loss is a weighted sum of the focal loss [72] and the dice loss [73], which are formed as Eqs. (3) and (4): where y i is the ground truth indicating building or background for a pixel at position i, andŷ i is the corresponding predicted probability of our approach for the building localization task. N is the number of pixels in the feature map. In the focal loss function, α is a weighting factor and γ is an adjustable focusing parameter, which are adopted to deal with the problem of class imbalance.

Dual-temporal aggregation
When detecting changes between pre-and postdisaster satellite images, feature-level differences between features of the dual-temporal images are worth exploring. Among the dual-temporal features, different channels may present diverse information. Specifically, some channels mainly model change patterns reflecting the differences, while other channels may tend to describe background-related information. We develop a dual-temporal aggregation (DTA) module to discover the change-sensitive channels and capture the global change pattern. The architecture of the DTA module is illustrated in Fig. 3. Let X b and X a denote a pair of input features before and after the disaster, respectively. The paired features are firstly fed into 1 × 1 convolution layers to reduce the feature channels for high efficiency, and we obtain output features X b and X a . Then 3 × 3 convolution layer is used to perform channel-wise transformation on the features, and the transformed features are utilized to calculate the temporal difference.
where W trans is the weight matrix of a 3 × 3 channel- Fig. 3 The dual-temporal aggregation module (DTA) fuses predisaster image features X b and post-disaster image features Xa. It utilizes the temporal difference to discover the change-sensitive channels and excite the global change pattern. "1 × 1" denotes a 1 × 1 convolution layer, "3 × 3" denotes a 3 × 3 channel-wise convolution layer. And "FC" is a fully connected layer.
wise convolution layer to perform transformation for each channel. X d denotes the difference between the dual-temporal feature maps, which helps us to learn the global change information. The subtraction operation suppresses the information of background scene, which is also beneficial for building recognition and change detection. We concatenate the difference X d with X b and X a along the channel dimension, which enhances the global change pattern and preserves scene information. It is defined as Eq. (6): Finally, we adopt a squeeze-excitation (SE) block [74] to excite the change-sensitive channels. It recalibrates the concatenated feature X c using the channel attention mechanism. In particular, there are two fully connected layers and a sigmoid function in the SE block. The concatenated feature X c is fed into this block after a global average pooling, where the output vector ranging (0, 1) is multiplied by the corresponding channels in the input X c .

Difference-attention module
A natural disaster usually causes varying degrees of damage to buildings. That is, there are multiple levels of change in the post-disaster satellite images compared to the pre-disaster images. Assessing the severity of damage is a prerequisite for disaster response and resource allocation, and exploring the relationship among multi-level changes can improve the ability to identify different damage scales. Thus, we develop a difference-attention (DA) module to learn local dependencies among the different changes, as shown in Fig. 4.
Let X ∈ R H×W ×C denote the input feature map for the DA module, where H, W , and C represent the height, width, and the number of channels, respectively. The feature map X is first split into p×p cubes along the height and width dimensions. The shape of each small feature cube is H ×W ×C, where H = H/p and W = W/p. The parameter p is the number of splits, and we set p = 8 in our experiments. Each cube is supposed to contain one change between the pre-and post-disaster satellite images. All cubes are rearranged to form X ∈ R D×H ×W ×C , where D = p 2 denotes the number of small feature cubes.
We propose a grouped self-attention mechanism based on the non-local module in Ref. [60] to effectively learn the similarities between any positions Fig. 4 Illustration of the difference-attention module (DA). The input feature map X ∈ R H×W ×C is first split along height and width dimensions to obtain small feature cubes with shape H × W × C. All cubes are then rearranged to form X ∈ R D×H ×W ×C . The dotted box represents matrix multiplication for each group, i.e., Eq. (8). G is the number of groups, and "×G" denotes that the operation in Eq. (8) is repeated G times. and any channels across the multiple feature cubes. X is first fed into three 1 × 1 × 1 convolution layers to learn the transformations, which are defined as where W θ , W φ , and W g are the weight matrices of the convolution layers. * denotes the convolution operation. Directly performing transformation operation on θ (X ), φ (X ), and g (X ) is infeasible since their shapes are D × H × W × C and the computational complexity would be very high. Recently, the idea of dividing channels into groups has been explored in some studies and proven to be effective for increasing the capacity of CNN, such as Xception [75], ResNeXt [76], and the group normalization approach [77]. We introduce the group idea into our DA module and group the dimension D into G groups. The tensor after being grouped are with shape D ×H ×W ×C, where D = D/G. Each group is implemented independently by the matrix multiplication equation in Eq. (8) to compute Z. Z = f (vec (θ (X )) , vec (φ (X ))) vec (g (X )) (8) where vec denotes a vector after grouping and reshaping operations, vec (θ (X )) , vec (φ (X )), and vec (g (X )) ∈ R D H W C×1 . The pairwise function f (·, ·) is used to compute the affinity among all positions and all channels in one group. As described in Ref. [60], the dot-product is probably the simplest way to do this, i.e., f (vec (θ (X )) , vec (φ (X ))) = vec (θ (X )) vec (φ (X )) T Then we define Eq. (10) to obtain the output Y ∈ R D×H ×W ×C of the DA module.
where "concatenate" denotes that all groups are concatenated along the dimension D . W y denotes the weight of a 1 × 1 × 1 group-wise convolution layer. Y is then rearranged so that its size is the same as the input feature map X.
In the DA module, we set the number of groups G = 16 to make the grouped attention module capture dependencies among any positions and any channels in one group, where each group contains 4 small cubes. Since each small cube after being split has the potential to represent one difference, the DA module will learn the similarity between multi-level differences and augment discrimination of multiple changes between the dual-temporal images. Besides, we provide further analysis in the experiments of parameter settings for the DA module, including the number of cubes D and the number of groups G.

Experiments
In this section, we introduce the xBD dataset and evaluation metric utilized in experiments, and then present detailed results of comparisons and ablation studies.

Dataset
In our experiments, we adopt the large-scale dataset xBD [20] for change detection and building damage assessment. It provides pre-and postdisaster satellite imagery across 19 different natural disasters, including earthquake, wildfire, tsunami, and volcanic eruption, etc. The xBD dataset contains 22,068 satellite images and geographical positions with 850,736 building annotations, and covers a total of 45,362 km 2 . The pixel resolution of the satellite images is 1024 × 1024 and the spatial resolution is 0.3 m. xBD presents the Joint Damage Scale as a unified assessment scale for building damage from satellite imagery across multiple disaster types, including no damage, minor damage, major damage, and destroyed. xBD provides training and testing splits, with 5598 and 1866 satellite images respectively, which are used in our experiments.

Evaluation metric
To validate the performance of our D2ANet for multilevel change detection, we adopt the evaluation metric used in the xView2 challenge [20]. It is defined as Eqs. (11) and (12): F 1 damage = n 1/F 1 cls 1 + · · · + 1/F 1 cls n (12) where F 1 loc is the F1 score of building localization measuring agreement between the predictions at each pixel versus the ground truth in the pre-disaster image. F 1 damage is the F1 score for change detection, which measures the consistency between predictions over pixels of buildings in the ground truth from the post-disaster image. F 1 cls 1 , . . . , F 1 cls n denote change detection F1 scores for n damage scales. It is noted that S xView2 is a weighted average of the localization F1 score and the harmonic mean of the multi-level change detection F1 scores. It penalizes over-fitting to categories with a large number of building polygons and the distribution of damage scale in xBD is heavily imbalanced, which makes this evaluation metric very challenging.

Implementation details
D2ANet adopts stochastic gradient descent (SGD) optimizer with a batch size of 8. The learning rate is initially set to 0.01, and weight decay and momentum coefficients are set to 5 × 10 −4 and 0.9, respectively. Our method is trained with 150 epochs and the "poly" learning rate policy is adopted in training, which reduces the learning rate by multiplying (1 − iter/maxiter) power with power = 3. We use the deep learning framework PyTorch [78] to perform our D2ANet, and the experiments are implemented on 4 NVIDIA RTX TITAN GPUs with 24 GB memory. During training, data augmentation techniques including random rotation, rescaling, horizontal flipping, and Gaussian blurring are applied to improve the generalization ability of our model. The images in the xBD dataset are randomly cropped to a fixed size of 512 × 512 for training.

Comparison with state-of-the-art methods
In this section, the performance of our D2ANet and several state-of-the-art methods on the xBD dataset [20] for building localization and multilevel change detection is compared, including Baseline [20], Siamese-UNet(ResNext50) [28], Siamese-UNet(DPN92) [28], Dual-HRNet [79], Dual-Temporal Fusion [37], and RescueNet [80]. The Baseline in Ref. [20] introduces an altered U-Net [35] architecture for the building localization and a ResNet-50 [36] pretrained on ImageNet [81] to carry out the damage classification task. In the area of change detection, the Siamese-UNet is a widely used architecture, which trains a U-Net based on pre-disaster images for building localization and uses Siamese-UNet on post-disaster images with shared weights from the localization model for change detection. Based on two different backbones, ResNext50 [76] and DPN92 [82], we implement two Siamese-UNet architectures. The Dual-HRNet [79] contains two HRNets and a fusion block for the two tasks based on the dual-temporal images. The Dual-Temporal Fusion in Ref. [37] adopts Mask R-CNN [38] with a feature pyramid network module [39] and a semantic segmentation head to complete the two tasks. RescueNet [80] employs multi-scale temporal features to identify damage in the segmented buildings, which can simultaneously segment buildings and classify damage levels. For a fair comparison, we re-execute these methods and conduct experiments on the same dataset.

Quantitative comparison
The quantitative comparison on the xBD dataset is listed in Table 1 Table 1 Quantitative comparison (%) of our proposed D2ANet with some state-of-the-art methods on the xBD dataset. "F1 score" denotes the overall F1 score, i.e., S xView2 in Eq. (11). "Loc F1" is the F1 score for building localization. "Damage F1" represents F1 score for damage detection we also report experimental results of our D2ANet using ResNet-50 [36], VAN-B2 [83], and VAN-B3 [83], which are better than other methods. D2ANet with ResNet-50 improves Dual-Temporal Fusion [37] by 0.88% on overall F1 score and 1.96% on localization F1 score. D2ANet with VAN-B2 [83] and VAN-B3 [83] are better than D2ANet with ResNet-50 by 0.31% and 1.46% on overall F1 score, respectively. We notice that some participants in the xView2 challenge achieved impressive results. However, the participants integrate several segmentation models, and the two tasks are trained separately, which is more complicated and less efficient. Our approach effectively obtains better performance in a single model network than the widely used Siamese-UNet model in the challenge. D2ANet's new state-of-theart results for building localization and multi-level change detection make it more suitable for disaster assessment.
We report the parameter number and running time for change detection methods on the xBD dataset, as listed in Table 2. The comparative experiments of all approaches are conducted on a workstation with 4 NVIDIA RTX TITAN GPUs. For fair comparison, we report the inference time for one image with the size of 1024 × 1024. Since the Baseline [20] and Dual-Temporal Fusion [37] are trained with a ResNet-50 backbone, we also list results of our D2ANet with ResNet-50 to be fair. The number of parameters of D2ANet with a ResNet-50 backbone is 43.6 M, which is the least and less than the Dual-Temporal Fusion [37] by 0.3 M. However, the overall F1 score of our D2ANet with ResNet-50 is better than that of other methods. Furthermore, our D2ANet with ResNet-50 achieves the fastest inference time.
When we adopt a ResNet-101 backbone, the overall F1 score is improved by 2.03% compared to the result using ResNet-50, as shown in Table 1. The number of parameters of D2ANet with ResNet-101 is still less than Siamese-UNet(ResNext50) [28] and Siamese-UNet(DPN92) [28]. Siamese-UNet adopts two networks for building localization and change detection, which requires more parameters.

Qualitative comparison
Our D2ANet and other methods' qualitative results are shown in Fig. 5, which contains seven examples from different disasters, including hurricane Harvey, hurricane Florence, earthquake, flooding, tsunami, wildfire, and fire. It is noted that the results of building segmentation for our approach are more consistent with the ground truth. Besides, the classification of building damage scales for D2ANet is more accurate. For example, in the first-row results of Fig. 5, there are several buildings damaged by a hurricane. D2ANet achieves the best qualitative result, which is closer to the ground truth. However, other methods can not correctly identify the types of building damage, especially Baseline [20] and Dual-Temporal Fusion [37]. In addition, methods like Siamese-UNet(ResNext50) [28] and Siamese-UNet(DPN92) [28] fail to recognize complete buildings. The above visualized results on the xBD dataset [20] demonstrate the superior performance of our proposed D2ANet for the building localization and multi-level change detection tasks from the dual-temporal satellite images.

Effectiveness of proposed modules
To model the global change pattern and capture local correlations among the multi-level changes, we develop the DTA and DA modules in D2ANet. Experimental results for verifying the effectiveness of the two modules are shown in Table 3. The No. 1 is the segmentation model with encoder-decoder  65% and 3.86%, respectively, where the latter validates that the DA module could augment the discrimination of multi-level changes between predisaster and post-disaster images. After combining the two modules, the overall F1 score improves 3.96% and the damage F1 score improves 5.22%. It is noted that adding DA module obtains better localization F1 score and destroyed F1 score than the full version, while its overall F1 and damage F1 scores are lower.
Since utilizing the two modules helps our proposed model effectively learn global and local multi-level change information, the full version D2ANet has advantages in identifying different changes.

Effect of different configurations in DA
In the DA module, there are two important parameters: the number of cubes D and the number of groups G. We analyze the influence of different parameter settings on the experiment results. The analysis results of parameter D are listed in Table 4. It is noted that D = 64 is the best setting for the DA module, while fewer and more cubes will hamper the performance improvements. For example, the overall F1 score is 75.43% when the number of cubes D = 64, which improves two other settings D = 16 and D = 256 by 4.01% and 8.17%, respectively. Cubes are obtained by splitting the feature map, where each cube is considered to contain only one change between dual-temporal images. If there are fewer cubes, each cube may contain multiple changes and hinder learning the similarity between multi-level differences. Each cube contains few pixels when too many cubes are split, which may restrict the feature learning. Therefore, the experiment results are in line with our expectations. The group idea is another important strategy in our DA module. As shown in Table 5, we achieve a best overall F1 score when groups G = 16, which improves two other settings G = 8 and G = 32 by 4.17% and 2.57%, respectively. Besides, the localization F1 score and damage F1 score are also the best when G = 16. These results are expected since affinity in the DA module considers the points across different changes. If there are a few groups, each group contains many small cubes and resulting in restricted optimization. However, when we adopt too many groups, it will limit the capturing of rich dependencies between elements across multi-level changes. If we do not adopt the group idea, i.e., G = 1, the performance is the worst as shown in Table 5.
According to the above experiments, we set the

Effect of different configurations in DTA
In the DTA module, we propose several operations to explore the change-sensitive channels and capture the global change pattern, including the temporal difference, the channel-wise convolution, and the SE block [74]. As listed in Table 6, we provide experimental results for the DTA module under different configurations to validate the effectiveness of the three operations. When the temporal difference is not applied and X c is the concatenation of predisaster feature X b and post-disaster feature X a , the overall F1 score is lower than our D2ANet by 4.12%.
If we do not adopt the 3 × 3 channel-wise convolution layers, i.e., the temporal difference X d = X a − X b , the overall F1 score is lower than D2ANet by 3.05%. The SE block increases the overall F1 score by 1.20%. The above results prove that these operations can improve the performance of change detection, and further verify the rationality and effectiveness of the DTA module.

Analysis of failure cases
Our method achieves new state-of-the-art performance on the xBD [20] dataset as stated in the above experiments, but there are still some cases of failure. As shown in Fig. 6, we present three samples. For Fig. 6(a), there are illumination variations and cloud occlusions between the dual-temporal images, our D2ANet fails to identify damage scale in areas affected by cloud occlusions. Besides, D2ANet identifies greenhouses as buildings in Fig. 6(b), while it fails to detect buildings in Fig. 6(c) since the color and texture of the buildings are similar to the surrounding farmland. In future, we will consider combining other satellite imagery such as hyperspectral imagery to detect damaged buildings in these challenging areas. Fig. 6 Visualization of some cases that our D2ANet fails. The first to the fourth rows are satellite imagery of pre-disaster, satellite imagery of post-disaster, ground truth, and the change detection results of our D2ANet. The colors green, blue, orange-red, and red denote no damage, minor damage, major damage, and destroyed, respectively. (a)-(c) Three samples from xBD dataset.

Conclusions
This paper presents a difference-aware attention network (D2ANet) for simultaneous building localization and multi-level change detection from dual-temporal satellite images. We first adopt a CNN architecture as the encoder to extract features from the dual-temporal images. The pre-disaster features from the encoder are fed into one decoder Table 6 Ablation study for the proposed DTA module with a different configuration for capturing the global change pattern between dual-temporal satellite images (%). "No temporal difference" denotes that the temporal difference is not applied and the Xc is the concatenation of pre-disaster feature X b and post-disaster feature X a . "No channel-wise convolution" indicates that 3 × 3 channel-wise convolution layers are not adopted and temporal difference X d = X a − X b . "No SE block" denotes that the SE block is not applied in the DTA module to accomplish building segmentation. For multilevel damage detection, we develop a differenceaware attention block that contains a dual-temporal aggregation (DTA) module and a difference-attention (DA) module. The DTA module is designed to learn the global change pattern and excite the changesensitive channels. And we develop a DA module to exploit the local correlations among any positions and any channels in one group, where small cubes have the potential to represent multi-level differences. Extensive experiments demonstrate the superiority of our D2ANet. We also perform ablation studies to validate the effectiveness of the proposed two modules. In future, we plan to adjust the selfattention mechanism in the DA module and improve its computational efficiency. Besides, other tasks like video-based action recognition and object tracking that the difference-attention module can be used are worth exploring.