Cross-temporal high spatial resolution urban scene classification and change detection based on a class-weighted deep adaptation network

Shi, Sunan; Zhong, Yanfei; Liu, Yinhe; Zhang, Liangpei; Li, Deren

doi:10.1007/s44212-023-00029-1

Cross-temporal high spatial resolution urban scene classification and change detection based on a class-weighted deep adaptation network

Original Article
Open access
Published: 10 January 2024

Volume 3, article number 3, (2024)
Cite this article

Download PDF

You have full access to this open access article

Urban Informatics Aims and scope Submit manuscript

Cross-temporal high spatial resolution urban scene classification and change detection based on a class-weighted deep adaptation network

Download PDF

Sunan Shi¹,
Yanfei Zhong^1,2,
Yinhe Liu¹,
Liangpei Zhang¹ &
…
Deren Li¹

636 Accesses
Explore all metrics

Abstract

Multi-temporal urban scene classification and change analysis based on high resolution (HR) remote sensing imagery can provide reliable time-series information for the semantic interpretation of urban land use and the transitional relationships, which is important information for urban planning and sustainable development. However, there are still some difficulties encountered when applying the existing multi-temporal scene classification methods to analyze urban development in China due to the complex urban structure and scene shape. The main reasons for this can be summarized as follows: 1) the multi-temporal data labeling workload caused by the differences of the data distributions among multi-temporal images; and 2) the lack of practical socio-geographical urban unit boundaries resulting from the uniform grid based segmentation. In this paper, a multi-temporal scene classification framework based on a class-weighted deep adaptation network (CWDAN) is proposed. In the CWDAN framework, multi-temporal OpenStreetMap (OSM) road networks are introduced for the scene segmentation at the land parcel level, to build clear and meaningful geographic boundaries for the scene units. The problem of large scale difference of parcels is solved by area-weighted voting (AWV). In order to solve the problems of the high workload of multi-temporal data labeling in the cross-temporal scene classification task, a gradient reversal layer (GRL) is used in the proposed CWDAN to obtain deep features with invariance relative to the shift between the domains. A class-weighted fully connected layer is used to solve the problem of unbalanced proportion of different urban scene classes. Post-classification is finally performed to obtain the scene change information. Experiments with tri-temporal datasets in Chinese areas demonstrated that the proposed framework can obtain a significantly improved performance in the cross-temporal scene classification and change analysis task.

Spatio-temporal analysis of land use/land cover change detection in small regions using self-supervised lightweight deep learning

Article 14 September 2023

Urban Land Use Classification Using Street View Images Based on Deep Transfer Network

Urban Change Detection from VHR Images via Deep-Features Exploitation

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

As global urbanization has accelerated in recent decades, the environmental, traffic, and demographic issues that arise in the urbanization process have become of great concern to the international community (Grimm et al., 2008). Accurate judgment of urban development modes and in-depth analysis of the current urban problems can be considered as one way to achieve sustainable urban development. Multi-temporal classification and change detection based on remote sensing imagery is a commonly used method in urban development analysis. As high resolution (HR) Earth observation technology is developing rapidly, a large quantity of HR images can now be obtained. The plentiful contextual information and spatial details provided by the HR imagery can help us to obtain richer land surface distribution and change information (Bouziani et al., 2010) in specific urban tasks. There has been a long history of pixel-based (Polykretis et al., 2020; Johnson and Kasischke, 2010) and object-oriented (Shi et al., 2020; Hussain et al., 2013). multi-temporal classification and change detection research. However, what we need in practical urban applications is the urban land-use pattern and change information, such as the transition from farmland to residential and the expansion of industrial areas. Due to the diversity of objects and the complex spatial distribution, the traditional pixel-based and object-oriented methods are unable to obtain the high-level semantic information expressed by HR images from the low-level data features, which poses a great challenge for practical urban analysis.

With the aim of better solving the problems of the traditional methods and achieve an understanding of the high-level semantic information, the concept of the “scene” has been proposed (Zhao et al., 2013). Urban scenes refer to the basic urban units with clear geographical boundaries and socio-economic attributes, which can express the high-level semantic land-use information. The spatial aggregation of different urban land-use patterns can create different urban scenes, which can further reflect the urban functional distribution behind the urban scenes (Zhao et al., 2013). On account of the abundant and detailed structural and spatial information provided by HR imagery (Zhu et al., 2016), multi-temporal scene classification and change detection can allow us to obtain multi-temporal urban land-use distribution and change information at a higher semantic level (Ni and Wu, 2019). This can help us to accurately grasp the spatial patterns and rules of urban land expansion, reasonably analyze urban development patterns, and correctly guide the sustainable use of urban land-use resources.

Based on the research in the field of natural image processing, the methods of remote sensing scene classification have been continuously developed from low-level feature based methods (Oliva and Torralba, 2001) and mid-level feature based methods (Zhu et al., 2016) to deep learning based methods (Yong Wang et al., 2019). However, methods at the first two stages are based on manual features. Therefore, these methods can obtain good application results only in certain cities, but cannot achieve satisfactory feature adaptability. Differing from the methods based on manually designed features, as an end-to-end framework, convolutional neural networks (CNNs) can realize automatic deep feature extraction and scene class prediction (Wu et al., 2017). Due to the data-driven characteristic of deep learning, the CNN-based multi-temporal methods all require a large amount of data labeling for the model training on multi-temporal images, which is time-consuming and does not fully utilize the prior knowledge of the existing labeled datasets (Ru et al., 2021; Hu et al., 2015). The deep learning model needs to follow the assumption that the training and test sets should be independently and identically distributed. Therefore, if we do not perform additional multi-temporal data labeling and use the well-trained model of one time phase for multi-temporal scene classification, the model generalization ability will be insufficient due to the differences in data distribution and diverse distribution patterns within the scene among multi-temporal images. In addition, the existing methods of multi-temporal urban scene classification and change detection all use a uniform grid for uniform scene patch segmentation, which makes the mapping results have a serious mosaic phenomenon and can not match with the actual urban application.

For the purpose of making the most of the previously labeled data while ensuring a satisfactory accuracy for the model in the multi-temporal scene classification and change detection tasks, transfer learning can be introduced to reduce the workload of multi-temporal data labeling. The idea of transfer learning is defined as the ability to transfer the knowledge obtained in the previous task (source domain) to a new task (target domain), enabling satisfactory learning results for the target domain (Mansour et al., 2009; Sinno Jialin Pan and Yang, 2010; Jiang and Zhai, 2007; Audebert et al., 2017; Gopalan et al., 2011). Domain adaptation is commonly used in transfer learning strategy (Matasci et al., 2015). So far, cross-temporal classification and change detection based on domain adaptation have not been carried out at the scene level, although there is some research basis at the re-pixel level (Hu et al., 2015; Demir et al., 2013).

In this paper, a multi-temporal scene classification framework based on a class-weighted deep adaptation network (CWDAN) is proposed. In this framework, multi-temporal OpenStreetMap (OSM) road networks and uniform grid sampling are introduced for mapping at the land parcel level and model training at the patch level, respectively. In order to reduce the workload of multi-temporal data labeling, a domain adversarial neural network (DANN) with a gradient reversal layer (GRL) is introduced so that the labeled source scene patches and the unlabeled target scene patches are used as the network inputs for cross-temporal scene classification to obtain deep features with invariance relative to the shift between the domains. The problem of large scale difference of parcels is solved by the area-weighted voting (AWV) strategy instead of the traditional equal-weighted voting (EWV) strategy. A class-weighted (CW) fully connected layer is used to solve the problem of unbalanced proportion of different urban scene classes during different time phases. Post-classification is finally applied to obtain the multi-temporal scene change information.

The main contributions of this paper are as follows:

1) A multi-temporal scene classification framework based on a class-weighted deep adaptation network (CWDAN) is proposed to reduce the multi-temporal data labeling workload. In the proposed framework, the idea of domain adaptation is introduced. The training data of only one time phase are labeled, and cross-temporal scene classification is achieved by feature alignment. Compared with the existing multi-temporal scene classification change detection frameworks, the proposed framework can drastically reduce the multi-temporal data labeling workload, as well as improve the model training efficiency. Tri-temporal scene classification and change detection mapping in study areas of Wuhan urban center and the Wan Chai of Hong Kong was undertaken to verify the effectiveness of the proposed CWDAN framework.
2) As urban areas develop, the proportion of urban scene classes contained in the cross-temporal images will change. When the source domain contains a different proportion of categories than the target domain,, it has a negative impact on the model training. Thus, the original DANN network is improved by adding the calculation of class weights in the fully connected layer, which means that, for a category with a low proportion in the target domain, its weight will be increased to balance the classification effect, and vice versa. The CWDAN model can solve the negative transfer problem caused by the category imbalance and can significantly improve the cross-temporal classification accuracy.
3) Multi-temporal land parcel level scene classification and change detection maps are acquired. To solve the problem of the mosaic phenomenon caused by uniform patch-based scene segmentation, OSM road network data are added to achieve land parcel level scene segmentation. The uniform patch-level scene segmentation is also performed as the input of the model training and prediction, which is finally mapped to the parcel level through the area-weighted voting (AWV) strategy for scene classification mapping and post-classification change detection. The results of multi-temporal scene classification and change detection based on land parcels can better reflect the urban layout and urban change information, thus providing favorable support for practical urban application analysis.

1.1 Related work

1.1.1 Scene classification and change detection based on bi-temporal labels

Scene classification methods based on low- and mid-level features have a history of more than ten years, and then few change detection methods based on multi-temporal scene classification came into being. A bag-of-visual-words (BOVW) model based bi-temporal scene classification and change detection framework was proposed by Wu et al. (2016). In this method, the BOVW model is used to encode different kinds of low-level features. Support vector machine (SVM) is then applied to classify the bi-temporal images separately, and post-classification change detection is finally performed to obtain the scene change information. This work laid a good foundation for the subsequent research on multi-temporal scene classification and change detection. However, for multi-temporal cases, this method does not consider the temporal correlation. To solve this problem, a scene change detection framework which introduced kernel slow feature analysis (KSFA) was proposed by Wu et al. (2017). This method also applies the BOVW model to perform scene classification, but a histogram intersection kernel (HIK)-based KSFA module is added in the classification process to obtain the change probabilities of the bi-temporal scene images. After obtaining the bi-temporal classification results and the temporal correlation established by KSFA, post-classification with Bayesian theory is utilized to determine the “from-to” change information of the scene pairs for change detection. This framework integrates the temporal correlation of bi-temporal images, and can achieve a superior classification and change detection accuracy. However, both of the above methods are based on manually designed mid-level features in the feature extraction stage.

In recent years, CNNs have been widely used and have been successfully applied in many areas of remote sensing, including image classification (Shin et al., 2016), target detection (Gopalan et al., 2011), semantic segmentation (Girshick et al., 2013; Chen et al., 2014), and scene classification tasks (Cheng et al., 2018; Limin Wang et al., 2016). At the same time, a scene classification and change detection framework based on a CNN has also been developed by Yong Wang et al. (2019). This framework is based on bi-temporal images, and the CNN is applied to extract spectral-spatial features. Nonlinear high-dimensional transformation is then conducted based on the deep features of bi-temporal images, and deep canonical correlation analysis (DCCA) is applied to extract the change information. Ru et al. (2021) proposed the CorrFusion module to perform the bi-temporal scene classification and scene change detection task. This method also uses a CNN to extract the bi-temporal scene features, while the temporal correlation is calculated through the CorrFusion module. Cross-temporal fusion is then finally performed based on the extracted features and the temporal correlation calculation results. All the existing methods require bi-temporal classification labels to train models for multi-temporal classification and change detection, and there is a large demand for manual labeling.

1.2 Scene classification and change detection based on single temporal labels

Domain adaptation has been widely applied in remote sensing image classification (Long et al., 2015; Long Liu et al., 2018) and change detection (Pan et al., 2011). For example, Wei Liu et al. (2019) applied an unsupervised land-use change detection method, which detects the unchanged regions and transfers the unchanged samples of the source domain (historical data) to the target domain (new data), to implement land-use change updating. Demir et al. (2013) and Hu et al. (2015) both applied a deep CNN to realize transfer between different classification datasets. However, multi-temporal classification and change detection methods based on domain adaptation have not been developed at the scene level up to now.

2 Methods

In order to reduce the workload of multi-temporal data labeling and to address the needs of urban development analysis in practical city planning, the CWDAN framework is proposed. Firstly, tri-temporal OSM road network data and uniform grid sampling were used to divide the scene units into land parcels and patches for the final multi-temporal scene classification and change detection mapping and model training. To reduce the cost of tri-temporal data labeling, only one time phase of the image was labeled to construct the training dataset, and the proposed CWDAN framework was then used to obtain the cross-temporal scene classification results. After obtaining the tri-temporal scene results for image patches, Area-weighted voting (AWV) was applied to generate the scene classification maps at the land parcel level. Finally post-classification was applied to obtain the final multi-temporal scene change maps.

2.1 Training datasets and multi-level scene segmentation

2.1.1 Class system for cross-temporal scene classification

For the urban scene classification standards and mapping rules used in this study, we synthesized the “Urban Land Classification and Planning Construction Land Standard (GB 50137-2011)” to form the parcel-level scene classification system of Residential area (Re), Educational and research area (Er), Sports area (Sa), Commercial service facilities (Cs), Industrial area (Ia), Transportation facilities (Tf), Park (Pa), Square (Sq), Green buffer (Gb), Water (Wa), Agricultural and forest (Af), Vacant area (Va), Other construction land (Oc) and Mixed land (Ml). In this study, functional areas of urban scenes that cannot be directly identified by remote sensing images, including “Oc” and “Ml”, were labeled for future research, but were not involved in the accuracy assessment of this study.

2.2 Multi-temporal road network processing

For the original downloaded OSM road network data, there was relatively serious data redundancy and topological relationship errors. Therefore, it was necessary to manually correct the original road network data. Due to the unavailability of the historical OSM road network data, production of the historical road network data based on the obtained road network data was also needed. Firstly, for the original road network data, the topological errors were corrected by extending or cropping to eliminate the hanging nodes in the road networks. Some overlapping redundant road networks were also manually removed. Secondly, based on the corrected road networks and the historical HR image data, road network correction was executed through manually adding or removing the road network data to obtain the historical road network. For a multi-temporal study, the road networks will change over time. For example, some minor roads may disappear, major roads may increase in size, and some road sections may be merged. In order to ensure that the numbers of land parcels obtained from each temporal division were consistent and the accuracy of the road network was guaranteed, a “union” operation that combines all the multi-temporal road network data was performed so that the intersecting parts of the multi-temporal road networks can independently form multiple component elements, that is, a parcel may be divided into multiple parcels due to partial changes in its contents. Finally, the complete road network is obtained for multi-temporal research.

In accordance with the attributes detailed in the OSM road network dataset and the road widths specified in the “Code for Transport Planning on Urban Road (GB50220-95),” which was effective until 2019, we established various buffer radii for distinct road classes (Table 1), to generate the buffer results for the final road networks. To address the misalignment issues caused by image registration errors, the buffer radii were reduced by 10 meters to ensure that the acquired parcels are completed. The road network buffers were then erased from the overall surface region of the study areas, to obtain the parcel-level scene segmentation results.

Table 1 Buffer zone settings for the different road classes

Full size table

2.3 Cross-temporal scene classification and change detection based on a class-weighted deep adaptation network (CWDAN)

In the cross-temporal scene classification task, we used the labeled data of only one time phase (source domain) to train the network and implement the classification task for the remaining time phases (target domain). However, as a result of the different light situations, sensor conditions, and radiation, there was a discrepancy in the data distributions among the different time phase images. Furthermore, imbalance between the categories of the target domain and source domain can also reduce the classification accuracy.

To solve the above-mentioned problem, a class-weighted DANN-based cross-temporal scene classification model is proposed in this paper. We use M to represent the scene patch images segmented by uniform grid sampling, and N to represent their labels. The source time phase images can then be described as $S = \{X^{s}_{l}, Y^{s}_{l}\}$, and the target time phase images can be described as $T = \left\{X^{t}_{l}, Y^{t}_{l}; X^{t}_{u}\right\}$. The superscripts of X and Y are s or t, indicating source domain or target domain, respectively; and the subscripts of X and Y are u or l, indicating labeled or unlabeled, respectively. It should be noted that the target domain description consists of two parts, where $T_{u} = \{X^{t}_{u}\}$ is the unlabeled patch data to be processed by the proposed model to obtain the corresponding image labels, and ${T_l} = \{ X_l^t,Y_l^t\}$ $T_{l} = \{X^{t}_{l}, Y^{t}_{l}\}$ is the partially labeled patch data for the evaluation of the algorithm accuracy at the patch level. Finally, the multi-temporal scene classification framework based on a class-weighted deep adaptation network (CWDAN) is summarized as Fig. 1.

The tri-temporal OSM road network data and uniform grid sampling are respectively used to achieve parcel-level and patch-level scene segmentation. For the parcel level, all of the time phase images are labeled for the final scene classification mapping accuracy evaluation. For the patch level, the CWDAN is trained by the labeled scene patches of the first time phase. The trained model is then used to obtain the scene labels for the remaining time phase images. After obtaining the multi-temporal scene classification results at the patch level, parcel-patch mapping with the AWV strategy is applied to obtain the final parcel-level scene classification results. Post-classification change detection is then used to acquire the final change information. The overall accuracy of the scene classification and change detection is evaluated by the number of land parcels correctly classified as a percentage of the total number of land parcels.

2.4 The class-weighted DANN

To solve the redundant data labeling task of multi-temporal data classification, we use the idea of transfer learning and use the DANN as the benchmark network. DANN is tasked with accomplishing the following objectives. 1) Accurate Source Domain Classification. The initial task involves achieving precise classification for the source domain dataset while minimizing classification errors in the existing images. 2) Domain Confusion and Adaptation. The subsequent task focuses on inducing confusion between the source domain dataset and the target domain dataset. This aims to maximize the error in domain classification, thereby facilitating the interchange of characteristics between these domains. Moreover, the process extends to confusing the target domain dataset with the source domain dataset to further enhance adaptation. In the training process, image classifier and domain classifier confront each other using the a GRL to achieve the balance between image classification loss and domain classification loss. The loss of traditional DANN can be defined as Eq. (1).

$${L}_{TD}={L}_{cls}\left(F\left(x^{s}_{l}\right), y^{s}_{l}\right)+\frac{1}{2}\left(L_{adv}\left(Dd\left(grl\left(F\left(x^{s}_{l}\right)\right)\right),1\right), L_{adv}\left(Dd\left(grl\left(F\left(x^{t}_{u}\right) \right)\right),0\right)\right)$$

(1)

where ${L_{cls}}$ represents the classification loss of the source domain and calculated by Cross Entropy. ${L_{adv}}$ is adversarial domain confusion loss calculated by Cross Entropy. F is the feature extrator. Dd is a domain discriminator and grl is the gradient reversal layer.

When directly classifying images in a new time phase (target domain), the different proportion of scene categories in the source domain and the target domain can have a great impact on the classification accuracy. Therefore, to solve this problem, a new weight parameter ${w}_{a}$ as shown in Eq. (2) is introduced, where ${\hat{y}_{i}}$ is the predicted probability for the scene patch images with label a, and $m_{a}^{t}$ represents the number of scene patch images labeled as a in the target domain. For a category that has a high proportion in both the source domain and the target domain, the probability of being correctly predicted in the target domain is high, and the number of correct predictions increases accordingly. However, for a category that has a low proportion in the source domain, the probability of being correctly predicted in the target domain is low due to the difficulties for model learning. Thus, the weights for categories with low proportions will be higher to balance the model's attention to these categories and improve the classification accuracy for the target domain.

$$w_{a} = 1\left/\left(\mathrm{ln}\left(\frac{1}{m^{t}_{a}} \sum\limits^{m^{t}_{a}}_{i=1} \hat{y}_{i} + 1\right)\right)\right.$$

(2)

Equation (3) shows the final loss function of the proposed CWDAN model, which consists of two parts, namely, the class-weighted classification loss and the domain-adversarial loss. By minimizing the classification loss and the domain-adversarial loss, optimal classification results can be obtained. The key structure of the proposed CWDAN model is shown in Fig. 2.

$${L} =w_{a} {L}_{cls}\left(F\left(x^{s}_{l}\right), y^{s}_{l}\right) +\frac{1}{2}\left(L_{adv}\left(Dd\left(grl\left(F\left(x^{s}_{l}\right)\right)\right),1\right), L_{adv}\left(Dd\left(grl\left(F\left(x^{t}_{u}\right) \right)\right),0\right)\right)$$

(3)

2.5 Multi-level parcel-patch category mapping system

In order to allow the multi-temporal data to be input into the deep learning network for training and classification, we performed scene segmentation on the original images using uniform grid sampling to obtain the patches to be input into the network, and the network then output the scene category labels for each patch of the target images. Utilizing patches derived from uniform grid sampling for model prediction and mapping can induce a mosaic phenomenon. Therefore, to acquire parcel-level mapping results that contain true geographical boundaries, it is imperative to establish a mapping relationship between the patch level and the parcel level. Traditional approaches employ an EWV strategy. However, when the parcel shape is elongated or irregular, portions of the patches belonging to other parcels may increase, interfering with the voting results of the current parcel. Consequently, an area-weighted voting strategy is introduced in this study, realizing a multi-level parcel-patch category mapping system to ameliorate this issue.

As shown in Fig. 3, in order to form the final mapping results at the parcel level, AWV effectively considers the scale difference and shape diversity of different parcels by voting according to the parcel area to which each patch belongs so as to obtain more robust mapping accuracy. The association between parcel and patch is managed through a one-to-many voting mechanism. Only those patches that overlap with the current parcel participate in the voting process for the corresponding parcel. And these patches should collectively cover all areas of the current parcel.

2.6 Scene change detection based on post-classification comparison

After obtaining the parcel-level multi-temporal classification results, post-classification is finally applied to obtain the multi-class scene change information. Differing from the traditional change detection accuracy evaluation method of using pixel-based indices, we treat the multi-class change information as different categories, and the accuracy is evaluated by the number of parcels that obtain the correct change information.

Finally, through the multi-temporal scene classification and change detection mapping, the time-series urban land-use distribution and change pattern analysis is obtained by calculating the parcel number of each category in the different time phases and combining this with the change detection results.

3 Experiments and results

In this section, we describe the results obtained for: 1) dataset construction used to validate the proposed method; 2) accuracy evaluation and mapping for multi-temporal scene classification; 3) accuracy evaluation and mapping for multi-temporal scene change detection.

3.1 Data collection

In this study, Chinese urban areas with diverse urban structures, city scales, and development models were chosen as the study areas, focusing on Wuhan City and Hong Kong Special Administrative Region. Wuhan is the capital of Hubei province, and plays a core role in the Yangtze River Economic Belt. It is also a strategic pivot point for the “Rise of Central China” plan (Wenmei Li et al., 2020). Hong Kong is a special administrative region of China of high prosperity, which is located in southern China and east of the mouth of the Pearl River. Hong Kong is the third-largest financial center in the world, and is ranked as the third-largest first-tier city in the world by the GaWC Research Network (Shang and Cha, 2013). By selecting the above study areas, the differences of urbanization development in different cities can be revealed.

This study was carried out based on tri-temporal HR images. For Wuhan city center, the image acquisition dates were June 2000, June 2010, and May 2017. The image size of these three images is 15424 × 12834 pixels and the spatial resolution is 0.5 m. For the Wan Chai area of Hong Kong, the image acquisition dates were June 2000, June 2010, and January 2017. The image size of these three images is 8192 × 7680 pixels and the spatial resolution is 0.5 m. The multi-temporal OSM road network data of the two study areas were obtained from both the OSM website (the road network for 2017) and by manual annotation and correction (the road networks for 2000 and 2010). The tri-temporal images and ground-truth maps at the parcel level are presented in Fig. 4.

In this study, the first time phase image of both datasets was used as the source domain image. The patches obtained based on the uniform grid sampling were used to construct the training datasets for the model training. The datasets used in this study were sliced using a 300 × 300 uniform grid with 150 pixel overlap. The number of patches obtained by the uniform grid sampling was 8555 for Wuhan center and 1784 for Wan Chai, separately. In order to simulate the situation that the open-source data generally cannot cover the whole study area, 80% of the patches of each class in the source domain images were labeled to form the final training data. 20% of the patches of each class in the images of three time phases were labeled for accuracy assessment at the patch level.

To express the multi-class change type in a simple way, the scene categories of the before and after phases were presented respectively in the changed area in two single figures. As can be seen from the ground-truth maps, the changes of Wuhan center mainly focus on the transition between residential and commercial. In 2017, a large number of built-up areas in Wuhan center were redeveloped. Meanwhile, the Wan Chai area of Hong Kong has seen little change. Some constructive activity appeared at the harbors.

3.2 Multi-temporal scene classification

The three constructed tri-temporal datasets were used to carry out the related experiments to evaluate the effectiveness of the CWDAN framework for the cross-temporal scene classification and change detection task. In the training procedure, the batch size and epochs were respectively set as 16 and 100. 80% of the patches on the first time phase are labeled for model training to simulate the incomplete coverage of historical products in practical applications. The ResNet-50 network was used as the backbone network and was pretrained using ImageNet. The experimental results obtained with other domain adaptation methods are also exhibited for comparison, i.e., DANN (Ganin and Lempitsky, 2015), deep adaptation networks (DAN) (Long et al., 2015), joint adaptation networks (JAN) (Long et al., 2017), the adversarial discriminative domain adaptation (ADDA) (Tzeng et al., 2017), conditional domain adversarial networks (CDAN) (M. S. Long et al., 2018), maximum classifier discrepancy for unsupervised domain adaptation (MCD) (Saito et al., 2018) margin disparity discrepancy (MDD) (Zhang et al., 2019) and minimum class confusion (MCC) (Jin et al., 2020).

We quantitatively analyze the results of the final multi-temporal scene classification at the parcel level in Table 2. Compared to the other methods, the proposed CWDAN framework obtains the best classification accuracy on all new time phases for both datasets, which shows that the proposed CWDAN has good migration effect. The overall accuracy (OA) of the first-time phase is not optimal because the small amount of data brings the problem of codomain overfitting so that a small decrease in accuracy is acceptable. In addition, compared with DANN, the proposed CWDAN has a significant improvement in time consistency. The proposed AWV strategy works for most methods on most time phases of both datasets. In addition, the images of both the Wuhan center dataset and the Wan Chai dataset have different degrees of image stitching. However, the results obtained by the proposed framework show that these problems do not have a great impact on the final cross-temporal classification results, indicating that the proposed cross-temporal scene classification framework is insensitive to the problems of cross-sensor data and image stitching.

Table 2 Accuracy of the multi-temporal scene classification at the parcel level with different comparison methods

Full size table

As can be seen from the multi-temporal mapping results (Fig. 5), the large areas of water and farmland regions are correctly classified in both datasets. The classification mapping results for the Wuhan center dataset show that, from 2000 to 2017, Wuhan city center was dominated by the residential and water categories, along with some commercial and green land categories. For the Wan Chai dataset, the Wan Chai area of Hong Kong was dominated by the categories of commercial, residential, agricultural and forestry land, and water from 2000 to 2017. Among the different categories, agricultural and forestry land is mainly distributed in the southern part, while water is mainly distributed in the northern part. The residential and commercial areas are concentrated in the central region. The mapping results of the CWDAN-based multi-temporal scene classification can provide time-series urban land-use information, thus providing useful support for realistic urban development analysis.

Figure 6 presents the confusion matrices for the scene classification results obtained using the CWDAN framework, where the columns show the classification results and the rows show the ground truth. For the Wuhan center dataset and Wan Chai dataset, the classification accuracy is highest for the residential and commercial categories, respectively. However, for the categories of sports, square, and transportation, the classification accuracy is relatively low. Due to the actual situation of the urban areas in general, the distributions of the different categories show great variance, which leads to the problem of sample imbalance in the training dataset. For example, as for Wuhan city center and Wan Chai area of Hong Kong, the residential and commercial categories have a larger coverage area, so that their sample numbers in the training dataset are sufficient. In contrast, the distribution of the sports, square, and transportation categories is limited, as a result of the small sample sizes of these scene categories. Thus, the categories with more training samples can obtain a better classification accuracy, but for categories with few samples, they achieve a poor accuracy, due to the limited training data. However, with regard to the overall classification results, the proposed CWDAN alleviates the negative effects of category imbalance to a certain extent and maintains a good accuracy.

There is also a misclassification phenomenon in both datasets. For the Wuhan center dataset and the Wan Chai dataset, the decrease of the classification accuracy can be attributed to the misclassification between residential, commercial, and educational and research land. The commercial and residential categories are a combination of uniformly arranged buildings, while the school category is also a combination of uniformly arranged buildings, as well as green land, sports facilities, and so on. In addition, the dual use of commercial and residential buildings is also common. Thus, the similarity of the composition within the scene categories results in the misclassification problem.

3.3 Multi-temporal scene change detection

Finally, the multi-temporal scene change detection results were obtained by post-classification comparison. The overall accuracies are listed in Table 3. As presented, the CWDAN framework obtains the best and the most robust change detection results by using the AWV strategy and the spatial-temporal aligned constraint. For the Wan Chai dataset, these areas were more stable during the study period, and the proportion of unchanged pairs is larger. Therefore, it is possible to achieve a better change detection accuracy than using the Wuhan center dataset.

Table 3 Accuracy of the multi-temporal scene change detection at the parcel level using different comparison methods with AWV

Full size table

Figure 7 shows the final change maps. The main changes of Wuhan center were concentrated on the transition between commercial, residential and vacant lands, which reveals the changes within Wuhan center brought about by urbanization. Due to the accumulation of misclassification errors, the change results for Wuhan center also include the water changes, which may not match the actual situation. The Wan Chai area of Hong Kong has remained largely unchanged. The main changes of Wan Chai focus on the water due to the construction of harbors, which have been detected correctly by CWDAN. These findings are consistent with the development status of the two cities.

4 Discussion

4.1 Methods robustness analysis

We conducted further robustness analysis on the proposed CWDAN. The CWDAN is an enhancement upon the DANN. The CW strategy and AWV strategy proposed in this paper improve the accuracy of cross-time classification and change detection, and show higher classification accuracy in the third time phase (Table 4). As illustrated in Fig. 8, the shooting angle of image in the second time phase is very different from that of source domain image, and the shooting angle of image in the third time phase is closer to that of source domain image, so the classification effect of image in the third time phase is better. Better classification results can also lead to more reliable changes

Table 4 Ablation experiments of multi-temporal urban scene classification at the parcel level

Full size table

4.2 OSM land-use product comparison

In order to verify the accuracy of the land-use mapping obtained in this study, we compared the results of the CWDAN-based land-use mapping with the OSM land-use product (Fig. 9). Since OSM only provides the historical land-use product from 2014 onwards, we only compared the 2017 mapping results for the Wuhan center dataset and WanChai dataset. From the visual representation, for the Wuhan center, the OSM land-use product only covers 20% to 30% of the study area, while our mapping results provide full coverage. In addition, a direct comparison of the areas covered by both OSM and our results shows that the results obtained by the CWDAN framework and the OSM product show a high degree of agreement. For the Wan Chai area of Hong Kong, the OSM land-use product covers a larger area. By comparing the mapping result and the land-use product, it is apparent that the main land-use categories, such as farmland, commercial, and residential, are accurately represented in our mapping results. Therefore, compared with the OSM land-use product, the CWDAN framework is able to obtain historical land-use mapping results, to compensate for the shortcomings of the OSM historical land-use product, as well as maintaining a satisfactory accuracy.

5 Conclusion

In this paper, a multi-temporal scene classification framework based on a class-weighted deep adaptation network has been proposed. This framework first uses OSM road network data and uniform grid sampling to segment the original multi-temporal HR images into the parcel level and patch level. The patch-level data are then input into the newly proposed CWDAN model, to carry out the cross-temporal scene classification task. The scene labels of the patch data are then mapped into the parcel level according to the AWV strategy. Finally, post-classification is applied to obtain the multi-temporal change information. This work applies the idea of domain adaptation to multi-temporal scene classification, which can significantly reduce the data labeling workload. The newly introduced class-weighted fully connection layers of the network solve the problem of category imbalance within multi-temporal images to some degree. At the same time, the combination of the OSM network and uniform grid sampling not only provides meaningful geographic boundaries for the scene units, but also ensures that the data can be input into the deep learning network for efficient feature learning.

Land-use change is a visual representation of the urbanization process. Only single-phase annotation is used in this study to realize the continuation of the classification task on in long temporal sequence, which reduce the training pressure of the scene classification task. Multi-temporal urban scene mapping can help to verify the consistency of the real urban construction and the urban planning, explain the urban development patterns, provide historical products for rational urban planning and assist the healthy and sustainable development of cities.

However, this work is still in the preliminary exploration stage of domain adaptation for multi-temporal urban scene classification and change detection. The framework proposed in this paper mainly focuses on the categories with imbalance proportions to enhance the robustness and transferability of deep features, yet the transferability in other areas of the whole study area can be further improved, and the generalization ability of the model needs to be further enhanced. Moreover, this work only utilizes remote sensing imagery. In the future, it is possible to combine multi-modal data to strengthen the discriminative ability between easily confused categories such as residential and commercial areas. Lastly, the trend analysis of multi-temporal changes will be further researched to explore the relationship between the time span and the types and intensities of urban scene changes.

Availability of data and materials

The datasets generated and/or analysed during the current study are not publicly available due to license reasons

References

Audebert, N., Le Saux, B., & Lefevre, S. (2017). Joint Learning from Earth Observation and OpenStreetMap Data to Get Faster Better Semantic Maps. 1552-1560. https://doi.org/10.1109/cvprw.2017.199.
Bouziani, M., Goieta, K., & He, D. C. (2010). Automatic change detection of buildings in urban environment from very high spatial resolution images using existing geodatabase and prior knowledge. Isprs Journal of Photogrammetry & Remote Sensing, 65(1), 143–153.
Article Google Scholar
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2014). Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. Computer Science, 4, 357–361.
Google Scholar
Cheng, G., Yang, C., Yao, X., Guo, L., & Han, J. (2018). When Deep Learning Meets Metric Learning: Remote Sensing Image Scene Classification via Learning Discriminative CNNs. IEEE Transactions on Geoscience & Remote Sensing, 2811-2821.
Demir, B., Bovolo, F., & Bruzzone, L. (2013). Updating Land-CoverMaps by Classification of Image Time Series: A Novel Change-Detection-Driven Transfer Learning Approach. IEEE Transactions on Geoscience & Remote Sensing, 51(1), 300–312.
Article Google Scholar
Ganin, Y., & Lempitsky, V. (2015) 'Unsupervised domain adaptation by backpropagation' International conference on machine learning. PMLR, pp. 1180-1189.
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2013). Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
Gopalan, R., Li, R., & Chellappa, R. (2011) 'Domain adaptation for object recognition: An unsupervised approach' IEEE International Conference on Computer Vision.
Grimm, N. B., Faeth, S. H., Golubiewski, N. E., Redman, C. L., Wu, J. G., Bai, X. M., et al. (2008). Global change and the ecology of cities. SCIENCE, 319(5864), 756–760. https://doi.org/10.1126/science.1150195
Article Google Scholar
Hu, F., Xia, G.-S., Hu, J., & Zhang, L. (2015). Transferring Deep Convolutional Neural Networks for the Scene Classification of High-Resolution Remote Sensing Imagery. Remote Sensing, 7, 14680–14707.
Article Google Scholar
Hussain, M., Chen, D., Cheng, A., Wei, H., & Stanley, D. (2013). Change detection from remotely sensed images: From pixel-based to object-based approaches. ISPRS Journal of Photogrammetry and Remote Sensing, 80, 91–106. https://doi.org/10.1016/j.isprsjprs.2013.03.006
Article Google Scholar
Jiang, J., & Zhai, C. X. (2007) 'Instance Weighting for Domain Adaptation in NLP' Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics.
Jin, Y., Wang, X., Long, M., & Wang, J. (2020) 'Minimum class confusion for versatile domain adaptation' Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16. Springer, pp. 464-480.
Johnson, R. D., & Kasischke, E. S. (2010). Change vector analysis: A technique for the multispectral monitoring of land cover and condition. International Journal of Remote Sensing, 19(3), 411–426. https://doi.org/10.1080/014311698216062
Article Google Scholar
Li, W., Wang, Z., Wang, Y., Wu, J., Wang, J., Jia, Y., et al. (2020). Classification of High-Spatial-Resolution Remote Sensing Scenes Method Using Transfer Learning and Deep Convolutional Neural Network. IEEE Transactions on Geoscience & Remote Sensing, 13, 1986–1995.
Google Scholar
Liu, L., Yang, L., & Zhu, B. (2018). Sparse Feature Space Representation: A Unified Framework for Semi-Supervised and Domain Adaptation Learning. Knowledge-Based Systems, S0950705118302405.
Liu, W., Qin, R., & Su, F. (2019). Weakly supervised classification of time-series of very high resolution remote sensing images by transfer learning. Remote Sensing Letters, 10(7), 689–698. https://doi.org/10.1080/2150704x.2019.1597295
Article Google Scholar
Long, M., Cao, Y., Wang, J., & Jordan, M. (2015) 'Learning transferable features with deep adaptation networks' International conference on machine learning. PMLR, pp. 97–105.
Long, M., Zhu, H., Wang, J., & Jordan, M. I. (2017) 'Deep transfer learning with joint adaptation networks' International conference on machine learning. PMLR, pp. 2208–2217.
Long, M. S., Cao, Z. J., Wang, J. M., & Jordan, M. I. (2018). Conditional Adversarial Domain Adaptation. (Paper presented at the ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018))
Mansour, Y., Mohri, M., & Rostamizadeh, A. (2009). Domain Adaptation: Learning Bounds and Algorithms.
Matasci, G., Volpi, M., Kanevski, M., Bruzzone, L., & Tuia, D. (2015). Semisupervised Transfer Component Analysis for Domain Adaptation in Remote Sensing Image Classification. IEEE Transactions on Geoscience & Remote Sensing, 53(7), 3550–3564.
Article Google Scholar
Ni, K., & Wu, Y. (2019). Scene classification from remote sensing images using mid-level deep feature learning. International Journal of Remote Sensing, 41(4), 1415–1436. https://doi.org/10.1080/01431161.2019.1667551
Article Google Scholar
Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International journal of computer vision, 42, 145–175.
Article Google Scholar
Pan, S. J., Tsang, I. W., Kwok, J. T., & Yang, Q. (2011). Domain Adaptation via Transfer Component Analysis. IEEE Transactions on Neural Networks, 22(2), 199–210.
Article Google Scholar
Pan, S. J., & Yang, Q. (2010). A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345–1359.
Article Google Scholar
Polykretis, C., Grillakis, M. G., & Alexakis, D. D. (2020). Exploring the Impact of Various Spectral Indices on Land Cover Change Detection Using Change Vector Analysis: A Case Study of Crete Island. Greece. Remote Sensing, 12(2), 319.
Article Google Scholar
Ru, L. X., Du, B., & Wu, C. (2021). Multi-Temporal Scene Classification and Scene Change Detection With Correlation Based Fusion. Ieee Transactions on Image Processing, 30, 1382–1394. https://doi.org/10.1109/tip.2020.3039328.
Saito, K., Watanabe, K., Ushiku, Y., & Harada, T. (2018). Maximum Classifier Discrepancy for Unsupervised Domain Adaptation. (Paper presented at the 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)).
Shang, Y. X., & Cha, K. J. (2013). The Rise of Central China Plan : Performance and Its Implications to Korea. Chinese Studies, 46, 335–356.
Shi, S., Zhong, Y., Zhao, J., Lv, P., Liu, Y., Zhang, L. J. I. T. o. G., et al. (2020). Land-use/land-cover change detection based on class-prior object-oriented conditional random field framework for high spatial resolution remote sensing imagery. 60, 1–16.
Shin, H. C., Roth, H. R., Gao, M., Lu, L., Xu, Z., Nogues, I., et al. (2016). Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning. IEEE Transactions on Medical Imaging, 35(5), 1285–1298.
Article Google Scholar
Tzeng, E., Hoffman, J., Saenko, K., & Darrell, T. (2017). Adversarial Discriminative Domain Adaptation. (Paper presented at the 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017)).
Wang, L., Guo, S., Huang, W., Xiong, Y., & Qiao, Y. (2016). Knowledge Guided Disambiguation for Large-Scale Scene Classification with Multi-Resolution CNNs. IEEE Transactions on Image Processing, 26(4), 2055–2068.
Article Google Scholar
Wang, Y., Du, B., Ru, L., Wu, C., & Luo, H. (2019) 'Scene Change Detection via Deep Convolution Canonical Correlation Analysis Neural Nerwork' IEEE International Geoscience and Remote Sensing Symposium.
Wu, C., Zhang, L., & Du, B. (2017). Kernel Slow Feature Analysis for Scene Change Detection. IEEE Transactions on Geoscience & Remote Sensing, 55(4), 2367–2384.
Article Google Scholar
Wu, C., Zhang, L., & Zhang, L. (2016). A scene change detection framework for multi-temporal very high resolution remote sensing images. Signal Processing, 124, 184–197. https://doi.org/10.1016/j.sigpro.2015.09.020
Article Google Scholar
Zhang, Y. C., Liu, T. L., Long, M. S., & Jordan, M. I. (2019). Bridging Theory and Algorithm for Domain Adaptation. (Paper presented at the INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97)
Zhao, B., Zhong, Y., & Zhang, L. (2013). Scene classification via latent Dirichlet allocation using a hybrid generative/discriminative strategy for high spatial resolution remote sensing imagery. Remote Sensing Letters, 4, 1204–1213.
Article Google Scholar
Zhu, Q., Zhong, Y., Zhao, B., Xia, G. S., & Zhang, L. (2016). Bag-of-Visual-Words Scene Classifier With Local and Global Features for High Spatial Resolution Remote Sensing Imagery. IEEE Geoscience & Remote Sensing Letters, 1-5.

Download references

Acknowledgments

We thank very much for some members of RSIDEA for their help in this work, and the valuable comments from anonymous reviewers.

Funding

This work was supported by "the Fundamental Research Funds for the Central Universities" under Grant No. 2042023kfyq04 and LIESMARS Special Research Funding.

Author information

Authors and Affiliations

State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, P.R. China
Sunan Shi, Yanfei Zhong, Yinhe Liu, Liangpei Zhang & Deren Li
Hubei Provincial Engineering Research Center of Natural Resources Remote Sensing Monitoring, Wuhan University, Wuhan, P.R. China
Yanfei Zhong

Authors

Sunan Shi
View author publications
You can also search for this author in PubMed Google Scholar
Yanfei Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Yinhe Liu
View author publications
You can also search for this author in PubMed Google Scholar
Liangpei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Deren Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: YZ, LZ, DL; Writing—original draft preparation: SS; Writing—review and editing: YZ, SS, YL. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Yanfei Zhong.

Ethics declarations

Ethics approval consent to participate

Not applicable. All authors of this article declare they consent to participate.

Consent for publication

All authors of this article declare they consent for publication.

Competing interests

No potential conflict of interest was reported by the author(s).

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Shi, S., Zhong, Y., Liu, Y. et al. Cross-temporal high spatial resolution urban scene classification and change detection based on a class-weighted deep adaptation network. Urban Info 3, 3 (2024). https://doi.org/10.1007/s44212-023-00029-1

Download citation

Received: 10 August 2023
Revised: 25 October 2023
Accepted: 29 October 2023
Published: 10 January 2024
DOI: https://doi.org/10.1007/s44212-023-00029-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Cross-temporal high spatial resolution urban scene classification and change detection based on a class-weighted deep adaptation network

Abstract

Similar content being viewed by others

Spatio-temporal analysis of land use/land cover change detection in small regions using self-supervised lightweight deep learning

Urban Land Use Classification Using Street View Images Based on Deep Transfer Network

Urban Change Detection from VHR Images via Deep-Features Exploitation

1 Introduction

1.1 Related work

1.1.1 Scene classification and change detection based on bi-temporal labels

1.2 Scene classification and change detection based on single temporal labels

2 Methods

2.1 Training datasets and multi-level scene segmentation

2.1.1 Class system for cross-temporal scene classification

2.2 Multi-temporal road network processing

2.3 Cross-temporal scene classification and change detection based on a class-weighted deep adaptation network (CWDAN)

2.4 The class-weighted DANN

2.5 Multi-level parcel-patch category mapping system

2.6 Scene change detection based on post-classification comparison

3 Experiments and results

3.1 Data collection

3.2 Multi-temporal scene classification

3.3 Multi-temporal scene change detection

4 Discussion

4.1 Methods robustness analysis

4.2 OSM land-use product comparison

5 Conclusion

Availability of data and materials

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation