1 Introduction

The Earth’s surface is constantly changing due to anthropogenic and natural causes like the progression of desert areas, deforestation, glacier movements, fires or earthquakes (Alberti et al., 2003). Monitoring these changes over time may provide valuable information on the transformation of the Earth’s environment paving the way for a better decision policy on the minimisation of the risk of disasters (Michel et al., 2012). In particular, the rapid development of multispectral (MS) and hyperspectral (HS) technology has recently unleashed the potential of change detection (CD) methods in a wide range of remote sensing applications ranging from urban planning, environmental monitoring, agriculture investigation, disaster assessment and map revision (Kwan, 2019).

MS and HS sensors, mounted on space-born systems, allow a frequent revisit time of the same Earth’s scene by acquiring observation data with high spectral and spatial resolution, while trying to keep constant acquisition characteristics (e.g. the same sun illumination and incidence angle if platforms are put in sun-synchronous orbit). Both technologies can reflect light in a spectrum of narrow frequencies that cover the visible, near-infrared and shortwave infrared bands (pixel spectral data). The difference between the MS and HS technology is the number of bands and how narrow the bands are. MS technology commonly refers to a small number of bands, i.e., from 3 to 10, sensed by a radiometer. HS technology could have hundreds or thousands of bands from a spectrometer. In any case, independently of the specific spectral resolution, both MS and HS sensors have definitely made available unprecedented optical information (Hoye & Fridman, 2013; Mouroulis et al., 2000) (compared to traditional RGB cameras) for learning in the Earth’s observation. In general, processing MS/HS imagery data is nowadays a milestone in developing new CD methods in remote sensing.

Existing MS/HS CD methods mainly leverage machine learning (Shi et al., 2020) for comparing spectral data of each couple of images of a scene and learning patterns that delineate changes at either pixel or object level of the observed scene (Im et al., 2008). These methods are mainly classified into supervised and unsupervised methods regarding the learning paradigm they adopt (Shi et al., 2020). Supervised CD methods (e.g., Larabi et al., 2019; Seydi & Hasanlou 2017; Wu et al., 2017; Yuan et al., 2005) rely on prior information about the ground changes. Therefore, their accuracy strongly depends on the availability and quality of the ground truth that is commonly based on human intervention and tends to be generated object-wise, rather than pixel-by-pixel, since it is costly in terms of time and effort. A poor-quality ground truth map may prevent even a good supervised learning method from highlighting its quality by producing contradictory results.

Due to the limitation of supervised CD, a significant research effort is devoted to performing CD analysis in an unsupervised manner. In the unsupervised machine learning paradigm (Bruzzone & Prieto, 2000; Hussain et al., 2013), changes are commonly detected by resorting to the Change Vector Analysis (CVA) strategy that bases on a reliable measure of distance (or similarity) computed between the two images. In this strategy, a threshold is always determined to separate the changed pixels from the unchanged background.

Following this mainstream of research in CD, we propose a CVA method, named ORCHESTRA (autO ecodeR-based CH ange dE tection in hyperS pecTRA l/ multispectral images), to analyse bi-temporal, co-registered MS/HS images of an Earth’s scene, which are denoted as primary image and secondary image, respectively. The proposed method extends the traditional CVA strategy by taking advantage of autoencoder information. An autoencoder is an artificial neural network (ANN) architecture consisting of both an encoder function, mapping the input to a hidden code and a decoder, producing the reconstructed input learned by minimizing a loss function (Goodfellow et al., 2016). As the hidden code commonly reduces the size of data, autoencoders are mostly used for saving the output of the encoder function for dimensionality reduction (Ferreira et al., 2019; Hu et al., 2014; Shone et al., 2018; Wang et al., 2014; Wang et al., 2015). However, there are few recent studies that learn autoencoders, which go beyond the dimensionality reduction purpose, e.g., considering the data restored output of the decoder function for data denoising (Andresini et al., 2020; Zheng & Peng, 2018) or the loss (residual error) for the anomaly detection (An & Cho, 2015; Andresini et al., 2019; Oh & Yun, 2018; Sarafijanovic-Djukic & Davis, 2019).

Also in this study, we consider autoencoders for data restoring. In particular, we use the output of the decoder function of the primary image-specific autoencoder to restore both the primary image and the secondary image. Note that this is not directed to operate data denoising only. In principle, the autoencoder trained on the primary samples can contribute to recovering denoised samples of the pixels in the primary image, as well as denoised samples of the unchanged pixels in the secondary image, but it should see changed samples of the secondary image as anomalies, and so reconstruct them badly. So, the idea is to exploit autoencoders to disclose patterns that better delineate the pixels of the sensed scene where a change has occurred over time. We take advantage of these patterns by completing the CVA on the restored data (rather than on the original, acquired data). In particular, we compute the Spectral Angle Mapper (SAM) distance pixel-by-pixel between the restored spectral data vectors of the primary and secondary image, respectively. This distance quantifies the spectral change range at each pixel of the scene. The Otsu’s algorithm (Otsu, 1972) is, finally, adopted to separate the foreground regions, where a change occurred, from the unchanged background.

We evaluate the proposed method performing the CD analysis of various, benchmark, bi-temporal, co-registered HS images collected in various urban and rural scenarios. As the change information is available on these datasets, the empirical study can verify the accuracy of the proposed CD method. In addition, we perform the evaluation of the viability of the proposed method in delineating the burnt area of bi-temporal, co-registered MS images acquired with Sentinel-2 in the area of Majella National Park (Italy).

The paper is organised as follows. The related works are presented in Section 2. The basic concepts are introduced in Section 3, while the proposed CD method is illustrated in Section 4 and the implementation is described in Section 5. The findings in the evaluation of the proposed strategy with benchmark HS data are discussed in Section 6, while the achievements in the analysis of the MS data of the burnt area in the Majella National Park (Italy) are illustrated in Section 7. Finally, Section 8 draws conclusions and proposes future developments.

2 Related work

Since obtaining a large number of labelled samples for supervised training is usually time-consuming and labour-intensive, remote sensing research devotes significant research effort for the formulation of CD methods in unsupervised machine learning.

In the unsupervised machine learning paradigm (Hussain et al., 2013; Bruzzone & Prieto, 2000), changes are commonly detected by resorting to the CVA strategy that computes a measure of similarity (or distance) between co-located pixels of a couple of images and uses a threshold-based approach to identify a distance threshold to separate changed pixels from the unchanged background. Various similarity (or distance) measures have been investigated for CVA methods (e.g., Appice et al., 2020; Falini et al., 2020; Seydi & Hasanlou, 2017; Yang & Mueller, 2007). The threshold to detect the changes is estimated by resorting to the spectral data (i.e. in a data-driven manner) (Lu et al., 2010; Najafi et al., 2017; Penglin et al., 2012) by leveraging probabilistic information extracted from the distribution of the (distance or similarity) measure among the pixels. A well-known approach commonly used for the threshold determination is Otsu’s method (Otsu, 1972; Sahoo et al., 1988). In(López-Fandiño et al., 2019), Otsu’s algorithms is evaluated in combination with SAM and Watershed algorithm. Alternatively, clustering algorithms are adopted (Appice et al., 2019; Appice et al., 2020), in order to separate distances (or similarities) of changed pixels from the unchanged background.

Algebra-based methods, similarity-based methods, as well as distance-based methods belong to the threshold-based family of CD approaches. In particular, algebra-based CD methods use mathematical operations (such as image differencing or image ratio) on images taken at different times to generate a change matrix output (Ilsever & Unsalan, 2012). Similarity-based CD methods resort to the computation of a similarity measure (e.g., correlation measure) between a pair of spectral vectors (Choi et al., 2010). Distance-based CD methods are founded on a spectral distance measure (e.g. SAM, Z-score Information Divergence) computed between the spectral vectors on corresponding pixels (Choi et al., 2010). A recent study (Appice et al., 2020) adopts a spectral-spatial distance for CVA. In addition, it introduces an iterative upgrade of the traditional distance-based approach by accounting for a representation of the possible change iteratively learned through classification. Due to the lack of ground change information, classification is supervised with pseudo-labels yielded on the spectral-spatial distance information via clustering.

A different unsupervised perspective performs the change detection on image combination or transformation. In Deng et al. (2008), the Principal Component Analysis (PCA) is used to extract the difference between two images suppressing correlated information and highlighting variance in multi-temporal data. The change is identified in the second component, while the first component is assumed to be the sum of the common information. In Gao et al. (2016), PCA is used as a convolutional filter to determine the representative neighbourhood features from each pixel and generate change matrices with less noise spots. Gabor wavelets and fuzzy c-means are utilized to select interested bi-temporal pixels that have a high probability of being changed or unchanged. Then, new image patches centred at interested pixels are generated and a PCANet model—a deep learning network with its convolution filter banks chosen from PCA filters—is trained using these patches. Finally, pixels of bi-temporal images are classified by the trained PCANet model.

In Appice et al. (2019), an autoencoder architecture is trained on the pixelwise difference computed between the spectral data of two images. By assuming that the spectral differences compressed at the bottom encoder layer preserve the hidden information to separate changed pixels from the unchanged background, encoded differences are coupled to distance information and processed through clustering to separate changed pixels from the unchanged background. The data compression ability of the encoder layer of an autoencoder is also investigated in Kalinicheva et al. (2018). Each image is encoded to an equal-sized compressed feature representation and the output of the subtraction operation between the encoded images is analysed for the change detection. In Kalinicheva et al. (2019), a convolution autoencoder is trained on the patches of a time series of images. The reconstruction error of each patch is analysed to discriminate changed pixels from the background.

Supervised CD methods (Khanday, 2016) are based on the availability of ground change information (often acquired by human intervention) and use a classification framework, in which the ground truth is used to learn a classifier. The spectral, spatial or proper combination of this information is used to build a measure able to detect the change and aid in the classifier decision. ANN (Clifton, 2003; Helmy and El-Taweel, 2010) and Expectation Maximization (EM) (Ming et al., 2014) algorithms fall in this category. Although both ANN and EM are based on different concepts (basically, the former is based on nonlinear regression, the latter on a maximum likelihood with unknown parameters), they provide a binary decision like the classifiers. In Seydi and Hasanlou (2017), a supervised method is illustrated. It acquires a sample of change labels on a scene, in order to determine the optimal threshold for determining the remaining labels with a distance-based change detection approach. In Wu et al. (2017), changes are identified by using a trained classifier to directly classify data from multiple periods (i.e., multi-date classification or direct classification) and comparing multiple classification maps (i.e. post-classification comparison). In Planinšič and Gleich (2018), a logistic regression layer is trained to perform supervised fine-tuning and classification on the autoencoder denoised representation of image time series feature extracted within tunable Q discrete wavelet transform. Finally, the transfer learning-based structure has been recently investigated to alleviate the lack of training samples and optimize the training process in a semi-supervised scenario. Transfer learning uses training in one domain to enable better results in another domain and, specifically, the lower to mid-level features learned in the original domain can be transferred as useful features in the new domain performing a fine-tuning according to a few labelled samples (Kerner et al., 2019; Larabi et al., 2019)

3 Preliminary concepts

A MS/HS sensor records reflected light in tens (MS)/hundreds (HS) of narrow frequencies covering the visible, near-infrared and shortwave infrared bands of a wavelength λ (also called spectrum). The spectrum is an M-dimensional feature vector (spectral feature vector), so that λ is spanned on M numeric spectral features (bands) λ1,λ2,…, and λM.

Let X and Y be two co-registered MS/HS images—digital images of an observed Earth’s scene, which are produced at different time points using the same MS/HS sensor mounted on aircraft or satellites. Note that if the searched changes concern the vegetation, images should be acquired in the same season to avoid the focus on changes related to different phenological stages of the vegetation. X is denoted as the primary image, while Y is denoted as the secondary image. Every image (see Fig. 1) is an hyper-cube of size U × V × M, which represents a collection of spectral vectors measured on an M-dimensional spectrum λ over a grid of U × V pixels. Every pixel (u, v) is a region of around a few square meters of the Earth’s surface, which is a function of the sensor spatial resolution. X(u, v) and Y(u, v) are one-dimensional real-valued spectrum sections of hyper-cubes X and Y, respectively, indexed by spatial coordinates u and v within the sensor resolution of the camera.

Fig. 1
figure 1

MS/HS imagery data

Every pixel of a scene for which bi-temporal MS/HS data are acquired can, in principle, be labelled according to an unknown binary target function, whose range is a finite set of two distinct labels, i.e. “changed” and “unchanged”. According to this function, a change matrix C can be associated with the bi-temporal image couple X and Y. In particular, C is a two-dimensional set of U × V change values with every value C(u, v) representing the change label of the pixel indexed by the spatial coordinates u and v. A CD method takes as input images X and Y to learn C.

In this paper, an unsupervised CVA method based on autoencoder is proposed. An autoencoder is a deep learning ANN trained to attempt to copy its input to its output (Goodfellow et al., 2016). It can be viewed as being composed of two functions: an encoder f —mapping the input vector x to a hidden representation h via a deterministic mapping h = f(x), parameterized by 𝜃f—and a decoder g—mapping back the resulting hidden representation h to a reconstructed vector in the input space \({\mathbf x^{\prime }}=g(\mathbf h)\), parameterized by 𝜃g. The functions g and f correspond to two different ANNs combined in a single one, whose parameters {𝜃f,𝜃g} are simultaneously learned by minimizing a loss function \(\mathcal L(\mathbf x, g(f(\mathbf x)) = \mathcal L(\mathbf x, {\mathbf x}^{\prime })\), penalising x for being dissimilar from \({\mathbf x}^{\prime }\) such that \(\mathcal L_{\text {se}}(\mathbf x, {\mathbf x}^{\prime }) = || \mathbf x - {\mathbf x}^{\prime } ||^{2}\).

4 The proposed methodology

In this section we describe ORCHESTRA—an unsupervised CVA method enhanced with autoencoder information. The method takes as input the bi-temporal images, X (primary image) and Y (secondary image), and learns a change matrix C. Figure 2 shows the block diagram of ORCHESTRA.

Fig. 2
figure 2

ORCHESTRA methodology

Initially, we train the autoencoder architecture gf on the pixel spectral vectors acquired with the primary image X. Since the activation produced by the top layer in the decoder network g corresponds to a reconstructed feature vector in the same M-dimensional spectral input space of the autoencoder, we consider this output feature vector as new learned features of the spectrum λ. The CVA of X and Y is then completed in this new feature space. According to these premises, gf is used to restore the pixel spectral vectors of both X and Y and build the image reconstructions \(\mathbf {X}^{\prime }\) and \(\mathbf {Y}^{\prime }\) so that, for each pixel (u, v), \(\mathbf {X}^{\prime }(u,v)=g(f(\mathbf {X}(u,v))\) and \(\mathbf {Y}^{\prime }(u,v)=g(f(\mathbf {Y}(u,v))\), respectively.

Some inherent remarks can be formulated on the reconstructions \(\mathbf {X}^{\prime }\) and \(\mathbf {Y}^{\prime }\). As the autoencoder gf is trained on the pixel spectral vectors of X, we expect that \(\mathbf {X}^{\prime }\) well reconstructs X as it mainly performs a denoising transformation of X. We also expect that gf well reconstructs the spectral vectors of Y associated with the unchanged pixels, while it poorly reconstructs the spectral vectors of Y associated with the changed pixels. As a consequence, the spectral vector reconstructions of unchanged pixels in \(\mathbf {Y}^{\prime }\) should be more similar to the corresponding spectral vector reconstructions restored in \(\mathbf {X}^{\prime }\) than reconstructions associated with changed pixels. This conjecture (that is experimentally verified in Section 6) inspires the idea of computing the distance between the proposed autoencoder transformation of the original images, in order to better disentangle the differences between changed and unchanged pixels.

Then we compute pixelwise the distance between \(\mathbf {X}^{\prime }\) and \(\mathbf {Y}^{\prime }\) by resorting to the algorithm SAM that is commonly used in CVA methods (e.g. Appice et al., 2019, 2020; Lopez-Fandino et al., 2018). As pointed out in Seydi and Hasanlou (2017), the computation of SAM is independent of the number of spectral bands and insensitive to sunlight. Let us consider pixel (u, v), SAM(u, v) measures the angle between the bi-temporal reconstructed spectral vectors associated with (u, v) in both \(\mathbf {X}^{\prime }\) and \(\mathbf {Y}^{\prime }\). This angle is computed as follows:

$$ SAM(u,v)= \arccos{\frac{ \mathbf{X}^{\prime}(u,v)\cdot \mathbf{Y}^{\prime}(u,v) }{\vert\vert \mathbf{X}^{\prime}(u,v)\vert\vert \ \vert\vert \mathbf{Y}^{\prime}(u,v)\vert\vert}}. $$
(1)

Subsequently, we perform the Otsu’s algorithm to automatically determine the upper threshold 𝜃otsu of SAM distances for separating pixels of the study scene into background (“unchanged” pixels with low SAM range) and foreground (“changed” pixels with high SAM range). In particular, we assign pixels (u, v) with SAM(u, v) higher than 𝜃otsu to the label “changed”, while we assign the remaining pixels to the label “unchanged”.

The Otsu’s algorithm is an adaptive, non-parametric and unsupervised threshold algorithm introduced in Otsu (1972). It is commonly used in image binarization problems to turn a single intensity threshold that separates pixels into two classes. The threshold is determined by minimising the intra-class intensity variance defined as a weighted sum of variances of the two classes.Footnote 1 In this paper, we assume that the SAM distances, computed pixelwise in the study scene, are represented in an histogram with L equal-width bins (levels) denoted as [1,…,L]. Let ηi be the number of pixels at level i, so that \(\displaystyle \sum\limits_{i=1}^{L}{\eta _{i}}\) corresponds to the total number of pixels in the scene, i.e. \(\displaystyle \sum\limits_{i=1}^{L}{\eta _{i}}=UV\). Based upon these premises, the probability of each level i is computed as \(p_{i}=\frac {\eta _{i}}{UV}\). The Otsu’s algorithm identifies the optimal threshold level 𝜃otsu, in order to divide the pixels of the processed scene into the background class C1, spanned over the SAM levels [1,2,…,𝜃otsu], and the foreground class C2, spanned over the SAM levels [𝜃otsu + 1,…,L], respectively. The optimal 𝜃otsu is searched for minimizing the intra-class variance that is defined as a weighted sum of variances of the two classes:

$$ \theta_{otsu} =\arg\min_{1\leqslant\theta\leqslant L}{\left( w_{1}(\theta) {\sigma_{1}^{2}}(\theta)+w_{2}(\theta) {\sigma_{2}^{2}}(\theta)\right)}, $$
(2)

where \({\sigma _{1}^{2}}(\theta )\) ad \({\sigma _{2}^{2}}(\theta )\) are the variance computed on the two classes separated by 𝜃. The weights w1(𝜃) and w2(𝜃) are the probabilities of the two classes, which are computed as follows:

$$ w_{1}(\theta)=\displaystyle \sum\limits_{i=1}^{\theta{p_{i}}} \ \text{and} \ w_{2}(\theta)=\displaystyle \sum\limits_{i=\theta+1}^{L}{p_{i}}. $$
(3)

Further considerations concern the fact that the direct application of the Otsu’s algorithm for change labelling will neglect the spatial arrangement of pixels. It may occasionally yield spurious assignments of pixels to classes. To avoid this issue, we may apply the principle of local auto-correlation congruence of objects (Appice et al., 2016, 2017; Du et al., 2012; Wang et al., 2015), according to which detected clusters, comprising changed objects, generally expand across contiguous areas (Appice et al., 2015). Based on this principle, we may decide to change the assignment of pixels that strongly disagree with surrounding assignments. This mainly corresponds to performing a spatial-aware correction of the change assignment defined with Otsu’s threshold. This correction assigns each pixel to the label that originally groups the majority of its neighbouring pixels (see Fig. 3).

Fig. 3
figure 3

Majella Park bi-temporal scene: the separation of the scene into black changed pixels and white unchanged background, which is computed via the Otsu’s algorithm (Fig. 3a); the spatial correction of the change labels assigned through clustering (Fig. 3b)

Formally,

$$ label(u,v)=\begin{cases} changed & \text{if } \sharp c(u,v)\geq \sharp u(u,v) \\ unchanged & \text{otherwise} \end{cases}, $$
(4)

where c(u, v) and u(u, v) count how many pixels, falling in neighbourhood 𝜖(u, v), are labelled as “changed” and “unchanged”, respectively, with the Otsu’s threshold. The neighbourhood 𝜖(u, v) is a set of pixels surrounding (u, v) in the study scene. As in (Appice et al., 2019, 2020; Appice & Malerba, 2019; Guccione et al., 2015), we consider a square-shaped neighbourhood. Let R be a positive, integer-valued radius, the square-shaped neighbourhood 𝜖(u, v) of pixel u, v is defined as follows:

$$ \epsilon(u,v)=\displaystyle\bigcup_{I=-R}^{+R}{\bigcup_{J=-R}^{+R}{ \{(u+I},v+J)\}}. $$
(5)

Finally, we analyse the time complexity of the proposed method. The time cost of the autoencoder layers is O \(\left (\sum \limits _{l=1}^{d_{A}}{n_{l-1} n_{l}}\right )\) (İrsoy and Alpaydın, 2017), where dA is the number of layers in the autoencoder, l is the index of a layer and nl is the number of nodes in layer l. The time complexity of the the distance computation is O(UVM). The time complexity of the Otsu’s algorithm is O(UV ), while the time complexity of the spatial correction operation is O(UV R2). In general, the most of the time cost is spent training the autoencoder ANN.

5 Implementation details

ORCHESTRA is implemented in Python 3.8. A pre-processing step is performed to scale spectral data in the range [0,1] and process spectral bands with values in comparable ranges.

The autoencoder is developed in Keras 2.4.3Footnote 2 with TensorFlowFootnote 3 as the back-end. The set-up of learning rate and batch size is decided by resorting to the tree-structured Parzen estimator algorithm, as implemented in the Hyperopt library (Bergstra et al., 2013). This hyper-parameter optimization is done by using 20% of the entire training collection as a validation set. Therefore, we automatically choose the configuration of learning rate and batch size, which achieves the best validation loss in training the autoencoder. The values of learning rate and batch size explored with the tree-structured Parzen estimator, are defined as follows: learning rate varies in the range [0.00001, 0.01] and batch size ranges among 32, 64, 128, 256 and 512. The autoencoder architecture comprises 5 fully-connected (FC) layers of 128 × 64 × 32 × 64 × 128 neurons when trained with HS data and 3 fully-connected (FC) layers of 8 × 4 × 8 neurons when trained with MS data. Both architectures comprise a dropout layer to prevent overfitting. The mean squared error (mse) is used as the loss function. The classical rectified linear unit (ReLu) (Glorot et al., 2011) is selected as the activation function for each hidden layer, while for the last layer the Linear activation function is used. The number of epochs is set equal to 150, retaining the best models achieving the lowest loss on the validation set.

For the autoencoder architectures, both the number of layers and the number of neurons per layer are selected by taking into account the size of the spectral feature vector of each imagery dataset. In particular, the HS images are spanned on a spectral feature vector with either 224 or 242 spectral bands (see Table 1), while the MS images are spanned on a spectral feature vector with 13 spectral bands. As the MS data are simpler than the HS data, the autoencoder architecture adopted to process the MS images is simpler than the architecture to process the HS data images. On the other hand, we also account for the principle that a high number of layers may cause an increase of the computational effort that may not be rewarded with a gain in accuracy (Uzair & Jamil, 2020). With regard to the number of neurons, we follow the guidelines reported in Vanhoucke et al. (2011) and select the number of neurons in each hidden layer as a power of two, in order to improve the speed in the computation of the neural network. In fact, most of the computation time spent training an ANN is devoted to performing matrix multiplication. This is computed as a SIMD (single instruction, multiple data) operation in CPUs by using a batch size that is a power of 2.

Finally, the threshold-based step is performed using the implementation of Otsu’s algorithm from skimage.filters.threshold_otsu,Footnote 4 with the number of levels L = 256.

Table 1 Data scenario description: scene size (column 2), number of spectral bands (column 3), number of changed pixels in the ground truth (GT) change matrix (column 4), number of unchanged pixels in the change matrix (column 5), number of pixels with an unknown label in the change matrix

6 Experimental evaluation and discussion

In this study we consider three co-registered, bi-temporal HS datasets (see Section 6.1) acquired in both rural and urban environments. For these datasets, the ground truth change information is available to validate the accuracy of ORCHESTRA. In particular, the accuracy performance is evaluated with the Overall Accuracy (OA), the number of Missed Alarms (MA – changed pixels assigned to the unchanged background) and the number of False Alarms (FA – unchanged pixels labelled as changed). These metrics are commonly considered in remote sensing for the evaluation of change detection methods. In addition, we measure the residual error of autoencoders (mean squared error on restored HS data) on both the primary image and the secondary image to explore the ability of the autoencoder in HS data reconstruction. The results, achieved on each dataset, are discussed in Section 6.2.

6.1 HS data

We consider three public available datasetsFootnote 5—Hermiston, Santa Barbara and Bay Area. Each data set comprises a couple of co-registered HS images of a scene, as well as ground truth information of the change occurred in the sensed scene. A brief description of the datasets is reported in Table 1.

In Hermiston, the study areas cover an irrigated agricultural field. This area provides a benchmark agricultural scene, which has been frequently used in the evaluation of the accuracy of HS CD methods (e.g. Appice et al., 2019, 2020; Lopez-Fandino et al., 2017, 2018). The land-cover types are soil, irrigated fields, rivers, buildings and types of cultivated land and grassland. In this dataset, the bi-temporal HS images were acquired with the HYPERION sensor. This is a space-borne system carried on the EO-1 satellite, which includes 242 spectral bands, covering wavelengths between 400 nm and 2.5 μ m. The spectral range is divided into two intervals: the VNIR range (that includes 70 bands with wavelengths ranging from 356 to 1058 nm) and the SWIR range (that consists of 172 bands with wavelengths between 852 and 2577 nm). The spectral and spatial resolution of this sensor is about 10 nm and 30 m, respectively, over a 7.5-km strip. The Hermiston scene was monitored in the years 2004 and 2007 with the sensor over Hermiston City, Umatilla County, Oregon, USA. Each HS image of the dataset consists of 390 × 200 pixels acquired across 242 spectral bands.

In both Santa Barbara and Bay Area, the study areas cover an urban suburb in California. Both datasets have been already used in the evaluation of the HS CD methods illustrated in (Appice et al., 2019, 2020). In both the datasets, images were acquired by using the AVIRIS sensor. This is an optical sensor that delivers calibrated images of the upwelling spectral radiance in 224 contiguous spectral bands with wavelengths from 400 to 2500 nm. The spectral and spatial resolution of this sensor are about 10 nm and 4 m, respectively. The Santa Barbara scene was monitored in the years 2013 and 2014 with the sensor over the Santa Barbara region (California). It consists of 984 × 740 pixels and includes 224 spectral bands. The Bay Area scene was monitored in the years 2013 and 2015 with the sensor surrounding the city of Patterson (California). It consists of 600 × 500 pixels and includes 224 spectral bands.

6.2 Results

We start evaluating how the autoencoder trained on the primary image discloses knowledge that may contribute to separate changed pixels from unchanged pixels. To this aim, we explore how the autoencoder gf trained on the image of the couple assigned to the primary role can accurately reconstruct the primary image, while badly reconstructing the changed pixels of the secondary image. We evaluate two configurations defined by assigning the role of the primary image to (1) the oldest image and (2) the newest image of the couple, respectively.

Table 2 reports the mean squared error (mse) computed comparing pixelwise each image to its reconstruction restored through the trained autoencoder. In both configurations, the autoencoder trained on the primary image reconstructs worse the secondary image, getting a poor restore of spectral vectors of changed pixels. This can be seen in Fig. 4a, b and c that depict the maps of the squared errors computed pixelwise on the reconstructions of the images acquired in the Hermiston dataset. The reconstructions are done with the autoencoder configuration (1) that is trained considering the oldest image acquired in 2004 as the primary image. These maps highlight that the changed area is already delineated from the poorly reconstructed pixels in the secondary image acquired on 2007. This supports our hypothesis that the autoencoder transformation can disclose a representation of the spectral data, which contributes to better disentangle the change.

Table 2 Autoencoder configurations: mean squared error
Fig. 4
figure 4

Hermiston dataset: change ground truth (GT), mse\((\mathbf {X},\mathbf {X}^{\prime })\) and mse\((\mathbf {Y},\mathbf {Y}^{\prime })\) with the the image acquired on 2004 as primary image X and the image acquired on 2007 as secondary image Y

We proceed with measuring how the autoencoder can actually improve the accuracy of the CVA strategy. Table 3 reports the accuracy metrics of both ORCHESTRA and its baseline (CVA), that is defined by implementing the basic CVA with SAM and Otsu’s algorithm on the original data (i.e. without the autoencoder architecture). Results show that both configurations of ORCHESTRA—(1) and (2)—outperform CVA. Interestingly, the highest accuracy (OA) is always achieved with the configuration of ORCHESTRA that maximizes the ratio of the mse computed on the reconstruction of the secondary image on the mse computed on the reconstruction of the primary image (\(\frac {mse(\mathbf {Y},\mathbf {Y}^{\prime })}{ mse(\mathbf {X},\mathbf {X}^{\prime })}\)) reported in Table 2). This defines a promising criterion to automatically select the best configuration of ORCHESTRA in an unsupervised manner. Final considerations concern the spatial correction that is beneficial except for Hermiston.

Table 3 Accuracy performance (OA, FA and MA) of ORCHESTRA and CVA

Finally, we analyse the accuracy of few CVA methods that have been defined in the recent literature and evaluated on the same datasets. Table 4 reports the OA results. The compared methods also use SAM and spatial information for final label assignment. In addition, Lopez-Fandino et al. (2018) and López-Fandiño et al. (2019)Footnote 6 introduce the watershed analysis, Appice et al. (2019) resorts to autoencoder for dimensional reduction, while (Appice et al., 2020) uses an iterative combination of clustering and classification. ORCHESTRA performs closely to competitors on Hermiston. It outperforms (Appice et al., 2019, 2020 and López-Fandiño et al., 2019) on both Santa Barbara and Bay Area. On the other hand, the iterative procedure defined in Appice et al. (2020) may be considered for a future upgrade of ORCHESTRA.

Table 4 Compared competitors (OA). Results of competitors reported in the reference papers

Upon the completion of this comparative analysis, we perform the Friedman-Nemenyi statistical test (Demšar, 2006) on Hermiston, Santa Barbara and Bay Area. This test ranks the compared CVA methods for each dataset separately, so the best performing method is given rank of 1, the second best rank 2 and so on (Demšar, 2006). Figure 5 ranks the CVA methods according to the result of the Friedman-Nemenyi statistical test done on OA. The results of the test confirm that ORCHESTRA enables the construction of the change matrix that achieves the highest OA by having (Appice et al., 2020) as runner-up.

Fig. 5
figure 5

Comparison based on the Friedman-Nemenyi test of OA computed on the change matrices built using ORCHESTRA and the related methods (Lopez-Fandino et al., 2018; López-Fandiño et al., 2019; Appice et al., 2019, 2020). The authors of (Lopez-Fandino et al., 2018) and (López-Fandiño et al., 2019) test the same CVA approach

7 Majella national park analysis

Wildfires generate significant and complex environmental changes such as physical and chemical variations of soils, structural changes of vegetation, changes in ecological processes and ecosystem services (Meng and Zhao, 2017). Satellite MS data are traditionally exploited for monitoring burnt areas and wildfire effects. In this paper, we analyse the ability of ORCHESTRA in detecting environmental changes (e.g. physical and chemical variations of soils, structural changes of vegetation, changes in ecological processes and ecosystem services) caused by wildfires in MS images. In particular, we process two co-registered Sentinel2 L1C images acquired on both August 16, 2017 (Fig. 6a) and September 15, 2017 (Fig. 6b), in the area of the Morrone Mountain (within the Majella National Park, Italy). This area was burnt in a wildfire started on August 19, 2017 and lasted 25 days, which burnt more than 2,000 ha of an inaccessible area covered by coniferous forest and gorse. The processed MS images are composed of 1494 × 1338 pixels, with pixel resolution equal to 10m/pixels and MS resolution equal to 13 spectral bands (Aiello et al., 2019).

Fig. 6
figure 6

Colour-based version of Sentinel2 L1C images acquired on both August 16, 2017 and September 15, 2017, in the area of the Morrone Mountain (Fig. 6b) within the Majella National Park, Italy)

We perform a preliminary analysis calculating the Normalized Burn Ratio (NBR) index on both the pre-fire and post-fire MS images. This index is commonly used to highlight burnt areas. Formally,

$$ NBR= (NIR-SWIR)/(NIR+SWIR), $$
(6)

where the reflectance in the mid-infrared band (SWIR), that is sensitive to the water content of both soil and vegetation, increases after a fire. On the other hand, the near-infrared band (NIR) declines in reflectance after a fire due to the decrease of the phytomass chlorophyll-content. So, following the the conclusions drawn in Key and Benson (2006), we are able to assess the fire severity in a study area by measuring the difference between the NBR index calculated on both the pre-fire and a post-fire satellite images:

$$ dNBR=NBR_{pre\_fire} - NBR_{post\_fire}. $$
(7)

In fact, this difference is correlated with the magnitude of changes caused by fires on the vegetation (Key & Benson, 2006). By assuming that the unburnt areas have similar spectral behaviour in two satellite images acquired before and after a fire, dNBR measures values around zero in unburnt areas, while it measures positive values in burnt areas. Figure 7 delineates the fire borders (red line) detected in the study area with the dNBR analysis conducted as described in Key and Benson (2006).

Fig. 7
figure 7

Burnt area detected by dNBR (red line) and ORCHESTRA (yellow zone), respectively. New burnt areas detected (blu circles), the false positives areas (orange circle) and the burnt areas detected independently of clouds (green circle) by ORCHESTRA

Although dNBR is one of the well-performing indexes in the detection of burnt areas over large fire zones with open forests and woodlands (Tran et al., 2018), it suffers from a few limits. It is influenced by the fact that unburnt areas do not remain static over time, but they naturally undergo changes, passing from more or less dry/humid conditions over time. The parameters of the dNBR analysis need to be reviewed in each scenario based on several factors, e.g. seasonality of images, closeness of image acquisition to the fire event. In addition, the dNBR computation is sensitive to variations in the soil brightness (Epting et al., 2005a), the type of vegetation (Epting et al., 2005b) and the density of the vegetation (Lentile et al., 2009). Finally, both clouds and their shadows can worsen the scenario when the dNBR analysis is done on large areas.

In this study, we explore which limits of dNBR analysis may be overcome by performing the CVA strategy with ORCHESTRA. To this aim, we consider the configuration of ORCHESTRA that handles the pre-fire image as X and the post-fire image as Y. This configuration allows us to train the autoencoder that maximizes the ratio of the mse computed on the reconstructed images. We apply the correction with R = 10. In particular, we focus the attention on: (1) the correctness of the detected fire borders; (2) the ability to detect any new burnt area correctly detected, as well as the presence of false-alarm areas; (3) the robustness of the performance to possible clouds. To reduce computational effort, we use the Corine Land Cover 2018 classification and then we analyse only pixels belonging to “Forest and semi-natural areas”.

Figure 7 highlights the advantages achieved with the CVA completed with ORCHESTRA. Blu circles underline that ORCHESTRA is able to detect newly burnt areas that are undetected with the dNBR. The green circle is a zoom-in to show the capability of ORCHESTRA to avoid changes that are due to the presence of clouds. Finally, we note that only one polygon (orange circle) is detected as a false alarm. We can conclude that, also for this particular dataset, ORCHESTRA shows good potential to reach more effective identification of burnt areas.

8 Conclusion

This paper describes a CVA method for analyzing a couple of optical satellite images (i.e., primary image and secondary image) acquired over time on the same scene, in order to separate pixels where a change occurs from the unchanged background in the scene. In particular, the proposed method takes advantage of autoencoders to identify spectral patterns that may aid in better disentangling changed pixels from unchanged ones. First an autoencoder is trained on the primary image and used to restore both the primary and the secondary image. Then the SAM distance is computed pixel-by-pixel between the restored images as a measure of the spectral change. Finally, the Otsu’s algorithm is used on the computed distances to isolate the changed pixels, which are the pixels that measure the highest distance.

The novelty of the proposed CVA method is the specific use of an autoeconder architecture to transform the spectral data to compare, in order to enhance the spectral changes resulting in processed data. This is different from the common use of the autoencoders for data dimensionality reduction. Specifically, we base on the considerations that the autoencoder trained on the primary image should restore both the pixels of the primary image and the unchanged pixels of the secondary image accurately. Instead, it should see changed pixels of the secondary image as anomalies and reconstruct them badly. Therefore, computing a distance between restored spectral data measured at the same pixel aids in better delineating possible changes in the scene.

The experiments are performed by processing three couples of satellite HS images, collected either in a benchmark agricultural scene or in an urban scene. These experiments prove that the autoencoder component of the methodology contributes to the gain in detection accuracy. These experiments also reveal that the proposed method is able to provide competitive accuracy, compared to recent state-of-the-art CVA methods (comprising recent methods with autoencoders). In fact, with the encouraging performance of the proposed method, precise land-use and land-cover (or cropping pattern) changes may be identified. In addition, the method supplies promising results in the analysis of a couple of satellite MS images of a burnt area in the Majella National Park (Italy).

Some directions for further work are still to be explored. For example, appropriate classification algorithms may be studied to discriminate among different change types. The performances of various distance measures may be considered for the CVA. In addition, we plan to study the performance of the autoencoder-enhanced distance measures within a deep metric learning framework (e.g. Siamese network or Triplet network). Finally, we intend to investigate different autoencoder architectures, e.g. convolutional autoencoders, in the spectral data reconstruction.