Keywords

1 Introduction

Developments in imaging technology resulted in the extremely large datasets, however, learning any useful information from these datasets, particularly using modern deep learning architectures, require large amount of annotations. Although initiatives such as ImageNet challenge and those related to autonomous vehicles provide such annotated data, they are only limited to street level imagery. In many areas, such as remote sensing, there is a dearth of annotated datasets [6]. Thus, there is a dire need of a method that allows unsupervised learning of features that are distinctive, posses reconstruction capability and are effectively compact.

Fig. 1.
figure 1

DAE-DR framework in which the feature learning and reduction step is explained in the upper part of the figure while the process of discriminating reduced features set is presented in the lower part of the figure.

To cultivate distinctiveness among unsupervised features, we adopted discriminative autoencoder network inspired from Generative Adverserial Networks (GANs) [4] and Siamese Networks [8] in our previous work [11]. However, these learned features are high dimensional with large memory footprints which require huge storage capacity for big data applications, such as remote sensing image retrieval.

Dimensionality reduction could be considered as one of the possible solutions, employed through feature aggregation (by using global sum-pooling, max-pooling, and scaled sum-pooling) or selection of kernels from the activations of the learned network [5, 12]. However, these methods have two important limitations. Firstly, theses methods fail to perform on features learned through unsupervised learning approaches. Secondly, they require an unbounded set of experiments still, they do not guarantee compact feature representation.

In our previous work we proposed a Discriminative Autoencoder (DAE) architecture that takes high-dimensional features from the depth layer of autoencoder as an input and projects them onto a space that separates similar images from non-similar images (see Fig. 1) [11]. This work demonstrates a step-wise procedure to abbreviate the features acquired through deep autoencoder network without significantly effecting their discriminative and regenerative characteristics.

Our approach leverages from the fact that autoencoders with linear activation are mathematically equivalent to Linear Principle Component Analysis (PCA) and those with non-linear activation (such as sigmoid) are equivalent to non-linear PCA. To prove the efficacy, we evaluated our approach on RSIR problem using benchmark datasets including University of California Merced Land Use/Land Cover (LandUse) [13] and High-resolution Satellite scene (SatScene) [3] containing 2100 and 1050 images, respectively.

2 Preliminaries

2.1 Discriminative Autoencoder (DAE)

For the dataset \(\mathrm {X}\) containing n images such that \(\mathrm {X} = \{x_1, x_2, \cdots ,x_n\}\), our network transforms the given input image, \(x_i\) onto the feature space generating feature \(f_i\) through deep learning network \(f_i = h_{\theta }(x_i) = r(W{x}_i+b)\).

Similarly, it then reconstructs the output image \(x^{'}_i\) from the input feature \(f_i\) using \({x^{'}}_{i} = {g_{\theta ^{'}}}(f_i) = t({W^{'}}f_i+b^{'})\). Where, h and g are encoder and decoder functions, respectively. Similarly, \(\theta =\{W,b\}\) are encoder parameters and \({\theta }^{'}= \{W^{'},b^{'}\}\) are decoder parameters for r and t being non-linear activation functions. By employing the mean squared error \(L({x}_i, {x^{'}}_i) = {\Vert x_i-{{x}^{'}}_{i}\Vert }_2\) as loss function, we optimize the parameters \(\theta \) and \({\theta }^{'}\) as follows:

$$\begin{aligned} \theta ^{*}, \theta ^{'*} = \arg \min _{\theta , \theta ^{'} } {\frac{1}{N} \sum _{i=1}^{N} L(x_i, x_i^{'}) } \end{aligned}$$
(1)

A pair of these image features \((f_q, f_t)\) are then concatenated and given to the discriminator network \(y^{'} = d((f_q, f_t), \theta _{d})\) to compute the Bernoulli probabilities (match or unmatched), where d is a discriminator model and \(y^{'}\) is classification probability. The parameters of d are optimized by using cross entropy loss function \(L_d (y, y^{'}) = - \sum _{q,t} y log y^{'}\) as given in Eq. (2).

$$\begin{aligned} \theta _{d}^{*}= \arg \min _{\theta _{d} } {\sum _{q,t} \left[ L(y_i, d(h_\theta (x_q)*h_\theta (x_t))\right] } \end{aligned}$$
(2)

In our previous work [11], it has been demonstrated that the features f learned using residual autoencoder coupled with the discriminative metric learning scheme outperforms supervised features based approaches. However, these features are prone to the curse of dimensionality.

2.2 Autoencoder vs PCA Relationship

We aim to obtain a transformation \(\varPhi \) that transform f to subspace \(\tilde{f}\) as \(\tilde{f} = \varPhi _{\tilde{\theta }}(f,\tilde{W})\) and then from \(\tilde{f}\) we aim to reconstruct the output \(\tilde{x}^{'}\) as \(\tilde{x}^{'} = g_{\tilde{\theta }}(\tilde{f}) = t(\tilde{W}\tilde{f}+\tilde{b})\) and compute similarity as \(\tilde{y}^{'} = d(\{\tilde{f}_q, \tilde{f}_t\}, \tilde{\theta }_{d})\). \(\tilde{f}\) should be such that it is a compact representation of f without any significant loss of information. In order to introduce energy conservation the transform should also be unitary i.e.

Fig. 2.
figure 2

Visualization of reconstructed images from (a) Input (b) \(8\times 8\times 512\) dimensional features of DAE (c) 1063 PCA basis (d) \(1\times 1\times 1024\) dimensional encoder features (DAE-DR 1D) (e) \(8\times 8\times 20\) dimensional encoder features (DAE-DR 2D).

$$\begin{aligned} \Vert \tilde{f}\Vert ^{2} =\tilde{f}^H\tilde{f} = \varPhi _{\tilde{\theta }}(f,\tilde{W})^H\varPhi _{\tilde{\theta }}(f,\tilde{W}) = \Vert {f}\Vert ^{2} \end{aligned}$$
(3)

where \(f^H\) is the hermitian conjugate of f. One such unitary transform is Eigen matrix of auto-correlation \(R_{ff} = ff^H\) which form the basis of Principle Component Analysis (PCA). In order to learn the optimal feature vector, we exploited the relationship between PCA and auto-encoder basis. Mathematically, a linear autoencoder is defined as:

$$\begin{aligned} f_1 = W_1 \times X + b_1 \end{aligned}$$
(4a)
$$\begin{aligned} \tilde{X} = W_2 \times f_1 + b_2 \end{aligned}$$
(4b)

Where, \(W_1\) and \(W_2\) are weights, X is input and \(\tilde{X}\) is reconstructed output. Minimizing the mean square cost function (Eq. 5) with respect to \(W_1, W_2, b_1, b_2\), the problem reduces to optimization with respect to \(W_2\) only, as given in Eq. (6).

$$\begin{aligned} min_{W_1, W_2, b_1, b_2} = \Vert {X - (W_2 (W_1 X + b_1 ) + b_2)}\Vert ^2 \end{aligned}$$
(5)
$$\begin{aligned} min_{W_2} = \Vert X^{*} - W_2 W_2^{\dagger } X^{*} \Vert ^2 \end{aligned}$$
(6)

where, \(X^{*}\) is obtained by subtracting mean image from each image in data as \(X^{*} = X - \bar{x}1^T_N\). Thus, by singular decomposition of \(W_2\), it can be proven that the singular vectors of \(W_2\) are actually the principle components of X. Consequently, PCA is equivalent to linear autoencoder whereas typical deep neural network based autoencoder with non-linear activation functions would be analogous to non-linear version of the PCA [2]. Therefore, the deep CNN autoencoder would learn feature space much better than PCA where PCA would help us to compute the optimal dimension of the space.

3 Methodology

3.1 Dimensionality Reduction in DAE via PCA

PCA helps to find the optimal dimension of space spanned by data but the challenge is the auto-correlation matrix which is computationally expensive. So, instead of computing f as \(f = \varPhi ^Hx\) where \(\varPhi = \xi (R_{XX})\), and \(\xi (.)\) returns the Eigen vectors, we compute \(\varPhi \) as \(\varPhi = F\xi (R_{F^H F})\) (using Sirovich and Kirby method [10]), i.e. by computing Eigen vectors of inner product of depth features instead of raw images, where \(F = \{f_1, f_2, \cdots \}\). \(\tilde{f}\) is then computed as:

$$\begin{aligned} \tilde{f} = \varPhi ^Hf \end{aligned}$$
(7)

Therefore, we compute the auto-correlation \(R_{\tilde{F}^{H}\tilde{F}}\) of \(\tilde{f}\) to identify the basis vectors that contain the maximum amount of energy. By reducing the feature dimension using PCA, from analysis of DAE features (32768 dimensions) on LandUse dataset, it has been found that \(95\%\) of the information lies in only 1063 principle components.

3.2 Dimensionality Reduction via DAE-DR Network

We modified our existing DAE architecture to DAE-DR network to learn the features with compact dimensions. The following three ways demonstrate the achieved modification for conversion of features from DAE to DAE-DR.

Pruning Spatial Dimensions of Filters. By the introduction of 3 additional residual blocks in autoencoder, spatial dimension is reduced to \(1\,\times \,1\) while increasing the number of filters to 1024, resulting in a 1D fine-grained feature vector (DAE-DR 1D). Nonetheless, as compared to PCA neither regeneration nor retrieval score were encouraging. It is quite obvious from Fig. 2(d) that reduction of spatial dimension of activation’s results in loss of structural information and outputs a degraded reconstructed image, hence, confirming the idea presented in [11].

Table 1. Regeneration loss: Averaged MSE on test set where training hyper-parameters were same for all models. PSNR averaged over 21 classes of LandUse.
Fig. 3.
figure 3

Qualitative evaluation of RSIR for harbour, building, intersection, and forest class from LandUse with query image (on the left most side) and its respective first ten retrieved images in each row. It also includes some misclassification results for intersection class.

Pruning Temporal Dimensions of Filters. Filters could also be pruned by adapting “Try and learn” learning approach [7], converting DAE to DAE-DR. However, this method takes a lot of training time which exponentially increases with the complexity of network. Another way is to introduce a stack of layers which reduces the dimensions depth wise while keeping the spatial dimensions unchanged throughout. This technique ensures that the structural information is stored in the spatial dimension. However, the addition of depth in the network architecture produces a blurred regeneration.

Modification of Existing DAE Network. Another way is to modify the hidden layers of the original autoencoder network by manipulating the number of filters to produce the desired dimensional features. This approach yields the 2D compact features (DAE-DR 2D) with significant improvement in reconstruction as illustrated in Fig. 2(e).

The discriminator network for each of the three scenarios mentioned above has been modified in such a way that it accommodates the input feature dimension, preserving the overall architecture of the network.

Table 2. Comparative evaluation of our proposed approach for feature dimension reduction where it should be noted that despite having smaller feature size, our approach outperform hand-crafted features and is comparable with supervised deep features.

4 Results and Discussion

4.1 Training and Evaluation

In order to evaluate the performance of reduced features with our previous results, all the training hyper-parameters were maintained as discussed in [11]. Data augmentation enabled the discriminator network to be robust to scaling, illumination and transnational invariances. For evaluation of all the approaches proposed in Sect. 3, standard metrics discussed in [9] for remote sensing image matching were computed and a brief analysis has been provided in this section.

4.2 Analysis of Image Reconstruction

We trained three variants of auto-encoder networks and compared the regenerated images with [11]. Qualitative visual results demonstrated in Fig. 2 show that the reconstruction of DAE-DR 2D features is smoother than reconstruction from PCA basis vectors. Moreover, the spatial compression of features results in the loss of structural information which degrades the reconstruction of the image. For quantitative evaluation, we compare the reconstruction MSE loss and Peak Signal to Noise Ration (PSNR). From Table 1, it can also be noticed that the MSE loss of DAE-DR 1D feature is almost 20 times higher than the loss of DAE. It can also be clearly analyzed that with the decrease in the feature dimension, the quality of the reconstructed images is impaired. Hence, for an effective reconstruction of the images local spatial information is crucial.

Fig. 4.
figure 4

Comparison between different feature sizes and their mAP scores for LandUse dataset.

Fig. 5.
figure 5

Comparison between different feature sizes and their mAP scores for RS SatScene dataset.

4.3 Analysis of Remote Sensing Image Matching

In order to evaluate the performance of the proposed approach, we provide quantitative as well as qualitative evaluation. Subjective evaluation by observing Fig. 3 clearly shows that the top 10 retrieved images mostly belong to the same class, however, the retrieved images are sometimes confused with visually similar images of different classes e.g. forest with rivers and over-head with highway class. For quantitative evaluation on metrics used for remote sensing image matching, we computed the values of Average Normalized Modified Retrieval Rank (ANMRR) and Mean Average Precision (mAP) [9]. The previously proposed unsupervised features outperforms supervised features in terms of lower ANMRR and higher mAP values which is evident from Table. 2 and a comparative analysis of features represented in Fig. 4 and Fig. 5. In our case, even with 25 times reduction in feature size, the performance is still comparable to hand-crafted approaches and competing with other supervised approaches e.g. NetVLAD [1]. As described in Table 2, the ANMRR value of DAE is comparatively better as compare to other DAE-DR unsupervised feature approaches for LandUse dataset. Furthermore, our approach outperforms other hand-crafted approaches in terms of ANMRR and feature size. Such significant differences in metric values demonstrates the effectiveness and superiority of our proposed feature size for the problem of RSIR using unsupervised features. By exploiting the local spatial and global semantic information, the proposed feature length outperforms the baseline sizes.

5 Conclusion

This paper introduces a novel unsupervised dimensionality reduction network after thoroughly studying some of the systematic methods of reducing unsupervised feature dimension including PCA. Through experiments we have shown that our proposed network DAE-DR 2D is able to achieve comparable content based image retrieval results from a significantly smaller feature vector. While a larger number of feature maps are required to obtain accurate retrieval results, we show that by retraining the spatial information and discarding the redundant filters it is possible to produce an optimal size image descriptor employing discriminative autoencoder.