# A Hybrid Deep Learning Architecture for Latent Topic-based Image Retrieval

## Abstract

Learning effective feature descriptors that bridge the semantic gap between low-level visual features directly extracted from image pixels and the corresponding high-level semantics perceived by humans is a challenging task in image retrieval. This paper proposes a hybrid deep learning architecture (HDLA) that generates sparse latent topic-based representation with the objective of minimizing the semantic gap problem in image retrieval. In fact, HDLA has a deep network structure with a constrained replicated Softmax Model in the lower layer and constrained restricted Boltzmann machines in the upper layers. The advantage of HDLA is that there exist nonnegativity restrictions on the model weights together with \(\ell _1\)-sparsity enforced over the activations of the hidden layer nodes of the network. This, in turn, enhances the modeling power of the network and leads to sparse, parts-based latent topic representation of images. Experimental results on various benchmark datasets show that the proposed model exhibits better generalization ability and the resulting high-level abstraction yields better retrieval performance as compared to state-of-the-art latent topic-based image representation schemes.

## Keywords

Image retrieval Deep learning Latent topics## 1 Introduction

The rapid expansion of digital image repositories poses numerous challenges to computer vision research. Among them, the most important one is the development of accurate and efficient mechanisms to search and retrieve desired images from various digital image repositories. Making use of the feature vectors automatically extracted from image pixels together with a suitable similarity measure, Content-Based Image Retrieval (CBIR) systems enable the search and retrieval of images from large repositories that are identical to the given query image. In CBIR domain, the state-of-the-art approaches are based on BoVW model where images are represented as histograms of visual words. Even though the effectiveness of BoVW model in image retrieval has been proved by many researchers, it still suffers from a major drawback, i.e., the resulting image representation is not as discriminative and descriptive as they are desired to be. This is mainly due to the loss of semantic information of visual words at each processing step of the BoVW model. Therefore, the semantic loss associated with BoVW-based image representation has to be minimized for better retrieval performance.

As the clustering operation in BoVW model often fails to take semantic information into account, there is a high probability that the generated visual dictionary contains many ambiguous visual words. These ambiguous visual words hinder the discriminative power of the BoVW-based image representation. The semantic loss in BoVW model can be reduced to a great extent by automatically grouping semantically similar visual words and then encoding images using these newly identified semantic structures. The work presented in this paper follows the above stated principle to derive a low dimensional but highly discriminative feature vector from the original BoVW-based representation for the task of image retrieval.

It has been observed that visual polysemy and visual synonymy are the root causes behind the induction of ambiguous visual words in the traditional BoVW model. In general, polysemy and synonymy can be regarded as the representational uncertainty of visual information. Polysemy is the characteristic of a visual word that it corresponds to two or more semantic concepts, while synonymy is the characteristic of two or more visual words that they correspond to the same semantic concept. Polysemy originates as a consequence of the visual appearance diversity of different semantic concepts, and it often leads to low inter-semantic discrimination. On the other hand, synonymy arises due to the appearance-based diversity within a particular semantic concept. Thus, if two semantically dissimilar images have a set of polysemous visual words, then they are closer to each other in the visual word-based feature space. Similarly, synonymous visual words may cause images with same semantics to be far apart in the visual word-based feature space. Therefore, to minimize semantic loss and thus to improve the overall retrieval performance of BoVW-based image retrieval, the issue of polysemy and synonymy needs to be effectively tackled.

**h**= \(\{h_{1},h_{2},\ldots ,h_{N} \}\) is defined such that a visual word can belong to none, one or several latent topics. Figure 1 depicts the above-mentioned notion of latent topics in detail. In the end, images are characterized by the proportion of latent topics and this representation is found to be more reliable than the BoVW-based feature while calculating the similarity between images.

As latent topics are learned in a completely unsupervised manner, it is not possible to precisely associate a particular semantic concept to each latent topic. However, images with identical latent topic representations are assumed to contain same semantic concepts and are treated as semantically similar while measuring image similarity. Hence, the notion of latent topics considerably minimizes the semantic loss associated with BoVW model and thus increases the discriminative power of the resulting image representation.

Numerous latent topic-based image retrieval frameworks are available in the literature, and the majority of these approaches are based on graphical models. Approaches based on graphical models try to maximize the joint distribution of visual words and the latent topics to effectively capture the latent topic structures present in the visual word collection. In general, the joint distribution of visual words and latent topics is modeled using a graphical structure. Graphical model-based latent topic frameworks for image retrieval fall into two fundamental categories such as (i) directed topic models and (ii) undirected topic models. The former category involves models based on directed graphs and the most successful approaches toward this direction are Probabilistic Latent Semantic Analysis (PLSA) [1], Latent Dirichlet Allocation (LDA) [2], Correlated Topic Models (CTM) [3] and Pachinko Allocation Model (PAM) [4]. On the contrary, undirected topic modeling frameworks encode the joint distribution by means of undirected graphs. Recently, several undirected topic models have been proposed for image retrieval operation. The most popular among them are Rate Adapting Poisson model (RAP) [5] and Replicated Softmax Model (RSM) [6].

The major drawback of directed topic modeling schemes is that exact inference is intractable, so they have to rely on approximation algorithms to compute the posterior distribution of latent topics. Another notable limitation is the disjunctive coding principle of directed topic models where they assume a visual word comes from a single latent topic resulting in a suboptimal representation of images. A more accurate latent topic-based image characterization can be obtained with undirected topic models. In general, undirected models are subjected to conjunctive coding principle, and they assume that a visual word always comes from a distribution influenced by all the latent topics. Moreover, accurate and efficient inference techniques have also been developed for undirected models. For these reasons, undirected topic models achieved state-of-the-art performance on large-scale image retrieval as compared to their directed counterparts.

- 1.
A hybrid deep learning architecture which is able to model the higher-order correlations among visual words by employing multiple levels of nonlinear transformations.

- 2.
A compact but discriminative image representation well suited for the retrieval task is obtained by directly imposing nonnegativity regulations on the network weights and \(\ell _1\) -sparseness constraint on the hidden layer activations.

## 2 Related Work

Topic models which automatically analyze and discover latent semantic structures from large image collections have been widely explored in image retrieval domain over the past few years. The basic idea behind topic modeling is the mapping of high-dimensional representation of images in the form of BoVW to a much lower-dimensional space defined by the latent topics. Loosely speaking, a latent topic can be viewed as a set of semantically related visual words. Thus, an image containing a large number of visual words can be concisely modeled using a smaller number of latent topics. This permits the easy estimation of semantic image similarity and consequently helps us to improve the overall retrieval effectiveness. A brief review of the most influential topic modeling schemes in image retrieval research is presented in the rest of this section.

Latent Semantic Analysis (LSA) [7] is regarded as the most primitive topic modeling scheme for semantics-based image retrieval. Pecenovic [8] introduced an LSA-based image modeling framework in which a visual word co-occurrence matrix is initially generated by accumulating the BoVW representation of all the images in the given collection. It is then decomposed into a set of orthogonal factors using Singular Value Decomposition (SVD) with the eigenvectors corresponding to the largest *k* eigenvalues constitutes the latent topics that represent the relevant semantic structures. When a query image is presented to the system, it is projected into the latent topic space and then the cosine similarity is computed between each indexed images to get a ranked retrieval list. Even though a competent approach, LSA is still computationally intensive. That is, singular value decomposition of the visual word co-occurrence matrix is not practically feasible for large-scale image databases.

Directed topic models have been developed to overcome the above-mentioned limitation of LSA. These models are based on the assumption that each image is a mixture of latent topics and each latent topic, in turn, is a distribution over the visual words. Directed topic models are generally represented with graphical structures comprising a set of random variables. The graphical representation mostly involves two different types of random variables: visible and hidden ones. The visible variables represent visual word count extracted from the given image collection, and the hidden variables capture the semantic structures (latent topics) embedded in these visual words. Then, the directed topic models find an optimal set of latent topics that best explains the visual words found in the given images. Comprehensive evaluation of various directed topic modeling schemes on large-scale image data sets has shown promising results in terms of retrieval precision and recall.

The last decade has witnessed the emergence of a number of directed topic modeling schemes. The earliest effort in this direction is the Probabilistic Latent Semantic Analysis (PLSA) [1]. Using PLSA, Zhang et al. [9] encoded an image by a probability distribution over latent topics with only a few of them assigned with high probability values. The PLSA model presumes each image as a mixture of a finite number of latent topics. Then, the model fitting involves the estimation of topic specific visual word distributions and image specific latent topic distributions from the given database using Maximum Likelihood Estimation (MLE). Experimental results demonstrated the fact that PLSA-based image modeling schemes have shown to perform remarkably well in large-scale image mining operations.

In order to capture more accurate semantic structures, several research attempts have been made to enhance various aspects of the original PLSA model. With this objective, Lienhart et al. [10] proposed a multilayer PLSA architecture by incorporating not just a single layer of hidden variables, but multiple layers with a hierarchy of variables. Hence, information from various modalities can be efficiently integrated to form more meaningful abstractions. On the other hand, Li et al. [11] introduced correlated PLSA (c-PLSA) which tries to merge inter-image correlations into the basic PLSA formulation and reported promising results in image retrieval tasks. Later on, Chiang et al. [12] proposed Probabilistic Semantic Component Descriptor (PSCD) whereby the latent topics associated with local image regions are initially identified and then integrate this regional semantics together to form a final image descriptor.

However, in PLSA-based image modeling, it is not clear how to infer the topic proportions for an unseen image. That is, the entire model needs to be re-estimated when an image from outside the training dataset is presented as the query. Therefore, the PLSA model and its variants are not scalable. Moreover, the number of parameters to be estimated entirely depends on the size of the image dataset and hence the learned model often tends to overfit the training samples when the number of images in the collection increases linearly.

Later on, Blei et al. [2] formulated a more sophisticated directed topic modeling scheme called Latent Dirichlet Allocation (LDA). Similar to PLSA, LDA assumes that each image is represented by a mixture of fixed number of latent topics and each topic is a mixture over the set of all visual words in the dictionary. In contrast to PLSA, LDA further makes the assumption that these mixture distributions are Dirichlet-distributed random variables whose parameters have to be estimated from the training data. Therefore, once the parameters of Dirichlet distributions are learned, the topic proportions for an unseen image can be predicted easily which is not the case with PLSA-based models. Moreover, the Dirichlet prior to the per-document topic distribution significantly reduces the effect overfitting. Horster et al. [13] investigated the applicability of LDA in the context of semantic image modeling and demonstrated its effectiveness in query-by-example-based image retrieval settings.

Due to its good scalability, the LDA model is further extended by many researchers. One such simple extension is the Correlated Topic Model (CTM) [14]. It is similar to LDA except that instead of drawing topic mixture proportions from a Dirichlet distribution, it does so from a logistic normal distribution. Thus, the parameters of CTM involve a covariance matrix whose entries represent the correlation between all pair of latent topics. Greif et al. [14] adopted CTM to explicitly model topic correlation to derive a lower-dimensional latent topic vector and is found to be superior to LDA. As the pairwise correlation of latent topics are modeled by CTM, the number of parameters in the covariance matrix grows as the square of the number of latent topics. Recently, the Pachinko Allocation Model (PAM) [15] emerged as a flexible alternative to CTM. In PAM, the nested correlation among latent topics is efficiently modeled. It does so by extending the concept of latent topics to be distributions not only over the visual words but also over other latent topics. Using image data from large-scale databases, Boulemden and Tilli [4] reported improved performance of PAM-based latent topic representation in image retrieval operation.

It should be noted that inferring posterior distribution of latent topics in directed topic modeling schemes such as LDA and its extensions is typically intractable. In general, approximate inference techniques such as variational methods [16], expectation propagation [17] and Gibbs sampling [18] are utilized to solve this problem. However, these inference algorithms are computationally expensive and time-consuming especially for larger datasets.

Another alternative for topic modeling is the construction of undirected graphical models. As stated before, the visible nodes of undirected graph accept BoVW representation of input images and the hidden nodes indicate the latent topics learned from the given images. In fact, these nodes in undirected topic models are arranged in layers with the visible nodes constitute the first layer and the hidden nodes form the second layer. This layered architecture has an important characteristic that the nodes in one layer are conditionally independent given the values of the nodes in the opposite layer. With this type of architecture, the mapping from input space (i.e., visual words) to latent topics can be done by a simple matrix multiplication. As a result, the overall retrieval performance, where speed is a primary concern, can be significantly improved. Additionally, undirected models generate distributed latent topic representation and are proven to be superior to the representations obtained with directed topic models for the task of image retrieval.

To date, only a handful of cases have been reported in image retrieval literature using undirected topic models. The Rate Adapting Poisson model (RAP) [5] is one of the earlier works in this direction. In this model, it is assumed that the distribution of the hidden nodes is Binomial and that of the visible nodes is Poisson. Even though RAP-based image retrieval framework performs well in terms of retrieval accuracy, the parameter learning process is unstable and hard. Recently, there has been great interest in using Replicated Softmax Model (RSM) [6] for large-scale image retrieval. It is basically a generalization of Restricted Boltzmann Machine (RBM) [19]. The advantage of using RSM over RAP for deriving high-level image abstractions is that parameter estimation is faster and stable. The Replicated Softmax Model is trained using a fairly efficient learning procedure known as the contrastive divergence algorithm [20]. More importantly, the generalization ability of RSM for unseen images is far better than other models and, this in turn, considerably enhances the overall retrieval performance.

More recently, high-level abstraction of text documents learned using a Deep Boltzmann Machine (DBM)-based formulation called Over Replicated Softmax Model (ORSM) [21] demonstrated promising results for the task of text document classification and retrieval. It has been observed that the high-level abstraction obtained with ORSM has better generalization performance on unseen data as compared to other topic modeling schemes. Encouraged by the recent success of ORSM in modeling text documents, this paper investigates the applicability of an undirected deep learning architecture for extracting efficient latent topic-based representations of images.

To summarize, the effectiveness of topic modeling schemes entirely depends on the quality of the latent topics discovered. It turns out that majority of the above-mentioned models still generate latent topics of inferior quality. This leads to a poor semantic characterization of images and hence degrades the overall retrieval performance. It has been observed that deep network models with many layers of latent topic variables can somehow solve the above-mentioned shortcoming. However, selecting an optimum value for the number of latent topics in each hidden layer is not a straightforward task in such deep models. That is, it should be large enough to fit the characteristics of the image data at hand and at the same time small enough to filter out the irrelevant representational details. In this scenario, a sparse feature representation [22] where only a few latent topics describe the information that we are anticipating does the trick. Therefore, this paper investigates a hybrid deep learning architecture that generates sparse, parts-based characterization of images using latent topics and is found to be compatible for large-scale image retrieval.

## 3 Preliminaries

Symbols used in this paper

Symbol | Meaning |
---|---|

| Visual dictionary size |

\({\mathbf{v}}_\mathrm{test}\) | BoVW representation of test image |

| Number of hidden layers |

\(T_L\) | Number of nodes in the |

| Visible layer nodes |

| Hidden layer nodes |

| Visible layer bias |

| Hidden layer bias |

| Weight matrix |

\(\eta\) | Learning rate |

\(\sigma (.)\) | Sigmoid function |

\(\mathbb {U}\) | Visible layer nodes after weight sharing |

\(\mathcal{M}\) | Number of epochs |

Let us first introduce the main notations used in this paper. Some of them are used in this section, and the rest are used in subsequent section where the formulation of the proposed HDLA model is described. All these notations are summarized in Table 1.

### 3.1 Restricted Boltzmann Machine

**u**= \([u_1 , u_2, \ldots , u_K ]\) and the hidden layer

**h**= \([h_1,h_2, \ldots , h_T]\). The visible layer nodes correspond to observed data, and the nodes in the hidden layer capture the dependencies among the observed data. There is a connection between each node in the visible layer to all the nodes in the hidden layer and vice versa. There is no link between the nodes within the same layer. In its standard form, the visible and hidden layer units of RBM are binary-valued. That is, the space of visible vectors for a binary RBM is

**u**= \(\{0,1\}^K\), while the space of hidden unit vectors is

**h**= \(\{0,1\}^T\). Associated with each nodes in the visible and the hidden layers, there exist bias units and the corresponding bias offsets are represented by

**b**= \([b_1,b_2,\ldots , b_K]\) and

**a**= \([a_1,a_2, \ldots , a_T]\).

*i*and a hidden layer node

*j*is quantified by a real-valued weight \(w_{ij}\). The pairwise weights between all the elements of

**u**and

**h**are generally summarized by a symmetric weight matrix

**W**. It is important to note that RBMs are special cases of Energy-Based Models (EBM), in which the relationships among variables are modeled by assigning energy values to each of their joint configurations. Then, the model parameters of RBM are learned by minimizing the energy of all the desirable configurations of the state space vectors. The following function computes the energy value for the joint configuration of visible and hidden layer nodes (

**u**,

**h**):

**u**in the following fashion:

**u**and hidden units

**h**can be easily derived from Eq. (2) and is given by:

Thus, RBM is a powerful generative model capable to capture the covariance structure present in the given input observations in a completely unsupervised fashion. This helps to group semantically similar visual words into a relatively small number of latent topics, and thus a more efficient latent topic-based image characterization can be derived with RBM-based image modeling. The next section provides a detailed description of the training procedure used to learn the model parameters of RBM.

#### 3.1.1 Contrastive Divergence Algorithm

**u**\(_{1}\),

**u**\(_{2}\), \(\ldots\),

**u**\(_{N} \}\) be the set of independent and identically distributed training samples, then the log-likelihood of \(\mathcal {S}\) is given by:

*m*is the number of epoch, and it indicates the total presentations of the full training set to the learning algorithm. \(\varDelta \varTheta\) is the change in the parameter vector \(\varTheta\). In each epoch, \(\varDelta \varTheta\) is initialized to zero and subsequently changed in a direction that minimizes the negative log-likelihood as shown below:

**W**, visible layer bias

**b**and hidden layer bias

**a**becomes:

**u**as well as the \(2^T\) elements of

**h**. Therefore, exact computation of the data-dependent expectation is intractable because its complexity is exponential in the number of visible and hidden layer nodes. To avoid this computational burden, the data-dependent expectation can be approximated by generating a finite number of samples from the joint distribution \(p({\mathbf{u}} ,{\mathbf{h}} )\) using the Markov Chain Monte Carlo (MCMC) [24] technique.

*c*random variables, Gibbs sampling performs a sequence of

*r*sampling steps of the form \(\mathcal {y}_i \sim P(\mathcal {y}_i \mid \mathcal {y}_{-i} )\), where \(\mathcal {y}_{-i}\) represents the ensemble of the \((c-1)\) random variables other than \(\mathcal {y}_{i}\). Since an RBM consists of conditionally independent visible and hidden units, Gibbs sampling can be easily applied to get samples from the joint distribution \(p({\mathbf{u}} ,{\mathbf{h}} )\). The variables in the hidden layer units are sampled simultaneously given fixed values for the variables in the visible layer. Similarly, visible layer variables are sampled simultaneously given the hidden layer variables. Thus, step (

*t*) of the Gibbs sampling process for the RBM defined in Eq. (2) has the following two phases:

*t*) and \((t-1)\) of the Gibbs sampling procedure. Similarly, \(h_j^{(t)}\), \(u_i^{(t)}\) are the

*j*-th hidden layer unit and the

*i*-th visible layer unit of the model at step (

*t*) of the Gibbs sampling procedure. It is assumed that as \(t \rightarrow \infty\),

*Gibbs sampling*is guaranteed to generate accurate samples of \(p({\mathbf{u}} ,{\mathbf{h}} )\).

*t*of CD\(_\mathcal {k}\) consists of sampling \({\mathbf{h}} ^{(t)}\) from \(p({\mathbf{h}} \mid {\mathbf{u}} ^{(t-1)})\) and then sampling \({\mathbf{u}} ^{(t)}\) from \(p({\mathbf{u}} \mid {\mathbf{h}} ^{(t)})\). Finally, the gradient in Eq. (19) can be written as:

**u**\(_s\) of the training set (i.e., \({\mathbf{u}} ^{(0)}={\mathbf{u}} _s\)), the following rules are used by the \(\mathcal {k}\)-step Contrastive Divergence algorithm to update the weights and biases of the model.

Once the unknown parameters are estimated, RBM generates a *T*-dimensional latent topic-based representation \(p({\mathbf{h}} \mid {\mathbf{u}} _\mathrm{new})\) for an unseen input \({\mathbf{u}} _\mathrm{new}\) supplied to the model. The newly generated feature vector provides a quantitative description of the latent topic structure associated with the unseen input \({\mathbf{u}} _\mathrm{new}\). Moreover, the dimensionality of the obtained representation is considerably lower than that of the actual input. All these characteristics make RBM an ideal tool for latent topic-based image modeling.

### 3.2 Replicated Softmax Model

*Bernoulli*) input units. Therefore, Salakhutdinov and Hinton [6] proposed Replicated Softmax Model (RSM) as a variant of RBM to model visual word-count data. The nodes in the visible layer are modeled as Softmax units and can have one of many different states. A graphical representation of the RSM framework is depicted in Fig. 3a. Let

*K*be the size of the visual dictionary learned from a set of training images and

*N*be the number of interest points detected in the given image, then the input data to the RSM model is an \(N \times K\) binary matrix

**U**with \(U_{ik}\) = 1 if and only if the

*i*-th interest point in the given image is assigned to the

*k*-th visual word and is given by:

**W**= \([W_{ijn}]\) denotes the connection strength between the

*i*-th visible layer unit corresponding to the

*n*th interest point in the given image and the

*j*-th hidden layer unit.

**b**= \([b_{ni}]\) is the bias associated with the

*i*th visible unit of the

*n*th interest point in the given image and

**a**is the bias of the hidden layer

**h**.

*i*th visible unit of the

*n*th local image descriptor is forced to share its weight with the

*i*th visible unit of all other local descriptors, then \(W_{ijn}\) can be simply redefined as \(W_{ij}\). This procedure is illustrated in Fig. 3b. With this modification, the input binary matrix

**U**of the RSM framework can be replaced with

*K*visible layer nodes \(\mathbb {U} = [ mathbb {u}_1, mathbb {u}_2, \ldots , mathbb {u}_K ]\) each of them corresponds to a distinct visual word in the learned dictionary. The nodes in the visible layer \(\mathbb {U}\) are shown using concentric circles to indicate replication, i.e. the number of times each visual word occurs in the given image. The weight sharing operation brings down the total number of parameters to be learned from \((N\times T \times K)\) to \((T \times K)\), and it helps RSM to model images with a varying number of visual words. The energy of the configuration \(\{\mathbb {U},{\mathbf{h}} \}\) after weight sharing is then defined as:

*i*-th visual word appears in the given image. It should be noted that the bias term for the hidden unit is scaled by the number of interest points

*N*. This scaling is crucial as it allows hidden units to behave sensibly when dealing with documents of different lengths. Then, the probability that the model assigns to a visible binary matrix \(\mathbb {U}\) is given by:

### 3.3 Deep Boltzmann Machine

*L*hidden layers is shown in Fig. 4. There are connections only between adjacent hidden layer units as well as units in the visible layer and the first hidden layer. Because of the deep hierarchical structure, DBM has greater flexibility and good representation power while modeling complex data distributions. That is, DBM can generate more structured and abstract representations of input observations. Consider a Deep Boltzmann Machine with one input layer \({\mathbf{u}} =\{u_1,u_2,\ldots ,u_K\} \in \{0,1\}^K\) and a series of

*L*hidden layer units \({\mathbf{h}} = \{ {\mathbf{h}} ^{(1)} \in \{0,1\}^{T_1}, {\mathbf{h}} ^{2} \in \{0,1\}^{T_2}, \ldots , {\mathbf{h}} ^L \in \{0,1\}^{T_L} \}\). Then, the energy of the joint configuration \(\{ \mathbf{u },\mathbf{h } \}\) is defined as:

*i*-th visible layer node \(u_i\). \({\mathbf{W}} ^{({\ell })}=[w^{({\ell })}_{jk}]\) where \(1 \le {\ell } \le L\) is the weight between the

*j*-th node in the hidden layer

**h**\(^{({\ell })}\) and the

*k*-th node in the hidden layer \({\mathbf{h}} ^{({\ell }+1)}\). \(a_j^{({\ell })}\) are the bias terms associated with

*j*-th node in the hidden layer

**h**\(^{({\ell })}\). All these model parameters are represented by the vector \(\varTheta\).

**h**\(^{(L)}\) is defined as:

**u**and first hidden layer

**h**\(^{(1)}\) is given by:

#### 3.3.1 The Layer-Wise Training Strategy for DBM

Parameter learning in DBM is performed using an unsupervised layer-wise training procedure. In this approach, the layers of DBM are grouped pairwise to form a sequence of RBMs. Then, the RBMs in the stack are trained independently in a bottom-up fashion such that successive RBMs use the samples drawn from the joint distribution of the visible and hidden layers of the previous RBM in the hierarchy as their input data. The entire learning procedure for a DBM with *L* hidden layers is summarized in Algorithm 2.

In layer-by-layer training procedure, the first RBM in the hierarchy is trained to model the given input observation. That is, the visible layer **u** of the first RBM accepts the input observations and models it using the \(\mathcal {k}\)-step contrastive divergence algorithm. After training the first RBM, a sufficiently large number of samples are generated from the joint distribution p(**u** \(\mid\) **h**)as the input data for the next RBM in hierarchy (step 3 of Algorithm 2).

While training the remaining portion of the DBM, only two layers **h**\(^{({\ell }-1)}\) and **h**\(^{({\ell })}\) of the network are considered at a time with the assumption that **h**\(^{({\ell }-1)}\) is known and fixed. Then, the joint distribution p(**h**\(^{(l-1)}\), **h**\(^{({\ell })}\)) of these two layers is approximated as if they constitute an isolated Restricted Boltzmann Machine and its parameters are learned by maximizing the likelihood p(**h**\(^{({\ell }-1)}\)). The \(\mathcal {k}\)-step contrastive divergence learning procedure mentioned in Algorithm 1 is used for this purpose.

Since all the edges are undirected, each hidden layer nodes except those in the last hidden layer of the DBM accept signals from the upper and the lower layer nodes as indicated in Eq. (32). Hence, the training algorithm must account for the top-down and the bottom-up interaction terms while learning the parameters of DBM. With this objective, Salakhutdinov and Hinton [25] modified the structure of the RBMs in the entire stack before the actual training begins. For instance, the following changes have been made to the structure of RBMs while training a DBM with three hidden layers as shown in Fig. 5b. Initially, the first layer RBM is altered to have two copies of visible layer nodes along with tied weights. The newly added visible layer nodes compensate for the lack of top-down interaction terms from the second layer. Similarly, the structure of the third layer RBM is modified in such a way that it involves two copies of hidden layer units **h**\(^{(3)}\) and the respective weight matrix **W**\(^{(3)}\) to compensate for the lack of bottom-up interactions from RBM-2. For the intermediate layer, the RBM is restructured such that only the connection strengths **W**\(^{(2)}\) are doubled. Salakhutdinov and Hinton [25] were able to show that the layer-wise training of DBM with this type of structural modification is guaranteed to yield optimal values for the model parameters.

## 4 The Proposed Image Retrieval Framework

The proposed HDLA model for latent topic-based image retrieval mainly involves two processing steps. The first step is fitting the HDLA model to the entire training images. In this step, the parameters of the HDLA model are learned from the training images, and it proceeds in three stages: (i) visual dictionary learning (ii) generating Bag of Visual Word (BoVW) representation of the training images and (iii) layer-by-layer training of the HDLA model in an unsupervised fashion. The second processing step is testing the learned HDLA model and thereby inferring latent topic-based representation of previously unseen images for the task of CBIR.

*K*) of clusters using the

*K*-means algorithm. Each of the resulting cluster center is termed as a visual word and the set of all visual words thus obtained are termed as a visual dictionary.

The BoVW representation of the images in the training collection is generated by decomposing each of the images into local patches and are then represented by means of scattering transform coefficients. The local image descriptors thus obtained are then mapped to the nearest visual word in the initially constructed visual dictionary. Finally, the number of occurrence of each visual word over the entire image is computed to form a *K*-dimensional feature vector popularly known as BoVW representation.

The HDLA model has a layered hierarchical structure where the processing elements are called nodes. There is one layer of visible nodes and multiple layers of hidden nodes stacked on top of one another to constitute the HDLA model. The nodes of any two adjacent layers are bidirectionally connected through weights, and it serves as the model parameters. Each layer of the HDLA model generates activation probability conditioned on the corresponding inputs, and it mainly depends on the model weights.

As the visible layer accepts the visual word count in the form of BoVW representation of training images, the lowest level in the HDLA model is an RSM with additional constraints on its weights and activation probabilities. The upper hidden layers of the HDLA model are paired together to form a hierarchy o Restricted Boltzmann Machines. The hidden layer nodes in HDLA capture the higher-order correlation among visual words and thereby group semantically identical visual words together to form latent topics. The output of the topmost hidden layer will be the latent topic distribution of the given image and is employed for the task of image retrieval.

We use a greedy layer-wise training strategy to learn the parameters of the proposed HDLA model, and it leads to iterative update rules for the parameters of individual layers. The basic idea of the layer-wise training strategy is to train the HDLA model one layer at a time, starting from the first layer. The principle of maximum likelihood is employed to learn the parameters of individual layer in the HDLA model. Thus, for a given collection of training images, the parameters of individual layers are learned in such a way that gives the highest possible probability to the given training data.

Given a previously unseen image (\(I_\mathrm{test}\)) in the testing phase of the proposed HDLA model, its BoVW representation (\({\mathbf{v}} _\mathrm{test}\)) is obtained based on the initially created visual dictionary and this BoVW representation is then presented as input to the visible layer of the HDLA model. The latent topic distribution of the test image is then computed as the activation probability \(p({\mathbf{h}} ^{L} \mid {\mathbf{v}} _\mathrm{test})\) of the topmost hidden layer in the HDLA model conditioned on the BoVW representation of the given test image. A ranked list of database images is then prepared on the basis of this latent topic features. Figure 6 shows graphically the process for both training and testing the proposed HDLA for the task of image retrieval. The rest of this section provides the implementation details of the proposed HDLA model.

### 4.1 The Hybrid Deep Learning Architecture

As mentioned earlier, latent topic representation obtained with Deep Boltzmann Machine-based architecture possesses good generalization ability. Deep Boltzmann Machine has multiple layers of processing modules stacked on top of one another, and each unsupervised module in this hierarchy is provided with the representation vectors from the lower level module. Thus, the latent topic vector in the upper-layer capture the high-level dependencies among input variables and thereby improve the ability of the system to learn complex distributions present in the input data.

However, the fully distributed representation yielded by DBM often fails to capture the constituent parts or factors of the input observations. In other words, the high-level abstraction generated by DBM often lacks the inherent meaning of adding parts to form a whole. In fact, “part-based” representation [27] ensures non-subtractive combinations of components to form the given input. Therefore, by restricting the network weights of DBM to nonnegative values yield a “part-based” representation of input data and it possibly enhances the expressive power of the basic DBM model. Another possibility for improving the performance of DBM is the incorporation of sparsity into the learned representation. In sparse feature coding [28], the final representation is forced to have only a few non-zero components, and most of the remaining entries are zero. Hence, sparsity is an effective constraint for performance enhancement where there is no intimation about the required number of hidden layers in DBM and the amount of hidden units required in successive layers while creating an optimal deep network that efficiently discovers interesting structures embedded in the input data.

The following subsections provide a detailed description of the Constrained Replicated Softmax Model (CRSM) and the Constrained Restricted Boltzmann Machine (CRBM) which add up to form the proposed HDLA model to infer latent topic-based image representation applicable for the retrieval operation.

#### 4.1.1 Constrained Replicated Softmax Model

**h**= \((h^{(1)}_1,h^{(1)}_2,\ldots ,h^{(1)}_{T_1}) \in \{0,1\}^{T_1}\) indicate the set of hidden nodes of CRSM. The input to the visible units of CRSM is the visual word-count vectors and to learn an optimum fitting distribution for any given set of

*m*data samples \(\{ \mathbb {U}_{1}, \mathbb {U}_{2}, \ldots , \mathbb {U}_{m} \}\) CRSM attempt to solve the following minimization problem.

**h**\(^{(1)}\) and visible layer \(\mathbb {U}\), respectively, \(\mathbb {W}=[mathbb {w}_{ij}]\) denote the weight between the

*i*-th visible layer node and the

*j*-th hidden layer unit.

*ln*\([p(\mathbb {U}_{s};\theta _1)]\) is the log-likelihood of the training sample \(\mathbb {U}_{s}\) and is computed by taking the logarithm of the probability value defined in Eq. (25). \(f(mathbb {w}_{ij})\) is the quadratic barrier function which enforces nonnegativity restriction on the model weights, \(f \Big (p({\mathbf{h}} ^{(1)} \mid \mathbb {U}_{s}) \Big )\) is the \(\ell _1\)-regularization term which is used to enforce sparsity on the latent topic representation learned by CRSM. \(\beta _1\), \(\gamma _1\) are the weight penalty term and the sparse hyper-parameter of CRSM. They, respectively, control the level of nonnegativity of connection weight matrix \(\mathbb {W}\) and the sparsity of hidden layer activation \(p({\mathbf{h}} ^{(1)} \mid \mathbb {U}_{s})\).

#### 4.1.2 Constrained Restricted Boltzmann Machine

*L*CRBM modules in the proposed HDLA model. This section explains the formulation of the \({\ell }\)-th CRBM (i.e., CRBM-\({\ell }\)) where \(1 \le {\ell } \le L\) and the basic theory remains the same for all other CRBMs in the hierarchy. More formally, CRBM-\({\ell }\) involve two sets of binary stochastic hidden layers

**h**\(^{({\ell })}=(h_1^{({\ell })},h_2^{({\ell })},\ldots ,h_{T_{{\ell }}}^{({\ell })})\) and

**h**\(^{({\ell }+1)}=(h_1^{({\ell }+1)},h_2^{({\ell }+1)},\ldots ,h_{T_{{\ell }+1}}^{({\ell }+1)})\). Then, CRBM-\({\ell }\) can model any distribution on \(\{0,1\}^{T_{{\ell }}}\) by learning appropriate model parameter values that minimizes the following optimization problem for a given set of

*m*training samples \(\{\)

**h**\(_{1}^{({\ell })}\),

**h**\(_{2}^{({\ell })}\), \(\ldots\),

**h**\(_{m}^{({\ell })} \}\)

*i*-th unit in the hidden layer

**h**\(^{({\ell })}\) and

*j*-th unit in the hidden layer

**h**\(^{({\ell }+1)}\),

**a**\(^{({\ell })}\) is the bias associated with hidden layer units in

**h**\(^{({\ell }+1)}\).

*ln*\([p({\mathbf{h}} ^{({\ell })}_s;\varTheta _{\ell })]\) is the log-likelihood of the given sample \({\mathbf{h}} ^{({\ell })}_s\) and is expressed as the logarithm of the probability value defined in Eq. (25). \(f(w_{ij}^{({\ell })})\) is the quadratic barrier function to ensure nonnegativity restriction on the network weights of CRBM-\({\ell }\). \(f \Big (p({\mathbf{h}} ^{({\ell }+1)} \mid {\mathbf{h}} _{s}^{({\ell })}) \Big )\) is the \(\ell _1\)-regularization term for the sparse activation of the output hidden layer units of CRBM-\({\ell }\). \(\beta _{\ell }\), \(\gamma _{\ell }\) are the weight penalty term and the sparse hyper-parameter of CRBM-

*l*. These parameters are defined in the same way as it was done before in the case of CRSM.

**h**\(_s^{({\ell })}\) is given by:

**h**\(_s^{({\ell })}\) from the training set (i.e.,

**h**\(^0 ={\mathbf{h}} _s^{({\ell })}\)) the parameter update rules of CRBM-\({\ell }\) becomes:

#### 4.1.3 HDLA Model Training

The layer-wise learning procedure already mentioned in Algorithm 2 is extended to learn the parameters of the proposed HDLA model. By using the layer-wise strategy, the learning process of the proposed HDLA model is broken down into a number of related sub-tasks such that all of them can be completed in a stage-by-stage fashion. The basic idea here is to gradually present input observations to the HDLA model so that at the early stages of training the coarse-scale properties of input observations are captured while the fine-scale characteristics are learned in later stages. After training each layer, its output is considered as the input for training the next layer. That is, the output of each layer serves as a prior for learning the parameters of the next higher layer. The entire procedure for training the proposed HDLA model is summarized in Algorithm 3.

Initially, the parameters of CRSM module which takes the BoVW representation of each training image as input are optimized using one-step contrastive divergence algorithm with the update rules specified in Eqs. (40)–(42). Then, we freeze the obtained parameters of CRSM and its hidden layer configuration for the given input observations is inferred. These inferred values then act as the input data for CRBM-1 in the next higher level of the hybrid deep learning architecture. Again, the one-step contrastive divergence algorithm with the value \({\ell }=1\) and the modified update rules specified in Eqs. (45) and (46) are used to derive the parameters of CRBM-1. This procedure is repeated until the parameters of CRBM-*L* in the hierarchy are learned. To account for the top-down and bottom-up interaction terms, the structure of the HDLA model is altered while training according to the strategy already illustrated in Sect. 3.3.1. Finally, these parameters are composed together to form the required HDLA model.

## 5 HDLA-Based Image Representation

This section describes how to learn a latent topic-based representation suitable for image retrieval from the trained HDLA model. Furthermore, the distance metric used to estimate the semantic similarity between images is also discussed.

### 5.1 Encoding of Previously Unseen Images

Once the model parameters of HDLA are learned from an appropriate set of training samples, the given query and the database images can be mapped into the latent topic space for the purpose of image retrieval. The conceived HDLA model with *L* hidden layers generates a latent topic-based representation \(p({\mathbf{h}} ^L \mid {\mathbf{v}} _\mathrm{test})\) for every input image whose BoVW representation is \({\mathbf{v}} _\mathrm{test}\). The activation \(p({\mathbf{h}} ^L \mid {\mathbf{v}} _\mathrm{test})\) of the topmost hidden layer of HDLA denotes the latent topic structure of the given image and is taken as the feature vector for the desired retrieval operation.

### 5.2 Image Similarity Measure

*K*-dimensional latent topic-based representation obtained with the proposed HDLA model is denoted by \(mathbb {f}_q\) and \(mathbb {f}_d\). Then, the Jensen–Shannon divergence similarity measure \(JS (mathbb {f}_q, mathbb {f}_d)\) for estimating the similarity between two latent topic-based distributions \(mathbb {f}_q\) and \(mathbb {f}_d\) and is formally defined as follows:

*i*-th bin of the feature vectors \(mathbb {f}_q\) and \(mathbb {f}_d\).

## 6 Performance Evaluation and Discussion

The experimental validation of the formulated model is demonstrated in this section. Firstly, a short description of the datasets used for evaluation is provided. Then, the quantitative evaluation of the proposed HDLA model in terms of its generalization ability is carried out. Finally, the retrieval efficiency of the latent topic-based image representation obtained with the proposed HDLA model is compared with state-of-the-art approaches.

### 6.1 Datasets Used

In the past, a number of benchmark datasets having ground truth images for a set of predefined queries have been introduced for evaluating different CBIR frameworks. Among them, six image collections with contrasting characteristics are selected to use in our retrieval experiments, and this section provides a detailed description of all these image collections.

*INRIA Holiday dataset* [30] It involves 1491 high-resolution images of various places situated all over the universe. Images in this collection have a resolution of either 570\(\times\)760 or 1020\(\times\)760 and it mainly includes natural scene types. Among them, 500 images are reserved as queries and there exist predefined retrieval lists for each of the queries.

*Scene-15 dataset* [31] There are mainly 4485 images in this collection and are grouped into 15 concept categories. In total, 210 to 410 images are there in each category and all of them have a fixed resolution equal to 250\(\times\)300 pixels. Most of the images in the Scene-15 collection have distinguishing background and foreground context. Therefore, this image collection serves as a good choice for evaluating context-aware semantic image modeling schemes for the task of CBIR.

*Oxford dataset* [32] This benchmark dataset comprises 5062 building images located at 11 various landmarks of the Oxford city, and it is difficult to distinguish similar building facades from one another. All images in the collection have a fixed resolution of 1020\(\times\)760. The ground truth includes five images from each of the 11 landmarks and their corresponding search results. That is, 55 queries are there to evaluate the effectiveness of any retrieval system.

*GHIM-10K dataset* [33] There are 10,000 images in the GHIM-10K dataset which spread over 20 concept categories. Each category contains 500 color images in JPEG format with a resolution of 300\(\times\)400 or 400\(\times\)300. Those images in the search result that belongs to the semantic category similar to the given query are judged as relevant. That is, a randomly selected image from any of these 20 concept classes can act as the query and there are exactly 499 relevant images in the collection.

*IAPR TC-12 dataset* [34] Another widely used image collection selected for retrieval evaluation is the IAPR TC-12 dataset. It involves 20,000 images collected from various locations around the globe comprising different types of natural scene images. All images in this collection are in JPEG format with a fixed size of 360\(\times\)480 pixels. An interesting property of this image collection is that there are many images having identical visual content; however, they differ in background, lighting conditions and viewing position.

*MIRFLICKR-40K dataset* [35] The final image collection selected for evaluation is the MIRFLICKR-40K dataset and is a subset of the MIRFLICKR-1M collection. This dataset comprises 40,000 images and all of them have a fixed resolution of 720\(\times\)480. The notable characteristic of this image collection is that it exhibits semantic diversity by having images belonging to multiple categories and varying appearance. Thus, the MIRFLICKR-40K dataset provides an in-depth analysis of any image retrieval framework due to its moderate size and heterogeneous content.

### 6.2 Quantitative Evaluation of the HDLA Model

An ideal topic modeling scheme should adequately model the given data samples and at the same time has the potential to yield semantically coherent latent topics. Therefore, it is necessary to analyze these two aspects of the proposed model while judging its competence. To do so, two sets of experiments are carried out using the proposed model. The first one is the generalization test on unseen data samples, and the other one is the evaluation of reconstruction error for a standard handwritten image collection. In all these experiments, the performance of the proposed model is compared with the following baseline approaches such as Over Replicated Softmax Model (ORSM) [21], Replicated Softmax Model (RSM) [6] and Rate Adapting Poisson model (RAP) [5].

#### 6.2.1 Experimental Setup

The hardware platform for simulating the proposed HDLA model is an Intel Core i7-4570 machine equipped with 3.4 GHz CPU and 16 GB of RAM. The HDLA model is coded in MATLAB R2016b(9.1) environment under Unix operating system. For all the experiments presented in this paper, the proposed HDLA model is trained for 200 epochs with a learning rate \(\eta\) = 0.2. The visible and hidden layer biases are initialized with small random values, and the model weights are randomly chosen from positive values in the range [0,1]. It is found that \(\mathcal {k}\)=1 is sufficient for the contrastive divergence algorithm to generate good latent topic-based features.

#### 6.2.2 Generalization Performance on Unseen Samples

Since topic models are trained in a completely unsupervised fashion, it is difficult to evaluate the competence of one model over the other. In practice, the performance of topic models is evaluated using their generalization ability on unseen data sample. More specifically, estimating the likelihood of a held-out data set provides a clear, interpretable metric for evaluating the performance of topic models relative to other existing models.

*n*samples drawn from \(P({\mathbf{v}}_\mathrm{test}, {\mathbf{h}} )\) by means of Gibbs sampling. Then, the average test perplexity value is computed as:

*D*| is the number of images in the collection \(\mathcal{J}_\mathrm{test}\), \(N_i\) and \({\mathbf{v}}^ {(i)}_\mathrm{test}\), respectively, denotes the number of interest points and the visual word-count vector for the

*i*-th image in the collection \(\mathcal{J}_\mathrm{test}\). From this definition, one can see that a low perplexity score always indicates a better generalization performance.

Quantitative evaluation of proposed HDLA model based on total log-probability (\(\sum \log p({\mathbf{v}}_\mathrm{test})\)) scores calculated over the test images of individual data sets

Dictionary size | \(T_\mathrm{L=3}\) | Holiday dataset | Scene-15 dataset | Oxford dataset | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

| ORSM | RSM | RAP | | ORSM | RSM | RAP | | ORSM | RSM | RAP | ||

250 | 50 | \(-\) | \(-\) 84.20 | \(-\) 89.25 | \(-\) 94.02 | \(-\) | \(-\) 78.25 | \(-\) 82.78 | \(-\) 86.73 | \(-\) | \(-\) 100.02 | \(-\) 110.52 | \(-\) 116.35 |

75 | \(-\) | \(-\) 73.26 | \(-\)78.84 | \(-\)93.85 | \(-\) | \(-\)73.38 | \(-\)77.83 | \(-\)82.16 | \(-\) | \(-\)87.49 | \(-\)96.38 | \(-\)102.58 | |

100 | \(-\) | \(-\)60.69 | \(-\)68.32 | \(-\)74.82 | \(-\) | \(-\)68.49 | \(-\)73.48 | \(-\)78.83 | \(-\) | \(-\)74.96 | \(-\)83.23 | \(-\)89.38 | |

125 | \(-\) | \(-\)55.42 | \(-\)61.09 | \(-\)66.38 | \(-\) | \(-\)65.31 | \(-\)69.80 | \(-\)74.56 | \(-\) | \(-\) 71.67 | \(-\)81.49 | \(-\)87.63 | |

500 | 100 | \(-\) | \(-\)74.86 | \(-\)79.46 | \(-\)84.68 | \(-\) | \(-\)69.29 | \(-\)75.98 | \(-\)81.46 | \(-\) | \(-\)93.16 | \(-\)101.28 | \(-\)108.58 |

125 | \(-\) | \(-\)65.99 | \(-\)70.43 | \(-\)75.18 | \(-\) | \(-\)65.44 | \(-\)71.84 | \(-\)76.61 | \(-\) | \(-\)79.26 | \(-\)87.82 | \(-\)95.68 | |

150 | \(-\) | \(-\)57.81 | \(-\)62.33 | \(-\)68.49 | \(-\) | \(-\)61.02 | \(-\)67.18 | \(-\)72.29 | \(-\) | \(-\)68.49 | \(-\)76.19 | \(-\)84.42 | |

175 | \(-\) | \(-\) 53.27 | \(-\) 58.09 | \(-\) 73.14 | \(-\) | \(-\) 57.48 | \(-\) 63.49 | \(-\) 69.18 | \(-\) | \(-\) 65.20 | \(-\) 72.39 | \(-\) 78.56 | |

750 | 150 | \(-\) | \(-\) 67.18 | \(-\) 73.51 | \(-\) 78.32 | \(-\) | \(-\) 64.32 | \(-\) 69.44 | \(-\) 74.14 | \(-\) | \(-\) 81.46 | \(-\) 88.13 | \(-\) 94.88 |

175 | \(-\) | \(-\) 59.47 | \(-\) 64.36 | \(-\) 69.48 | \(-\) | \(-\) 60.44 | \(-\) 65.90 | \(-\) 71.41 | \(-\) | \(-\) 73.43 | \(-\) 80.38 | \(-\) 86.36 | |

200 | \(-\) | \(-\) 54.32 | \(-\) 59.07 | \(-\) 64.54 | \(-\) | \(-\) 56.82 | \(-\) 62.76 | \(-\) 68.79 |
| \(-\) 65.38 | \(-\) 72.19 | \(-\) 78.37 | |

225 | \(-\) | \(-\) 50.71 | \(-\) 55.86 | \(-\) 60.48 | \(-\) | \(-\) 52.28 | \(-\) 59.34 | \(-\) 65.43 | \(-\) | \(-\) 62.29 | \(-\) 69.93 | 74.86 |

Dictionary Size | \(T_\mathrm{L=3}\) | GHIM-10K dataset | IAPR TC-12 dataset | MIRFLICKR-40K dataset | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

| ORSM | RSM | RAP | | ORSM | RSM | RAP | | ORSM | RSM | RAP | ||

250 | 50 | \(-\) | \(-\) 92.55 | \(-\) 96.41 | \(-\) 101.23 | \(-\) | \(-\) 87.51 | \(-\) 93.38 | \(-\) 98.14 | \(-\) | \(-\) 72.93 | \(-\) 77.02 | \(-\) 82.54 |

75 | \(-\) | \(-\) 81.82 | \(-\) 85.29 | \(-\) 90.28 | \(-\) | \(-\) 75.80 | \(-\) 81.16 | \(-\) 86.28 | \(-\) | \(-\) 60.15 | \(-\) 65.47 | \(-\) 71.41 | |

100 | \(-\) | \(-\) 73.62 | \(-\) 79.39 | \(-\) 84.08 | \(-\) | \(-\) 62.67 | \(-\) 68.33 | \(-\) 73.45 | \(-\) | \(-\) 56.23 | \(-\) 60.31 | \(-\) 65.38 | |

125 | \(-\) | \(-\) 70.27 | \(-\) 76.38 | \(-\) 81.22 | \(-\) | \(-\) 57.23 | \(-\) 73.49 | \(-\) 79.07 | \(-\) | \(-\) 54.13 | 58.22 | \(-\) 64.14 | |

500 | 100 | \(-\) | \(-\) 82.48 | \(-\) 87.69 | \(-\) 92.49 | \(-\) | \(-\) 73.43 | \(-\) 79.28 | \(-\) 85.66 | \(-\) | \(-\) 65.24 | \(-\) 69.06 | \(-\) 73.38 |

125 | \(-\) | \(-\) 76.31 | \(-\) 80.88 | \(-\) 85.59 | \(-\) | \(-\) 70.56 | \(-\) 76.22 | \(-\) 82.72 | \(-\) | \(-\) 55.18 | \(-\) 59.73 | \(-\) 64.18 | |

150 | \(-\) | \(-\) 70.11 | \(-\) 74.29 | \(-\) 79.71 | \(-\) | \(-\) 59.85 | \(-\) 65.20 | \(-\) 70.15 | \(-\) | \(-\) 51.49 | \(-\) 55.12 | \(-\) 60.66 | |

175 | \(-\) | \(-\) 67.48 | \(-\) 71.29 | \(-\) 76.38 | \(-\) | \(-\) 55.77 | \(-\) 62.54 | \(-\) 68.16 | \(-\) | \(-\) 48.64 | \(-\) 52.23 | \(-\) 56.11 | |

750 | 150 | \(-\) | \(-\) 75.36 | \(-\) 80.34 | \(-\) 85.43 | \(-\) | \(-\) 60.42 | \(-\) 67.53 | \(-\) 72.34 | \(-\) | \(-\) 56.14 | \(-\) 61.22 | \(-\) 66.18 |

175 | \(-\) | \(-\) 71.82 | \(-\) 76.29 | \(-\) 81.81 | \(-\) | \(-\) 62.17 | \(-\) 68.34 | \(-\) 74.04 | \(-\) | \(-\) 51.87 | \(-\) 56.23 | \(-\) 60.63 | |

200 | \(-\) | \(-\) 68.53 | \(-\) 74.62 | \(-\) 79.33 | \(-\) | \(-\) 56.26 | \(-\) 63.52 | \(-\) 68.85 | \(-\) | \(-\) 46.88 | \(-\) 50.03 | \(-\) 55.18 | |

225 | \(-\) | \(-\) 63.34 | \(-\) 69.93 | \(-\) 74.19 | \(-\) | \(-\) 52.65 | \(-\) 59.47 | \(-\) 65.13 | \(-\) | \(-\) 44.33 | \(-\) 48.12 | \(-\) 53.74 |

We conducted log-likelihood and perplexity analysis by experimenting on all the six data sets considered for evaluation. HDLA model with three hidden layer units (i.e., *L* = 3) is used in this experiment. The visible layer of the proposed model accepts BoVW-based representation of input images and then maps the input to latent topic space. The log-likelihood and perplexity values are calculated by running the Gibbs sampler three times each with 1000 iterations and then by taking the average of these three scores. Tenfold cross-validation is performed in all the six datasets considered for evaluation. That is, images in the individual dataset are grouped into tenfolds of approximately equal sizes. Special care has been taken to ensure that there is no overlap between images belonging to each fold. Then, in each run of the experiment, ninefolds are used for model training, and the remaining onefold is used for testing the model. For different sizes (*K*) of the visual dictionary, the total log-likelihood values obtained for each of the compared models by varying the number of latent topics (\(T_\mathrm{L=3}\)) are summarized in Table 2. From these results, it can be concluded that the proposed model outperforms other existing models in terms of its generalization performance.

*K*and \(T_\mathrm{L=3}\) values are selected in such a way that gives better generalization performance. The obtained results revealed the fact that the perplexity values of the formulated model consistently decrease in successive iterations and it achieves a faster rate of convergence as compared to other models.

In conclusion, the effectiveness of a given topic modeling scheme entirely depends on its generalization ability and which in turn directly related to the number of training iterations. There is always an upper limit beyond which an increase in the number of iteration has no effect on the model’s generalization power. It is evident from the above results that the generalization power of the existing models is not up to the mark even for a substantially large number of training iterations. However, the proposed HDLA model outperforms the widely used baseline models in terms of its generalization ability and convergence rate. That is, HDLA model attains better generalization power within a lesser number of training iterations. Therefore, the HDLA-based formulation is capable of yielding a semantic-based image representation having more discriminative power.

#### 6.2.3 Reconstruction Performance

Evaluation of the reconstruction performance of the proposed HDLA model

Number of RBM units | No of training samples | Model configuration | Reconstruction error (\(\%\)) | |
---|---|---|---|---|

ORSM | HDLA | |||

3 | 30,000 | (784-500-150) | 21.38 | 14.16 |

(784-300-100) | 19.66 | 11.48 | ||

(784-200-50) | 17.43 | 10.61 | ||

60,000 | (784-500-150) | 18.49 | 11.82 | |

(784-300-100) | 15.57 | 8.94 | ||

(784-200-50) | 14.22 | 7.42 | ||

4 | 30,000 | (784-550-350-200) | 20.86 | 13.92 |

(784-450-250-150) | 17.79 | 10.29 | ||

(784-350-150-75) | 16.84 | 9.26 | ||

60,000 | (784-550-350-200) | 17.53 | 10.56 | |

(784-450-250-150) | 14.19 | 7.71 | ||

(784-350-150-75) | 13.63 | 6.18 |

Initially, the pixel values (0-255) of all input images are mapped to 0 or 1. For this, a threshold value of 30 is selected, and pixel values greater than or equal to 30 are set to 1 while values less than 30 are set to 0. A given image in its vectorized binary form is reconstructed by sampling the top most hidden layer vector from the latent model under evaluation followed by sampling the visible vector based on the generated hidden vector. The resulting visible vector is multiplied by 255 and is then binarized by the same procedure described above. To deal with binary inputs, the RSM unit in the first layer of the proposed HDLA model shown in Fig. 7 is replaced with an RBM unit.

In our experiments, different configurations of the proposed HDLA model are trained for the purpose of reconstructing MNIST handwritten digit images. The performance of the proposed HDLA model is then evaluated in comparison with Over Replicated Softmax Model (ORSM). Instead of directly using the actual training and test sets, the entire data set is pooled into ten equal-sized subsets. One of this subset is then used for model evaluation, and the remaining nine subsets are used for model training. This process is repeated ten times rotating through all the subsets which lead to tenfold cross-validation results. The obtained values are summarized in Table 3. From these results, it is evident that HDLA is a good generative model and it can significantly minimize the reconstruction error as compared to the ORSM-based formulation. Another factor to take into account is the impact of the number of training samples on the performance of HDLA and ORSM. Therefore, experiments are conducted by varying the number of training samples for each configuration of HDLA and ORSM. In all such cases, it seems that the proposed HDLA framework exhibits better reconstruction performance and is less sensitive to the size of training set as compared to ORSM.

### 6.3 Evaluation of HDLA-Based Image Search

This section evaluates the retrieval effectiveness of the proposed HDLA model in comparison with other latent topic-based approaches. The following subsections delineate the performance measures employed to judge the retrieval results, the procedure used to select appropriate values for the model parameters of HDLA in connection with effective image retrieval and the search results of the retrieval experiments carried out in various datasets.

#### 6.3.1 Evaluation Metrics

*k*images from the given dataset in response to a submitted query. The rank of an image is determined by its relevance to the query at hand. To be able to compare various image retrieval models, first a set of performance measures are to be identified. When the ground truth of the data set is available, the system’s performance is generally measured in terms of quantitative metrics such as precision and recall. The precision of a retrieval system measures the percentage of relevant images in the ranked retrieval list and the recall denotes the percentage of relevant images retrieved by the system. These two metrics are defined as follows:

*k*(p@k) and R-precision are introduced. p@

*k*is the value of precision calculated using the first

*k*documents in the retrieval list. Similarly,

*R*-Precision for a given query is defined to be the precision after retrieving

*R*images from the image data base and is expressed as:

*R*is the total number of relevant images in the database for the given query and Rel(j) is an indicator function which returns the value 1 when the image present at the

*j*-th location of the retrieval list is relevant with respect to the given query.

*r*between \(r_i\) and \(r_{i+1}\):

*m*query images the Mean Average Precision is defined as:

*q*) is the average precision for a given query

*q*and is defined as the ratio of the sum of precision values from rank positions where a relevant image is found in the retrieval result to the total number of relevant images in the database.

*q*) is the retrieval rate for a single query q and is calculated as:

*i*in the retrieval list varies within the range 0–3 according to user judgement, where 0 corresponds to irrelevant images and 3 corresponds to the most relevant image. Based on the correctness score, the usefulness or gain of each image with respect to its position

*p*in the retrieval list is estimated and is then accumulated to compute the nDCG value as follows:

*p*and are, respectively, defined as follows:

*p*. The logarithmic factor in the denominator is a penalty term by which a discount is made to the correctness value of highly relevant images appearing at the bottom position of the search result. Finally, the nDCG values of all the queries are averaged to get the overall performance of the retrieval system.

#### 6.3.2 Parameter Selection

In the context of image retrieval, it is important to select appropriate values for the parameters of HDLA model. More specifically, the parameters such as visual dictionary size (*K*), the number of hidden layers and the number of nodes in each hidden layers need to be tuned for good retrieval performance. For individual image collection, this is done by calculating the average retrieval rates for each query set by varying the visual dictionary size and the number of nodes in each hidden layers of HDLA. Figure 9 depicts the average retrieval rates obtained by different image collections while changing the number of hidden layer units along with visual dictionary size. It is now easy to fix reasonable values for the model parameters by analyzing the results shown in Fig. 9. Once the proper estimates of these parameters have been obtained, they can be frozen and used for subsequent retrieval experiments. To avoid computational bottlenecks, HDLA model with three layers of hidden units are considered in our retrieval experiments. It is empirically found that HDLA model with three layers of hidden units is good enough to generate latent topic-based image representation having more discriminative power and retrieval accuracy than the existing topic modeling schemes. The next subsection summarizes the comparative evaluation of various image retrieval experiments.

#### 6.3.3 Retrieval Results and Discussion

This section verifies the retrieval efficiency of the proposed scheme in comparison with state-of-the-art models. In this regard, the following retrieval frameworks have been selected for comparison purpose, namely, Over Replicated Softmax Model (ORSM) [21], Replicated Softmax model (RSM) [6], Rate Adapting Poisson model (RAP) [5], Pachinko Allocation Model (PAM) [15] and Latent Dirichlet Allocation (LDA) [2].

The retrieval effectiveness of the proposed HDLA model is initially evaluated on the basis of mAP, average R-Precision and nDCG\(_{p=10}\) values. The comparison of the proposed model and the already existing methods is provided in Table 4. On average, the HDLA model achieves 6\(\%\) improvement in the values of mAP, average R-Precision and nDCG\(_{p=10}\) as compared to the best performing approach in the literature. From these statistics, it is evident that the proposed HDLA model is promising and it gives better retrieval results compared to state-of-the-art methods.

Comparative evaluation of the proposed HDLA model based on mean average precision (*mAP*), average R-precision and normalized discounted cumulative gain (*nDCG*) calculated at rank position *p*=10

Dataset used | | |||||
---|---|---|---|---|---|---|

| ORSM | RSM | RAP | PAM | LDA | |

Holiday dataset | | 0.695 | 0.663 | 0.631 | 0.613 | 0.597 |

Scene-15 dataset | | 0.668 | 0.635 | 0.607 | 0.584 | 0.564 |

Oxford dataset | | 0.647 | 0.613 | 0.586 | 0.561 | 0.543 |

GHIM-10K dataset | | 0.640 | 0.611 | 0.579 | 0.565 | 0.544 |

IAPR TC-12 dataset | | 0.681 | 0.654 | 0.621 | 0.606 | 0.581 |

MIRFLICKR-40K dataset | | 0.702 | 0.678 | 0.644 | 0.626 | 0.602 |

Dataset used | Average R-Precision | |||||
---|---|---|---|---|---|---|

| ORSM | RSM | RAP | PAM | LDA | |

Holiday dataset | | 0.702 | 0.672 | 0.644 | 0.627 | 0.602 |

Scene-15 dataset | | 0.693 | 0.661 | 0.647 | 0.625 | 0.600 |

Oxford dataset | | 0.668 | 0.664 | 0.632 | 0.614 | 0.596 |

GHIM-10K dataset | | 0.679 | 0.647 | 0.619 | 0.595 | 0.575 |

IAPR TC-12 dataset | | 0.694 | 0.665 | 0.660 | 0.639 | 0.619 |

MIRFLICKR-40K dataset | | 0.727 | 0.694 | 0.665 | 0.642 | 0.626 |

Dataset used | | |||||
---|---|---|---|---|---|---|

| ORSM | RSM | RAP | PAM | LDA | |

Holiday dataset | | 0.776 | 0.748 | 0.716 | 0.694 | 0.675 |

Scene-15 dataset | | 0.751 | 0.729 | 0.695 | 0.672 | 0.653 |

Oxford dataset | | 0.717 | 0.687 | 0.653 | 0.634 | 0.610 |

GHIM-10K dataset | | 0.734 | 0.706 | 0.675 | 0.653 | 0.632 |

IAPR TC-12 dataset | | 0.763 | 0.737 | 0.702 | 0.688 | 0.663 |

MIRFLICKR-40K dataset | | 0.794 | 0.760 | 0.733 | 0.700 | 0.683 |

To further validate the effectiveness of the proposed HDLA model, its performance is compared with other existing models in terms of the average precision values at selected rank thresholds of 10, 20 and 30 (i.e, p@10, p@20 and p@30). The average precision values of the retrieval experiments carried out in all the benchmark datasets are presented in Table 5. When an end user is interested in viewing only the top 10, 20 and 30 results returned by the retrieval model, then 6\(\%\) improvement on average is achieved with the proposed HDLA-based formulation.

Comparative evaluation of the proposed HDLA model based on precision values calculated at cut-off levels 10, 20 and 30

Evaluation metric | Holiday dataset | Scene-15 dataset | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

| ORSM | RSM | RAP | PAM | LDA | | ORSM | RSM | RAP | PAM | LDA | |

p@10 | | 0.807 | 0.765 | 0.742 | 0.733 | 0.721 | | 0.774 | 0.732 | 0.710 | 0.703 | 0.698 |

p@20 | | 0.785 | 0.733 | 0.716 | 0.702 | 0.696 | | 0.746 | 0.701 | 0.687 | 0.675 | 0.664 |

p@30 | | 0.757 | 0.717 | 0.684 | 0.687 | 0.675 | | 0.712 | 0.677 | 0.653 | 0.642 | 0.633 |

Evaluation metric | Oxford dataset | GHIM-10K dataset | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

| ORSM | RSM | RAP | PAM | LDA | | ORSM | RSM | RAP | PAM | LDA | |

p@10 | | 0.743 | 0.716 | 0.693 | 0.684 | 0.671 | | 0.754 | 0.711 | 0.697 | 0.680 | 0.678 |

p@20 | | 0.712 | 0.687 | 0.661 | 0.653 | 0.646 | | 0.716 | 0.689 | 0.664 | 0.655 | 0.647 |

p@30 | | 0.682 | 0.654 | 0.637 | 0.625 | 0.611 | | 0.674 | 0.644 | 0.623 | 0.616 | 0.605 |

Evaluation metric | IAPR TC-12 dataset | MIRFLICKR-40K dataset | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

| ORSM | RSM | RAP | PAM | LDA | | ORSM | RSM | RAP | PAM | LDA | |

p@10 | | 0.795 | 0.760 | 0.748 | 0.733 | 0.726 | | 0.812 | 0.686 | 0.642 | 0.633 | 0.620 |

p@20 | | 0.766 | 0.735 | 0.711 | 0.700 | 0.692 | | 0.787 | 0.651 | 0.623 | 0.614 | 0.601 |

p@30 | | 0.738 | 0.704 | 0.687 | 0.675 | 0.664 | | 0.764 | 0.637 | 0.605 | 0.598 | 0.587 |

## 7 Conclusion

In this paper, a new class of topic modeling scheme called hybrid deep learning architecture is proposed for semantic image modeling and retrieval. The proposed architecture is a composite of Replicated Softmax Model and Restricted Boltzmann Machines with nonnegativity restriction on the network weights and \(\ell _1\)-sparseness constraint on the hidden layer activations. As part of image modeling, the formulated architecture infers a hierarchical nonlinear mapping function in a completely unsupervised fashion that projects the original BoVW-based representation on to a latent topic-based semantic concept space. Thus, the hybrid deep learning architecture can capture semantic correlation among visual words and consequently minimizes the semantic loss associated with BoVW-based image retrieval. Based on the experimental evaluations it can be concluded that the image representation yielded by the proposed HDLA model significantly improves the retrieval performance as compared to state-of-the-art latent topic-based image retrieval systems.

## Notes

## Compliance with Ethical Standards

## Conflict of interest

The authors declare that they have no competing interests.

## References

- 1.Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42(1):177MathSciNetCrossRefMATHGoogle Scholar
- 2.Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(January):993MATHGoogle Scholar
- 3.Blei DM, Lafferty JD (2005) Correlated topic models. In: Proceedings of the 18th international conference on neural information processing systems, MIT Press, Cambridge, MA, pp 147–154Google Scholar
- 4.Boulemden A, Tlili Y (2012) Image indexing and retrieval with pachinko allocation model: application on local and global features. In: Proceedings of the 12th pacific rim conference on knowledge management and acquisition for intelligent systems. Springer, Berlin, pp 140–146Google Scholar
- 5.Gehler PV, Holub AD, Welling M (2006) The rate adapting poisson model for information retrieval and object recognition. In: Proceedings of the 23rd international conference on machine learning. ACM, New York, pp 337–344Google Scholar
- 6.Salakhutdinov R, Hinton G (2009) Replicated softmax: an undirected topic model. In: Proceedings of the 22nd international conference on neural information processing systems. Curran Associates Inc., USA, pp 1607–1614Google Scholar
- 7.Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391CrossRefGoogle Scholar
- 8.Pecenovic Z (1997) Intelligent image retrieval using latent semantic indexing. Master’s thesis, Swiss Federal Institute of TechnologyGoogle Scholar
- 9.Zhang R, Zhang Z (2007) Effective image retrieval based on hidden concept discovery in image database. IEEE Trans Image Process 16(2):562MathSciNetCrossRefGoogle Scholar
- 10.Lienhart R, Romberg S, Hörster E (2009) Multilayer pLSA for multimodal image retrieval. In: Proceedings of the ACM international conference on image and video retrieval. ACM, New YorkGoogle Scholar
- 11.Li P, Cheng J, Li Z, Lu H (2011) Correlated PLSA for image clustering. In: Advances in multimedia modeling, pp 307–316Google Scholar
- 12.Chiang CC, Wu JW, Lee GC (2012) Probabilistic semantic component descriptor. Multimed Tools Appl 59(2):629CrossRefGoogle Scholar
- 13.Hörster E, Lienhart R, Slaney M (2007) In: Proceedings of the 6th ACM international conference on image and video retrieval. ACM, New York, pp 17–24Google Scholar
- 14.Greif T, Hörster E, Lienhart R (2008) Correlated topic models for image retrieval. University of Augsburg, Germany, July, Tech. repGoogle Scholar
- 15.Li W, McCallum A (2006) Pachinko allocation: DAG-structured mixture models of topic correlations. In: Proceedings of the 23rd international conference on machine learning, ACM, New York, pp 577–584Google Scholar
- 16.Andrieu C, De Freitas N, Doucet A, Jordan MI (2003) An introduction to MCMC for machine learning. Mach Learn 50(1–2):5CrossRefMATHGoogle Scholar
- 17.Minka T, Lafferty J (2002) Expectation-propagation for the generative aspect model. In: Proceedings of the eighteenth conference on uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc., pp 352–359Google Scholar
- 18.Casella G, George EI (1992) Explaining the Gibbs sampler. Am Stat 46(3):167MathSciNetGoogle Scholar
- 19.Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504MathSciNetCrossRefMATHGoogle Scholar
- 20.Hinton G (2010) A practical guide to training restricted Boltzmann machines. Momentum 9(1):926Google Scholar
- 21.Srivastava N, Salakhutdinov R, Hinton G (2013) Modeling documents with a deep boltzmann machine. In: Proceedings of the twenty-ninth conference on uncertainty in artificial intelligence. AUAI Press, Arlington, Virginia, pp 616–624Google Scholar
- 22.Olshausen BA, Field DJ (2004) Sparse coding of sensory inputs. Curr Opin Neurobiol 14(4):481CrossRefGoogle Scholar
- 23.Salakhutdinov R, Hinton G (2009) Deep boltzmann machines. In: Proceedings of the twelfth international conference on artificial intelligence and statistics, Clearwater Beach, Florida, pp 448–455Google Scholar
- 24.Brooks S, Gelman A, Jones GL, Meng XL (2011) Handbook of markov chain monte carlo. CRC Press, Boca RatonCrossRefMATHGoogle Scholar
- 25.Hinton GE, Salakhutdinov RR (2012) A better way to pretrain deep boltzmann machines. In: Proceedings of the 26th annual conference on neural information processing systems. Lake Tahoe, Nevada, pp 2447–2455Google Scholar
- 26.Bruna J, Mallat S (2013) Invariant scattering convolution networks. IEEE Trans Pattern Anal Machine Intelligence 35(8):1872Google Scholar
- 27.Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788CrossRefMATHGoogle Scholar
- 28.Poggio T, Girosi F (1998) A sparse representation for function approximation. Neural Comput 10(6):1445CrossRefGoogle Scholar
- 29.Nguyen TD, Tran T, Phung DQ, Venkatesh S (2013) Learning parts-based representations with nonnegative restricted boltzmann machine. In: Proceedings of the Asian conference on machine learning. ACT, Canberra, pp 133–148Google Scholar
- 30.Jegou H, Douze M, Schmid C (2008) Hamming embedding and weak geometric consistency for large scale image search. In: Proceedings of the 10th European conference on computer vision: Part I. Springer, Berlin, pp 304–317Google Scholar
- 31.Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the 2006 IEEE computer society conference on computer vision and pattern recognition. Vol 2, IEEE Computer Society, Washington, DC, pp 2169–2178Google Scholar
- 32.Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and fast spatial matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, New York, pp 1–8Google Scholar
- 33.Liu GH, Yang JY, Li Z (2015) Content-based image retrieval using computational visual attention model. Pattern Recogn 48(8):2554CrossRefGoogle Scholar
- 34.Grubinger M, Clough P, Müller H, Deselaers T (2006) The iapr tc-12 benchmark: a new evaluation resource for visual information systems. In: Proceedings of international conference on language resources and evaluation. vol 5, ELRA, 2006, vol 5, p 10Google Scholar
- 35.Huiskes MJ, Thomee B, Lew MS (2010) New Trends and ideas in visual concept detection: the MIR Flickr retrieval evaluation initiative. In: Proceedings of international conference on multimedia information retrieval. ACM, New ork, pp 527–536Google Scholar
- 36.Deng L (2012) The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process Mag 29(6):141CrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.