In this section, we mainly elaborate the proposed Latent Structure Preserving Hashing algorithm.
Preserving Data Structure with NMF
NMF is an unsupervised learning algorithm which can learn a parts-based representation. Theoretically, it is expected that the low-dimensional data V given by NMF can obtain locality structure from the high-dimensional data X. However, in real-world applications, NMF cannot discover the intrinsic geometrical and discriminating structure of the data space. Therefore, to preserve as much of the significant structure of the high-dimensional data as possible, we propose to minimize the Kullback-Leibler divergence (Xie et al. 2011) between the joint probability distribution in the high-dimensional space and the joint probability distribution that is heavy-tailed in the low-dimensional space:
$$\begin{aligned} C= \lambda KL(P\Vert Q). \end{aligned}$$
(4)
In Eq. (4), P is the joint probability distribution in the high-dimensional space which can also be denoted as \(p_{ij}\). Q is the joint probability distribution in the low-dimensional space that can be represented as \(q_{ij}\). \(\lambda \) is the control of the smoothness of the new representation. The conditional probability \(p_{ij}\) means the similarity between data points \(\mathbf {x}_{i}\) and \(\mathbf {x}_{j}\), where \(\mathbf {x}_{j}\) is picked in proportion to their probability density under a Gaussian centered at \(\mathbf {x}_{i}\). Since only significant points are needed to model pairwise similarities, we set \(p_{ii}\) and \(q_{ii}\) to zero. Meanwhile, it has the characteristics that \(p_{ij} = p_{ji}\) and \(q_{ij} = q_{ji}\) for \(\forall i,j\). The pairwise similarities in the high-dimensional space \(p_{ij}\) are defined as:
$$\begin{aligned} p_{ij}= \frac{\exp \left( -\Vert \mathbf {x}_i -\mathbf {x}_j\Vert ^2/ 2\sigma _i^2\right) }{\sum _{k \ne l} \exp \left( -\Vert \mathbf {x}_k - \mathbf {x}_l\Vert ^2/2\sigma _k^2\right) }, \end{aligned}$$
(5)
where \(\sigma _{i}\) is the variance of the Gaussian distribution which is centered on data point \(x_{i}\). Each data point \(x_{i}\) makes a significant contribution to the cost function. In the low-dimensional map, using the probability distribution that is heavy tailed, the joint probabilities \(q_{ij}\) can be defined as:
$$\begin{aligned} q_{ij}= \frac{\left( 1+\Vert \mathbf {v}_i-\mathbf {v}_j\Vert ^2\right) ^{-1}}{\sum _{k \ne l} \left( 1+ \Vert \mathbf {v}_k - \mathbf {v}_l \Vert ^2\right) ^{-1}}. \end{aligned}$$
(6)
This definition is an infinite mixture of Gaussians, which is much faster to evaluate the density of a point than the single Gaussian, since it does not have an exponential. This representation also makes the mapped points invariant to the changes in the scale for the embedded points that are far apart. Thus, the cost function based on Kullback-Leibler divergence can effectively measure the significance of the data distribution . \(q_{ij}\) models \(p_{ij}\) is given by
$$\begin{aligned} \begin{aligned} G= KL(P\Vert Q) =\sum _{i}\sum _{j}p_{ij}\log {p_{ij}}-p_{ij}\log {q_{ij}}. \end{aligned} \end{aligned}$$
(7)
For simplicity, we define two auxiliary variables \(d_{ij}\) and Z for making the derivation clearer as follows:
$$\begin{aligned} d_{ij}=\Vert \mathbf {v}_i-\mathbf {v}_j\Vert ~\text {and}~ Z={\sum _{k \ne l} \left( 1+ d_{kl}^2\right) ^{-1}}. \end{aligned}$$
(8)
Therefore, the gradient of function G with respect to \(\mathbf {v}_i\) can be given by
$$\begin{aligned} \frac{\partial G}{\partial \mathbf {v}_i} = 2\sum _{j=1}^N \frac{\partial G}{\partial {d}_{ij}} \left( \mathbf {v}_i-\mathbf {v}_j\right) . \end{aligned}$$
(9)
Then \(\frac{\partial G}{\partial {d}_{ij}}\) can be calculated by Kullback-Leibler divergence in Eq. (7):
$$\begin{aligned} \frac{\partial G}{\partial {d}_{ij}}=-\sum _{k \ne l}p_{kl}\left( \frac{1}{q_{kl}Z}\frac{\partial \left( \left( 1+ d_{kl}^2\right) ^{-1}\right) }{\partial {d}_{ij}}-\frac{1}{Z}\frac{\partial Z}{\partial {d}_{ij}} \right) .\nonumber \\ \end{aligned}$$
(10)
Since \(\frac{\partial ((1+ d_{kl}^2)^{-1})}{\partial {d}_{ij}}\) is nonzero if and only if \(k=i\) and \(l=j\), and \(\sum _{k \ne l}p_{kl}=1\), the gradient function can be expressed as
$$\begin{aligned} \frac{\partial G}{\partial {d}_{ij}}=2 \left( p_{ij}-q_{ij}\right) \left( 1+d_{ij}^2\right) ^{-1}. \end{aligned}$$
(11)
Eq. (11) can be substituted into Eq. (9). Therefore, the gradient of the Kullback-Leibler divergence between P and Q is
$$\begin{aligned} \frac{\partial G}{\partial \mathbf {v}_i} = 4 \sum _{j=1}^N (p_{ij}-q_{ij}) (\mathbf {v}_i - \mathbf {v}_j) \left( 1 + \Vert \mathbf {v}_i-\mathbf {v}_j\Vert ^2\right) ^{-1}.\nonumber \\ \end{aligned}$$
(12)
Therefore, through combining the data structure preserving part in Eq. (4) and the NMF technique, we can obtain the following new objective function:
$$\begin{aligned} O_f=\Vert X-UV\Vert ^2+\lambda KL(P\Vert Q), \end{aligned}$$
(13)
where \(V\in \{0,1\}^{D\times N}\), \(X,U,V\geqslant 0\), \(U\in {\mathbb {R}}^{M\times D}\), \(X\in {\mathbb {R}}^{M\times N}\), and \(\lambda \) controls the smoothness of the new representation.
In most of the circumstances, the low-dimensional data only from NMF is not effective and meaningful for realistic applications. Thus, we introduce \(\lambda KL(P\Vert Q)\) to preserve the structure of the original data which can obtain better results in information retrieval.
Relaxation and Optimization
Since the discreteness condition \(V\in \{0,1\}^{D\times N}\) in Eq. (22) cannot be calculated directly in the optimization procedure, motivated by Weiss et al. (2009), we first relax the data \(V\in \{0,1\}^{D\times N}\) to the range \(V\in {\mathbb {R}}^{D\times N}\) for obtaining real-values. Then let the Lagrangian of our problem be:
$$\begin{aligned} {\mathcal {L}}= & {} \Vert X-UV\Vert ^2+\lambda KL(P\Vert Q) + tr \left( {\varPhi }U^T\right) \nonumber \\&+\, tr ({\varPsi }V^T), \end{aligned}$$
(14)
where matrices \({\varPhi }\) and \({\varPsi }\) are two Lagrangian multiplier matrices. Since we have the gradient of \(C = \lambda G\):
$$\begin{aligned} \frac{\partial C}{\partial \mathbf {v}_i} = 4\lambda \sum _{j=1}^N \left( p_{ij}-q_{ij}\right) \left( \mathbf {v}_i - \mathbf {v}_j\right) \left( 1+\Vert \mathbf {v}_i-\mathbf {v}_j\Vert ^2\right) ^{-1},\nonumber \\ \end{aligned}$$
(15)
we make the gradients of \({\mathcal {L}}\) be zeros to minimize \(O_f\):
$$\begin{aligned} \frac{\partial {\mathcal {L}}}{\partial V}= & {} 2\left( -U^TX + U^TUV\right) + \frac{\partial C}{\partial \mathbf {v}_i} + {\varPsi }= \mathbf {0}, \end{aligned}$$
(16)
$$\begin{aligned} \frac{\partial {\mathcal {L}}}{\partial U}= & {} 2\left( -XV^T + UVV^T\right) + {\varPhi }= \mathbf {0}, \end{aligned}$$
(17)
In addition, we also have KKT conditions: \({\varPhi }_{ij} U_{ij} = 0\) and \({\varPsi }_{ij} V_{ij} = 0, \forall i,j\). Then multiplying \(V_{ij}\) and \(U_{ij}\) in the corresponding positions on both sides of Eqs. (16) and (17) respectively, we obtain
$$\begin{aligned} \left( 2 \left( -U^TX + U^TUV\right) +\frac{\partial C}{\partial \mathbf {v}_i}\right) _{ij} V_{ij}= & {} 0, \end{aligned}$$
(18)
$$\begin{aligned} 2\left( -XV^T + UVV^T\right) _{ij} U_{ij}= & {} 0. \end{aligned}$$
(19)
Note that
$$\begin{aligned} \begin{aligned} \left( \frac{\partial C}{\partial \mathbf {v}_j}\right) _i&= \left( 4\lambda \sum _{k=1}^N \frac{p_{jk} \mathbf {v}_j - q_{jk}\mathbf {v}_j - p_{jk} \mathbf {v}_k + q_{jk}\mathbf {v}_k}{1+\Vert \mathbf {v}_j-\mathbf {v}_k\Vert ^2}\right) _i \\&= 4\lambda \sum _{k=1}^N \frac{p_{jk} V_{ij} - q_{jk} V_{ij} - p_{jk} V_{ik} + q_{jk} V_{ik}}{1+\Vert \mathbf {v}_j-\mathbf {v}_k\Vert ^2}. \end{aligned} \end{aligned}$$
Therefore, we have the following update rules for any i, j:
$$\begin{aligned} V_{ij}\leftarrow & {} \frac{\left( U^TX\right) _{ij} + 2 \lambda \sum \limits _{k=1}^N \frac{p_{jk} V_{ik} + q_{jk} V_{ij}}{1+\Vert \mathbf {v}_j-\mathbf {v}_k\Vert ^2}}{\left( U^TUV\right) _{ij} + 2 \lambda \sum \limits _{k=1}^N \frac{p_{jk} V_{ij} + q_{jk} V_{ik}}{1+\Vert \mathbf {v}_j-\mathbf {v}_k\Vert ^2}} V_{ij}, \end{aligned}$$
(20)
$$\begin{aligned} U_{ij}\leftarrow & {} \frac{\left( X V^T\right) _{ij}}{\left( UVV^T\right) _{ij}} U_{ij}. \end{aligned}$$
(21)
All the elements in U and V can be guaranteed that they are nonnegative from the allocation. In Lee and Seung (2000), it has been proved that the objective function is monotonically non-increasing after each update of U or V. The proof of convergence about U and V is similar to the ones in Zheng et al. (2011), Cai et al. (2011).
Once the above algorithm is converged, we can obtain the real-valued low-dimensional representation by a linear projection matrix. Since our algorithm is based on general NMF rather than Projective NMF (PNMF) (Yuan and Oja 2005; Guan et al. 2013), a direct projection does not exist for data embedding. Thus, in this paper, inspired by Cai et al. (2007), we consider using linear regression to compute our projection matrix. Particularly, we make the projection orthogonal by solving the Orthogonal Procrustes problem (Schönemann 1966) as follows:
$$\begin{aligned} \min _{{\mathcal {P}}} \Vert {\mathcal {P}} X - V\Vert , ~\text {s.t.}~ {{\mathcal {P}}^T {\mathcal {P}} = I} \end{aligned}$$
(22)
where \({\mathcal {P}}\) is the orthogonal projection. The optimal solution can be obtained by the following procedure: 1. use the singular value decomposition algorithm to decompose the matrix \(X^T V = A {\varSigma }B^T\); 2. calculate \({\mathcal {P}} = B{\varOmega }A^T\), where, \({\varOmega }\) is a connection matrix as \({\varOmega }=[I,\mathbf{0 }]\in {\mathbb {R}}^{D\times M}\) and \({\mathbf{0 }}\) indicates all zeros matrix. Given data \(\mathbf {x} \in {\mathbb {R}}^{M \times 1}\), its low-dimensional representation is \(\mathbf {v} = {\mathcal {P}} \mathbf {x}\). There are three advantages on using orthogonal projection according to Zhang et al. (2015): Firstly, the orthogonal projection can preserve the Euclidean distance between two points; Secondly, the orthogonal projection can distribute the variance more evenly across the dimensions; Thirdly, the orthogonal projection can learn maximally uncorrelated dimensions, which leads to more compact representations.
Hash Function Generation
The low-dimensional representations \(V\in {\mathbb {R}}^{D\times N}\) and the bases \(U\in {\mathbb {R}}^{M\times D}\), where \(D \ll M\), can be obtained from Eq. (20) and Eq. (21), respectively. Then we need to convert the low-dimensional real-valued representations from \(V=[\mathbf{v }_{1},\cdots ,\mathbf{v }_{N}]\) into binary codes via thresholding: if the d-th element in \(\mathbf{v }_{n}\) is larger than a specified threshold, this real value will be represented as 1; otherwise it will be 0, where \(d=1,\cdots ,D\) and \(n=1,\cdots ,N\).
In addition, a well-designed semantic hashing should also be entropy maximizing to ensure its efficiency (Baluja and Covell 2008). Meanwhile, from the information theory, through having a uniform probability distribution, the source alphabet can reach a maximal entropy. Specifically, if the entropy of codes over the corpus is small, the documents will be mapped to a small number of codes (hash bins). In this paper, the threshold of the elements in \(\mathbf {v}_n\) can be set to the median value of \(\mathbf {v}_n\), which can satisfy entropy maximization. Therefore, half of the bit-strings will be 1 and the other half will be 0. In this way, the real-value code can be calculated into a binary code (Yu et al. 2014).
However, from the above procedure, we can only obtain the binary codes of the data in the training set. Therefore, given a new sample, it cannot directly find a hash function. To solve such an “out-of-sample” problem, in our approach, we are inspired by the “self-taught” binary coding scheme (Zhang et al. 2010) to use the logistic regression (Hosmer and Lemeshow 2004) which can be treated as a type of probabilistic statistical classification model to compute the hash code for unseen test data. Specifically, we learn a square projection matrix via logistic regression, which can be regarded as a rotation of V. This kind of transformation can make the codes more balanced (Gong et al. 2013; Liu et al. 2012) and lead to better performance compared with directly binarizing V with the median value calculated from training data. To make it more convincing, we also show the performance difference in the later section. Before obtaining the logistic regression cost function, we define that the binary code is represented as \({\hat{V}}=[\hat{\mathbf{v }}_{1},\cdots ,\hat{\mathbf{v }}_{N}]\), where \(\hat{\mathbf{v }}_{n}\in \{0,1\}^{D}\) and \(n=1,\cdots ,N\). Therefore, the training set can be considered as \(\{(\mathbf{v }_{1}, \hat{\mathbf{v }}_{1}), (\mathbf{v }_{2}, \hat{\mathbf{v }}_{2}), \cdots , (\mathbf{v }_{N}, \hat{\mathbf{v }}_{N})\}\). The vector-valued regression function which is based on the corresponding regression matrix \({\varTheta }\in {\mathbb {R}}^{D\times D}\) can be represented as
$$\begin{aligned} h_{{\varTheta }}\left( \mathbf{v }_{n}\right) = \left( \frac{1}{1+e^{- \left( {\varTheta }^{T}\mathbf{v }_{n}\right) _i}}\right) ^T_{i = 1, \cdots , D}. \end{aligned}$$
(23)
Therefore, with the maximum log-likelihood criterion for the Bernoulli-distributed data, our cost function for the corresponding regression matrix can be defined as:
$$\begin{aligned} \begin{aligned} J ({\varTheta }) =&-\frac{1}{N} \Big ( \sum _{n=1}^N \Big (\hat{\mathbf{v }}_{n}^T \mathbf{log }(h_{{\varTheta }}(\mathbf{v }_{n}))\\&+ (\mathbf{1 }-\hat{\mathbf{v }}_{n})^T \mathbf{log }(\mathbf{1 }-h_{{\varTheta }}(\mathbf{v }_{n}))\Big ) + \delta \Vert {\varTheta }\Vert ^{2} \Big ), \end{aligned} \end{aligned}$$
(24)
where \(\mathbf{log }(\cdot )\) is the element-wise logarithm function and 1 is an \(D \times 1\) all ones matrix. We use \(\delta \Vert {\varTheta }\Vert ^{2}\) as the regularization term in logistic regression to avoid overfitting.
To find the matrix \({\varTheta }\) which aims to minimize \(J ({\varTheta })\), we use gradient descent and repeatedly update each parameter using a learning rate \(\alpha \). The updating equation is shown as follows:
$$\begin{aligned} {{\varTheta }}^{(t+1)} = {{\varTheta }}^{(t)} - \frac{\alpha }{N}\sum _{i=1}^N \left( h_{{\varTheta }^{(t)}} \left( \mathbf{v }_{i}\right) -\hat{\mathbf{v }}_{i}\right) \mathbf{v }_{i}^T - \frac{\alpha \delta }{N}{{\varTheta }}^{(t)}.\nonumber \\ \end{aligned}$$
(25)
The updating equation stops when the norm of difference between \({{\varTheta }}^{(t+1)}\) and \({{\varTheta }}^{(t)}\), i.e., \(||{{\varTheta }}^{(t+1)}-{{\varTheta }}^{(t)}||^2\), is smaller than a small value. Then we can obtain the regression matrix \({\varTheta }\). For a new coming test data \(X_{new} \in {\mathbb {R}}^{M \times 1}\), then its low-dimensional representation is \(V_{new} = {\mathcal {P}} X_{new}\). Note that each entry of \(h_{{\varTheta }}\) is a sigmoid function, the hash codes for a new coming sample \(X_{new} \in {\mathbb {R}}^{M \times 1}\) can be represented as:
$$\begin{aligned} {\hat{V}}_{new}=\lfloor h_{\varTheta }({\mathcal {P}} X_{new})\rceil , \end{aligned}$$
(26)
where \(\lfloor \cdot \rceil \) means the nearest integer function for each entry of \(h_{{\varTheta }}\). Specifically, since the output of logistic regression i.e., \(h_{\varTheta }({\mathcal {P}} X_{new})\), indicates the probability of “1” for each entry, \(\lfloor \cdot \rceil \) is equivalent to binarizing each bit by probability 0.5. Thus, if the probability of a bit from \(h_{\varTheta }({\mathcal {P}} X_{new})\) is larger than 0.5, it will be represented as 1, otherwise 0. For example, through Eq. (26), vector \(h_{\varTheta }({\mathcal {P}} X_{new}) =[0.17, 0.37, 0.42, 0.79, 0.03, 0.92, \cdots ]\) can be expressed as \([0, 0, 0, 1, 0, 1, \cdots ]\). Up to now, we can obtain the Latent Structure Preserving Hashing codes for both training and test data. The procedure of LSPH is summarized in Algorithm 1.