Abstract
Recently, matrix factorizationbased hashing has gained wide attention because of its strong subspace learning ability and high search efficiency. However, some problems need to be further addressed. First, uniform hash codes can be generated by collective matrix factorization, but they often cause serious loss, degrading the quality of hash codes. Second, most of them preserve the absolute similarity simply in hash codes, failing to capture the inherent semantic affinity among training data. To overcome these obstacles, we propose a Discrete Multisimilarity Consistent Matrix Factorization Hashing (DMCMFH). Specifically, an individual subspace is first learned by matrix factorization and multisimilarity consistency for each modality. Then, the subspaces are aligned by a shared semantic space to generate homogenous hash codes. Finally, an iterativebased discrete optimization scheme is presented to reduce the quantization loss. We conduct quantitative experiments on three datasets, MSCOCO, Mirflickr25K and NUSWIDE. Compared with supervised baseline methods, DMCMFH achieves increases of \(0.22\%\), \(3.00\%\) and \(0.79\%\) on the imagequerytext tasks for three datasets respectively, and achieves increases of \(0.21\%\), \(1.62\%\) and \(0.50\%\) on the textqueryimage tasks for three datasets respectively.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
With the arrival of the 5G network, a tremendous quantity of multiple media data is generated on social networks, e.g., texts, images and videos. Faced with such a large quantity of data, how to search effectively and efficiently becomes a problem that needs to be fixed [1,2,3,4,5,6,7]. Some researchers propose approximate nearest neighbor methods to satisfy the requirement of largescale retrieval. Hashing technology, one of the most famous methods, has received increasing attention in recent years [8,9,10,11,12,13,14]. The critical formulation of hashing technology is to project data from the original space into a binary space. Then the similarities between data points are measurable by Hamming distances, which can be calculable rapidly by the XOR operation. Because of the efficiency in computing and memory cost, hashing technology has evoked wide attention in the multimedia retrieval field.
Early hashing technology has been widely applied in single modality retrieval. However, tremendous multimedia data are generated on the Internet, which makes the retrieval task more challenging. The main target of crossmodal retrieval is to build the relationship between different modalities [12, 15,16,17,18]. Specifically, while a user submits a query, similar objects can be returned by the retrieval system in other modalities. However, there is a pervasive semantic gap among different modalities. Hence, preserving the semantic correlation among heterogeneous data has become one of the important targets.
Matrix factorization, which can capture the intrinsic data structure hidden in original data, is a powerful tool in subspace learning. Accordingly, several methods based on matrix factorization have been designed to address the crossmodal retrieval task and gain acceptable retrieval performance [19,20,21,22,23,24]. However, the following shortcomings are generally suffered by these methods. First, most models manage to learn consistent representations for pairwise data points by keeping the intermodal similarities using collective matrix factorization (CMF) [19, 21, 22, 25]. However, data points from different modalities lie in totally different feature spaces. Directly generating consistent representations for heterogeneous pairwise data points creates vast costs in the training procedure and consequent performance degradation. Figure 1 demonstrates the differences between the proposed method and methodbased CMF. (a) shows the formulation of the proposed DMCMFH; (b) shows the formulation of collective matrix factorizationbased methods. Different shapes represent different modalities, and different border colors with the same inside colors represent different data points from the same categories. In simple terms, when the inside color is the same, different outside borders represent different samples of the same category. Most of the models, such as that in Fig. 1(b), directly use CMF to generate consistent representations for heterogeneous pairwise data points. However, it usually results in vast costs in the training procedure and consequent performance degradation. DMCMFH generates independent subspaces through matrix factorization and embedding based on multisimilarity matrix, which makes the learned subspaces more discriminative and reduces the distance between different samples from the same category. Then, a semantic space is built through class tags to bridge the heterogeneous gap. Although heterogeneous data come from the same category, the distance between them should be small. Second, for generating more discriminative hash codes, most models propose maintaining the intramodal similarities in the hashing learning procedure [21, 26]. However, they only attempt to maintain the local data structure or class labelbased semantic structure, which may not be enough to capture the intrinsic structure in training data. Third, most of them first learn a subspace and then quantize the realvalued representations into discrete hash codes for simplicity[21, 26,27,28]. However, the quantization procedure leads to large quantization loss and lower retrieval performance.
To overcome the issues referred to above, we propose a novel hashing method, named Discrete Multisimilarity Consistent Matrix Factorization Hashing (DMCMFH), which incorporates matrix factorization and multisimilarity consistency into a unified framework. Specifically, an individual subspace is first generated by matrix factorization and multisimilarity matrixbased embedding, which makes the learned subspaces more discriminative. A semantic space is then constructed by class labels to bridge the heterogeneous gap. The flowchart of our DMCMFH is described in Fig. 2.
The contributions are summarized as follows:

DMCMFH leverages a semantic space derived by class labels to establish the relationships between the individual subspaces generated by matrix factorization. Therefore, in the learned common semantic space, the intermodal similarity can be well preserved, which can improve the discrimination of the sharing space and consequent more discriminative hash codes.

A multisimilarity matrix, which can not only combine the affinity information cross different modalities but also better capture the latent semantic affinity among the data points, is designed.

Experimental results indicate the superior performance of DMCMFH in various respects compared with several existing methods.
The rest of this paper is organized as follows. The related work is presented in Section Related work. The proposed DMCMFH model is presented in Section Discrete multisimilarity consistent matrix factorization hashing. Comprehensive experiments and corresponding analysis are presented in Section Experiments. Finally, Section Conclusion concludes.
Related work
In this section, we briefly introduce unsupervised and supervised crossmodal hashing methods.
Unsupervised crossmodal hashing methods focus on learning hash codes by preserving featurebased similarity without supervised information. In [21], which first utilizes CMF to learn a common space for heterogeneous data. In [29], which constructs an affinity matrix, that can capture the latent intrinsic semantic affinity. In [27], which learns unified hash codes to maintain common properties and individual hash codes to retain specific properties. Unsupervised methods generally cannot produce highquality hash codes because class labels cannot be used to reduce the heterogeneous gap.
In contrast, supervised methods can make use of both the data characteristics and the semantic information to supervise hash learning. Supervised hashing methods play an increasingly significant role in crossmodal retrieval tasks. In [26], which makes use of the CMF strategy to generate common representations for pairwise data points and maintains the intramodal similarities by graph regularization. This method first generates realvalue common semantic subspaces for heterogeneous modalities and then quantizes the shared semantic space to obtain hash codes, which results in a large quantization loss. In [19], which kernelizes the original features and leverages CMF to extract the kernelized features into a semantic space. In [22], which learns the hash codes that are based on semantic embedding and CMF based on class labels. In [20], which proposes a flexible and generalized crossmodal hashing framework.
Note that Joint and Individual Matrix Factorization Hashing (JIFMH) [27], Online Collective Matrix Factorization Hashing (OCMFH) [28], Collective Matrix Factorization Hashing (CMFH) [21], Supervised Matrix Factorization for crossmodality Hashing (SMFH) [26], Efficient Discrete Supervised Hashing (EDSH) [22] and Scalable Discrete Matrix Factorization Hashing (SCRATCH) [19] are matrix factorizationbased crossmodal hashing methods. Among them, JIFMH [27], OCMFH [28], CMFH [21] are unsupervised methods, that cannot utilize class labels to improve the hash codes quality. SCRATCH [19], EDSH [22] and SMFH [26] are supervised methods. Most of them learn a common space by CMF [19, 21, 22, 26,27,28]. However, this procedure generally produces a large loss due to the complex correlation between different modalities. Moreover, most of them utilize a coarse similarity matrix to preserve the local structure of each modality in hash codes, which is not enough to describe the intrinsic local structure in training data points [20, 22, 26]. Hence, we leverage a semantic space derived by class labels to establish the relationships between the individual subspaces generated by matrix factorization and then design a multisimilarity matrix that can preserve both absolute similarity and relative similarity among data points.
Discrete multisimilarity consistent matrix factorization hashing
Notations
Assume that we have n data points described by \(O=\left\{ O^I,O^T\right\} \), and \({O^I}\in R^{{d_I}\times n}\), \({O^T}\in R^{{d_T}\times n}\), where \(d_I\) and \(d_T\) are the dimensions of the two modalities. \(O^{t}=\left\{ o_1^{t},o_2^{t},...,o_n^{t}\right\} , t=\left\{ I,T \right\} \) is a bimodal training entity and data from the tth modality. \(o_i\) is a feature vector from the image or text modalities. Without loss of generality, different modalities of data are treated with zero averaging \(\sum _{i=1}^n o_i^{t}=0\). \(Y=\{y_i\}_{i=1}^n\in R^{c\times n}\) is the class labels matrix and \(y_i\) is the ith class label vector, where c is the total number of categories. \(y_{iq}=1\) implies that the ith training instance contains the qth semantic concept; otherwise \(y_{iq}=0\). \(b_i\in R^{k\times n}\) is the common hash codes matrix, where k is the code length.
Formulation
To better capture the nonlinear structure of the original data, the RGB kernel function is adopted to map the original data points into the kernel space.
where \(t=\left\{ I,T \right\} \) and \(\left\{ O_1,O_2,...,O_m\right\} \) denote the anchors randomly selected from training data points. \(\sigma \) is obtained by taking the mean value of the Euclidean distance between the training samples. Assume that \(X^I\) and \(X^T\) denote the kernelized data from image and text modalities, respectively.
Individual subspace learning
To bridge the correlations among heterogeneous data points, we propose to first learn individual subspaces by matrix factorization for both modalities:
where \(U^t\in R^{{d_t}\times k}(t=\left\{ I,T \right\} )\) is a mapping matrix, and \(V^t\in R^{k\times n}(t=\left\{ I,T \right\} )\) is an individual subspace. \(\alpha \) is the combining coefficient and \(\alpha \in (0,1)\) controls the weight of the image modality. \({\parallel \cdot \parallel }_F^2\) is the Frobenius norm.
Class labels embedding
Class labels contain highlevel semantic information, and we propose embedding class labels into common space learning to improve the retrieval performance:
where Z is a projection matrix and \(\beta _t\) is a weighted parameter.
We also introduce a linear mapping function \(W^t\) for each modality, which can address the outofsample issue for each modality:
where \(\mu \) is the weight parameter of the linear projection regular term.
Multisimilarity consistency embedding
To further improve the retrieval performance, most studies utilize a coarse affinity matrix to preserve the intramodal similarity. However, the coarse affinity matrix is not sufficient for describing the intrinsic local structure in training data points.
To better discover the heterogeneous gap, we design a multisimilarity matrix that can preserve both absolute similarity and relative similarity among data points. After normalizing \(X^T\), \(X^I\), and Y to \(\hat{X^T}\), \(\hat{X^I}\) and \(\hat{Y}\) which has the unit \(l_2norm\) in each row, we can calculate the similarity matrix of each modality and class labels by:
and
Then, the absolute similarity matrix can be obtained:
However, considering only the absolute similarity of the samples is not enough to describe the intrinsic local structure in training data points, resulting in the tobelearned hash codes being unable to maintain appropriate similarities. Therefore, it is necessary to introduce additional similarity information to produce more optimized hash codes. To this end, we define the relative similarity as follows:
where \(\sigma \) and \(\varepsilon \) are the tradeoff parameters.
According to the definition of the absolute similarity and relative similarity, we denote the multisimilarity matrix as:
Equation (8) only uses loworder neighborhood information, class labels are embedded into the affinity matrix building and the highorder affinity information cross different modalities is combined to build the affinity matrix in Eq. (10). Therefore, the learned multisimilarity matrix possesses the capacity for capturing the latent semantic structure among the input instances.
To preserve the multisimilarity matrixbased similarities in hash codes, the objective loss function is calculated as:
where \(L^t=D^tS^t\), \(D^t\) is a diagonal matrix and \(D^t_{ii}=\sum _{j=1}^n(S^t_{ij})\), \(\eta \) is a weighted parameter.
Quantization loss
To generate high quality hash codes, a rotation procedure is adopted to minimize the quantization loss:
where \(R\in R^{k \times k}\) is an orthogonal rotation matrix.
Overall objective function
In conclusion, the final objective function is formed as:
where \(l_{6}=\gamma ({{\parallel Z\parallel }_F^2}+{\sum _{t=1}^m({\parallel W^t\parallel }_F^2+{\parallel U^t\parallel }_F^2 } { +{\parallel V^t\parallel }_F^2)})\) , which denotes the regulation term for avoiding overfitting.
Optimization method
An optimization scheme to address Eq. (13), which is briefly outlined in Algorithm 1. To be more specific, several steps are elaborately designed with each step updating one variable while fixing others and optimizing the objective function. The detailed procedure is summarized as follows—
\(\varvec{U^Tstep}\): By fixing others but \(U^T\), we have:
we obtain:
\(\varvec{U^Istep}\): By fixing others but \(U^I\), we have:
we obtain:
\(\varvec{V^Istep}\): By fixing others but \(V^I\), we have—
we obtain:
\(\varvec{V^Tstep}\): By fixing others but \(V^T\), we have—
we obtain:
\(\varvec{W^Istep}\): By fixing others but \(W^I\), we have—
we obtain:
\(\varvec{W^Tstep}\): By fixing others but \(W^T\), we have—
we obtain:
\(\varvec{Zstep}\): By fixing others but Z, we can obtain—
we obtain:
\(\varvec{Bstep}\): By fixing others but B , we can obtain—
we obtain:
\(\varvec{Rstep}\): By fixing others but R, we have—
Specifically, the singular value decomposition of \(B Y^{\top } Z^{\top }\) is computed as \(B Y^{\top } Z^{\top }=H \Omega \widehat{H}^{\top }\), and the orthogonal rotation matrix can then be updated using \(R=H \widehat{H}^{\top }\).
Complexity analysis
In this section, we analyze the time complexity of DMCMFH. The computational complexity of DMCMFH: the computational complexity of solving Eq. (14) is \(O(d_Tkn)\), Eq. (16) is \(O(d_Ikn)\), Eq. (18) is \(O(k^2d_I+k^2n+kcn+kd_In)\), Eq. (20) is \(O(k^2d_T+k^2n+kcn+kd_Tn)\), Eq. (22) is \(O(kd_In+kn^2)\), Eq. (24) is \(O(kd_Tn+kn^2)\), Eq. (26) is \(O(kcn+k^2c+k^3)\), Eq. (28) is \(O(k^2c+kcn)\) and R is \(O(k^3+k^2n)\). Thus, the computational complexity of each iteration is \(O(n^2)\).
We analyze the space complexity of DMCMFH. The computational complexity of DMCMFH: the computational complexity of generating \(O^I\), \(O^T\), \(X^I\) and \(X^T\) are T(\(d^tn\)); for generating Y is T(cn); for generating B, \(V^I\) and \(V^T\) are T(kn); for generating \(U^I\), \(U^T\),and \(W^t\) are T(\(d^tk\)); for generating Z is T(kc); for generating R is T(\(k^2\)); for generating \(S^L\), \(S^I\), \(S^T\), \(S_a\), \(S_b\) and \(D^t\) are T(\(n^2\)). Thus, the space complexity is T(\(n^2\)).
Outofsample extension
After the hash codes of training data points are obtained, given a query from the tth modality \(o^{(t)}\) , the hash code can be calculated:
Variable summary
Table 1 shows the frequently used notations in this paper.
Experiments
Datasets
MSCOCO: This dataset consists of 123,588 image–sentence pairs from 80 object categories. Each image is associated with 4 short sentences describing its content. The text information is represented as a 2000dimensional bagofwords vector.
Mirflickr25K: This dataset contains 25,000 image–text pairs. Those textual labels appearing no less than 20 times are dropped, and then those data points without textual labels or class labels are removed in our experiments. Finally, we obtain 20,015 data points.
NUSWIDE: It is composed of 269,684 image–text pairs in 81 concepts. We select the top 10 most frequent labels, and the corresponding 186,577 pairs are retained in our experiments.
The details of the three datasets are summarized in Table 2.
Baseline Methods and Implementation Details
We conduct comparison experiments with several baseline methods, including CMFH [21], SMFH [26], SCM [30], JIFMH [27], EDSH [22], SCRATCH [19], MTFH [20], LCMFH [31], BATCH [32].

CMFH [21] utilizes CMF different modalities of one instance to learn unified hash codes.

SMFH [26] embeds class labels to CMF to maintain semantic information in hash codes.

SCM [30] rebuilds the similarity matrix by the learned hash codes.

SCRATCH [19] first makes use of CMF on original features, and then incorporates class labels to learn hash codes discretely manner.

JIMFH [27] learns unified hash codes to maintain multimodal data common properties and individual hash codes to keep each modality specific characteristics.

MTFH [20] learns the modality specific hash codes with varying length settings while learning two semantic correlation matrices synchronously to correlate the different hash representations semantically for heterogeneous data comparison.

EDSH [22] learns shared semantic space by CMF to reduce heterogeneous gaps.

LCMFH [31] directly uses semantic labels to guide learning hash codes.

BATCH [32] leverages CMF to learn a latent semantic space.
Due to the high time complexity for some methods, it is intolerable to train hash functions using all training data on the NUSWIDE dataset. Hence, we select 10,000 instances in the training procedure for fair comparison. There are six parameters in DMCMFH. We set \(\alpha =0.6\), \(\beta =11\), \(\mu =5\), \(\gamma =0.001\), \(\xi =0.1\) and \(\eta =0.00001\) in our experiments.
Evaluation criteria
In this paper, we choose several widely used metrics to evaluate the performance of DMCMFH.

MAP is the mean of the average precision (AP) for each query for a set of queries, where AP can be defined as:
$$\begin{aligned} AP = \frac{1}{R}\sum _{k=1}^N(P(k)\times {rel}_k) \end{aligned}$$where N denotes the size of the dataset, P(k) denotes the precision of the result ranked at k, and \({rel}_k=1\) if the example ranked is relevant and 0 otherwise.

topNprecision reflects the variations in precision with the number of returned instances varies.

The precision–recall (PR) curve reflects the change in precision at different recall ratios.

Precision@k: proportion of correctly predicted relevant results in all returned results.

F1 and NDCG are usually used to measure and evaluate retrieval results.
Comparison with baselines
We conduct ImagetoText(IT) and TexttoImage(TI) tasks to evaluate the performance of DMCMFH.
The MAP values of DMCMFH and all baselines are illustrated in Table 3. It can be seen that:

DMCMFH outperforms the other baselines in most cases of both tasks. To better bridge individual subspaces derived from matrix factorization by a shared semantic space and multisimilarity consistency, matrix factorization and the class labels embedding are specifically put into a framework for improving discrimination of the common semantic space. Furthermore, a multisimilarity matrix is designed to capture the intrinsic local structure for improving the quality hash codes.

It can be observed that our DMCMFH obtains the best results, especially for long bits. This is because our method exploits more semantic information in the hash learning procedure. Furthermore, as the length of hash codes increases, more semantic information is embedded into hash codes.

Compared with the unsupervised baseline methods, DMCMFH achieves increases of \(17.86\%\) and \(54.54\%\) on the IT and TI tasks for the MSCOCO dataset, respectively, and achieves increases of \(25.32\%\) and \(15.40\%\) on the two retrieval tasks for the Mirflickr25K dataset. For the NUSWIDE dataset, DMCMFH achieves increases of \(14.44\%\) on the IT task and \(15.36\%\) on the TI task. Compared with supervised baseline methods, DMCMFH achieves increases of \(0.22\%\) and \(0.21\%\) on the IT and TI tasks for the MSCOCO dataset respectively, and achieves increases of \(3.00\%\) and \(1.62\%\) on the two retrieval tasks for the Mirflickr25K dataset. For the NUSWIDE dataset, we obtain increases of \(0.79\%\) on the IT task and \(0.50\%\) on the TI task.
Additionally, we set different MAP@K (k=10,20) to evaluate the performance of DMCMFH.
The MAP values of DMCMFH and all baselines are illustrated in Tables 4 and 5. We can see that: DMCMFH outperforms the other baselines in most cases of both tasks.
In Table 4, compared with the unsupervised baseline methods, DMCMFH achieves increases of \(28.60\%\) and \(30.60\%\) on the IT and TI tasks for the Mirflickr25K dataset, respectively, and achieves increases of \(21.60\%\) and \(29.60\%\) on the two retrieval tasks for the NUSWIDE dataset. For the MSCOCO dataset, our method achieves increases of \(12.67\%\) on the IT task and \(55.67\%\) on the TI task. Compared with supervised baseline methods, DMCMFH achieves increases of \(0.99\%\) and \(0.49\%\) on the IT and TI tasks for the Mirflickr25K dataset respectively, and achieves increases of \(1.60\%\) and \(1.01\%\) on the two retrieval tasks for NUSWIDE dataset. For MSCOCO dataset, we obtain increases of \(0.16\%\) on the IT task and \(0.95\%\) on the TI task.
In Table 5, compared with the unsupervised baseline methods, DMCMFH achieves increases of \(29.98\%\) and \(24.00\%\) on the IT and TI tasks for the Mirflickr25K dataset, respectively, and achieves increases of \(25.06\%\) and \(33.06\%\) on the two retrieval tasks for the NUSWIDE dataset. For the MSCOCO dataset, our method achieves increases of \(17.73\%\) on the IT task and \(52.73\%\) on the TI task. Compared with supervised baseline methods, DMCMFH achieves increases of \(0.55\%\) and \(0.41\%\) on the IT and TI tasks for the Mirflickr25K dataset respectively, and achieves increases of \(3.32\%\) and \(2.32\%\) on the two retrieval tasks for the NUSWIDE dataset. For the MSCOCO dataset, we obtain increases of \(0.87\%\) on the IT task and \(0.19\%\) on the TI task.
The topNprecision curves are shown in Fig. 3. Note that since the implementations of NSDH [33] has not been released, we directly copy the MAP results from the original papers without precision–recall curves and topNprecision. It can be considered that DMCMFH outperforms baseline methods in most cases. It is worth noting that the performance of DMCMFH performs better than baseline methods on top retrieved instances.
Figure 4 shows precision–recall with 32 bit code lengths on the three datasets. It can be seen that DMCMFH outperforms others on the IT task in most cases and performs better than others on top retrieved instances on the TI task in most cases. Generally, in the retrieved list, users are more concerned with front instances. In this sense, DMCMFH achieves superior performance on both retrieval tasks in most cases.
The NDCG values and F1 values of DMCMFH and all baselines are illustrated in Tables 6 and 7. It can be seen that: DMCMFH outperforms the other baselines in most cases of both tasks. In Table 6, compared with the unsupervised baseline methods, the proposed DMCMFH achieves increases of \(18.16\%\) and \(13.52\%\) on the ImageText and TextImage tasks for the Mirflickr25K dataset, respectively, and achieves increases of \(15.77\%\) and \(25.05\%\) on the two retrieval tasks for the NUSWIDE dataset. For the MSCOCO dataset, our method achieves increases of \(0.0967\%\) on the IT task and \(47.34\%\) on the TI task. Compared with supervised baseline methods, the proposed DMCMFH achieves increases of \(0.36\%\) and \(0.50\%\) in average MAP with different bits on the IT and TI tasks for the Mirflickr25K dataset respectively, and achieves increases of \(4.10\%\) and \(3.93\%\) on the two retrieval tasks for the NUSWIDE dataset. For the MSCOCO dataset, we obtain increases of \(2.39\%\) on the IT task and \(0.68\%\) on the TI task.
The precision values of DMCMFH and all baselines are illustrated in Tables 8 and 9. It can be seen that DMCMFH outperforms others on the IT task in most cases and performs better than others on top retrieved instances on the TI task in most cases. In this sense, DMCMFH achieves superior performance on both retrieval tasks in most cases.
The above experimental results verify the superiority of DMCMFH. The performance improvement in DMCMFH comes from the proposed multisimilarity matrix that preserves absolute similarity and relative similarity simultaneously to capture the intermodal and the intramodal correlation.
Comprehensive analysis
Convergency analysis
Since optimal solutions of DMCMFH are obtained using iterative updating rules, its training time is closely related to the number of iterations in the training phase. In this section, its convergence is validated experimentally, and the results are shown in Fig. 5. It can be considered that the objective values are monotonically decreased, and the proposed optimal scheme quickly converges in less than twenty iterations, which validates the effectiveness of the proposed optimization algorithm.
Sensitivity to parameters
To further investigate the sensitivity of the parameters, we also conduct experiments under different values \(\alpha \), \(\gamma \), \(\mu \), \(\beta \), \(\eta \) and \(\xi \) in the case of 32 bits. There are six main parameters in DMCMFH: the weights of the text and image are controlled by \(\alpha \); the contribution of the correlation that matches the term is controlled by \(\mu \), \(\beta \), \(\eta \) and \(\xi \); the contribution of the regularization term is controlled by \(\gamma \). Figure 6 shows the results of \(\xi \) on three datasets and illustrates that DMCMFH gains a better performance when \(\xi \) is less than 0.0001. As \(\xi \) changes from 0.0001 to 10, the MAP performance decreases quickly. This shows that the text modality plays a more significant role than the image modality in the performance of DMCMFH.
Comparison With deep neural networks
Deep hashing retrieval has been widely studied [9, 29, 34,35,36,37,38,39,40]. With neural networks having achieved perfect performance in modeling nonlinear correlation, many deepbased crossmodal hashing approaches have been proposed, which integrate feature learning and hash codes learning into a unified framework. We make comparison experiments with DCMH [37] , DJSRH [29] and NSDH [33] on the Mirflickr25K dataset. Due to the high feature extraction ability of the convolutional neural network, we use CNN features to implement the proposed DMCMFH for a fair comparison.
The MAP values of DMCMFH and two methods of deep learning are illustrated in Table 10. It can be seen that: DMCMFH outperforms the other baselines in both tasks, which demonstrates the effectiveness of our main proposals. The proposed DMCMFH achieves increases of \(12.42\%\) and \(6.46\%\) on the IT and TI tasks for the Mirflickr25K dataset, respectively. DMCMFH incorporates matrix factorization and multisimilarity consistency into a unified framework. Therefore, in the learned common semantic space, the intermodal similarity can be well preserved, which can improve the discrimination of the sharing space and consequently more discriminative hash codes.
Conclusion
In this article, a crossmodal hashing method named DMCMFH is proposed. Data from each modality is first mapped to an individual subspace, and then a semantic subspace that is generated by the class labels is constructed to align semantic information of different modalities. The semantic correlation between different modalities can be well maintained by a multisimilarity matrix. To capture the intrinsic local structure in training data points, we propose a multisimilarity matrix that can not only better capture the latent semantic affinity, but also combine the affinity information cross different modalities. Extensive experiments indicate the superiority of DMCMFH over other baseline models.
Recently, deep hashing learning has been widely studied, not only to obtain the semantic associations with intramodality and intermodality of images and texts, but also to obtain finegrained semantic information in images and texts. In future work, we will design a deep learningbased matrix factorization model to further improve the retrieval performance.
Data Availability
Data are openly available in a public repository. The MSCOCO dataset that support the findings of this study are available in http://mscoco.org/home/. The Mirflickr25K dataset that support the findings of this study are available in http://press.liacs.nl/mirflickr/. The NUSWIDE dataset that support the findings of this study are available in http://lms.comp.nus.edu.sg/research/NUSWIDE.htm.
References
Shen HT, Liu L, Yang Y et al (2020) Exploiting subspace relation in semantic labels for crossmodal hashing[J]. IEEE Trans Knowl Data Eng 33(10):3351–3365
Hu M, Yang Y, Shen F et al (2018) Collective reconstructive embeddings for crossmodal hashing[J]. IEEE Trans Image Process 28(6):2770–2784
Hu Y, Liu M, Su X et al (2021) Video moment localization via deep crossmodal hashing[J]. IEEE Trans Image Process 30:4667–4677
Wang Y , Chen Z D , Luo X , et al. Fast CrossModal Hashing With Global and Local Similarity Embedding[J]. IEEE Transactions on Cybernetics, 2021, PP(99):114
Liu X, Nie X, Zeng W, et al. Fast discrete crossmodal hashing with regressing from semantic labels[C].Proceedings of the 26th ACM international conference on Multimedia. 2018: 16621669
Zou X, Wang X, Bakker EM et al (2021) Multilabel semantics preserving based deep crossmodal hashing[J]. Signal Processing: Image Communication 93:116131
Zhan YW, Wang Y, Sun Y et al (2022) Discrete online crossmodal hashing[J]. Pattern Recognition 122:108262
Gadamsetty S, Ch R, Ch A et al (2022) Hashbased deep learning approach for remote sensing satellite imagery detection[J]. Water 14(5):707
Shen Y, Gadekallu T R. Resource Search Method of Mobile Intelligent Education System Based on Distributed Hash Table[J]. Mobile Networks and Applications, 2022: 110
Wang W, Xu H, Alazab M, et al. Blockchainbased reliable and efficient certificateless signature for IIoT devices[J]. IEEE transactions on industrial informatics, 2021
Deebak B D, Memon F H, Dev K, et al. TABSAPP: a trustaware blockchainbased seamless authentication for massive IoTenabled industrial applications[J]. IEEE Transactions on Industrial Informatics, 2022
Deebak B D, Memon F H, Khowaja S A, et al. Lightweight blockchain based remote mutual authentication for AIempowered IoT sustainable computing systems[J]. IEEE Internet of Things Journal, 2022
Hu H, Xie L, Hong R, et al. Creating something from nothing: Unsupervised knowledge distillation for crossmodal hashing[C].Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 31233132
Fang Y, Li B, Li X et al (2021) Unsupervised crossmodal similarity via latent structure discrete hashing factorization[J]. KnowledgeBased Systems 218:106857
Luo X, Yin X Y, Nie L, et al. SDMCH: supervised discrete manifoldembedded crossmodal hashing[C].Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018: 25182524
Liu H, Feng Y, Zhou M et al (2021) Semantic ranking structure preserving for crossmodal retrieval[J]. Appl Intell 51(3):1802–1812
Fang Y, Ren Y, Park JH (2020) Semanticenhanced discrete matrix factorization hashing for heterogeneous modal matching[J]. KnowledgeBased Systems 192:105381
Fang Y, Ren Y (2020) Supervised discrete crossmodal hashing based on kernel discriminant analysis[J]. Pattern Recognition 98:107062
Li C X, Chen Z D, Zhang P F, et al. SCRATCH: A scalable discrete matrix factorization hashing for crossmodal retrieval[C].Proceedings of the 26th ACM international conference on Multimedia. 2018: 19
Liu X, Hu Z, Ling H et al (2021) MTFH: A Matrix TriFactorization Hashing Framework for Efficient CrossModal Retrieval[J]. IEEE Trans Pattern Anal Mach Intell 43(3):964–981
Ding G, Guo Y, Zhou J. Collective matrix factorization hashing for multimodal data[C].Proceedings of the IEEE conference on computer vision and pattern recognition. 2014: 20752082
Yao T, Han Y, Wang R et al (2020) Efficient discrete supervised hashing for largescale crossmodal retrieval[J]. Neurocomputing 385:358–367
Yao T, Li Y, Guan W et al (2021) Discrete Robust Matrix Factorization Hashing for Largescale Crossmedia Retrieval[J]. IEEE Transactions on Knowledge & Data Engineering 01:1–1
Zou X, Wu S, Zhang N et al (2022) Multilabel modality enhanced attention based selfsupervised deep crossmodal hashing[J]. KnowledgeBased Systems 239:107927
Yao T, Kong X, Fu H et al (2019) Discrete semantic alignment hashing for crossmedia retrieval[J]. IEEE transactions on cybernetics 50(12):4896–4907
Liu H, Ji R, Wu Y, et al. Supervised matrix factorization for crossmodality hashing[C].Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence. 2016: 17671773
Wang D, Wang Q, He L et al (2020) Joint and individual matrix factorization hashing for largescale crossmodal retrieval[J]. Pattern Recognition 107:107479
Wang D, Wang Q, An Y, et al. Online collective matrix factorization hashing for largescale crossmedia retrieval[C].Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. 2020: 14091418
Su S, Zhong Z, Zhang C. Deep jointsemantics reconstructing hashing for largescale unsupervised crossmodal retrieval[C].Proceedings of the IEEE/CVF international conference on computer vision. 2019: 30273035
Zhang D, Li W J. Largescale supervised multimodal hashing with semantic correlation maximization[C].Proceedings of the TwentyEighth AAAI Conference on Artificial Intelligence. 2014: 21772183
Wang D, Gao X, Wang X et al (2018) Label consistent matrix factorization hashing for largescale crossmodal similarity search[J]. IEEE Trans Pattern Anal Mach Intell 41(10):2466–2479
Wang Y, Luo X, Nie L et al (2020) BATCH: A scalable asymmetric discrete crossmodal hashing[J]. IEEE Trans Knowl Data Eng 33(11):3507–3519
Yang Z, Yang L, Raymond OI et al (2021) NSDH: A Nonlinear Supervised Discrete Hashing framework for largescale crossmodal retrieval[J]. KnowledgeBased Systems 217:106818
Deng C, Yang E, Liu T et al (2019) Twostream deep hashing with classspecific centers for supervised image search[J]. IEEE transactions on neural networks and learning systems 31(6):2189–2201
Tang C, Zhu X, Liu X, et al. Crossview local structure preserved diversity and consensus learning for multiview unsupervised feature selection[C].Proceedings of the AAAI Conference on artificial intelligence. 2019, 33(01): 51015108
Cui H, Zhu L, Li J et al (2019) Scalable deep hashing for largescale social image retrieval[J]. IEEE Trans Image Process 29:1271–1284
Jiang Q Y, Li W J. Deep crossmodal hashing[C].Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 32323240
Zhu L, Song J, Yang Z, et al. DAP\(^2\)CMH: Deep Adversarial PrivacyPreserving CrossModal Hashing[J]. Neural Processing Letters, 2021: 121
Yu E, Ma J, Sun J et al (2022) Deep Discrete CrossModal Hashing with Multiple Supervision[J]. Neurocomputing 486:215–224
Li M, Li Q, Ma Y et al (2022) Semanticguided autoencoder adversarial hashing for largescale crossmodal retrieval[J]. Complex & Intelligent Systems 8(2):1603–1617
Acknowledgements
This work are supported by the National Natural Science Foundation of China (Grant Nos. 61872170, 62102186, 62076052, 61873117), Natural Science Foundation of Jiangsu Province under Grant BK20200725.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Li, Y., Hu, P., Li, Y. et al. Discrete matrix factorization crossmodal hashing with multisimilarity consistency. Complex Intell. Syst. 9, 4195–4212 (2023). https://doi.org/10.1007/s4074702200950z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s4074702200950z