Discrete matrix factorization cross-modal hashing with multi-similarity consistency

Recently, matrix factorization-based hashing has gained wide attention because of its strong subspace learning ability and high search efficiency. However, some problems need to be further addressed. First, uniform hash codes can be generated by collective matrix factorization, but they often cause serious loss, degrading the quality of hash codes. Second, most of them preserve the absolute similarity simply in hash codes, failing to capture the inherent semantic affinity among training data. To overcome these obstacles, we propose a Discrete Multi-similarity Consistent Matrix Factorization Hashing (DMCMFH). Specifically, an individual subspace is first learned by matrix factorization and multi-similarity consistency for each modality. Then, the subspaces are aligned by a shared semantic space to generate homogenous hash codes. Finally, an iterative-based discrete optimization scheme is presented to reduce the quantization loss. We conduct quantitative experiments on three datasets, MSCOCO, Mirflickr25K and NUS-WIDE. Compared with supervised baseline methods, DMCMFH achieves increases of 0.22%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.22\%$$\end{document}, 3.00%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$3.00\%$$\end{document} and 0.79%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.79\%$$\end{document} on the image-query-text tasks for three datasets respectively, and achieves increases of 0.21%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.21\%$$\end{document}, 1.62%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1.62\%$$\end{document} and 0.50%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.50\%$$\end{document} on the text-query-image tasks for three datasets respectively.


Introduction
With the arrival of the 5G network, a tremendous quantity of multiple media data is generated on social networks, e.g., texts, images and videos. Faced with such a large quantity of data, how to search effectively and efficiently becomes a problem that needs to be fixed [1][2][3][4][5][6][7]. Some researchers propose approximate nearest neighbor methods to satisfy the requirement of large-scale retrieval. Hashing technology, one of the most famous methods, has received increasing attention in recent years [8][9][10][11][12][13][14]. The critical formulation of hashing technology is to project data from the original space into a binary space. Then the similarities between data points are measurable by Hamming distances, which can be calculable rapidly by the XOR operation. Because of the efficiency in computing and memory cost, hashing technology has evoked wide attention in the multimedia retrieval field. 3 Engineering Institute, Shandong Yingcai University, Jinan 250104, China Early hashing technology has been widely applied in single modality retrieval. However, tremendous multimedia data are generated on the Internet, which makes the retrieval task more challenging. The main target of cross-modal retrieval is to build the relationship between different modalities [12,[15][16][17][18]. Specifically, while a user submits a query, similar objects can be returned by the retrieval system in other modalities. However, there is a pervasive semantic gap among different modalities. Hence, preserving the semantic correlation among heterogeneous data has become one of the important targets.
Matrix factorization, which can capture the intrinsic data structure hidden in original data, is a powerful tool in subspace learning. Accordingly, several methods based on matrix factorization have been designed to address the cross-modal retrieval task and gain acceptable retrieval performance [19][20][21][22][23][24]. However, the following shortcomings are generally suffered by these methods. First, most models manage to learn consistent representations for pairwise data points by keeping the inter-modal similarities using collective matrix factorization (CMF) [19,21,22,25]. However, data points from different modalities lie in totally different feature spaces. Directly generating consistent representations for heterogeneous pairwise data points creates vast costs in the training procedure and consequent performance degradation. Figure 1 demonstrates the differences between the proposed method and method-based CMF. (a) shows the formulation of the proposed DMCMFH; (b) shows the formulation of collective matrix factorization-based methods. Different shapes represent different modalities, and different border colors with the same inside colors represent different data points from the same categories. In simple terms, when the inside color is the same, different outside borders represent different samples of the same category. Most of the models, such as that in Fig. 1(b), directly use CMF to generate consistent representations for heterogeneous pairwise data points. However, it usually results in vast costs in the training procedure and consequent performance degradation. DMCMFH generates independent subspaces through matrix factorization and embedding based on multi-similarity matrix, which makes the learned subspaces more discriminative and reduces the distance between different samples from the same category. Then, a semantic space is built through class tags to bridge the heterogeneous gap. Although heterogeneous data come from the same category, the distance between them should be small. Second, for generating more discriminative hash codes, most models propose maintaining the intra-modal similarities in the hashing learning procedure [21,26]. However, they only attempt to maintain the local data structure or class label-based semantic structure, which may not be enough to capture the intrinsic structure in training data. Third, most of them first learn a subspace and then quantize the real-valued representations into discrete hash codes for simplicity [21,[26][27][28]. However, the quantization procedure leads to large quantization loss and lower retrieval performance.
To overcome the issues referred to above, we propose a novel hashing method, named Discrete Multi-similarity Consistent Matrix Factorization Hashing (DMCMFH), which incorporates matrix factorization and multi-similarity consistency into a unified framework. Specifically, an individual subspace is first generated by matrix factorization and multisimilarity matrix-based embedding, which makes the learned subspaces more discriminative. A semantic space is then constructed by class labels to bridge the heterogeneous gap. The flowchart of our DMCMFH is described in Fig. 2.
The contributions are summarized as follows: • DMCMFH leverages a semantic space derived by class labels to establish the relationships between the individual subspaces generated by matrix factorization. Therefore, in the learned common semantic space, the inter-modal similarity can be well preserved, which can improve the discrimination of the sharing space and consequent more discriminative hash codes. • A multi-similarity matrix, which can not only combine the affinity information cross different modalities but also better capture the latent semantic affinity among the data points, is designed. • Experimental results indicate the superior performance of DMCMFH in various respects compared with several existing methods.
The rest of this paper is organized as follows. The related work is presented in Section Related work. The proposed DMCMFH model is presented in Section Discrete multi-similarity consistent matrix factorization hashing. Comprehensive experiments and corresponding analysis are presented in Section Experiments. Finally, Section Conclusion concludes.

Related work
In this section, we briefly introduce unsupervised and supervised cross-modal hashing methods.
Unsupervised cross-modal hashing methods focus on learning hash codes by preserving feature-based similarity without supervised information. In [21], which first utilizes CMF to learn a common space for heterogeneous data. In [29], which constructs an affinity matrix, that can capture the latent intrinsic semantic affinity. In [27], which learns unified hash codes to maintain common properties and individual hash codes to retain specific properties. Unsupervised methods generally cannot produce high-quality hash codes because class labels cannot be used to reduce the heterogeneous gap.  In contrast, supervised methods can make use of both the data characteristics and the semantic information to supervise hash learning. Supervised hashing methods play an increasingly significant role in cross-modal retrieval tasks. In [26], which makes use of the CMF strategy to generate common representations for pairwise data points and maintains the intra-modal similarities by graph regularization. This method first generates real-value common semantic subspaces for heterogeneous modalities and then quantizes the shared semantic space to obtain hash codes, which results in a large quantization loss. In [19], which kernelizes the original features and leverages CMF to extract the kernelized features into a semantic space. In [22], which learns the hash codes that are based on semantic embedding and CMF based on class labels. In [20], which proposes a flexible and generalized cross-modal hashing framework.
Note that Joint and Individual Matrix Factorization Hashing (JIFMH) [27], Online Collective Matrix Factorization Hashing (OCMFH) [28], Collective Matrix Factorization Hashing (CMFH) [21], Supervised Matrix Factorization for cross-modality Hashing (SMFH) [26], Efficient Discrete Supervised Hashing (EDSH) [22] and Scalable Discrete Matrix Factorization Hashing (SCRATCH) [19] are matrix factorization-based cross-modal hashing methods. Among them, JIFMH [27], OCMFH [28], CMFH [21] are unsupervised methods, that cannot utilize class labels to improve the hash codes quality. SCRATCH [19], EDSH [22] and SMFH [26] are supervised methods. Most of them learn a common space by CMF [19,21,22,[26][27][28]. However, this procedure generally produces a large loss due to the complex correlation between different modalities. Moreover, most of them utilize a coarse similarity matrix to preserve the local structure of each modality in hash codes, which is not enough to describe the intrinsic local structure in training data points [20,22,26]. Hence, we leverage a semantic space derived by class labels to establish the relationships between the individual subspaces generated by matrix factorization and then design a multi-similarity matrix that can preserve both absolute similarity and relative similarity among data points.

Discrete multi-similarity consistent matrix factorization hashing
∈ R c×n is the class labels matrix and y i is the i-th class label vector, where c is the total number of categories. y iq = 1 implies that the i-th training instance contains the q-th semantic concept; otherwise y iq = 0. b i ∈ R k×n is the common hash codes matrix, where k is the code length.

Formulation
To better capture the nonlinear structure of the original data, the RG B kernel function is adopted to map the original data points into the kernel space.
where t = {I , T } and {O 1 , O 2 , ..., O m } denote the anchors randomly selected from training data points. σ is obtained by taking the mean value of the Euclidean distance between the training samples. Assume that X I and X T denote the kernelized data from image and text modalities, respectively.

Individual subspace learning
To bridge the correlations among heterogeneous data points, we propose to first learn individual subspaces by matrix fac-torization for both modalities: where U t ∈ R d t ×k (t = {I , T }) is a mapping matrix, and V t ∈ R k×n (t = {I , T }) is an individual subspace. α is the combining coefficient and α ∈ (0, 1) controls the weight of the image modality. · 2 F is the Frobenius norm.

Class labels embedding
Class labels contain high-level semantic information, and we propose embedding class labels into common space learning to improve the retrieval performance: where Z is a projection matrix and β t is a weighted parameter. We also introduce a linear mapping function W t for each modality, which can address the out-of-sample issue for each modality: where μ is the weight parameter of the linear projection regular term.

Multi-similarity consistency embedding
To further improve the retrieval performance, most studies utilize a coarse affinity matrix to preserve the intra-modal similarity. However, the coarse affinity matrix is not sufficient for describing the intrinsic local structure in training data points.
To better discover the heterogeneous gap, we design a multi-similarity matrix that can preserve both absolute similarity and relative similarity among data points. After normalizing X T , X I , and Y toX T ,X I andŶ which has the unit l 2 − norm in each row, we can calculate the similarity matrix of each modality and class labels by: and Then, the absolute similarity matrix can be obtained: However, considering only the absolute similarity of the samples is not enough to describe the intrinsic local structure in training data points, resulting in the to-be-learned hash codes being unable to maintain appropriate similarities. Therefore, it is necessary to introduce additional similarity information to produce more optimized hash codes. To this end, we define the relative similarity as follows: where σ and ε are the trade-off parameters.
According to the definition of the absolute similarity and relative similarity, we denote the multi-similarity matrix as: Equation (8) only uses low-order neighborhood information, class labels are embedded into the affinity matrix building and the high-order affinity information cross different modalities is combined to build the affinity matrix in Eq. (10). Therefore, the learned multi-similarity matrix possesses the capacity for capturing the latent semantic structure among the input instances.
To preserve the multi-similarity matrix-based similarities in hash codes, the objective loss function is calculated as: where

Quantization loss
To generate high quality hash codes, a rotation procedure is adopted to minimize the quantization loss: where R ∈ R k×k is an orthogonal rotation matrix.

Overall objective function
In conclusion, the final objective function is formed as: where , which denotes the regulation term for avoiding overfitting.

Optimization method
An optimization scheme to address Eq. (13), which is briefly outlined in Algorithm 1. To be more specific, several steps are elaborately designed with each step updating one variable while fixing others and optimizing the objective function. The detailed procedure is summarized as follows-U T − step: By fixing others but U T , we have: we obtain: U I − step: By fixing others but U I , we have: we obtain: V I − step: By fixing others but V I , we have- we obtain: V T − step: By fixing others but V T , we have- we obtain: W I − step: By fixing others but W I , we have- we obtain: W T − step: By fixing others but W T , we have- we obtain: Z − step: By fixing others but Z , we can obtain- we obtain: B − step: By fixing others but B , we can obtain- we obtain: R − step: By fixing others but R, we have- Specifically, the singular value decomposition of BY Z is computed as BY Z = H H , and the orthogonal rotation matrix can then be updated using R = H H .

Algorithm 1 Discrete Multi-similarity Consistent Matrix Factorization Hashing.
Require: feature matrices X I and X T , class labels Y , the length of hash codes k, parameters α, β, μ, ξ , γ and η, the total iterative number I ter.

Complexity analysis
In this section, we analyze the time complexity of DMCMFH. The computational complexity of DMCMFH: the computational complexity of solving Eq. (14) is We analyze the space complexity of DMC M F H. The computational complexity of DMC M F H: the computational complexity of generating O I , O T , X I and X T are T(d t n); for generating Y is T(cn); for generating B, V I and V T are T(kn); for generating U I , U T ,and W t are T(d t k); for generating Z is T(kc); for generating R is T(k 2 ); for generating S L , S I , S T , S a , S b and D t are T(n 2 ). Thus, the space complexity is T(n 2 ).

Out-of-sample extension
After the hash codes of training data points are obtained, given a query from the t-th modality o (t) , the hash code can be calculated: Variable summary Table 1 shows the frequently used notations in this paper.

Datasets
MSCOCO: This dataset consists of 123,588 image-sentence pairs from 80 object categories. Each image is associated with 4 short sentences describing its content. The text information is represented as a 2000-dimensional bag-of-words vector.
Mirflickr25K: This dataset contains 25,000 image-text pairs. Those textual labels appearing no less than 20 times are dropped, and then those data points without textual labels or class labels are removed in our experiments. Finally, we obtain 20,015 data points.
NUS-WIDE: It is composed of 269,684 image-text pairs in 81 concepts. We select the top 10 most frequent labels, and the corresponding 186,577 pairs are retained in our experiments.
The details of the three datasets are summarized in Table 2.
• CMFH [21] utilizes CMF different modalities of one instance to learn unified hash codes. • SMFH [26] embeds class labels to CMF to maintain semantic information in hash codes. • SCM [30] rebuilds the similarity matrix by the learned hash codes. • SCRATCH [19] first makes use of CMF on original features, and then incorporates class labels to learn hash codes discretely manner. • JIMFH [27] learns unified hash codes to maintain multimodal data common properties and individual hash codes to keep each modality specific characteristics.
• MTFH [20] learns the modality specific hash codes with varying length settings while learning two semantic correlation matrices synchronously to correlate the different hash representations semantically for heterogeneous data comparison. • EDSH [22] learns shared semantic space by CMF to reduce heterogeneous gaps. • LCMFH [31] directly uses semantic labels to guide learning hash codes. • BATCH [32] leverages CMF to learn a latent semantic space.
Due to the high time complexity for some methods, it is intolerable to train hash functions using all training data on the NUS-WIDE dataset. Hence, we select 10,000 instances in the training procedure for fair comparison. There are six parameters in DMCMFH. We set α = 0.6, β = 11, μ = 5, γ = 0.001, ξ = 0.1 and η = 0.00001 in our experiments.

Evaluation criteria
In this paper, we choose several widely used metrics to evaluate the performance of DMCMFH.
• MAP is the mean of the average precision (AP) for each query for a set of queries, where AP can be defined as: where N denotes the size of the dataset, P(k) denotes the precision of the result ranked at k, and rel k = 1 if the example ranked is relevant and 0 otherwise.

Comparison with baselines
We conduct Image-to-Text(I-T) and Text-to-Image(T-I) tasks to evaluate the performance of DMCMFH.
The MAP values of DMCMFH and all baselines are illustrated in Table 3. It can be seen that: • DMCMFH outperforms the other baselines in most cases of both tasks. To better bridge individual subspaces derived from matrix factorization by a shared semantic The weight parameter of the linear projection regular term R An orthogonal rotation matrix The absolute similarity matrix

S b
The relative similarity matrix space and multi-similarity consistency, matrix factorization and the class labels embedding are specifically put into a framework for improving discrimination of the common semantic space. Furthermore, a multi-similarity matrix is designed to capture the intrinsic local structure for improving the quality hash codes. • It can be observed that our DMCMFH obtains the best results, especially for long bits. This is because our method exploits more semantic information in the hash learning procedure. Furthermore, as the length of hash codes increases, more semantic information is embedded into hash codes.
• Compared with the unsupervised baseline methods, DMCMFH achieves increases of 17.86% and 54.54% on the I-T and T-I tasks for the MSCOCO dataset, respectively, and achieves increases of 25.32% and 15.40% on the two retrieval tasks for the Mirflickr25K dataset. For the NUS-WIDE dataset, DMCMFH achieves increases of 14.44% on the I-T task and 15.36% on the T-I task.
Compared with supervised baseline methods, DMCMFH achieves increases of 0.22% and 0.21% on the I-T and T-I tasks for the MSCOCO dataset respectively, and achieves increases of 3.00% and 1.62% on the two retrieval tasks for the Mirflickr25K dataset. For the NUS-WIDE dataset, we obtain increases of 0.79% on the I-T task and 0.50% on the T-I task.
Additionally, we set different MAP@K (k=10,20) to evaluate the performance of DMCMFH.
The MAP values of DMCMFH and all baselines are illustrated in Tables 4 and 5. We can see that: DMCMFH outperforms the other baselines in most cases of both tasks.
In Table 4, compared with the unsupervised baseline methods, DMCMFH achieves increases of 28.60% and 30.60%  The highest performance of each metric is in bold on the I-T and T-I tasks for the Mirflickr25K dataset, respectively, and achieves increases of 21.60% and 29.60% on the two retrieval tasks for the NUS-WIDE dataset. For the MSCOCO dataset, our method achieves increases of 12.67% on the I-T task and 55.67% on the T-I task. Compared with supervised baseline methods, DMCMFH achieves increases of 0.99% and 0.49% on the I-T and T-I tasks for the Mir-flickr25K dataset respectively, and achieves increases of 1.60% and 1.01% on the two retrieval tasks for NUS-WIDE dataset. For MSCOCO dataset, we obtain increases of 0.16% on the I-T task and 0.95% on the T-I task.
In Table 5, compared with the unsupervised baseline methods, DMCMFH achieves increases of 29.98% and 24.00% on the I-T and T-I tasks for the Mirflickr25K dataset, respectively, and achieves increases of 25.06% and 33.06% on the two retrieval tasks for the NUS-WIDE dataset. For the MSCOCO dataset, our method achieves increases of 17.73% on the I-T task and 52.73% on the T-I task. Compared with supervised baseline methods, DMCMFH achieves increases of 0.55% and 0.41% on the I-T and T-I tasks for the Mirflickr25K dataset respectively, and achieves increases of 3.32% and 2.32% on the two retrieval tasks for the NUS-WIDE dataset. For the MSCOCO dataset, we obtain increases of 0.87% on the I-T task and 0.19% on the T-I task.
The topN-precision curves are shown in Fig. 3. Note that since the implementations of NSDH [33] has not been released, we directly copy the MAP results from the original papers without precision-recall curves and topN-precision. It can be considered that DMCMFH outperforms baseline methods in most cases. It is worth noting that the performance of DMCMFH performs better than baseline methods on top retrieved instances. Figure 4 shows precision-recall with 32 bit code lengths on the three datasets. It can be seen that DMCMFH outperforms others on the I-T task in most cases and performs better than others on top retrieved instances on the T-I task in most cases. Generally, in the retrieved list, users are more concerned with front instances. In this sense, DMCMFH achieves superior performance on both retrieval tasks in most cases.
The NDCG values and F1 values of DMCMFH and all baselines are illustrated in Tables 6 and 7. It can be seen that: DMCMFH outperforms the other baselines in most cases of both tasks. In Table 6, compared with the unsupervised baseline methods, the proposed DMCMFH achieves increases of 18.16% and 13.52% on the Image-Text and Text-Image tasks for the Mirflickr25K dataset, respectively, and achieves increases of 15.77% and 25.05% on the two retrieval tasks for the NUS-WIDE dataset. For the MSCOCO dataset, our  The highest performance of each metric is in bold Table 5 The MAP@20 scores comparison on three datasets Task  Methods  Mirflickr25K  NUS-WIDE  MSCOCO   8  16  24  32  8  16  24  32  8  16  The highest performance of each metric is in bold Fig. 3 topN-precision curves on the three datasets when the code length is 32 bits Table 6 The NDCG scores comparison on three datasets Task  Methods  Mirflickr25K  NUS-WIDE  MSCOCO   8  16  24  32  8  16  24  32  8  16  24  The highest performance of each metric is in bold The precision values of DMCMFH and all baselines are illustrated in Tables 8 and 9. It can be seen that DMCMFH outperforms others on the I-T task in most cases and performs better than others on top retrieved instances on the T-I task in most cases. In this sense, DMCMFH achieves superior performance on both retrieval tasks in most cases.
The above experimental results verify the superiority of DMCMFH. The performance improvement in DMCMFH comes from the proposed multi-similarity matrix that preserves absolute similarity and relative similarity simultaneously to capture the inter-modal and the intra-modal correlation.

Convergency analysis
Since optimal solutions of DMCMFH are obtained using iterative updating rules, its training time is closely related to the number of iterations in the training phase. In this section, its convergence is validated experimentally, and the results are shown in Fig. 5. It can be considered that the objective values are monotonically decreased, and the proposed optimal scheme quickly converges in less than twenty iterations, The highest performance of each metric is in bold The highest performance of each metric is in bold

Sensitivity to parameters
To further investigate the sensitivity of the parameters, we also conduct experiments under different values α, γ , μ, β, η and ξ in the case of 32 bits. There are six main parameters in DMCMFH: the weights of the text and image are controlled by α; the contribution of the correlation that matches the term is controlled by μ, β, η and ξ ; the contribution of the regularization term is controlled by γ . Figure 6 shows the results of ξ on three datasets and illustrates that DMCMFH gains a better performance when ξ is less than 0.0001. As ξ changes from 0.0001 to 10, the MAP performance decreases quickly. This shows that the text modality plays a more significant role than the image modality in the performance of DMCMFH.
With neural networks having achieved perfect performance in modeling nonlinear correlation, many deep-based crossmodal hashing approaches have been proposed, which integrate feature learning and hash codes learning into a unified framework. We make comparison experiments with DCMH [37] , DJSRH [29] and NSDH [33] on the Mirflickr25K dataset. Due to the high feature extraction ability of the convolutional neural network, we use CNN features to implement the proposed DMCMFH for a fair comparison.
The MAP values of DMCMFH and two methods of deep learning are illustrated in Table 10. It can be seen that: DMCMFH outperforms the other baselines in both tasks, which demonstrates the effectiveness of our main proposals. The proposed DMCMFH achieves increases of 12.42% and 6.46% on the I-T and T-I tasks for the Mirflickr25K dataset, respectively. DMCMFH incorporates matrix factorization and multi-similarity consistency into a unified framework. Therefore, in the learned common semantic space, the intermodal similarity can be well preserved, which can improve the discrimination of the sharing space and consequently more discriminative hash codes.

Conclusion
In this article, a cross-modal hashing method named DMCMFH is proposed. Data from each modality is first mapped to an individual subspace, and then a semantic subspace that is generated by the class labels is constructed to align semantic information of different modalities. The semantic correlation between different modalities can be well maintained by a multi-similarity matrix. To capture the intrinsic local structure in training data points, we propose a multi-similarity matrix that can not only better capture the latent semantic affinity, but also combine the affinity information cross different modalities. Extensive experiments indicate the superiority of DMCMFH over other baseline models.
Recently, deep hashing learning has been widely studied, not only to obtain the semantic associations with intramodality and inter-modality of images and texts, but also to obtain fine-grained semantic information in images and texts. In future work, we will design a deep learning-based matrix factorization model to further improve the retrieval performance. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.