Effective deep learning-based multi-modal retrieval
- First Online:
- Received:
- Revised:
- Accepted:
DOI: 10.1007/s00778-015-0391-4
- Cite this article as:
- Wang, W., Yang, X., Ooi, B.C. et al. The VLDB Journal (2016) 25: 79. doi:10.1007/s00778-015-0391-4
- 7 Citations
- 2.1k Downloads
Abstract
Multi-modal retrieval is emerging as a new search paradigm that enables seamless information retrieval from various types of media. For example, users can simply snap a movie poster to search for relevant reviews and trailers. The mainstream solution to the problem is to learn a set of mapping functions that project data from different modalities into a common metric space in which conventional indexing schemes for high-dimensional space can be applied. Since the effectiveness of the mapping functions plays an essential role in improving search quality, in this paper, we exploit deep learning techniques to learn effective mapping functions. In particular, we first propose a general learning objective that effectively captures both intramodal and intermodal semantic relationships of data from heterogeneous sources. Given the general objective, we propose two learning algorithms to realize it: (1) an unsupervised approach that uses stacked auto-encoders and requires minimum prior knowledge on the training data and (2) a supervised approach using deep convolutional neural network and neural language model. Our training algorithms are memory efficient with respect to the data volume. Given a large training dataset, we split it into mini-batches and adjust the mapping functions continuously for each batch. Experimental results on three real datasets demonstrate that our proposed methods achieve significant improvement in search accuracy over the state-of-the-art solutions.
Keywords
Deep learning Multi-modal retrieval Hashing Auto-encoders Deep convolutional neural network Neural language model1 Introduction
The prevalence of social networking has significantly increased the volume and velocity of information shared on the Internet. A tremendous amount of data in various media types is being generated every day in social networking systems, and images and video contribute the main bulk of the data. For instance, Twitter recently reported that over 340 million tweets were sent each day,^{1} while Facebook reported that around 300 million photographs were created each day.^{2} These data, together with other domain specific data, such as medical data, surveillance and sensory data, are big data that can be exploited for insights and contextual observations. However, effective retrieval of such huge amounts of media from heterogeneous sources remains a big challenge.
- 1.
Intramodal search has been extensively studied and widely used in commercial systems. Examples include web document retrieval via keyword queries and content-based image retrieval.
- 2.
Cross-modal search enables users to explore relevant resources from different modalities. For example, a user can use a tweet to retrieve relevant photographs and videos from other heterogeneous data sources, or search relevant textual descriptions or videos by submitting an interesting image as a query.
We propose a general learning objective that effectively captures both intramodal and intermodal semantic relationships of data from heterogeneous sources. In particular, we differentiate modalities in terms of their representations’ ability to capture semantic information and robustness when noisy data are involved. The modalities with better representations are assigned with larger weight for the sake of learning more effective mapping functions. Based on the objective function, we design an unsupervised algorithm using stacked auto-encoders (SAEs). SAE is a deep learning model that has been widely applied in many unsupervised feature learning and classification tasks [13, 31, 34, 38]. If the media are annotated with semantic labels, we design a supervised algorithm to realize the learning objective. The supervised approach uses a deep convolutional neural network (DCNN) and neural language model (NLM). It exploits the label information, thus can learn robust mapping functions against noisy input data. DCNN and NLM have shown great success in learning image features [8, 10, 20] and text features [28, 33], respectively.
Compared with existing solutions for multi-modal retrieval, our approaches exhibit three major advantages. First, our mapping functions are nonlinear, which are more expressive than the linear projections used in IMH [36] and CVH [21]. The deep structures of our models can capture more abstract concepts at higher layers, which is very useful in modeling categorical information of data for effective retrieval. Second, we require minimum prior knowledge in the training. Our unsupervised approach only needs relevant data pairs from different modalities as the training input. The supervised approach requires additional labels for the media objects. In contrast, MLBE [43] and IMH [36] require a big similarity matrix of intramodal data for each modality. LSCMR [25] uses training examples, each of which consists of a list of objects ranked according to their relevance (based on manual labels) to the first one. Third, our training process is memory efficient because we split the training dataset into mini-batches and iteratively load and train each mini-batch in memory. However, many existing works (e.g., CVH, IMH) have to load the whole training dataset into memory which is infeasible when the training dataset is too large.
We propose a general learning objective for learning mapping functions to project data from different modalities into a common latent space for multi-modal retrieval. The learning objective differentiates modalities in terms of their input features’ quality of capturing semantics.
We realize the general learning objective by one unsupervised approach and one supervised approach based on deep learning techniques.
We conduct extensive experiments on three real datasets to evaluate our proposed mapping mechanisms. Experimental results show that the performance of our method is superior to state-of-the-art methods.
2 Problem statements
In our data model, the database \(\mathbb {D}\) consists of objects from multiple modalities. For ease of presentation, we use images and text as two sample modalities to explain our idea, i.e., we assume that \(\mathbb {D}=\mathbb {D}_I \bigcup \mathbb {D}_T\). An image (resp. a text document) is represented by a feature vector \(x \in \mathbb {D}_I\) (resp. \(y \in \mathbb {D}_T\)). To conduct multi-modal retrieval, we need a relevance measurement for the query and the database object. However, the database consists of objects from different modalities, there is no such widely accepted measurement. A common approach is to learn a set of mapping functions that project the original feature vectors into a common latent space such that semantically relevant objects (e.g., image and its tags) are located close. Consequently, our problem includes the following two sub-problems.
Definition 1
Common Latent Space Mapping
Given an image \(x\in \mathbb {D}_I\) and a text document \(y\in \mathbb {D}_T\), find two mapping functions \(f_I : \mathbb {D}_I\rightarrow \mathbb {Z}\), and \(f_T : \mathbb {D}_T\rightarrow \mathbb {Z}\), such that if x and y are semantically relevant, the distance between \(f_I(x)\) and \(f_T(y)\) in the common latent space \(\mathbb {Z}\), denoted by \(dist_{\mathbb {Z}}(f_I(x),f_T(y))\), is small.
The common latent space mapping provides a unified approach to measuring distance of objects from different modalities. As long as all objects can be mapped into the same latent space, they become comparable. Once the mapping functions \(f_I\) and \(f_T\) have been determined, the multi-modal search can then be transformed into the classic kNN problem, defined as follows
Definition 2
Multi-Modal Search
Given a query object \(Q\in \mathbb {D}_q\) and a target domain \(\mathbb {D}_t\)\((q,t\in \{I,T\})\), find a set \(O \subset \mathbb {D}_t\) with k objects such that \(\forall o\in O\) and \(o'\in \mathbb {D}_t/O\), \(dist_{\mathbb {Z}}(f_q(Q), f_t(o')) \ge dist_{\mathbb {Z}}(f_q(Q), f_t(o))\).
Since both q and t have two choices, four types of queries can be derived, namely \(\mathbb {Q}_{q\rightarrow t}\) and \(q,t\in \{I,T\}\). For instance, \(\mathbb {Q}_{I\rightarrow T}\) searches relevant text in \(\mathbb {D}_T\) given an image from \(\mathbb {D}_I\). By mapping objects from different high-dimensional feature spaces into a low-dimensional latent space, queries can be efficiently processed using existing multi-dimensional indexes [16, 40]. Our goal is then to learn a set of effective mapping functions which preserve well both intramodal semantics (i.e., semantic relationships within each modality) and intermodal semantics (i.e., semantic relationships across modalities) in the latent space. The effectiveness of mapping functions is measured by the accuracy of multi-modal retrieval using latent features.
3 Overview of multi-modal retrieval
4 Unsupervised approach: MSAE
4.1 Background: auto-encoder and stacked auto-encoder
4.2 Realization of the learning objective in MSAE
4.2.1 Modeling intramodal semantics of data
We extend SAEs to model the intramodal losses in the general learning objective (Eq. 1). Specifically, \(\mathcal {L}_I\) and \(\mathcal {L}_T\) are modeled as the reconstruction errors for the image SAE and the text SAE, respectively. Intuitively, if the two reconstruction errors are small, the latent features generated by the top auto-encoder would be able to reconstruct the original input well and, consequently, capture the regularities of the input data well. This implies that with small reconstruction error, two objects from the same modality that are similar in the original space would also be close in the latent space. In this way, we are able to capture the intramodal semantics of data by minimizing \(\mathcal {L}_I\) and \(\mathcal {L}_T\), respectively. But to use the SAEs we have to design the decoders of the bottom auto-encoders carefully to handle different input features.
4.2.2 Modeling intermodal semantics of data
4.3 Training
Following the training flow shown in Fig. 2, in stage I we train a SAE for the image modality and a SAE for the text modality separately. Back-propagation [22] (see “Appendix”) is used to calculate the gradients of the objective loss, i.e., \(\mathcal {L}_I\) or \(\mathcal {L}_T\), w.r.t. the parameters. Then the parameters are updated according to mini-batch stochastic gradient descent (SGD) (see “Appendix”), which averages the gradients contributed by a mini-batch of training records (images or text documents) and then adjusts the parameters. The learned image and text SAEs are fine-tuned in stage II by back-propagation and mini-batch SGD with the objective to find the optimal parameters that minimize the learning objective (Eq. 1). In our experiment, we observe that the training would be more stable if we alternatively adjust one SAE with the other SAE fixed.
Setting\(\varvec{\beta _I}\) & \(\varvec{\beta _T}\)\(\beta _I\) and \(\beta _T\) are the weights of the reconstruction error of image and text SAEs, respectively, in the objective function (Eq. 1). As mentioned in Sect. 3, they are set based on the quality of each modality’s raw (input) feature. We use an example to illustrate the intuition. Consider a relevant object pair \((x_0\), \(y_0)\) from modality x and y. Assume x’s feature is of low quality in capturing semantics (e.g., due to noise), while y’s feature is of high quality. If \(x_h\) and \(y_h\) are the latent features generated by minimizing the reconstruction error, then \(y_h\) can preserve the semantics well, while \(x_h\) is not as meaningful due to the low quality of \(x_0\). To solve this problem, we combine the intermodal distance between \(x_h\) and \(y_h\) in the learning objective function and assign smaller weight to the reconstruction error of \(x_0\). This is the same as increasing the weight of the intermodal distance from \(x_h\) to \(y_h\). As a result, the training algorithm will move \(x_h\) toward \(y_h\) to make their distance smaller. In this way, the semantics of low quality \(x_h\) could be enhanced by the high-quality feature \(y_h\).
In the experiment, we evaluate the quality of each modality’s raw feature on a validation dataset by performing intramodal search against the latent features learned in single-modal training. Modality with worse search performance is assigned a smaller weight. Notice that because the dimensions of the latent space and the original space are usually of different orders of magnitude, the scale of \(\mathcal {L}_I\), \(\mathcal {L}_T\) and \(\mathcal {L}_{I,T}\) are different. In the experiment, we also scale \(\beta _I\) and \(\beta _T\) to make the losses comparable, i.e., within an order of magnitude.
5 Supervised approach: MDNN
5.1 Background: deep convolutional neural network and neural language model
Deep convolutional neural network (DCNN) DCNN has shown great success in computer vision tasks [8, 10] since the first DCNN (called AlexNet) was proposed by Alex [20]. It has specialized connectivity structure, which usually consists of multiple convolutional layers followed by fully connected layers. These layers form stacked, multiple-staged feature extractors, with higher layers generating more abstract features from lower ones. On top of the feature extractor layers, there is a classification layer. Please refer to [20] for a more comprehensive review of DCNN.
The learned dense vectors can be used to construct a dense vector for one sentence or document (e.g., by averaging) or to calculate the similarity of two words, e.g., using the cosine similarity function.
5.2 Realization of the learning objective in MDNN
5.2.1 Modeling intramodal semantics of data
Having witnessed the outstanding performance of DCNNs in learning features for visual data [8, 10], and NLMs in learning features for text data [33], we extend one instance of DCNN— AlexNet [20] and one instance of NLM —Skip-Gram model (SGM) [28] to model the intramodal semantics of images and text, respectively.
5.2.2 Modeling intermodal semantics
After extending the AlexNet and skip-gram model to preserve the intramodal semantics for images and text, respectively, we jointly learn the latent features for image and text to preserve the intermodal semantics. We follow the general learning objective in Eq. 1 and realize \(\mathcal {L}_I\) and \(\mathcal {L}_T\) using Eqs. 12 and 15, respectively. Euclidean distance is used to measure the difference of the latent features for an image–text pair, i.e., \(\mathcal {L}_{I,T}\) is defined similarly as in Eq. 8. By minimizing the distance of latent features for an image–text pair, we require their latent features to be closer in the latent space. In this way, the intermodal semantics are preserved.
5.3 Training
Similar to the training of MSAE, the training of MDNN consists of two steps. The first step trains the extended AlexNet and the extended NLM (i.e., MLP+Skip-Gram) separately.^{6} The learned parameters are used to initialize the joint model. All training is conducted by back-propagation using mini-batch SGD (see “Appendix”) to minimize the objective loss (Eq. 1).
Setting\(\varvec{\beta _I}\) & \(\varvec{\beta _T}\) In the unsupervised training, we assign larger \(\beta _I\) to make the training prone to preserve the intramodal semantics of images if the input image feature is of higher quality than the text input feature, and vice versa. For supervised training, since the intramodal semantics are preserved based on reliable labels, we do not distinguish the image modality from the text one in the joint training. Hence, \(\beta _I\) and \(\beta _T\) are set to the same value. In our experiment, to make the three losses within one order of magnitude, we scale the intermodal distance by 0.01.
6 Query processing
After the unsupervised (or supervised) training, each modality has a mapping function. Given a set of heterogeneous data sources, high-dimensional raw features (e.g., bag-of-visual-words or RGB feature for images) are extracted from each source and mapped into a common latent space using the learned mapping functions. In MSAE, we use the image (resp. text) SAE to project image (resp. text) input features into the latent space. In MDNN, then we use the extended DCNN (resp. extended NLM) to map the image (resp. text) input feature into the common latent space.
To further improve the search efficiency, we convert the real-valued latent features into binary features, and search based on Hamming distance. The conversion is conducted using existing hash methods that preserve the neighborhood relationship. For example, in our experiment (Sect. 8.2), we use Spectral Hashing [41], which converts real-valued vectors (data points) into binary codes with the objective to minimize the Hamming distance of data points that are close in the original Euclidean space. Other hashing approaches like [12, 35] are also applicable.
The conversion from real-valued features to binary features trades off effectiveness for efficiency. Since there is information loss when real-valued data are converted to binaries, it affects the retrieval performance. We study the trade-off between efficiency and effectiveness on binary features and real-valued features in the experiment section.
7 Related work
The key problem of multi-modal retrieval is to find an effective mapping mechanism, which maps data from different modalities into a common latent space. An effective mapping mechanism would preserve both intramodal semantics and intermodal semantics well in the latent space and thus generates good retrieval performance.
Linear projection has been studied to solve this problem [21, 36, 44]. The main idea is to find a linear projection matrix for each modality that maps semantic relevant data into similar latent vectors. However, when the distribution of the original data is nonlinear, it would be hard to find a set of effective projection matrices. CVH [21] extends the Spectral Hashing [41] to multi-modal data by finding a linear projection for each modality that minimizes the Euclidean distance of relevant data in the latent space. Similarity matrices for both intermodal data and intramodal data are required to learn a set of good mapping functions. IMH [36] learns the latent features of all training data first before it finds a hash function to fit the input data and output latent features, which could be computationally expensive. LCMH [44] exploits the intramodal correlations by representing data from each modality using its distance to cluster centroids of the training data. Projection matrices are then learned to minimize the distance of relevant data (e.g., image and tags) from different modalities.
Other recent works include CMSSH [4], MLBE [43] and LSCMR [25]. CMSSH uses a boosting method to learn the projection function for each dimension of the latent space. However, it requires prior knowledge such as semantic relevant and irrelevant pairs. MLBE explores correlations of data (both intermodal and intramodal similarity matrices) to learn latent features of training data using a probabilistic graphic model. Given a query, it is converted into the latent space based on its correlation with the training data. Such correlation is decided by labels associated with the query. However, labels of queries are usually not available in practice, which makes it hard to obtain its correlation with the training data. LSCMR [25] learns the mapping functions with the objective to optimize the ranking criteria (e.g., MAP). Ranking examples (a ranking example is a query and its ranking list) are needed for training. In our algorithm, we use simple relevant pairs (e.g., image and its tags) as training input; thus, no prior knowledge such as irrelevant pairs, similarity matrix, ranking examples and labels of queries is needed.
Multi-modal deep learning [29, 37] extends deep learning to multi-modal scenario. [37] combines two Deep Boltzmann Machines (DBM) (one for image and one for text) with a common latent layer to construct a Multi-modal DBM. [29] constructs a bimodal deep auto-encoder with two deep auto-encoders (one for audio and one for video). Both models aim to improve the classification accuracy of objects with features from multiple modalities. They combine different features to learn a (high dimensional) latent feature. In this paper, we aim to represent data with low-dimensional latent features to enable effective and efficient multi-modal retrieval, where both queries and database objects may have features from only one modality. DeViSE [9] from Google shares similar idea with our supervised training algorithm. It embeds image features into text space, which are then used to retrieve similar text features for zero-shot learning. Notice that the text features used in DeViSE to learn the embedding function are generated from high-quality labels. However, in multi-modal retrieval, queries usually do not come with labels and text features are generated from noisy tags. This makes DeViSE less effective in learning robust latent features against noisy input.
8 Experimental study
This section provides an extensive performance study of our solution in comparison with the state-of-the-art methods. We examine both efficiency and effectiveness of our method including training overhead, query processing time and accuracy. Visualization of the training process is also provided to help understand the algorithms. All experiments are conducted on CentOS 6.4 using CUDA 5.5 with NVIDIA GPU (GeForce GTX TITAN). The size of main memory is 64GB and the size of GPU memory is 6GB. The code and hyper-parameter settings are available online.^{7} In the rest of this section, we first give our evaluation metrics and then study the performance of unsupervised approach and supervised approach, respectively.
8.1 Evaluation metrics
Besides effectiveness, we also evaluate the training overhead in terms of time, cost and memory consumption. In addition, we report query processing time.
8.2 Experimental study of unsupervised approach
First, we describe the datasets used for unsupervised training. Second, an analysis of the training process by visualization is presented. Last, comparison with previous works, including CVH [21], CMSSH [4] and LCMH [44] are provided.^{8}
8.2.1 Datasets
Unsupervised training requires relevant image text pairs, which are easy to collect. We use three datasets to evaluate the performance—NUS-WIDE [5], Wiki [30] and Flickr1M [17].
Statistics of datasets for unsupervised training
Dataset | NUS-WIDE | Wiki | Flickr1M |
---|---|---|---|
Total size | 190,421 | 2866 | 1,000,000 |
Training set | 60,000 | 2000 | 975,000 |
Validation set | 10,000 | 366 | 6000 |
Test set | 120,421 | 500 | 6000 |
Average text length | 6 | 131 | 5 |
Wiki This dataset contains 2,866 image–text pairs from the Wikipedia’s featured articles. An article in Wikipedia contains multiple sections. The text and its associated image in one section is considered as an image–text pair. Every image–text pair has a label inherited from the article’s category (there are 10 categories in total). We randomly split the dataset into three subsets as shown in Table 1. For validation (resp. test), we randomly select 50 (resp. 100) pairs from the validation (resp. test) set as the query set. Images are represented by 128 dimensional bag-of-visual-words vectors based on SIFT feature. For text, we construct a vocabulary with the most frequent 1,000 words excluding stop words and represent one text section by 1,000 dimensional word count vector like [25]. The average number of words in one section is 131 which is much higher than that in NUS-WIDE. To avoid overflow in Eq. 6 and smooth the text input, we normalize each unit x as \(\log (x+1)\) [32].
Flickr1M This dataset contains 1 million images associated with tags from Flickr, 25,000 of which are annotated with labels (there are 38 labels in total). The image feature is a 3,857 dimensional vector concatenated by SIFT feature, color histogram, etc [37]. Like NUS-WIDE, the text feature is represented by a tag occurrence vector with 2,000 dimensions. All the image–text pairs without annotations are used for training. For validation and test, we randomly select 6,000 pairs with annotations, respectively, among which 1,000 pairs are used as queries.
Before training, we use ZCA whitening [19] to normalize each dimension of image feature to have zero mean and unit variance.
8.2.2 Training visualization
In this section, we visualize the training process of MSAE using the NUS-WIDE dataset as an example to help understand the intuition of the training algorithm and the setting of the weight parameters, i.e., \(\beta _I\) and \(\beta _T\). Our goal is to learn a set of mapping functions such that the mapped latent features capture both intramodal semantics and intermodal semantics well. Generally, the intermodal semantics is preserved by minimizing the distance of the latent features of relevant intermodal pairs. The intramodal semantics is preserved by minimizing the reconstruction error of each SAE and through intermodal semantics (see Sect. 4 for details).
8.2.3 Evaluation of model effectiveness on NUS_WIDE dataset
We first examine the mean average precision (MAP) of our method using Euclidean distance against real-valued features. Let L be the dimension of the latent space. Our MSAE is configured with three layers, where the image features are mapped from 500 dimensions to 128 and finally to L. Similarly, the dimension of text features is reduced from \(1000\rightarrow 128\rightarrow L\) by the text SAE. \(\beta _I\) and \(\beta _T\) are set to 0 and 0.01, respectively, according to Sect. 8.2.2. We test L with values 16, 24 and 32. The results compared with other methods are reported in Table 2. Our MSAE achieves the best performance for all four search tasks. It demonstrates an average improvement of 17, 27, 21 and 26 % for \(\mathbb {Q}_{I\rightarrow I}\), \(\mathbb {Q}_{T\rightarrow T}\), \(\mathbb {Q}_{I\rightarrow T}\), and \(\mathbb {Q}_{T\rightarrow I}\), respectively. CVH and CMSSH prefer smaller L in queries \(\mathbb {Q}_{I\rightarrow T}\) and \(\mathbb {Q}_{T\rightarrow I}\). The reason is that it needs to train far more parameters in higher dimensions and the learned models will be farther from the optimal solutions. Our method is less sensitive to the value of L. This is probably because with multiple layers, MSAE has stronger representation power and thus is more robust under different L.
Mean average precision on NUS-WIDE dataset
Task | \(\mathbb {Q}_{I\rightarrow I}\) | \(\mathbb {Q}_{T\rightarrow T}\) | \(\mathbb {Q}_{I\rightarrow T}\) | \(\mathbb {Q}_{T\rightarrow I}\) | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Algorithm | LCMH | CMSSH | CVH | MSAE | LCMH | CMSSH | CVH | MSAE | LCMH | CMSSH | CVH | MSAE | LCMH | CMSSH | CVH | MSAE | |
Dimension of latent space L | 16 | 0.353 | 0.355 | 0.365 | 0.417 | 0.373 | 0.400 | 0.374 | 0.498 | 0.328 | 0.391 | 0.359 | 0.447 | 0.331 | 0.337 | 0.368 | 0.432 |
24 | 0.343 | 0.356 | 0.358 | 0.412 | 0.373 | 0.402 | 0.364 | 0.480 | 0.333 | 0.388 | 0.351 | 0.444 | 0.323 | 0.336 | 0.360 | 0.427 | |
32 | 0.343 | 0.357 | 0.354 | 0.413 | 0.374 | 0.403 | 0.357 | 0.470 | 0.333 | 0.382 | 0.345 | 0.402 | 0.324 | 0.335 | 0.355 | 0.435 |
Mean average precision on NUS-WIDE dataset (using binary latent features)
Task | \(\mathbb {Q}_{I\rightarrow I}\) | \(\mathbb {Q}_{T\rightarrow T}\) | \(\mathbb {Q}_{I\rightarrow T}\) | \(\mathbb {Q}_{T\rightarrow I}\) | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Algorithm | LCMH | CMSSH | CVH | MSAE | LCMH | CMSSH | CVH | MSAE | LCMH | CMSSH | CVH | MSAE | LCMH | CMSSH | CVH | MSAE | |
Dimension of latent space L | 16 | 0.353 | 0.357 | 0.352 | 0.376 | 0.387 | 0.391 | 0.379 | 0.397 | 0.328 | 0.339 | 0.359 | 0.364 | 0.325 | 0.346 | 0.359 | 0.392 |
24 | 0.347 | 0.358 | 0.346 | 0.368 | 0.392 | 0.396 | 0.372 | 0.412 | 0.333 | 0.346 | 0.353 | 0.371 | 0.324 | 0.352 | 0.353 | 0.380 | |
32 | 0.345 | 0.358 | 0.343 | 0.359 | 0.395 | 0.397 | 0.365 | 0.434 | 0.320 | 0.340 | 0.348 | 0.373 | 0.318 | 0.347 | 0.348 | 0.372 |
8.2.4 Evaluation of model effectiveness on Wiki dataset
We conduct similar evaluations on Wiki dataset as on NUS-WIDE. For MSAE with latent feature of dimension L, the structure of its image SAE is \(128\rightarrow 128\rightarrow L\), and the structure of its text SAE is \(1000\rightarrow 128\rightarrow L\). Similar to the settings on NUS-WIDE, \(\beta _I\) is set to 0 and \(\beta _T\) is set to 0.01.
Mean average precision on Wiki dataset
Task | \(\mathbb {Q}_{I\rightarrow I}\) | \(\mathbb {Q}_{T\rightarrow T}\) | \(\mathbb {Q}_{I\rightarrow T}\) | \(\mathbb {Q}_{T\rightarrow I}\) | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Algorithm | LCMH | CMSSH | CVH | MSAE | LCMH | CMSSH | CVH | MSAE | LCMH | CMSSH | CVH | MSAE | LCMH | CMSSH | CVH | MSAE | |
Dimension of latent space L | 16 | 0.146 | 0.148 | 0.147 | 0.162 | 0.359 | 0.318 | 0.153 | 0.462 | 0.133 | 0.138 | 0.126 | 0.182 | 0.117 | 0.140 | 0.122 | 0.179 |
24 | 0.149 | 0.151 | 0.150 | 0.161 | 0.345 | 0.320 | 0.151 | 0.437 | 0.129 | 0.135 | 0.123 | 0.176 | 0.124 | 0.138 | 0.123 | 0.168 | |
32 | 0.147 | 0.149 | 0.148 | 0.162 | 0.333 | 0.312 | 0.152 | 0.453 | 0.137 | 0.133 | 0.128 | 0.187 | 0.119 | 0.137 | 0.123 | 0.179 |
8.2.5 Evaluation of model effectiveness on Flickr1M dataset
We configure a four-layer image SAE as \(3857\rightarrow 1000\rightarrow 128\rightarrow L\) and a four-layer text SAE as \(2000\rightarrow 1000\rightarrow 128\rightarrow L\) for this dataset. Different from the other two datasets, the original image feature of Flickr1M is of higher quality as it consists of both local and global features. For intramodal search, the image latent feature performs equally well as the text latent feature. Therefore, we set both \(\beta _I\) and \(\beta _T\) to 0.01.
Mean average precision on Flickr1M dataset
Task | \(\mathbb {Q}_{I\rightarrow I}\) | \(\mathbb {Q}_{T\rightarrow T}\) | \(\mathbb {Q}_{I\rightarrow T}\) | \(\mathbb {Q}_{T\rightarrow I}\) | |||||
---|---|---|---|---|---|---|---|---|---|
Algorithm | CVH | MSAE | CVH | MSAE | CVH | MSAE | CVH | MSAE | |
L | 16 | 0.622 | 0.621 | 0.610 | 0.624 | 0.610 | 0.632 | 0.616 | 0.608 |
24 | 0.616 | 0.619 | 0.604 | 0.629 | 0.605 | 0.628 | 0.612 | 0.612 | |
32 | 0.603 | 0.622 | 0.587 | 0.630 | 0.588 | 0.632 | 0.598 | 0.614 |
8.2.6 Evaluation of training cost
Figure 13b shows the memory usage of the training process. Given a training dataset, MSAE splits them into mini-batches and conducts the training batch by batch. It stores the model parameters and one mini-batch in memory, both of which are independent of the training dataset size. Hence, the memory usage stays constant when the size of the training dataset increases. The actual minimum memory usage for MSAE is smaller than 10GB. In our experiments, we allocate more space to load multiple mini-batches into memory to save disk reading cost. CVH has to load all training data into memory for matrix operations. Therefore, its memory usage increases with respect to the size of the training dataset.
8.2.7 Evaluation of query processing efficiency
By taking into account the results from effectiveness evaluations, we can see that there is a trade-off between efficiency and effectiveness in feature representation. The binary encoding greatly improves the efficiency in the expense of accuracy degradation (Table 3).
8.3 Experimental study of supervised approach
8.3.1 Datasets
Supervised training requires input image–text pairs to be associated with additional semantic labels. Since Flickr1M does not have labels and Wiki dataset has too few labels that are not discriminative enough, we use NUS-WIDE dataset to evaluate the performance of supervised training. We extract 203,400 labeled pairs, among which 150,000 are used for training. The remaining pairs are evenly partitioned into two sets for validation and testing. From both sets, we randomly select 2000 pairs as queries. This labeled dataset is named NUS-WIDE-a.
Statistics of datasets for supervised training
Dataset | NUS-WIDE-a | NUS-WIDE-b |
---|---|---|
Total size | 203,400 | 76,000 |
Training set | 150,000 | 60,000 |
Validation set | 26,700 | 80,000 |
Test set | 26,700 | 80,000 |
8.3.2 Visualization of training process
The MAPs for all types of searches using supervised training model are shown in Fig. 15b. As can be seen, the MAPs first gradually increase and then become stable in the last few iterations. It is worth noting that the MAPs are much higher than the results of unsupervised training (MSAE) in Fig. 11. There are two reasons for the superiority. First, the supervised training algorithm (MDNN) exploits DCNN and NLM to learn better visual and text features, respectively. Second, labels bring in more semantics and make latent features more robust to noises in input data (e.g., visual irrelevant tags).
8.3.3 Evaluation of model effectiveness on NUS-WIDE dataset
Mean average precision using real-valued latent feature
Task | \(\mathbb {Q}_{I\rightarrow I}\) | \(\mathbb {Q}_{T\rightarrow T}\) | \(\mathbb {Q}_{I\rightarrow T}\) | \(\mathbb {Q}_{T\rightarrow I}\) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Algorithm | MDNN | DeViSE-L | DeViSE-T | MDNN | DeViSE-L | DeViSE-T | MDNN | DeViSE-L | DeViSE-T | MDNN | DeViSE-L | DeViSE-T | |
Dataset | NUS-WIDE-a | 0.669 | 0.5619 | 0.5399 | 0.541 | 0.468 | 0.464 | 0.587 | 0.483 | 0.517 | 0.612 | 0.502 | 0.515 |
NUS-WIDE-b | 0.556 | 0.432 | 0.419 | 0.466 | 0.367 | 0.385 | 0.497 | 0.270 | 0.399 | 0.495 | 0.222 | 0.406 |
8.3.4 Evaluation of training cost
8.3.5 Comparison with unsupervised approach
By comparing Tables 7 and 2, we can see that the supervised approach—MDNN, performs better than the unsupervised approach—MSAE. This is not surprising because MDNN consumes more information than MSAE. Although the two methods share the same general training objective, the exploitation of label semantics helps MDNN learn better features in capturing the semantic relevance of the data from different modalities. For memory consumption, MDNN and MSAE perform similarly (Fig. 18b).
9 Conclusion
In this paper, we have proposed a general framework (objective) for learning mapping functions for effective multi-modal retrieval. Both intramodal and intermodal semantic relationships of data from heterogeneous sources are captured in the general learning objective function. Given this general objective, we have implemented one unsupervised training algorithm and one supervised training algorithm separately to learn the mapping functions based on deep learning techniques. The unsupervised algorithm uses stacked auto-encoders as the mapping functions for the image modality and the text modality. It only requires simple image–text pairs for training. The supervised algorithm uses an extend DCNN as the mapping function for images and an extend NLM as the mapping function for text data. Label information is integrated in the training to learn robust mapping functions against noisy input data. The results of experiment confirm the improvements of our method over previous works in search accuracy. Based on the processing strategies outlined in this paper, we have built a distributed training platform (called SINGA) to enable efficient deep learning training that supports training large-scale deep learning models. We shall report the system architecture and its performance in a future work.
We tried both the Sigmoid function and ReLU activation function for s(). ReLU offers better performance.
Notice that in our model, we fix the word vectors learned by SGM. It can also be fine-tuned by integrating the objective of SGM (Eq. 11) into 15.
In our experiment, we use the parameters trained by Caffe [18] to initialize the AlexNet to accelerate the training. We use Gensim (http://radimrehurek.com/gensim/) to train the skip-gram model with the dimension of word vectors being 100.
The code and parameter configurations for CVH and CMSSH are available online at http://www.cse.ust.hk/~dyyeung/code/mlbe.zip. The code for LCMH is provided by the authors. Parameters are set according to the suggestions provided in the paper.
The last layer with two units is for visualization purpose, such that the latent features could be showed in a 2D space.
Acknowledgments
This work is supported by A*STAR Project 1321202073. Xiaoyan Yang is supported by Human-Centered Cyber-physical Systems (HCCS) programme by A*STAR in Singapore.