Abstract
Crossmodal similarity query has become a highlighted research topic for managing multimodal datasets such as images and texts. Existing researches generally focus on query accuracy by designing complex deep neural network models and hardly consider query efficiency and interpretability simultaneously, which are vital properties of crossmodal semantic query processing system on largescale datasets. In this work, we investigate multigrained common semantic embedding representations of images and texts and integrate interpretable query index into the deep neural network by developing a novel Multigrained Crossmodal Query with Interpretability (MCQI) framework. The main contributions are as follows: (1) By integrating coarsegrained and finegrained semantic learning models, a multigrained crossmodal query processing architecture is proposed to ensure the adaptability and generality of query processing. (2) In order to capture the latent semantic relation between images and texts, the framework combines LSTM and attention mode, which enhances query accuracy for the crossmodal query and constructs the foundation for interpretable query processing. (3) Index structure and corresponding nearest neighbor query algorithm are proposed to boost the efficiency of interpretable queries. (4) A distributed query algorithm is proposed to improve the scalability of our framework. Comparing with stateoftheart methods on widely used crossmodal datasets, the experimental results show the effectiveness of our MCQI approach.
Introduction
With rapid development of computer science and technology, multimedia data including images and texts have been emerging on the Internet, which have become the main form of humans knowing the world. Consequently, crossmodal similarity query has been an essential technique with wide applications, such as search engine and multimedia data management. Crossmodal similarity query [1] is such an effective query paradigm that users can get the results of one type by submitting a query of the other type. In this work, we mainly focus on queries between images and texts. For instance, when one user submits a piece of textual description of one football game, most relevant images in datasets can be fetched and vice versa. Crossmodal similarity query should discover latent semantic relationships among different types, it has attracted great interests from researchers.
Due to the significant advantage of deep neural networks (DNN) in feature extraction, DNN models are utilized for crossmodal similarity query [2]. The complex structure and highdimensional feature maps equip the deep neural networks with considerable power of learning nonlinear relationships; however, at the same time, complex models introduce some drawbacks. First, numerous parameters of deep neural networks make query process and results difficult to be explained. That is, those models have weak interpretability, which is an important property for general and reliable crossmodal query system. Second, in order to find the most similar data objects, typically the cosine similarity between the highdimensional feature vector of query object and that of each object in the whole dataset should be computed. Hence, for a largescale dataset, the computation cost is so high that the query response time will be obnoxious. Existing researches tend to focus on designing complicated composite models to enhance query accuracy and hardly take account of query interpretability, efficiency and scalability at the same time.
Query interpretability of the query framework can improve the credibility of query result. Query efficiency can ensure the accuracy of query result. And query scalability can enhance the adaptability of query methods, especially when faced with largescale data. Hence, to develop a crossmodal similarity query framework with interpretability, efficiency and scalability is necessary. There are two challenges to achieve this goal. First it is how to bridge the semantic gap among different modality, which need a sophisticated model to capture the common semantic in terms of coarse grain and fine grain. The second challenge is how to enhance interpretability of the query framework with complex structure and millions of parameters. The third challenge is how to integrate the query model with scalability, in case of processing largescale data, which are ubiquitous nowadays.
Our core insight is that we can leverage deep neural network model to capture multigrained crossmodal common semantics and build an efficient hybrid index with interpretability and scalability. Hence, in this work, we propose a novel efficient and effective Multigrained Crossmodal Query framework with Interpretability (MCQI). In order to ensure the adaptability and generality of our framework, during training common feature vectors for different types we first capture coarsegrained and finegrained semantic information by designing different networks and then combine them. And in order to discover the latent semantic relations between images and texts, we integrated LSTM model and attention model, besides, the data foundation of crossmodal correlative information is constructed in this way. In addition, for the sake of query efficiency, we built an index supporting interpretable query. And further, in order to enhance the scalability of our framework, a distributed query algorithm is proposed based on our framework. At last, to confirm the efficiency and effectiveness of our approach, we systematically evaluate the performances of the approach by comparing with 8 stateoftheart methods on five widely used multimodal datasets. Concretely, our contributions are shown as follows:

By integrating coarsegrained and finegrained semantic learning models, a multigrained crossmodal query processing architecture is proposed to ensure the adaptability and generality of query processing.

In order to capture the latent semantic relation between images and texts, the framework combines LSTM and attention mode, which enhances query accuracy for the crossmodal query and constructs the foundation for interpretable query processing.

Index structure and corresponding nearest neighbor query algorithm are proposed to boost the efficiency of interpretable queries.

A distributed query algorithm is proposed to improve the scalability of our framework.
The remainder of this paper is organized as follows. Section 2 briefly reviews related work. In Sect. 3, we introduce definitions of problems and then describe in detail our MCQI framework and a kNN query algorithm in Sect. 4. Section 5 gives a distributed query algorithm to enhance scalability of our framework. Section 6 provides experimental results and analysis on five datasets, and we conclude in Sect. 7.
Related Work
In this section, we briefly review the related researches for crossmodal query, including crossmodal retrieval, latent semantic alignment and crossmodal hashing.
Crossmodal Retrieval
Traditional methods mainly learn linear projections for different data types. Canonical correlation analysis (CCA) [3] is proposed to learn crossmodal common representation by maximizing the pairwise correlation, which is a classical baseline method for crossmodal measurement. Beyond pairwise correlation, joint representation learning (JRL) [4] is proposed to make use of semisupervised regularization and semantic information, which can jointly learn common representation projections for up to five data types. S2UPG [5] further improves JRL by constructing a unified hypergraph to learn the common space by utilizing the finegrained information. Recent years, DNNbased crossmodal retrieval has become an active research topic. Deep canonical correlation analysis (DCCA) is proposed by [6] with two subnetworks, which combines DNN with CCA to maximize the correlation on the top of two subnetworks. UCAL [7] is an unsupervised crossmodal retrieval method based on adversarial learning, which takes a modality classifier as a discriminator to distinguish the modality of learned features. DADN [8] approach is proposed for addressing the problem of zeroshot crossmedia retrieval, which learns common embeddings with category semantic information. These methods mainly focus on query accuracy rather than query efficiency and interpretability.
Latent Semantic Alignment
Latent semantic alignment is the foundation for interpretable query. [9] embeds patches of images and dependency tree relations of sentences in a common embedding space and explicitly reasons about their latent, intermodal correspondences. Adding generation step, [10] proposes a model which learns to score sentence and image similarity as a function of RCNN object detections with outputs of a bidirectional RNN. By incorporating attention into neural networks for vision related tasks, [11, 12] investigate models that can attend to salient part of an image while generating its caption. These methods inspire ideas of achieving interpretable crossmodal query, but neglect issues of query granularity and efficiency.
Crossmodal Hashing
Deep crossmodal hashing (DCMH) [13] combines hashing learning and deep feature learning by preserving the semantic similarity between modalities. Correlation autoencoder hashing (CAH) [14] embeds the maximum crossmodal similarity into hash codes using nonlinear deep autoencoders. Correlation hashing network (CHN) [15] jointly learns image and text representations tailored to hash coding and formally controls the quantization error. Pairwise relationship guided deep hashing (PRDH) [16] jointly uses two types of pairwise constraints from intramodality and intermodality to preserve the semantic similarity of the learned hash codes. [17] proposes a generative adversarial network to model crossmodal hashing in an unsupervised fashion and a correlation graphbased learning approach to capture the underlying manifold structure across different modalities. For large highdimensional data, hashing is a common tool, which can achieve sublinear time complexity for data retrieval. However, after constructing a hash index on hamming space, it is difficult to obtain flexible query granularity and reasonable interpretability.
Distributed Similarity Query
Existing methods for distributed similarity queries in metric spaces can be partitioned into two categories [18]. The first category utilizes basic metric partitioning principles to distribute the data over the underlying network. [19] proposes a distributed index, GHT* index, which can exploit parallelism in a dynamic network of computers by putting a part of the index structure in every network node. [20] proposes a mapping mechanism that enables to actually store the data in wellestablished structures such as the Btree. The second category utilizes the index integrating technique to distribute the data. Paper [21] integrates Rtree and CAN overlay to process multidimensional data in a cloud system. Paper [22] combines Btree and BATON overlay to provide a distributed index which has high scalability but incurs low maintenance. They both choose a part of local index nodes to build global index node by computing the cost model. [23] integrates quadtree index with Chord overlay to enable more powerful accesses to data in P2P networks. In this paper, we adopt the pivotmappingbased method due to two reasons below. These methods are not enough due to two reasons below. First, they are query sensitive; that is, they cannot adjust distribution of data for different query load and then cannot keep load balance, which is also important for distributed environment.
Problem Description
For crossmodal similarity query, given a query object of one type, most similar objects of the other type in the dataset should be returned. The formal definition is shown below.
The multimodal dataset consists of two modalities with m images and n texts, which is denoted as D = {D^{t}, D^{i}}. The texts are encoded as a one hot code originally and in the set D the data of text modality are denoted as \(D^{t} = \left\{ {x_{k}^{t} } \right\}_{k = 1}^{m}\), where the kth text object is defined as \(x_{k}^{t} \in R^{{l_{k} * c}}\) with the sentence length l_{k} and the vocabulary size c. The data of image modality are denoted as \(D^{i} = \left\{ {x_{k}^{i} } \right\}_{k = 1}^{n}\), where the kth image instance is defined as \(x_{k}^{i} \in R^{{w * h * c^{\prime}}}\) with image resolution w*h and color channel number c’. Besides, the pairwise correspondence is denoted as (\({\text{x}}_{\text{k}}^{\text{t}}\),\({\text{x}}_{\text{k}}^{\text{i}}\)), which means that the two instances of different types are strongly semantically relevant. Crossmodal similarity query means that given one query object it is to find similar objects of the other modality which share relevant semantics with the given one, kNN query is a classical type of similarity query and the definition is given as follows.
Definition 1
(kNN Query). Given an object q, an integer k > 1, dataset D and similarity function SIM, the k nearest neighbors query kNN computes a sizek subset S ⊆ D, s.t. \({ }\forall o_{i} \in S,o_{j} \in D  S:SIM\left( {q,o_{i} } \right) \ge SIM\left( {q,o_{j} } \right).\) In this work, we set cosine similarity as similarity function.
Table 1 lists the used notations throughout this paper. The list mainly consists of the notations which are mentioned far from their definitions.
Proposed Model
In this section, we describe the proposed MCQI framework in detail. As shown in Fig. 1, MCQI framework consists of two stages. The first stage is the learning stage, which models common embedding representation of multimodal data by fusing coarsegrained and finegrained semantic information. The second stage is the index construction stage, in which Mtree index and inverted index are integrated to process efficient and interpretable queries. In the following paragraphs, we introduce it in the aspects of embedding representations of multimodal data and interpretable query processing.
Embedding Representations of Multimodal Data
In the first stage, MCQI learns the embedding representation of multimodal data by fusing coarsegrained and finegrained semantic information.
Finegrained Embedding Learning with Local Semantics
Different methods are used to extract local semantic features for texts and images. For texts, EmbedRank [24] is utilized to extract keyphrases. Then, a pretrained model Sent2vec[25] is chosen for computing the embedding of each keyphrase. Then, by threelayer fully connected neural network, we map each keyphrase into the common embedding space with dimension d_l, denoted as ts_{pq,} which means the embedding representation of the qth keyphrase of the pth text description.
For images, Region Convolutional Neural Network (RCNN) [26] is utilized to detect objects in images. We use top detected locations of the entire image as local semantic features and then compute the common embedding vectors based on the visual matrix in each bounding box by a pretrained convolutional neural network and by transition matrix transform the vector to common space with dimension d_l; lastly, we get is_{uv}, which means the embedding representation of the vth bounding box of the uth image.
Typically, for a pair of matched text and image, at least one of keyphrases in the text is semantically relevant with a certain bounding box in the image instance; that is, at least one common embedding vector of the text instance is close to a certain common embedding vector of the image instance. Base on this intuitiveness, according to hinge rank loss function, we set the original objective of finegrained embedding learning as follows:
Here, Pnum is the number of matched pairs in the training sample set and Anum is the training sample capability. \(\left( {\alpha \frac{Pnum}{{Anum}}} \right)^{{I\left( {p = u} \right)}} \left( {1  \alpha \frac{Pnum}{{Anum}}} \right)^{{I\left( {p \ne u} \right)}}\) is utilized to balance positive and negative samples. \(\frac{{ts_{pq} \cdot is_{uv} }}{{\left {ts_{pq} } \right \cdot \left {is_{uv} } \right}}\) is the cosine similarity of two embedding vectors. M is the margin constant, which defines the tolerance of true positive and true negative. The more M is close to 1, the stricter it is about semantically recognizing true positive and true negative.
C_{b} cost is computed over all pairs of local features between text instances and image instances. However, in many cases, for two semantically relevant instances only few parts of local features are matched and similar pairs are difficult to be acquired by computation over all pairs. To address this problem, according to MILE [9], we make a multiple instance learning extension of formula (1) as shown in formula (2). For each text instance, local features of matched image are put into a positive bag, while local features in other image are treated as negative samples.
Here, \({\text{B}}_{\text{q}}\) is the positive bag of the qth feature vector, \({\text{k}}_{\text{qv}}\) is the correlation index which indicates whether the corresponding text instance and image instance are matched. It is worth notice that each feature vector \({\text{is}}_{\text{uv}}\) and the corresponding bounding box are stored in the storage system for processing interpretable queries.
Coarsegrained Embedding Learning with Global Semantics
Coarsegrained embedding network tries to capture global common semantics between texts and images. For texts, Universal Sentence Encoder [27] is utilized to extract feature vectors of texts and by fully connected layers the feature vectors are transformed into the global common embedding space with dimension d_g.
For images, inspired by [11] that pretrained LSTM with soft attention model is integrated to translate images into sequential representation. For an image, feature maps before classification in a pretrained RCNN network and the whole image’s feature maps before fully connected layers in pretrained CNN network are combined into feature vectors, denoted as \(a = \left\{ {a_{i} } \right\}_{i = 1}^{LV}\) and LV is the number of the feature vectors.
Our implementation of LSTM with soft attention is based on [11]. a_{i} is the input, y and \(\alpha_{ti}\) are outputs, y is the generated sequential text and \(\alpha_{ti}\) represents importance of feature vector \({\text{a}}_{\text{i}}\) when generating the tth word. please note that each word y_{t} has an attention weight \(\alpha_{ti}\) for each feature vector \({\text{a}}_{\text{i}}\) and each tuple tu_{t} = < y_{t}, imageID, \(\alpha_{ti}\), xloc_{i}, xloc1_{i}, xloc2_{i}, yloc1_{i}, yloc2_{i} > is stored for answering future queries, where imageID is the image’s unique identifier, xloc_{i}, xloc1_{i}, xloc2_{i}, yloc1_{i}, yloc2_{i}are the corresponding coordinate position of \({\text{a}}_{\text{i}}\) in the image. We collect all tuples as set TU = {tu_{t}}.
For generated sequential text \({\text{y}}\), Universal Sentence Encoder is utilized to generate the coarsegrained representative vector of \({\text{y}}\), denoted as GIV, while the coarsegrained representative vector of original paired training text by Universal Sentence Encoder is denoted as OTV.
Intuitively, global training objective function is shown as follows.
Multigrained Objective Function
We are now ready to formulate the multigrained objective function. The objective function is designed by two criteria. First, it is likely that matched pairs of images and texts have similar patches, which applies to C_{P}. Second, matched pairs of image and text probably have similar global semantics, which applies to C_{G}. By integrating C_{P} and C_{G}, the objective function is defined as follows.
where θ is a shorthand for parameters of our model and \(\alpha , \beta , \gamma\) are hyperparameters which are computed by crossvalidation. \({\theta }_{2}^{2}\) is the regularization.
Optimization
The proposed model consists of two branches, which are designed for common finegrained semantic and coarsegrained semantic, respectively. Naturally, the training process is divided into two stages, i.e., branch training and joint training. Both training processes are based on stochastic gradient descent (SGD) with a batch size of 32, a momentum of 0.9 and a weight decay of 0.00005.
Stage 1: In this stage, branches for common finegrained semantic and coarsegrained semantic are trained in turn, taking formula (2) and formula (3) as loss functions, respectively. In the finegrained branch, pretrained Sent2Vec model and RCNN model are utilized, while in the coarsegrained branch, pretrained several pretrained Universal Sentence Encoder model and LSTM model are utilized. The default parameters in those pretrained models are utilized, and its parameters are kept fixed at this stage. The other parameters of our model, including the attentional mechanism, are automatically initialized with the Xavier algorithm [28].
Stage 2: After all branch networks are trained, we jointly finetune the entire model parameters by combining the loss terms over all granularities in formula (4).
Interpretable Query Processing
In MCQI framework, images and texts can be represented by highdimensional feature vectors, which include finegrained and coarsegrained semantic features. Denote IFV_{i} as feature vectors of the ith instance Ins_{i}, then IFV_{i} = {CFVF_{i}, CFVC_{i}}, where CFVF_{i} and CFVC_{i} mean the corresponding common finegrained semantic feature and the coarsegrained semantic feature of Ins_{i}, respectively. Given a query instance, i.e., an image or text instance, in order to find the matched crossmodal instance, i.e., the most relevant text or image instance, the similarity between two crossmodal instances can be computed by cosine similarity shown in formula (5) as follows.
Here, Ins_{i} and Ins_{j} are two crossmodal instances, \(\delta\) is the weight factor, Cosine is the cosine similarity function.
A naive method to obtain the matched crossmodal instances is pairwise computation; however, this method is inefficient. Particularly when the dataset is large and the dimension of vectors is high, the computation is nontrivial. To address this, an inverted index and an Mtree index are integrated into MCQI model. The Mtree index increases the efficiency of queries and the inverted index enhances the interpretability of queries. Index construction and query processing method based on the indices are discussed separately as follows.
Index Construction
It is shown in formula (5) the similarity between two instances mainly is calculated by the cosine similarity of two types of feature vectors. By assuming that variables obey uniform distribution, we get Observation 1 in the following. Observation 1 shows that cosine similarity between the whole feature vectors of Ins_{i} and Ins_{j} is close to SIM(Ins_{i}, Ins_{j}).
Observation 1
For Random Variable \(\delta \in \left[ {0.2,0.8} \right],\exists \varepsilon ,\sigma \in \left[ {0,1} \right],\) s.t. \(P\left( {\left {\left( {\delta \frac{{CFVF_{i} \cdot CFVF_{j} ,}}{{\left {CFVF_{i} } \right * \left {CFVF_{j} } \right}} + \left( {1  \delta } \right)\frac{{CFVC_{i} \cdot CFVC_{j} ,}}{{\left {CFVC_{i} } \right * \left {CFVC_{j} } \right}}} \right)  \frac{{IFV_{i} \cdot IFV_{j} ,}}{{\left {IFV_{i} } \right * \left {IFV_{j} } \right}}} \right{ < }\varepsilon } \right){ > }\sigma ,\) i.e., \(P\left( {\left {SIM\left( {Ins_{i} ,Ins_{j} } \right)  {\text{Cos}} ine\left( {Ins_{i} ,Ins_{j} } \right)} \right{ < }\varepsilon } \right){ > }\sigma\).
This Observation is obtained by statistical hypotheses testing method, which will be illustrated in the experiments. By setting DIF = \(\left{\text{SIM}}({\text{Ins}}_{\text{i}}, {\text{Ins}}_{\text{j}})\text{Cosine(}{\text{Ins}}_{\text{i}}, {\text{Ins}}_{\text{j}}\text{)}\right\), we get P(DIF < \(\varepsilon\))) > \(\sigma\). In experiments, when set \(\varepsilon\) = 0.05, we have \(\sigma\) = 0.9 and when set \(\varepsilon\) = 0.1, we have \(\sigma\) = 0.98.
It is known that the Mtree is an efficient structure for NN queries in metric spaces. In order to use Mtree index, cosine distance should be transformed to angular similarity (AS) which is metric. The angular similarity between Ins_{i} and Ins_{j} is defined in formula (6) in the following.
Lemma 1.
For any instance q, the nearest neighbor of q by angular similarity is also the nearest neighbor of q by cosine similarity.
Lemma 1 can be easily proved by contradiction, which is omitted for simplicity.
Based on Lemma 1 and formula (6), an Mtree is constructed on the data set of feature vectors. And then Mtree is augmented with an inverted index of semantic relationship tuple set TU, which is mentioned in Sect. 4.1.
Interpretable kNN Query
For processing similarity queries efficiently, we adopt a filterandfine model. Our method first obtains candidates of matched objects by Mtree and then verifies the candidates and identifies the final answers.
The Mtree inherently supports range query, denoted as Range (Ins_{i}, r), where Ins_{i} is the query instance and r is the query range. In our algorithm the kNN candidates can be efficiently obtained by two range queries on Mtree. To verify the candidates, formula (5) is utilized and for the verified objects, Inverted index is accessed to give reasons why the objects are relevant to the query. The detailed query processing is shown in algorithm 1 as follows. Specifically, at first we use range query Range (Ins_{i}, 0) to find the closest index node and read all the objects in the node(line 2). If the number of objects is less than k, we read its sibling nodes through its parent node, recursively, until we obtain k objects (line 3). And then we use the kth farthest distance r from the query instance to issue the second range query by setting range as r and get the candidates. Finally, we utilized formula (5) to verify the candidates and each matched pair is augmented with the relationship interpretation through inverted index (line 6–8).
As for complexity, considering the first range query with range zero, the cost of processing a query is O(H), where H is the height of the Mtree. As for the second range query the selectivity of a range query is se, the cost of each level of index nodes can be approximated as a geometric sequence with common ratio, cr*se, where cr is the capacity of index node. Hence, the average cost is:
As for query accuracy, by Observation 1 and Lemma 1, we can get Observation 2 as follows.
Observation 2.
Algorithm 1 can obtain kNN instances of the query instances with probability more than \(\sigma\).
Proof.
We assume o^{∗} is the actual the kth NN query result but is not returned. Denote dis is the distance between the returned kth NN query result and the query. and by Lemma 1, we can get that the distance between o^{∗} and the query is less than dis + DIF. And set \(\varepsilon\) = dis, according to Observation 1, by probability \(\sigma\) or more, DIF is less than dis. So, by algorithm 1, we can get the query result o^{∗}_{,} which is a contradiction for assumption. Then the Observation 2 is proved.
Distributed Algorithm
When the data set is relatively large, the computational complexity of the algorithm will be relatively high as shown in formula 7. Therefore, in order to effectively process largescale data sets, this section will extend the framework to a distributed environment and propose a distributed kNN algorithm.
The distributed algorithm is based on the idea of divide and conquer. Each computing node in a P2P distributed system is independent and autonomous. Let C be the number of computing nodes, PV is the pivot set of the data set, PV = {pv_{i}}, where 1 ≤ i ≤ pn and pn is the number of pivot points. PV is stored on each computing node as global information. Each computing node is responsible for one or more pivot points. Data are divided into computing nodes according to the distance between the data object and the pivot point. Then, each computing node builds Mtree and inverted index locally. When a computing node receives a similarity query q with query range R, the computing node will act as the coordinator and calculate the relevant pivot point by formula 8.
where maxd_{i} is the largest distance among the data objects maintained by pv_{i}. Then, the coordinator will forward the query to the computing node where the relevant pivot point is located and each computing node will calculate the query result through the local indices and return it to the coordinator. Finally, the coordinator collects the intermediate query results and returns the final result to the user. Obviously, the selection of pivot points and query algorithms are the key points of query performance and these two parts will be discussed in detail as follows.
Selection of Pivot Points
The main function of the pivot point is to filter irrelevant data objects in the query process, so the index of selecting the pivot point is to increase the filtering ability of the query as much as possible. In the metric space, the farthest pivot point is generally selected. Based on this heuristic method, we propose a pivot point selection scheme similar to [29]. First randomly select a data object o from the sample data set and then put the data object farthest from o in the sample data set to the pivot point set PV, then further add the data object with the largest average distance between the sample data set and the central pivot point to the PV and then repeat the previous step until PV= pn.
QuerySensitive Load Balancing
In a distributed environment, consistent hashing is used to maintain and manage the pivot point, that is, the pivot point is divided into [0, 2^{max}] domain, [0, 2^{max}] is divided into multiple intervals (token) and each compute node is responsible for one or more intervals and the query is routed through the distributed hash table in the system. Through a hash method such as SHA1, the pivot point will be divided evenly on each computing node. However, the query load is not always evenly distributed and the distribution will change dynamically. Therefore, in order to achieve the load balance of the system, a queryaware adjustment method is needed. First, set the threshold t for load balance. If a computing node (ComputerA) exceeds t times the average load, that is, ComputerA becomes a query bottleneck of load, then ComputerA communicates with the adjacent computing node (ComputerB) of its responsible area, then reduce the area that ComputerA is responsible for while increase the area that ComputerB is responsible for and move the corresponding pivot point from ComputerA to ComputerB. After the last step, if other computing nodes have a load balance problem, repeat this process for this computing node until the load balance of the system is achieved. Note that in order to avoid thrash, the load adjustment method should be performed in the same direction.
Computation of pn
The execution time of query processing can be divided into two parts, one part is the time gt for computing the relevant computing node based on the pivot point and the other part is the time lt for each computing node to perform a local query. Therefore, the computing time for query processing is ct = gt + lt. Obviously, the computing time gt of the relevant computing node is proportional to the number of pivot points pn, that is, gt = α*pn, α is the coefficient ratio and α is related to the processing capability of computing nodes. The computing time of the query lt and pn is inversely proportional, in the average, lt = \(\beta \frac{{r^{{\left( {\log_{m} \frac{N}{pn}} \right)}} }}{r  1}  1\) = \(\beta \frac{{\left( \frac{N}{pn} \right)^{{\left( {\frac{1}{{\log_{r} m}}} \right)}} }}{r  1}  1\), where r is the average selection degree of the child nodes of the index tree by the query, m is the out degree of the index tree, N is the size of the data set, β is the coefficient ratio and β is determined by the average processing capability of the computing node. Therefore, the formula of computing ct can be obtained:
By deriving and calculating the extreme value, it is easy to get when
ct takes the minimum. By Formula 10, the number of pivot points can be obtained.
Distributed kNN Query Algorithm
As mentioned at the beginning of this section, by Formula 8 it is easy to handle range queries. In this section, we discuss distributed nearest neighbor query algorithm.
Considering a simple case, when k = 1, it is a 1NN query. When a computing node receives a 1NN query q, the computing node as the scheduler first initiates a query object q and the query radius 0. Calculate the relevant pivot points, that is, calculate the pivot point set PS = {pn_{i}SIM(pn_{i}, q)\(\ge\) 1maxd_{i}}, where maxd_{i} is the distance between the pivot node pn_{i} and the farthest data object maintained. Then, the scheduler forwards the query to the computing node (denoted as CS) responsible for the data objects in the PS set and each computing node calculates the local NN of the data object q and returns it to the scheduler. After receiving all candidate nearest neighbors, the scheduler calculates the data object with the smallest distance to q and let mind be the smallest distance. After that, the scheduler uses q as the query object and mind as the query range to calculate the relevant pivot points and forwards the NN query to the computing nodes responsible for these pivot points except for the CS set. Finally, the scheduler collects candidate data objects to calculate the NN and returns it to the user.
The kNN query algorithm is discussed as follows. The specific process is shown in Algorithm 2. First, the initial query distance initR is estimated according to the statistical histogram and the relevant computing node (line 2) is calculated. For a kNN query with a query object of q, The distance between q and each pivot point is qdist_{i} = 1SIM(pn_{i}, q), let
where NumHist(r, dmin, dmax) function obtain the number of data object in index with root r between dmin and dmax. Then, forward the query to the relevant computing node set CS_{1}. Each computing node calculates the local kNN data object as a candidate set and returns it to the scheduler. The scheduler calculates the smallest kNN candidate data object distance mind from all the candidate data objects and calculate the relevant computing node CS_{2} (lines 3–7) when the query range is mind, forward the request to the computing node set CS_{1} CS_{2} and collect the local kNN candidate data objects of each computing node, calculate the final kNN result and return (lines 8–11).
The kNN query algorithm is also implemented using two range queries, but the main difference is that in the first range query, kNN uses the histogram information summarized by the pivot point to predict a better query range initR. initR can have a good estimate of the k nearest neighbor data objects, thereby effectively reducing the cost of the second range query.
Experiment
Experiment Setup
We evaluate our crossmodal query performance on Flickr8K [30], Flickr30K [31], NUSWIDE [32], MSCOCO [33] and a synthetic dataset Synthetic9K in our experiments. Flickr8K consists of 8096 images from the Flickr.com website and each image is annotated by 5 sentences by Amazon Mechanical Turk. Flickr30K is also a crossmodal dataset with 31,784 images including corresponding descriptive sentences. NUSWIDE dataset is a web image dataset for media search, which consists of about 270,000 images with their tags and each image along with its corresponding tags is viewed together as an image/text pair. MSCOCO contains 123,287 images, and each image is also annotated by five independent sentences provided by Amazon Mechanical Turk. By extracting 2000 image/text pairs from each above dataset, we obtain a hybrid dataset, denoted as Synthetic9K. For each data set, 10% data are used as testing set and validation set, while the rest are training set.
We compare our approach with eight stateoftheart crossmodal retrieval methods, including CCL [34], HSE [35], DADN [8], SCAN [36], DCMH [13], LGCFL [37], JRL [4], KCCA[38]. CCL learns crossmodal correlation by hierarchical network in two stages. First, separate representation is learned by jointly optimizing intramodality and intermodality correlation and then a multitask learning is adopted to fully exploit the intrinsic relevance between them. HSE proposes a uniform deep model to learn the common representations for four types of media simultaneously by considering classification constraint, center constrain and ranking constraint. DADN proposes a dual adversarial distribution network which takes zeroshot learning and correlation learning in a unified framework to generate common embeddings for crossmodal retrieval. SCAN considers the latent alignments between image regions and text words to learn the imagetext similarity. DCMH combines hashing learning and deep feature learning by preserving the semantic similarity between modalities. LGCFL uses a local groupbased priori to exploit popular block based features and jointly learns basis matrices for different modalities. JRL applies semisupervised regularization and sparse regularization to learn the common representations. KCCA follows the idea of projecting the data into a higherdimensional feature space and then performing CCA. Some compared methods rely on category information for common representation learning, such as CCL and HSE; however, the datasets have no label annotations available. So, in our experiments first keywords are extracted from text descriptions by TFIDF method and seen as labels for corresponding images. For distributed query processing, our algorithms are compared with two most related methods. One is a naive method(SimD), data objects are scattered randomly and when there is a query, objects are compared with query in pairwise way. The other is a stateoftheart method [39](DistMP), which is a general framework based on MapReduce.
Following [34], we apply the mean average precision (MAP) score to evaluate the crossmodal query performance. We first calculate average precision (AP) score for each query in formula (8) and then calculate their mean value as MAP score.
where R is the number of groundtruth relevant instances, \({\text{k}}\) is from the kNN query, \({\text{p}}_{\text{i}}\) denotes the precision of the top i results and \({\text{rel}}_{\text{i}}\) is the indicator whether the ith result is relevant.
We adopt TensorFlow [40] to implement our MCQI approach. In the first stage, we take 4096 dimensional feature extracted from the image inside a given bounding box from RCNN. For the nonlinear transformation model, we use three fully connected layers with 1,024 dimensions and set the dimension of common embedding space d_l and d_g as 1024. The Sent2vec for finegrained semantics has 700 dimensions, which is pretrained on Wikipedia and Universal Sentence Encoder for coarsegrained semantics has 512 dimensions. Experiments for centralized algorithms are conducted on a server with Intel E52650v3, 256 GB RAM, NVIDIA V100 and Ubuntu 16.04 OS, while experiments for distributed algorithms are processed on a cluster of 30 computer nodes with Intel Core i510210 1.6 GHz*4CPU and 8 GB memory.
Verification of Observation 1
Figures 2 and 3 show the accuracy of DIF < 0.05 and DIF < 0.1 respectively, with different sample size. \(\delta\) is randomly generated from three different ranges, i.e., [0.2, 0.8], [0.3, 0.7], [0.4, 0.6] and for different varying ranges, it can tell that when \(\delta\) is closer to 0.5 the accuracy is higher. In the situation of DIF < 0.05, with the increasing of sample size, the accuracy is steadily more than 0.9. And for DIF < 0.1, with the increasing of sample size, the accuracy is steadily more than 0.99. Without loss of generality, according to statistical hypotheses testing method, in the situation of \(\delta\)=[0.2, 0.8], we assume DIF < 0.05 with significant level as 0.1. In our experiments with sample size 100,000, the mean value of DIF is 0.021, sample variance is 0.00045 and because the standard deviation is unknown, tdistribution should be referred. Test statistic is 0.63. And with significant level 0.1, the critical quantile is 1.28. because 0.63 > 1.28, the assumption is accepted.
Performance of Query Accuracy
We present query accuracy of our MCQI approach as well as all the compared methods in this part. Table 2 shows the MAP scores for 30NN query. As shown in the table, the accuracies of DNNbased methods like DADN and CCL are higher than traditional methods on average. Due to the fusion of multigrained semantic feature and transfer learning embedding, MCQI approach steadily achieves the best query accuracies. The number of data categories in Sythetic9K is more than other datasets, and comparatively learning common semantic embeddings are more dependent on the quantity of training data. So, under the same condition, the accuracy is impacted relatively.
Performance of Query Time
As shown in Fig. 4, we measure the query time for our proposed MCQI approach as well as two representative methods on 5 datasets. CCL is a DNNbased method and DCMH is a hashbased method. For CCL pairwise computation is need to get kNN result. And for DCMH, data can be transformed into binary code and it is fast to obtain 1NN, while for kNN with varying k, query time is affected. Intuitively query times are proportional to the size of the datasets. As CCL and DCMH are not very sensitive to k of kNN queries, we show query time of only 30NN queries on each dataset. From 30NN queries to 5NN queries, filtering effect of Mtree index enhances, consequently query times decrease. In all cases, MCQI is fastest among the methods. Particularly for 5NN, average running times for MCQI are about 13 times faster than that of CCL and 20 times faster than DCMH, i.e., our approach on average outperforms CCL and DCMH by an order of magnitude.
Performance of Distributed Algorithm
In order to show the scalability of our framework. Figure 5 present the running time of methods with varying k of kNN query on NUSWIDE and MSCOCO dataset, respectively. In terms of running time, MCQI is nearly three times as fast as DistMP, which is one order of magnitude faster than SimD. SimD as a pairwise method causes enormous communication cost in distributed environment, while DistMP which utilizes the metric distance to filter unrelated data can save computation cost. However, for DistMP the lack of an efficient index leads to worse query performance than MCQI. In essence, MCQI is composed of two rounds of NN query groups and it is easy to see that MCQI is significantly better than SimD and DistMP.
Query Interpretability
Figure 5 shows some examples of crossmodal similarity query results. Because MCQI not only contains the latent semantic common embedding of two types, but also has explicit alignment information. As shown in Fig. 6, for kNN queries, MCQI can return similar objects in datasets and further gives a reason why those objects are semantically related, which is very important for serious applications.
Conclusion
In this paper, we proposed a novel framework for Multigrained Crossmodal Similarity Query with Interpretability (MCQI) to effectively leverage coarsegrained and finegrained semantic information to achieve effective interpretable crossmodal queries. MCQI integrates deep neural network embedding and highdimensional query index and also introduces an efficient kNN similarity query algorithm with theoretical support. Experimental results on widely used datasets prove the effectiveness of MCQI. In our future work, we will study more reinforcement learningbased crossmodal query approaches for reducing dependence on large training data of certain area.
Data availability
Open data are used in this work and are publicly available (references are provided in the paper).
References
Peng Y, Huang X, Zhao Y (2018) An over view of crossmedia retrieval: Concepts, methodologies, benchmarks and challenges. IEEE Trans Circuits Syst Video Technol 28(9):2372–2385
He X, Peng Y, Xi L (2019) A new benchmark and approach for finegrained crossmedia retrieval. In: 27th ACM international conference on multimedia, ACM. pp 1740–1748
Rasiwasia N, Pereira J, Coviello E et al (2010) A new approach to crossmodal multimedia retrieval. In: 18th international conference on multimedia, ACM. pp 251–260
Zhai X, Peng Y, Xiao J (2014) Learning crossmedia joint representation with sparse and semisupervised regularization. IEEE Trans Circuits Syst Video Technol 24(6):965–978
Peng Y, Zhai X, Zhao Y, Huang X (2016) Semisupervised crossmedia feature learning with unified patch graph regularization. IEEE Trans Circuits Syst Video Technol 26(3):583–596
Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: IEEE conference on computer vision and pattern recognition, IEEE. pp 3441–3450
He L, Xu X, Lu H et al (2017) Unsupervised crossmodal retrieval through adversarial learning. In: IEEE international conference on multimedia and expo, IEEE. pp 1153–1158
Chi J, Peng Y (2020) Zeroshot crossmedia embedding learning with dual adversarial distribution network. IEEE Trans Circuits Syst Video Technol 30(4):1173–1187
Andrej K, Armand J, Li F (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: 27th international conference on neural information processing systems, ACM. pp 1889–1897
Andrej K, Li F (2017) Deep VisualSemantic Alignments for Generating Image Descriptions. IEEE Trans Pattern Anal Mach Intell 39(4):664–676
Xu K, Ba J, Kiros R et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: 2015 international conference on machine learning, IEEE. pp 2048–2057
Wang X, Wang Y, Wan W (2018) Watch, listen and describe: globally and locally aligned crossmodal attentions for video captioning. In: Proceedings of 2018 conference of the North American chapter of the association for computational linguistics, ACL. pp 795–801
Jiang Q, Li W (2017) Deep crossmodal hashing. In: 2017 IEEE conference on computer vision and pattern recognition, IEEE. pp 3270–3278
Cao Y, Long M, Wang J et al (2016) Correlation autoencoder hashing for supervised crossmodal search. In: international conference on multimedia retrieval, ACM. pp 197–204
Cao Y, Long M, Wang J (2017) Correlation hashing network for efficient crossmodal retrieval. In: 28th British machine vision conference, BMVA. pp 1–12
Yang E, Deng C, Liu W et al (2017) Pairwise relationship guided deep hashing for crossmodal retrieval. In: 31st conference on artificial intelligence, AAAI. pp 1618–1625
Zhang J, Peng Y, Yuan M et al (2018) Unsupervised generative adversarial crossmodal hashing. In 32nd conference on artificial intelligence, AAAI. pp 539–546
Yang K, Ding X, Zhang Y et al (2019) Distributed similarity queries in metric spaces. Data Science and Engineering 4(4):1–16
Batko M (2004) Distributed and scalable similarity searching in metric spaces. In: 9th EDBT, ACM. pp 44–153
Novak D, Batko M (2011) Zezula P, Metric index: An efficient and scalable solution for precise and approximate similarity search. Inf Syst 36(4):721–733
Wang J, Wu S, Gao H et al (2010) Indexing multidimensional data in a cloud system. In: SIGMOD, ACM. pp 591–602
Wu S, Jiang D, Ooi B, Wu K (2010) Efficient Btree based indexing for cloud data processing. In: 36th VLDB, ACM. pp 1207–1218
Tanin E, Harwood A, Samet H (2007) Using a distributed quadtree index in peertopeer networks. VLDB J 16(2):165–178
Bennanismires K, Musat C, Hossmann A et al (2018) Simple Unsupervised Keyphrase Extraction using Sentence Embeddings. In: conference on computational natural language learning, ACL. pp 221–229
Shen Y, He X, Gao, J et al (2014) A latent semantic model with convolutionalpooling structure for information retrieval. In: conference on information and knowledge management, ACM. pp 101–110
Cheng B, Wei Y, Shi H et al (2018) Revisiting RCNN: On awakening the classification power of faster RCNN. In: European conference on computer vision, Springer. pp 473–490
Cer D, Yang Y, Kong S et al (2018) Universal Sentence Encoder. arXiv: Computation and Language. https://arxiv.org/abs/1803.11175v2. Accessed 12 April 2018
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: 13th international conference on artificial intelligence and statistics, JMLR. pp 249–256
Zhu M, Xu L, Shen D et al (2018) Methods for similarity query on uncertain data with cosine similarity constraints. Journal of Frontiers of Computer Science and Technology 12(1):49–64
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. Journal of Artificial Intelligence Research 47(1):853–899
Young P, Lai A, Hodosh M et al (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 7(2):67–78
Chua T, Tan J, Hong R et al (2009) NUSWIDE: a realworld web image database from national university of Singapore. In: 8th conference on image and video retrieval, ACM. pp 1–9
Lin T, Maire M, Belongie S (2014) Microsoft coco: Common objects in context. In: 13th European conference on Computer Vision (ECCV), Springer. pp 740–755
Peng Y, Qi J, Huang X et al (2018) CCL: Crossmodal correlation learning with multigrained fusion by hierarchical network. IEEE Trans Multimedia 20(2):405–420
Chen T, Wu W, Gao Y et al (2018) Finegrained representation learning and recognition by exploiting hierarchical semantic embedding. In: 26th ACM multimedia, ACM. pp 2023–2031
Lee K, Chen X, Hua G et al (2018) Stacked cross attention for imagetext matching. In: European conference on computer vision, Springer. pp 212–228
Kang C, Xiang S, Liao S et al (2015) Learning Consistent Feature Representation for CrossModal Multimedia Retrieval. IEEE Trans Multimedia 17(3):370–381
Hardoon D, Szedmak S, Shawetaylor J et al (2004) Canonical correlation analysis: An overview with application to learning methods. Neural Comput 16(12):2639–2664
Akdogan A, Demiryurek U, Kashani FB et al (2010) Voronoibased geospatial query processing with mapreduce. In: 2nd international conference of cloud Computing(CloudCom), IEEE. pp 9–16
Abadi M, Barham, P, Chen J et al (2016) TensorFlow: A system for largescale machine learning. In: 12th USENIX conference on operating systems design and implementation, ACM. pp 265–283
Acknowledgements
We would like to thank selfless friends and professional reviewers for all the insightful advices. The preliminary version of this article has been published in APWebWAIM 2020 [https://doi.org/10.1007/9783030602901_26]
Funding
This work is supported by the National Natural Science Foundation of China (61802116, 62072157), the Training Plan of Young Backbone Teachers in Universities of Henan Province (2020GGJS263), the Natural Science Foundation of Henna province(202300410102), the Science and Technology Plan of Henan Province (192102210113, 192102210248) and the Key Scientific Research Project of Henan Universities (19B520005).
Author information
Authors and Affiliations
Contributions
Mingdong Zhu is responsible for providing idea, designing model and experimental methods. Derong Shen gives overall guidance. Lixin Xu implements experiments, Xianfang Wang gives guidance for theory proof.
Corresponding author
Ethics declarations
Conflict of interest
Avoid reviewers from Henan Institute of Technology and Northeastern University.
Consent to participate
All authors consent to participate in this work.
Consent for publication
All authors consent to publish the paper.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zhu, M., Shen, D., Xu, L. et al. Scalable Multigrained Crossmodal Similarity Query with Interpretability. Data Sci. Eng. 6, 280–293 (2021). https://doi.org/10.1007/s41019021001624
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41019021001624
Keywords
 Crossmodal
 Interpretability
 Multigrained
 Similarity query ·Scalability