1 Introduction

With rapid development of computer science and technology, multimedia data including images and texts have been emerging on the Internet, which have become the main form of humans knowing the world. Consequently, cross-modal similarity query has been an essential technique with wide applications, such as search engine and multimedia data management. Cross-modal similarity query [1] is such an effective query paradigm that users can get the results of one type by submitting a query of the other type. In this work, we mainly focus on queries between images and texts. For instance, when one user submits a piece of textual description of one football game, most relevant images in datasets can be fetched and vice versa. Cross-modal similarity query should discover latent semantic relationships among different types, it has attracted great interests from researchers.

Due to the significant advantage of deep neural networks (DNN) in feature extraction, DNN models are utilized for cross-modal similarity query [2]. The complex structure and high-dimensional feature maps equip the deep neural networks with considerable power of learning nonlinear relationships; however, at the same time, complex models introduce some drawbacks. First, numerous parameters of deep neural networks make query process and results difficult to be explained. That is, those models have weak interpretability, which is an important property for general and reliable cross-modal query system. Second, in order to find the most similar data objects, typically the cosine similarity between the high-dimensional feature vector of query object and that of each object in the whole dataset should be computed. Hence, for a large-scale dataset, the computation cost is so high that the query response time will be obnoxious. Existing researches tend to focus on designing complicated composite models to enhance query accuracy and hardly take account of query interpretability, efficiency and scalability at the same time.

Query interpretability of the query framework can improve the credibility of query result. Query efficiency can ensure the accuracy of query result. And query scalability can enhance the adaptability of query methods, especially when faced with large-scale data. Hence, to develop a cross-modal similarity query framework with interpretability, efficiency and scalability is necessary. There are two challenges to achieve this goal. First it is how to bridge the semantic gap among different modality, which need a sophisticated model to capture the common semantic in terms of coarse grain and fine grain. The second challenge is how to enhance interpretability of the query framework with complex structure and millions of parameters. The third challenge is how to integrate the query model with scalability, in case of processing large-scale data, which are ubiquitous nowadays.

Our core insight is that we can leverage deep neural network model to capture multi-grained cross-modal common semantics and build an efficient hybrid index with interpretability and scalability. Hence, in this work, we propose a novel efficient and effective Multi-grained Cross-modal Query framework with Interpretability (MCQI). In order to ensure the adaptability and generality of our framework, during training common feature vectors for different types we first capture coarse-grained and fine-grained semantic information by designing different networks and then combine them. And in order to discover the latent semantic relations between images and texts, we integrated LSTM model and attention model, besides, the data foundation of cross-modal correlative information is constructed in this way. In addition, for the sake of query efficiency, we built an index supporting interpretable query. And further, in order to enhance the scalability of our framework, a distributed query algorithm is proposed based on our framework. At last, to confirm the efficiency and effectiveness of our approach, we systematically evaluate the performances of the approach by comparing with 8 state-of-the-art methods on five widely used multimodal datasets. Concretely, our contributions are shown as follows:

  • By integrating coarse-grained and fine-grained semantic learning models, a multi-grained cross-modal query processing architecture is proposed to ensure the adaptability and generality of query processing.

  • In order to capture the latent semantic relation between images and texts, the framework combines LSTM and attention mode, which enhances query accuracy for the cross-modal query and constructs the foundation for interpretable query processing.

  • Index structure and corresponding nearest neighbor query algorithm are proposed to boost the efficiency of interpretable queries.

  • A distributed query algorithm is proposed to improve the scalability of our framework.

The remainder of this paper is organized as follows. Section 2 briefly reviews related work. In Sect. 3, we introduce definitions of problems and then describe in detail our MCQI framework and a kNN query algorithm in Sect. 4. Section 5 gives a distributed query algorithm to enhance scalability of our framework. Section 6 provides experimental results and analysis on five datasets, and we conclude in Sect. 7.

2 Related Work

In this section, we briefly review the related researches for cross-modal query, including cross-modal retrieval, latent semantic alignment and cross-modal hashing.

2.1 Cross-modal Retrieval

Traditional methods mainly learn linear projections for different data types. Canonical correlation analysis (CCA) [3] is proposed to learn cross-modal common representation by maximizing the pairwise correlation, which is a classical baseline method for cross-modal measurement. Beyond pairwise correlation, joint representation learning (JRL) [4] is proposed to make use of semi-supervised regularization and semantic information, which can jointly learn common representation projections for up to five data types. S2UPG [5] further improves JRL by constructing a unified hypergraph to learn the common space by utilizing the fine-grained information. Recent years, DNN-based cross-modal retrieval has become an active research topic. Deep canonical correlation analysis (DCCA) is proposed by [6] with two subnetworks, which combines DNN with CCA to maximize the correlation on the top of two subnetworks. UCAL [7] is an unsupervised cross-modal retrieval method based on adversarial learning, which takes a modality classifier as a discriminator to distinguish the modality of learned features. DADN [8] approach is proposed for addressing the problem of zero-shot cross-media retrieval, which learns common embeddings with category semantic information. These methods mainly focus on query accuracy rather than query efficiency and interpretability.

2.2 Latent Semantic Alignment

Latent semantic alignment is the foundation for interpretable query. [9] embeds patches of images and dependency tree relations of sentences in a common embedding space and explicitly reasons about their latent, intermodal correspondences. Adding generation step, [10] proposes a model which learns to score sentence and image similarity as a function of R-CNN object detections with outputs of a bidirectional RNN. By incorporating attention into neural networks for vision related tasks, [11, 12] investigate models that can attend to salient part of an image while generating its caption. These methods inspire ideas of achieving interpretable cross-modal query, but neglect issues of query granularity and efficiency.

2.3 Cross-modal Hashing

Deep cross-modal hashing (DCMH) [13] combines hashing learning and deep feature learning by preserving the semantic similarity between modalities. Correlation auto-encoder hashing (CAH) [14] embeds the maximum cross-modal similarity into hash codes using nonlinear deep autoencoders. Correlation hashing network (CHN) [15] jointly learns image and text representations tailored to hash coding and formally controls the quantization error. Pairwise relationship guided deep hashing (PRDH) [16] jointly uses two types of pairwise constraints from intra-modality and intermodality to preserve the semantic similarity of the learned hash codes. [17] proposes a generative adversarial network to model cross-modal hashing in an unsupervised fashion and a correlation graph-based learning approach to capture the underlying manifold structure across different modalities. For large high-dimensional data, hashing is a common tool, which can achieve sublinear time complexity for data retrieval. However, after constructing a hash index on hamming space, it is difficult to obtain flexible query granularity and reasonable interpretability.

2.4 Distributed Similarity Query

Existing methods for distributed similarity queries in metric spaces can be partitioned into two categories [18]. The first category utilizes basic metric partitioning principles to distribute the data over the underlying network. [19] proposes a distributed index, GHT* index, which can exploit parallelism in a dynamic network of computers by putting a part of the index structure in every network node. [20] proposes a mapping mechanism that enables to actually store the data in well-established structures such as the B-tree. The second category utilizes the index integrating technique to distribute the data. Paper [21] integrates R-tree and CAN overlay to process multi-dimensional data in a cloud system. Paper [22] combines B-tree and BATON overlay to provide a distributed index which has high scalability but incurs low maintenance. They both choose a part of local index nodes to build global index node by computing the cost model. [23] integrates quadtree index with Chord overlay to enable more powerful accesses to data in P2P networks. In this paper, we adopt the pivot-mapping-based method due to two reasons below. These methods are not enough due to two reasons below. First, they are query sensitive; that is, they cannot adjust distribution of data for different query load and then cannot keep load balance, which is also important for distributed environment.

3 Problem Description

For cross-modal similarity query, given a query object of one type, most similar objects of the other type in the dataset should be returned. The formal definition is shown below.

The multimodal dataset consists of two modalities with m images and n texts, which is denoted as D = {Dt, Di}. The texts are encoded as a one hot code originally and in the set D the data of text modality are denoted as \(D^{t} = \left\{ {x_{k}^{t} } \right\}_{k = 1}^{m}\), where the kth text object is defined as \(x_{k}^{t} \in R^{{l_{k} * c}}\) with the sentence length lk and the vocabulary size c. The data of image modality are denoted as \(D^{i} = \left\{ {x_{k}^{i} } \right\}_{k = 1}^{n}\), where the kth image instance is defined as \(x_{k}^{i} \in R^{{w * h * c^{\prime}}}\) with image resolution w*h and color channel number c’. Besides, the pairwise correspondence is denoted as (\({\text{x}}_{\text{k}}^{\text{t}}\),\({\text{x}}_{\text{k}}^{\text{i}}\)), which means that the two instances of different types are strongly semantically relevant. Cross-modal similarity query means that given one query object it is to find similar objects of the other modality which share relevant semantics with the given one, kNN query is a classical type of similarity query and the definition is given as follows.

Definition 1

(kNN Query). Given an object q, an integer k > 1, dataset D and similarity function SIM, the k nearest neighbors query kNN computes a size-k subset S ⊆ D, s.t. \({ }\forall o_{i} \in S,o_{j} \in D - S:SIM\left( {q,o_{i} } \right) \ge SIM\left( {q,o_{j} } \right).\) In this work, we set cosine similarity as similarity function.

Table 1 lists the used notations throughout this paper. The list mainly consists of the notations which are mentioned far from their definitions.

Table 1 Used notations

4 Proposed Model

In this section, we describe the proposed MCQI framework in detail. As shown in Fig. 1, MCQI framework consists of two stages. The first stage is the learning stage, which models common embedding representation of multimodal data by fusing coarse-grained and fine-grained semantic information. The second stage is the index construction stage, in which M-tree index and inverted index are integrated to process efficient and interpretable queries. In the following paragraphs, we introduce it in the aspects of embedding representations of multimodal data and interpretable query processing.

Fig. 1
figure 1

The framework of MCQI

4.1 Embedding Representations of Multimodal Data

In the first stage, MCQI learns the embedding representation of multimodal data by fusing coarse-grained and fine-grained semantic information.

4.2 Fine-grained Embedding Learning with Local Semantics

Different methods are used to extract local semantic features for texts and images. For texts, EmbedRank [24] is utilized to extract keyphrases. Then, a pretrained model Sent2vec[25] is chosen for computing the embedding of each keyphrase. Then, by three-layer fully connected neural network, we map each keyphrase into the common embedding space with dimension d_l, denoted as tspq, which means the embedding representation of the qth keyphrase of the pth text description.

For images, Region Convolutional Neural Network (RCNN) [26] is utilized to detect objects in images. We use top detected locations of the entire image as local semantic features and then compute the common embedding vectors based on the visual matrix in each bounding box by a pretrained convolutional neural network and by transition matrix transform the vector to common space with dimension d_l; lastly, we get isuv, which means the embedding representation of the vth bounding box of the uth image.

Typically, for a pair of matched text and image, at least one of keyphrases in the text is semantically relevant with a certain bounding box in the image instance; that is, at least one common embedding vector of the text instance is close to a certain common embedding vector of the image instance. Base on this intuitiveness, according to hinge rank loss function, we set the original objective of fine-grained embedding learning as follows:

$${\text{C}}_{b} \left( {ts_{pq} ,is_{uv} } \right) = \mathop \sum \limits_{{\text{q}}} \mathop \sum \limits_{{\text{v}}} {(}\frac{{{\text{Pnum}}}}{{{\text{Anum}}}}{)}^{{{\text{I}}\left( {{\text{p}} \ne {\text{u}}} \right)}} {(1 - }\frac{{{\text{Pnum}}}}{{{\text{Anum}}}}{)}^{{{\text{I}}\left( {\text{p = u}} \right)}} {\text{max(0, M - }}\left( { - 1} \right)^{{{\text{I}}\left( {{\text{p}} \ne {\text{u}}} \right)}} \frac{{{\text{ts}}_{{{\text{pq}}}} \cdot {\text{is}}_{{{\text{uv}}}} }}{{{\text{|ts}}_{{{\text{pq}}}} {|} \cdot {\text{|is}}_{{{\text{uv}}}} {|}}}{)}$$
(1)

Here, Pnum is the number of matched pairs in the training sample set and Anum is the training sample capability. \(\left( {\alpha \frac{Pnum}{{Anum}}} \right)^{{I\left( {p = u} \right)}} \left( {1 - \alpha \frac{Pnum}{{Anum}}} \right)^{{I\left( {p \ne u} \right)}}\) is utilized to balance positive and negative samples. \(\frac{{ts_{pq} \cdot is_{uv} }}{{\left| {ts_{pq} } \right| \cdot \left| {is_{uv} } \right|}}\) is the cosine similarity of two embedding vectors. M is the margin constant, which defines the tolerance of true positive and true negative. The more M is close to 1, the stricter it is about semantically recognizing true positive and true negative.

Cb cost is computed over all pairs of local features between text instances and image instances. However, in many cases, for two semantically relevant instances only few parts of local features are matched and similar pairs are difficult to be acquired by computation over all pairs. To address this problem, according to MILE [9], we make a multiple instance learning extension of formula (1) as shown in formula (2). For each text instance, local features of matched image are put into a positive bag, while local features in other image are treated as negative samples.

$$\begin{gathered} C_{P} = \min_{kqv} \sum\nolimits_{q} {\sum\nolimits_{v} {\left( {\frac{Pnum}{{Anum}}} \right)} }^{{I\left( {p \ne u} \right)}} \left( {1 - \frac{Pnum}{{Anum}}} \right)^{{I\left( {p = u} \right)}} \max \left( {0,M - k_{qv} \frac{{ts_{pq} \cdot is_{uv} }}{{\left| {ts_{pq} } \right| \cdot \left| {is_{uv} } \right|}}} \right) \hfill \\ {\text{s}}.{\text{t}}.\mathop \sum \limits_{{{\text{v}} \in {\text{B}}_{{\text{q}}} }} \left( {{\text{k}}_{{{\text{qv}}}} { + 1}} \right) \ge {2 }\forall {\text{v}},k_{qv} = \left\{ \begin{gathered} 1,p = u \hfill \\ - 1,p \ne u \hfill \\ \end{gathered} \right. \hfill \\ \end{gathered}$$
(2)

Here, \({\text{B}}_{\text{q}}\) is the positive bag of the qth feature vector, \({\text{k}}_{\text{qv}}\) is the correlation index which indicates whether the corresponding text instance and image instance are matched. It is worth notice that each feature vector \({\text{is}}_{\text{uv}}\) and the corresponding bounding box are stored in the storage system for processing interpretable queries.

4.3 Coarse-grained Embedding Learning with Global Semantics

Coarse-grained embedding network tries to capture global common semantics between texts and images. For texts, Universal Sentence Encoder [27] is utilized to extract feature vectors of texts and by fully connected layers the feature vectors are transformed into the global common embedding space with dimension d_g.

For images, inspired by [11] that pretrained LSTM with soft attention model is integrated to translate images into sequential representation. For an image, feature maps before classification in a pretrained R-CNN network and the whole image’s feature maps before fully connected layers in pretrained CNN network are combined into feature vectors, denoted as \(a = \left\{ {a_{i} } \right\}_{i = 1}^{LV}\) and LV is the number of the feature vectors.

Our implementation of LSTM with soft attention is based on [11]. ai is the input, y and \(\alpha_{ti}\) are outputs, y is the generated sequential text and \(\alpha_{ti}\) represents importance of feature vector \({\text{a}}_{\text{i}}\) when generating the tth word. please note that each word yt has an attention weight \(\alpha_{ti}\) for each feature vector \({\text{a}}_{\text{i}}\) and each tuple tut =  < yt, imageID, \(\alpha_{ti}\), xloci, xloc1i, xloc2i, yloc1i, yloc2i > is stored for answering future queries, where imageID is the image’s unique identifier, xloci, xloc1i, xloc2i, yloc1i, yloc2iare the corresponding coordinate position of \({\text{a}}_{\text{i}}\) in the image. We collect all tuples as set TU = {tut}.

For generated sequential text \({\text{y}}\), Universal Sentence Encoder is utilized to generate the coarse-grained representative vector of \({\text{y}}\), denoted as GIV, while the coarse-grained representative vector of original paired training text by Universal Sentence Encoder is denoted as OTV.

Intuitively, global training objective function is shown as follows.

$$C_{G} = GIVOTV$$
(3)

4.4 Multi-grained Objective Function

We are now ready to formulate the multi-grained objective function. The objective function is designed by two criteria. First, it is likely that matched pairs of images and texts have similar patches, which applies to CP. Second, matched pairs of image and text probably have similar global semantics, which applies to CG. By integrating CP and CG, the objective function is defined as follows.

$$C\left( \theta \right) = \alpha C_{P} \left( \theta \right) + \beta C_{G} \left( \theta \right) + \gamma \left| \theta \right|_{2}^{2} ,$$
(4)

where θ is a shorthand for parameters of our model and \(\alpha , \beta , \gamma\) are hyperparameters which are computed by cross-validation. \({|\theta |}_{2}^{2}\) is the regularization.

4.5 Optimization

The proposed model consists of two branches, which are designed for common fine-grained semantic and coarse-grained semantic, respectively. Naturally, the training process is divided into two stages, i.e., branch training and joint training. Both training processes are based on stochastic gradient descent (SGD) with a batch size of 32, a momentum of 0.9 and a weight decay of 0.00005.

Stage 1: In this stage, branches for common fine-grained semantic and coarse-grained semantic are trained in turn, taking formula (2) and formula (3) as loss functions, respectively. In the fine-grained branch, pretrained Sent2Vec model and RCNN model are utilized, while in the coarse-grained branch, pretrained several pretrained Universal Sentence Encoder model and LSTM model are utilized. The default parameters in those pretrained models are utilized, and its parameters are kept fixed at this stage. The other parameters of our model, including the attentional mechanism, are automatically initialized with the Xavier algorithm [28].

Stage 2: After all branch networks are trained, we jointly fine-tune the entire model parameters by combining the loss terms over all granularities in formula (4).

4.6 Interpretable Query Processing

In MCQI framework, images and texts can be represented by high-dimensional feature vectors, which include fine-grained and coarse-grained semantic features. Denote IFVi as feature vectors of the ith instance Insi, then IFVi = {CFVFi, CFVCi}, where CFVFi and CFVCi mean the corresponding common fine-grained semantic feature and the coarse-grained semantic feature of Insi, respectively. Given a query instance, i.e., an image or text instance, in order to find the matched cross-modal instance, i.e., the most relevant text or image instance, the similarity between two cross-modal instances can be computed by cosine similarity shown in formula (5) as follows.

$$\begin{gathered} {\text{SIM}}\left( {Ins_{i} ,Ins_{j} } \right) = \delta \frac{{CFVF_{i} \cdot CFVF_{j} ,}}{{\left| {CFVF_{i} } \right| * \left| {CFVF_{j} } \right|}} + \left( {1 - \delta } \right)\frac{{CFVC_{i} \cdot CFVC_{j} ,}}{{\left| {CFVC_{i} } \right| * \left| {CFVC_{j} } \right|}} \hfill \\ = \delta {\text{Cos}} ine\left( {CFVF_{i} ,CFVF_{j} } \right) + \left( {1 - \delta } \right){\text{Cos}} ine\left( {CFVC_{i} ,CFVC_{j} } \right) \hfill \\ \end{gathered}$$
(5)

Here, Insi and Insj are two cross-modal instances, \(\delta\) is the weight factor, Cosine is the cosine similarity function.

A naive method to obtain the matched cross-modal instances is pairwise computation; however, this method is inefficient. Particularly when the dataset is large and the dimension of vectors is high, the computation is nontrivial. To address this, an inverted index and an M-tree index are integrated into MCQI model. The M-tree index increases the efficiency of queries and the inverted index enhances the interpretability of queries. Index construction and query processing method based on the indices are discussed separately as follows.

4.7 Index Construction

It is shown in formula (5) the similarity between two instances mainly is calculated by the cosine similarity of two types of feature vectors. By assuming that variables obey uniform distribution, we get Observation 1 in the following. Observation 1 shows that cosine similarity between the whole feature vectors of Insi and Insj is close to SIM(Insi, Insj).

Observation 1

For Random Variable \(\delta \in \left[ {0.2,0.8} \right],\exists \varepsilon ,\sigma \in \left[ {0,1} \right],\) s.t. \(P\left( {\left| {\left( {\delta \frac{{CFVF_{i} \cdot CFVF_{j} ,}}{{\left| {CFVF_{i} } \right| * \left| {CFVF_{j} } \right|}} + \left( {1 - \delta } \right)\frac{{CFVC_{i} \cdot CFVC_{j} ,}}{{\left| {CFVC_{i} } \right| * \left| {CFVC_{j} } \right|}}} \right) - \frac{{IFV_{i} \cdot IFV_{j} ,}}{{\left| {IFV_{i} } \right| * \left| {IFV_{j} } \right|}}} \right|{ < }\varepsilon } \right){ > }\sigma ,\) i.e., \(P\left( {\left| {SIM\left( {Ins_{i} ,Ins_{j} } \right) - {\text{Cos}} ine\left( {Ins_{i} ,Ins_{j} } \right)} \right|{ < }\varepsilon } \right){ > }\sigma\).

This Observation is obtained by statistical hypotheses testing method, which will be illustrated in the experiments. By setting DIF = \(\left|{\text{SIM}}({\text{Ins}}_{\text{i}}, {\text{Ins}}_{\text{j}})-\text{Cosine(}{\text{Ins}}_{\text{i}}, {\text{Ins}}_{\text{j}}\text{)}\right|\), we get P(DIF < \(\varepsilon\))) > \(\sigma\). In experiments, when set \(\varepsilon\) = 0.05, we have \(\sigma\) = 0.9 and when set \(\varepsilon\) = 0.1, we have \(\sigma\) = 0.98.

It is known that the M-tree is an efficient structure for NN queries in metric spaces. In order to use M-tree index, cosine distance should be transformed to angular similarity (AS) which is metric. The angular similarity between Insi and Insj is defined in formula (6) in the following.

$${\text{AS}}\left( {Ins_{i} , \, Ins_{j} } \right) = 2\frac{{{\text{arccos(Cosine(Ins}}_{{\text{i}}} {\text{, Ins}}_{{\text{j}}} {))}}}{{\uppi }}$$
(6)

Lemma 1.

For any instance q, the nearest neighbor of q by angular similarity is also the nearest neighbor of q by cosine similarity.

Lemma 1 can be easily proved by contradiction, which is omitted for simplicity.

Based on Lemma 1 and formula (6), an M-tree is constructed on the data set of feature vectors. And then M-tree is augmented with an inverted index of semantic relationship tuple set TU, which is mentioned in Sect. 4.1.

4.8 Interpretable kNN Query

For processing similarity queries efficiently, we adopt a filter-and-fine model. Our method first obtains candidates of matched objects by M-tree and then verifies the candidates and identifies the final answers.

The M-tree inherently supports range query, denoted as Range (Insi, r), where Insi is the query instance and r is the query range. In our algorithm the kNN candidates can be efficiently obtained by two range queries on M-tree. To verify the candidates, formula (5) is utilized and for the verified objects, Inverted index is accessed to give reasons why the objects are relevant to the query. The detailed query processing is shown in algorithm 1 as follows. Specifically, at first we use range query Range (Insi, 0) to find the closest index node and read all the objects in the node(line 2). If the number of objects is less than k, we read its sibling nodes through its parent node, recursively, until we obtain k objects (line 3). And then we use the kth farthest distance r from the query instance to issue the second range query by setting range as r and get the candidates. Finally, we utilized formula (5) to verify the candidates and each matched pair is augmented with the relationship interpretation through inverted index (line 6–8).

figure a

As for complexity, considering the first range query with range zero, the cost of processing a query is O(H), where H is the height of the M-tree. As for the second range query the selectivity of a range query is se, the cost of each level of index nodes can be approximated as a geometric sequence with common ratio, cr*se, where cr is the capacity of index node. Hence, the average cost is:

$$\frac{{{\text{cr*se*(1 - (cr*se)}}^{{\text{H}}} {)}}}{{\text{1 - cr*se}}}$$
(7)

As for query accuracy, by Observation 1 and Lemma 1, we can get Observation 2 as follows.

Observation 2.

Algorithm 1 can obtain kNN instances of the query instances with probability more than \(\sigma\).

Proof.

We assume o is the actual the kth NN query result but is not returned. Denote dis is the distance between the returned kth NN query result and the query. and by Lemma 1, we can get that the distance between o and the query is less than dis + DIF. And set \(\varepsilon\) = dis, according to Observation 1, by probability \(\sigma\) or more, DIF is less than dis. So, by algorithm 1, we can get the query result o, which is a contradiction for assumption. Then the Observation 2 is proved.

5 Distributed Algorithm

When the data set is relatively large, the computational complexity of the algorithm will be relatively high as shown in formula 7. Therefore, in order to effectively process large-scale data sets, this section will extend the framework to a distributed environment and propose a distributed kNN algorithm.

The distributed algorithm is based on the idea of divide and conquer. Each computing node in a P2P distributed system is independent and autonomous. Let C be the number of computing nodes, PV is the pivot set of the data set, PV = {pvi}, where 1 ≤ i ≤ pn and pn is the number of pivot points. PV is stored on each computing node as global information. Each computing node is responsible for one or more pivot points. Data are divided into computing nodes according to the distance between the data object and the pivot point. Then, each computing node builds M-tree and inverted index locally. When a computing node receives a similarity query q with query range R, the computing node will act as the coordinator and calculate the relevant pivot point by formula 8.

$${\text{SIM}}(pv_{i} ,\overline{q}) \ge {1} - (R + maxd_{i} ),$$
(8)

where maxdi is the largest distance among the data objects maintained by pvi. Then, the coordinator will forward the query to the computing node where the relevant pivot point is located and each computing node will calculate the query result through the local indices and return it to the coordinator. Finally, the coordinator collects the intermediate query results and returns the final result to the user. Obviously, the selection of pivot points and query algorithms are the key points of query performance and these two parts will be discussed in detail as follows.

5.1 Selection of Pivot Points

The main function of the pivot point is to filter irrelevant data objects in the query process, so the index of selecting the pivot point is to increase the filtering ability of the query as much as possible. In the metric space, the farthest pivot point is generally selected. Based on this heuristic method, we propose a pivot point selection scheme similar to [29]. First randomly select a data object o from the sample data set and then put the data object farthest from o in the sample data set to the pivot point set PV, then further add the data object with the largest average distance between the sample data set and the central pivot point to the PV and then repeat the previous step until |PV|= pn.

5.1.1 Query-Sensitive Load Balancing

In a distributed environment, consistent hashing is used to maintain and manage the pivot point, that is, the pivot point is divided into [0, 2max] domain, [0, 2max] is divided into multiple intervals (token) and each compute node is responsible for one or more intervals and the query is routed through the distributed hash table in the system. Through a hash method such as SHA1, the pivot point will be divided evenly on each computing node. However, the query load is not always evenly distributed and the distribution will change dynamically. Therefore, in order to achieve the load balance of the system, a query-aware adjustment method is needed. First, set the threshold t for load balance. If a computing node (ComputerA) exceeds t times the average load, that is, ComputerA becomes a query bottleneck of load, then ComputerA communicates with the adjacent computing node (ComputerB) of its responsible area, then reduce the area that ComputerA is responsible for while increase the area that ComputerB is responsible for and move the corresponding pivot point from ComputerA to ComputerB. After the last step, if other computing nodes have a load balance problem, repeat this process for this computing node until the load balance of the system is achieved. Note that in order to avoid thrash, the load adjustment method should be performed in the same direction.

5.1.2 Computation of pn

The execution time of query processing can be divided into two parts, one part is the time gt for computing the relevant computing node based on the pivot point and the other part is the time lt for each computing node to perform a local query. Therefore, the computing time for query processing is ct = gt + lt. Obviously, the computing time gt of the relevant computing node is proportional to the number of pivot points pn, that is, gt = α*pn, α is the coefficient ratio and α is related to the processing capability of computing nodes. The computing time of the query lt and pn is inversely proportional, in the average, lt = \(\beta \frac{{r^{{\left( {\log_{m} \frac{N}{pn}} \right)}} }}{r - 1} - 1\) = \(\beta \frac{{\left( \frac{N}{pn} \right)^{{\left( {\frac{1}{{\log_{r} m}}} \right)}} }}{r - 1} - 1\), where r is the average selection degree of the child nodes of the index tree by the query, m is the out degree of the index tree, N is the size of the data set, β is the coefficient ratio and β is determined by the average processing capability of the computing node. Therefore, the formula of computing ct can be obtained:

$${\text{ct}} = \alpha *pn + {\upbeta }\frac{{{(}\frac{{\text{N}}}{{{\text{pn}}}}{)}^{{{(}\frac{1}{{{\text{log}}_{{\text{r}}} {\text{m}}}}{)}}} { - 1}}}{{\text{r - 1}}}$$
(9)

By deriving and calculating the extreme value, it is easy to get when

$$pn = \left( {\frac{{\beta N^{{\frac{1}{{\log_{r} m}}}} }}{{\alpha \left( {r - 1} \right)\log_{r} m}}} \right)^{{\frac{{\log_{r} m}}{{\log_{r} m + 1}}}} ,$$
(10)

ct takes the minimum. By Formula 10, the number of pivot points can be obtained.

5.2 Distributed kNN Query Algorithm

As mentioned at the beginning of this section, by Formula 8 it is easy to handle range queries. In this section, we discuss distributed nearest neighbor query algorithm.

Considering a simple case, when k = 1, it is a 1NN query. When a computing node receives a 1NN query q, the computing node as the scheduler first initiates a query object q and the query radius 0. Calculate the relevant pivot points, that is, calculate the pivot point set PS = {pni|SIM(pni, q)\(\ge\) 1-maxdi}, where maxdi is the distance between the pivot node pni and the farthest data object maintained. Then, the scheduler forwards the query to the computing node (denoted as CS) responsible for the data objects in the PS set and each computing node calculates the local NN of the data object q and returns it to the scheduler. After receiving all candidate nearest neighbors, the scheduler calculates the data object with the smallest distance to q and let mind be the smallest distance. After that, the scheduler uses q as the query object and mind as the query range to calculate the relevant pivot points and forwards the NN query to the computing nodes responsible for these pivot points except for the CS set. Finally, the scheduler collects candidate data objects to calculate the NN and returns it to the user.

The kNN query algorithm is discussed as follows. The specific process is shown in Algorithm 2. First, the initial query distance initR is estimated according to the statistical histogram and the relevant computing node (line 2) is calculated. For a kNN query with a query object of q, The distance between q and each pivot point is qdisti = 1-SIM(pni, q), let

$$initR = {\text{argmin}}_{r} \cdot \mathop \sum \limits_{{\text{i = 1}}}^{{{\text{i}} \le {\text{pn}}}} {\text{NumHist(pv}}_{{\text{i}}} {\text{, qdist}}_{{\text{i}}} {\text{ - r, qdist}}_{{\text{i}}} {\text{ + r)}} \le {\text{k}}$$
(11)

where NumHist(r, dmin, dmax) function obtain the number of data object in index with root r between dmin and dmax. Then, forward the query to the relevant computing node set CS1. Each computing node calculates the local kNN data object as a candidate set and returns it to the scheduler. The scheduler calculates the smallest kNN candidate data object distance mind from all the candidate data objects and calculate the relevant computing node CS2 (lines 3–7) when the query range is mind, forward the request to the computing node set CS1- CS2 and collect the local kNN candidate data objects of each computing node, calculate the final kNN result and return (lines 8–11).

The kNN query algorithm is also implemented using two range queries, but the main difference is that in the first range query, kNN uses the histogram information summarized by the pivot point to predict a better query range initR. initR can have a good estimate of the k nearest neighbor data objects, thereby effectively reducing the cost of the second range query.

figure b

6 Experiment

6.1 Experiment Setup

We evaluate our cross-modal query performance on Flickr8K [30], Flickr30K [31], NUS-WIDE [32], MS-COCO [33] and a synthetic dataset Synthetic9K in our experiments. Flickr8K consists of 8096 images from the Flickr.com website and each image is annotated by 5 sentences by Amazon Mechanical Turk. Flickr30K is also a cross-modal dataset with 31,784 images including corresponding descriptive sentences. NUS-WIDE dataset is a web image dataset for media search, which consists of about 270,000 images with their tags and each image along with its corresponding tags is viewed together as an image/text pair. MS-COCO contains 123,287 images, and each image is also annotated by five independent sentences provided by Amazon Mechanical Turk. By extracting 2000 image/text pairs from each above dataset, we obtain a hybrid dataset, denoted as Synthetic9K. For each data set, 10% data are used as testing set and validation set, while the rest are training set.

We compare our approach with eight state-of-the-art cross-modal retrieval methods, including CCL [34], HSE [35], DADN [8], SCAN [36], DCMH [13], LGCFL [37], JRL [4], KCCA[38]. CCL learns cross-modal correlation by hierarchical network in two stages. First, separate representation is learned by jointly optimizing intra-modality and intermodality correlation and then a multi-task learning is adopted to fully exploit the intrinsic relevance between them. HSE proposes a uniform deep model to learn the common representations for four types of media simultaneously by considering classification constraint, center constrain and ranking constraint. DADN proposes a dual adversarial distribution network which takes zero-shot learning and correlation learning in a unified framework to generate common embeddings for cross-modal retrieval. SCAN considers the latent alignments between image regions and text words to learn the image-text similarity. DCMH combines hashing learning and deep feature learning by preserving the semantic similarity between modalities. LGCFL uses a local group-based priori to exploit popular block based features and jointly learns basis matrices for different modalities. JRL applies semi-supervised regularization and sparse regularization to learn the common representations. KCCA follows the idea of projecting the data into a higher-dimensional feature space and then performing CCA. Some compared methods rely on category information for common representation learning, such as CCL and HSE; however, the datasets have no label annotations available. So, in our experiments first keywords are extracted from text descriptions by TF-IDF method and seen as labels for corresponding images. For distributed query processing, our algorithms are compared with two most related methods. One is a naive method(SimD), data objects are scattered randomly and when there is a query, objects are compared with query in pairwise way. The other is a state-of-the-art method [39](DistMP), which is a general framework based on MapReduce.

Following [34], we apply the mean average precision (MAP) score to evaluate the cross-modal query performance. We first calculate average precision (AP) score for each query in formula (8) and then calculate their mean value as MAP score.

$$AP = \frac{1}{\left| R \right|}\sum\nolimits_{i = 1}^{k} {p_{i} * rel_{i} }$$
(12)

where |R| is the number of ground-truth relevant instances, \({\text{k}}\) is from the kNN query, \({\text{p}}_{\text{i}}\) denotes the precision of the top i results and \({\text{rel}}_{\text{i}}\) is the indicator whether the ith result is relevant.

We adopt TensorFlow [40] to implement our MCQI approach. In the first stage, we take 4096 dimensional feature extracted from the image inside a given bounding box from RCNN. For the nonlinear transformation model, we use three fully connected layers with 1,024 dimensions and set the dimension of common embedding space d_l and d_g as 1024. The Sent2vec for fine-grained semantics has 700 dimensions, which is pretrained on Wikipedia and Universal Sentence Encoder for coarse-grained semantics has 512 dimensions. Experiments for centralized algorithms are conducted on a server with Intel E5-2650v3, 256 GB RAM, NVIDIA V100 and Ubuntu 16.04 OS, while experiments for distributed algorithms are processed on a cluster of 30 computer nodes with Intel Core i5-10210 1.6 GHz*4CPU and 8 GB memory.

6.2 Verification of Observation 1

Figures 2 and 3 show the accuracy of DIF < 0.05 and DIF < 0.1 respectively, with different sample size. \(\delta\) is randomly generated from three different ranges, i.e., [0.2, 0.8], [0.3, 0.7], [0.4, 0.6] and for different varying ranges, it can tell that when \(\delta\) is closer to 0.5 the accuracy is higher. In the situation of DIF < 0.05, with the increasing of sample size, the accuracy is steadily more than 0.9. And for DIF < 0.1, with the increasing of sample size, the accuracy is steadily more than 0.99. Without loss of generality, according to statistical hypotheses testing method, in the situation of \(\delta\)=[0.2, 0.8], we assume DIF < 0.05 with significant level as 0.1. In our experiments with sample size 100,000, the mean value of DIF is 0.021, sample variance is 0.00045 and because the standard deviation is unknown, t-distribution should be referred. Test statistic is -0.63. And with significant level 0.1, the critical quantile is -1.28. because -0.63 > -1.28, the assumption is accepted.

Fig. 2
figure 2

Accuracy of DIF < 0.05

Fig. 3
figure 3

Accuracy of DIF < 0.1

6.3 Performance of Query Accuracy

We present query accuracy of our MCQI approach as well as all the compared methods in this part. Table 2 shows the MAP scores for 30NN query. As shown in the table, the accuracies of DNN-based methods like DADN and CCL are higher than traditional methods on average. Due to the fusion of multi-grained semantic feature and transfer learning embedding, MCQI approach steadily achieves the best query accuracies. The number of data categories in Sythetic9K is more than other datasets, and comparatively learning common semantic embeddings are more dependent on the quantity of training data. So, under the same condition, the accuracy is impacted relatively.

Table 2 MAP scores of MCQI and compared methods for 30NN query

6.4 Performance of Query Time

As shown in Fig. 4, we measure the query time for our proposed MCQI approach as well as two representative methods on 5 datasets. CCL is a DNN-based method and DCMH is a hash-based method. For CCL pairwise computation is need to get kNN result. And for DCMH, data can be transformed into binary code and it is fast to obtain 1NN, while for kNN with varying k, query time is affected. Intuitively query times are proportional to the size of the datasets. As CCL and DCMH are not very sensitive to k of kNN queries, we show query time of only 30NN queries on each dataset. From 30NN queries to 5NN queries, filtering effect of M-tree index enhances, consequently query times decrease. In all cases, MCQI is fastest among the methods. Particularly for 5NN, average running times for MCQI are about 13 times faster than that of CCL and 20 times faster than DCMH, i.e., our approach on average outperforms CCL and DCMH by an order of magnitude.

Fig. 4
figure 4

Performance of query time

6.5 Performance of Distributed Algorithm

In order to show the scalability of our framework. Figure 5 present the running time of methods with varying k of kNN query on NUS-WIDE and MS-COCO dataset, respectively. In terms of running time, MCQI is nearly three times as fast as DistMP, which is one order of magnitude faster than SimD. SimD as a pairwise method causes enormous communication cost in distributed environment, while DistMP which utilizes the metric distance to filter unrelated data can save computation cost. However, for DistMP the lack of an efficient index leads to worse query performance than MCQI. In essence, MCQI is composed of two rounds of NN query groups and it is easy to see that MCQI is significantly better than SimD and DistMP.

Fig. 5
figure 5

Effect of distributed query on NUS-WIDE. Effect of distributed query on MS-COCO

6.6 Query Interpretability

Figure 5 shows some examples of cross-modal similarity query results. Because MCQI not only contains the latent semantic common embedding of two types, but also has explicit alignment information. As shown in Fig. 6, for kNN queries, MCQI can return similar objects in datasets and further gives a reason why those objects are semantically related, which is very important for serious applications.

Fig. 6
figure 6

Examples of processing cross-modal similarity queries by MCQI

7 Conclusion

In this paper, we proposed a novel framework for Multi-grained Cross-modal Similarity Query with Interpretability (MCQI) to effectively leverage coarse-grained and fine-grained semantic information to achieve effective interpretable cross-modal queries. MCQI integrates deep neural network embedding and high-dimensional query index and also introduces an efficient kNN similarity query algorithm with theoretical support. Experimental results on widely used datasets prove the effectiveness of MCQI. In our future work, we will study more reinforcement learning-based cross-modal query approaches for reducing dependence on large training data of certain area.