The influence of image descriptors’ dimensions’ value cardinalities on large-scale similarity search

  • Theodoros Semertzidis
  • Dimitrios Rafailidis
  • Michael Gerassimos Strintzis
  • Petros Daras
Regular Paper

Abstract

In this empirical study, we evaluate the impact of the dimensions’ value cardinality (DVC) of image descriptors in each dimension, on the performance of large-scale similarity search. DVCs are inherent characteristics of image descriptors defined for each dimension as the number of distinct values of image descriptors, thus expressing the dimension’s discriminative power. In our experiments, with six publicly available datasets of image descriptors of different dimensionality (64–5,000 dim) and size (240 K–1 M), (a) we show that DVC varies, due to the existence of several extraction methods using different quantization and normalization techniques; (b) we also show that image descriptor extraction strategies tend to follow the same DVC distribution function family; therefore, similarity search strategies can exploit image descriptors DVCs, irrespective of the sizes of the datasets; (c) based on a canonical correlation analysis, we demonstrate that there is a significant impact of image descriptors’ DVCs on the performance of the baseline LSH method [8] and three state-of-the-art hashing methods: SKLSH [28], PCA-ITQ [10], SPH [12], as well as on the performance of MSIDX method [34], which exploits the DVC information; (d) we experimentally demonstrate the influence of DVCs in both the sequential search and in the aforementioned similarity search methods and discuss the advantages of our findings. We hope that our work will motivate researchers for considering DVC analysis as a tool for the design of similarity search strategies in image databases.

Keywords

Dimensions value cardinalities Indexing Content-based image retrieval Approximate similarity search 

1 Introduction

This work presents an empirical study of dimensions’ value cardinality (DVC), defined as the number of distinct values in each dimension of image descriptor vectors. Through our analysis and experiments, we examine the influence of DVCs on the performance of approximate similarity search algorithms as well as on the sequential search in image databases.

Many hashing techniques [8, 12, 13, 19, 20, 28, 31, 38, 39] have been proposed to provide efficient methods for high-dimensional indexing of low-level descriptor vectors of multimedia, such as video or still images. The mapping of low-level descriptor vectors into the hamming space [5] using appropriate hashing functions ensures scalability of the similarity search algorithms to large-scale datasets, due to the compactness of the data and the fast hamming distance computations. The goal of the appropriate hashing functions is to map similar (i.e. adjacent in the euclidean space) high-dimensional descriptors of images to neighboring binary codes in the hamming space. Similarity search is then performed by comparing the binary codes. However, hashing methods often fail to preserve neighboring vectors adjacent to the hamming space and thus have low accuracy. The performance of similarity search methods is usually measured in terms of mean Average Precision (mAP), expressing how well the methods preserve the Euclidean neighbors of sequential search. Especially, when the hashing functions are selected independently from the data, or when a short binary code length is selected, hashing methods have limited mAP. Moreover, for long binary code lengths, a significant preprocessing time is required and the speedup factor (SF) of similarity search is highly reduced. Hashing methods are categorized to data-dependent or data-independent ones, based on the method followed to generate the hashing functions. Efficiency improvements of data-dependent methods over independent ones have been shown in several studies [19, 39], for the case where limited hash code sizes are employed. This happens due to the increase of independence between the hash functions as their number increases. For example, spectral hashing [39] outperforms many data-independent methods for small code sizes, but it is outperformed by the data independent method of shift-invariant kernel hashing [28] for sizes over 64 bits. Moreover, in all data-dependent hashing methods, there is often a significant preprocessing cost for learning the selected training dataset and for generating the binary codes. While in most hashing methods the usual technique for assigning the binary codes is to partition the metric space of the projected data points of image descriptors with appropriate hyperplanes and set two different codes for each side, in the recent approach of spherical hashing (SPH) [12], the partition of data points for computing the binary codes is based on hyperspheres. According to the experimental evaluation of [12], SPH outperforms other state-of-the-art hashing methods.

In the work of [2], authors developed an analytic model to describe the operation of hash table-based multimedia fingerprint databases. In their analysis, they show that their model can predict the performance of a search through a hash-based database as a function of both the statistical distribution of the fingerprints and the actual values derived by the database design parameter. The main idea is to exploit the notion of “weak” bits. When extracting the fingerprint of the query multimedia object, each of the bits of the query fingerprint is assigned a certain probability value, which describes the likelihood with which the respective bit would change if the query object were modified. The bits which are assigned with a high probability of change are called “weak” bits. In their algorithm, a stability score is assigned to each bit. Less stable (weaker) bits are toggled to generate multiple pseudoqueries from a single query. The results from all the generated queries are then aggregated. However, the algorithm requires that the stable bits of the query be correctly identified, otherwise it fails. Moreover, the aforementioned algorithm is limited to multimedia fingerprint databases.

Apart from hashing strategies, the recently proposed MSIDX [34] method exploits the correlation between (a) the value cardinality of each dimension of the descriptor vector and (b) the discriminative power of the specific dimension, assuming that dimensions with high-value cardinalities have more discriminative power. The key idea of MSIDX is to reorder the storage positions of image descriptors according to the value of the cardinalities of their dimensions, by performing a multiple sort algorithm. This sorting approach aims to increase the probability of having two similar images in storage positions that do not differ more than a specific global constant range, which is calculated as a percentage of the dataset size and marked with parameter \(w\). As was experimentally shown, MSIDX outperforms current state-of-the-art hashing methods in terms of both mAP and SF.

1.1 Contribution and Layout

The high performance of MSIDX raises several important questions about large-scale image similarity search, such as (a) why the descriptor extraction techniques produce different DVCs? (b) can similarity search strategies exploit descriptors’ DVCs, irrespective of the dataset size? finally (c) is there a correlation between DVCs and the performance of hashing methods, in terms of mAP and SF? The contribution of this paper is summarized as follows:
  1. (C1)

    In six publicly available datasets it is experimentally shown that image descriptors’ DVCs vary due to the existence of different extraction techniques and dimensions with relatively high-DVC have relatively higher discriminative power.

     
  2. (C2)

    It is shown that each descriptor extraction method tends to produce similar DVC distributions for different dataset sizes. Thus, similarity search strategies that exploit image descriptors’ DVCs can scale, since the DVC distributions over the dimensions are preserved, irrespective of the dataset sizes.

     
  3. (C3)

    It is verified that the values of the image descriptors’ DVCs have a strong impact on the similarity search performance both in terms of mAP and SF. The correlations of a set of variables describing DVCs for each image descriptor dataset and a set of variables describing the performance of the similarity search strategy were calculated using canonical correlation analysis (CCA). The CCA approach considers the performance variables, mAP and SF, as a set and not separately, since both mAP and SF play a crucial role in similarity search.

     
The remainder of the paper is organized as follows. In Sect. 2, we present our case study of six publicly available datasets of image descriptors of different dimensionality and size, on which we performed our DVC analysis. In Sect. 3, we describe the examined similarity search strategies, whereas in Sect. 4, we present our CCA for evaluating the impact of DVCs of image descriptors on the similarity search strategies’ performance. In Sect. 5, we present and discuss the results of three different sets of experiments, aiming to examine the notion of DVC from different perspectives. Finally, in Sect. 6, we draw the conclusions of our study, we provide a practical guide and discuss possible future work, extending our analysis to vantage indexing, dimensionality reduction and data co-reduction methods. Such methods are also suitable for large-scale similarity search in image databases and other similar problems.

2 Analysis of image descriptors’ DVC

2.1 Impact of images’ descriptor extraction strategies on DVC and search performance

A wide set of factors are known to affect the retrieval performance of image descriptor extraction algorithms. The actual image characteristics that each algorithm selects to identify and encode, are tightly correlated with the semantic definition of similarity which is adopted by the algorithm designer. The ability of the algorithm to correctly model image characteristics such as color, texture, illumination and resolution variations, plays a very important role on the final retrieval performance of the image descriptor. However, given the retrieval performance of an image descriptor algorithm, the aim of our analysis is to study how the different dimensions of a descriptor vector contribute to the overall performance in the case of hashing and other approximate similarity search strategies. In this Section, image descriptor extraction techniques are discussed to identify how different methodologies influence DVC values.

For each dimension, DVC is the number of discrete values that can be found in this dimension throughout a dataset of image descriptors. Descriptor vectors of images are integer or real-valued vector representations of either a part (i.e. local descriptors) or the whole image (i.e. global descriptors), by its characteristics. These are typically histograms or other vector representations of image characteristics such as color, texture, edges, illumination and their spatial distribution in the examined area. A parameter that varies among descriptor extraction techniques is the number of dimensions of the descriptor vector. In the case of local image descriptor vectors, the number of dimensions depends on the selected number of attributes or the binning resolution, which are selected for producing the histograms of the local attributes. In the case of generating global image descriptors from local ones, the typical procedure to follow is a “bag-of-words” technique, where local descriptors are assigned, either by soft or hard assignment, to a predefined number of centroids. Then a histogram of these assignments is constructed. The number of local descriptor vectors that are used to extract the global descriptor may vary for images due to different sizes, sampling strategies (e.g. dense grid, interest points, pyramidal decompositions) or the selected extraction density [34].

According to [34], such variations bind the global descriptor vectors to low performance and thus, a post-processing phase, to normalize the values in each dimension is required [23, 36, 41]. However, this step renders the descriptor values as real numbers which further increases the algorithm’s complexity, leading to high processing time and storage requirements, especially for large-scale datasets. The typical approach to address the aforementioned drawback is quantization of the values in each dimension which, however, is generally a lossy process and thus it introduces a trade-off between retrieval accuracy and computational cost. However, in practice, a very limited number of dimensions reach the quantization bounds, while most of them are highly repetitive and thus, restricted to a lower DVC bound.

2.2 Calculation of image descriptors’ DVC

The value cardinalities for each dimension is the number of distinct values that exist in this dimension throughout the dataset [34]. In the case of integer values, this is well defined.

In the case of real values, a finite length of decimal digits should be selected. However, real values are calculated either by normalizing integers to the \( \{0-1\}\mathbb {R}\) range and thus, already restricted to the original discrete values, or are real numbers with restricted decimal accuracy due to memory and time bounds that the algorithms and current computers introduce. As a result, descriptor vectors in all datasets of the experiments have a limited number of decimals, usually not exceeding 8 decimal digits. Following [34], we should also note that in our experiments no value quantization was applied in the examined datasets.

2.3 Evaluation datasets

The evaluation datasets used in our experiments are the datasets used in [34] with the additional C-SIFT dataset. The collection of datasets contains both local and global descriptors from different collections1\(^{,}\)2\(^{,}\)3 and has not been subjected to any additional preprocessing steps.

From the ImageClef image collection, we have the CIME 64d-240K, CEDD 144d-240K and SURF 5000d-240K datasets. The CIME 64d-240K dataset features CIME descriptors [32] of 64-dimensions of integer values\(\in \) {0- 63}. The CEDD 144d-240K dataset contains global CEDD descriptors [4], of 144-dimensions of integer values\(\in \) {0 - 7} and the SURF 5000d-240K dataset with a 5000-dimensional codebook, to extract global vectors from the local SURF descriptors [3] with normalized real values\(\in \)\([0,1]\).

From the TEXMEX collection, we used SIFT 128d-1M and GIST 960d-1M datasets, featuring 1 million image descriptors. The SIFT 128d-1M consists of local SIFT [23] descriptors of 128-dimensions of integer values\(\in \) {0 - 255}, while the GIST 960d-1M hold global GIST [27] descriptors of real values\(\in \)\([0,1.0929]\).

Finally, C-SIFT 1019d-700K dataset, featuring 738,418 (\(N\) = 700K) images, crawled through Flickr’s Web Services by posing 50 random queries. Local C-SIFT descriptors [36] were extracted and a codebook of 1019 dimensions was computed by clustering the local descriptors. Next, the typical “bag-of-words” approach was followed to compute global C-SIFT descriptor vectors of normalized real values\(\in \)\([0,1]\) from the local vectors for each image.

Figure 1 (taken from [34] with the additional C-SIFT 1019d-700K dataset) presents the DVCs of the image descriptors for the six evaluation datasets, to support our first contribution (C1). DVCs vary due to the different descriptor extraction strategies that CIME, CEDD, SIFT, GIST, C-SIFT and SURF follow.
Fig. 1

DVC per dimension in the evaluation datasets. DVCs vary due to the different descriptor extraction strategies that CIME, SIFT, CEDD, GIST, C-SIFT and SURF follow. The figure is based on [34] with the additional C-SIFT-1019d-700K dataset

2.4 DVC in evolving datasets’ sizes

Our goal in this section is to evaluate the correlation of DVCs with the performance of similarity search strategies and to confirm that DVC characteristics can be exploited in the design of scalable similarity search strategies, irrespective of the datasets’ sizes.

Therefore, we down-sampled the \(N\) sizes of the evaluation datasets from 100 to 20 %, by a step of 20 % and generated the empirical cumulative distribution for each down-sampled dataset:
$$\begin{aligned} F(x)=P(X \le x),\; \text { with } x=\mathrm{DVC} \end{aligned}$$
(1)
where \(F(x)\) expresses the probability that a dimension has DVC value less than or equal to a DVC value. The experiments were repeated ten times and the average results are reported in Fig. 2. The \(F(x)\) values are calculated as follows: according to (1), the number of dimensions with DVC values less or equal than \(x=\mathrm{DVC}\) is calculated. For example, in the top-left subfigure of Fig. 2, i.e. the case of the CIME dataset in the 20 % curve, for \(x=\mathrm{DVC}=30\) holds that \(F(30) \thickapprox 0.2\). This can be interpreted as follows: for \(x=\mathrm{DVC}=30\) the 20 % (0.2) of the dimensions in the CIME dataset have DVC less or equal than 30. Please note that Fig. 2 presents the empirical cumulative distribution functions and not the cumulative distribution functions of the DVC. Thus, a stair-like effect is expected due to the distinct values of the examined samples. Especially in the case of CEDD 144d-240K dataset the stair-like effect is very strong, due to the very small number of distinct values per dimension.
Fig. 2

Empirical cumulative distribution function \(F(x)\), with \(x=\mathrm{DVC}\), for different sizes (%\(N\)) of the evaluation datasets. The descriptor extraction techniques tend to produce similar DVC distributions from the same distribution family. Thus, similarity search strategies can exploit image descriptors’ DVCs, irrespective of the \(N\) datasets’ sizes

For each evaluation dataset the Kolmogorov–Smirnov [25] test was performed between all possible pairs of the DVC distributions of different datasets’ sizes and it was found that for each descriptor extraction methodology the DVC distributions of different datasets’ sizes come from the same distribution function (\(p<0.01\)). This means that each descriptor extraction technique tends to produce the same distribution function family. Since for each descriptor extraction strategy the cumulative distributions of DVC come from the same distribution family, the relative differences between DVCs of datasets are preserved, irrespective of the datasets’ sizes. Based on this finding, similarity search strategies can exploit image descriptors’ DVCs irrespective of the datasets’ sizes (contribution C2 in Sect. 1.1).

In addition, based on the experimental results of Fig. 2, we observe that in the case of the 20 % down-sampled high-dimensional dataset of SURF, the majority of the DVCs are in the range of [1 1,000] with \(F(1,000)\thickapprox 0.8\) (80 %). According to (1) this means that 80 % of dimensions have DVCs lower than or equal to 1,000, whereas the rest of the 20 % of dimensions have DVCs in the range of [1,000 5,000]. Analogously, the same happens for the high-dimensional datasets of GIST ([2,000 6,000]) and C-SIFT ([1 16,000]) datasets, with the majority (80 %) of dimensions having the respective DVCs lower than or equal to 3,000 and 4,000, whereas the rest of the 20 % of dimensions have DVCs in the range of [3,000 6,000] and [4,000 16,000], respectively. However, in the case of the low-dimensional datasets of CIME ([10 60]), SIFT ([140 220]) and CEDD ([1 8]), the majority (80 %) of the dimensions have the respective DVCs lower than or equal to 50, 170 and 8. The remaining 20 % of dimensions have DVCs in the range of [50 60], [170 220] and 8. This can be attributed to the fact that, high-dimensional descriptors tend to produce high-DVCs for fewer dimensions, in comparison with the low dimensional image descriptors. In the high-dimensional evaluation datasets of SURF 5000d-240K, C-SIFT 1019d-700K and GIST 960d-1M there are few dimensions with high discriminative power, specifically those that have high-DVCs, while the rest have low DVCs. This effect is attributed to the fact that high-dimensional descriptors tend to be more sparse, with some dimensions frequently holding zero or being highly repetitive.

3 Similarity search strategies

3.1 Algorithms

In our analysis, we examined the performance of the following similarity search strategies:
  1. 1.

    Locality-sensitive hashing (LSH) [8] is the baseline hashing method used in our experiments. LSH projects the data to a randomly generated space, through a gaussian random distribution, and the thresholds to generate the binary codes. The source code of LSH is publicly available at [30].

     
  2. 2.

    Shift-invariant kernel hashing (SKLSH) [28] is based on the random projections, in such a way that the Hamming distance between the binary codes of two vectors is related to the value of a shift-invariant kernel between the vectors. The source code of SKLSH is publicly available at [16].

     
  3. 3.

    Iterative quantization for learning binary codes (PCA-ITQ) [10] minimizes the quantization error by rotating zero-centered PCA projected data. The PCA-ITQ method generates the binary codes in two steps: (a) PCA dimensionality reduction and (b) iterative quantization. In the first step, PCA is performed and a projection matrix is computed for dimensionality reduction. The aim is to produce efficient binary codes, in which the variance of each bit is maximized and the bits are pairwise uncorrelated. Next, in the iterative quantization step, a rotation matrix is computed for the training set so as to minimize the quantization error by preserving the locality structure of the projected data. The source code of PCA-ITQ is publicly available at [16].

     
  4. 4.

    Spherical hashing (SPH) [12] is a hashing method based on a hypersphere binary embedding technique (see Sect. 1). Based on the experimental results of [12], in our experiments, we used the spherical Hamming distance approach, since it achieves higher mAP than the baseline Hamming distance. However, the SF of SPH is comparable to the baseline LSH method. The source code of SPH is publicly available at [30].

     
  5. 5.

    Multi-sort indexing is based on DVC (MSIDX) [34], which performs a multi-sort algorithm based on image descriptors’ DVC (see Sect. 1). The source code of MSIDX is publicly available at [15].

     

3.2 Evaluation benchmark

In our experiments, we tested all examined similarity search strategies on the six evaluation datasets of image descriptors. Following the evaluation protocol of [12] and [34], in all our experiments, we performed 1,000 test queries, which were randomly chosen and did not participate in the training/preprocessing phase. For each query, the NN search accuracy is measured in terms of mAP, according to the following ratio:
$$\begin{aligned} mAP = \frac{|\mathbf {R}_\mathrm{seq} \cap \mathbf {R}_\mathrm{ind}|}{k} \end{aligned}$$
(2)
where, \(\mathbf {R}_\mathrm{seq}\) is the set of the top-\(k\) results (Euclidean neighbors) retrieved by the sequential search based on the Euclidean distance (\(L_2\)), and \(\mathbf {R}_\mathrm{ind}\) is the set of the top-\(k\) results retrieved by the examined similarity search method. The final performance of each method is measured by averaging the mAP variable over the 1,000 performed queries. In our experiments, we set \(k=100\). The SF variable is measured as follows:
$$\begin{aligned} \mathrm{SF} = \frac{T_\mathrm{seq}}{T_\mathrm{ind}} \end{aligned}$$
(3)
where \(T_\mathrm{seq}\) is the sequential search time and \(T_\mathrm{ind}\) is the search time of the examined similarity search strategy.

3.3 Parameter settings and performance

Several experiments were conducted with different configurations for the number of bits and hash tables for the examined methods. To make a fair comparison for the hashing methods a condition4 of \(\mathrm{SF}\ge 1\) was applied, since large #bits and #hash tables increase mAP, while reducing SF. For the relatively low-dimensionality datasets (CIME 64d-240K, SIFT 128d-1M and CEDD 144d-240K) we varied the #bits from 1 to 1,024 with a step of 4 bits to increase the number of observations that satisfy the condition and fixed the #hash tables = 1. For the high-dimensional datasets (GIST 960d-1M, C-SIFT 1019d-700K, SURF 5000d-240K) we varied the number of bits in the set of \(\left\{ {64, 128, 256, 512, 1,024}\right\} \). In the high dimensional datasets of GIST 960d-1M and C-SIFT 1019d-700K the maximum #hash tables were 5, whereas in the extreme high-dimensional dataset of SURF 5000d-240K the maximum #hash tables were 15.

Summarizing the results of our experiments for the same SF we observed that in terms of mAP, SPH outperforms the LSH, SKLSH and PCA-ITQ hashing methods. Following the aforementioned parameter settings in Fig. 3 we present the experimental results of LSH and SPH on the high-dimensional datasets of GIST 960d-1M and SURF 5000d-240K, which are the hashing methods of lowest and highest mAP, respectively. In particular, in the left panel of Fig. 3 we present the SF values for both LSH and SPH methods. At this point we must mention that for the same parameter settings, i.e. number of bits and number of hash tables, LSH and SPH produce comparable SF. By increasing the number of hash tables and the number of bits, SF is significantly reduced. The red line indicates the condition of \(\mathrm{SF}\ge 1\). In the middle and the right panel of Fig. 3, we show the results of LSH and SPH, respectively, in terms of NN search accuracy (mAP), with SPH significantly outperforming LSH.
Fig. 3

Performance of LSH and SPH methods of the lowest and highest mAP in the high-dimensional datasets of a GIST 960d-1M and b SURF 5000d-240K. The red line indicates \(\mathrm{SF}=1\). For the same settings (#bits and #hash tables = HT), LSH and SPH have comparable SF. mAP is presented for those settings that satisfy the \(\mathrm{SF}\ge 1\) condition

Following the setup parameter of [34], in the MISDX method, we varied the parameter \(w\) as a percentage of the dataset size (%\(N\)) from 5 to 25 %, with a step of 5 %. Our aim in selecting these values for \(w\) was to produce experiments with SF values directly comparable to SPH’s. For the same SF, we observed that MSIDX outperforms the four hashing methods in terms of mAP, by considering the discriminative power of image descriptors’ DVCs. In all evaluation datasets, SPH achieves \(\mathrm{mAP}\le 0.5\), whereas MSIDX achieves \(\mathrm{mAP} \le 0.8\) in the same SF range of [1, 25], as presented in Fig. 4.
Fig. 4

Performance of MSIDX on the six evaluation datasets in terms of a SF and b mAP by varying the \(w\) parameter, as a percentage of the datasets’ sizes (%\(N\))

4 Canonical correlation analysis (CCA): impact of DVC on similarity search strategies

4.1 Preliminaries of CCA

Canonical correlation analysis (CCA) [14] has been applied to many machine learning methods. CCA generates a multivariate statistical model facilitating the study of interrelationships among sets of multiple dependent variables and multiple independent variables. According to [14], the goal of CCA is to maximize the \(R_c\), which is the linear correlation between two sets of metric/categorical variables. In case that the generated model is not statistically significant, as measured by Wilk’s \(\Lambda \) statistic [24], then the two sets of variables are not linearly correlated. Alternatively, non-linear CCA or Kernel CCA have been proposed [21].

Consider two sets of input data, from which we draw independent and identically distributed (iid) samples to form a pair of input vectors, \(\mathbf {x_1}\) and \(\mathbf {x_2}\). The goal is to find the linear combination of the variables which gives the maximum \(R_c\) between the combinations, i.e. to find \(\mathbf {w_1}\) and \(\mathbf {w_2}\), so as to maximize the correlation between \(y_1\) and \(y_2\). Let
$$\begin{aligned} y_1=\mathbf {w_1}\mathbf {x_1}= \sum _{j} w_{1j}x_{1j} \end{aligned}$$
(4)
$$\begin{aligned} y_2=\mathbf {w_2}\mathbf {x_2}= \sum _{j} w_{2j}x_{2j} \end{aligned}$$
(5)
The solutions are constrained to ensure a finite solution by setting the variance of \(y_1\) and \(y_2\) to 1. Let \(\mu _{11}\) and \(\mu _{12}\) be the means of \(\mathbf {x_1}\) and \(\mathbf {x_2}\), respectively. Then, the standard statistical method [24] starts by defining
$$\begin{aligned} \Sigma _{11}=E\{(\mathbf {x_1}-\mu _{11}){(\mathbf {x_1}-\mu _{11})}^T\} \end{aligned}$$
(6)
$$\begin{aligned} \Sigma _{22}=E\{(\mathbf {x_2}-\mu _{12}){(\mathbf {x_2}-\mu _{12})}^T\} \end{aligned}$$
(7)
$$\begin{aligned} \Sigma _{12}=E\{(\mathbf {x_1}-\mu _{11}){(\mathbf {x_2}-\mu _{12})}^T\} \end{aligned}$$
(8)
$$\begin{aligned} K=\Sigma _{11}^{-\frac{1}{2}}\Sigma _{12}\Sigma _{22}^{-\frac{1}{2}} \end{aligned}$$
(9)
where \(T\) denotes the transpose of a vector. The singular value decomposition (SVD) of matrix \(K\) results in:
$$\begin{aligned} K=(\alpha _1,\alpha _2,\ldots ,\alpha _k)D(\beta _1,\beta _2,\ldots ,\beta _k) \end{aligned}$$
(10)
where \(\alpha _i\) and \(\beta _i\) are the standardized eigenvectors of \(KK^T\) and \(K^TK\), respectively, and \(D\) is the diagonal matrix of \(\lambda _i\) eigenvalues. The overlapping variance between the canonical variate pairs are represented by the \(\lambda _i\) eigenvalues. According to [14], if there is a linear correlation between the \(\mathbf {x_1}\) and \(\mathbf {x_2}\) sets of variables, then the first eigenvalue, \(\lambda _1\), is the most important. This means that the first canonical correlations (those with the greatest correlation) are given by
$$\begin{aligned} \mathbf {w_1}= \Sigma _{11}^{-\frac{1}{2}}\alpha _1 \end{aligned}$$
(11)
$$\begin{aligned} \mathbf {w_2}= \Sigma _{22}^{-\frac{1}{2}}\beta _1 \end{aligned}$$
(12)

4.2 DVC and performance sets

To evaluate the impact of the DVCs of the image descriptors on the performance of the five examined similarity search strategies (described in Sect. 3.1), we performed CCA on two sets of variables: the DVC set, as the set of independent variables and the Performance set as the set of dependent variables.

The DVC set\(=\)\(\{ \mu _2, \mu _3, \mu _4\}\) consists of the second, third and fourth central moments5 about the mean \(\mu \) of each DVC distribution, expressing each DVC’s covariance, asymmetry and “peakedness”, respectively. The values of the \(\mu _2, \mu _3, \mu _4\) variables of the six evaluation datasets are presented in Table 1.
Table 1

The three central moments \(\mu _2\), \(\mu _3\), \(\mu _4\) of DVC in the six evaluation datasets

 

\(\mu _2\)

\(\mu _3\)

\(\mu _4\)

CIME 64d-240K

81.826

\(-\)660.770

17,691

SIFT 128d-1M

189.090

3316.6

1.619e+05

CEDD 144d-240K

1,006

\(-\)0.959

2.607

GIST 960d-1M

2.973e+05

1.718e+08

3.515e+11

C-SIFT 1019d-700K

4.673e+06

7.922e+09

8.176e+13

SURF 5000d-240K

1.451e+05

1.666e+09

3.250e+13

The Performance set\(=\)\(\{\mathrm{mAP, SF} \}\) consists of the mAP and SF variables of the examined LSH, SKLSH, PCA-ITQ, SPH and MSIDX similarity search strategies, expressing how well each similarity search strategy preserves the Euclidean neighbors of sequential search and the method’s speedup factor compared to the linear time of sequential search, calculated according to (2) and (3), respectively.

4.3 Canonical correlations

For each examined similarity search method (LSH, SKLSH, PCA-ITQ, SPH, MSIDX), we generated a statistical model based on CCA. In particular, for each similarity search method, we created tuples (samples) in the form \(\{\mu _2,\mu _3,\mu _4,SF,mAP\}\), where the \(\mu _2,\mu _3,\mu _4 \) central moments are calculated by the DVC distribution of the evaluation datasets, where each examined similarity search method is tested. SF and mAP are the respective performance variables of each similarity search method for a setting of parameters, i.e. #bits, #hash tables in the hashing methods and the \(w\) parameter for MSIDX. For example, in the LSH method for all possible different settings of #bits, #hash tables (see Sect. 3) on the six evaluation datasets we generated the respective \(\{\mu _2,\mu _3,\mu _4,SF,mAP\}\) tuples. An overview of CCA is presented in Fig. 5. The calculated statistics of our CCA for each examined similarity search method are the following.
Fig. 5

CCA overview

Canonical correlation coefficient\(R_c\) is the measure of the strength of the overall relationships between the linear composites (canonical variates) for the \(\{\mathrm{mAP, SF} \}\) dependent and the \(\{ \mu _2, \mu _3, \mu _4\}\) independent variables. In effect, it represents the bivariate correlation between the two canonical variates and according to [14] it is equal to the squared root of the first eigenvalue of matrix \(K\) defined in (9), \(R_c=\sqrt{\lambda _1}\), containing a large amount of variance of the examined variables.

Canonical variates, \(\mathrm{DVC}_s\) and \(\mathrm{Perf}_s\), are in general synthetic sets of variables, which are the linear combinations that represent the weighted sum of two or more variables and can be defined for either dependent or independent variables. In our case, we have two variates: \(\mathrm{DVC}_s\) for the DVC Set and \(\mathrm{Perf}_s\) for the Performance Set, respectively.

Canonical loadings\(r\) are measures of the simple linear correlation between each of the independent \(\{ \mu _2,\mu _3, \mu _4\}\) or dependent \(\{\mathrm{mAP, SF} \}\) variables with their corresponding canonical variates (i.e. \(\mathrm{DVC}_s\) and \(\mathrm{Perf}_s\) ). As an example the canonical loading \(r_\mathrm{mAP}\) represents the correlation of the mAP variable with the \(\mathrm{Perf}_s\) canonical variate of which it is member. The larger the coefficient, the more important it is in deriving the canonical variate.

Canonical cross-loadings\(rc\) measure the correlation of each observed independent or dependent variable with the opposite canonical variate. As an example, \({rc}_{\mu _2-\mathrm{Perf}_s}\) encodes the correlation of the independent \(\mu _2\) variable with the dependent \(\mathrm{Perf}_s\) canonical variate.

5 Experimental results

5.1 Roadmap

In this section, we examine the notion of DVC from different perspectives. In Sect. 5.2, the results of our CCA are discussed. Section 5.3 presents the impact of eliminating the low-DVC dimensions and finally, in Sect. 5.4 an energy-based study of DVC is presented to further support our findings.

5.2 CCA results

In Table 2 we present the experimental results of CCA in the LSH, SKLSH, PCA-ITQ, SPH and MSIDX similarity search strategies, where we include Wilk’s \(\Lambda \) statistic [24], canonical loadings \(r\), cross-loadings \(rc\) and the \(R_c\) correlation coefficient between the \(\mathrm{DVC}_s\) and \(\mathrm{Perf}_s\) canonical variates. Summarizing our results:
  1. 1.

    The generated statistical models of the LSH, SKLSH, PCA-ITQ, SPH and MSIDX similarity search strategies based on CCA were statistically significant according to Wilk’s \(\Lambda \) statistic [24]. Accordingly, we can reject the null hypothesis that there is no relationship between the two variable sets, in all five models of LSH, SKLSH, PCA-ITQ, SPH and MSIDX. The result, that answers contribution C3, can be interpreted as follows: there is a linear correlation between the images descriptors’ DVC and the performance of the similarity search strategies, expressed by the \(R_c\) canonical correlation coefficient between the \(\mathrm{DVC}_s\) and \(\mathrm{Perf}_s\) variates.

     
  2. 2.

    In all five statistical models, the \(\mu _2,\mu _3,\mu _4\) variables are highly correlated to the \(\mathrm{DVC}_s\) and \(\mathrm{Perf}_s\) variates, denoted by \(r_{\mu _2}\), \(r_{\mu _3}\), \(r_{\mu _4}>0\) and \({rc}_{\mu _2-\mathrm{Perf}_{s}}\), \({rc}_{\mu _3-\mathrm{Perf}_{s}}\), \({rc}_{\mu _4-\mathrm{Perf}_{s}}>0\). This observation reflects the high contribution of \(\mu _2\), \(\mu _3\), \(\mu _4\) central moments to the overall performance of each similarity search strategy.

     
  3. 3.

    The SF, mAP variables are negatively correlated to the \(\mathrm{Perf}_s\) variate, denoted by \(r_\mathrm{mAP}>0\) and \(r_\mathrm{SF}<0\) for the hashing methods and \(r_\mathrm{mAP}<0\) and \(r_\mathrm{SF}>0\) for MSIDX, expressing the trade-off between SF and mAP of the similarity search strategies.

     
  4. 4.

    For all five similarity search strategies, the \(|rc_{\mathrm{SF-DVC}_s}|\) canonical cross-loading is low, which means that there is a weak correlation between the SF variable and the \(\mathrm{DVC}_s\) variate. This happens because the choices of (a) #bits and #hash tables for the hashing methods and (b) parameter \(w\) for MSIDX solely influence the performance of the similarity search strategies in terms of SF and they are not affected by image descriptors’ DVC. However, since SF is strongly correlated to the mAP variable, it is important to preserve SF in our analysis.

     
  5. 5.

    For LSH, SKLSH, PCA-ITQ and SPH, it holds that \(rc_{\mathrm{mAP-DVC}_s}>0\), which means that the NN search accuracy (mAP) of the examined hashing methods is positively correlated to the DVC set of variables. This happens because the efficient encoding of image data into binary codes through hash functions depends on image descriptors’ DVC high covariance, asymmetry and “peakedness”.

     
  6. 6.

    For MSIDX, the \(rc_{\mathrm{mAP-DVC}_s}=-0.476\) canonical cross-loading indicates that there is a negative correlation between MSIDX’s mAP and the DVC set of variables. Based on this observation, we conclude that in the MSIDX method mAP is increased when the DVCs of image descriptors have low covariance, asymmetry and “peakedness”, which can be achieved by a dataset of very high DVC values in all its dimensions. This happens because MSIDX does not perform any projection of the image descriptors’ dimensions into lower dimensional space, as the hashing methods do, and thus all dimensions contribute to the final performance.

     
Table 2

Experimental results of CCA for evaluating the performance of LSH, SKLSH, PCA-ITQ, SPH and MSIDX on the six evaluation datasets

 

LSH

SKLSH

PCA-ITQ

SPH

MSIDX

Wilk’s \(\Lambda \)

0.121 (\(p<\) 0.001)

0.018 (\(p<\) 0.001)

0.187 (\(p<\)0.001)

0.822 (\(p<\) 0.004)

0.525 (\(p<\) 0.001)

Can. Load. \(r \mathrm{DVC}_s\)

\(r_{\mu _2}\)

0.475

0.431

0.558

0.656

0.440

\(r_{\mu _3}\)

0.306

0.502

0.196

0.867

0.595

\(r_{\mu _4}\)

0.061

0.036

0.167

0.965

0.732

Can. Cross-Load. \(rc \mathrm{DVC}_s\)-\(\mathrm{Perf}_s\)

\({rc}_{\mu _2-\mathrm{Perf}_s}\)

0.445

0.427

0.503

0.239

0.303

\({rc}_{\mu _3-\mathrm{Perf}_s}\)

0.286

0.497

0.177

0.316

0.410

\({rc}_{\mu _4-\mathrm{Perf}_s}\)

0.057

0.036

0.150

0.351

0.504

Can. Load. \(r \mathrm{Perf}_s\)

\(r_{mAP}\)

0.973

0.996

0.946

0.894

\(-\)0.678

\(r_{SF}\)

\(-\)0.074

\(-\)0.014

\(-\)0.072

\(-\)0.182

0.018

Can. Cross-Load. \(rc \mathrm{Perf}_s-\mathrm{DVC}_s\)

\(rc_{\mathrm{mAP-DVC}_s}\)

0.911

0.987

0.852

0.325

\(-\)0.476

\({rc}_{\mathrm{SF-DVC}_s}\)

\(-\)0.069

\(-\)0.014

\(-\)0.065

\(-\)0.066

0.0120

Can. Correlation Coefficient

\(R_c\)

0.936

0.991

0.900

0.364

0.688

Italic values indicate the canonical loadings and cross-loadings of the canonical correlation analysis

5.3 Elimination of low-DVC dimensions

In this set of experiments we show that the discriminative power (i.e. high mAP) of a descriptor, in the context of large-scale similarity search, is mostly due to its high-DVC dimensions. To support the aforementioned statement, the following experiments were conducted: for each dataset, we eliminated a percentage of the dimensions from 5 to 50 % with a step of 5 % for the low-DVC, high-DVC and randomly selected dimensions, respectively. For each step we computed the mAP with respect to the sequential search for the eliminated dimensions’ datasets. Figure 6 presents the results for eliminating low-DVC dimensions, grouped in low-dimensional and high-dimensional datasets, while Fig. 7 presents the comparison of eliminating low-DVC, high-DVC and randomly selected dimensions per dataset. The figures demonstrate that: (a) low-DVC dimensions contribute much less (compared to high-DVC or randomly selected dimensions) to the performance of similarity search in terms of mAP and (b) an elimination of approximately 20 % of descriptor’s low-DVC dimensions is capable of preserving the sequential search performance at least over 80 % in terms of mAP. Moreover, even if eliminating a 50 % of the low-DVC dimensions, the performance of sequential search is preserved over 70 % mAP in the 5 out of 6 datasets and at least over 50 % mAP in the SIFT 128d-1M dataset. In Fig. 7, for the majority of the datasets, the random DVC curves appear between the low-DVC and high-DVC curves meaning that the high-DVC dimensions contribute the most in terms of mAP. However, in the cases of CEDD 144d-240K and SURF 5000d-240K the random DVC elimination curves are below the high-DVC ones. This observation implies that dimensions between the high- and the low-DVC ones hold much of the information of the dataset and contribute to the overall performance.
Fig. 6

Performance of sequential search of descriptors, after eliminating a percentage of low-DVC dimensions. The results are grouped based on the datasets’ dimensionality in a low and b high dimensional

Fig. 7

Performance of sequential search of descriptors, after eliminating a percentage of low-, high- or randomly selected DVC dimensions per dataset

To assess the influence of the low-DVC eliminations on the examined similarity search strategies, we selected for each dataset a percentage of elimination that preserves the mAP of sequential search over 90 % based on the experiments of Fig. 6. The dashed red line in Fig. 6 marks the 90 % mAP threshold which results in an elimination of 20 % of dimensions for CIME 64d-240K, 5 % for SIFT 128d-1M, 30 % for CEDD 144d-240K, 20 % for GIST 960d-1M, 20 % for C-SIFT 1019d-700K and 15 % for SURF 5000d-240K. Next, for each full and eliminated dimensions’ dataset pairs and for each similarity search method we computed mAP for 1,000 top-\(K\) queries with \(K=100\). The impact of discarding the low-DVC dimensions on the performance of each similarity search method, was evaluated by computing the mAP difference, denoted by mAPDif, expressing the difference of mAP scores for each similarity search method between the original and the eliminated dimensions’ datasets.
$$\begin{aligned} \mathrm{mAPDif} = \mathrm{mAP}_\mathrm{original} - \mathrm{mAP}_\mathrm{eliminated} \end{aligned}$$
(13)
The results of our experiments are presented in Table 3 for LSH, SKLSH, PCA-ITQ, SPH and Table  4 for MSIDX, respectively. The bold values in the tables mark the largest absolute mAP difference of each dataset in all bit or \(w\) variations. The results clearly demonstrate the preservation of mAP scores with max differences that do not raise over 2.5 % for the hashing methods and 4 % for MSIDX. By eliminating dimensions with less information, the hashing methods encode the available information of high-DVC dimensions more efficiently. Moreover, in some cases, the eliminated dimensions’ datasets perform higher than the original ones, since the low-DVC dimensions induce noise in the projections of the hashing methods. In MSIDX, the multiple sorting of descriptors’ dimensions from high-DVC to low-DVC reflects the increasingly higher mAPDif of MSIDX with the increase of the \(w\) parameter (Table 4).
Table 3

mAPDif (%) for hashing methods, by eliminating a number of low-DVC dimensions (per dataset) that preserve the sequential search over 90 % mAP (dashed line in Fig. 6)

 

8 bits (%)

16 bits (%)

32 bits (%)

64 bits (%)

128 bits (%)

256 bits (%)

512 bits (%)

1,024 bits (%)

LSH

   CIME 64d-240K

0.76

\(-\)0.51

0.68

\(-\)1.19

0.38

0.73

1.31

0.65

   SIFT 128d-1M

0.10

\(-\)0.49

\(-\)1.26

2.02

0.19

0.08

0.48

0.74

   CEDD 144d-240K

0.30

1.11

\(-\)0.29

\(-\)0.05

\(-\)1.27

0.41

\(-\)0.25

\(-\)0.03

   GIST 960d-1M

0.07

\(-\)0.13

\(-\)0.14

\(-\)0.34

0.16

\(-\)0.29

0.16

0.39

   C-SIFT 1019d-700K

0.14

0.30

\(-\)0.13

0.10

0.67

\(-\)0.11

\(-\)0.09

0.40

   SURF 5000d-240K

0.10

\(-\)0.20

0.46

\(-\)0.03

0.28

0.10

\(-\)0.60

\(-\)0.36

SKLSH

   CIME 64d-240K

0.44

\(-\)0.13

\(-\)0.08

\(-\)0.39

\(-\)0.09

0.07

0.01

\(-\)0.03

   SIFT 128d-1M

0.04

0.00

\(-\)0.07

\(-\)0.03

0.00

0.01

\(-\)0.03

0.05

   CEDD 144d-240K

0.09

0.02

\(-\)0.01

0.03

0.03

\(-\)0.01

0.00

\(-\)0.10

   GIST 960d-1M

0.25

\(-\)0.10

0.64

0.22

\(-\)0.47

\(-\)0.97

\(-\)1.58

\(-\)0.29

   C-SIFT 1019d-700K

0.05

0.30

\(-\)0.13

0.06

0.32

1.76

\(-\)2.45

0.48

   SURF 5000d-240K

0.16

\(-\)0.08

0.15

0.02

0.04

0.23

\(-\)0.21

0.47

PCA-ITQ

   CIME 64d-240K

0.02

0.56

0.74

     

   SIFT 128d-1M

\(-\)0.01

0.24

0.54

\(-\)0.45

    

   CEDD 144d-240K

\(-\)0.34

0.73

\(-\)0.39

0.06

    

   GIST 960d-1M

\(-\)0.10

0.11

0.04

0.19

0.51

0.72

1.03

 

   C-SIFT 1019d-700K

\(-\)0.45

\(-\)0.10

0.36

0.99

0.94

1.01

1.14

 

   SURF 5000d-240K

0.04

\(-\)0.04

\(-\)0.01

\(-\)0.13

0.10

0.05

\(-\)0.06

0.00

SPH

   CIME 64d-240K

\(-\)0.10

\(-\)0.22

0.25

0.62

\(-\)0.03

1.15

1.68

1.48

   SIFT 128d-1M

0.19

\(-\)0.29

0.78

0.25

\(-\)0.15

1.39

0.67

0.80

   CEDD 144d-240K

\(-\)0.19

0.64

\(-\)0.18

0.11

\(-\)0.39

0.80

0.89

0.99

   GIST 960d-1M

0.41

0.16

0.44

0.40

2.50

0.30

0.08

0.58

   C-SIFT 1019d-700K

0.41

\(-\)0.16

\(-\)0.66

1.02

0.46

1.98

1.07

0.96

   SURF 5000d-240K

\(-\)0.35

\(-\)0.99

0.18

0.14

\(-\)0.54

\(-\)0.70

0.34

0.05

The negative values correspond to experiments where the mAP was higher in the case of the eliminated datasets

Bold values indicate the largest absolute mAP difference of each dataset in all bit or \(w\) variations

Table 4

mAPDif (%) for MSIDX, by eliminating a number of low-DVC dimensions (per dataset) that preserve the sequential search over 90 % mAP

 

2.5 W (%)

5 W (%)

7.5 W (%)

10 W (%)

12.5 W (%)

15 W (%)

17.5 W (%)

20 W (%)

22.5 W (%)

25 W (%)

CIME 64d-240K

\(-\)0.08

0.10

0.02

0.19

0.34

0.56

0.92

1.29

1.80

2.46

SIFT 128d-1M

0.01

0.02

0.06

0.17

0.34

0.65

1.15

1.79

2.56

3.34

CEDD 144d-240K

0.00

0.03

0.20

0.40

0.56

0.62

0.94

1.21

1.74

2.74

GIST 960d-1M

0.00

0.02

0.07

0.17

0.42

0.86

1.31

1.82

2.30

2.88

C-SIFT 1019d-700K

0.04

0.09

0.16

0.24

0.34

0.45

0.58

0.74

0.94

1.21

SURF 5000d-240K

0.44

0.49

0.51

0.51

0.56

0.73

0.90

1.14

1.41

1.74

Bold values indicate the largest absolute mAP difference of each dataset in all bit or \(w\) variations

As a stress test of our rationale, we performed a different experiment by eliminating 50 % of the dimensions (from low-DVC to high-DVC ones) of every dataset and computing the respective mAP Dif values (Tables 5 and  6). With an elimination of a 50 % of dimensions the majority of the configurations held mAP drops in the range of 1–6 %, with only some outliers that reached a 19 % mAP drop for the lower dimensionality datasets (i.e. CIME, SIFT) in the hashing methods. In the case of MSIDX, the mAP drop was greater since the algorithm considers more lower DVC dimensions as \(w\) increases. However, the maximum mAP drop for the hashing methods was less than 4.5 % for the relatively higher dimensional datasets (i.e. GIST, C-SIFT,and SURF). The interesting observation is that for some configurations the hashing methods performed slightly higher in the eliminated datasets, denoted by the negative values. This is due to the fact that, even if a part of the information is lost by eliminating the low-DVC dimensions, the encoding of information of the high-DVC dimensions by the hashing methods was performed more efficiently.
Table 5

mAPDif (%) for hashing methods by eliminating the 50 % of the low-DVC dimensions

 

8 bits (%)

16 bits (%)

32 bits (%)

64 bits (%)

128 bits (%)

256 bits (%)

512 bits (%)

1,024 bits (%)

LSH

   CIME 64d-240K

\(-\)0.07

1.77

\(-\)0.89

\(-\)0.14

0.50

3.72

5.26

6.64

   SIFT 128d-1M

\(-\)0.91

\(-\)0.53

\(-\)1.00

2.58

4.52

9.44

14.94

19.37

   CEDD 144d-240K

\(-\)0.74

\(-\)0.48

\(-\)0.94

\(-\)2.13

0.25

\(-\)0.02

1.60

2.40

   GIST 960d-1M

0.17

\(-\)0.16

0.05

0.28

0.57

0.60

1.33

2.34

   C-SIFT 1019d-700K

\(-\)0.18

\(-\)0.39

\(-\)0.43

\(-\)0.18

\(-\)0.35

0.32

0.69

1.09

   SURF 5000d-240K

0.02

\(-\)0.22

\(-\)0.36

\(-\)0.44

\(-\)0.68

\(-\)0.90

\(-\)0.99

\(-\)0.82

SKLSH

   CIME 64d-240K

\(-\)0.50

0.24

0.05

0.10

0.10

0.08

0.00

\(-\)0.20

   SIFT 128d-1M

0.01

0.02

0.01

0.03

0.02

\(-\)0.03

0.01

0.01

   CEDD 144d-240K

\(-\)0.02

\(-\)0.03

\(-\)0.07

\(-\)0.15

\(-\)0.23

\(-\)0.53

\(-\)0.74

\(-\)1.16

   GIST 960d-1M

1.96

\(-\)0.16

0.39

\(-\)0.65

\(-\)0.93

\(-\)1.02

\(-\)1.61

0.83

   C-SIFT 1019d-700K

0.09

0.07

0.03

\(-\)1.25

\(-\)0.02

0.58

1.10

2.71

   SURF 5000d-240K

\(-\)0.01

0.07

\(-\)0.17

0.05

\(-\)0.53

0.17

\(-\)0.32

\(-\)0.05

PCA-ITQ

   CIME 64d-240K

0.31

0.82

1.61

     

   SIFT 128d-1M

0.51

1.40

3.81

6.02

    

   CEDD 144d-240K

\(-\)0.13

0.43

0.35

1.08

    

   GIST 960d-1M

\(-\)0.06

0.40

0.61

1.05

1.61

1.56

  

   C-SIFT 1019d-700K

\(-\)0.24

0.51

0.74

1.15

2.97

2.88

  

   SURF 5000d-240K

\(-\)0.06

0.02

0.22

\(-\)0.03

0.64

0.88

0.77

2.35

SPH

   CIME 64d-240K

\(-\)0.23

1.37

1.21

2.74

4.65

6.08

8.14

9.48

   SIFT 128d-1M

\(-\)0.13

0.13

1.26

3.51

6.32

9.40

12.77

13.08

   CEDD 144d-240K

0.22

0.76

0.29

1.63

2.02

3.48

2.93

3.99

   GIST 960d-1M

0.12

0.17

0.07

\(-\)0.20

\(-\)0.22

0.00

\(-\)0.40

\(-\)0.59

   C-SIFT 1019d-700K

0.19

1.11

1.01

1.84

2.58

2.91

3.60

4.18

   SURF 5000d-240K

0.13

0.12

0.29

0.37

\(-\)0.37

\(-\)0.22

1.09

0.12

Bold values indicate the largest absolute mAP difference of each dataset in all bit or \(w\) variations

Table 6

mAPDif (%) for MSIDX by eliminating the 50 % of the low-DVC dimensions

 

2.5 W (%)

5 W (%)

7.5 W (%)

10 W (%)

12.5 W (%)

15 W (%)

17.5 W (%)

20 W (%)

22.5 W (%)

25 W (%)

CIME 64d-240K

0.28

0.74

1.67

3.10

5.08

7.01

9.24

11.47

14.32

16.88

SIFT 128d-1M

0.61

3.37

7.84

12.92

18.17

22.99

27.30

31.11

34.17

36.87

CEDD 144d-240K

0.47

1.48

2.54

3.50

4.55

5.63

6.64

7.55

8.55

9.63

GIST 960d-1M

0.07

0.56

1.62

2.94

4.50

6.10

7.38

8.43

9.79

11.00

C-SIFT 1019d-700K

0.06

0.19

0.43

0.85

1.45

2.27

3.28

4.46

5.67

7.03

SURF 5000d-240K

1.81

2.31

3.18

4.40

5.99

7.70

9.89

11.77

13.81

15.79

Bold values indicate the largest absolute mAP difference of each dataset in all bit or \(w\) variations

Finally, we calculated the Pearson correlation between DVC and variance and found that they are positively correlated (\(p<0.05\)) in the six evaluation datasets, which means that dimensions with high-/low-DVC have high/low variance, respectively. To exclude the possibility that the low mAPDif values of the hashing methods in Table 5 (the case of eliminating the 50 % of the dimensions with low-DVC) were generated due to the low variance in these dimensions and not due to the low-DVC exclusively, we repeated the experiments in two additional synthetic datasets. In particular, to decorrelate DVC and variance we generated two synthetic datasets of 100K vectors with 512 (SYNTH 512d-100K) and 1,024 dimensions (SYNTH 1,024d-100K), by preserving similar variance for all dimensions and monotonically increasing the DVC values for different dimensions, i.e. \(\forall \) dimension \(i>j\), \(\mathrm{DVC}_i>\mathrm{DVC}_j\). The two synthetic datasets were generated as follows. The possible values in the dimensions of the synthetic vectors were limited to the range of [0 1]. Then, an equal-width binning method for each dimension was performed, where different dimensions had different number of bins, with the number of bins of the \(i\)-th dimension being equal to the DVC value of the \(i\)-th dimension. Finally, the 100K possible values in both synthetic datasets were assigned to the bins (\(=\) DVC) of each dimension, using a uniform distribution function, ensuring thus that the dimensions have similar variance and different DVC. In doing so, in both synthetic datasets DVC and variance were not correlated at 0.05 level. In both datasets, the variance was in the range of [0.080–0.086], whereas the DVC was in the range of [73–584] in SYNTH 512d-100K and [53-1,076] in the SYNTH 1,024d-100K dataset. In both synthetic datasets, we repeated the experiment of Table 5, by eliminating the 50 % of the low-DVC dimensions, where we observed that mAP was preserved, i.e. small mAPDif values, similar to the reported values in Table 5. For instance, in the case of LSH with {8, 16, 32, 64, 128, 256, 512, 1024} bits, the mAPDif values were (\(-\)0.06, 0.03, 0.05, 0.11, 0.40, 1.05, 2.85, 6.32 %) and (\(-\)0.02, 0.03, 0.03, \(-\)0.05, 0.18, 0.30, 1.05, 2.61 %) in SYNTH 512d-100K and SYNTH 1,024d-100K datasets, respectively. According to the results of this experiment, we concluded that even in the case of decorrelating DVC and variance, dimensions with low-DVC have low contribution to the overall performance of the hashing methods.

The advantages of the elimination approach are the following:
  1. 1.

    Storage: Current multimedia similarity search systems are required to store and handle huge volumes of content. A decrease of e.g. 50 % of the dimensions of the descriptors (see Tables  3, 4, 5, 6) may reduce the dataset sizes up to 50 % and thus increase the capacity of the system.

     
  2. 2.

    Preprocessing: The preprocessing of multi-dimensional data of high volumes requires extremely high processing power. Moreover, in the case of periodic retraining of similarity search methods (as it may occur due to the significant increase of the dataset) the preprocessing time becomes prohibitive. In our experiments, we observed that an elimination of 50 % of the SURF low-DVC dimensions decreased SPH’s preprocessing time to approximately 50 %, compared to the preprocessing cost that SPH required for the initial SURF dataset. This happens because the elimination of low-DVC dimensions results in smaller datasets with negligible computational burden, avoiding thus unnecessary computations.

     
  3. 3.

    Speedup: In the case of hashing methods, the query time SF is preserved, since the size of the binary codes is maintained. However, for MSIDX this is not the case. MSIDX performs distance computations among the \(2\times w\) image descriptors and thus an elimination of the dimensionality of the descriptors significantly reduces the computation time in terms of SF.

     

5.4 Energy-based study of DVC

Aiming to formulate an analytical study of the DVC elimination, we followed the concept of Orthogonal Centroid Feature Selection algorithm [40, 43], used on text categorization applications. A score function for each dimension (i.e. feature) is computed according to:
$$\begin{aligned} S(w_{l})&= \lambda \sum \limits _{j=1}^{k} \frac{n_{j}}{n}(w_{l}^T(m_{j}-m))^2\nonumber \\&+ (1-\lambda ) \sum \limits _{i=1}^{n}(w_{l}^T(x_{i}-m))^2 \end{aligned}$$
(14)
which combines objective functions of supervised, semi-supervised and unsupervised methods in a unified framework, with \(w_{l}\) the element of the projection matrix that selects dimension \(l\), \(n\) the number of records, \(m\) the mean of each dimension and \(n_{j}\) and \(m_{j}\) the number of records and the mean of each class in the supervised methods. Parameter \(\lambda \) takes values from \(\{0,1,2\}\) for the unsupervised, supervised or semi-supervised approach, respectively. In our analysis, we followed an unsupervised approach and thus for \(\lambda =0\) (14) is transformed to:
$$\begin{aligned} S(w_{l}) = \sum \limits _{i=1}^{n}(w_{l}^T(x_{i}-m))^2 \end{aligned}$$
(15)
which gives the (scaled) variance of each variable \(l\), i.e. each dimension, of the dataset \(X \in \mathbb {R}^{n \times d}\). Recall here that for a positive semidefinite matrix \(A\in \mathbb {R}^{d \times d}\), the energy of the matrix may be defined by the summation of its eigenvalues \(\lambda _{1},\lambda _{2},\ldots ,\lambda _{d},\) which equals also to the trace of the matrix \(A\) as in:
$$\begin{aligned} E_\mathrm{total} = \sum \limits _{i=1}^{d}\lambda _{i} = tr\{A\} \end{aligned}$$
(16)
According to (16), the energy of the covariance of the full dataset X is equal to summation of the scores of (15) for all dimensions.
Finally, based on [40, 43], the energy function after the selection of \(p\) dimensions (features), sorted by their scores in descending order, is defined as:
$$\begin{aligned} E(p) = \frac{\sum \nolimits _{i=1}^{p}S(d_{i})}{E_\mathrm{total}} \end{aligned}$$
(17)
Given the above formulations, to evaluate our statement that low-DVC dimensions hold much less information than high-DVC or randomly selected ones, we computed the energy function (17) of each dataset by eliminating the dimensions from 5 to 50 % with a step of 5 % as in our previous experiments in Sect. 5.3. However, the sorting of scores in (17) followed the DVC sorting of each experiment. The results, presented in Fig. 8, make clear that, by eliminating only low-DVC dimensions, the preserved energy is much higher compared to eliminating randomly selected or high-DVC dimensions.
Fig. 8

Energy preservation (as a percentage of the total energy) after eliminating a percentage of low-, high- or randomly selected DVC dimensions

The results of Fig. 8 explain the outcomes of our previous experiments, since the higher the energy is the more information is preserved. An interesting observation is also that the curves for the randomly eliminated dimensions fall always in-between the low-DVC and high-DVC curves, with the high-DVC elimination curves preserving the less possible energy. One may interpret this as follows: by removing the high-DVC dimensions from the dataset, the most informative features (dimensions) are lost.6

6 Conclusions and discussion

Our aim in this paper was to introduce DVC in the multimedia community and to motivate researchers to consider DVC in the design and evaluation phase of large-scale similarity search strategies due to the following highly desired characteristics: (a) descriptor extraction methods tend to produce DVC distributions from the same distribution family irrespective of the datasets’ sizes and thus similarity search strategies that exploit DVC can scale, (b) as it was experimentally shown in our CCA, the DVCs of image descriptors have strong impact on similarity search strategies and (c) the elimination of low-DVC dimensions of a descriptor vector has minor impact on the mAP performance in similarity search strategies.

6.1 A practical guide

As a general, practical guide for the interested researchers, the following steps are recommended. Initially, a training set or the full multi-dimensional dataset is required to compute the DVC for each dimension, as was explained in Sect. 2.2. The output of the first stage is a vector of the same length with the descriptor vectors. Then, the entries of the DVCs vector should be sorted in a descending order and the sorting index can be used as a priority index for each dimension.

In the case of selecting a DVC elimination approach, as it was presented in [43], the user can give a threshold for the energy to be preserved after the elimination of a set of low-DVC dimensions as in:
$$\begin{aligned} p = \arg \min {E(p)}\text {, subject to }E(p) \ge T \end{aligned}$$
(18)
note here that the smaller the \(T\) (i.e. 90 %) the more dimensions will be eliminated.

Otherwise, since most of the similarity search techniques follow an approximate approach, the dimensions with high priority, i.e. high-DVC values, should be either weighted more or quantized with extra bits, or in general preserve more information than the dimensions of low-DVC values do. The level of weighting between high-DVC and low-DVC is an open research issue depending on each specific case.

6.2 The use of DVC in other applications of image descriptors

Apart from the similarity search, image descriptors are used in a wide range of applications such as image clustering [1, 9, 26, 42], image annotation [22, 29, 44], image registration [7, 11, 33] and object recognition  [6, 35, 45]. The examination of the impact of DVCs in such applications requires further in-depth analysis. Here, we present some key characteristics for future research.

All of the aforementioned applications use a distance or a similarity metric to determine whether two or more image descriptors represent the same object, or belong to the same cluster or are in some way related. As such, the DVC characteristics of the selected image descriptor and specifically each of its dimensions have a contribution to the measured distances and thus to the overall performance of the application. The contribution of each dimension to the final performance is associated with the processing steps that each method follows.

Applications of image descriptors may be classified as approximate or exact ones. For the applications that compute exact distances on the descriptors, without the usage of any indexing technique or other approach that prunes the search space, there is no direct benefit of the knowledge of DVC characteristics. On the other hand, for the applications that perform approximate approaches such as indexing with hashing methods, spectral clustering or vector quantization, the exploitation of DVCs may have clear benefits. As such, the influence of DVCs to such applications are worth to examine.

6.3 Future work

Apart from the hashing methods and MSIDX, many other strategies have been proposed in the literature for efficient large-scale similarity search in image databases:

Vantage indexing for large-scale similarity search, such as the work of [37], aims to increase SF, by selecting a small set of reference/vantage multimedia objects, such as images, based on which the rest of objects are compared to, in order to retrieve the \(k\) most similar results. Thus, by avoiding the all-to-all comparison, the search space is pruned resulting in the increase of SF. To achieve high mAP, the key idea is the accurate definition of the criteria to assess the quality of the selected vantage objects. However, in the aforementioned criteria, the DVCs of image descriptors have not been considered yet.

Dimensionality reduction methods, such as the work of [18], aim at mapping the original data into a much lower dimensional subspace. An index can then be built on the subspace to further facilitate the image similarity search. The main idea is to transform data from a high-dimensional space to a lower dimensional one without losing much information. Many dimensionality reduction methods have been proposed, including global and local methods. Global dimensionality reduction methods map the dataset as a whole down to a suitable and lower dimensional subspace. Local dimensionality reduction methods firstly divide the whole dataset into correlated clusters, each of which is then reduced to their respective subspaces by classical PCA or other methods. The dimensionality reduction methods reduce image descriptors’ dimensionality into a smaller number by removing insignificant dimensions. Since most image descriptors do not preserve complete information, nearest neighbor search accuracy may be compromised. Demand of high mAP accuracy often limits the number of dimensions to be removed, thus limiting the performance. A possible future direction for the dimensionality reduction methods is to consider the DVCs of image descriptors, assuming that dimensions of high-DVC have more discriminative power, and thus contain more valuable information.

Data Co-Reduction methods, such as the work of [17], achieve simultaneous data reduction on both data size and image descriptors dimensionality. This is possible by assuming that there is a subset of dimensions that may have very close values for a subset of image descriptors and similarly, a subset of image descriptors may have very similar values along their dimensions. However, in the data co-reduction methods, the impact of image descriptors’ DVCs is still omitted.

We hope that our analysis will help researchers in the design of large-scale similarity search strategies in image and other multimedia databases, as well as in other applications of image descriptors.

Footnotes

  1. 1.
  2. 2.
  3. 3.
  4. 4.

    In the PCA-ITQ method, due to the PCA’s eigen-decomposition, we also satisfied the condition of #bits\(< d\), where \(d\) is the dimensionality of each evaluation dataset.

  5. 5.

    The first central moment \(\mu _1\) of mean \(\mu \) is discarded in our analysis, because by definition it is always equal to 0 and thus, based on Wilk’s \(\Lambda \) statistic [24] \(\mu _1\) generates a statistical insignificant model of CCA in the examined methods.

  6. 6.

    We calculated the Pearson correlation between mAP and energy (Figs. 78), and found that for all datasets mAP and energy are correlated with over 0.985 with \(p<0.005\).

Notes

Acknowledgments

This work was partially supported by the EC FP7 funded project CUBRIK, ICT- 287704 (http://www.cubrikproject.eu).

References

  1. 1.
    Agrawal R, Wu C, Grosky WI, Fotouhi F (2007) Image clustering using visual and text keywords. Computational Intelligence in Robotics and automation, CIRA 2007. International Symposium on, pp. 49,54, 20–23 June 2007Google Scholar
  2. 2.
    Bauer C, Radhakrishnan R, Jiang W (2010) Optimal configuration of hash table based multimedia fingerprint databases using weak bits. In: Proc. of IEEE International Conference on Multimedia and Expo (ICME), pp. 1672–1667Google Scholar
  3. 3.
    Bay H, Ess A, Tuytelaars T, Van Gool L (2008) SURF: speeded up robust features. Comput. Vis. Image Underst. (CVIU) 110(3):346–359CrossRefGoogle Scholar
  4. 4.
    Chatzichristofis SA, Boutalis YS (2008) CEDD: Color and edge directivity descriptor: a compact descriptor for image indexing and retrieval. In: ICVS, vol. 5008 of Lecture Notes in Computer Science, Springer, pp 312–322Google Scholar
  5. 5.
    Daintith J, Wright E (2008) Hamming space. In: A dictionary of computing. Oxford University Press. Retrieved 30 Oct 2014, from http://www.oxfordreference.com/view/10.1093/acref/9780199234004.001.0001/acref-9780199234004-e-2303
  6. 6.
    Due Trier Ø, Jain AK, Taxt T (1996) Feature extraction methods for character recognition–a survey. Pattern Recog 29(4):641–662 ISSN 0031–3203CrossRefGoogle Scholar
  7. 7.
    Fan B, Wu F, Hu Z (2012) Rotationally invariant descriptors using intensity order pooling. Pattern Anal Mach Intel IEEE Trans 34(10):2031–2045CrossRefGoogle Scholar
  8. 8.
    Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of International Conference on Very large data bases (VLDB), pp 518–529Google Scholar
  9. 9.
    Goldberger J, Gordon S, Greenspan H (2006) Unsupervised image-set clustering using an information theoretic framework. Image Process IEEE Trans 15(2):449–458CrossRefGoogle Scholar
  10. 10.
    Gong Y, Lazebnik S, Gordo A, Perronnin F (2013) Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans PAMI 35(12):2916–2929Google Scholar
  11. 11.
    Griffith EJ, Yuan C, Jump M, Ralph JF (2013) Equivalence of BRISK descriptors for the registration of variable bit-depth aerial imagery. In: 2013 IEEE international conference on systems, man, and cybernetics (SMC), pp 2587–2592, 13–16 Oct 2013Google Scholar
  12. 12.
    Heo JP, Lee Y, He J, Chang S, Yoon S (2012) Spherical hashing. In: Proceedings of CVPR, pp 2957–2964Google Scholar
  13. 13.
    He J, Radhakrishnan R, Chang S-F, Bauer C (2011) Compact hashing with joint optimization of search accuracy and time. In: Proceedings of CVPR, pp 753–760Google Scholar
  14. 14.
    Hotelling H (1936) Relations between two sets of variables. Biometrika 28:312–377CrossRefGoogle Scholar
  15. 15.
  16. 16.
  17. 17.
    Huang Z, Shen HT, Liu J, Zhou X (2011) Effective data co-reduction for multimedia similarity search. In: Proceedings of ACM SIGMOD, pp 1021–1032Google Scholar
  18. 18.
    Huang Z, Shen HT, Shao J, Ruger SM, Zhou X (2008) Locality condensation: a new dimensionality reduction method for image retrieval. In: Proceedings of ACM Multimedia, pp 219–228Google Scholar
  19. 19.
    Jegou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Trans PAMI 33(1):117–128CrossRefGoogle Scholar
  20. 20.
    Joly A, Buisson O (2011) Random maximum margin hashing. In: Proceedings of the CVPR’11 - IEEE computer vision and pattern recognition, Jun 2011. IEEE, Colorado Springs, US, pp 873–880Google Scholar
  21. 21.
    Lai PL, Fyfe C (2000) Kernel and nonlinear canonical correlation analysis. Int J Neural Syst 10(5):365–377CrossRefGoogle Scholar
  22. 22.
    Liu C, Yuen J, Torralba A (2009) Nonparametric scene parsing: label transfer via dense scene alignment. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009. IEEE, Miami, US, pp 1972–1979Google Scholar
  23. 23.
    Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60:91–110CrossRefGoogle Scholar
  24. 24.
    Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic PressGoogle Scholar
  25. 25.
    Massey FJ (1951) The Kolmogorov–Smirnov test for goodness of fit. J Am Stat Assoc 46(253):6878CrossRefGoogle Scholar
  26. 26.
    Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: analysis and an algorithm. In: Proceedings of NIPSGoogle Scholar
  27. 27.
    Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175CrossRefMATHGoogle Scholar
  28. 28.
    Raginsky M, Lazebnik S (2009) Locality-sensitive binary codes from shift-invariant kernels. In: Proceedings of NIPS, pp 1509–1517Google Scholar
  29. 29.
    Russell BC, Torralba A, Liu C, Fergus R, Freeman WT (2007) Object recognition by scene alignment. In: NIPSGoogle Scholar
  30. 30.
    sglab.kaist.ac.kr\_Hashing/Google Scholar
  31. 31.
    Song J, Yang Y, Huang Z, Shen H-T, Hong R (2011) Multiple feature hashing for real-time large scale near-duplicate video retrieval. In: Proceedings of the 19th ACM international conference on Multimedia (MM ’11). ACM, New York, NY, USA, pp 423–432Google Scholar
  32. 32.
    Stehling RO, Nascimento MA, Falcao AX (2002) A compact and efficient image retrieval approach based on border/interior pixel classification. In: Proceedings of CIKMGoogle Scholar
  33. 33.
    Szeliski R (2006) Image alignment and stitching: a tutorial. Found Trends Comput Graph Comput Vis 2(1)Google Scholar
  34. 34.
    Tiakas E, Rafailidis D, Dimou A, Daras P (2013) MSIDX: multi-sort indexing for efficient content-based image search and retrieval. IEEE Trans Multimed 15(6):1415–1430CrossRefGoogle Scholar
  35. 35.
    Uijlings JRR, van de Sande KEA, Gevers T, Smeulders AWM (2013) Selective search for object recognition. Int J Comput Vis Springer 104(2):154–171CrossRefGoogle Scholar
  36. 36.
    Van De Sande KEA, Gevers T, Snoek CGM (2010) Evaluating color descriptors for object and scene recognition. IEEE Trans PAMI 32(9):1582–1596CrossRefGoogle Scholar
  37. 37.
    Van Leuken RH, Veltkamp RC (2011) Selecting vantage objects for similarity indexing. ACM TOMCCAP 7(3):16Google Scholar
  38. 38.
    Wang J, Kumar S, Chang S-F (2010) Semisupervised hashing for scalable image retrieval. In: Proceedings of CVPR, pp 3424–3431Google Scholar
  39. 39.
    Weiss Y, Torralba A, Fergus R (2008) Spectral hashing. In: Proceedings of NIPS, pp 1753–1760Google Scholar
  40. 40.
    Yan J, Liu N, Yan S, Yang Q, Fan W, Wei W, Chen Z (2011) Trace-oriented feature analysis for large-scale text data dimension reduction. Knowl Data Eng IEEE Trans 23(7):1103–1117Google Scholar
  41. 41.
    Yang J, Jiang YG, Hauptmann AG, Ngo CW (2007) Evaluating bag-of-visual-words representations in scene classification. In: Proceedings of ACM MIR, pp 197–206Google Scholar
  42. 42.
    Yan D, Huang L, Jordan MI (2009) Fast approximate spectral clustering. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’09). ACM, New York, NY, USA, pp 907–916Google Scholar
  43. 43.
    Yan J, Liu N, Zhang B, Yan S, Chen Z, Cheng Q, Fan W, Ma W-Y (2005) OCFS: optimal orthogonal centroid feature selection for text categorization. In: Proceedings of the 28th annual international ACM SIGIR ’05. ACM, New York, NY, USA, pp 122–129Google Scholar
  44. 44.
    Zhang D, Islam MM, Lu G (2012) A review on automatic image annotation techniques. Pattern Recog, 45(1), pp 346–362, ISSN 0031–3203, http://dx.doi.org/10.1016/j.patcog.2011.05.013
  45. 45.
    Zitov B, Flusser J (2003) Image registration methods: a survey. Image Vis Comput 21(11):977–1000. ISSN 0262-8856Google Scholar

Copyright information

© Springer-Verlag London 2014

Authors and Affiliations

  • Theodoros Semertzidis
    • 1
    • 2
  • Dimitrios Rafailidis
    • 2
  • Michael Gerassimos Strintzis
    • 1
    • 2
  • Petros Daras
    • 2
  1. 1.Information Processing Laboratory, Electrical and Computer Engineering DepartmentAristotle University of ThessalonikiThessalonikiGreece
  2. 2.Information Technologies InstituteCentre for Research and Technology HellasThessalonikiGreece

Personalised recommendations