# Consensus hashing

- First Online:

- Received:
- Accepted:

DOI: 10.1007/s10994-015-5496-x

- Cite this article as:
- Leng, C. & Cheng, J. Mach Learn (2015) 100: 379. doi:10.1007/s10994-015-5496-x

- 751 Downloads

## Abstract

Hashing techniques have been widely used in many machine learning applications because of their efficiency in both computation and storage. Although a variety of hashing methods have been proposed, most of them make some implicit assumptions about the statistical or geometrical structure of data. In fact, few hashing algorithms can adequately handle all kinds of data with different structures. When considering hybrid structure datasets, different hashing algorithms might produce different and possibly inconsistent binary codes. Inspired by the successes of classifier combination and clustering ensembles, in this paper, we present a novel combination strategy for multiple hashing results, named consensus hashing. By defining the measure of consensus of two hashing results, we put forward a simple yet effective model to learn consensus hash functions which generate binary codes consistent with the existing ones. Extensive experiments on several large scale benchmarks demonstrate the overall superiority of the proposed method compared with state-of-the-art hashing algorithms.

## 1 Introduction

Nearest neighbor (NN) search is a fundamental building block in many machine learning and computer vision applications, such as image recognition, image matching, collaborative filtering, etc. It becomes infeasible for exhaustive linear search when the data explosively increases in size and dimensionality. To overcome the high time complexity in NN search, many approximate nearest neighbor (ANN) search methods have been proposed (Arya et al. 1998; Friedman et al. 1977; Silpa-Anan and Hartley 2008; Indyk and Motwani 1998). Among these techniques, hashing based ANN search has attracted much attention recently (Indyk and Motwani 1998; Datar et al. 2004). Hashing algorithms encode nearby points in the original space into similar codes in a discrete binary space, i.e., Hamming space. This enables large efficiency gains in computation speed for similarity search. For example, calculating the Hamming distance between 64-bit codes costs only one CPU cycle with the XOR operation.

Numerous hashing methods have been proposed over the last decade, which can be roughly categorized into *data independent* and *data dependent* schemes. Data independent hashing methods, such as the well known locality sensitive hashing (LSH) (Indyk and Motwani 1998; Charikar 2002) and its variants (Kulis and Grauman 2009; Datar et al. 2004; Jain et al. 2008), construct hash functions based on random projections without considering the input data. Rather than random projection, data dependent methods try to learn data-aware hash functions from a training set. Representative approaches include spectral hashing (SH) (Weiss et al. 2008), anchor graph hashing (AGH) (Liu et al. 2011), binary reconstructive embedding (BRE) (Kulis and Darrell 2009), iterative quantization (ITQ) (Gong et al. 2013), spherical hashing (SPH) (Heo et al. 2012), K-means Hashing (KMH) (He et al. 2013), etc. Supervised information is also incorporated into newly proposed hashing methods (Wang et al. 2012; Norouzi and Fleet 2011; Liu et al. 2012; Lin et al. 2013), and it has been demonstrated that side information is conducive to preserving the semantic affinity.

Although a variety of hashing methods have been proposed, most of them make some assumptions about the statistical or geometrical structure of data. For instance, SH assumes that the data is uniformly distributed (Weiss et al. 2008), while AGH assumes that the data lies on a manifold (Liu et al. 2011). In fact, few hashing algorithms can adequately handle all kinds of data with different structures. When considering real-world datasets of hybrid structure, different hashing methods produce different and possibly inconsistent hashing results. Two data points might be embedded to close codes by one method but far by another. Actually, even multiple replications of the same algorithm, such as LSH, result in different hashing codes due to the algorithm’s inner randomness. On the other hand, in many visual applications, the data can be represented by different kinds of features, and it is clear that different hashing codes will be obtained when one algorithm is applied on different features. In all these cases, the diverse outcomes usually contain complementary information which could be useful to improve the performance of hashing. Inspired by the success of classifier combination (Breiman 1996; Schapire 1990) and clustering ensembles (Fred and Jain 2005) in practice, we believe the idea of combining these complementary information for obtaining better binary codes is interesting and worth investigating.

In this paper, we first study the problem of combining multiple hashing results and put forward a novel strategy, named *Consensus Hashing* (CH). Generally the problem we focus on here is finding a combined hashing result based on multiple existing ones so as to provide better search performance. There exist two major challenges for this problem. First, how to measure the consensus of two hashing results? The measurement of consensus in hashing is more difficult than that in classifier combination (Breiman 1996; Schapire 1990) because two completely consistent hashing results may seem different in appearance. Second, how to learn consensus hash functions? Finding consensus codes for the training data is not sufficient because the algorithm should also be able to generate hashing codes for the unseen queries. It is necessary to learn new consensus hash functions to solve the problem of out-of-sample extension. To overcome these challenges, we first propose a simple approach to measure the consensus of two hashing results based on connectivity matrix. With this measurement, an effective model is established to learn consensus hash functions which generate binary code consistent with the existing ones. We find that the optimization of the model can be relaxed into two subproblems: a multidimensional scaling (MDS) (Kruskal and Wish 1978) problem and a basic linear regression problem, and closed-form solutions exist for both subproblems. By taking advantage of the Nyström technique (Williams and Seeger 2001), our strategy has linear training time with respect to the size of the training set. One interesting characteristic of our method is that it is inherently qualified for multi-view hashing by combining the diverse hashing codes generated by different features. The effectiveness and efficiency of the proposed CH are validated by extensive experiments, both single-view and multi-view, carried out on several benchmarks.

## 2 Background

The idea of combining diverse information of multiple outcomes lies in the context of ensemble learning. As a kind of state-of-the-art learning approach, the core spirit of ensemble learning is that one can train multiple learners and then combine them for use. Notable works include Bagging (Breiman 1996), Random Forest (Breiman 2001) and Boosting (Schapire 1990). It is well known that an ensemble is usually more accurate than a single learner, and ensemble methods have already achieved great success in many real-world tasks (Zhou 2012). Abundant theoretical results have been established for these three ensemble learning methods. For example, Breiman proved that Bagging and Random Forest converge to a steady error rate as the ensemble size grows (Breiman 1996, 2001). Freund and Schapire proved that the error of the final combined learner in AdaBoost is upper bounded by the errors of the base learners (Freund and Schapire 1996).

Consensus learning is another learning strategy of ensemble learning. This learning scheme is originally designed for *unsupervised* tasks such as clustering, i.e., consensus clustering (Monti et al. 2003; Li et al. 2007; Fred and Jain 2005) Consensus clustering is a kind of ensemble whose base learners are clusters generated by clustering methods. Ensemble of clustering is considerably more difficult than the combination of classifiers, because the cluster labels produced by different clustering methods may not correspond. During the past decades, a great number of consensus clustering approaches have been proposed. Representative works include similarity based methods (Fred and Jain 2005), graph based methods (Fern and Brodley 2004) and non-negative matrix factorization (NMF) based methods (Li et al. 2007). Similarity based methods express the base clustering results as similarity matrices and generate the final clustering result based on the consensus similarity matrix (Fred and Jain 2005). The basic idea of graph based methods is to construct an undirected graph to model the base clustering results, and then derive the ensemble clustering via graph partitioning (Fern and Brodley 2004). Li et al. (2007) found that consensus clustering can be formulated within the framework of NMF and proposed a simple iterative algorithm. Recently, weighted consensus clustering (Li and Ding 2008) has been considered in which a weight for each base clustering is introduced to model its importance. Although these consensus clustering methods have shown the ability to improve clustering quality and robustness in practice, unlike other well-established ensemble learning strategies such as Bagging and Boosting, finding theoretical guarantees for the consensus learning strategy is still an open question.

Our work is largely inspired by the works in ensemble learning especially consensus clustering. Different from consensus clustering, the base learners in our work are the diverse binary codes generated by hashing algorithms. Our work aims to look for a combined hashing result based on the existing ones in order to achieve better performance. More specifically, we establish a complete learning system which is equipped with the ability to measure the consensus of binary codes (Sect. 3), generate hashing codes for the unseen samples (Sect. 4.1), and overcome the large scale learning issue (Sect. 4.2).

## 3 Measure of consensus in hashing

As a kind of ensemble learning approach, it is crucial for consensus learning to measure the consensus of different base learners. For classifier combination, it is not a problem because the outcome of each weak classifier is explicit, i.e., positive or negative. Nevertheless, it is not suitable for hashing combination. As we will see later, two consistent hashing results may be entirely different in appearance. It is necessary to seek a new metric to measure the consensus in hashing.

First of all, some notations are described here. Let \(X = [x_{1},x_{2},\ldots ,x_{n}] \in \mathbb {R}^{d \times n}\) denote a set of \(n\) data points. Suppose there exists a set of \(L\) component hashing results \(\mathcal {H} = \{H^{(1)},H^{(2)},\ldots ,H^{(L)}\}\), each \(H^{(l)}= \left[ code^{l}(x_{1}), \ldots , code^{l}(x_{n}) \right] ^{T}\in \{-1,1\}^{n\times r}\) denotes a code matrix of \(X\), corresponding to the \(lth\) component. Each row of \(H^{(l)}\) represents the \(r\)-bit hashing code of a sample.

### 3.1 Connectivity matrix

*connectivity matrix*.

### 3.2 Measure of consensus

*Frobenius norm*. The disagreement between two hashing results can be measured by \(d(M^{(p)},M^{(q)})\). When \(d(M^{(p)},M^{(q)})=0\), we consider that two hashing results are completely consistent.

## 4 Consensus hash functions learning

In this section, we give the details of the proposed Consensus hashing framework. A simple model associated with an efficient two-step optimization method is proposed for consensus hash functions learning.

### 4.1 Objective function

### 4.2 Optimization

The relaxed variable \(Y\) in (12a) is of real space, then (12a) becomes a MDS problem (Kruskal and Wish 1978). Besides, it will be shown that (12b) can be transformed into a basic linear regression problem and a classical Orthogonal Procrustes (OP) problem (Schönemann 1966). Therefore, the complicated objective in Eq. (9) can be optimized by solving two relatively simpler problems (12a) and (12b).

#### 4.2.1 First step

#### 4.2.2 Second step

**(1) Update**\(W\)**with fixed**\(R\):

**(2) Update**\(R\)**with fixed**\(W\):

### 4.3 Complexity analysis

The complexity of eigendecomposition for a \(m\times m\) matrix \(U_{m,m}\) is \(O(m^{3})\). To obtain \(Y^{*}\) in Eq. (16), we need to multiply \(n \times m\) matrix \(U_{n,m}\) with the \(r\) eigenvectors, which costs \(O(rmn)\) complexity. Therefore, the time complexity of first step is \(O(m^{3} + rmn)\). In Eq. (18), in order to obtain \((XX^{T})^{-1}\), the time complexity is \(O(drn + d^{3})\), and the following matrix multiplication costs \(O(drn + dr^{2} + d^{2}r)\), then the time complexity for obtaining \(W\) is \(O(2drn + dr^{2} + d^{2}r + d^{3})\). To obtain \(R\) in Eq. (19), SVD is applied to a \(r \times r\) matrix whose time complexity is \(O(r^{3})\). Considering the cost to get this matrix \(O(r^{2}n)\), the whole cost for obtaining \(R\) is \(O(r^{3}+r^{2}n)\). Given that \(n \gg r\) and \(n \gg d\) in large scale dataset, the whole time complexity of the training process is \(O(r^{2}n + drn)\), which is linear to the size of training set.

## 5 Experiments

As we have discussed in the introduction, many possible circumstances can lead to different and inconsistent binary codes of the data, including (1) applying different hashing algorithms to the data, (2) the dependency to the initialization or inner randomness of some algorithms, and (3) applying one algorithm to different features of the data. In this section, we conduct extensive experiments to test the effectiveness and efficiency of the proposed CH in all these situations.

### 5.1 Experiments on three large scale datasets

#### 5.1.1 Datasets

Three large scale datasets are used in the experiments. **Tiny-100K**: it consists of 100,000 tiny images of \(32 \times 32\) pixels randomly sampled from the original 80 million tiny images.^{1} Each image is represented by a 384 dimensional GIST descriptor (Oliva and Torralba 2001). **CIFAR10**:^{2} it consists of 60,000 color images and we extract a 512 dimensional GIST descriptor to represent each image. **GIST-1M**:^{3} it contains a set of 960 dimensional, one million GIST descriptors extracted from random images.

All these data are mean-centered as required in many methods (Gong et al. 2013; Kong and Li 2012; Kulis and Darrell 2009). For each dataset, we randomly select 1000 data points as queries and use the remaining as gallery database as well as training set. The ground truth neighbors, exact \(K\)-nearest neighbors for each query, are computed by linear scan, i.e., comparing each query with all the points in the database with raw feature. Following Wang et al. (2012), the top 2 percentile nearest neighbors in Euclidean space are taken as ground truth. All the reported results are averaged over ten random test/training partitions.

#### 5.1.2 Compared methods and protocol

We compare the performance of our approach, CH, against several state-of-the-art hashing methods, including LSH (Charikar 2002, AGH Liu et al. 2011, ITQ Gong et al. 2013), Isotropic Hashing (IsoH) (Kong and Li 2012) and Kmeans Hashing (KMH) (He et al. 2013). All the results were obtained with the source codes generously provided by the authors and by following their instructions to tune the algorithm parameters.

To perform fair evaluation, we adopt the Hamming Ranking search strategy commonly used in the literature (Gong et al. 2013; Wang et al. 2012; He et al. 2013; Kong and Li 2012). All points in the database are ranked according to their Hamming distance to the query. The Hamming ranking performance is measured with three widely used metrics in information retrieval: mean average precision (MAP), precision-recall curves and precision curves. It is worthy to note that Hamming ranking is an exhaustive linear search method, but usually very fast in practice because of the efficient computation of Hamming distance. Hamming ranking can be further sped up by a recent method (Norouzi et al. 2014). This work of Norouzi et al. (2014) provides a parallel method for fast Hamming ranking in Hamming space, and can be also applied in our work for parallel search. In specific, the code generated by our approach can be divided into many pieces of substrings and used for parallel exact k-nearest neighbor search in Hamming space with the approach proposed in Norouzi et al. (2014).

For the proposed consensus hashing, we present two different implementations in the comparisons. The first implementation, denoted as CH\(^{1}\), combines the hashing codes generated by 20 independent runs of LSH to learn consensus hash functions. The second implementation, denoted as CH\(^{2}\), combines the hashing results of the adopted baselines, i.e., LSH (Charikar 2002), AGH (Liu et al. 2011), ITQ (Gong et al. 2013), IsoH (Kong and Li 2012) and KMH (He et al. 2013).

CH\(^{1}\) combines the binary codes generated by multiple independent runs of LSH because only LSH, owing to its inner randomness, can guarantee enough diversity of the outputs of multiple executions. For other base learner like ITQ, although with better quality, the outputs of different runs are almost the same, which is not suitable for CH\(^{1}\). CH\(^{2}\) aggregates the results generated from various base learners. Ideally, if the diversity of combined learners is sufficient, the higher precision of them will lead to better performance of CH\(^{2}\). Nevertheless, in practice, higher precision often means less diversity. This is also why often weak classifiers are used to be base learners in other ensemble learning methods like boosting. In summary, the selection of base combined methods is a trade off between quality and diversity.

#### 5.1.3 Results and analysis

*Mean average precision*MAP score is one of the most comprehensive criterions to evaluate the retrieval performance in the literature (Gong et al. 2013; Kong and Li 2012; Heo et al. 2012; He et al. 2013). Figure 1 shows the MAP scores for different methods with different code lengths on three datasets. We observe that CH\(^{1}\) and CH\(^{2}\) achieve the highest MAP scores in most cases on all these datasets.

By comparing CH\(^{2}\) with the other baselines it combines, i.e. LSH, AGH, ITQ, IsoH and KMH, we find that CH\(^{2}\) outperforms all of them with a large margin on these datasets. These results imply that the consensus strategy can collect the advantages of other methods and consequently achieve further improvement on any single baseline. Besides, CH\(^{2}\) consistently outperforms CH\(^{1}\). This phenomenon is natural and easy to understand as the information provided by the combined methods is more diverse in CH\(^{2}\) , e.g., AGH considers the manifold of data and KMH considers the clustering of data. Intuitively, these diverse information is conducive to learning consensus hash functions which are more adequate to capture the structure of data.

Also, some interesting phenomenons about the baselines can be observed from the MAP results. AGH works relatively well for small code size and substantially outperforms LSH. However, as the code size increases to 64 bits, the performance of LSH rises rapidly and surpasses AGH. This is due to that most of the information is caught in the top eigenvectors (hash functions) while the remainders are usually noisy in AGH. However, in data-independent methods such as LSH, it is theoretically guaranteed that two similar samples will be embedded into close codes with higher probability when more bits are assigned.

*Precision-recall curves and precision curves* Figures 2–4a, b show the complete precision recall curves on the three datasets, respectively. As a complementary evaluation, the precision curves on three datasets are given in Figs. 2–4c, d. Results with 64 bits and 128 bits are reported. The comparisons with other code lengths are of similar trends. These results demonstrate the overall performance improvement of the proposed CH on other methods. Specially, from these results we can get two observations. First, these detailed results are consistent with the trends discovered in the Fig. 1, namely, our methods perform the best and the running up methods are ITQ and IsoH. Besides, the performance of CH\(^{2}\) is better than CH\(^{1}\) by combining more comprehensive baselines. Second, we find that CH\(^{2}\) with 64 bits performs similarly or even better compared with other approaches with 128 bits. In consequence, our method typically provides about two times more compact binary representation than other methods when meeting the same precision target.

#### 5.1.4 Parametric sensitivity

#### 5.1.5 Computational cost

Comparison of training time (seconds) on Tiny-100K

16-bits | 32-bits | 48-bits | 64-bits | 96-bits | 128-bits | |
---|---|---|---|---|---|---|

LSH | 0.01 | 0.01 | 0.02 | 0.02 | 0.03 | 0.05 |

AGH | 65.81 | 65.72 | 65.69 | 63.52 | 66.06 | 65.98 |

ITQ | 2.20 | 3.71 | 5.50 | 7.36 | 11.51 | 16.38 |

IsoH | 1.00 | 0.62 | 0.79 | 1.04 | 0.99 | 1.18 |

KMH | 428.05 | 424.20 | 457.09 | 487.92 | 529.97 | 571.74 |

CH | 2.19 | 3.60 | 4.74 | 6.87 | 12.37 | 20.33 |

### 5.2 Experiments on multi-view hashing

#### 5.2.1 Multi-view hashing

In most of the real world visual problems, data are collected from diverse domains or obtained from various feature extractors and exhibit heterogeneous properties. Most of the conventional hashing methods, such as all the algorithms we mentioned above, usually adopt a single modality to learn hash functions without exploiting the complementary information contained in different modalities. From this perspective, all these methods can be seen as *single-view hashing*.

Some recently proposed *multi-view hashing* methods try to fuse multiple information sources to get more efficient and effective hashing codes. Song et al. (2011) presented a Multiple Feature Hashing (MFH) to tackle the near-duplicate video retrieval problems. MFH establishes one graph for each view to preserve the local structure information and also globally consider the local structures for all views. Liu et al. (2014) used kernel trick to capture the similarity affinity of different sources (MFKH). In MFKH, by concatenating different features in kernel space, multi-view hashing is formulated as a similarity preserving problem with linearly combined multiple kernels. Other representative works include Zhang et al. (2011) and Xia et al. (2012).

As pointed out in the introduction, applying the same hashing algorithm on different modalities of the data will result in different hashing codes. These diverse hashing results usually contain complementary information which can be combined with our CH strategy. In this sense, the proposed CH method provides an alternative for multi-view hashing, which is from a completely different perspective of consensus learning compared with the previous methods. In this subsection, we explore the effectiveness of this alternative in multi-view hashing.

#### 5.2.2 Dataset and protocol

We conduct experiments on NUSWIDE^{4} dataset, which consists of 269,648 images with 81 concept tags from Flickr. Five kinds of features are provided and used in our experiments, including 64-D color histogram, 144-D color correlogram, 73-D edge direction histogram, 128-D wavelet texture and 225-D block-wise color moments. To obtain diverse hashing results, we run LSH on each feature ten times. Both MFH and our method need to concatenate all features into a long vector to learn hash functions. For evaluation, we use 1000 images with largest number of tags as queries and the rest serving as database. The source codes are provided by the authors.

*at least*one common tag.

#### 5.2.3 Results and analysis

Figure. 6 displays the experimental results on NUSWIDE including performances of Hamming ranking and hash lookup. First, Fig. 6a show the MAP scores of different methods with various code lengths. It can be noted that, by combining the hashing results on different features, our CH gives much better performance than the competitors MFH and MFKH. In Fig. 6b, we present the precision of the top 50 returned with different code lengths. Again, significant performance gaps are observed between CH and the two baseline algorithms. Another discovery is that, as the bit number increases, the retrieval precision of MFKH mildly decreases and that of MFH does not increase obviously either. By comparison, our CH achieves remarkably better precision as the code length increases and consistently outperforms other baselines in all cases.

To evaluate the performance of our algorithm in hash lookup, in Fig. 6c we plot the precision of points within Hamming radius 2, i.e. with Hamming distance \(<\)2 to the query, using 16, 32 and 48 bits. Note that we follow Liu et al. (2012) and Wang et al. (2012) to treat failing to find any potential neighbors for a query as zero precision. Due to the increased sparsity of the Hamming space as the code length increases, precision drops rapidly when longer codes are used. It is observed that CH achieves superior accuracy on different bits compared with other methods. It comes to a conclusion that our method with compact codes can retrieve more semantically related images than other baselines when using hash lookup. All these results verify that the proposed CH is very effective for multi-view hashing.

## 6 Conclusion

In this paper, we proposed a novel CH algorithm based on ensemble learning strategy. Firstly, the definition of consensus measurement was proposed. With this definition, we presented a simple model to learn consensus hash function. A two-step optimization method was also proposed for efficient training. Comprehensive analysis certified that the proposed training method has a linear time complexity to the size of training set. Extensive experiments on several large scale benchmarks demonstrated the effectiveness and efficiency of our method.

To the best of our knowledge, this is the first attempt to introduce the idea of consensus learning into hash functions learning. Our work can be viewed as an application of consensus learning in hashing. As another example of consensus learning, consensus clustering is a well studied topic in machine learning community and many interesting approaches have been proposed. Plenty of instructive ideas in consensus clustering are worth studying for hashing. From this point of view, our preliminary work might stimulate other researchers to move their attention to this topic, and finally propose better methods for consensus hashing.

The major limitation of our method is that all the input component hashing codes in our method are treated equally. In real-world cases, however, various component hashing results might be of different importance and some of them might be redundant. Accordingly, controlling the contributions of these results is crucial. In the future work, we intend to extend the consensus hashing model to a weighted scheme, in which different weights will be imposed to different component hashing results and these weights should be learned automatically.

## Acknowledgments

This work was supported in part by 863 Program (Grant No. 2014AA015100), National Natural Science Foundation of China (Grant No. 61170127, 61332016). The authors would like to thank Prof. Hanqing Lu and Dr. Xi Zhang for their constructive suggestions.