# Accelerated Kmeans Clustering Using Binary Random Projection

## Abstract

Codebooks have been widely used for image retrieval and image indexing, which are the core elements of mobile visual searching. Building a vocabulary tree is carried out offline, because the clustering of a large amount of training data takes a long time. Recently proposed adaptive vocabulary trees do not require offline training, but suffer from the burden of online computation. The necessity for clustering high dimensional large data has arisen in offline and online training. In this paper, we present a novel clustering method to reduce the burden of computation without losing accuracy. Feature selection is used to reduce the computational complexity with high dimensional data, and an ensemble learning model is used to improve the efficiency with a large number of data. We demonstrate that the proposed method outperforms the-state of the art approaches in terms of computational complexity on various synthetic and real datasets.

## Keywords

Feature Selection Ensemble Model Normalize Mutual Information Random Projection Image Indexing## 1 Introduction

Image to image matching is one of the important tasks in mobile visual searching. A vocabulary tree based image search is commonly used due to its simplicity and high performance [1, 2, 3, 4, 5, 6]. The original vocabulary tree method [1] cannot grow and adapt with new images and environments, and it takes a long time to build a vocabulary tree through clustering. An incremental vocabulary tree was introduced to overcome limitations such as adaptation in dynamic environments [4, 5]. It does not require heavy clustering for the offline training process due to the use of a distributed online process. The clustering time of the incremental vocabulary tree is the chief burden in the case of realtime application. Thus, an efficient clustering method is required.

Lloyd kmeans [7] is the standard method. However, this algorithm is not suitable for high dimensional large data, because the computational complexity is proportional to the number and the dimension of data. Various approaches have been proposed to accelerate the clustering and reduce the complexity. One widely used approach is applying geometric knowledge to avoid unnecessary computations. Elkans algorithm [8] is the representative example, and this method does not calculate unnecessary distances between points and centers. Two additional strategies for accelerating kmeans are refining initial data and finding good initial clusters. The P.S Bradley approach [9] refines initial clusters as data close to the modes of the joint probability density. If initial clusters are selected by nearby modes, true clusters are found more often, and the algorithm iterates fewer times. Arthur kmeans [10] is a representative algorithm that chooses good initial clusters for fast convergence. This algorithm randomly selects the first center for each cluster, and then subsequent centers are determined with the probability proportional to the squared distance from the closest center.

The aforementioned approaches, however, are not relevant to high dimensional large data except for Elkans algorithm. This type of data contains a high degree of irrelevant and redundant information [11]. Also, owing to the sparsity of data, it is difficult to find the hidden structure in a high dimension space. Some researchers thus have recently solved the high dimensional problem by decreasing the dimensionality [12, 13]. Others have proposed clustering the original data in a low dimensional subspace rather than directly in a high dimensional space [14, 15, 16]. Two basic types of approaches to reduce the dimensionality have been investigated: feature selection [14] and feature transformation [15, 16]. One of the feature selection methods, random projection [17], has received attention due to the simplicity and efficiency of computation.

Ensemble learning is mainly used for classification and detection. Fred [18] first introduced ensemble learning to the clustering society in the form of an ensemble combination method. The ensemble approach for clustering is robust and efficient in dealing with high dimensional large data, because distributed processing is possible and diversity is preserved. In detail, the ensemble approach consists of the generation and combination steps. Robustness and efficiency of an algorithm can be obtained through various models in the generation step [19]. To produce a final model, multiple models are properly combined in the combination step [20, 21].

In this paper, we show that kmeans clustering can be formulated by feature selection and an ensemble learning approach. We propose a two-stage algorithm, following the coarse to fine strategy. In the first stage, we obtain the sub-optimal clusters, and then we obtain the optimal clusters in the second stage. We employ a proposed binary random matrix, which is learned by each ensemble model. Also, using this simple matrix, the computational complexity is reduced. Due to the first ensemble stage, our method chooses the initial points nearby sub-optimal clusters in the second stage. Refined data taken from an ensemble method can be sufficiently representative because they are sub-optimal. Also, our method can avoid unnecessary distance calculation by a triangle inequality and distance bounds. As will be seen in Sect. 3, we show good performance with a binary random matrix, thus demonstrating that the proposed random matrix is suitable for finding independent bases.

This paper is organized as follows. In Sect. 2, the proposed algorithm to solve the accelerated clustering problem with high dimensional large data is described. Section 3 presents various experimental results on object classification, image retrieval, and loop detection.

## 2 Proposed Algorithm

Finally, this paper approximates the kmeans clustering as both Eqs. (6) and (2). This approach presents an efficient kmeans clustering method that capitalizes on the randomness and the sparseness of the projection matrix for dimension reduction in high dimensional large data.

As mentioned above, our algorithm is composed of two phases combining Eqs. (3) and (2). In the first stage, our approach builds multiple models by small sub-samples of the dataset. Each separated dataset is applied to kmeans clustering, and it randomly selects arbitrary attribute-features in every iteration. As we compute the minimization error in every iteration, we only require sub-dimensional data. The approximated centroids can be obtained by smaller iterations than one phase clustering. These refined data from the first stage are used as the input of the next step. The second stage consists of a single kmeans optimizer to merge the distributed operations. Our algorithm adopts a coarse to fine strategy so that the product of the first stage is suitable to achieve fast convergence. The algorithm is delineated below in Algorithm 1.

### 2.1 Feature Selection in Single Model

**Random Projection.** Principal component analysis (PCA) is a widely used method for reducing the dimensionality of data. Unfortunately, it is quite expensive to compute in high dimensional data. It is thus desirable to derive a dimension reduction method that is computationally simple without yielding significant distortion.

As an alternative method, random projection (RP) has been found to be computationally efficient yet sufficiently accurate for the dimension reduction of high dimensional data. In random projection, the \(d\)-dimensional data in original spaces is projected to \(d'\)-dimensional sub-spaces. This random projection uses the matrix \(A_{d' \times d}\), where the columns have unit lengths, and it is calculated through the origin. Using matrix notation, the equation is given as follows: \(X_{d' \times N}^{RP}=A_{d' \times d}X_{d \times N}\). If a projection matrix \(A\) is not orthogonal, it causes significant distortion in the dataset. Thus, we should consider the orthogonality of matrix \(A\), when we design the matrix \(A\).

**Random Projection Matrix.**Matrix \(A\) of Eq. (7) is generally called a random projection matrix. The choice of the random projection matrix is one of the key points of interest. According to [22], elements of \(A\) are Gaussian distributed (GRP). Achiloptas [14] has recently shown that the Gaussian distribution can be replaced by a simpler distribution such as a sparse matrix (SRP). In this paper, we propose the binary random projection (BRP) matrix, where the elements \(a_{ij}\) consist of zero or one value, as delineated in Eq. (8).

**Distance Bound and Triangle Inequality.** Factors that can cause kmeans to be slow include processing large amounts of data, computing many point-center distances, and requiring many iterations to converge. A primary strategy of accelerating kmeans is applying geometric knowledge to avoid computing redundant distance. For example, Elkan kmeans [8] employs the triangle inequality to avoid many distance computations. This method efficiently updates the upper and lower bounds of point-center distances to avoid unnecessary distance calculations. The proposed method projects high dimensional data onto the lower dimensional subspace using the BRP matrix. It may be determined that each data of lower dimensional subspace cannot guarantee exact geometric information between data. However, our method is approximately preserved by the Johnson-Lindenstrauss lemma [25]: if points in a vector space are projected on to a randomly selected subspace of a suitably high dimension, then the distances between the points are approximately preserved. Our algorithm thus can impose a distance bound characteristic to reduce the computational complexity.

### 2.2 Bootstrap Sampling and Ensemble Learning

Our approach adopts an ensemble learning model because of statistical reasons and large volumes of data. The statistical reason is that combining the outputs of several models by averaging may reduce the risk of an unfortunate feature selection. Learning with such a vast amount of data is usually not practical. We therefore use a partitioning method that separates all dataset into several small subsets. Also, we learn each models with disjoint subdata. By adapting the ensemble approach to our work, we obtain diversity of models and decrease the correlation between ensemble models. The results of Eq. (5) thus are more stable and comparable to the results of Eq. (4).

To reduce the risk of an unfortunate feature selection, the diversity of each ensemble models should be guaranteed. The diversity of ensemble models can be generally achieved in two ways. The most popular way is to employ a different dataset in each model, and the other is to use different learning algorithms. We choose the first strategy, and the bootstrap is used for pre-processing of feature selection. We empirically show that our method produces sufficient diversity, even when the number of ensembles is limited.

As multiple candidate clusters are combined, our algorithm considers the compatibility with variants of kmeans methods and efficiency of the execution time. Our method simply combines multiple candidate clusters using the conventional kmeans algorithm to guarantee fast convergence. Finally, it affords \(K\) clusters by minimizing errors in Eq. (2) using the refined products of the generation stage, as mentioned above.

### 2.3 Time Complexity

The time complexity for three accelerated algorithms is described in Table 1. We use lower case letters \(n\), \(d\), and \(k\) instead of \(N\), \(D\), and \(K\) for the readability. The total time is the summation of elapsed time in each kmeans iteration without the initialization step. The proposed total time in Table 1, the first part of the or statement represents executed total time without geometric knowledge to avoid computing the redundant distance, while the second part indicates total time with geometric knowledge. Our algorithm shows the highest simplicity, since the \(\alpha \beta *T\) term is much smaller than 1.

The asymptotic total time for each examined algorithm.

Total time | |
---|---|

Kmeans | \( O(ndk)*iter \) |

Elkan | \( O(\underline{n}dk+dk^{2})*iter\) |

Proposed | \( \alpha {}\beta {}*T*O({n}{d}k)*iter \) or \( T * O(\widetilde{\underline{n}}{d'}k+{d'}k^{2})*iter \) |

## 3 Experiments

We extensively evaluate the performances on various datasets. Synthetic and real datasets are used for the explicit clustering evaluation in terms of accuracy and elapsed time. We also show offline and online training efficiency for building a vocabulary tree [1] and incremental vocabulary tree [5]. As mentioned earlier, the incremental vocabulary tree does not need heavy clustering in the offline training process due to the distributed online process. Thus, strict elapsed time is more important for online clustering.

Our algorithm has three parameters: \(\alpha \), \(\beta \), and \(T\). The default values of parameters are determined through several experiments. The values of \(\alpha \) and \(\beta \) are set as [0.1, 0.3] and \(T\) is selected as [5, 7]. During the experiments, these values are preserved.

### 3.1 Data Sets

**Synthetic data.** We use synthetic datasets based on a standard cluster model using a multi-variated normal distribution. Synthetic data generation tool is available on the website^{1}. This generator gives two datasets, Gaussian and ellipse cluster data. To evaluate the performance of the algorithm over various numbers of data (N), dimensions of data (D), and numbers of groups of data (K), we generated datasets having N = 100 K, K \(\in \{3, 5, 10, 100, 500\}\), and D \(\in \{8, 32, 128\}\).

**Tiny Images.** We use the CIFAR-10 dataset, which is composed of labelled subsets of 80 million tiny images [26]. CIFAR-10 consists of 10 categories and it contains 6000 images for each category. Each image is represented as GIST feature of dimension 384.

**RGBD Images.** We collect about object images from the RGBD dataset [27]. RGBD images are randomly sampled with category information. We use a 384-dimensional GIST feature to represent each image.

**Caltech101.** It contains images of 101 categories of objects, gathered from the internet. This dataset is mainly used to benchmark classification methods. We extract dense multi-scale SIFT feature for each image, and randomly sample 1M features to form this dataset.

**UKbench.** This dataset is from the Recognition Benchmark introduced in [1]. It consists of 10200 images split into four-image groups, with each of the same scene/object taken at different viewpoints. The features of the dataset and ground truth are publicly available.

**Indoor/Outdoor.** One indoor and two outdoor datasets are used to demonstrate the efficiency of our approach. Indoor images are captured by a mobile robot that moves twice along a similar path in the building. This dataset has 5890 images. SURF features are used to represent each image. Outdoor datasets are captured by a moving vehicle. We refer to the datasets as small and large outdoor datasets for the sake of convenient reference. The vehicle moves twice along the same path in the small outdoor dataset. In the large outdoor dataset, the vehicle travels about 13 km while making many loops. This large dataset consists of 23812 images, and we use sub-sampled images for test.

### 3.2 Evaluation Metric

We use three metrics to evaluate the performance of various clustering algorithms, elapsed time, the within-cluster sum of squared distortions (WCSSD), and the normalized mutual information (NMI) [28]. NMI is widely used for clustering evaluation, and it is a measurement of how close clustering results are to the latent classes. NMI requires the ground truth of cluster assignments X for points in the dataset. Given clustering results Y, NMI is defined by NMI(X,Y) = \(\frac{MI(X,Y)}{\sqrt{H(X)H(Y)}}\), where MI(X,Y) is the mutual information of X and Y and \(H(\cdot )\) is the entropy.

### 3.3 Clustering Performance

We compare our proposed clustering algorithm with three variations: Lloyd kmeans algorithm, Athur kmeans algorithm, and Elkan kmeans algorithm. All algorithms are run on a 3.0 GHz, 8 GB desktop PC using a single thread, and are mainly implemented in C language with some routines implemented in Matlab. We use the public releases of Athur kmeans and Elkan kmeans. The time costs for initialization and clustering are included in the comparison.

The results in Figs. 1 and 2 are shown for various dimensions of data and various numbers of clusters, respectively. The proposed algorithm is faster than Lloyds algorithm. Our algorithm consistently outperforms the other variations of kmeans in high dimensional large datasets. Also, our approach performs best regardless of \(K\). However, from the results of this work, the accuracy of clustering in low dimensional datasets is not maintained. Hecht-Nielsens theory [30] is not valid in low dimensional space, because a vector having random directions might not be close to orthogonal.

## 4 Applications

### 4.1 Evaluation Using Object Recognition

We compare the efficiency and quality of visual codebooks, which are respectively generated by flat and hierarchical clustering methods. A hierarchical clustering method such as HKM [1] is suitable for large data applications.

The classification and identification accuracy are similar and therefore we only present results in terms of elapsed time as increasing the size of visual words, namely vocabulary, in Fig. 5. We perform the experiments on the Caltech101 dataset, which contains 0.1M randomly sampled features. Following [29], we run the clustering algorithms used to build a visual codebook, and test only the codebook generation process in the image classification. Results of the Caltech101 dataset are obtained by 0.3 K, 0.6 K, and 1 K codebooks, and a \(\chi ^2\)-SVM on top of \(4\times 4\) spatial histograms. From Fig. 5a, we see that for the same vocabulary size, our method is more efficient than the other approaches. However, the accuracy of each algorithm is similar. For example, when we use 1 K codebooks for clustering, the mAP of our approach is 0.641 and that of the other approach is 0.643.

In the experiment on the UKbench dataset, we use a subset of database images and 760 K local features. We evaluate the clustering time and the performance of image retrieval with various vocabulary sizes from 200 K to 500 K. Figure 5b shows that our method runs faster than the conventional approach with a similar mAP, about 0.75.

### 4.2 Evaluation Using Image Indexing

In Fig. 8, we use the precision-recall curve instead of a similarity matrix. The tendency of both results is similar to that seen above.

Two images (from left) of each row in Fig. 9 are connected with each dataset: images of the first row belong to the indoor dataset, the second belong to the small outdoor dataset, and the third belong to the large outdoor dataset. Images of the third column show total localization results. There are three circles: green, the robot position; yellow, added image position; and red, a matched scene position. In order to prevent confusion, we should mention that the trajectories of the real position (green line) are slightly rotated to view the results clearly and easily.

## 5 Conclusions

In this paper, we have introduced an accelerated kmeans clustering algorithm that uses binary random projection. The clustering problem is formulated as a feature selection and solved by minimization of distance errors between original data and refined data. The proposed method enables efficient clustering of high dimensional large data. Our algorithm shows better performance on the simulated datasets and real datasets than conventional approaches. We demonstrate that our accelerated algorithm is applicable to an incremental vocabulary tree for object recognition and image indexing.

## Footnotes

## Notes

### Acknowledgement

We would like to thank Greg Hamerly and Yudeog Han for their support. This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (No. 2010-0028680).

## References

- 1.Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: International Conference on Computer Vision and Pattern Recognition, pp. 2161–2168 (2006)Google Scholar
- 2.Tsai, S.S., Chen, D., Takacs, G., Chandrasekhar, V., Singh, J.P., Girod, B.: Location coding for mobile image retrieval. In: Proceedings of the 5th International ICST Mobile Multimedia Communications Conference (2009)Google Scholar
- 3.Straub, J., Hilsenbeck, S., Schroth, G., Huitl, R., Möller, A., Steinbach, E.: Fast relocalization for visual odometry using binary features. In: IEEE International Conference on Image Processing (ICIP), Melbourne, Australia (2013)Google Scholar
- 4.Nicosevici, T., Garcia, R.: Automatic visual bag-of-words for online robot navigation and mapping. Trans. Robot.
**99**, 1–13 (2012)Google Scholar - 5.Yeh, T., Lee, J.J., Darrell, T.: Adaptive vocabulary forests br dynamic indexing and category learning. In: Proceedings of the International Conference on Computer Vision, pp. 1–8 (2007)Google Scholar
- 6.Kim, J., Park, C., Kweon, I.S.: Vision-based navigation with efficient scene recognition. J. Intell. Serv. Robot.
**4**, 191–202 (2011)CrossRefGoogle Scholar - 7.Lloyd, S.P.: Least squares quantization in PCM. Trans. Inf. Theory
**28**, 129–137 (1982)CrossRefzbMATHMathSciNetGoogle Scholar - 8.Elkan, C.: Using the triangle inequality to accelerate k-means. In: International Conference on Machine Learning, pp. 147–153 (2003)Google Scholar
- 9.Bradley, P.S., Fayyad, U.M.: Refining initial points for k-means clustering. In: International Conference on Machine Learning (1998)Google Scholar
- 10.Arthur, D., Vassilvitskii, S.: K-means++: the advantages of careful seeding. In: ACM-SIAM Symposium on Discrete Algorithms (2007)Google Scholar
- 11.Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: a review. ACM SIGKDD Explorarions Newslett.
**6**, 90–105 (2004)CrossRefGoogle Scholar - 12.Khalilian, M., Mustapha, N., Suliman, N., Mamat, A.: A novel k-means based clustering algorithm for high dimensional data sets. In: Internaional Multiconference of Engineers and Computer Scientists, pp. 17–19 (2010)Google Scholar
- 13.Moise, G., Sander: Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In: International Conference on Knowledge Discovery and Data Mining (2008)Google Scholar
- 14.Achlioptas, D.: Database-friendly random projections. In: ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 274–281 (2001)Google Scholar
- 15.Ding, C., He, X., Zha, H., Simon, H.D.: Adaptive dimension reduction for clustering high dimensional data. In: International Conference on Data Mining, pp. 147–154 (2002)Google Scholar
- 16.Hinneburg, A., Keim, D.A.: Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In: International Conference on Very Large Data Bases (1999)Google Scholar
- 17.Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to image and text data. In: International Conference on Knowledge Discovery and Data Mining (2001)Google Scholar
- 18.Fred, A.L.N., Jain, A.K.: Combining multiple clusterings using evidence accumulation. Trans. Pattern Anal. Mach. Intell.
**27**(6), 835–850 (2005)CrossRefGoogle Scholar - 19.Polikar, R.: Ensemble based systems in decision making. Circ. Syst. Mag.
**6**(3), 21–45 (2006)CrossRefGoogle Scholar - 20.Fern, X.Z., Brodley, C.E.: Random projection for high dimensional data clustering: a cluster ensemble approach. In: International Conference on Machine Learning, pp. 186–193 (2003)Google Scholar
- 21.Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell.
**97**, 273–324 (1997)CrossRefzbMATHGoogle Scholar - 22.Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: ACM Symposium on Theory of Computing, pp. 604–613 (1998)Google Scholar
- 23.Elhamifar, E., Vidal., R.: Sparse subspace clustering. In: International Conference on Computer Vision and Pattern Recognition (2009)Google Scholar
- 24.Elhamifar, E., Vidal, R.: Sparse manifold clustering and embedding. Neural Inf. Process. Syst.
**24**, 55–63 (2011)Google Scholar - 25.Johnson, W.B., Lindenstrauss, J.: Extensions of lipschitz mapping into hilbert space. In: International Conference in Modern Analysis and Probability, vol. 26, pp. 90–105 (1984)Google Scholar
- 26.Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical report (2009)Google Scholar
- 27.Lai, K., Bo, L., Ren, X., Fox, D.: A large-scale hierarchical multi-view RGB-D object dataset. In: International Conference on Robotics and Automation, pp. 1817–1824 (2012)Google Scholar
- 28.Strehl, A., Ghosh, J.: Cluster ensembles - a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res.
**3**, 583–617 (2003)zbMATHMathSciNetGoogle Scholar - 29.Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: International Conference on Computer Vision and Pattern Recognition, pp. 2169–2178 (2006)Google Scholar
- 30.Hecht-Nielsen, R.: Context vectors: general purpose approximate meaning representations self-organized from raw data. In: Zurada, J.M., Marks II, R.J., Robinson, C.J. (eds.) Computational Intelligence: Imitating Life, pp. 43–56. IEEE Press, Cambridge (1994)Google Scholar