Abstract
Clustering is an unsupervised learning technique in which data or objects are grouped into sets based on some similarity measure. Most of the clustering algorithms assume that the main memory is infinite and can accommodate the complete set of patterns. In reality, many applications give rise to a large set of patterns which does not fit in the main memory. When the data set is too large, much of the data are stored in the secondary memory. Input/outputs (I/O) from the disk are the major bottlenecks in designing efficient clustering algorithms for large data sets. Different designing techniques have been used to design clustering algorithms for large data sets. External memory algorithm is one class of algorithms which can be used for large data sets. These algorithms exploit the hierarchical memory structure of the computers by incorporating locality of reference directly in the algorithm. This paper contributes towards designing clustering algorithms in the external memory model (proposed by Aggarwal and Vitter) to make the algorithms scalable. In this paper, it is shown that the Shared near neighbors algorithm is not I/O efficient since the computational complexity and the I/O complexity both are same and high. The algorithm is redesigned in the external memory model reducing its I/O complexity without any change in its computational complexity. We substantiate the theoretical analysis by showing the performance of the algorithms with their traditional counterpart by implementing in STXXL library.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abello, J., Pardalos, P.M., Resende, M.G.: Handbook of Massive Data Sets. Springer (2002)
Aggarwal, A., Vitter, J.: The input/output complexity of sorting and related problems. Commun. ACM 31(9), 1116–1127 (1988)
Arge, L., Procopiuc, O., Vitter, J.: Implementing I/O-efficient data structures using TPIE. In: Algorithms ESA 2002, pp. 88–100. Springer (2002)
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable k-means++. Proc. VLDB Endow. 5(7), 622–633 (2012)
Ball, G.H., Hall, D.J.: A clustering technique for summarizing multivariate data. Behav. Sci. 12(2), 153–155 (1967)
Charikar, M., Chekuri, C., Feder, T., Motwani, R.: Incremental clustering and dynamic information retrieval. In: Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing. pp. 626–635. ACM (1997)
Crauser, A., Mehlhorn, K.: LEDA-SM: Extending LEDA to secondary memory. In: Algorithm Engineering, pp. 228–242. Springer (1999)
Dementiev, R., Kettner, L., Sanders, P.: STXXL: standard template library for XXL data sets. Softw. Pract. Exp. 38(6), 589–637 (2008)
Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. In: ACM SIGMOD Record, vol. 27, pp. 73–84. ACM (1998)
Han, J., Kamber, M.: Data Mining. Concepts and Techniques. Morgan kaufmann, Southeast Asia Edition (2006)
Januzaj, E., Kriegel, H.P., Pfeifle, M.: Dbdc: Density based distributed clustering. In: Advances in Database Technology—EDBT 2004, Lecture Notes in Computer Science, vol. 2992, pp. 88–105 (2004)
Jarvis, R.A., Patrick, E.A.: Clustering using a similarity measure based on shared near neighbors. IEEE Trans. Comput. 100(11), 1025–1034 (1973)
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: A local search approximation algorithm for k-means clustering. In: Proceedings of the Eighteenth Annual Symposium on Computational Geometry, pp. 10–18. ACM (2002)
Kim, W.: Parallel clustering algorithms: survey (2009). http://www.cs.gsu.edu/~wkim/indexfiles/SurveyParallelClustering.pdf
Liu, Y., Guo, Q., Yang, L., Li, Y.: Research on incremental clustering. In: 2nd International Conference on Consumer Electronics, Communications and Networks (CECNet), 2012, pp. 2803–2806, April 2012
Moreira, G., Santos, M.Y., Moura-Pires, J.: SNN Input Parameters: how are they related? In: International Conference on Parallel and Distributed Systems (ICPADS), pp. 492–497. IEEE (2013)
Musser, D.R., Derge, G.J., Saini, A.: STL Tutorial and Reference Guide: C++ Programming with the Standard Template Library. Addison-Wesley Professional (2009)
Ng, R.T., Jiawei, H.: CLARANS: a method for clustering objects for spatial data mining. IEEE Trans. Knowl. Data Eng. 14(5), 1003–1016 (2002)
Olson, C.F.: Parallel algorithms for hierarchical clustering. Parallel Comput. 21(8), 1313–1325 (1995)
Rokach, L., Maimon, O.: Clustering methods. In: Data Mining and Knowledge Discovery Handbook, pp. 321–352. Springer (2005)
Wikipedia: Approximation algorithm (2015). Accessed June 2015
Xu, X., Ester, M., Kriegel, H.P., Sander, J.: A distribution-based clustering algorithm for mining in large spatial databases. In: Proceedings of 14th International Conference on Data Engineering, 1998, pp. 324–331. IEEE (1998)
Yadav, P.K., Pandey, S., Mohanty, S.K.: Nearest neighbor based clustering algorithm for large data sets. arXiv:1505.05962
Zaïane, O.R., Foss, A., Lee, C.H., Wang, W.: On data clustering analysis: Scalability, constraints, and validation. In: Advances in Knowledge Discovery and Data Mining, pp. 28–39. Springer (2002)
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: ACM SIGMOD Record, vol. 25, pp. 103–114. ACM (1996)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Pankaj Kumar, Y., Pandey, S., Samal, M., Sraban Kumar, M. (2019). Nearest Neighbor-Based Clustering Algorithm for Large Data Sets. In: Bhatia, S., Tiwari, S., Mishra, K., Trivedi, M. (eds) Advances in Computer Communication and Computational Sciences. Advances in Intelligent Systems and Computing, vol 760. Springer, Singapore. https://doi.org/10.1007/978-981-13-0344-9_6
Download citation
DOI: https://doi.org/10.1007/978-981-13-0344-9_6
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0343-2
Online ISBN: 978-981-13-0344-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)