Nearest Neighbor-Based Clustering Algorithm for Large Data Sets

Pankaj Kumar, Yadav; Pandey, Sriniwas; Samal, Mamata; Sraban Kumar, Mohanty

doi:10.1007/978-981-13-0344-9_6

Yadav Pankaj Kumar¹⁸,
Sriniwas Pandey¹⁸,
Mamata Samal¹⁹ &
…
Mohanty Sraban Kumar¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 760))

537 Accesses

Abstract

Clustering is an unsupervised learning technique in which data or objects are grouped into sets based on some similarity measure. Most of the clustering algorithms assume that the main memory is infinite and can accommodate the complete set of patterns. In reality, many applications give rise to a large set of patterns which does not fit in the main memory. When the data set is too large, much of the data are stored in the secondary memory. Input/outputs (I/O) from the disk are the major bottlenecks in designing efficient clustering algorithms for large data sets. Different designing techniques have been used to design clustering algorithms for large data sets. External memory algorithm is one class of algorithms which can be used for large data sets. These algorithms exploit the hierarchical memory structure of the computers by incorporating locality of reference directly in the algorithm. This paper contributes towards designing clustering algorithms in the external memory model (proposed by Aggarwal and Vitter) to make the algorithms scalable. In this paper, it is shown that the Shared near neighbors algorithm is not I/O efficient since the computational complexity and the I/O complexity both are same and high. The algorithm is redesigned in the external memory model reducing its I/O complexity without any change in its computational complexity. We substantiate the theoretical analysis by showing the performance of the algorithms with their traditional counterpart by implementing in STXXL library.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://en.wikipedia.org/?title=Binary_prefix.

References

Abello, J., Pardalos, P.M., Resende, M.G.: Handbook of Massive Data Sets. Springer (2002)
Google Scholar
Aggarwal, A., Vitter, J.: The input/output complexity of sorting and related problems. Commun. ACM 31(9), 1116–1127 (1988)
Article MathSciNet Google Scholar
Arge, L., Procopiuc, O., Vitter, J.: Implementing I/O-efficient data structures using TPIE. In: Algorithms ESA 2002, pp. 88–100. Springer (2002)
Google Scholar
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable k-means++. Proc. VLDB Endow. 5(7), 622–633 (2012)
Article Google Scholar
Ball, G.H., Hall, D.J.: A clustering technique for summarizing multivariate data. Behav. Sci. 12(2), 153–155 (1967)
Article Google Scholar
Charikar, M., Chekuri, C., Feder, T., Motwani, R.: Incremental clustering and dynamic information retrieval. In: Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing. pp. 626–635. ACM (1997)
Google Scholar
Crauser, A., Mehlhorn, K.: LEDA-SM: Extending LEDA to secondary memory. In: Algorithm Engineering, pp. 228–242. Springer (1999)
Google Scholar
Dementiev, R., Kettner, L., Sanders, P.: STXXL: standard template library for XXL data sets. Softw. Pract. Exp. 38(6), 589–637 (2008)
Google Scholar
Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. In: ACM SIGMOD Record, vol. 27, pp. 73–84. ACM (1998)
Google Scholar
Han, J., Kamber, M.: Data Mining. Concepts and Techniques. Morgan kaufmann, Southeast Asia Edition (2006)
MATH Google Scholar
Januzaj, E., Kriegel, H.P., Pfeifle, M.: Dbdc: Density based distributed clustering. In: Advances in Database Technology—EDBT 2004, Lecture Notes in Computer Science, vol. 2992, pp. 88–105 (2004)
Google Scholar
Jarvis, R.A., Patrick, E.A.: Clustering using a similarity measure based on shared near neighbors. IEEE Trans. Comput. 100(11), 1025–1034 (1973)
Article Google Scholar
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: A local search approximation algorithm for k-means clustering. In: Proceedings of the Eighteenth Annual Symposium on Computational Geometry, pp. 10–18. ACM (2002)
Google Scholar
Kim, W.: Parallel clustering algorithms: survey (2009). http://www.cs.gsu.edu/~wkim/indexfiles/SurveyParallelClustering.pdf
Liu, Y., Guo, Q., Yang, L., Li, Y.: Research on incremental clustering. In: 2nd International Conference on Consumer Electronics, Communications and Networks (CECNet), 2012, pp. 2803–2806, April 2012
Google Scholar
Moreira, G., Santos, M.Y., Moura-Pires, J.: SNN Input Parameters: how are they related? In: International Conference on Parallel and Distributed Systems (ICPADS), pp. 492–497. IEEE (2013)
Google Scholar
Musser, D.R., Derge, G.J., Saini, A.: STL Tutorial and Reference Guide: C++ Programming with the Standard Template Library. Addison-Wesley Professional (2009)
Google Scholar
Ng, R.T., Jiawei, H.: CLARANS: a method for clustering objects for spatial data mining. IEEE Trans. Knowl. Data Eng. 14(5), 1003–1016 (2002)
Article Google Scholar
Olson, C.F.: Parallel algorithms for hierarchical clustering. Parallel Comput. 21(8), 1313–1325 (1995)
Article MathSciNet Google Scholar
Rokach, L., Maimon, O.: Clustering methods. In: Data Mining and Knowledge Discovery Handbook, pp. 321–352. Springer (2005)
Google Scholar
Wikipedia: Approximation algorithm (2015). Accessed June 2015
Google Scholar
Xu, X., Ester, M., Kriegel, H.P., Sander, J.: A distribution-based clustering algorithm for mining in large spatial databases. In: Proceedings of 14th International Conference on Data Engineering, 1998, pp. 324–331. IEEE (1998)
Google Scholar
Yadav, P.K., Pandey, S., Mohanty, S.K.: Nearest neighbor based clustering algorithm for large data sets. arXiv:1505.05962
Zaïane, O.R., Foss, A., Lee, C.H., Wang, W.: On data clustering analysis: Scalability, constraints, and validation. In: Advances in Knowledge Discovery and Data Mining, pp. 28–39. Springer (2002)
Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: ACM SIGMOD Record, vol. 25, pp. 103–114. ACM (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

PDPM Indian Institute of Information Technology, Design and Manufacturing, Jabalpur, Jabalpur, India
Yadav Pankaj Kumar, Sriniwas Pandey & Mohanty Sraban Kumar
Indian Institute of Technology Guwahati, Guwahati, India
Mamata Samal

Authors

Yadav Pankaj Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Sriniwas Pandey
View author publications
You can also search for this author in PubMed Google Scholar
Mamata Samal
View author publications
You can also search for this author in PubMed Google Scholar
Mohanty Sraban Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Sriniwas Pandey or Mohanty Sraban Kumar .

Editor information

Editors and Affiliations

Department of Mathematics & Computer Science, University of Missouri, St. Louis, Missouri, USA
Sanjiv K. Bhatia
Computer Science Engineering Department, ABES Engineering College, Ghaziabad, India
Shailesh Tiwari
Department of Computer Science & Engineering, Motilal Nehru National Institute of Technology, Allahabad, Uttar Pradesh, India
Krishn K. Mishra
Department of Computer Science & Engineering, ABES Engineering College, Ghaziabad, India
Munesh C. Trivedi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pankaj Kumar, Y., Pandey, S., Samal, M., Sraban Kumar, M. (2019). Nearest Neighbor-Based Clustering Algorithm for Large Data Sets. In: Bhatia, S., Tiwari, S., Mishra, K., Trivedi, M. (eds) Advances in Computer Communication and Computational Sciences. Advances in Intelligent Systems and Computing, vol 760. Springer, Singapore. https://doi.org/10.1007/978-981-13-0344-9_6

Download citation

DOI: https://doi.org/10.1007/978-981-13-0344-9_6
Published: 19 August 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0343-2
Online ISBN: 978-981-13-0344-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics