Skip to main content

Nearest Neighbor-Based Clustering Algorithm for Large Data Sets

  • Conference paper
  • First Online:
Advances in Computer Communication and Computational Sciences

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 760))

  • 537 Accesses

Abstract

Clustering is an unsupervised learning technique in which data or objects are grouped into sets based on some similarity measure. Most of the clustering algorithms assume that the main memory is infinite and can accommodate the complete set of patterns. In reality, many applications give rise to a large set of patterns which does not fit in the main memory. When the data set is too large, much of the data are stored in the secondary memory. Input/outputs (I/O) from the disk are the major bottlenecks in designing efficient clustering algorithms for large data sets. Different designing techniques have been used to design clustering algorithms for large data sets. External memory algorithm is one class of algorithms which can be used for large data sets. These algorithms exploit the hierarchical memory structure of the computers by incorporating locality of reference directly in the algorithm. This paper contributes towards designing clustering algorithms in the external memory model (proposed by Aggarwal and Vitter) to make the algorithms scalable. In this paper, it is shown that the Shared near neighbors algorithm is not I/O efficient since the computational complexity and the I/O complexity both are same and high. The algorithm is redesigned in the external memory model reducing its I/O complexity without any change in its computational complexity. We substantiate the theoretical analysis by showing the performance of the algorithms with their traditional counterpart by implementing in STXXL library.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://en.wikipedia.org/?title=Binary_prefix.

References

  1. Abello, J., Pardalos, P.M., Resende, M.G.: Handbook of Massive Data Sets. Springer (2002)

    Google Scholar 

  2. Aggarwal, A., Vitter, J.: The input/output complexity of sorting and related problems. Commun. ACM 31(9), 1116–1127 (1988)

    Article  MathSciNet  Google Scholar 

  3. Arge, L., Procopiuc, O., Vitter, J.: Implementing I/O-efficient data structures using TPIE. In: Algorithms ESA 2002, pp. 88–100. Springer (2002)

    Google Scholar 

  4. Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable k-means++. Proc. VLDB Endow. 5(7), 622–633 (2012)

    Article  Google Scholar 

  5. Ball, G.H., Hall, D.J.: A clustering technique for summarizing multivariate data. Behav. Sci. 12(2), 153–155 (1967)

    Article  Google Scholar 

  6. Charikar, M., Chekuri, C., Feder, T., Motwani, R.: Incremental clustering and dynamic information retrieval. In: Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing. pp. 626–635. ACM (1997)

    Google Scholar 

  7. Crauser, A., Mehlhorn, K.: LEDA-SM: Extending LEDA to secondary memory. In: Algorithm Engineering, pp. 228–242. Springer (1999)

    Google Scholar 

  8. Dementiev, R., Kettner, L., Sanders, P.: STXXL: standard template library for XXL data sets. Softw. Pract. Exp. 38(6), 589–637 (2008)

    Google Scholar 

  9. Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. In: ACM SIGMOD Record, vol. 27, pp. 73–84. ACM (1998)

    Google Scholar 

  10. Han, J., Kamber, M.: Data Mining. Concepts and Techniques. Morgan kaufmann, Southeast Asia Edition (2006)

    MATH  Google Scholar 

  11. Januzaj, E., Kriegel, H.P., Pfeifle, M.: Dbdc: Density based distributed clustering. In: Advances in Database Technology—EDBT 2004, Lecture Notes in Computer Science, vol. 2992, pp. 88–105 (2004)

    Google Scholar 

  12. Jarvis, R.A., Patrick, E.A.: Clustering using a similarity measure based on shared near neighbors. IEEE Trans. Comput. 100(11), 1025–1034 (1973)

    Article  Google Scholar 

  13. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: A local search approximation algorithm for k-means clustering. In: Proceedings of the Eighteenth Annual Symposium on Computational Geometry, pp. 10–18. ACM (2002)

    Google Scholar 

  14. Kim, W.: Parallel clustering algorithms: survey (2009). http://www.cs.gsu.edu/~wkim/indexfiles/SurveyParallelClustering.pdf

  15. Liu, Y., Guo, Q., Yang, L., Li, Y.: Research on incremental clustering. In: 2nd International Conference on Consumer Electronics, Communications and Networks (CECNet), 2012, pp. 2803–2806, April 2012

    Google Scholar 

  16. Moreira, G., Santos, M.Y., Moura-Pires, J.: SNN Input Parameters: how are they related? In: International Conference on Parallel and Distributed Systems (ICPADS), pp. 492–497. IEEE (2013)

    Google Scholar 

  17. Musser, D.R., Derge, G.J., Saini, A.: STL Tutorial and Reference Guide: C++ Programming with the Standard Template Library. Addison-Wesley Professional (2009)

    Google Scholar 

  18. Ng, R.T., Jiawei, H.: CLARANS: a method for clustering objects for spatial data mining. IEEE Trans. Knowl. Data Eng. 14(5), 1003–1016 (2002)

    Article  Google Scholar 

  19. Olson, C.F.: Parallel algorithms for hierarchical clustering. Parallel Comput. 21(8), 1313–1325 (1995)

    Article  MathSciNet  Google Scholar 

  20. Rokach, L., Maimon, O.: Clustering methods. In: Data Mining and Knowledge Discovery Handbook, pp. 321–352. Springer (2005)

    Google Scholar 

  21. Wikipedia: Approximation algorithm (2015). Accessed June 2015

    Google Scholar 

  22. Xu, X., Ester, M., Kriegel, H.P., Sander, J.: A distribution-based clustering algorithm for mining in large spatial databases. In: Proceedings of 14th International Conference on Data Engineering, 1998, pp. 324–331. IEEE (1998)

    Google Scholar 

  23. Yadav, P.K., Pandey, S., Mohanty, S.K.: Nearest neighbor based clustering algorithm for large data sets. arXiv:1505.05962

  24. Zaïane, O.R., Foss, A., Lee, C.H., Wang, W.: On data clustering analysis: Scalability, constraints, and validation. In: Advances in Knowledge Discovery and Data Mining, pp. 28–39. Springer (2002)

    Google Scholar 

  25. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: ACM SIGMOD Record, vol. 25, pp. 103–114. ACM (1996)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Sriniwas Pandey or Mohanty Sraban Kumar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pankaj Kumar, Y., Pandey, S., Samal, M., Sraban Kumar, M. (2019). Nearest Neighbor-Based Clustering Algorithm for Large Data Sets. In: Bhatia, S., Tiwari, S., Mishra, K., Trivedi, M. (eds) Advances in Computer Communication and Computational Sciences. Advances in Intelligent Systems and Computing, vol 760. Springer, Singapore. https://doi.org/10.1007/978-981-13-0344-9_6

Download citation

Publish with us

Policies and ethics