Advertisement

Knowledge and Information Systems

, Volume 50, Issue 3, pp 851–881 | Cite as

DASC: data aware algorithm for scalable clustering

  • Vasudha Bhatnagar
  • Sharanjit KaurEmail author
  • Rakhi Saxena
  • Dhriti Khanna
Regular Paper
  • 342 Downloads

Abstract

Emergence of MapReduce (MR) framework for scaling data mining and machine learning algorithms provides for Volume, while handling of Variety and Velocity needs to be skilfully crafted in algorithms. So far, scalable clustering algorithms have focused solely on Volume, taking advantage of the MR framework. In this paper we present a MapReduce algorithm—data aware scalable clustering (DASC), which is capable of handling the 3 Vs of big data by virtue of being (i) single scan and distributed to handle Volume, (ii) incremental to cope with Velocity and (iii) versatile in handling numeric and categorical data to accommodate Variety. DASC algorithm incrementally processes infinitely growing data set stored on distributed file system and delivers quality clustering scheme while ensuring recency of patterns. The up-to-date synopsis is preserved by the algorithm for the data seen so far. Each new data increment is processed and merged with the synopsis. Since the synopsis itself may grow very large in size, the algorithm stores it as a file. This makes DASC algorithm truly scalable. Exclusive clusters are obtained on demand by applying connected component analysis (CCA) algorithm over the synopsis. CCA presents subtle roadblock to effective parallelism during clustering. This problem is overcome by accomplishing the task in two stages. In the first stage, hyperclusters are identified based on prevailing data characteristics. The second stage utilizes this knowledge to determine the degree of parallelism, thereby making DASC data aware. Hyperclusters are distributed over the available compute nodes for discovering embedded clusters in parallel. Staged approach for clustering yields dual advantage of improved parallelism and desired complexity in \(\mathcal {MRC}^0\) class. DASC algorithm is empirically compared with incremental Kmeans and Scalable Kmeans++ algorithms. Experimentation on real-world and synthetic data with approximately 1.2 billion data points demonstrates effectiveness of DASC algorithm. Empirical observations of DASC execution are in consonance with the theoretical analysis with respect to stability in resources utilization and execution time.

Keywords

MapReduce Synopsis Incremental clustering Scalable clustering Complexity class MRC 

Notes

Acknowledgments

We thank authors of [17] for making BigCross data set available to us. We also thank Cluster Innovation Centre, Delhi University, for permitting us to use its computing facility.

References

  1. 1.
    Cordeiro FRL, Traina Junior C, Machado Traina AJ, López J, Kang U, Faloutsos C (2011) Clustering very large multi-dimensional datasets with MapReduce. In: Proceedings of the 17th ACM SIGKDD, New York. ACM, pp 690–698Google Scholar
  2. 2.
    Ene A, Im S, Moseley B (2011) Fast clustering using MapReduce. In: Proceedings of the seventeenth international conference on knowledge discovery and data mining. ACM, pp 681–689Google Scholar
  3. 3.
    Zhou P, Lei J, Ye W (2011) Large-scale datasets clustering based on MapReduce and Hadoop. J Comput Inf Syst 7(16):5956–5963Google Scholar
  4. 4.
    Aggarwal CC (ed) (2007) Data streams: models and algorithms. Springer, New YorkzbMATHGoogle Scholar
  5. 5.
    Barbara D (2002) Requirements for clustering data streams. SIGKDD Explor 3(2):23CrossRefGoogle Scholar
  6. 6.
    Faria ER, Barros RC, Hruschka ER, de Carvalho ACPLF, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv 46(1):13zbMATHGoogle Scholar
  7. 7.
    Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of international conference on very large data bases, pp 81–92Google Scholar
  8. 8.
    Amini A, Teh YW, Saboohi H (2014) On density-based data streams clustering algorithms: a survey. J Comput Sci Technol 29(1):116–141CrossRefGoogle Scholar
  9. 9.
    Bhatnagar V, Kaur S, Chakravarthy S (2014) Clustering data streams using grid-based synopsis. Knowl Inf Syst 41:127–152CrossRefGoogle Scholar
  10. 10.
    Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the thirteenth International conference on knowledge discovery and data mining. ACMGoogle Scholar
  11. 11.
    Forestiero A, Pizzuti C, Spezzano G (2013) A single pass algorithm for clustering evolving data streams based on swarm intelligence. Data Min Knowl Discov 26(1):1–26MathSciNetCrossRefGoogle Scholar
  12. 12.
    Lin J, Lin H (2011) A density-based clustering over evolving heterogeneous data stream. Int J Digit Content Technol Its Appl 5:325–330Google Scholar
  13. 13.
    Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the sixth SIAM international conference on data mining, pp 326–337Google Scholar
  14. 14.
    Orlowska ME, Sun X, Li X (2006) Can exclusive clustering on streaming data be achieved? SIGKDD Explor 8(2):102–108CrossRefGoogle Scholar
  15. 15.
    Park NH, Lee WS (2004) Statistical grid-based clustering over data streams. ACM SIGMOD Record 33:32–37CrossRefGoogle Scholar
  16. 16.
    Akioka S (2013) Task graphs for stream mining algorithms. In: Proceedings of first international workshop on big dynamic distributed data. ACM, pp 55–60Google Scholar
  17. 17.
    Hadian A, Shahrivari S (2014) High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs. J Supercomput 69(2):845–863CrossRefGoogle Scholar
  18. 18.
    Lv Z, Hu Y, Zhong H, Wu J, Li B, Zhao H (2010) Parallel K-means clustering of remote sensing images based on MapReduce. In: Proceedings of the 2010 international conference on web information systems and mining, WISM’10. Springer, Berlin, pp 162–170Google Scholar
  19. 19.
    Wang S, Dutta H (2011) PARABLE: a parallel random-partition based hierarchical clustering algorithm for the MapReduce framework. http://hdl.handle.net/10022/AC:P:11821
  20. 20.
    Zhanquan S (2013) A parallel clustering method study based on MapReduce. In: International workshop on cloud computing and information security. Atlantis PressGoogle Scholar
  21. 21.
    Zhao W, Ma H, He Q (2009) Parallel K-means clustering based on MapReduce. In: Proceedings of the 1st international conference on cloud computing. Springer, pp 674–679Google Scholar
  22. 22.
    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. ACM Commun 51(1):107–113CrossRefGoogle Scholar
  23. 23.
    The Apache Software Foundation (1999). http://hadoop.apache.org/, http://hadoop.apache.org/hdfs/
  24. 24.
    Ghemawat S, Gobioff H, Leung ST (2003) The Google file system. In: Proceedings of the 19th ACM symposium on operating systems principles. ACM, pp 29–43Google Scholar
  25. 25.
    Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable K-means++. Proc VLDB Endow 5(7):622–633CrossRefGoogle Scholar
  26. 26.
    Li Q, Wang P, Wang W, Hu H, Li Z, Li J (2014) An efficient K-means clustering algorithm on MapReduce. In: Proceedings of the 19th international conference on database systems for advanced applications, pp 357–371Google Scholar
  27. 27.
    He Y, Tan H, Luo W, Mao H, Ma D, Feng S, Fan J (2011) MR-DBSCAN: an efficient parallel density-based clustering algorithm using MapReduce. In: Proceedings of the 17th international conference on parallel and distributed systems. IEEE, pp 473–480Google Scholar
  28. 28.
    Kim Y, Shim K, Kim M-S, Lee JS (2014) DBCURE-MR: an efficient density-based clustering algorithm for large data using MapReduce. Inf Syst 42(0):15–35. ISSN 0306-4379Google Scholar
  29. 29.
    Ganglia (2000) High Performance Monitoring Tool. University of California, Berkeley, http://ganglia.sourceforge.net/
  30. 30.
    UCI KDD Archive (1999) KDD CUP 99 Intrusion Data. http://kdd.ics.uci.edu//databases/kddcup99
  31. 31.
    Cardoso Margarida GMS (2014) Wholesale customers data. http://archive.ics.uci.edu/ml/datasets/Wholesale+customers
  32. 32.
    Asuncion A, Newman DJ (2007) UCI machine learning repository. https://archive.ics.uci.edu/ml/datasets/Covertype
  33. 33.
    Bhatt R, Dhall A (2012) Skin segmentation data. http://archive.ics.uci.edu/ml/datasets/Skin+Segmentation
  34. 34.
    Ackermann MR, Märtens M, Raupach C, Swierkot K, Lammersen C, Sohler C (2012) StreamKM++: a clustering algorithm for data streams. ACM J Exp Algorithmics 17(1):327–338MathSciNetzbMATHGoogle Scholar
  35. 35.
    Tan P-N, Steinbach M, Kumar V (2014) Introduction to data mining, 2nd edn. Pearson Education, Limited, New York CityGoogle Scholar
  36. 36.
    Karloff H, Suri S, Vassilvitskii S (2010) A model of computation for MapReduce. In: Proceedings of the twenty-first annual ACM-SIAM symposium on discrete algorithms, SODA ’10, pp 938–948, Philadelphia. Society for Industrial and Applied Mathematics. http://dl.acm.org/citation.cfm?id=1873601.1873677

Copyright information

© Springer-Verlag London 2016

Authors and Affiliations

  • Vasudha Bhatnagar
    • 1
  • Sharanjit Kaur
    • 2
    Email author
  • Rakhi Saxena
    • 3
  • Dhriti Khanna
    • 4
  1. 1.Department of Computer ScienceUniversity of DelhiNew DelhiIndia
  2. 2.Acharya Narendra Dev College, University of DelhiNew DelhiIndia
  3. 3.Deshbandhu College, University of DelhiNew DelhiIndia
  4. 4.Indraprastha Institute of Information TechnologyNew DelhiIndia

Personalised recommendations