Skip to main content
Log in

GoSCAN: Decentralized scalable data clustering

  • Published:
Computing Aims and scope Submit manuscript

Abstract

Identifying clusters is an important aspect of analyzing large datasets. Clustering algorithms classically require access to the complete dataset. However, as huge amounts of data are increasingly originating from multiple, dispersed sources in distributed systems, alternative solutions are required. Furthermore, data and network dynamicity in a distributed setting demand adaptable clustering solutions that offer accurate clustering models at a reasonable pace. In this paper, we propose GoScan, a fully decentralized density-based clustering algorithm which is capable of clustering dynamic and distributed datasets without requiring central control or message flooding. We identify two major tasks: finding the core data points, and forming the actual clusters, which we execute in parallel employing gossip-based communication. This approach is very efficient, as it offers each peer enough authority to discover the clusters it is interested in. Our algorithm poses no extra burden of overlay formation in the network, while providing high levels of scalability. We also offer several optimizations to the basic clustering algorithm for improving communication overhead and processing costs. Coping with dynamic data is made possible by introducing an age factor, which gradually detects data-set changes and enables clustering updates. In our experimental evaluation, we will show that GoSCAN can discover the clusters efficiently with scalable transmission cost.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  1. Aouad LM, Le-Khac NA, Kechadi TM (2007) Lightweight clustering technique for distributed data mining applications. In: 7th international conference on data mining. Springer, Berlin, pp 120–134

  2. Bentley JL, Friedman JH (1979) Data structures for range searching. ACM Comput Surv 11(4):397–409

    Article  Google Scholar 

  3. Datta S, Bhaduri K, Giannella C, Wolff R, Kargupta H (2006) Distributed data mining in peer-to-peer networks. IEEE Internet Comput 10(4):18–26

    Article  Google Scholar 

  4. Datta S, Giannella CR, Kargupta H (2009) Approximate distributed K-means clustering over a peer-to-peer network. IEEE Trans Knowl Data Eng 21(10):1372–1388

    Article  Google Scholar 

  5. Demers A, Greene D, Hauser C, Irish W, Larson J, Shenker S, Sturgis H, Swinehart D, Terry D (1987) Epidemic algorithms for replicated database maintenance. In: 6th symposium on principles of distributed computing. ACM, pp 1–12

  6. Dhillon IS, Modha DS (2000) A data-clustering algorithm on distributed memory multiprocessors. In: Large-scale parallel data mining. Springer, Berlin, Lecture Notes in Computer Science, vol 1759, pp 245–260

  7. Eisenhardt M, Muller W, Henrich A (2003) Classifying documents by distributed P2P clustering. In: Inform. pp 286–291

  8. Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: 2nd international conference knowledge discovery and data mining. ACM Press, New York, pp 226–231

  9. Eyal I, Keidar I, Rom R (2011) Distributed data clustering in sensor networks. Distrib Comput 24(5):207–222

    Article  MATH  Google Scholar 

  10. Forman G, Zhang B (2000) Distributed data clustering can be efficient and exact. SIGKDD Explor Newsl 2:34–38

    Article  Google Scholar 

  11. Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: SIGMOD international conference on management Of data. ACM Press, New York, SIGMOD’98, pp 73–84

  12. Hammouda KM, Kamel MS (2009) Hierarchically distributed peer-to-peer document clustering and cluster summarization. IEEE Trans Knowl Data Eng 21:681–698

    Article  Google Scholar 

  13. Holm J, de Lichtenberg C, Thorup M (1998) Poly-logarithmic deterministic fully-dynamic algorithms for connectivity, minimum spanning tree, 2-edge, and biconnectivity. In Proceedings of the thirtieth annual ACM symposium on Theory of computing (STOC ’98). ACM, New York, pp 79–89

  14. Januzaj E, Kriegel HP, Pfeifle M (2004) Scalable density-based distributed clustering. In: 8th european conference on principles and practice of knowledge discovery in databases, Springer, Berlin, pp 231–244

  15. Jelasity M, Voulgaris S, Guerraoui R, Kermarrec AM, van Steen M (2007) Gossip-based peer sampling. ACM Trans Comput Syst 25(3):8

    Google Scholar 

  16. Kowalczyk W, Vlassis N (2005) Newscast EM. NIPS 17:713–720

    Google Scholar 

  17. Kriegel HP, Kroger P, Pryakhin A, Schubert M (2005) Effective and efficient distributed model-based clustering. In: 5th international conference on data mining. IEEE Computer Society Press, Los Alamitos, CA, pp 258–265

  18. Kriegel HP, Kunath P, Pfeie M, Renz M (2005) Approximated clustering of distributed high-dimensional data. In: 9th advances in knowledge discovery and data mining. Lecture Notes in Computer Science, vol 3518, pp 432–441

  19. Li M, Lee G, Lee WC, Sivasubramaniam A (2006) PENS: an algorithm for density-based clustering in peer-to-peer systems. In: 1st international conference on scalable information systems. ACM Press, New York

  20. Liu YB, Liu ZX (2011) Scalable local density-based distributed clustering. Expert Syst Appl 38(8):9491–9498

    Google Scholar 

  21. Lodi S, Moro G, Sartori C (2010) Distributed data clustering in multi-dimensional peer-to-peer networks. In: 21st Australasian conference on database technologies, vol 104. pp 171–178

  22. Luo P, Xiong H, Lu K, Shi Z (2007) Distributed classification in peer-to-peer networks. In: 13th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, pp 968–976

  23. Merugu S, Ghosh J (2005) A privacy-sensitive approach to distributed clustering. Pattern Recogn Lett 26(4):399–410

    Article  Google Scholar 

  24. Ratnasamy S, Francis P, Handley M, Karp R, Schenker S (2001) A scalable content-addressable network. In: SIGCOMM. ACM, San Diego, pp 161–172

  25. Samatova NF, Ostrouchov G, Geist A, Melechko AV (2002) RACHET: an efficient cover-based merging of clustering hierarchies from distributed datasets. Distrib Parallel Datab 11:157–180

    MATH  Google Scholar 

  26. Stonebraker M, Frew J, Gardels K, Meredith J (1993) The sequoia 2000 storage benchmark. In Proceedings of SIGMOD, pp 2–11

  27. Tasoulis DK, Vrahatis MN (2004) Unsupervised distributed clustering. In: IASTED international conference on parallel and distributed computing and networks

  28. Visalakshi N, Thangavel K (2009) Distributed data clustering: a comparative analysis. In: Abraham A, Hassanien AE, de Leon F de Carvalho A, Snasel V (eds) Found Comput Intell 206:371–397

  29. Voulgaris S, van Steen M, Iwanicki K (2007) Proactive gossip-based management of semantic overlay networks. Concurr Comput : Pract Expert 19(17):2299–2311

    Article  Google Scholar 

  30. Wolff R, Schuster A (2004) Association rule mining in peer-to-peer systems. IEEE Trans Syst Man Cybern 34(6):242–2438

    Google Scholar 

  31. Wolff R, Bhaduri K, Kargupta H (2009) A generic local algorithm for mining data streams in large distributed systems. IEEE Trans Knowl Data Eng 21:465–478

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hoda Mashayekhi.

Additional information

Hoda Mashayekhi research was supported by the Research Institute for ICT under grant No: T/500/5197.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mashayekhi, H., Habibi, J., Voulgaris, S. et al. GoSCAN: Decentralized scalable data clustering. Computing 95, 759–784 (2013). https://doi.org/10.1007/s00607-012-0264-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-012-0264-2

Keywords

Mathematics Subject Classification (2000)

Navigation