Abstract
Identifying clusters is an important aspect of analyzing large datasets. Clustering algorithms classically require access to the complete dataset. However, as huge amounts of data are increasingly originating from multiple, dispersed sources in distributed systems, alternative solutions are required. Furthermore, data and network dynamicity in a distributed setting demand adaptable clustering solutions that offer accurate clustering models at a reasonable pace. In this paper, we propose GoScan, a fully decentralized density-based clustering algorithm which is capable of clustering dynamic and distributed datasets without requiring central control or message flooding. We identify two major tasks: finding the core data points, and forming the actual clusters, which we execute in parallel employing gossip-based communication. This approach is very efficient, as it offers each peer enough authority to discover the clusters it is interested in. Our algorithm poses no extra burden of overlay formation in the network, while providing high levels of scalability. We also offer several optimizations to the basic clustering algorithm for improving communication overhead and processing costs. Coping with dynamic data is made possible by introducing an age factor, which gradually detects data-set changes and enables clustering updates. In our experimental evaluation, we will show that GoSCAN can discover the clusters efficiently with scalable transmission cost.
Similar content being viewed by others
References
Aouad LM, Le-Khac NA, Kechadi TM (2007) Lightweight clustering technique for distributed data mining applications. In: 7th international conference on data mining. Springer, Berlin, pp 120–134
Bentley JL, Friedman JH (1979) Data structures for range searching. ACM Comput Surv 11(4):397–409
Datta S, Bhaduri K, Giannella C, Wolff R, Kargupta H (2006) Distributed data mining in peer-to-peer networks. IEEE Internet Comput 10(4):18–26
Datta S, Giannella CR, Kargupta H (2009) Approximate distributed K-means clustering over a peer-to-peer network. IEEE Trans Knowl Data Eng 21(10):1372–1388
Demers A, Greene D, Hauser C, Irish W, Larson J, Shenker S, Sturgis H, Swinehart D, Terry D (1987) Epidemic algorithms for replicated database maintenance. In: 6th symposium on principles of distributed computing. ACM, pp 1–12
Dhillon IS, Modha DS (2000) A data-clustering algorithm on distributed memory multiprocessors. In: Large-scale parallel data mining. Springer, Berlin, Lecture Notes in Computer Science, vol 1759, pp 245–260
Eisenhardt M, Muller W, Henrich A (2003) Classifying documents by distributed P2P clustering. In: Inform. pp 286–291
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: 2nd international conference knowledge discovery and data mining. ACM Press, New York, pp 226–231
Eyal I, Keidar I, Rom R (2011) Distributed data clustering in sensor networks. Distrib Comput 24(5):207–222
Forman G, Zhang B (2000) Distributed data clustering can be efficient and exact. SIGKDD Explor Newsl 2:34–38
Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: SIGMOD international conference on management Of data. ACM Press, New York, SIGMOD’98, pp 73–84
Hammouda KM, Kamel MS (2009) Hierarchically distributed peer-to-peer document clustering and cluster summarization. IEEE Trans Knowl Data Eng 21:681–698
Holm J, de Lichtenberg C, Thorup M (1998) Poly-logarithmic deterministic fully-dynamic algorithms for connectivity, minimum spanning tree, 2-edge, and biconnectivity. In Proceedings of the thirtieth annual ACM symposium on Theory of computing (STOC ’98). ACM, New York, pp 79–89
Januzaj E, Kriegel HP, Pfeifle M (2004) Scalable density-based distributed clustering. In: 8th european conference on principles and practice of knowledge discovery in databases, Springer, Berlin, pp 231–244
Jelasity M, Voulgaris S, Guerraoui R, Kermarrec AM, van Steen M (2007) Gossip-based peer sampling. ACM Trans Comput Syst 25(3):8
Kowalczyk W, Vlassis N (2005) Newscast EM. NIPS 17:713–720
Kriegel HP, Kroger P, Pryakhin A, Schubert M (2005) Effective and efficient distributed model-based clustering. In: 5th international conference on data mining. IEEE Computer Society Press, Los Alamitos, CA, pp 258–265
Kriegel HP, Kunath P, Pfeie M, Renz M (2005) Approximated clustering of distributed high-dimensional data. In: 9th advances in knowledge discovery and data mining. Lecture Notes in Computer Science, vol 3518, pp 432–441
Li M, Lee G, Lee WC, Sivasubramaniam A (2006) PENS: an algorithm for density-based clustering in peer-to-peer systems. In: 1st international conference on scalable information systems. ACM Press, New York
Liu YB, Liu ZX (2011) Scalable local density-based distributed clustering. Expert Syst Appl 38(8):9491–9498
Lodi S, Moro G, Sartori C (2010) Distributed data clustering in multi-dimensional peer-to-peer networks. In: 21st Australasian conference on database technologies, vol 104. pp 171–178
Luo P, Xiong H, Lu K, Shi Z (2007) Distributed classification in peer-to-peer networks. In: 13th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, pp 968–976
Merugu S, Ghosh J (2005) A privacy-sensitive approach to distributed clustering. Pattern Recogn Lett 26(4):399–410
Ratnasamy S, Francis P, Handley M, Karp R, Schenker S (2001) A scalable content-addressable network. In: SIGCOMM. ACM, San Diego, pp 161–172
Samatova NF, Ostrouchov G, Geist A, Melechko AV (2002) RACHET: an efficient cover-based merging of clustering hierarchies from distributed datasets. Distrib Parallel Datab 11:157–180
Stonebraker M, Frew J, Gardels K, Meredith J (1993) The sequoia 2000 storage benchmark. In Proceedings of SIGMOD, pp 2–11
Tasoulis DK, Vrahatis MN (2004) Unsupervised distributed clustering. In: IASTED international conference on parallel and distributed computing and networks
Visalakshi N, Thangavel K (2009) Distributed data clustering: a comparative analysis. In: Abraham A, Hassanien AE, de Leon F de Carvalho A, Snasel V (eds) Found Comput Intell 206:371–397
Voulgaris S, van Steen M, Iwanicki K (2007) Proactive gossip-based management of semantic overlay networks. Concurr Comput : Pract Expert 19(17):2299–2311
Wolff R, Schuster A (2004) Association rule mining in peer-to-peer systems. IEEE Trans Syst Man Cybern 34(6):242–2438
Wolff R, Bhaduri K, Kargupta H (2009) A generic local algorithm for mining data streams in large distributed systems. IEEE Trans Knowl Data Eng 21:465–478
Author information
Authors and Affiliations
Corresponding author
Additional information
Hoda Mashayekhi research was supported by the Research Institute for ICT under grant No: T/500/5197.
Rights and permissions
About this article
Cite this article
Mashayekhi, H., Habibi, J., Voulgaris, S. et al. GoSCAN: Decentralized scalable data clustering. Computing 95, 759–784 (2013). https://doi.org/10.1007/s00607-012-0264-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-012-0264-2