Abstract
Clustering is a machine learning task to group similar objects in coherent sets. These groups exhibit similar behavior with-in their cluster. With the exponential increase in the data volume, robust approaches are required to process and extract clusters. In addition to large volumes, datasets may have uncertainties due to the heterogeneity of the data sources, resulting in the Big Data. Modern approaches and algorithms in machine learning widely use probability-theory in order to determine the data uncertainty. Such huge uncertain data can be transformed to a probabilistic graph-based representation. This work presents an approach for density-based clustering of big probabilistic graphs. The proposed approach deals with clustering of large probabilistic graphs using the graph’s density, where the clustering process is guided by the nodes’ degree and the neighborhood information. The proposed approach is evaluated using seven real-world benchmark datasets, namely protein-to-protein interaction, yahoo, movie-lens, core, last.fm, delicious social bookmarking system, and epinions. These datasets are first transformed to a graph-based representation before applying the proposed clustering algorithm. The obtained results are evaluated using three cluster validation indices, namely Davies–Bouldin index, Dunn index, and Silhouette coefficient. This proposal is also compared with four state-of-the-art approaches for clustering large probabilistic graphs. The results obtained using seven datasets and three cluster validity indices suggest better performance of the proposed approach.
Similar content being viewed by others
References
AbdulAzeem YM, ElDesouky AI, Ali HA (2014) A framework for ranking uncertain distributed database. Data Knowl Eng 92:1–19
Aggarwal CC, Reddy CK (eds) (2013) Data clustering: algorithms and applications. CRC Press, Taylor & Francis Group, Boca Raton
Angelov PP, Gu X, Gutierrez G, Iglesias JA, Sanchis A (2016) Autonomous data density based clustering method. In international joint conference on neural networks (IJCNN), pp 2405–2413
Balakrishnan S, Xu M, Krishnamurthy A, Singh A (2011) Noise thresholds for spectral clustering. Adv Neural Inf Process Syst 2011:954–962
Basharat A, Arpinar IB, Dastgheib S, Kursuncu U, Kochut K, Dogdu E (2014) Semantically enriched task and workflow automation in crowdsourcing for linked data management. Int J Semant Comput 8(04):415–439
Bezerra CG, Costa BSJ, Guedes LA, Angelov PP (2016) A new evolving clustering algorithm for online data streams. In IEEE conference on evolving and adaptive intelligent systems, pp 162–168
Bonchi F, van Leeuwen M, Ukkonen A (2011) Characterizing uncertain data using compression. In proceedings of the 2011 SIAM international conference on data mining, pp 534–545
Chau M, Cheng R, Kao B, Ng J (2006) Uncertain data mining: an example in clustering location data. In Pacific–Asia conference on knowledge discovery and data mining, Springer, Berlin. pp 199–204
Chaudhuri K, Graham FC, Tsiatas A (2012) Spectral clustering of graphs with general degrees in the extended planted partition model. COLT 23:35–1
Chen Y, Sanghavi S, Xu H (2012) Clustering sparse graphs. Adv Neural Inf Process Syst 2012:2204–2212
Clémençon S, De Arazoza H, Rossi F, Tran VC (2012) Hierarchical clustering for graph visualization. arXiv:1210.5693 (preprint)
Cornish R (2007) Statistics: cluster analysis. Mathematics Learning Support Centre
Dahlin J, Svenson P (2011) A method for community detection in uncertain networks. In intelligence and security informatics conference (EISIC), pp 155–162
Du L, Li C, Chen H, Tan L, Zhang Y (2015) Probabilistic SimRank computation over uncertain graphs. Inf Sci 295:521–535
Gionis A, Mannila H, Tsaparas P (2007 Clustering aggregation. ACM Trans Knowl Discov Data (TKDD) 1(1):4
Gu X, Angelov PP (2016) Autonomous data-driven clustering for live data stream. In IEEE international conference on systems, man, and cybernetics (SMC), pp 001128–001135
Gu Y, Gao C, Cong G, Yu G (2014) Effective and efficient clustering methods for correlated probabilistic graphs. IEEE Trans Knowl Data Eng 26(5):1117–1130
Gu X, Angelov PP, Kangin D, Principe JC (2017) A new type of distance metric and its use for clustering. Evol Syst 8(3):167–177
Halim Z, Uzma (2017) Optimizing the minimum spanning tree-based extracted clusters using evolution strategy. Clust Comput 1–15
Halim Z, Waqas M, Hussain SF (2015) Clustering large probabilistic graphs using multi-population evolutionary algorithm. Inf Sci 317:78–95
Halim Z, Waqas M, Baig AR, Rashid A (2017) Efficient clustering of large uncertain graphs using neighborhood information. Int J Approx Reason 90:274–291
Hintsanen P, Toivonen H (2008) Finding reliable subgraphs from large probabilistic graphs. Data Min Knowl Disc 17(1):3–23
Hyde R, Angelov P, MacKenzie AR (2017) Fully online clustering of evolving data streams into arbitrarily shaped clusters. Inf Sci 382:96–114
Jin P, Qu S, Zong Y, Li X (2014) CUDAP: a novel clustering algorithm for uncertain data based on approximate backbone. J Softw 9(3):732–737
Karunambigai MG, Akram M, Sivasankar S, Palanivel K (2017) Clustering algorithm for intuitionistic fuzzy graphs. Int J Uncertain Fuzziness Knowl Based Syst 25(03):367–383
Khanmohammadi S, Adibeig N, Shanehbandy S (2017) An improved overlapping k-means clustering method for medical applications. Expert Syst Appl 67:12–18
Kollios G, Potamias M, Terzi E (2013) Clustering large probabilistic graphs. IEEE Trans Knowl Data Eng 25(2):325–336
Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Punna T (2006) Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440(7084):637–643
Langohr L, Toivonen H (2012) Finding representative nodes in probabilistic graphs. In: Bisociative Knowledge Discovery. Springer, Berlin Heidelberg, pp 218–229
Li WP, Yang J, Zhang JP (2015) Uncertain canonical correlation analysis for multi-view feature extraction from uncertain data streams. Neurocomputing 149:1337–1347
Liu L, Jin R, Aggarwal C, Shen Y (2012) Reliable clustering on uncertain graphs. In data mining (ICDM), 2012 IEEE 12th international conference on, pp 459–468
Liu HW, Chen L, Zhu H, Lu T, Liang F (2014) Uncertainty community detection in social networks. J Softw 9(4):1045–1050
Mishra N, Schreiber R, Stanton I, Tarjan RE (2007) Clustering social networks. In international workshop on algorithms and models for the web-graph. Springer, Berlin, pp 56–67
Muhammad T, Halim Z (2016) Employing artificial neural networks for constructing metadata-based model to automatically select an appropriate data visualization technique. Appl Soft Comput 49:365–384
Priyadarshini G, Sarmah R, Chakraborty B, Bhattacharyya DK, Kalita JK (2012) An effective graph-based clustering technique to identify coherent patterns from gene expression data. Int J Bioinform Res Appl 8(1–2):18–37
Sarwar M, Akram M (2016) An algorithm for computing certain metrics in intuitionistic fuzzy graphs. J Intell Fuzzy Syst 30(4):2405–2416
Sarwar M, Akram M (2017) Certain algorithms for computing strength of competition in bipolar fuzzy graphs. Int J Uncertain Fuzziness Knowl Based Syst 25(06):877–896
Satuluri V, Parthasarathy S (2011 Symmetrizations for clustering directed graphs. In proceedings of the 14th international conference on extending database technology. pp 343–354
Schubert E, Koos A, Emrich T, Züfle A, Schmid KA, Zimek A (2015) A framework for clustering uncertain data. Proc VLDB Endow 8(12):1976–1979
Shah MA, Abbas G, Dogar AB, Halim Z (2015) Scaling hierarchical clustering and energy aware routing for sensor networks. Complex Adapt Syst Model 3(1):5
Xu H, Li G (2008) Density-based probabilistic clustering of uncertain data. In computer science and software engineering, 2008 international conference on, pp 4,474–477
Xu L, Hu Q, Hung E, Chen B, Tan X, Liao C (2015) Large margin clustering on uncertain data by considering probability distribution similarity. Neurocomputing 158:81–89
Zhang X, Liu H, Zhang X, Liu X (2014) Novel density-based clustering algorithms for uncertain data. In: Proceedings of the twenty-eighth conference on artificial intelligence, pp 2191–2197
Zhou L, Pan S, Wang J, Vasilakos AV (2017) Machine learning on big data: opportunities and challenges. Neurocomputing 237:350–361
Acknowledgements
The authors would like to thank the anonymous reviewers for their helpful and constructive comments that greatly contributed to improve the manuscript quality.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Halim, Z., Khattak, J.H. Density-based clustering of big probabilistic graphs. Evolving Systems 10, 333–350 (2019). https://doi.org/10.1007/s12530-018-9223-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12530-018-9223-2