Evolving Systems

, Volume 10, Issue 3, pp 333–350 | Cite as

Density-based clustering of big probabilistic graphs

  • Zahid HalimEmail author
  • Jamal Hussain Khattak
Original Paper


Clustering is a machine learning task to group similar objects in coherent sets. These groups exhibit similar behavior with-in their cluster. With the exponential increase in the data volume, robust approaches are required to process and extract clusters. In addition to large volumes, datasets may have uncertainties due to the heterogeneity of the data sources, resulting in the Big Data. Modern approaches and algorithms in machine learning widely use probability-theory in order to determine the data uncertainty. Such huge uncertain data can be transformed to a probabilistic graph-based representation. This work presents an approach for density-based clustering of big probabilistic graphs. The proposed approach deals with clustering of large probabilistic graphs using the graph’s density, where the clustering process is guided by the nodes’ degree and the neighborhood information. The proposed approach is evaluated using seven real-world benchmark datasets, namely protein-to-protein interaction, yahoo, movie-lens, core,, delicious social bookmarking system, and epinions. These datasets are first transformed to a graph-based representation before applying the proposed clustering algorithm. The obtained results are evaluated using three cluster validation indices, namely Davies–Bouldin index, Dunn index, and Silhouette coefficient. This proposal is also compared with four state-of-the-art approaches for clustering large probabilistic graphs. The results obtained using seven datasets and three cluster validity indices suggest better performance of the proposed approach.


Clustering graphs Machine learning Big graphs Clustering Community detection 



The authors would like to thank the anonymous reviewers for their helpful and constructive comments that greatly contributed to improve the manuscript quality.


  1. AbdulAzeem YM, ElDesouky AI, Ali HA (2014) A framework for ranking uncertain distributed database. Data Knowl Eng 92:1–19CrossRefGoogle Scholar
  2. Aggarwal CC, Reddy CK (eds) (2013) Data clustering: algorithms and applications. CRC Press, Taylor & Francis Group, Boca RatonGoogle Scholar
  3. Angelov PP, Gu X, Gutierrez G, Iglesias JA, Sanchis A (2016) Autonomous data density based clustering method. In international joint conference on neural networks (IJCNN), pp 2405–2413Google Scholar
  4. Balakrishnan S, Xu M, Krishnamurthy A, Singh A (2011) Noise thresholds for spectral clustering. Adv Neural Inf Process Syst 2011:954–962Google Scholar
  5. Basharat A, Arpinar IB, Dastgheib S, Kursuncu U, Kochut K, Dogdu E (2014) Semantically enriched task and workflow automation in crowdsourcing for linked data management. Int J Semant Comput 8(04):415–439CrossRefGoogle Scholar
  6. Bezerra CG, Costa BSJ, Guedes LA, Angelov PP (2016) A new evolving clustering algorithm for online data streams. In IEEE conference on evolving and adaptive intelligent systems, pp 162–168Google Scholar
  7. Bonchi F, van Leeuwen M, Ukkonen A (2011) Characterizing uncertain data using compression. In proceedings of the 2011 SIAM international conference on data mining, pp 534–545Google Scholar
  8. Chau M, Cheng R, Kao B, Ng J (2006) Uncertain data mining: an example in clustering location data. In Pacific–Asia conference on knowledge discovery and data mining, Springer, Berlin. pp 199–204Google Scholar
  9. Chaudhuri K, Graham FC, Tsiatas A (2012) Spectral clustering of graphs with general degrees in the extended planted partition model. COLT 23:35–1Google Scholar
  10. Chen Y, Sanghavi S, Xu H (2012) Clustering sparse graphs. Adv Neural Inf Process Syst 2012:2204–2212Google Scholar
  11. Clémençon S, De Arazoza H, Rossi F, Tran VC (2012) Hierarchical clustering for graph visualization. arXiv:1210.5693 (preprint) Google Scholar
  12. Cornish R (2007) Statistics: cluster analysis. Mathematics Learning Support CentreGoogle Scholar
  13. Dahlin J, Svenson P (2011) A method for community detection in uncertain networks. In intelligence and security informatics conference (EISIC), pp 155–162Google Scholar
  14. Du L, Li C, Chen H, Tan L, Zhang Y (2015) Probabilistic SimRank computation over uncertain graphs. Inf Sci 295:521–535MathSciNetCrossRefzbMATHGoogle Scholar
  15. Gionis A, Mannila H, Tsaparas P (2007 Clustering aggregation. ACM Trans Knowl Discov Data (TKDD) 1(1):4CrossRefGoogle Scholar
  16. Gu X, Angelov PP (2016) Autonomous data-driven clustering for live data stream. In IEEE international conference on systems, man, and cybernetics (SMC), pp 001128–001135Google Scholar
  17. Gu Y, Gao C, Cong G, Yu G (2014) Effective and efficient clustering methods for correlated probabilistic graphs. IEEE Trans Knowl Data Eng 26(5):1117–1130CrossRefGoogle Scholar
  18. Gu X, Angelov PP, Kangin D, Principe JC (2017) A new type of distance metric and its use for clustering. Evol Syst 8(3):167–177CrossRefGoogle Scholar
  19. Halim Z, Uzma (2017) Optimizing the minimum spanning tree-based extracted clusters using evolution strategy. Clust Comput 1–15Google Scholar
  20. Halim Z, Waqas M, Hussain SF (2015) Clustering large probabilistic graphs using multi-population evolutionary algorithm. Inf Sci 317:78–95CrossRefGoogle Scholar
  21. Halim Z, Waqas M, Baig AR, Rashid A (2017) Efficient clustering of large uncertain graphs using neighborhood information. Int J Approx Reason 90:274–291MathSciNetCrossRefzbMATHGoogle Scholar
  22. Hintsanen P, Toivonen H (2008) Finding reliable subgraphs from large probabilistic graphs. Data Min Knowl Disc 17(1):3–23MathSciNetCrossRefGoogle Scholar
  23. Hyde R, Angelov P, MacKenzie AR (2017) Fully online clustering of evolving data streams into arbitrarily shaped clusters. Inf Sci 382:96–114CrossRefGoogle Scholar
  24. Jin P, Qu S, Zong Y, Li X (2014) CUDAP: a novel clustering algorithm for uncertain data based on approximate backbone. J Softw 9(3):732–737CrossRefGoogle Scholar
  25. Karunambigai MG, Akram M, Sivasankar S, Palanivel K (2017) Clustering algorithm for intuitionistic fuzzy graphs. Int J Uncertain Fuzziness Knowl Based Syst 25(03):367–383MathSciNetCrossRefzbMATHGoogle Scholar
  26. Khanmohammadi S, Adibeig N, Shanehbandy S (2017) An improved overlapping k-means clustering method for medical applications. Expert Syst Appl 67:12–18CrossRefGoogle Scholar
  27. Kollios G, Potamias M, Terzi E (2013) Clustering large probabilistic graphs. IEEE Trans Knowl Data Eng 25(2):325–336CrossRefGoogle Scholar
  28. Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Punna T (2006) Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440(7084):637–643CrossRefGoogle Scholar
  29. Langohr L, Toivonen H (2012) Finding representative nodes in probabilistic graphs. In: Bisociative Knowledge Discovery. Springer, Berlin Heidelberg, pp 218–229CrossRefGoogle Scholar
  30. Li WP, Yang J, Zhang JP (2015) Uncertain canonical correlation analysis for multi-view feature extraction from uncertain data streams. Neurocomputing 149:1337–1347CrossRefGoogle Scholar
  31. Liu L, Jin R, Aggarwal C, Shen Y (2012) Reliable clustering on uncertain graphs. In data mining (ICDM), 2012 IEEE 12th international conference on, pp 459–468Google Scholar
  32. Liu HW, Chen L, Zhu H, Lu T, Liang F (2014) Uncertainty community detection in social networks. J Softw 9(4):1045–1050Google Scholar
  33. Mishra N, Schreiber R, Stanton I, Tarjan RE (2007) Clustering social networks. In international workshop on algorithms and models for the web-graph. Springer, Berlin, pp 56–67Google Scholar
  34. Muhammad T, Halim Z (2016) Employing artificial neural networks for constructing metadata-based model to automatically select an appropriate data visualization technique. Appl Soft Comput 49:365–384CrossRefGoogle Scholar
  35. Priyadarshini G, Sarmah R, Chakraborty B, Bhattacharyya DK, Kalita JK (2012) An effective graph-based clustering technique to identify coherent patterns from gene expression data. Int J Bioinform Res Appl 8(1–2):18–37CrossRefGoogle Scholar
  36. Sarwar M, Akram M (2016) An algorithm for computing certain metrics in intuitionistic fuzzy graphs. J Intell Fuzzy Syst 30(4):2405–2416CrossRefzbMATHGoogle Scholar
  37. Sarwar M, Akram M (2017) Certain algorithms for computing strength of competition in bipolar fuzzy graphs. Int J Uncertain Fuzziness Knowl Based Syst 25(06):877–896MathSciNetCrossRefGoogle Scholar
  38. Satuluri V, Parthasarathy S (2011 Symmetrizations for clustering directed graphs. In proceedings of the 14th international conference on extending database technology. pp 343–354Google Scholar
  39. Schubert E, Koos A, Emrich T, Züfle A, Schmid KA, Zimek A (2015) A framework for clustering uncertain data. Proc VLDB Endow 8(12):1976–1979CrossRefGoogle Scholar
  40. Shah MA, Abbas G, Dogar AB, Halim Z (2015) Scaling hierarchical clustering and energy aware routing for sensor networks. Complex Adapt Syst Model 3(1):5CrossRefGoogle Scholar
  41. Xu H, Li G (2008) Density-based probabilistic clustering of uncertain data. In computer science and software engineering, 2008 international conference on, pp 4,474–477Google Scholar
  42. Xu L, Hu Q, Hung E, Chen B, Tan X, Liao C (2015) Large margin clustering on uncertain data by considering probability distribution similarity. Neurocomputing 158:81–89CrossRefGoogle Scholar
  43. Zhang X, Liu H, Zhang X, Liu X (2014) Novel density-based clustering algorithms for uncertain data. In: Proceedings of the twenty-eighth conference on artificial intelligence, pp 2191–2197Google Scholar
  44. Zhou L, Pan S, Wang J, Vasilakos AV (2017) Machine learning on big data: opportunities and challenges. Neurocomputing 237:350–361CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Faculty of Computer Science and EngineeringGhulam Ishaq Khan Institute of Engineering Sciences and TechnologyTopiPakistan
  2. 2.Business Solutions and Development, Information Technology GroupAllied Bank LimitedLahorePakistan

Personalised recommendations