Density-based clustering of big probabilistic graphs

Abstract

Clustering is a machine learning task to group similar objects in coherent sets. These groups exhibit similar behavior with-in their cluster. With the exponential increase in the data volume, robust approaches are required to process and extract clusters. In addition to large volumes, datasets may have uncertainties due to the heterogeneity of the data sources, resulting in the Big Data. Modern approaches and algorithms in machine learning widely use probability-theory in order to determine the data uncertainty. Such huge uncertain data can be transformed to a probabilistic graph-based representation. This work presents an approach for density-based clustering of big probabilistic graphs. The proposed approach deals with clustering of large probabilistic graphs using the graph’s density, where the clustering process is guided by the nodes’ degree and the neighborhood information. The proposed approach is evaluated using seven real-world benchmark datasets, namely protein-to-protein interaction, yahoo, movie-lens, core, last.fm, delicious social bookmarking system, and epinions. These datasets are first transformed to a graph-based representation before applying the proposed clustering algorithm. The obtained results are evaluated using three cluster validation indices, namely Davies–Bouldin index, Dunn index, and Silhouette coefficient. This proposal is also compared with four state-of-the-art approaches for clustering large probabilistic graphs. The results obtained using seven datasets and three cluster validity indices suggest better performance of the proposed approach.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Notes

  1. 1.

    http://llama.mshri.on.ca/cgi/ClusterJudge/cluster_judge.pl.

  2. 2.

    https://thebiogrid.org/.

  3. 3.

    https://webscope.sandbox.yahoo.com/.

  4. 4.

    http://www.epinions.com/.

  5. 5.

    https://grouplens.org/datasets/movielens/.

  6. 6.

    http://ir.ii.uam.es/hetrec2011/datasets.html.

  7. 7.

    https://labrosa.ee.columbia.edu/millionsong/lastfm.

References

  1. AbdulAzeem YM, ElDesouky AI, Ali HA (2014) A framework for ranking uncertain distributed database. Data Knowl Eng 92:1–19

    Article  Google Scholar 

  2. Aggarwal CC, Reddy CK (eds) (2013) Data clustering: algorithms and applications. CRC Press, Taylor & Francis Group, Boca Raton

    Google Scholar 

  3. Angelov PP, Gu X, Gutierrez G, Iglesias JA, Sanchis A (2016) Autonomous data density based clustering method. In international joint conference on neural networks (IJCNN), pp 2405–2413

  4. Balakrishnan S, Xu M, Krishnamurthy A, Singh A (2011) Noise thresholds for spectral clustering. Adv Neural Inf Process Syst 2011:954–962

    Google Scholar 

  5. Basharat A, Arpinar IB, Dastgheib S, Kursuncu U, Kochut K, Dogdu E (2014) Semantically enriched task and workflow automation in crowdsourcing for linked data management. Int J Semant Comput 8(04):415–439

    Article  Google Scholar 

  6. Bezerra CG, Costa BSJ, Guedes LA, Angelov PP (2016) A new evolving clustering algorithm for online data streams. In IEEE conference on evolving and adaptive intelligent systems, pp 162–168

  7. Bonchi F, van Leeuwen M, Ukkonen A (2011) Characterizing uncertain data using compression. In proceedings of the 2011 SIAM international conference on data mining, pp 534–545

  8. Chau M, Cheng R, Kao B, Ng J (2006) Uncertain data mining: an example in clustering location data. In Pacific–Asia conference on knowledge discovery and data mining, Springer, Berlin. pp 199–204

  9. Chaudhuri K, Graham FC, Tsiatas A (2012) Spectral clustering of graphs with general degrees in the extended planted partition model. COLT 23:35–1

    Google Scholar 

  10. Chen Y, Sanghavi S, Xu H (2012) Clustering sparse graphs. Adv Neural Inf Process Syst 2012:2204–2212

    Google Scholar 

  11. Clémençon S, De Arazoza H, Rossi F, Tran VC (2012) Hierarchical clustering for graph visualization. arXiv:1210.5693 (preprint)

  12. Cornish R (2007) Statistics: cluster analysis. Mathematics Learning Support Centre

  13. Dahlin J, Svenson P (2011) A method for community detection in uncertain networks. In intelligence and security informatics conference (EISIC), pp 155–162

  14. Du L, Li C, Chen H, Tan L, Zhang Y (2015) Probabilistic SimRank computation over uncertain graphs. Inf Sci 295:521–535

    MathSciNet  Article  MATH  Google Scholar 

  15. Gionis A, Mannila H, Tsaparas P (2007 Clustering aggregation. ACM Trans Knowl Discov Data (TKDD) 1(1):4

    Article  Google Scholar 

  16. Gu X, Angelov PP (2016) Autonomous data-driven clustering for live data stream. In IEEE international conference on systems, man, and cybernetics (SMC), pp 001128–001135

  17. Gu Y, Gao C, Cong G, Yu G (2014) Effective and efficient clustering methods for correlated probabilistic graphs. IEEE Trans Knowl Data Eng 26(5):1117–1130

    Article  Google Scholar 

  18. Gu X, Angelov PP, Kangin D, Principe JC (2017) A new type of distance metric and its use for clustering. Evol Syst 8(3):167–177

    Article  Google Scholar 

  19. Halim Z, Uzma (2017) Optimizing the minimum spanning tree-based extracted clusters using evolution strategy. Clust Comput 1–15

  20. Halim Z, Waqas M, Hussain SF (2015) Clustering large probabilistic graphs using multi-population evolutionary algorithm. Inf Sci 317:78–95

    Article  Google Scholar 

  21. Halim Z, Waqas M, Baig AR, Rashid A (2017) Efficient clustering of large uncertain graphs using neighborhood information. Int J Approx Reason 90:274–291

    MathSciNet  Article  MATH  Google Scholar 

  22. Hintsanen P, Toivonen H (2008) Finding reliable subgraphs from large probabilistic graphs. Data Min Knowl Disc 17(1):3–23

    MathSciNet  Article  Google Scholar 

  23. Hyde R, Angelov P, MacKenzie AR (2017) Fully online clustering of evolving data streams into arbitrarily shaped clusters. Inf Sci 382:96–114

    Article  Google Scholar 

  24. Jin P, Qu S, Zong Y, Li X (2014) CUDAP: a novel clustering algorithm for uncertain data based on approximate backbone. J Softw 9(3):732–737

    Article  Google Scholar 

  25. Karunambigai MG, Akram M, Sivasankar S, Palanivel K (2017) Clustering algorithm for intuitionistic fuzzy graphs. Int J Uncertain Fuzziness Knowl Based Syst 25(03):367–383

    MathSciNet  Article  MATH  Google Scholar 

  26. Khanmohammadi S, Adibeig N, Shanehbandy S (2017) An improved overlapping k-means clustering method for medical applications. Expert Syst Appl 67:12–18

    Article  Google Scholar 

  27. Kollios G, Potamias M, Terzi E (2013) Clustering large probabilistic graphs. IEEE Trans Knowl Data Eng 25(2):325–336

    Article  Google Scholar 

  28. Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Punna T (2006) Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440(7084):637–643

    Article  Google Scholar 

  29. Langohr L, Toivonen H (2012) Finding representative nodes in probabilistic graphs. In: Bisociative Knowledge Discovery. Springer, Berlin Heidelberg, pp 218–229

    Google Scholar 

  30. Li WP, Yang J, Zhang JP (2015) Uncertain canonical correlation analysis for multi-view feature extraction from uncertain data streams. Neurocomputing 149:1337–1347

    Article  Google Scholar 

  31. Liu L, Jin R, Aggarwal C, Shen Y (2012) Reliable clustering on uncertain graphs. In data mining (ICDM), 2012 IEEE 12th international conference on, pp 459–468

  32. Liu HW, Chen L, Zhu H, Lu T, Liang F (2014) Uncertainty community detection in social networks. J Softw 9(4):1045–1050

    Google Scholar 

  33. Mishra N, Schreiber R, Stanton I, Tarjan RE (2007) Clustering social networks. In international workshop on algorithms and models for the web-graph. Springer, Berlin, pp 56–67

  34. Muhammad T, Halim Z (2016) Employing artificial neural networks for constructing metadata-based model to automatically select an appropriate data visualization technique. Appl Soft Comput 49:365–384

    Article  Google Scholar 

  35. Priyadarshini G, Sarmah R, Chakraborty B, Bhattacharyya DK, Kalita JK (2012) An effective graph-based clustering technique to identify coherent patterns from gene expression data. Int J Bioinform Res Appl 8(1–2):18–37

    Article  Google Scholar 

  36. Sarwar M, Akram M (2016) An algorithm for computing certain metrics in intuitionistic fuzzy graphs. J Intell Fuzzy Syst 30(4):2405–2416

    Article  MATH  Google Scholar 

  37. Sarwar M, Akram M (2017) Certain algorithms for computing strength of competition in bipolar fuzzy graphs. Int J Uncertain Fuzziness Knowl Based Syst 25(06):877–896

    MathSciNet  Article  Google Scholar 

  38. Satuluri V, Parthasarathy S (2011 Symmetrizations for clustering directed graphs. In proceedings of the 14th international conference on extending database technology. pp 343–354

  39. Schubert E, Koos A, Emrich T, Züfle A, Schmid KA, Zimek A (2015) A framework for clustering uncertain data. Proc VLDB Endow 8(12):1976–1979

    Article  Google Scholar 

  40. Shah MA, Abbas G, Dogar AB, Halim Z (2015) Scaling hierarchical clustering and energy aware routing for sensor networks. Complex Adapt Syst Model 3(1):5

    Article  Google Scholar 

  41. Xu H, Li G (2008) Density-based probabilistic clustering of uncertain data. In computer science and software engineering, 2008 international conference on, pp 4,474–477

  42. Xu L, Hu Q, Hung E, Chen B, Tan X, Liao C (2015) Large margin clustering on uncertain data by considering probability distribution similarity. Neurocomputing 158:81–89

    Article  Google Scholar 

  43. Zhang X, Liu H, Zhang X, Liu X (2014) Novel density-based clustering algorithms for uncertain data. In: Proceedings of the twenty-eighth conference on artificial intelligence, pp 2191–2197

  44. Zhou L, Pan S, Wang J, Vasilakos AV (2017) Machine learning on big data: opportunities and challenges. Neurocomputing 237:350–361

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their helpful and constructive comments that greatly contributed to improve the manuscript quality.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Zahid Halim.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Halim, Z., Khattak, J.H. Density-based clustering of big probabilistic graphs. Evolving Systems 10, 333–350 (2019). https://doi.org/10.1007/s12530-018-9223-2

Download citation

Keywords

  • Clustering graphs
  • Machine learning
  • Big graphs
  • Clustering
  • Community detection