Skip to main content

Density-based clustering of big probabilistic graphs

Abstract

Clustering is a machine learning task to group similar objects in coherent sets. These groups exhibit similar behavior with-in their cluster. With the exponential increase in the data volume, robust approaches are required to process and extract clusters. In addition to large volumes, datasets may have uncertainties due to the heterogeneity of the data sources, resulting in the Big Data. Modern approaches and algorithms in machine learning widely use probability-theory in order to determine the data uncertainty. Such huge uncertain data can be transformed to a probabilistic graph-based representation. This work presents an approach for density-based clustering of big probabilistic graphs. The proposed approach deals with clustering of large probabilistic graphs using the graph’s density, where the clustering process is guided by the nodes’ degree and the neighborhood information. The proposed approach is evaluated using seven real-world benchmark datasets, namely protein-to-protein interaction, yahoo, movie-lens, core, last.fm, delicious social bookmarking system, and epinions. These datasets are first transformed to a graph-based representation before applying the proposed clustering algorithm. The obtained results are evaluated using three cluster validation indices, namely Davies–Bouldin index, Dunn index, and Silhouette coefficient. This proposal is also compared with four state-of-the-art approaches for clustering large probabilistic graphs. The results obtained using seven datasets and three cluster validity indices suggest better performance of the proposed approach.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Notes

  1. http://llama.mshri.on.ca/cgi/ClusterJudge/cluster_judge.pl.

  2. https://thebiogrid.org/.

  3. https://webscope.sandbox.yahoo.com/.

  4. http://www.epinions.com/.

  5. https://grouplens.org/datasets/movielens/.

  6. http://ir.ii.uam.es/hetrec2011/datasets.html.

  7. https://labrosa.ee.columbia.edu/millionsong/lastfm.

References

  • AbdulAzeem YM, ElDesouky AI, Ali HA (2014) A framework for ranking uncertain distributed database. Data Knowl Eng 92:1–19

    Article  Google Scholar 

  • Aggarwal CC, Reddy CK (eds) (2013) Data clustering: algorithms and applications. CRC Press, Taylor & Francis Group, Boca Raton

    Google Scholar 

  • Angelov PP, Gu X, Gutierrez G, Iglesias JA, Sanchis A (2016) Autonomous data density based clustering method. In international joint conference on neural networks (IJCNN), pp 2405–2413

  • Balakrishnan S, Xu M, Krishnamurthy A, Singh A (2011) Noise thresholds for spectral clustering. Adv Neural Inf Process Syst 2011:954–962

    Google Scholar 

  • Basharat A, Arpinar IB, Dastgheib S, Kursuncu U, Kochut K, Dogdu E (2014) Semantically enriched task and workflow automation in crowdsourcing for linked data management. Int J Semant Comput 8(04):415–439

    Article  Google Scholar 

  • Bezerra CG, Costa BSJ, Guedes LA, Angelov PP (2016) A new evolving clustering algorithm for online data streams. In IEEE conference on evolving and adaptive intelligent systems, pp 162–168

  • Bonchi F, van Leeuwen M, Ukkonen A (2011) Characterizing uncertain data using compression. In proceedings of the 2011 SIAM international conference on data mining, pp 534–545

  • Chau M, Cheng R, Kao B, Ng J (2006) Uncertain data mining: an example in clustering location data. In Pacific–Asia conference on knowledge discovery and data mining, Springer, Berlin. pp 199–204

  • Chaudhuri K, Graham FC, Tsiatas A (2012) Spectral clustering of graphs with general degrees in the extended planted partition model. COLT 23:35–1

    Google Scholar 

  • Chen Y, Sanghavi S, Xu H (2012) Clustering sparse graphs. Adv Neural Inf Process Syst 2012:2204–2212

    Google Scholar 

  • Clémençon S, De Arazoza H, Rossi F, Tran VC (2012) Hierarchical clustering for graph visualization. arXiv:1210.5693 (preprint)

  • Cornish R (2007) Statistics: cluster analysis. Mathematics Learning Support Centre

  • Dahlin J, Svenson P (2011) A method for community detection in uncertain networks. In intelligence and security informatics conference (EISIC), pp 155–162

  • Du L, Li C, Chen H, Tan L, Zhang Y (2015) Probabilistic SimRank computation over uncertain graphs. Inf Sci 295:521–535

    MathSciNet  Article  MATH  Google Scholar 

  • Gionis A, Mannila H, Tsaparas P (2007 Clustering aggregation. ACM Trans Knowl Discov Data (TKDD) 1(1):4

    Article  Google Scholar 

  • Gu X, Angelov PP (2016) Autonomous data-driven clustering for live data stream. In IEEE international conference on systems, man, and cybernetics (SMC), pp 001128–001135

  • Gu Y, Gao C, Cong G, Yu G (2014) Effective and efficient clustering methods for correlated probabilistic graphs. IEEE Trans Knowl Data Eng 26(5):1117–1130

    Article  Google Scholar 

  • Gu X, Angelov PP, Kangin D, Principe JC (2017) A new type of distance metric and its use for clustering. Evol Syst 8(3):167–177

    Article  Google Scholar 

  • Halim Z, Uzma (2017) Optimizing the minimum spanning tree-based extracted clusters using evolution strategy. Clust Comput 1–15

  • Halim Z, Waqas M, Hussain SF (2015) Clustering large probabilistic graphs using multi-population evolutionary algorithm. Inf Sci 317:78–95

    Article  Google Scholar 

  • Halim Z, Waqas M, Baig AR, Rashid A (2017) Efficient clustering of large uncertain graphs using neighborhood information. Int J Approx Reason 90:274–291

    MathSciNet  Article  MATH  Google Scholar 

  • Hintsanen P, Toivonen H (2008) Finding reliable subgraphs from large probabilistic graphs. Data Min Knowl Disc 17(1):3–23

    MathSciNet  Article  Google Scholar 

  • Hyde R, Angelov P, MacKenzie AR (2017) Fully online clustering of evolving data streams into arbitrarily shaped clusters. Inf Sci 382:96–114

    Article  Google Scholar 

  • Jin P, Qu S, Zong Y, Li X (2014) CUDAP: a novel clustering algorithm for uncertain data based on approximate backbone. J Softw 9(3):732–737

    Article  Google Scholar 

  • Karunambigai MG, Akram M, Sivasankar S, Palanivel K (2017) Clustering algorithm for intuitionistic fuzzy graphs. Int J Uncertain Fuzziness Knowl Based Syst 25(03):367–383

    MathSciNet  Article  MATH  Google Scholar 

  • Khanmohammadi S, Adibeig N, Shanehbandy S (2017) An improved overlapping k-means clustering method for medical applications. Expert Syst Appl 67:12–18

    Article  Google Scholar 

  • Kollios G, Potamias M, Terzi E (2013) Clustering large probabilistic graphs. IEEE Trans Knowl Data Eng 25(2):325–336

    Article  Google Scholar 

  • Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, Punna T (2006) Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440(7084):637–643

    Article  Google Scholar 

  • Langohr L, Toivonen H (2012) Finding representative nodes in probabilistic graphs. In: Bisociative Knowledge Discovery. Springer, Berlin Heidelberg, pp 218–229

    Chapter  Google Scholar 

  • Li WP, Yang J, Zhang JP (2015) Uncertain canonical correlation analysis for multi-view feature extraction from uncertain data streams. Neurocomputing 149:1337–1347

    Article  Google Scholar 

  • Liu L, Jin R, Aggarwal C, Shen Y (2012) Reliable clustering on uncertain graphs. In data mining (ICDM), 2012 IEEE 12th international conference on, pp 459–468

  • Liu HW, Chen L, Zhu H, Lu T, Liang F (2014) Uncertainty community detection in social networks. J Softw 9(4):1045–1050

    Google Scholar 

  • Mishra N, Schreiber R, Stanton I, Tarjan RE (2007) Clustering social networks. In international workshop on algorithms and models for the web-graph. Springer, Berlin, pp 56–67

  • Muhammad T, Halim Z (2016) Employing artificial neural networks for constructing metadata-based model to automatically select an appropriate data visualization technique. Appl Soft Comput 49:365–384

    Article  Google Scholar 

  • Priyadarshini G, Sarmah R, Chakraborty B, Bhattacharyya DK, Kalita JK (2012) An effective graph-based clustering technique to identify coherent patterns from gene expression data. Int J Bioinform Res Appl 8(1–2):18–37

    Article  Google Scholar 

  • Sarwar M, Akram M (2016) An algorithm for computing certain metrics in intuitionistic fuzzy graphs. J Intell Fuzzy Syst 30(4):2405–2416

    Article  MATH  Google Scholar 

  • Sarwar M, Akram M (2017) Certain algorithms for computing strength of competition in bipolar fuzzy graphs. Int J Uncertain Fuzziness Knowl Based Syst 25(06):877–896

    MathSciNet  Article  Google Scholar 

  • Satuluri V, Parthasarathy S (2011 Symmetrizations for clustering directed graphs. In proceedings of the 14th international conference on extending database technology. pp 343–354

  • Schubert E, Koos A, Emrich T, Züfle A, Schmid KA, Zimek A (2015) A framework for clustering uncertain data. Proc VLDB Endow 8(12):1976–1979

    Article  Google Scholar 

  • Shah MA, Abbas G, Dogar AB, Halim Z (2015) Scaling hierarchical clustering and energy aware routing for sensor networks. Complex Adapt Syst Model 3(1):5

    Article  Google Scholar 

  • Xu H, Li G (2008) Density-based probabilistic clustering of uncertain data. In computer science and software engineering, 2008 international conference on, pp 4,474–477

  • Xu L, Hu Q, Hung E, Chen B, Tan X, Liao C (2015) Large margin clustering on uncertain data by considering probability distribution similarity. Neurocomputing 158:81–89

    Article  Google Scholar 

  • Zhang X, Liu H, Zhang X, Liu X (2014) Novel density-based clustering algorithms for uncertain data. In: Proceedings of the twenty-eighth conference on artificial intelligence, pp 2191–2197

  • Zhou L, Pan S, Wang J, Vasilakos AV (2017) Machine learning on big data: opportunities and challenges. Neurocomputing 237:350–361

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their helpful and constructive comments that greatly contributed to improve the manuscript quality.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zahid Halim.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Halim, Z., Khattak, J.H. Density-based clustering of big probabilistic graphs. Evolving Systems 10, 333–350 (2019). https://doi.org/10.1007/s12530-018-9223-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12530-018-9223-2

Keywords

  • Clustering graphs
  • Machine learning
  • Big graphs
  • Clustering
  • Community detection