Skip to main content

Ensemble-based clustering of large probabilistic graphs using neighborhood and distance metric learning

Abstract

Graphs are commonly used to express the communication of various data. Faced with uncertain data, we have probabilistic graphs. As a fundamental problem of such graphs, clustering has many applications in analyzing uncertain data. In this paper, we propose a novel method based on ensemble clustering for large probabilistic graphs. To generate ensemble clusters, we develop a set of probable possible worlds of the initial probabilistic graph. Then, we present a probabilistic co-association matrix as a consensus function to integrate base clustering results. It relies on co-occurrences of node pairs based on the probability of the corresponding common cluster graphs. Also, we apply two improvements in the steps before and after of ensembles generation. In the before step, we append neighborhood information based on node features to the initial graph to achieve a more accurate estimation of the probability between the nodes. In the after step, we use supervised metric learning-based Mahalanobis distance to automatically learn a metric from ensemble clusters. It aims to gain crucial features of the base clustering results. We evaluate our work using five real-world datasets and three clustering evaluation metrics, namely the Dunn index, Davies–Bouldin index, and Silhouette coefficient. The results show the impressive performance of clustering large probabilistic graphs.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Notes

  1. 1.

    https://vlado.fmf.uni-lj.si/pub/networks/data/dic/roget/Roget.htm.

  2. 2.

    https://vlado.fmf.uni-lj.si/pub/networks/data/collab/geom.htm.

  3. 3.

    https://www.cs.cornell.edu/courses/cs685/2002fa/data/gr0.California.

  4. 4.

    https://snap.stanford.edu/data/ego-Facebook.html.

References

  1. 1.

    Zou Z, Li J, Gao H et al (2010) Mining frequent subgraph patterns from uncertain graph data. IEEE Trans Knowl Data Eng 22:1203–1218

    Article  Google Scholar 

  2. 2.

    Papapetrou O, Ioannou E, Skoutas D (2011) Efficient discovery of frequent subgraph patterns in uncertain graph databases. In: EDBT/ICDT’11, pp 355–366

  3. 3.

    Potamias M, Bonchi F, Gionis A et al (2010) k-nearest neighbors in uncertain graphs. Proc VLDB Endow 3(1):997–1008

    Article  Google Scholar 

  4. 4.

    Strehl A, Ghosh J (2003) Cluster ensembles—A knowledge reuse framework for combining partitions. J Mach Learn Res 3:583–617

    MathSciNet  MATH  Google Scholar 

  5. 5.

    Topchy A, Jain AK, Punch W (2005) Clustering ensembles: models of consensus and weak partitions. IEEE Trans Pattern Anal Mach Intell 27(12):1866–1881

    Article  Google Scholar 

  6. 6.

    Li F, Qian Y, Wang J et al (2019) Clustering ensemble based on sample’s stability. Artif Intell 273:37–55

    MathSciNet  Article  Google Scholar 

  7. 7.

    Boongoen T, Iam-On N (2018) Cluster ensembles: a survey of approaches with recent extensions and applications. Comput Sci Rev 28:1–25

    MathSciNet  Article  Google Scholar 

  8. 8.

    Alqurashi T, Wang W (2019) Clustering ensemble method. Int J Mach Learn Cyb 10:1227–1246

    Article  Google Scholar 

  9. 9.

    Vega-Pons S, Ruiz-Shulcloper J (2011) A survey of clustering ensemble algorithms. Int J Pattern Recogn Artif Intell 25(03):337–372

    MathSciNet  Article  Google Scholar 

  10. 10.

    Kollios G, Potamias M, Terzi E (2013) Clustering large probabilistic graphs. IEEE Trans Knowl Data Eng 25(2):325–336

    Article  Google Scholar 

  11. 11.

    Ailon N, Charikar M, Newman A (2005) Aggregating Inconsistent Information: Ranking and Clustering. In: Proceedings of the ACM Symposium on Theory of Computing (STOC), pp 684–693

  12. 12.

    Halim Z, Waqas M, Hussain SF (2015) Clustering large probabilistic graphs using multi-population evolutionary algorithm. Inf Sci 317:78–95

    Article  Google Scholar 

  13. 13.

    Gu Y, Gao C, Cong G et al (2014) Effective and efficient clustering methods for correlated probabilistic graphs. IEEE Trans Knowl Data Eng 26(5):1117–1130

    Article  Google Scholar 

  14. 14.

    Ceccarello M, Fantozzi C, Pietracaprina A et al (2017) Clustering uncertain graphs. Proc VLDB Endowment 11(4):472–544

    Article  Google Scholar 

  15. 15.

    Halim Z, Khattak JH (2019) Density-based clustering of big probabilistic graphs. Evolv Syst 10(3):333–350

    Article  Google Scholar 

  16. 16.

    Qiu YX, Li RH, Li J, Qiao S et al (2018) Efficient structural clustering on probabilistic graphs. IEEE Trans Knowl Data Eng 31(10):1954–1968

    Article  Google Scholar 

  17. 17.

    Iam-On N, Boongoen T (2015) Comparative study of matrix refinement approaches for ensemble clustering. Mach Learn 98(1–2):269–300

    MathSciNet  Article  Google Scholar 

  18. 18.

    Huang D, Wang CD, Wu JS et al (2019) Ultra-scalable spectral clustering and ensemble clustering. IEEE TKDE 32(6):1212–1226

    Google Scholar 

  19. 19.

    Iam-On N, Boongoen T, Garrett S et al (2011) A link-based approach to the cluster ensemble problem. IEEE Trans Pattern Anal Mach Intell 33(12):2396–2409

    Article  Google Scholar 

  20. 20.

    Yi J, Yang T, Jin R et al (2012) Robust ensemble clustering by matrix completion. In: Proceedings of IEEE International Conference on Data Mining (ICDM)

  21. 21.

    Fred AN, Jain AK (2005) Combining multiple clusterings using evidence accumulation. IEEE Trans Pattern Anal Mach Intell 27(6):835–850

    Article  Google Scholar 

  22. 22.

    Lourenço A, Bulò SR, Rebagliati N et al (2015) Probabilistic consensus clustering using evidence accumulation. Mach Learn 98(1–2):331–357

    MathSciNet  Article  Google Scholar 

  23. 23.

    Fern XZ, Brodley CE (2004) Solving cluster ensemble problems by bipartite graph partitioning. In: Proceedings of International Conference on Machine Learning (ICML)

  24. 24.

    Huang D, Lai JH, Wang CD (2016) Robust ensemble clustering using probability trajectories. IEEE Trans Knowl Data Eng 28(5):1312–1326

    Article  Google Scholar 

  25. 25.

    Huang D, Wang CD, Lai JH (2018) Locally weighted ensemble clustering. IEEE Trans Cybern 48(5):1460–1473

    Article  Google Scholar 

  26. 26.

    Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392

    MathSciNet  Article  Google Scholar 

  27. 27.

    Huang D, Lai J, Wang CD (2016) Ensemble clustering using factor graph. Pattern Recogn 50:131–142

    Article  Google Scholar 

  28. 28.

    Franek L, Jiang X (2014) Ensemble clustering by means of clustering embedding in vector spaces. Pattern Recogn 47(2):833–842

    Article  Google Scholar 

  29. 29.

    Weiszfeld E, Plastria F (2009) On the point for which the sum of the distances to n given points is minimum. Ann Oper Res 167(1):7–41

    MathSciNet  Article  Google Scholar 

  30. 30.

    Benjelloun O, Sarma AD, Halevy A et al (2006) ULDBs: databases with uncertainty and lineage. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB), pp 953–964

  31. 31.

    Dalvi NN, Suciu D (2004) Efficient Query Evaluation on Probabilistic Databases. In: Proceedings of the 30th International Conference on Very Large Databases, Toronto, Canada.

  32. 32.

    Gionis A, Mannila H, Tsaparas P (2007) Clustering aggregation. ACM Trans Knowl Discov Data (TKDD) 1(1):1–30

    Article  Google Scholar 

  33. 33.

    Han K, Gui F, Xiao X et al (2019) Efficient and effective algorithms for clustering uncertain graphs. Proc VLDB Endow 12(6):667–680

    Article  Google Scholar 

  34. 34.

    Shamir R, Sharan R, Tsur D (2004) Cluster graph modification problems. Discrete Appl Math 144(1–2):173–182

    MathSciNet  Article  Google Scholar 

  35. 35.

    Bian W, Tao D (2011) Learning a distance metric by empirical loss minimization. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence, pp 1186–1191, Barcelona, Spain

  36. 36.

    Luo Y, Wen Y, Tao D (2016) On combining side information and unlabeled data for heterogeneous multi-task metric learning. In: The 25th International Joint Conference on Artificial Intelligence, pp 1809–1815, New York

  37. 37.

    Xiang S, Nie F, Zhang C (2008) Learning a mahalanobis distance metric for data clustering and classification. Pattern Recogn 41(12):3600–3612

    Article  Google Scholar 

  38. 38.

    Xing EP, Ng AY, Jordan MI et al (2003) Distance metric learning with application to clustering with side-information. In: Proceedings of the 15th International Conference on Neural Information Processing Systems, pp 521–528. Cambridge

  39. 39.

    Law MT, Yu Y, Cord M et al (2016) Closed-form training of mahalanobis distance for supervised clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3909–3917, Las Vegas, NV

  40. 40.

    McFee B, Lanckriet GR (2010) Metric learning to rank. In: Proceedings of the 27th International Conference on Machine Learning, pp 775–782, Haifa, Israel.

  41. 41.

    Mahalanobis PC (1936) On the generalized distance in statistics. In: Proceedings of the National Institute of Science, Calcutta, India

  42. 42.

    Bellet A, Habrard A, Sebban M (2015) Metric learning. Synthesis lectures on artificial intelligence and machine learning. Morgan & Claypool Publishers, San Rafael

    MATH  Google Scholar 

  43. 43.

    Kulis B (2012) Metric learning: a survey. Found Trends Mach Learn 5(4):287–364

    MathSciNet  Article  Google Scholar 

  44. 44.

    Krogan NJ et al (2006) Global landscape of protein complexes in the yeast saccharomyces cerevisiae. Nature 440(7084):637–643

    Article  Google Scholar 

  45. 45.

    Wu X, Ma T, Cao J et al (2018) A comparative study of clustering ensemble algorithms. Comput & Electr Eng 68:603–615

    Article  Google Scholar 

  46. 46.

    Leutbecher M (2018) Ensemble size: how suboptimal is less than infinity? Q J R Meteorol Soc 145:107–128

    Article  Google Scholar 

  47. 47.

    Buizza R, Palmer TN (1998) Impact of ensemble size on ensemble prediction. Mon Weather Rev 126:2503–2518

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Morteza Dorrigiv.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Danesh, M., Dorrigiv, M. & Yaghmaee, F. Ensemble-based clustering of large probabilistic graphs using neighborhood and distance metric learning. J Supercomput 77, 4107–4134 (2021). https://doi.org/10.1007/s11227-020-03429-1

Download citation

Keywords

  • Probabilistic graph
  • Ensemble clustering
  • Distance metric learning
  • Neighborhood