Intra graph clustering using collaborative similarity measure

Abstract

Graph is an extremely versatile data structure in terms of its expressiveness and flexibility to model a range of real life phenomenon. Various networks like social networks, sensor networks and computer networks are represented and stored in the form of graphs. The analysis of these kind of graphs has an immense importance from quite a long time. It is performed from various aspects to get maximum out of such multifaceted information repository. When the analysis is targeted towards finding groups of vertices based on their similarity in a graph, clustering is the most conspicuous option. Previous graph clustering approaches either focus on the topological structure or attributes likeness, however, few recent methods constitutes both aspects simultaneously. Due to enormous computation requirements for similarity estimation, these methods are often suffered from scalability issues. In order to overcome this limitation, we introduce collaborative similarity measure (CSM) for intra-graph clustering. CSM is based on shortest path strategy, instead of all paths, to define structural and semantic relevance among vertices. First, we calculate the pair-wise similarity among vertices using CSM. Second, vertices are grouped together based on calculated similarity under k-Medoid framework. Empirical analysis, based on density, and entropy, proves the efficacy of CSM over existing measures. Moreover, CSM becomes a potential candidate for medium scaled graph analysis due to an order of magnitude less computations.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Notes

  1. 1.

    \({<}10^4\) nodes.

  2. 2.

    \({<}10^6\) nodes.

  3. 3.

    http://math.nist.gov/javanumerics/jama.

  4. 4.

    The source code is available at https://github.com/WNawaz/CSM.

  5. 5.

    http://www-personal.umich.edu/mejn/netdata.

  6. 6.

    The detailed conference list is DB: SIGMOD, VLDB, PODS,ICDE, EDBT; DM: KDD, ICDM, SDM, PAKDD, PKDD; IR: SIGIR,CIKM, ECIR, WWW; AI: IJCAI, AAAI, UAI, NIPS.

  7. 7.

    \(10^6-10^9\) nodes.

References

  1. 1.

    Ahn, Y.Y., Han, S., Kwak, H., Moon, S., Jeong, H.: Analysis of topological characteristics of huge online social networking services. In: Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pp. 835–844. ACM, New York, NY, USA (2007). doi:10.1145/1242572.1242685

  2. 2.

    Anand, R., Reddy, C.K.: Graph-based clustering with constraints. In: Proceedings of the 15th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining—Volume Part II, PAKDD’11, pp. 51–62. Springer, Berlin, Heidelberg (2011). http://dl.acm.org/citation.cfm?id=2022850.2022855

  3. 3.

    Andersen, R., Lang, K.J.: Communities from seed sets. In: Proceedings of the 15th International Conference on World Wide Web, WWW ’06, pp. 223–232. ACM, New York, NY, USA (2006). doi:10.1145/1135777.1135814

  4. 4.

    Cheng, H., Zhou, Y., Yu, J.X.: Clustering large attributed graphs: a balance between structural and attribute similarities. ACM Trans. Knowl. Discov. Data 5(2), 12:1–12:33 (2011). doi:10.1145/1921632.1921638

    Article  MATH  Google Scholar 

  5. 5.

    Cook, D.J., Holder, L.B.: Mining Graph Data. Wiley, New York (2006)

    Book  Google Scholar 

  6. 6.

    Drineas, P., Frieze, A., Kannan, R., Vempala, S., Vinay, V.: Clustering large graphs via the singular value decomposition. Mach. Learn. 56(1–3), 9–33 (2004). doi:10.1023/B:MACH.0000033113.59016.96

    Article  Google Scholar 

  7. 7.

    Everitt, B.S., Landau, S., Leese, M.: Cluster Analysis, 4th edn. Wiley, New York (2009)

  8. 8.

    Fjllstrm, P.O.: Algorithms for Graph Partitioning: A Survey. Linkping Electronic Articles in Computer and Information Science 3 (1998). http://www.ep.liu.se/ea/cis/1998/010/cis98010.pdf

  9. 9.

    Flake, G.W., Tarjan, R.E., Tsioutsiouliklis, K.: Graph clustering and minimum cut trees. Internet Math. 1, 385–408 (2004)

    MathSciNet  Article  MATH  Google Scholar 

  10. 10.

    Fredman, M.L., Tarjan, R.E.: Fibonacci heaps and their uses in improved network optimization algorithms. J. ACM 34(3), 596–615 (1987). doi:10.1145/28869.28874

    MathSciNet  Article  Google Scholar 

  11. 11.

    Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’99, pp. 50–57. ACM, New York, NY, USA (1999). doi:10.1145/312624.312649

  12. 12.

    Huang, X., Lai, W.: Clustering graphs for visualization via node similarities. J. Vis. Lang. Comput. 17(3), 225–253 (2006). doi:10.1016/j.jvlc.2005.10.003

    Article  Google Scholar 

  13. 13.

    Ino, H., Kudo, M., Nakamura, A.: Partitioning of web graphs by community topology. In: Proceedings of the 14th International Conference on World Wide Web, WWW ’05, pp. 661–669. ACM, New York, NY, USA (2005). doi:10.1145/1060745.1060841

  14. 14.

    Jaccard, P.: Étude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin del la Société Vaudoise des Sciences Naturelles 37, 547–579 (1901)

    Google Scholar 

  15. 15.

    Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Statistical properties of community structure in large social and information networks. In: Proceedings of the 17th International Conference on World Wide Web, WWW ’08, pp. 695–704. ACM, New York, NY, USA (2008). doi:10.1145/1367497.1367591

  16. 16.

    Macropol, K., Singh, A.: Scalable discovery of best clusters on large graphs. Proc. VLDB Endow. 3(1–2), 693–702 (2010). http://dl.acm.org/citation.cfm?id=1920841.1920930

  17. 17.

    Nawaz, W., Lee, Y.K., Lee, S.: Collaborative similarity measure for intra graph clustering. In: DASFAA Workshops, pp. 204–215 (2012)

  18. 18.

    Newman, M.: Detecting community structure in networks. Eur. Phys. J. B Condens. Matter Complex Syst. 38, 321–330 (2004). doi:10.1140/epjb/e2004-00124-y

    Article  Google Scholar 

  19. 19.

    Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69, 026113 (2004). doi:10.1103/PhysRevE.69.026113

    Article  Google Scholar 

  20. 20.

    Papadopoulos, S., Kompatsiaris, Y., Vakali, A., Spyridonos, P.: Community detection in social media. Data Min. Knowl. Discov. 24(3), 515–554 (2012). doi:10.1007/s10618-011-0224-z

    Article  Google Scholar 

  21. 21.

    Rob, G.-P., Hwang, S.: Online clustering algorithms for semantic-rich network trajectories. J. Comput. Sci. Eng. 5, 346–353 (2011). doi:10.5626/JCSE.2011.5.4.346

    Article  Google Scholar 

  22. 22.

    Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22, 888–905 (1997)

    Google Scholar 

  23. 23.

    Tiakas, E., Papadopoulos, A.N., Manolopoulos, Y.: Graph node clustering via transitive node similarity. In: Panhellenic Conference on Informatics, pp. 72–77 (2010). doi:10.1109/PCI.2010.42

  24. 24.

    Tian, Y., Hankins, R.A., Patel, J.M.: Efficient aggregation for graph summarization. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pp. 567–580. ACM, New York, NY, USA (2008). doi:10.1145/1376616.1376675

  25. 25.

    van Dongen, S.M.: Graph Clustering by Flow Simulation. Ph.D. thesis, University of Utrecht, The Netherlands (2000)

  26. 26.

    Xu, X., Yuruk, N., Feng, Z., Schweiger, T.A.J.: Scan: a structural clustering algorithm for networks. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’07, pp. 824–833. ACM, New York, NY, USA (2007). doi:10.1145/1281192.1281280

  27. 27.

    Zhai, C., Velivelli, A., Yu, B.: A cross-collection mixture model for comparative text mining. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, pp. 743–748. ACM, New York, NY, USA (2004). doi:10.1145/1014052.1014150

  28. 28.

    Zhou, Y., Cheng, H., Yu, J.X.: Graph clustering based on structural/attribute similarities. Proc. VLDB Endow. 2(1), 718–729 (2009). http://dl.acm.org/citation.cfm?id=1687627.1687709

Download references

Acknowledgments

We are thankful to the anonymous reviewers for valuable comments and suggestions. This research was supported by the MSIP (Ministry of Science, ICT & Future Planning), Korea, under the ITRC (Information Technology Research Center) support program supervised by the NIPA (National IT Industry Promotion Agency) (NIPA-2014-(H0301-14-1003)).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Young-Koo Lee.

Appendix

Appendix

The pseudo code of the proposed method is given in the following algorithm. The similarity estimation and clustering are two main independent steps. Initialization and similarity calculation is done once however the clustering step is repeatative in nature.

figurea

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Nawaz, W., Khan, KU., Lee, YK. et al. Intra graph clustering using collaborative similarity measure. Distrib Parallel Databases 33, 583–603 (2015). https://doi.org/10.1007/s10619-014-7170-x

Download citation

Keywords

  • Graph clustering
  • Collaborative similarity
  • k-Medoid clustering
  • Entropy
  • Density
  • Jaccard similarity coefficient