Abstract
In recent years, many information networks have become available for analysis, including social networks, road networks, sensor networks, biological networks, etc. Graph clustering has shown its effectiveness in analyzing and visualizing large networks. The goal of graph clustering is to partition vertices in a large graph into clusters based on various criteria such as vertex connectivity or neighborhood similarity. Many existing graph clustering methods mainly focus on the topological structures, but largely ignore the vertex properties which are often heterogeneous. Recently, a new graph clustering algorithm, SA-cluster, has been proposed which combines structural and attribute similarities through a unified distance measure. SA-Cluster performs matrix multiplication to calculate the random walk distances between graph vertices. As part of the clustering refinement, the graph edge weights are iteratively adjusted to balance the relative importance between structural and attribute similarities. As a consequence, matrix multiplication is repeated in each iteration of the clustering process to recalculate the random walk distances which are affected by the edge weight update. In order to improve the efficiency and scalability of SA-cluster, in this paper, we propose an efficient algorithm In-Cluster to incrementally update the random walk distances given the edge weight increments. Complexity analysis is provided to estimate how much runtime cost Inc-Cluster can save. We further design parallel matrix computation techniques on a multicore architecture. Experimental results demonstrate that Inc-Cluster achieves significant speedup over SA-Cluster on large graphs, while achieving exactly the same clustering quality in terms of intra-cluster structural cohesiveness and attribute value homogeneity.
Similar content being viewed by others
References
Cai D, Shao Z, He X, Yan X, Han J (2005) Mining hidden community in heterogeneous social networks. In: Proceedings of Workshop on Link Discovery: Issues, Approaches and Applications (LinkKDD’05), pp 58–65, Chicago, IL
Cohn H, Kleinberg R, Szegedy B, Umans C (2005) Group-theoretic algorithms for matrix multiplication. In: Symposium on Foundations of Computer Science (FOCS)
Desikan P, Pathak N, Srivastava J, Kumar V (2005) Incremental page rank computation on evolving graphs. In: 14th International World Wide Web (WWW) Conference, pp 1094–1095
Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of SIGIR, pp 50–57
Jeh G, Widom J (2002) SimRank: a measure of structural-context similarity. In: Proceedings of KDD, pp 538–543
Long B, Zhang ZM, Wu X, Yu PS (2006) Spectral clustering for multi-type relational data. In: Proceedings of International Conference on Machine Learning (ICML), pp 585–592
Navlakha S, Rastogi R, Shrivastava N (2008) Graph summarization with bounded error. In: Proceedings of SIGMOD, pp 419–432
Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69:026113
Pons P, Latapy M (2006) Computing communities in large networks using random walks. J. Graph Algorithms Appl 10(2): 191–218
Satuluri V, Parthasarathy S (2009) Scalable graph clustering using stochastic flows: applications to community discovery. In: Conference on Knowledge Discovery and Data Mining (KDD), pp 737–745
Shi J, Malik J (2000) Normalized cuts and image segmentation. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8): 888–905
Strassen V (1969) Gaussian elimination is not optimal. Numerische Mathematik 13: 354–356
Sun J, Faloutsos C, Papadimitriou S, Yu PS (2007) Graphscope: parameter-free mining of large time-evolving graphs. In: Proceedings of KDD, pp 687–696
Sun Y, Han J, Zhao P, Yin Z, Cheng H, Wu T (2009) Rankclus: integrating clustering with ranking for heterogenous information network analysis. In: Proceedings of EDBT, pp 565–576
Tian Y, Hankins RA, Patel JM (2008) Efficient aggregation for graph summarization. In: Proceedings of SIGMOD, pp 567–580
Tong H, Faloutsos C, Pan J-Y (2006) Fast random walk with restart and its applications. In: Proceedings of ICDM, pp 613–622
Tong H, Faloutsos C, Pan J-Y (2008) Random walk with restart: fast solutions and applications. Knowl Inf Syst 14: 327–346
Tsai C-Y, Chui C-C (2008) Developing a feature weight self-adjustment mechanism for a k-means clustering algorithm. Comput Stat Data Anal 52: 4658–4672
Wang F, Li T, Wang X, Zhu S, Ding C (2011) Community discovery using nonnegative matrix factorization. Data Min Knowl Discov 22(3): 493–521
Wu Y, Raschid L (2009) Approxrank: estimating rank for a subgraph. In: Proceedings of ICDE, pp 54–65
Xu X, Yuruk N, Feng Z (2007) Schweiger TAJ Scan: a structural clustering algorithm for networks. In: Proceedings of KDD, pp 824–833
Zhou Y, Cheng H, Yu JX (2009) Graph clustering based on structural/attribute similarities. In: Proceedings of the VLDB Endowment, pp 718–729
Zhou Y, Cheng H, Yu JX (2010) Clustering large attributed graphs: an efficient incremental approach. In: IEEE International Conference on Data Mining (ICDM), pp 689–698
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Fei Wang, Hanghang Tong, Phillip Yu, Charu Aggarwal.
Rights and permissions
About this article
Cite this article
Cheng, H., Zhou, Y., Huang, X. et al. Clustering large attributed information networks: an efficient incremental computing approach. Data Min Knowl Disc 25, 450–477 (2012). https://doi.org/10.1007/s10618-012-0263-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-012-0263-0