Skip to main content
Log in

Clustering large attributed information networks: an efficient incremental computing approach

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

In recent years, many information networks have become available for analysis, including social networks, road networks, sensor networks, biological networks, etc. Graph clustering has shown its effectiveness in analyzing and visualizing large networks. The goal of graph clustering is to partition vertices in a large graph into clusters based on various criteria such as vertex connectivity or neighborhood similarity. Many existing graph clustering methods mainly focus on the topological structures, but largely ignore the vertex properties which are often heterogeneous. Recently, a new graph clustering algorithm, SA-cluster, has been proposed which combines structural and attribute similarities through a unified distance measure. SA-Cluster performs matrix multiplication to calculate the random walk distances between graph vertices. As part of the clustering refinement, the graph edge weights are iteratively adjusted to balance the relative importance between structural and attribute similarities. As a consequence, matrix multiplication is repeated in each iteration of the clustering process to recalculate the random walk distances which are affected by the edge weight update. In order to improve the efficiency and scalability of SA-cluster, in this paper, we propose an efficient algorithm In-Cluster to incrementally update the random walk distances given the edge weight increments. Complexity analysis is provided to estimate how much runtime cost Inc-Cluster can save. We further design parallel matrix computation techniques on a multicore architecture. Experimental results demonstrate that Inc-Cluster achieves significant speedup over SA-Cluster on large graphs, while achieving exactly the same clustering quality in terms of intra-cluster structural cohesiveness and attribute value homogeneity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Cai D, Shao Z, He X, Yan X, Han J (2005) Mining hidden community in heterogeneous social networks. In: Proceedings of Workshop on Link Discovery: Issues, Approaches and Applications (LinkKDD’05), pp 58–65, Chicago, IL

  • Cohn H, Kleinberg R, Szegedy B, Umans C (2005) Group-theoretic algorithms for matrix multiplication. In: Symposium on Foundations of Computer Science (FOCS)

  • Desikan P, Pathak N, Srivastava J, Kumar V (2005) Incremental page rank computation on evolving graphs. In: 14th International World Wide Web (WWW) Conference, pp 1094–1095

  • Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of SIGIR, pp 50–57

  • Jeh G, Widom J (2002) SimRank: a measure of structural-context similarity. In: Proceedings of KDD, pp 538–543

  • Long B, Zhang ZM, Wu X, Yu PS (2006) Spectral clustering for multi-type relational data. In: Proceedings of International Conference on Machine Learning (ICML), pp 585–592

  • Navlakha S, Rastogi R, Shrivastava N (2008) Graph summarization with bounded error. In: Proceedings of SIGMOD, pp 419–432

  • Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69:026113

    Google Scholar 

  • Pons P, Latapy M (2006) Computing communities in large networks using random walks. J. Graph Algorithms Appl 10(2): 191–218

    Article  MathSciNet  MATH  Google Scholar 

  • Satuluri V, Parthasarathy S (2009) Scalable graph clustering using stochastic flows: applications to community discovery. In: Conference on Knowledge Discovery and Data Mining (KDD), pp 737–745

  • Shi J, Malik J (2000) Normalized cuts and image segmentation. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8): 888–905

    Article  Google Scholar 

  • Strassen V (1969) Gaussian elimination is not optimal. Numerische Mathematik 13: 354–356

    Article  MathSciNet  MATH  Google Scholar 

  • Sun J, Faloutsos C, Papadimitriou S, Yu PS (2007) Graphscope: parameter-free mining of large time-evolving graphs. In: Proceedings of KDD, pp 687–696

  • Sun Y, Han J, Zhao P, Yin Z, Cheng H, Wu T (2009) Rankclus: integrating clustering with ranking for heterogenous information network analysis. In: Proceedings of EDBT, pp 565–576

  • Tian Y, Hankins RA, Patel JM (2008) Efficient aggregation for graph summarization. In: Proceedings of SIGMOD, pp 567–580

  • Tong H, Faloutsos C, Pan J-Y (2006) Fast random walk with restart and its applications. In: Proceedings of ICDM, pp 613–622

  • Tong H, Faloutsos C, Pan J-Y (2008) Random walk with restart: fast solutions and applications. Knowl Inf Syst 14: 327–346

    Article  MATH  Google Scholar 

  • Tsai C-Y, Chui C-C (2008) Developing a feature weight self-adjustment mechanism for a k-means clustering algorithm. Comput Stat Data Anal 52: 4658–4672

    Article  MATH  Google Scholar 

  • Wang F, Li T, Wang X, Zhu S, Ding C (2011) Community discovery using nonnegative matrix factorization. Data Min Knowl Discov 22(3): 493–521

    Article  MathSciNet  MATH  Google Scholar 

  • Wu Y, Raschid L (2009) Approxrank: estimating rank for a subgraph. In: Proceedings of ICDE, pp 54–65

  • Xu X, Yuruk N, Feng Z (2007) Schweiger TAJ Scan: a structural clustering algorithm for networks. In: Proceedings of KDD, pp 824–833

  • Zhou Y, Cheng H, Yu JX (2009) Graph clustering based on structural/attribute similarities. In: Proceedings of the VLDB Endowment, pp 718–729

  • Zhou Y, Cheng H, Yu JX (2010) Clustering large attributed graphs: an efficient incremental approach. In: IEEE International Conference on Data Mining (ICDM), pp 689–698

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hong Cheng.

Additional information

Responsible editor: Fei Wang, Hanghang Tong, Phillip Yu, Charu Aggarwal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cheng, H., Zhou, Y., Huang, X. et al. Clustering large attributed information networks: an efficient incremental computing approach. Data Min Knowl Disc 25, 450–477 (2012). https://doi.org/10.1007/s10618-012-0263-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-012-0263-0

Keywords

Navigation