Advertisement

Geographically Organized Small Communities and the Hardness of Clustering Social Networks

  • Miklós Kurucz
  • András A. Benczúr
Chapter
Part of the Annals of Information Systems book series (AOIS, volume 12)

Abstract

Spectral clustering, while perhaps the most efficient heuristics for graph partitioning, has recently gathered bad reputation for failure over large-scale power law graphs. In this chapter we identify the abundance of small-size communities connected by long tentacles as the major obstacle for spectral clustering. These subgraphs hide the higher level structure and result in a highly degenerate adjacency matrix with several hundreds of eigenvalues very close to 1. Our results on clustering social networks, telephone call graphs, and Web graphs are twofold. (1) We show that graphs generated by existing social network models are not as difficult to cluster as they are in the real world. For this end we give a new combined model that yields degenerate adjacency matrices and hard-to-partition graphs. (2) We give heuristics for spectral clustering for large-scale real-world social networks that handle tentacles and small dense communities. Our algorithm outperforms all previous methods for power law graph partitioning both in speed and in cluster quality. In a combination of heuristics for the contraction of tentacles as well as the removal of community cores that involve the recent SCAN (Structural Clustering Algorithm for Networks) algorithm, we are able to efficiently find balanced partitioning of over 10 million edge power law graphs. In particular, our heuristics promise similar or better performance than semidefinite relaxation with orders of magnitude lower running time.

Keywords

Singular Value Decomposition Spectral Cluster Community Core Large Social Network Semidefinite Relaxation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgments

We would like to thank Jon Kleinberg and Lars Backstrom for providing us with the LiveJournal friends and communities data used in [3]. Thanks to Zoltán Gyöngyi for providing us with the host graph with labels from the Open Directory top hierarchy for the UK2007-WEBSPAM crawl of the UbiCrawler [6]. This work was supported by grants OTKA NK 72845 and NKFP-07-A2 TEXTREND.

References

  1. 1.
    Alpert, C.J. and Kahng, A.B. Recent directions in netlist partitioning: A survey. Integration the VLSI Journal, 19(1–2):1–81, 1995.CrossRefGoogle Scholar
  2. 2.
    Alpert, C.J. and Yao, S.-Z. Spectral partitioning: the more eigenvectors, the better. In DAC ’95: Proceedings of the 32nd ACM/IEEE Conference on Design Automation, New York, NY: ACM Press, pp. 195–200, 1995.Google Scholar
  3. 3.
    Backstrom, L., Huttenlocher, D., Kleinberg, J., and Lan, X. Group formation in large social networks: Membership, growth, and evolution. In KDD ’06: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY: ACM Press, pp. 44–54, 2006.Google Scholar
  4. 4.
    Barabási, A.-L., Albert, R., and Jeong, H. Scale-free characteristics of random networks: The topology of the word-wide web. Physica A, 281:69–77, 2000.CrossRefGoogle Scholar
  5. 5.
    Berry, M.W., SVDPACK: A Fortran-77 software library for the sparse singular value decomposition. Technical report, University of Tennessee, Knoxville, TN, 1992.Google Scholar
  6. 6.
    Boldi, P., Codenotti, B., Santini, M., and Vigna, S. Ubicrawler: A scalable fully distributed web crawler. Software: Practice & Experience, 34(8):721–726, 2004.CrossRefGoogle Scholar
  7. 7.
    Borodin, A., Roberts, G.O., Rosenthal, J.S., and Tsaparas, P. Finding authorities and hubs from link structures on the world wide web. In Proceedings of the 10th World Wide Web Conference (WWW), pp. 415–429, 2001.Google Scholar
  8. 8.
    Broder, A.Z. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences (SEQUENCES’97), pp. 21–29, 1997.Google Scholar
  9. 9.
    Burer, S. and Monteiro, R.D.C. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming, 95(2):329–357, 2003.CrossRefGoogle Scholar
  10. 10.
    Chan, P.K., Schlag, M.D.F., and Zien, J.Y. Spectral k-way ratio-cut partitioning and clustering. In DAC ’93: Proceedings of the 30th International Conference on Design Automation, pp. 749–754, New York, NY, ACM Press, 1993.Google Scholar
  11. 11.
    Cheng, D., Vempala, S., Kannan, R., and Wang, G. A divide-and-merge methodology for clustering. In PODS ’05: Proceedings of the 24th ACM SIGMOD-SIGACTSIGART Symposium on Principles of Database Systems, pp. 196–205, New York, NY: ACM Press, 2005.Google Scholar
  12. 12.
    Ding, C.H.Q., He, X., Zha, H., Gu, M., and Simon, H.D. A minmax cut algorithm for graph partitioning and data clustering. In ICDM ’01: Proceedings of the 2001 IEEE International Conference on Data Mining, pp. 107–114, Washington, DC: IEEE Computer Society, 2001.Google Scholar
  13. 13.
    Drineas, P., Mahoney, M.W., and Kannan, R. Fast Monte Carlo algorithms for matrices II: Computing a low rank approximation to a matrix. SIAM Journal on Computing, 36:158–183, 2006.CrossRefGoogle Scholar
  14. 14.
    Fiedler, M. Algebraic connectivity of graphs. Czechoslovak Mathematical Journal, 23(98), 1973.Google Scholar
  15. 15.
    Flake, G., Lawrence, S., and Giles, C.L. Efficient identification of web communities. In Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 150–160, Boston, MA, August 20–23 2000.Google Scholar
  16. 16.
    Flake, G.W., Tarjan, R.E., and Tsioutsiouliklis, K. Graph clustering and minimum cut trees. Internet Mathematics, 1(4):385–408, 2003.CrossRefGoogle Scholar
  17. 17.
    Girvan, M. and Newman, M.E. Community structure in social and biological networks. Proceedings of the National Academy of Sciences of the USA, 99(12):7821–7826, June 2002.CrossRefGoogle Scholar
  18. 18.
    Gorny, E. Russian LiveJournal. The Impact of Cultural Identity on the Development of a Virtual Community. In H. Schmidt, K. Teubener, and N. Konradova, (eds), Control and Shift: Public and Private Usages of the Russian Internet, pp. 73–90, 2006.Google Scholar
  19. 19.
    Gyöngyi, Z., Garcia-Molina, H., and Pedersen, J. Web content categorization using link information. Technical report, Stanford University, 2006–2007.Google Scholar
  20. 20.
    Hopcroft, J., Khan, O., Kulis, B., and Selman, B. Natural communities in large linked networks. In KDD ’03: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 541–546, New York, NY: ACM Press, 2003.Google Scholar
  21. 21.
    Kleinberg, J. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999.CrossRefGoogle Scholar
  22. 22.
    Kleinberg, J. The small-world phenomenon: An algorithmic perspective. In Proceedings of the 32nd ACM Symposium on Theory of Computing, 2000.Google Scholar
  23. 23.
    Kumar, R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A., and Upfal, E. Stochastic models for the web graph. In Proceedings of the 41st IEEE Symposium on Foundations of Computer Science (FOCS), pp. 1–10, 2000.Google Scholar
  24. 24.
    Kurucz, M., Benczúr, A.A., and Pereszlényi, A. Large-scale principal component analysis on live journal friends network. In Workshop on Social Network Mining and Analysis Held in Conjunction with the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2008), 2008.Google Scholar
  25. 25.
    Kurucz, M., Benczúr, A.A., and Csalogány, K. Methods for large scale SVD with missing values. In KDD Cup and Workshop in Conjunction with KDD 2007, 2007.Google Scholar
  26. 26.
    Kurucz, M., Benczúr, A.A., Csalogány, K., and Lukács, L. Spectral clustering in telephone call graphs. In WebKDD/SNAKDD Workshop 2007 in Conjunction with KDD 2007, 2007.Google Scholar
  27. 27.
    Kurucz, M., Siklósi, D., Lukács, L., Benczúr, A.A., Csalogány, K., and Lukács, A. Telephone call network data mining: A survey with experiments. In Handbook of Large-Scale Random Networks to be published by Springer Verlag in conjunction with the Bolyai Mathematical Society of Budapest, 2008.Google Scholar
  28. 28.
    Lang, K. Fixing two weaknesses of the spectral method. In NIPS ’05: Advances in Neural Information Processing Systems, volume 18, Vancouver, BC, 2005.Google Scholar
  29. 29.
    Lempel, R. and Moran, S. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks, 33(1–6):387–401, 2000.CrossRefGoogle Scholar
  30. 30.
    McGlohon, M., Leskovec, J., Faloutsos, C., Hurst, M., and Glance, N. Finding patterns in blog shapes and blog evolution. In Proceedings International Conference on Weblogs and Social Media (ICWSM-2007), 2007.Google Scholar
  31. 31.
    Newman, M. Detecting community structure in networks. The European Physical Journal B – Condensed Matter, 38(2):321–330, March 2004.CrossRefGoogle Scholar
  32. 32.
    Newman, M.E.J. and Girvan, M. Finding and evaluating community structure in networks. Physical Review E, 69(2):26113, 2004.CrossRefGoogle Scholar
  33. 33.
    Ng, A.Y., Zheng, A.X., and Jordan, M.I. Link analysis, eigenvectors and stability. In Proceedings International Joint Conference on Artificial Intelligence, Seattle, WA, August 2001.Google Scholar
  34. 34.
    Open Directory Project (ODP). http://www.dmoz.org.
  35. 35.
    Pennock, D.M., Giles, C.L., Flake, G.W., Lawrence, S., and Glover, E. Winners don’t take all: A model of web link accumulation. Proceedings of the National Academy of Sciences, 99:5207–5211, April 2000.Google Scholar
  36. 36.
    Richardson, M. and Domingos, P. Mining knowledge-sharing sites for viral marketing. In KDD ’02: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 61–70, New York, NY: ACM Press, 2002.Google Scholar
  37. 37.
    Sarlós, T. Improved approximation algorithms for large matrices via random projections. In Proceedings of the 47th IEEE Symposium on Foundations of Computer Science (FOCS), 2006.Google Scholar
  38. 38.
    Shi, J. and Malik, J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2000.Google Scholar
  39. 39.
    Shiga, M., Takigawa, I., and Mamitsuka, H. A spectral clustering approach to optimally combining numerical vectors with a modular network. In KDD ’07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 647–656, New York, NY: ACM press, 2007.Google Scholar
  40. 40.
    Xu, X., Yuruk, N., Feng, Z., and Schweiger, T.A.J. Scan: A structural clustering algorithm for networks. In KDD ’07: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 824–833, New York, NY: ACM Press, 2007.Google Scholar
  41. 41.
    Zakharov, P. Structure of LiveJournal social network. In Proceedings of SPIE Volume 6601, Noise and Stochastics in Complex Systems and Finance, 2007.Google Scholar
  42. 42.
    Zha, H., He, X., Ding, C.H.Q., Gu, M., and Simon, H.D. Spectral relaxation for k-means clustering. In T.G. Dietterich, S. Becker, and Z. Ghahramani (eds), NIPS, pp. 1057–1064. Cambridge, MA: MIT Press, 2001.Google Scholar

Copyright information

© Springer US 2010

Authors and Affiliations

  1. 1.Data Mining and Web search Research Group, Informatics Laboratory, Computer and Automation Research Institute, Hungarian Academy of SciencesBudapestHungary

Personalised recommendations