Advertisement

Knowledge and Information Systems

, Volume 14, Issue 1, pp 101–139 | Cite as

Cluster ranking with an application to mining mailbox networks

  • Ziv Bar-Yossef
  • Ido Guy
  • Ronny Lempel
  • Yoëlle S. Maarek
  • Vladimir Soroka
Regular Paper

Abstract

We initiate the study of a new clustering framework, called cluster ranking. Rather than simply partitioning a network into clusters, a cluster ranking algorithm also orders the clusters by their strength. To this end, we introduce a novel strength measure for clusters—the integrated cohesion—which is applicable to arbitrary weighted networks. We then present a new cluster ranking algorithm, called C-Rank. We provide extensive theoretical and empirical analysis of C-Rank and show that it is likely to have high precision and recall. A main component of C-Rank is a heuristic algorithm for finding sparse vertex separators. At the core of this algorithm is a new connection between vertex betweenness and multicommodity flow. Our experiments focus on mining mailbox networks. A mailbox network is an egocentric social network, consisting of contacts with whom an individual exchanges email. Edges between contacts represent the frequency of their co–occurrence on message headers. C-Rank is well suited to mine such networks, since they are abundant with overlapping communities of highly variable strengths. We demonstrate the effectiveness of C-Rank on the Enron data set, consisting of 130 mailbox networks.

Keywords

Clustering Ranking Communities Social networks Social network analysis Graph algorithms 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Amir E, Krauthgamer R, Rao S (2003) Constant factor approximation of vertex-cuts in planar graphs. In: Proceedings of the 35th ACM symposium on theory of computing (STOC), San Diego, pp 90–99Google Scholar
  2. 2.
    Banerjee A, Krumpelman C, Ghosh J, Basu S, Mooney RJ (2005) Model-based overlapping clustering. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining, Chicago, pp 532–537Google Scholar
  3. 3.
    Banfield JD and Raftery AE (1993). Model-based gaussian and non-gaussian clustering. Biometrics 49: 803–821 MATHCrossRefMathSciNetGoogle Scholar
  4. 4.
    Baumes J, Goldberg MK, Krishnamoorthy MS, Magdon-Ismail M, Preston N (2005) Finding communities by clustering a graph into overlapping subgraphs. In: Proceedings of the IADIS international conference on applied computing, Algarve, pp 97–104Google Scholar
  5. 5.
    Boykin PO and Roychowdhury V (2005). Personal email networks: an effective anti-spam tool. IEEE Comput 38(4): 61–68 MathSciNetGoogle Scholar
  6. 6.
    Bui TN and Jones C (1992). Finding good approximate vertex and edge partitions is NP-hard. Inf Proces Lett 42: 153–159 MATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Dunn JC (1973). A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3: 32–57 MATHMathSciNetCrossRefGoogle Scholar
  8. 8.
    Farnham S, Portnoy W, Turski A, Cheng L, Vronay D (2003) Personal map: automatically modeling the user’s online social network. In: Proceedings of the international conference on human–computer interaction (INTERACT), Zurich, pp 567–574Google Scholar
  9. 9.
    Fasulo D (1999) An analysis of recent work on clustering algorithms. Technical Report 01-03-02, Department of Computer Science and Engineering, University of Washington, SeattleGoogle Scholar
  10. 10.
    Feige U, Hajiaghayi MT, Lee JR (2005) Improved approximation algorithms for minimum-weight vertex separators. In: Proceedings of the 37th ACM symposium on theory of computing (STOC), Baltimore, pp 563–572Google Scholar
  11. 11.
    Fisher D (2005). Using egocentric networks to understand communication. IEEE Internet Comput 9(5): 20–28 CrossRefGoogle Scholar
  12. 12.
    Fisher D, Dourish P (2004) Social and temporal structures in everyday collaboration. In: Proceedings of the 2004 conference on human factors in computing systems (CHI), Vienna, pp 551–558Google Scholar
  13. 13.
    Flake GW, Lawrence S, Giles CL (2000) Efficient identification of Web communities. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, Boston pp 150–160Google Scholar
  14. 14.
    Flake GW, Lawrence S, Giles CL and Coetzee F (2002). Self-organization and identification of web communities. IEEE Comput 35(3): 66–71 Google Scholar
  15. 15.
    Fraley C and Raftery AE (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41(8): 578–588 MATHCrossRefGoogle Scholar
  16. 16.
    Fraley C, Raftery AE (2000) Model-based clustering, discriminant analysis, density estimation. Technical Report 380, University of Washington, Department of StatisticsGoogle Scholar
  17. 17.
    Freeman LC (1977). A set of measures of centrality based on betweenness. Sociometry 40: 35–41 CrossRefGoogle Scholar
  18. 18.
    Freeman LC (2004). The development of social network analysis: a study in the sociology of science. Empirical Press, Vancouver Google Scholar
  19. 19.
    Girvans M and Newman MEJ (2002). Community structure in social and biological networks. In: Proceedings of the National Academy of Sciences of the United States of America (PNAS) 99(12): 7821–7826 CrossRefGoogle Scholar
  20. 20.
    Höppner F, Klawonn F, Kruse R and Runkler T (1999). Fuzzy cluster analysis: Methods for classification, data analysis and image Recognition. Wiley, New York MATHGoogle Scholar
  21. 21.
    Ino H, Kudo M, Nakamura A (2005) Partitioning of Web graphs by community topology. In: Proceedings of the 14th international conference on World Wide Web (WWW), Chiba, pp 661–669Google Scholar
  22. 22.
    Jain AK and Dubes RC (1998). Algorithms for clustering data. Prentice-Hall, New Jersey Google Scholar
  23. 23.
    Jain AK, Topchy AP, Law MHC, Buhmann JM (2004) Landscape of clustering algorithms. In: Proceedings of the 17th international conference on pattern recognition (ICPR), Cambridge, Vol. 1, pp 260–263Google Scholar
  24. 24.
    Kannan R, Vempala S and Vetta A (2004). On clusterings: good, bad and spectral. J ACM 51(3): 497–515 CrossRefMathSciNetGoogle Scholar
  25. 25.
    Kaufman L and Rousseeuw PJ (1990). Finding groups in data: an introduction to cluster analysis. John Wiley, New York Google Scholar
  26. 26.
    Kleinberg JM (2002) An impossibility theorem for clustering. In: Proceedings of the 15th annual conference on neural information processing systems (NIPS), Vancouver, pp 446–453Google Scholar
  27. 27.
    Klimt B, Yang Y (2004) The enron corpus: a new dataset for email classification research. In: Proceedings of the 15th European conference on machine learning (ECML), Pisa, pp 217–226Google Scholar
  28. 28.
    Kobayashi M and Aono M (2006). Exploring overlapping clusters using dynamic re-scaling and sampling. Knowl Inf Systems 10(3): 295–313 CrossRefGoogle Scholar
  29. 29.
    Leighton T and Rao S (1999). Multicommodity max-flow min-cut theorems and their use in designing approximation algorithms. J ACM 46(6): 787–832 MATHCrossRefMathSciNetGoogle Scholar
  30. 30.
    Macqueen JB (1967) Some methods of classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathemtical statistics and probability, Berkeley, pp 281–297Google Scholar
  31. 31.
    Mar JC, McLachlan GJ (2003) Model-based clustering in gene expression microarrays: an application to breast cancer data. In: Proceedings of the first asia-pacific bioinformatics conference (APBC), Adelaide, Vol 19, pp 139–144Google Scholar
  32. 32.
    McCallum A, Corrada-Emmanuel A, Wang X (2005) Topic and role discovery in social networks. In: Proceedings of the 19th international joint conference on artificial intelligence (IJCAI), Edinburgh, pp 786–791Google Scholar
  33. 33.
    Newman MEJ (2001) Scientific collaboration networks: II. Shortest paths, weighted networks, and centrality. Phys Rev E 64(016132)Google Scholar
  34. 34.
    Newman MEJ (2004) Analysis of weighted networks. Phys Rev E 70(056131)Google Scholar
  35. 35.
    Newman MEJ, Girvans M (2004) Finding and evaluating community structure in networks. Phys Rev E 69(026113)Google Scholar
  36. 36.
    Palla G, Derényi I, Farkas I and Vicsek T (2005). Uncovering the overlapping community structure of complex networks in nature and society. Nature 435: 814–818 CrossRefGoogle Scholar
  37. 37.
    Pereira FCN, Tishby N, Lee L (1993) Distributional clustering of english words. In: Proceedings of the 31st annual meeting of the association for computational linguistics (ACL), Ohio, pp 183–190Google Scholar
  38. 38.
    Scott J (1991). Social network analysis: a handbook. Sage, London Google Scholar
  39. 39.
    Segal E, Battle A, Koller D (2003) Decomposing gene expression into cellular processes. In: Proceedings of the 8th pacific symposium on biocomputing (PSB), Lihue, pp 89–100Google Scholar
  40. 40.
    Shi J and Malik J (2000). Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8): 888–905 CrossRefGoogle Scholar
  41. 41.
    Sinclair AJ (1992). Improved bounds for mixing rates of Markov chains and multicommodity flow. Combin Probab Comput 1: 351–370 MATHMathSciNetCrossRefGoogle Scholar
  42. 42.
    Sinclair AJ and Jerrum MR (1989). Approximate counting, uniform generation and rapidly mixing Markov chains. Inf Comput 82: 93–133 MATHCrossRefMathSciNetGoogle Scholar
  43. 43.
    Slonim N (2002) The information bottleneck: theory and applications. PhD thesis, The Hebrew University of JerusalemGoogle Scholar
  44. 44.
    Slonim N, Atwal GS, Tkacik G and Bialek W (2005). Information based clustering. In: Proc Natl Acad Sci USA 102(12): 18297–18302 CrossRefMathSciNetMATHGoogle Scholar
  45. 45.
    Tishby N, Pereira F, Bialek W (1999) The information bottleneck method. In: Proceedings of the 37th annual allerton conference on communication, control and computing, University of Illinois, Urbana-Champaign, pp 368–377Google Scholar
  46. 46.
    Tyler J, Wilkinson D, Huberman BA (2003) Email as spectroscopy: automated discovery of community structure within organizations. In: Proceedings of the 1st international conference on communities and technologies, Amsterdam, pp 81–96Google Scholar
  47. 47.
    Wellman B (1993). An egocentric network tale. Soc Netw 15: 423–436 CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2007

Authors and Affiliations

  • Ziv Bar-Yossef
    • 1
    • 4
  • Ido Guy
    • 2
    • 3
  • Ronny Lempel
    • 3
  • Yoëlle S. Maarek
    • 4
  • Vladimir Soroka
    • 3
  1. 1.Department of Electrical EngineeringTechnionHaifaIsrael
  2. 2.Department of Computer ScienceTechnionHaifaIsrael
  3. 3.IBM Research Lab in HaifaHaifaIsrael
  4. 4.Google, Haifa Engineering CenterHaifaIsrael

Personalised recommendations