Integrating Clustering with Ranking in Heterogeneous Information Networks Analysis

Chapter

Abstract

Heterogeneous information networks, ie, the logic networks involving multi-typed, interconnected objects, are ubiquitous. For example, a bibliographic information network contains nodes including authors, conferences, terms and papers, and links corresponding to relations exiting between these objects. Extracting knowledge from information networks has become an important task. Both ranking and clustering can provide overall views on information network data, and each has been a hot topic by itself. However, ranking objects globally without considering which clusters they belong to often leads to dumb results, e.g., ranking database and computer architecture conferences together may not make much sense. Similarly, clustering a huge number of objects (e.g., thousands of authors) into one huge cluster without distinction is dull as well. In contrast, a good cluster can lead to meaningful ranking for objects in that cluster, and ranking distributions for these objects can serve as good features to help clustering. Two ranking-based clustering algorithms, RankClus and NetClus, thus are proposed. RankClus aims at clustering target objects using the attribute objects in the remaining network, while NetClus is able to generate net-clusters containing multiple types of objects following the same schema of the original network. The basic idea of such algorithms is that ranking distributions of objects in each cluster should be quite different from each other, which can be served as features of clusters and new measures of objects can be calculated accordingly. Also, better clustering results can achieve better ranking results. Ranking and clustering can be mutually enhanced, where ranking provides better measure space and clustering provides more reasonable ranking distribution. What’s more, clusters obtained in this way are more informative than other methods, given the ranking distribution for objects in each cluster.

References

  1. 1.
    A. Banerjee, S. Basu, and S. Merugu. Multi-way clustering on relation graphs. In SIAM’07, pages 145–156, Minneapolis, Minnesota, April 2007.Google Scholar
  2. 2.
    R. Bekkerman, R. El-Yaniv, and A. McCallum. Multi-way distributional clustering via pairwise interactions. In ICML’05, pages 41–48, Bonn, Germany, August 2005.Google Scholar
  3. 3.
    J. Bilmes. A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models, TR-97-021, April, 1998. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.40.576. 1997.
  4. 4.
    S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1–7):107–117, 1998.CrossRefGoogle Scholar
  5. 5.
    D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/gather: a cluster-based approach to browsing large document collections. In SIGIR’92, pages 318–329, Copenhagen, Denmark, June 1992.Google Scholar
  6. 6.
    DBLP. The dblp computer science bibliography. http://www.informatik.uni-trier.de/∼ley/db/.
  7. 7.
    C. H. Q. Ding, X. He, H. Zha, M. Gu, and H. D. Simon. A min-max cut algorithm for graph partitioning and data clustering. In ICDM’01, pages 107–114. IEEE Computer Society, San Jose, California, USA, November-December 2001.Google Scholar
  8. 8.
    M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In SIGCOMM’99, pages 251–262, Cambridge, Massachusetts, USA, August-September 1999.Google Scholar
  9. 9.
    J. E. Gentle and W. HSrdle. Handbook of Computational Statistics: Concepts and Methods,  Chapter 7 Evaluation of Eigenvalues, pages 245–247. Springer, 1st edition, Berlin, Springer-Verlag, 2004.Google Scholar
  10. 10.
    C. L. Giles. The future of citeseer. In 10th European Conference on PKDD (PKDD’06), page 2, Berlin, Germany, September 2006.Google Scholar
  11. 11.
    Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In Proceedings of the Thirtieth international conference on Very large data bases (VLDB’04), pages 576–587. VLDB Endowment, Toronto, Canada, August-September 2004.Google Scholar
  12. 12.
    J. E. Hirsch. An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences, 102:16569, 2005.CrossRefGoogle Scholar
  13. 13.
    G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. In Proceedings of the eighth ACM SIGKDD conference (KDD’02), pages 538–543, ACM, Edmonton, Alberta, Canada, July 2002.Google Scholar
  14. 14.
    W. Jiang, J. Vaidya, Z. Balaporia, C. Clifton, and B. Banich. Knowledge discovery from transportation network data. In Proceedings of the 21st ICDE Conference (ICDE’05), pages 1061–1072, Tokyo, Japan, April 2005.Google Scholar
  15. 15.
    J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999.CrossRefGoogle Scholar
  16. 16.
    B. Long, Z. M. Zhang, X. Wú, and P. S. Yu. Spectral clustering for multi-type relational data. In ICML’06, pages 585–592, Pittsburgh, Pennsylvania, USA, June 2006.Google Scholar
  17. 17.
    U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, 2007.CrossRefGoogle Scholar
  18. 18.
    M. E. J. Newman. Assortative mixing in networks. Physical Review Letters, 89(20):208701+, October 2002.PubMedCrossRefGoogle Scholar
  19. 19.
    Z. Nie, Y. Zhang, J.-R. Wen, and W.-Y. Ma. Object-level ranking: Bringing order to web objects. In Proceedings of the fourteenth International World Wide Web Conference (WWW’05), pages 567–574. ACM, Chiba, Japan, May 2005.Google Scholar
  20. 20.
    S. Roy, T. Lane, and M. Werner-Washburne. Integrative construction and analysis of condition-specific biological networks. In Proceedings of AAAI’07, pages 1898–1899, Vancouver, British Columbia, Canada, July 2007.Google Scholar
  21. 21.
    J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.CrossRefGoogle Scholar
  22. 22.
    A. Sidiropoulos, D. Katsaros, and Y. Manolopoulos. Generalized h-index for disclosing latent facts in citation networks. CoRR, abs/cs/0607066, 2006. http://arxiv.org/abs/cs/0607066.
  23. 23.
    Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, and T. Wu. Rankclus: Integrating clustering with ranking for heterogenous information network analysis. In EDBT’09, pages 565–576, Saint Petersburg, Russia, March 2009.Google Scholar
  24. 24.
    Y. Sun, Y. Yu, and J. Han. “Ranking-based clustering of heterogeneous information networks with star network schema”. In KDD’09, pages 797–806, Paris, France, June-July 2009.Google Scholar
  25. 25.
    U. von Luxburg. A tutorial on spectral clustering. Technical report, Max Planck Institute for Biological Cybernetics, 2006.Google Scholar
  26. 26.
    N. Wang, S. Parthasarathy, K.-L. Tan, and A. K. H. Tung. Csv: visualizing and mining cohesive subgraphs. In SIGMOD’08, pages 445–458, Vancouver, BC, Canada, June 2008.Google Scholar
  27. 27.
    X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger. Scan: a structural clustering algorithm for networks. In KDD’07, pages 824–833, San Jose, California, USA, August 2007.Google Scholar
  28. 28.
    O. Zamir and O. Etzioni. Grouper: A dynamic clustering interface to web search results. Computer Networks, 31: 1361–1374, 1999.CrossRefGoogle Scholar
  29. 29.
    C. Zhai and J. D. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transaction on Information Systems, 22(2):179–214, 2004.CrossRefGoogle Scholar
  30. 30.
    C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In KDD’04, pages 743–748, Seattle, Washington, USA, August 2004.Google Scholar
  31. 31.
    D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Schölkopf. Ranking on data manifolds. In Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.University of Illinois at Urbana-ChampaignUrbanaUSA
  2. 2.UIUCUrbanaUSA

Personalised recommendations