Integrating Clustering with Ranking in Heterogeneous Information Networks Analysis
Heterogeneous information networks, ie, the logic networks involving multi-typed, interconnected objects, are ubiquitous. For example, a bibliographic information network contains nodes including authors, conferences, terms and papers, and links corresponding to relations exiting between these objects. Extracting knowledge from information networks has become an important task. Both ranking and clustering can provide overall views on information network data, and each has been a hot topic by itself. However, ranking objects globally without considering which clusters they belong to often leads to dumb results, e.g., ranking database and computer architecture conferences together may not make much sense. Similarly, clustering a huge number of objects (e.g., thousands of authors) into one huge cluster without distinction is dull as well. In contrast, a good cluster can lead to meaningful ranking for objects in that cluster, and ranking distributions for these objects can serve as good features to help clustering. Two ranking-based clustering algorithms, RankClus and NetClus, thus are proposed. RankClus aims at clustering target objects using the attribute objects in the remaining network, while NetClus is able to generate net-clusters containing multiple types of objects following the same schema of the original network. The basic idea of such algorithms is that ranking distributions of objects in each cluster should be quite different from each other, which can be served as features of clusters and new measures of objects can be calculated accordingly. Also, better clustering results can achieve better ranking results. Ranking and clustering can be mutually enhanced, where ranking provides better measure space and clustering provides more reasonable ranking distribution. What’s more, clusters obtained in this way are more informative than other methods, given the ranking distribution for objects in each cluster.
- 1.A. Banerjee, S. Basu, and S. Merugu. Multi-way clustering on relation graphs. In SIAM’07, pages 145–156, Minneapolis, Minnesota, April 2007.Google Scholar
- 2.R. Bekkerman, R. El-Yaniv, and A. McCallum. Multi-way distributional clustering via pairwise interactions. In ICML’05, pages 41–48, Bonn, Germany, August 2005.Google Scholar
- 3.J. Bilmes. A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models, TR-97-021, April, 1998. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.40.576. 1997.
- 5.D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/gather: a cluster-based approach to browsing large document collections. In SIGIR’92, pages 318–329, Copenhagen, Denmark, June 1992.Google Scholar
- 6.DBLP. The dblp computer science bibliography. http://www.informatik.uni-trier.de/∼ley/db/.
- 7.C. H. Q. Ding, X. He, H. Zha, M. Gu, and H. D. Simon. A min-max cut algorithm for graph partitioning and data clustering. In ICDM’01, pages 107–114. IEEE Computer Society, San Jose, California, USA, November-December 2001.Google Scholar
- 8.M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In SIGCOMM’99, pages 251–262, Cambridge, Massachusetts, USA, August-September 1999.Google Scholar
- 10.C. L. Giles. The future of citeseer. In 10th European Conference on PKDD (PKDD’06), page 2, Berlin, Germany, September 2006.Google Scholar
- 11.Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In Proceedings of the Thirtieth international conference on Very large data bases (VLDB’04), pages 576–587. VLDB Endowment, Toronto, Canada, August-September 2004.Google Scholar
- 13.G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. In Proceedings of the eighth ACM SIGKDD conference (KDD’02), pages 538–543, ACM, Edmonton, Alberta, Canada, July 2002.Google Scholar
- 14.W. Jiang, J. Vaidya, Z. Balaporia, C. Clifton, and B. Banich. Knowledge discovery from transportation network data. In Proceedings of the 21st ICDE Conference (ICDE’05), pages 1061–1072, Tokyo, Japan, April 2005.Google Scholar
- 16.B. Long, Z. M. Zhang, X. Wú, and P. S. Yu. Spectral clustering for multi-type relational data. In ICML’06, pages 585–592, Pittsburgh, Pennsylvania, USA, June 2006.Google Scholar
- 19.Z. Nie, Y. Zhang, J.-R. Wen, and W.-Y. Ma. Object-level ranking: Bringing order to web objects. In Proceedings of the fourteenth International World Wide Web Conference (WWW’05), pages 567–574. ACM, Chiba, Japan, May 2005.Google Scholar
- 20.S. Roy, T. Lane, and M. Werner-Washburne. Integrative construction and analysis of condition-specific biological networks. In Proceedings of AAAI’07, pages 1898–1899, Vancouver, British Columbia, Canada, July 2007.Google Scholar
- 22.A. Sidiropoulos, D. Katsaros, and Y. Manolopoulos. Generalized h-index for disclosing latent facts in citation networks. CoRR, abs/cs/0607066, 2006. http://arxiv.org/abs/cs/0607066.
- 23.Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, and T. Wu. Rankclus: Integrating clustering with ranking for heterogenous information network analysis. In EDBT’09, pages 565–576, Saint Petersburg, Russia, March 2009.Google Scholar
- 24.Y. Sun, Y. Yu, and J. Han. “Ranking-based clustering of heterogeneous information networks with star network schema”. In KDD’09, pages 797–806, Paris, France, June-July 2009.Google Scholar
- 25.U. von Luxburg. A tutorial on spectral clustering. Technical report, Max Planck Institute for Biological Cybernetics, 2006.Google Scholar
- 26.N. Wang, S. Parthasarathy, K.-L. Tan, and A. K. H. Tung. Csv: visualizing and mining cohesive subgraphs. In SIGMOD’08, pages 445–458, Vancouver, BC, Canada, June 2008.Google Scholar
- 27.X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger. Scan: a structural clustering algorithm for networks. In KDD’07, pages 824–833, San Jose, California, USA, August 2007.Google Scholar
- 30.C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In KDD’04, pages 743–748, Seattle, Washington, USA, August 2004.Google Scholar
- 31.D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Schölkopf. Ranking on data manifolds. In Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.Google Scholar