Integrating Clustering with Ranking in Heterogeneous Information Networks Analysis

Sun, Yizhou; Han, Jiawei

doi:10.1007/978-1-4419-6515-8_17

Integrating Clustering with Ranking in Heterogeneous Information Networks Analysis

Yizhou Sun⁴ &
Jiawei Han⁵

Chapter
First Online: 01 January 2010

2592 Accesses
1 Citations

Abstract

Heterogeneous information networks, ie, the logic networks involving multi-typed, interconnected objects, are ubiquitous. For example, a bibliographic information network contains nodes including authors, conferences, terms and papers, and links corresponding to relations exiting between these objects. Extracting knowledge from information networks has become an important task. Both ranking and clustering can provide overall views on information network data, and each has been a hot topic by itself. However, ranking objects globally without considering which clusters they belong to often leads to dumb results, e.g., ranking database and computer architecture conferences together may not make much sense. Similarly, clustering a huge number of objects (e.g., thousands of authors) into one huge cluster without distinction is dull as well. In contrast, a good cluster can lead to meaningful ranking for objects in that cluster, and ranking distributions for these objects can serve as good features to help clustering. Two ranking-based clustering algorithms, RankClus and NetClus, thus are proposed. RankClus aims at clustering target objects using the attribute objects in the remaining network, while NetClus is able to generate net-clusters containing multiple types of objects following the same schema of the original network. The basic idea of such algorithms is that ranking distributions of objects in each cluster should be quite different from each other, which can be served as features of clusters and new measures of objects can be calculated accordingly. Also, better clustering results can achieve better ranking results. Ranking and clustering can be mutually enhanced, where ranking provides better measure space and clustering provides more reasonable ranking distribution. What’s more, clusters obtained in this way are more informative than other methods, given the ranking distribution for objects in each cluster.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
www.informatik.uni-trier.de/∼ley/db/
2.
For example, a statistician may want to change the rules referring to conferences to journals; whereas a bibliographic database that collects papers from all the bogus conferences may need even more sophisticated rules (extracted from the domain knowledge) to guard the ranking quality.
3.
Initial absolute posterior prob. to background is sensitive to prior \(\lambda_{\mathrm{P}}\): the higher \(\lambda_{\mathrm{P}}\) , the larger the value. However, final posterior prob. is not significantly affected by \(\lambda_{\mathrm{P}}\).
4.
Actually, the extremely poor quality when \(\lambda_{\mathrm{P}}\) is very small is partially caused by the improper accuracy measure at those occasions. When the prior is not big enough to attract the papers from the correct cluster, the clusters generated not necessarily have the same cluster label with the priors.
5.
http://vivisimo.com

References

A. Banerjee, S. Basu, and S. Merugu. Multi-way clustering on relation graphs. In SIAM’07, pages 145–156, Minneapolis, Minnesota, April 2007.
Google Scholar
R. Bekkerman, R. El-Yaniv, and A. McCallum. Multi-way distributional clustering via pairwise interactions. In ICML’05, pages 41–48, Bonn, Germany, August 2005.
Google Scholar
J. Bilmes. A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models, TR-97-021, April, 1998. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.40.576. 1997.
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1–7):107–117, 1998.
Article Google Scholar
D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/gather: a cluster-based approach to browsing large document collections. In SIGIR’92, pages 318–329, Copenhagen, Denmark, June 1992.
Google Scholar
DBLP. The dblp computer science bibliography. http://www.informatik.uni-trier.de/∼ley/db/.
C. H. Q. Ding, X. He, H. Zha, M. Gu, and H. D. Simon. A min-max cut algorithm for graph partitioning and data clustering. In ICDM’01, pages 107–114. IEEE Computer Society, San Jose, California, USA, November-December 2001.
Google Scholar
M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In SIGCOMM’99, pages 251–262, Cambridge, Massachusetts, USA, August-September 1999.
Google Scholar
J. E. Gentle and W. HSrdle. Handbook of Computational Statistics: Concepts and Methods, Chapter 7 Evaluation of Eigenvalues, pages 245–247. Springer, 1st edition, Berlin, Springer-Verlag, 2004.
Google Scholar
C. L. Giles. The future of citeseer. In 10th European Conference on PKDD (PKDD’06), page 2, Berlin, Germany, September 2006.
Google Scholar
Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In Proceedings of the Thirtieth international conference on Very large data bases (VLDB’04), pages 576–587. VLDB Endowment, Toronto, Canada, August-September 2004.
Google Scholar
J. E. Hirsch. An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences, 102:16569, 2005.
Article CAS Google Scholar
G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. In Proceedings of the eighth ACM SIGKDD conference (KDD’02), pages 538–543, ACM, Edmonton, Alberta, Canada, July 2002.
Google Scholar
W. Jiang, J. Vaidya, Z. Balaporia, C. Clifton, and B. Banich. Knowledge discovery from transportation network data. In Proceedings of the 21st ICDE Conference (ICDE’05), pages 1061–1072, Tokyo, Japan, April 2005.
Google Scholar
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999.
Article Google Scholar
B. Long, Z. M. Zhang, X. Wú, and P. S. Yu. Spectral clustering for multi-type relational data. In ICML’06, pages 585–592, Pittsburgh, Pennsylvania, USA, June 2006.
Google Scholar
U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, 2007.
Article Google Scholar
M. E. J. Newman. Assortative mixing in networks. Physical Review Letters, 89(20):208701+, October 2002.
Article PubMed CAS Google Scholar
Z. Nie, Y. Zhang, J.-R. Wen, and W.-Y. Ma. Object-level ranking: Bringing order to web objects. In Proceedings of the fourteenth International World Wide Web Conference (WWW’05), pages 567–574. ACM, Chiba, Japan, May 2005.
Google Scholar
S. Roy, T. Lane, and M. Werner-Washburne. Integrative construction and analysis of condition-specific biological networks. In Proceedings of AAAI’07, pages 1898–1899, Vancouver, British Columbia, Canada, July 2007.
Google Scholar
J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.
Article Google Scholar
A. Sidiropoulos, D. Katsaros, and Y. Manolopoulos. Generalized h-index for disclosing latent facts in citation networks. CoRR, abs/cs/0607066, 2006. http://arxiv.org/abs/cs/0607066.
Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, and T. Wu. Rankclus: Integrating clustering with ranking for heterogenous information network analysis. In EDBT’09, pages 565–576, Saint Petersburg, Russia, March 2009.
Google Scholar
Y. Sun, Y. Yu, and J. Han. “Ranking-based clustering of heterogeneous information networks with star network schema”. In KDD’09, pages 797–806, Paris, France, June-July 2009.
Google Scholar
U. von Luxburg. A tutorial on spectral clustering. Technical report, Max Planck Institute for Biological Cybernetics, 2006.
Google Scholar
N. Wang, S. Parthasarathy, K.-L. Tan, and A. K. H. Tung. Csv: visualizing and mining cohesive subgraphs. In SIGMOD’08, pages 445–458, Vancouver, BC, Canada, June 2008.
Google Scholar
X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger. Scan: a structural clustering algorithm for networks. In KDD’07, pages 824–833, San Jose, California, USA, August 2007.
Google Scholar
O. Zamir and O. Etzioni. Grouper: A dynamic clustering interface to web search results. Computer Networks, 31: 1361–1374, 1999.
Article Google Scholar
C. Zhai and J. D. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transaction on Information Systems, 22(2):179–214, 2004.
Article Google Scholar
C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In KDD’04, pages 743–748, Seattle, Washington, USA, August 2004.
Google Scholar
D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Schölkopf. Ranking on data manifolds. In Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.
Google Scholar

Download references

Author information

Authors and Affiliations

University of Illinois at Urbana-Champaign, Urbana, IL, USA
Yizhou Sun
UIUC, Urbana, IL, USA
Jiawei Han

Authors

Yizhou Sun
View author publications
You can also search for this author in PubMed Google Scholar
Jiawei Han
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yizhou Sun .

Editor information

Editors and Affiliations

Dept. Computer Science, University of Illinois, Chicago, S. Morgan St. 851, Chicago, 60607-7053, Illinois, USA
Philip S. Yu
Dept. Computer Science, University of Illinois, Urbana-Champaign, N. Goodwin Ave. 201, Urbana, 61801, Illinois, USA
Jiawei Han
School of Computer Science, Carnegie Mellon University, Forbes Ave. 5000, Pittsburgh, 15213, Pennsylvania, USA
Christos Faloutsos

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Sun, Y., Han, J. (2010). Integrating Clustering with Ranking in Heterogeneous Information Networks Analysis. In: Yu, P., Han, J., Faloutsos, C. (eds) Link Mining: Models, Algorithms, and Applications. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-6515-8_17

Download citation

DOI: https://doi.org/10.1007/978-1-4419-6515-8_17
Published: 13 August 2010
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4419-6514-1
Online ISBN: 978-1-4419-6515-8
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics