Skip to main content

Integrating Clustering with Ranking in Heterogeneous Information Networks Analysis

  • Chapter
  • First Online:

Abstract

Heterogeneous information networks, ie, the logic networks involving multi-typed, interconnected objects, are ubiquitous. For example, a bibliographic information network contains nodes including authors, conferences, terms and papers, and links corresponding to relations exiting between these objects. Extracting knowledge from information networks has become an important task. Both ranking and clustering can provide overall views on information network data, and each has been a hot topic by itself. However, ranking objects globally without considering which clusters they belong to often leads to dumb results, e.g., ranking database and computer architecture conferences together may not make much sense. Similarly, clustering a huge number of objects (e.g., thousands of authors) into one huge cluster without distinction is dull as well. In contrast, a good cluster can lead to meaningful ranking for objects in that cluster, and ranking distributions for these objects can serve as good features to help clustering. Two ranking-based clustering algorithms, RankClus and NetClus, thus are proposed. RankClus aims at clustering target objects using the attribute objects in the remaining network, while NetClus is able to generate net-clusters containing multiple types of objects following the same schema of the original network. The basic idea of such algorithms is that ranking distributions of objects in each cluster should be quite different from each other, which can be served as features of clusters and new measures of objects can be calculated accordingly. Also, better clustering results can achieve better ranking results. Ranking and clustering can be mutually enhanced, where ranking provides better measure space and clustering provides more reasonable ranking distribution. What’s more, clusters obtained in this way are more informative than other methods, given the ranking distribution for objects in each cluster.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    www.informatik.uni-trier.de/∼ley/db/

  2. 2.

    For example, a statistician may want to change the rules referring to conferences to journals; whereas a bibliographic database that collects papers from all the bogus conferences may need even more sophisticated rules (extracted from the domain knowledge) to guard the ranking quality.

  3. 3.

    Initial absolute posterior prob. to background is sensitive to prior \(\lambda_{\mathrm{P}}\): the higher \(\lambda_{\mathrm{P}}\) , the larger the value. However, final posterior prob. is not significantly affected by \(\lambda_{\mathrm{P}}\).

  4. 4.

    Actually, the extremely poor quality when \(\lambda_{\mathrm{P}}\) is very small is partially caused by the improper accuracy measure at those occasions. When the prior is not big enough to attract the papers from the correct cluster, the clusters generated not necessarily have the same cluster label with the priors.

  5. 5.

    http://vivisimo.com

References

  1. A. Banerjee, S. Basu, and S. Merugu. Multi-way clustering on relation graphs. In SIAM’07, pages 145–156, Minneapolis, Minnesota, April 2007.

    Google Scholar 

  2. R. Bekkerman, R. El-Yaniv, and A. McCallum. Multi-way distributional clustering via pairwise interactions. In ICML’05, pages 41–48, Bonn, Germany, August 2005.

    Google Scholar 

  3. J. Bilmes. A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models, TR-97-021, April, 1998. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.40.576. 1997.

  4. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1–7):107–117, 1998.

    Article  Google Scholar 

  5. D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/gather: a cluster-based approach to browsing large document collections. In SIGIR’92, pages 318–329, Copenhagen, Denmark, June 1992.

    Google Scholar 

  6. DBLP. The dblp computer science bibliography. http://www.informatik.uni-trier.de/∼ley/db/.

  7. C. H. Q. Ding, X. He, H. Zha, M. Gu, and H. D. Simon. A min-max cut algorithm for graph partitioning and data clustering. In ICDM’01, pages 107–114. IEEE Computer Society, San Jose, California, USA, November-December 2001.

    Google Scholar 

  8. M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In SIGCOMM’99, pages 251–262, Cambridge, Massachusetts, USA, August-September 1999.

    Google Scholar 

  9. J. E. Gentle and W. HSrdle. Handbook of Computational Statistics: Concepts and Methods, Chapter 7 Evaluation of Eigenvalues, pages 245–247. Springer, 1st edition, Berlin, Springer-Verlag, 2004.

    Google Scholar 

  10. C. L. Giles. The future of citeseer. In 10th European Conference on PKDD (PKDD’06), page 2, Berlin, Germany, September 2006.

    Google Scholar 

  11. Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In Proceedings of the Thirtieth international conference on Very large data bases (VLDB’04), pages 576–587. VLDB Endowment, Toronto, Canada, August-September 2004.

    Google Scholar 

  12. J. E. Hirsch. An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences, 102:16569, 2005.

    Article  CAS  Google Scholar 

  13. G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. In Proceedings of the eighth ACM SIGKDD conference (KDD’02), pages 538–543, ACM, Edmonton, Alberta, Canada, July 2002.

    Google Scholar 

  14. W. Jiang, J. Vaidya, Z. Balaporia, C. Clifton, and B. Banich. Knowledge discovery from transportation network data. In Proceedings of the 21st ICDE Conference (ICDE’05), pages 1061–1072, Tokyo, Japan, April 2005.

    Google Scholar 

  15. J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999.

    Article  Google Scholar 

  16. B. Long, Z. M. Zhang, X. Wú, and P. S. Yu. Spectral clustering for multi-type relational data. In ICML’06, pages 585–592, Pittsburgh, Pennsylvania, USA, June 2006.

    Google Scholar 

  17. U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, 2007.

    Article  Google Scholar 

  18. M. E. J. Newman. Assortative mixing in networks. Physical Review Letters, 89(20):208701+, October 2002.

    Article  PubMed  CAS  Google Scholar 

  19. Z. Nie, Y. Zhang, J.-R. Wen, and W.-Y. Ma. Object-level ranking: Bringing order to web objects. In Proceedings of the fourteenth International World Wide Web Conference (WWW’05), pages 567–574. ACM, Chiba, Japan, May 2005.

    Google Scholar 

  20. S. Roy, T. Lane, and M. Werner-Washburne. Integrative construction and analysis of condition-specific biological networks. In Proceedings of AAAI’07, pages 1898–1899, Vancouver, British Columbia, Canada, July 2007.

    Google Scholar 

  21. J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.

    Article  Google Scholar 

  22. A. Sidiropoulos, D. Katsaros, and Y. Manolopoulos. Generalized h-index for disclosing latent facts in citation networks. CoRR, abs/cs/0607066, 2006. http://arxiv.org/abs/cs/0607066.

  23. Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, and T. Wu. Rankclus: Integrating clustering with ranking for heterogenous information network analysis. In EDBT’09, pages 565–576, Saint Petersburg, Russia, March 2009.

    Google Scholar 

  24. Y. Sun, Y. Yu, and J. Han. “Ranking-based clustering of heterogeneous information networks with star network schema”. In KDD’09, pages 797–806, Paris, France, June-July 2009.

    Google Scholar 

  25. U. von Luxburg. A tutorial on spectral clustering. Technical report, Max Planck Institute for Biological Cybernetics, 2006.

    Google Scholar 

  26. N. Wang, S. Parthasarathy, K.-L. Tan, and A. K. H. Tung. Csv: visualizing and mining cohesive subgraphs. In SIGMOD’08, pages 445–458, Vancouver, BC, Canada, June 2008.

    Google Scholar 

  27. X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger. Scan: a structural clustering algorithm for networks. In KDD’07, pages 824–833, San Jose, California, USA, August 2007.

    Google Scholar 

  28. O. Zamir and O. Etzioni. Grouper: A dynamic clustering interface to web search results. Computer Networks, 31: 1361–1374, 1999.

    Article  Google Scholar 

  29. C. Zhai and J. D. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transaction on Information Systems, 22(2):179–214, 2004.

    Article  Google Scholar 

  30. C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In KDD’04, pages 743–748, Seattle, Washington, USA, August 2004.

    Google Scholar 

  31. D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Schölkopf. Ranking on data manifolds. In Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yizhou Sun .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Sun, Y., Han, J. (2010). Integrating Clustering with Ranking in Heterogeneous Information Networks Analysis. In: Yu, P., Han, J., Faloutsos, C. (eds) Link Mining: Models, Algorithms, and Applications. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-6515-8_17

Download citation

Publish with us

Policies and ethics