Advertisement

Data Mining and Knowledge Discovery

, Volume 12, Issue 2–3, pp 229–248 | Cite as

Reinforcing Web-object Categorization Through Interrelationships

  • GUI-RONG XUEEmail author
  • YONG YU
  • DOU SHEN
  • QIANG YANG
  • HUA-JUN ZENG
  • ZHENG CHEN
Original Paper

Abstract

Existing categorization algorithms deal with homogeneous Web objects, and consider interrelated objects as additional features when taking the interrelationships with other types of objects into account. However, focusing on any single aspect of the inter-object relationship is not sufficient to fully reveal the true categories of Web objects. In this paper, we propose a novel categorization algorithm, called the Iterative Reinforcement Categorization Algorithm (IRC), to exploit the full interrelationship between different types of Web objects on the Web, including Web pages and queries. IRC classifies the interrelated Web objects by iteratively reinforcing the individual classification results of different types of objects via their interrelationship. Experiments on a clickthrough-log dataset from the MSN search engine show that, in terms of the F1 measure, IRC achieves a 26.4% improvement over a pure content-based classification method. It also achieves a 21% improvement over a query-metadata-based method, as well as a 16.4% improvement on F1 measure over the well-known virtual document-based method. Our experiments show that IRC converges fast enough to be applicable to real world applications.

Keywords

categorization interrelated Web objects iterative reinforcement clickthrough data 

Notes

Acknowledgement

Gui-Rong Xue, Hua-Jun Zeng and Zheng Chen are supported by Microsoft Research Asia. Yong Yu is supported by a grant from National Natural Science Foundation of China (NO.60473122). Dou Shen and Qiang Yang are supported by a grant from the Hong Kong Government (RGC central allocation grant CA03/04.EG01).

References

  1. Beeferman, D. and Berger, A. 2000. Agglomerative clustering of a search engine query log. In Proc. of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston: ACM Press, pp. 407–415.Google Scholar
  2. Chakrabarti, S., Dom, B., and Indyk, P. 1998. Enhanced hypertext categorization using hyperlinks. In Proc. of the ACM SIGMOD International Conference on Management of Data, Seattle, Washington: ACM Press, pp. 307–318.Google Scholar
  3. Cortes, C. and Vapnik, V. 1995. Support vector networks. Machine Learning, 20, (3), 273–297, 20:1–25. Springer.Google Scholar
  4. Chuang, S.L. and Chien, L.F. 2003. Enriching Web taxonomies through subject categorization of query terms from search engine logs. Decision Support System, Vol. 35, No. 1, Elsevier Science Publishers, pp. 113–127.Google Scholar
  5. Cui, H., Wen, J.R., Nie, J.Y., and Ma, W.Y. 2003. Query expansion by mining user Logs. IEEE Transaction on Knowledge and Data Engineering, Vol. 15, No. 4, IEEE Computer Society, pp. 829–839.Google Scholar
  6. Cohn, D. and Hofmann, T. 2001. The missing link -- A probabilistic model of document content and hypertext connectivity. In Advances in Neural Information Processing Systems 13, Vancouver, Canada: MIT Press, pp. 430–436.Google Scholar
  7. Dumain, S. and Chen, H. 2000. Hierarchical classification of web content. In Proc. of the 23rd annual international ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece: ACM Press, New York: ACM Press pp. 256–263.Google Scholar
  8. Getoor, L., Friedman, N., Koller, D., and Taskar, B. 2001. Learning probabilistic models of relational structure. In Proc. of the 18th International Conference on Machine Learning, Williamstown, MA: Morgan Kaufmann, pp. 170–177.Google Scholar
  9. Getoor, L., Segal, E., Taskar, B., and Koller, D. 2001. Probabilistic models of text and link structure for hypertext classification. In IJCAI Workshop on “Text Learning: Beyond Supervision”, Seattle, WA, August.Google Scholar
  10. Glover, E.J., Tsioutsiouliklis Lawrence, K.S., Pennock, D.M., and Flake, G.W. 2002. Using web structure for classifying and describing web pages. In Proc. of the International Conference on the World Wide Web, Hawaii: ACM Press, pp. 562–569.Google Scholar
  11. Grimmett, G. and Stirzaker, D. 1992. Probability and Random Processes, 2nd edition. Oxford, England: Oxford University Press.Google Scholar
  12. Huang, C.K., Chien, L.F., and Oyang, Y.J. 2003. Relevant term suggestion in interactive web search based on contextual information in query session logs. In Journal of the American Society for Information Science and Technology, Vol. 54, No. 7, John Wiley & Sons, Inc., pp. 638–649.Google Scholar
  13. Jeh, G. and Widom, J. 2002. SimRank: A measure of structural-context similarity. In Proc. of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Canada. ACM Press, pp. 538–543.Google Scholar
  14. Joachims, T. 1998. Text categorization with support vector machines: Learning with many relevant features. In Proc. of 10th European Conference on Machine Learning, Chemnitz, Germany: Springer, pp. 137–142.Google Scholar
  15. Lu, Q. and Getoor, L. 2003. Link-based classification. In Proc. of 20th International Conference on Machine Learning, Washington, DC: AAAI Press, pp. 496–503.Google Scholar
  16. Oh, H.J., Myaeng, S.H., and Lee, M.H. 2000. A practical hypertext categorization method using links and incrementally available class information. In Proc. of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, Athens, Greece: ACM Press, pp. 264–271.Google Scholar
  17. Platt, J. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In A. Smola, P. Bartlett, B. Scholkopf, and D. Schuurmans (Eds.), Advances in Large Margin Classifiers, MIT Press, pp. 61–74.Google Scholar
  18. Sequential Minimal Optimization, http://research. micro-soft.com/∼jplatt/smo.html.Google Scholar
  19. Slattery, S. and Craven, M. 2000. Discovering test set regularities in relational domains. In Proc. of 17th International Conference on Machine Learning, Stanford, US: Morgan Kaufmann, pp. 895–902.Google Scholar
  20. Wang, J.D., Zeng, H.J., Chen, Z., Lu, H.J., Tao, L., and Ma, W.-Y. 2003. ReCoM: Reinforcement clustering of multi-type interrelated data objects. In Proc. of the ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, CA: ACM Press, pp. 274–281.Google Scholar
  21. Wen, J.R., Nie, J.Y., and Zhang, H.J. 2001. Clustering user queries of a search engine. In Proc. of the Tenth International World Wide Web Conference, Hong Kong: ACM Press, pp. 162–168.Google Scholar
  22. Yang, Y. and Pedersen, J.O. 1997. A comparative study on feature selection in text categorization. In Proc. of the Fourteenth International Conference of Machine Learning, Nashville, Tennessee: Morgan Kaufmann, pp. 412–420.Google Scholar
  23. Zhang, S.C., Wu, X.D., and Zhang, C.Q. 2003. Multi-database mining. IEEE Computational Intelligence Bulletin, Vol. 2 No. 1, IEEE Computer Society, pp. 5–13.Google Scholar
  24. Zhang, S.C., Zhang, C.Q., and Wu, X.D. 2004. Knowledge Discovery in Multiple Databases, Springer.Google Scholar

Copyright information

© Springer Science+Business Media, Inc. 2005

Authors and Affiliations

  • GUI-RONG XUE
    • 1
    Email author
  • YONG YU
    • 1
  • DOU SHEN
    • 2
  • QIANG YANG
    • 2
  • HUA-JUN ZENG
    • 3
  • ZHENG CHEN
    • 3
  1. 1.Computer Science and EngineeringShanghai Jiao-Tong UniversityShanghaiP.R. China
  2. 2.Hong Kong University of Science and TechnologyKowloonHong Kong
  3. 3.Microsoft Research Asia5F, Sigma CenterBeijingP.R.China

Personalised recommendations