A Cross-Lingual Framework for Web News Taxonomy Integration

  • Cheng-Zen Yang
  • Che-Min Chen
  • Ing-Xiang Chen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4182)


There are currently many news sites providing online news articles, and many Web news portals arise to provide clustered news categories for users to browse more related news reports and realize the news events in depth. However, to the best of our knowledge, most Web news portals only provide monolingual news clustering services. In this paper, we study the cross-lingual Web news taxonomy integration problem in which news articles of the same news event reported in different languages are to be integrated into one category. Our study is based on cross-lingual classification research results and the cross-training concept to construct SVM-based classifiers for cross-lingual Web news taxonomy integration. We have conducted several experiments with the news articles from Google News as the experimental data sets. From the experimental results, we find that the proposed cross-training classifiers outperforms the traditional SVM classifiers in an all-round manner. We believe that the proposed framework can be applied to different bilingual environments.


News Article Name Entity Recognition News Event Bilingual Dictionary Taxonomy Integration 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
  2. 2.
    Google News (2006), http://news.google.com/
  3. 3.
    BBC News: First impressions count for web (2006), English version available at, http://bbc.co.uk/2/hi/technology/4616700.stm, Chinese version available at, http://news.bbc.co.uk/chinese/trad/hi/newsid4610000/newsid4618500/4618552.stm
  4. 4.
    Agrawal, R., Srikant, R.: On Integrating Catalogs. In: Proceedings of the 10th International Conference on World Wide Web, pp. 603–612 (2001)Google Scholar
  5. 5.
    Sarawagi, S., Chakrabarti, S., Godbole, S.: Cross-training: Learning Probabilistic Mappings between Topics. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 177–186 (2003)Google Scholar
  6. 6.
    Zhang, D., Lee, W.S.: Web Taxonomy Integration using Support Vector Machines. In: Proceedings of the 13th international conference on World Wide Web, pp. 472–481 (2004)Google Scholar
  7. 7.
    Zhang, D., Lee, W.S.: Web Taxonomy Integration Through Co-Bootstrapping. In: Proceedings of the 27th annual international ACM SIGIR Conference on Research and development in information retrieval, pp. 410–417 (2004)Google Scholar
  8. 8.
    Wu, C.W., Tsai, T.H., Hsu, W.L.: Learning to Integrate Web Taxonomies with Fine-Grained Relations: A Case Study Using Maximum Entropy Model. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 190–205. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  9. 9.
    Chen, I.X., Ho, J.C., Yang, C.Z.: An Iterative Approach for Web Catalog Integration with Support Vector Machines. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 703–708. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  10. 10.
    Rogati, M., Yang, Y.: Resrouce Selection for Domain-Specific ross-Lingual IR. In: Proceedings of the 27th annual international ACM SIGIR Conference on Research and development in information retrieval, pp. 154–161 (2004)Google Scholar
  11. 11.
    Chen, H.H., Kuo, J.J., Su, T.C.: Clustering and Visualization in a Multi-lingual Multidocument Summarization System. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 266–280. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  12. 12.
    Yahoo! News (2006), http://news.yahoo.com/
  13. 13.
    Jenkins, C., Inman, D.: Adaptive Automatic Classification on the Web. In: Proc. of the 11th International Workshop on Database and Expert Systems Applications, Greenwich, London, UK, pp. 504–511 (2000)Google Scholar
  14. 14.
    Chen, I.X., Shih, C.H., Yang, C.Z.: Web Catalog Integration using Support Vector Machines. In: Proceedings of the 1st Workshop on Intelligent Web Technology (IWT 2004), Taipei, Taiwan, pp. 7–13 (2004)Google Scholar
  15. 15.
    Nie, J.Y., Ren, F.: Chinese Information Retrieval: Using Characters or Words. Information Processing and Management 35(4), 443–162 (1999)Google Scholar
  16. 16.
    Nie, J.Y., Gao, J., Zhang, J., Zhou, M.: On the Use of Words and N-grams for Chinese Information Retrieval. In: Proceedings of the 5th International Workshop on Information Retrieval with Asian Languages, pp. 141–148 (2000)Google Scholar
  17. 17.
    Foo, S., Li, H.: Chinese Word Segmentation and Its Effect on Information Retrieval. Information Processing and Management 40(1), 161–190 (2004)CrossRefGoogle Scholar
  18. 18.
    Tseng, Y.H.: Automatic Thesaurus Generation for Chinese Documents. Journal of the American Society for Information Science and Technology 53(13), 1130–1138 (2002)CrossRefGoogle Scholar
  19. 19.
    The Association for Computational Linguistics and Chinese Language Processing (2006), http://www.aclclp.org.tw/use_ssc.php
  20. 20.
    Thorsten Joachims: SVMlight (2006), http://svmlight.joachims.org/
  21. 21.
    Linguistic Data Consortium (2006), http://projects.ldc.upenn.edu/Chinese/LDCch.htm

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Cheng-Zen Yang
    • 1
  • Che-Min Chen
    • 1
  • Ing-Xiang Chen
    • 1
  1. 1.Department of Computer Science and EngineeringYuan Ze UniversityTaiwan, R.O.C.

Personalised recommendations