Skip to main content

Exploiting the Web as Parallel Corpora for Cross-Language Information Retrieval

  • Chapter
Web Intelligence
  • 252 Accesses

Abstract

The expansion of the Web creates more requirements for Cross-Language Information Retrieval (CLIR). Query translation is the key problem. Previous studies have shown that query translation can be done by exploiting a large set of parallel texts. However, the problem arising is the unavailability of large parallel corpora for many languages. In this chapter, we describe a mining system that automatically discovers parallel Web pages on the Web. This system exploits the existing search engines and the common characteristics in the organization of Web pages. Several large text corpora have been constructed using this system. This chapter describes the mining process as well as the experimental results for English-French and English-Chinese CLIR. Our experiments show that query translation using the mined corpora can be as good as those obtained by high-quality machinetranslation systems. This study shows the feasibility of building automatically a query-translation system for all the active languages on the Web.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. P.F. Brown, S.A. Della Pietra, V.J. Della Pietra, R.L. Mercer: The mathematics of ma-chine translation: Parameter estimation. Computational Linguistics, 19, 263–311 (1993)

    Google Scholar 

  2. C. Buckley: Implementation of the SMART information retrieval system. Cornell Uni-versity, Tech. report 85–686 (1985)

    Google Scholar 

  3. S.F. Chen: Aligning sentences in bilingual corpora using lexical information. Proc. ACL (1993) pp. 9–16

    Google Scholar 

  4. P. Denisowski: Cedict (chinese-english dictionary) project.( http://www.mindspring.com/) (1999)

  5. W.A. Gale, K.W. Church: A program for aligning sentences in bilingual corpora. Proc. the 29th Annual Meeting of the Association for Computational Linguistics ( Berkeley, Calif., 1991 ) pp. 177–184

    Google Scholar 

  6. J. Gao, J.Y. Nie, E. Xun, J. Zhang, M. Zhou, C. Huang: Improving Query Translation for CLIR using Statistical Models. Proc. 24th ACM-SIGIR (New Orleans, 2001 ) pp. 96104

    Google Scholar 

  7. D.K. Harman, E.M. Voorhees (eds.): Proc. the Sixth Text REtrieval Conference (TREC-6) (Gaithersburg, NIST Special Publication) (http://trec.nist.gov) (1997) pp. 500–240

    Google Scholar 

  8. W. Kraaij: TNO at CLEF-2001: Comparing translation resources, Proc. Workshop of Cross-Language Evaluation Forum (CLEF) (Darmstadt, 2001) pp. 29–40

    Google Scholar 

  9. K.L. Kwok: English-Chinese cross-language retrieval based on a translation package,Proc. Workshop of Machine Translation for Cross Language Information Retrieval Ma-chine Translation Summit VII (Singapore, 1999 ) pp. 8–13

    Google Scholar 

  10. N.Y. Liang, Y.B. Zhen: A Chinese word segmentation model and a Chinese word segmentation system PC-CWSS. Proc. COUPS ‘81, vol. 1 (1991) pp. 51–55

    MATH  Google Scholar 

  11. J.Y. Nie, W. Jin, M.L. Hannan: A hybrid approach to unknown word detection and segmentation of Chinese. Proc. International Conference on Chinese Computing (Singapore, 1994 ) pp. 326–335

    Google Scholar 

  12. J.Y. Nie, M. Simard, P. Isabelle, R. Durand: Cross-language information retrieval based on parallel texts and automatic mining parallel texts from the Web. Proc. ACM SIGIR’99 (1999) pp. 74–81

    Google Scholar 

  13. J.Y. Nie, J. Gao, J. Zhang, M. Zhou: On the use of words and n-grams for Chinese information retrieval. Proc. Fifth International Workshop on Information Retrieval with Asian Languages (IRAL-2000 (Hong Kong, 2000)

    Google Scholar 

  14. J. Prosise: Crawling the Web, A guide to robots, spiders, and other shadowy denizens of the Web, PC Magazine–July (http://www.zdnet.com/) (1996)

  15. Ph. Resnik: Parallel stands: A preliminary investigation into mining the Web for bilin-gual text, Proc. AMTA’98, Lecture Notes in Artificial Intelligence, 1529 (1998)

    Google Scholar 

  16. M. Simard, G.F. Foster, P. Isabelle: Using cognates to align sentences in bilingual corpora. Proc. TMI-92 ( Montreal, Quebec, 1992 ) pp. 67–81

    Google Scholar 

  17. R. Sproat, C. Shih: A statistical method for finding word boundaries in Chinese text.Computer Processing of Chinese and Oriental Languages, 4 (4) (1991) pp. 336–351

    Google Scholar 

  18. D. Wu: Large-scale automatic extraction of an English-Chinese lexicon. Machine Translation, 9 (3–4) 285–313 (1995)

    Article  Google Scholar 

  19. J. Xu, R. Weischedel, C. Nguyen: Evaluating a probabilistic model for cross-lingual information retrieval, Proc. ACM-SIGIR 2001 (2001) pp. 105–110

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Nie, JY., Chen, J. (2003). Exploiting the Web as Parallel Corpora for Cross-Language Information Retrieval. In: Zhong, N., Liu, J., Yao, Y. (eds) Web Intelligence. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-05320-1_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-05320-1_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-07936-8

  • Online ISBN: 978-3-662-05320-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics