Exploiting the Web as Parallel Corpora for Cross-Language Information Retrieval

Nie, Jian-Yun; Chen, Jiang

doi:10.1007/978-3-662-05320-1_11

Jian-Yun Nie⁴ &
Jiang Chen⁴

252 Accesses

Abstract

The expansion of the Web creates more requirements for Cross-Language Information Retrieval (CLIR). Query translation is the key problem. Previous studies have shown that query translation can be done by exploiting a large set of parallel texts. However, the problem arising is the unavailability of large parallel corpora for many languages. In this chapter, we describe a mining system that automatically discovers parallel Web pages on the Web. This system exploits the existing search engines and the common characteristics in the organization of Web pages. Several large text corpora have been constructed using this system. This chapter describes the mining process as well as the experimental results for English-French and English-Chinese CLIR. Our experiments show that query translation using the mined corpora can be as good as those obtained by high-quality machinetranslation systems. This study shows the feasibility of building automatically a query-translation system for all the active languages on the Web.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.00; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

P.F. Brown, S.A. Della Pietra, V.J. Della Pietra, R.L. Mercer: The mathematics of ma-chine translation: Parameter estimation. Computational Linguistics, 19, 263–311 (1993)
Google Scholar
C. Buckley: Implementation of the SMART information retrieval system. Cornell Uni-versity, Tech. report 85–686 (1985)
Google Scholar
S.F. Chen: Aligning sentences in bilingual corpora using lexical information. Proc. ACL (1993) pp. 9–16
Google Scholar
P. Denisowski: Cedict (chinese-english dictionary) project.( http://www.mindspring.com/) (1999)
W.A. Gale, K.W. Church: A program for aligning sentences in bilingual corpora. Proc. the 29th Annual Meeting of the Association for Computational Linguistics ( Berkeley, Calif., 1991 ) pp. 177–184
Google Scholar
J. Gao, J.Y. Nie, E. Xun, J. Zhang, M. Zhou, C. Huang: Improving Query Translation for CLIR using Statistical Models. Proc. 24th ACM-SIGIR (New Orleans, 2001 ) pp. 96104
Google Scholar
D.K. Harman, E.M. Voorhees (eds.): Proc. the Sixth Text REtrieval Conference (TREC-6) (Gaithersburg, NIST Special Publication) (http://trec.nist.gov) (1997) pp. 500–240
Google Scholar
W. Kraaij: TNO at CLEF-2001: Comparing translation resources, Proc. Workshop of Cross-Language Evaluation Forum (CLEF) (Darmstadt, 2001) pp. 29–40
Google Scholar
K.L. Kwok: English-Chinese cross-language retrieval based on a translation package,Proc. Workshop of Machine Translation for Cross Language Information Retrieval Ma-chine Translation Summit VII (Singapore, 1999 ) pp. 8–13
Google Scholar
N.Y. Liang, Y.B. Zhen: A Chinese word segmentation model and a Chinese word segmentation system PC-CWSS. Proc. COUPS ‘81, vol. 1 (1991) pp. 51–55
MATH Google Scholar
J.Y. Nie, W. Jin, M.L. Hannan: A hybrid approach to unknown word detection and segmentation of Chinese. Proc. International Conference on Chinese Computing (Singapore, 1994 ) pp. 326–335
Google Scholar
J.Y. Nie, M. Simard, P. Isabelle, R. Durand: Cross-language information retrieval based on parallel texts and automatic mining parallel texts from the Web. Proc. ACM SIGIR’99 (1999) pp. 74–81
Google Scholar
J.Y. Nie, J. Gao, J. Zhang, M. Zhou: On the use of words and n-grams for Chinese information retrieval. Proc. Fifth International Workshop on Information Retrieval with Asian Languages (IRAL-2000 (Hong Kong, 2000)
Google Scholar
J. Prosise: Crawling the Web, A guide to robots, spiders, and other shadowy denizens of the Web, PC Magazine–July (http://www.zdnet.com/) (1996)
Ph. Resnik: Parallel stands: A preliminary investigation into mining the Web for bilin-gual text, Proc. AMTA’98, Lecture Notes in Artificial Intelligence, 1529 (1998)
Google Scholar
M. Simard, G.F. Foster, P. Isabelle: Using cognates to align sentences in bilingual corpora. Proc. TMI-92 ( Montreal, Quebec, 1992 ) pp. 67–81
Google Scholar
R. Sproat, C. Shih: A statistical method for finding word boundaries in Chinese text.Computer Processing of Chinese and Oriental Languages, 4 (4) (1991) pp. 336–351
Google Scholar
D. Wu: Large-scale automatic extraction of an English-Chinese lexicon. Machine Translation, 9 (3–4) 285–313 (1995)
Article Google Scholar
J. Xu, R. Weischedel, C. Nguyen: Evaluating a probabilistic model for cross-lingual information retrieval, Proc. ACM-SIGIR 2001 (2001) pp. 105–110
Google Scholar

Download references

Author information

Authors and Affiliations

DIRO, Université de Montréal, Succursale Centre-Ville, CP. 6128, Montreal, Quebec, H3C 3J7, Canada
Jian-Yun Nie & Jiang Chen

Authors

Jian-Yun Nie
View author publications
You can also search for this author in PubMed Google Scholar
Jiang Chen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Knowledge Information Systems Lab., Dept. of Systems and Information Eng., Maebashi Institute of Technology, 460-1 Kamisadori-Cho, 371-0816, Maebashi-City, Japan
Ning Zhong
Dept. of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
Jiming Liu
Dept. of Computer Science, University of Regina, S4S 0A2, Regina, Saskatchewan, Canada
Yiyu Yao

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Nie, JY., Chen, J. (2003). Exploiting the Web as Parallel Corpora for Cross-Language Information Retrieval. In: Zhong, N., Liu, J., Yao, Y. (eds) Web Intelligence. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-05320-1_11

Download citation

DOI: https://doi.org/10.1007/978-3-662-05320-1_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-07936-8
Online ISBN: 978-3-662-05320-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics