Abstract
The expansion of the Web creates more requirements for Cross-Language Information Retrieval (CLIR). Query translation is the key problem. Previous studies have shown that query translation can be done by exploiting a large set of parallel texts. However, the problem arising is the unavailability of large parallel corpora for many languages. In this chapter, we describe a mining system that automatically discovers parallel Web pages on the Web. This system exploits the existing search engines and the common characteristics in the organization of Web pages. Several large text corpora have been constructed using this system. This chapter describes the mining process as well as the experimental results for English-French and English-Chinese CLIR. Our experiments show that query translation using the mined corpora can be as good as those obtained by high-quality machinetranslation systems. This study shows the feasibility of building automatically a query-translation system for all the active languages on the Web.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
P.F. Brown, S.A. Della Pietra, V.J. Della Pietra, R.L. Mercer: The mathematics of ma-chine translation: Parameter estimation. Computational Linguistics, 19, 263–311 (1993)
C. Buckley: Implementation of the SMART information retrieval system. Cornell Uni-versity, Tech. report 85–686 (1985)
S.F. Chen: Aligning sentences in bilingual corpora using lexical information. Proc. ACL (1993) pp. 9–16
P. Denisowski: Cedict (chinese-english dictionary) project.( http://www.mindspring.com/) (1999)
W.A. Gale, K.W. Church: A program for aligning sentences in bilingual corpora. Proc. the 29th Annual Meeting of the Association for Computational Linguistics ( Berkeley, Calif., 1991 ) pp. 177–184
J. Gao, J.Y. Nie, E. Xun, J. Zhang, M. Zhou, C. Huang: Improving Query Translation for CLIR using Statistical Models. Proc. 24th ACM-SIGIR (New Orleans, 2001 ) pp. 96104
D.K. Harman, E.M. Voorhees (eds.): Proc. the Sixth Text REtrieval Conference (TREC-6) (Gaithersburg, NIST Special Publication) (http://trec.nist.gov) (1997) pp. 500–240
W. Kraaij: TNO at CLEF-2001: Comparing translation resources, Proc. Workshop of Cross-Language Evaluation Forum (CLEF) (Darmstadt, 2001) pp. 29–40
K.L. Kwok: English-Chinese cross-language retrieval based on a translation package,Proc. Workshop of Machine Translation for Cross Language Information Retrieval Ma-chine Translation Summit VII (Singapore, 1999 ) pp. 8–13
N.Y. Liang, Y.B. Zhen: A Chinese word segmentation model and a Chinese word segmentation system PC-CWSS. Proc. COUPS ‘81, vol. 1 (1991) pp. 51–55
J.Y. Nie, W. Jin, M.L. Hannan: A hybrid approach to unknown word detection and segmentation of Chinese. Proc. International Conference on Chinese Computing (Singapore, 1994 ) pp. 326–335
J.Y. Nie, M. Simard, P. Isabelle, R. Durand: Cross-language information retrieval based on parallel texts and automatic mining parallel texts from the Web. Proc. ACM SIGIR’99 (1999) pp. 74–81
J.Y. Nie, J. Gao, J. Zhang, M. Zhou: On the use of words and n-grams for Chinese information retrieval. Proc. Fifth International Workshop on Information Retrieval with Asian Languages (IRAL-2000 (Hong Kong, 2000)
J. Prosise: Crawling the Web, A guide to robots, spiders, and other shadowy denizens of the Web, PC Magazine–July (http://www.zdnet.com/) (1996)
Ph. Resnik: Parallel stands: A preliminary investigation into mining the Web for bilin-gual text, Proc. AMTA’98, Lecture Notes in Artificial Intelligence, 1529 (1998)
M. Simard, G.F. Foster, P. Isabelle: Using cognates to align sentences in bilingual corpora. Proc. TMI-92 ( Montreal, Quebec, 1992 ) pp. 67–81
R. Sproat, C. Shih: A statistical method for finding word boundaries in Chinese text.Computer Processing of Chinese and Oriental Languages, 4 (4) (1991) pp. 336–351
D. Wu: Large-scale automatic extraction of an English-Chinese lexicon. Machine Translation, 9 (3–4) 285–313 (1995)
J. Xu, R. Weischedel, C. Nguyen: Evaluating a probabilistic model for cross-lingual information retrieval, Proc. ACM-SIGIR 2001 (2001) pp. 105–110
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Nie, JY., Chen, J. (2003). Exploiting the Web as Parallel Corpora for Cross-Language Information Retrieval. In: Zhong, N., Liu, J., Yao, Y. (eds) Web Intelligence. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-05320-1_11
Download citation
DOI: https://doi.org/10.1007/978-3-662-05320-1_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-07936-8
Online ISBN: 978-3-662-05320-1
eBook Packages: Springer Book Archive