Abstract
This work has three aspects. One is to describe a probabilistic approach to term translations for cross-lingual IR. We will show that such an approach, when used with a probabilistic retrieval model, can produce better retrieval than non-probabilistic techniques such as structural query translation (Pirkola, 1998) and Machine Translation. We will also show that parallel corpora and manual lexicons are complementary and their combination is essential to high performance CLIR. The second aspect of this work is to empirically measure CLIR performance as a function of the sizes of the bilingual resources available for estimating translation probabilities. A measurement like this is useful for two reasons. First, it can help to predict CLIR performance for a new language pair. Second, it can be used as a guidance on how much more data to acquire if existing resources cannot meet a target performance level. The third aspect is to describe a technique that can potentially reduce the cost of manually creating a parallel corpus. Such a technique will be useful for language pairs with no or little parallel text.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Allan, J., Callan, J., Feng, F., and Malin, D. (2000). INQUERY at TREC8. In TREC8 Proceedings. NIST.
Ballesteros, L. and Croft, W. (1998). Resolving Ambiguity for Cross-language Retrieval. In Proceedings of ACM SIGIR 1998 Conference, pages 64–71.
Berger, A. and Lafferty, J. (1999). Information Retrieval as Statistical Translation. In Proceedings of ACM SIGIR 1999 Conference.
Brown, P., Pietra, S. D., Pietra, V. D., and Mercer, R. (1993). The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, pages 263–311.
Hiemstra, D. and de Jong, F. (1999). Disambiguation Strategies for Cross-language Information Retrieval. In Proceedings of the third European Conference on Research and Advanced Technology for Digital Libraries, pages 274–293.
Hull, D. (1997). Using Structured Queries for Disambiguation in Cross-language Information Retrieval. In AAAI Symposium on Cross-Language Text and Speech Retrieval.
Klavans, J. and Hovy, E. (1999). Multilingual (or Cross-lingual) Information Retrieval. In Hovy, E., editor, Multilingual Information Management, current levels and future abilities.
Kwok, K. L. (1997). Comparing Representations in Chinese Information Retrieval. In Proceedings of ACM SIGIR 1997 Conference.
Lafferty, J. (1999). Personal Communications.
McCarley, J. (1999). Should We Translate the Documents or the Queries in Cross-language Information Retrieval. In Proceedings of ACL 1999, pages 208–214.
Miller, D., Leek, T., and Schwartz, R. (1999). A Hidden Markov Model Information Retrieval System. In Proceedings of ACM SIGIR 1999 Conference, pages 214–221.
Oard, D. (1998). A Comparative Study of Query and Document Translation for Cross-language Information Retrieval. In Third Conference of the Association for Machine Translation in the Americas.
Pirkola, A. (1998). The Effects of Query Structure and Dictionary Setups in Dictionary-based Cross-language Information Retrieval. In Proceedings of ACM SIGIR 1998 Conference, pages 55–63.
Ponte, J. (1998). A Language Modeling Approach to Information Retrieval. In Proceedings of ACM SIGIR 1998 Conference, pages 275–281.
Porter, M. (1980). An Algorithm for Suffix Stripping. Program,14(3):130137.
Rabiner, L. (1989). A Tutorial on Hidden Markov Models and Selected Appli- cations in Speech Recognition. Proceedings of IEEE 77, pages 257–286.
Resnik, P. (1998). Parallel Strands: A Preliminary Investigation into Mining the Web for Bilingual Text. In Third Conference of the Association for Machine Translation in the Americas.
Singhal, A., Buckley, C.,, and Mitra, M. (1996). Pivoted Document Length Normalization. In Proceedings of ACM SIGIR 1996 Conference.
Sperer, R. and Oard, D. (2000). Structured Query Translation for Cross-language Information Retrieval. In Proceedings of ACM SIGIR 2000 Conference.
Voorhees, E. and Harman, D., editors (1997). TREC5 Proceedings. NIST.
Voorhees, E. and Harman, D., editors (1998). TREC6 Proceedings. NIST.
Voorhees, E. and Harman, D., editors (2001). TREC9 Proceedings. NIST. Xu, J. and Croft, W. (1998). Corpus-based Stemming Using Co-occurrence of Word Variants. ACM TOIS,18(1):79–112.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Xu, J., Weischedel, R. (2003). A Probabilistic Approach to Term Translation for Cross-Lingual Retrieval. In: Croft, W.B., Lafferty, J. (eds) Language Modeling for Information Retrieval. The Springer International Series on Information Retrieval, vol 13. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-0171-6_6
Download citation
DOI: https://doi.org/10.1007/978-94-017-0171-6_6
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-6263-5
Online ISBN: 978-94-017-0171-6
eBook Packages: Springer Book Archive