Statistical Models for Monolingual and Bilingual Information Retrieval

Bertoldi, Nicola; Federico, Marcello

doi:10.1023/B:INRT.0000009440.64411.ad

Statistical Models for Monolingual and Bilingual Information Retrieval

Published: January 2004

Volume 7, pages 53–72, (2004)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

Statistical Models for Monolingual and Bilingual Information Retrieval

Download PDF

Nicola Bertoldi¹ &
Marcello Federico¹

102 Accesses
4 Citations
Explore all metrics

Abstract

This work reviews information retrieval systems developed at ITC-irst which were evaluated through several tracks of CLEF, during the last three years. The presentation tries to follow the progress made over time in developing new statistical models first for monolingual information retrieval, then for cross-language information retrieval. Besides describing the underlying theory, performance of monolingual and bilingual information retrieval models are reported, respectively, on Italian monolingual tracks and Italian-English bilingual tracks of CLEF. Monolingual systems by ITC-irst performed consistently well in all the official evaluations, while the bilingual system ranked in CLEF 2002 just behind competitors using commercial machine translation engines. However, by experimentally comparing our statistical topic translation model against a state-of-the-art commercial system, no statistically significant difference in retrieval performance could be measured on a larger set of queries.

References

Ballesteros L and Croft WB (1998) Resolving ambiguity for cross-language retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 64-71.
Berger A and Lafferty JD (1999) Information retrieval as statistical translation. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 222-229.
Bertoldi N and Federico M (2001) ITC-irst at CLEF 2000: Italian monolingual track. In: Peters C, Ed. Cross-Language Information Retrieval and Evaluation, vol. 2069 of Lecture Notes in Computer Science, Heidelberg, Germany, Springer Verlag, pp. 261–272.
Google Scholar
Bertoldi N and Federico M (2003) Cross-language spoken document retrieval on the TREC SDR collection. In: Peters C et al., Eds. Cross-Language Information Retrieval and Evaluation, Lecture Notes in Computer Science, Heidelberg, Germany, Springer Verlag (to appear).
Google Scholar
Federico M (2000) A system for the retrieval of Italian broadcast news. Speech Communication, 32(1/2):37–47.
Google Scholar
Federico M and Bertoldi N (2002) Statistical cross-language information retrieval using n-best query translations. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, pp. 167-174.
Federico M and De Mori R (1998) Language modelling. In: Mori RD, Ed. Spoken Dialogues with Computers, chapter 7. Academy Press, London, UK.
Google Scholar
Frakes WB and Baeza-Yates R, editors (1992). Information Retrieval: Data Structures and Algorithms. Prentice Hall, Englewood Cliffs, NJ.
Google Scholar
Hiemstra D and de Jong F (1999). Disambiguation strategies for cross-language information retrieval. In: Proceedings of the 3rd European Conference on Research and Advanced Technology for Digital Libraries, pp. 274-293.
Johnson RA and Wichern DW, Eds. (1992) Applied Multivariate Statistical Analysis. Prentice Hall, Englewood Cliffs, NJ.
Google Scholar
Johnson S, Jourlin P, Jones KS and Woodland P (1999) Spoken document retrieval for TREC-8 at Cambridge University. In: Proceedings of the 8th Text REtrieval Conference, Gaithersburg, MD, pp. 197-206.
Koehn P and Knight K (2000) Estimating word translation probabilities from unrelated monolingual corpora using the EM algorithm. In: AAAI/IAAI, pp. 711-715.
Miller, DRH, Leek T and Schwartz RM (1998) BBN at TREC-7: Using hidden Markov models for information retrieval. In: Proceedings of the 7th Text REtrieval Conference, Gaithersburg, MD, pp. 133-142.
Mood AM, Graybill FA and Boes DC (1974) Introduction to the Theory of Statistics. McGraw-Hill, Singapore.
Google Scholar
Ney H, Essen U and Kneser R (1994) On structuring probabilistic dependences in stochastic language modelling. Computer Speech and Language, 8:1–38.
Google Scholar
Ng K (1999) A maximum likelihood ratio information retrieval model. In: Proceedings of the 8th Text REtrieval Conference, Gaithersburg, MD, pp. 483-492.
Nilsson NJ (1982) Principles of Artificial Intelligence. Springer Verlag, Berlin, Germany.
Google Scholar
Pirkola A (1998) The effect of query structure and dictionary setups in dictionary-based cross-language information retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 55-63.
Porter MF (1980) An algorithm for suffix stripping. Program, 14(3):130–137.
Google Scholar
Rabiner LR (1990) A tutorial on hidden Markov models and selected applications in speech recognition. In: Weibel A and Lee K, Eds. Readings in Speech Recognition, Morgan Kaufmann, Los Altos, CA, pp. 267–296.
Google Scholar
Robertson SE, Walker S, Jones S, Hancock-Beaulieu MM and Gatford M (1994) Okapi at TREC-3. In: Proceedings of the 3rd Text REtrieval Conference, Gaithersburg, MD, pp. 109-126.
Soong FK and Huang EF (1991) A tree-trellis based fast search for finding the n-best sentence hypotheses in continuos speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, Toronto, Canada, pp. 705–708.
Google Scholar
Witten IH and Bell TC (1991) The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transaction on Information Theory, IT-37(4):1085–1094.
Google Scholar
Xu J, Weischedel R and Nguyen C (2001) Evaluating a probabilistic model for cross-lingual information retrieval. In: Proceedings of the 24h Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 105-110.

Download references

Author information

Authors and Affiliations

ITC-irst, Centro per la Ricerca Scientifica e Tecnologica, I-38050, Povo, Italy
Nicola Bertoldi & Marcello Federico

Authors

Nicola Bertoldi
View author publications
You can also search for this author in PubMed Google Scholar
Marcello Federico
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bertoldi, N., Federico, M. Statistical Models for Monolingual and Bilingual Information Retrieval. Information Retrieval 7, 53–72 (2004). https://doi.org/10.1023/B:INRT.0000009440.64411.ad

Download citation

Issue Date: January 2004
DOI: https://doi.org/10.1023/B:INRT.0000009440.64411.ad

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Statistical Models for Monolingual and Bilingual Information Retrieval

Abstract

Article PDF

Similar content being viewed by others

Lessons Learnt from Experiments on the Ad Hoc Multilingual Test Collections at CLEF

A Unified Framework for Monolingual and Cross-Lingual Relevance Modeling Based on Probabilistic Topic Models

A Language Modeling Approach for Extracting Translation Knowledge from Comparable Corpora

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Statistical Models for Monolingual and Bilingual Information Retrieval

Abstract

Article PDF

Similar content being viewed by others

Lessons Learnt from Experiments on the Ad Hoc Multilingual Test Collections at CLEF

A Unified Framework for Monolingual and Cross-Lingual Relevance Modeling Based on Probabilistic Topic Models

A Language Modeling Approach for Extracting Translation Knowledge from Comparable Corpora

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation