Abstract
This work reviews information retrieval systems developed at ITC-irst which were evaluated through several tracks of CLEF, during the last three years. The presentation tries to follow the progress made over time in developing new statistical models first for monolingual information retrieval, then for cross-language information retrieval. Besides describing the underlying theory, performance of monolingual and bilingual information retrieval models are reported, respectively, on Italian monolingual tracks and Italian-English bilingual tracks of CLEF. Monolingual systems by ITC-irst performed consistently well in all the official evaluations, while the bilingual system ranked in CLEF 2002 just behind competitors using commercial machine translation engines. However, by experimentally comparing our statistical topic translation model against a state-of-the-art commercial system, no statistically significant difference in retrieval performance could be measured on a larger set of queries.
Article PDF
Similar content being viewed by others
References
Ballesteros L and Croft WB (1998) Resolving ambiguity for cross-language retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 64-71.
Berger A and Lafferty JD (1999) Information retrieval as statistical translation. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 222-229.
Bertoldi N and Federico M (2001) ITC-irst at CLEF 2000: Italian monolingual track. In: Peters C, Ed. Cross-Language Information Retrieval and Evaluation, vol. 2069 of Lecture Notes in Computer Science, Heidelberg, Germany, Springer Verlag, pp. 261–272.
Bertoldi N and Federico M (2003) Cross-language spoken document retrieval on the TREC SDR collection. In: Peters C et al., Eds. Cross-Language Information Retrieval and Evaluation, Lecture Notes in Computer Science, Heidelberg, Germany, Springer Verlag (to appear).
Federico M (2000) A system for the retrieval of Italian broadcast news. Speech Communication, 32(1/2):37–47.
Federico M and Bertoldi N (2002) Statistical cross-language information retrieval using n-best query translations. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, pp. 167-174.
Federico M and De Mori R (1998) Language modelling. In: Mori RD, Ed. Spoken Dialogues with Computers, chapter 7. Academy Press, London, UK.
Frakes WB and Baeza-Yates R, editors (1992). Information Retrieval: Data Structures and Algorithms. Prentice Hall, Englewood Cliffs, NJ.
Hiemstra D and de Jong F (1999). Disambiguation strategies for cross-language information retrieval. In: Proceedings of the 3rd European Conference on Research and Advanced Technology for Digital Libraries, pp. 274-293.
Johnson RA and Wichern DW, Eds. (1992) Applied Multivariate Statistical Analysis. Prentice Hall, Englewood Cliffs, NJ.
Johnson S, Jourlin P, Jones KS and Woodland P (1999) Spoken document retrieval for TREC-8 at Cambridge University. In: Proceedings of the 8th Text REtrieval Conference, Gaithersburg, MD, pp. 197-206.
Koehn P and Knight K (2000) Estimating word translation probabilities from unrelated monolingual corpora using the EM algorithm. In: AAAI/IAAI, pp. 711-715.
Miller, DRH, Leek T and Schwartz RM (1998) BBN at TREC-7: Using hidden Markov models for information retrieval. In: Proceedings of the 7th Text REtrieval Conference, Gaithersburg, MD, pp. 133-142.
Mood AM, Graybill FA and Boes DC (1974) Introduction to the Theory of Statistics. McGraw-Hill, Singapore.
Ney H, Essen U and Kneser R (1994) On structuring probabilistic dependences in stochastic language modelling. Computer Speech and Language, 8:1–38.
Ng K (1999) A maximum likelihood ratio information retrieval model. In: Proceedings of the 8th Text REtrieval Conference, Gaithersburg, MD, pp. 483-492.
Nilsson NJ (1982) Principles of Artificial Intelligence. Springer Verlag, Berlin, Germany.
Pirkola A (1998) The effect of query structure and dictionary setups in dictionary-based cross-language information retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 55-63.
Porter MF (1980) An algorithm for suffix stripping. Program, 14(3):130–137.
Rabiner LR (1990) A tutorial on hidden Markov models and selected applications in speech recognition. In: Weibel A and Lee K, Eds. Readings in Speech Recognition, Morgan Kaufmann, Los Altos, CA, pp. 267–296.
Robertson SE, Walker S, Jones S, Hancock-Beaulieu MM and Gatford M (1994) Okapi at TREC-3. In: Proceedings of the 3rd Text REtrieval Conference, Gaithersburg, MD, pp. 109-126.
Soong FK and Huang EF (1991) A tree-trellis based fast search for finding the n-best sentence hypotheses in continuos speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, Toronto, Canada, pp. 705–708.
Witten IH and Bell TC (1991) The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transaction on Information Theory, IT-37(4):1085–1094.
Xu J, Weischedel R and Nguyen C (2001) Evaluating a probabilistic model for cross-lingual information retrieval. In: Proceedings of the 24h Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 105-110.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Bertoldi, N., Federico, M. Statistical Models for Monolingual and Bilingual Information Retrieval. Information Retrieval 7, 53–72 (2004). https://doi.org/10.1023/B:INRT.0000009440.64411.ad
Issue Date:
DOI: https://doi.org/10.1023/B:INRT.0000009440.64411.ad