Information Retrieval

, Volume 7, Issue 1–2, pp 33–52 | Cite as

Monolingual Document Retrieval for European Languages

  • Vera Hollink
  • Jaap Kamps
  • Christof Monz
  • Maarten de Rijke
Article

Abstract

Recent years have witnessed considerable advances in information retrieval for European languages other than English. We give an overview of commonly used techniques and we analyze them with respect to their impact on retrieval effectiveness. The techniques considered range from linguistically motivated techniques, such as morphological normalization and compound splitting, to knowledge-free approaches, such as n-gram indexing. Evaluations are carried out against data from the CLEF campaign, covering eight European languages. Our results show that for many of these languages a modicum of linguistic techniques may lead to improvements in retrieval effectiveness, as can the use of language independent techniques.

cross-lingual information retrieval monolingual document retrieval European languages morphological normalization tokenization 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Airio E, Keskustalo H, Hedlund T and Pirkola A (2002) Utaclir @ CLEF 2002-towards a uniform translation process model. In: Peters 2002, Ed. pp. 51-58.Google Scholar
  2. Amati G, Carpinetto C and Romano G (2002) Italian monolingual retrieval with PROSIT. In: Peters 2002, Ed. pp. 145-152.Google Scholar
  3. Bacchin M, Ferro N and Melucci M (2002) University of Padua at CLEF-2002: Experiments to evaluate a statistical stemming procedure. In: Peters 2002, Ed. pp. 161-168.Google Scholar
  4. Bell C and Jones KP (1979) Toward everyday language information retrieval systems via minicomputers. Journal of the American Society for Information Science, 30:334–338.Google Scholar
  5. Braschler M and Ripplinger B (2003) Stemming and decompounding for German text retrieval. In: Advances in Information Retrieval, 25th BCS-IRSG European Colloquium on IR Research (ECIR), pp. 177-192.Google Scholar
  6. Buckley C, Singhal A and MitraM(1995) New retrieval approaches using SMART: TREC-4'. In: Harman 1995b, pp. 25-48. NIST Special Publication 500-225.Google Scholar
  7. Burnett JE, Cooper D, Lynch MF, Willett P and Wycherley M (1979) Document retrieval experiments using indexing vocabularies of varying size. I. Variety generation symbols assigned to the fronts of index terms. Journal of Documentation, 35(3):197–206.Google Scholar
  8. Chen A (2002) Cross-language retrieval experiments at CLEF-2002. In: Peters 2002, pp. 5-20.Google Scholar
  9. CLEF-Neuchâtel (2003) CLEF Resources at the University of Neuchâtel. http://www.unine.ch/info/clef (visited February 1, 2003).Google Scholar
  10. Damashek M (1995) Gauging similarity via N-Grams: Language independent categorization of text. Science, 267:843–848.Google Scholar
  11. Davison A and Hinkley D (1997) Bootstrap Methods and Their Application. Cambridge University Press.Google Scholar
  12. De Heer T (1982) The application of the concept of homeosemy to natural language information retrieval. Information Processing & Management, 18(5):229–236.Google Scholar
  13. Demske U (1995) Word vs. phrase structure: The rise of genitive compounds in German. ZAS Papers in Linguistics, 3:1–28.Google Scholar
  14. Efron B (1979) Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7(1):1–26.Google Scholar
  15. Fagan J (1987) Experiments in automatic phrase indexing for document retrieval: A comparison of syntactic and non-syntactic methods. Ph.D. Thesis, Department of Computer Science, Cornell University.Google Scholar
  16. Figuerola CG, Gómez R, Rodríguez AFZ and Berrocal JLA (2002) Spanish monolingual track: The impact of stemming on retrieval. In: Peters et al. 2002, Eds. Springer, pp. 253-261.Google Scholar
  17. Frakes WB (1992) Stemming algorithms. In: Frakes WB and Baeza-Yates R, Eds. Information Retrieval, Data Structures and Algorithms, Prentice-Hall, pp. 131-160.Google Scholar
  18. Harman DK (1991) How effective is suffixing. Journal of the American Society for Information Science, 42(1):7–15.Google Scholar
  19. Harman DK (1994)Overview of the second Text REtrieval Conference (TREC-2). In: Harman DK, Ed. Proceedings of the Second Text REtrieval Conference (TREC-2), NIST Special Publication 500-215, pp. 1-20.Google Scholar
  20. Harman DK (1995a) Overview of the third Text REtrieval Conference (TREC-3). In: Harman 1995b, NIST Special Publication 500-225, pp. 1-20.Google Scholar
  21. Harman DK (1995b), Ed. Proceedings of the third Text REtrieval Conference (TREC-3). NIST Special Publication 500-225.Google Scholar
  22. Hedlund T (2002) Compounds in dictionary-based cross-language information retrieval. Information Research, 7(2). Available at http://InformationR.net/ir/7-2/paper128.html (visited February 1, 2003).Google Scholar
  23. Hedlund T, Keskustalo H, Pirkola A, Airio E and Järvelin K (2002) Utaclir @ CLEF 2001-effects of compound splitting and N-Gram techniques. In: Peters et al. 2002, Ed. Springer, pp. 118-136.Google Scholar
  24. Hull D (1996) Stemming algorithms: A case study for detailed evaluation. Journal of the American Society for Information Sience, 47(1):70–84.Google Scholar
  25. Jelinek F (1990) Self-organized language modeling for speech recognition. In: Waibel A and Lee K-F, Eds. Readings in Speech Recognition, Morgan Kaufmann, pp. 450-506.Google Scholar
  26. Josefsson G (1997) On the principles of word formation in Swedish. Lund University Press, Lund.Google Scholar
  27. Jurafsky D and Martin JH (2000) Speech and Language Processing. Prentice-Hall.Google Scholar
  28. Koehn P and Knight K (2003) Empirical methods for compound splitting. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL).Google Scholar
  29. Kotamarti U and Tharp AL (1990) Accelerated text searching through signature trees. Journal of the American Society for Information Science, 41:79–86.Google Scholar
  30. Kraaij W and Pohlmann R (1996) Viewing stemming as recall enhancement. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 40-48.Google Scholar
  31. Kraaij W and Pohlmann R (1998) Comparing the effect of syntactic vs. statistical phrase index strategies for Dutch. In: Proceedings ECDL'98, pp. 605-617.Google Scholar
  32. Krott A, Baayen RH and Schreuder R (2001) Analogy in morphology: Modelling the choice of linking morpheme in Dutch. Linguistics, 39:51–93.Google Scholar
  33. Krovetz R (1993) Viewing morphology as an inference process. In: Proceedings SIGIR'93, pp. 191-202.Google Scholar
  34. Matthews PH (1991) Morphology. Cambridge University Press.Google Scholar
  35. Mayfield J and McNamee P (1999) Indexing using both n-grams and words. In: Voorhees and Harman 1999, Ed., pp. 419-424. NIST Special Publication 500-242.Google Scholar
  36. McNamee P and Mayfield J (2002a), JHU/APL Experiments at CLEF: Translation resources and score normalization. In: Peters et al. 2002, Ed. Springer, pp. 193-208.Google Scholar
  37. McNamee P and Mayfield J (2002b) Scalable multilingual information access. In: Peters 2002, Ed. pp. 133-140.Google Scholar
  38. Monz C and de Rijke M (2002) Shallow morphological analysis in monolingual retrieval for Dutch, German, and Italian. In: Peters et al. 2002, Eds. Springer, pp. 262-277.Google Scholar
  39. Monz C, de Rijke M, Kamps J, van Hage W and Hollink V (2002) The FlexIR information retrieval system. Manual, Language & Inference Technology Group, ILLC, University of Amsterdam.Google Scholar
  40. Mooney C and Duval R (1993) Bootstrapping: A Nonparametric Approach to Statistical Inference. Sage Quantitative Applications in the Social Science Series No. 95. Sage Publications.Google Scholar
  41. Moulinier I, McCulloh J, and Lund E (2001) West group at CLEF 2000: Non-english monolingual retrieval. In: Peters 2001, Ed. Springer, pp. 253-260.Google Scholar
  42. Peters C (2001), Ed. Cross-language information retrieval and evaluation, workshop of the cross-language evaluation forum, CLEF 2000, Vol. 2069 of LNCS. Springer.Google Scholar
  43. Peters C (2002), Ed. Results of the CLEF 2002 Cross-Language System Evaluation Campaign.Google Scholar
  44. Peters C and Braschler M (2001) Cross-language system evaluation: TheCLEFcampaigns. Journal of the American Society for Information Science and Technology, 52(12):1067–1072.Google Scholar
  45. Peters C, Braschler M, Gonzalo J and Kluck M (2002), Eds. Evaluation of cross-language information retrieval systems, second workshop of the cross-language evaluation forum, CLEF 2001', Vol. 2406 of LNCS. Springer.Google Scholar
  46. Pirkola A (1999) Studies on linguistic problems and methods in text retrieval. Ph.D. Thesis, University of Tampere.Google Scholar
  47. Pirkola A (2001) Morphological typology of languages for IR. Journal of Documentation, 57(3):330–348.Google Scholar
  48. Popovic M and Willett P (1992) The effectiveness of stemming for natural-language to Slovene textual data. Journal of the American Society for Information Sience, 43(5):384–390.Google Scholar
  49. Porter M (1980) An algorithm for suffix stripping. Program, 14(3):130–137.Google Scholar
  50. Rocchio Jr, JJ (1971) Relevance feedback in information retrieval. In: Salton G, Ed. The SMART Retrieval System: Experiments in Automatic Document Processing, Prentice-Hall Series in Automatic Computation. Prentice-Hall, Englewood Cliffs NJ, chapt. 14, pp. 313–323.Google Scholar
  51. Savoy J (1997) Statistical inference in retrieval effectiveness evaluation. Information Processing and Management, 33(4):495–512.Google Scholar
  52. Savoy J (1999) A stemming procedure and stopword list for general French corpora. Journal of the American Society for Information Science, 50(10):944–952.Google Scholar
  53. Savoy J (2002a) Report on CLEF-2001 Experiments: Effective combined query-translation approach. In: Peters et al. 2002, Eds. Springer, pp. 27-43.Google Scholar
  54. Savoy J (2002b) Report on CLEF-2002 experiments: Combining multiple sources of evidence. In: Peters 2002, Ed., pp. 31-46.Google Scholar
  55. Schmid H (1994), Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing.Google Scholar
  56. Shannon CE (1951), Prediction and entropy of printed English. The Bell System Technical Journal, 30:50–64.Google Scholar
  57. Snowball Stemmers, http://snowball.tartarus.org/ (visited February 1, 2003).Google Scholar
  58. Strzalkowski T (1995) Natural language information retrieval. Information Processing & Management, 31(3):397–417.Google Scholar
  59. Tomlinson S (2002a) Experiments in 8 European languages with Hummingbird SearchServerTM at CLEF2002. In: Peters 2002, Ed. pp. 203-214.Google Scholar
  60. Tomlinson S (2002b) Stemming evaluated in 6 languages by Hummingbird SearchServerTM at CLEF2001. In: Peters 2002, Ed. Springer, pp. 278-287.Google Scholar
  61. Ullman JR (1977) Binary n-gram technique for automatic correction of substitution, deletion, insertion, and reversal errors in words. Computer Journal, 20:141–147.Google Scholar
  62. Voorhees EM and Harman DK (1998) Overview of the sixth Text REtrieval Conference (TREC-6). In: Voorhees EM and Harman DK Eds. Proceedings of the Sixth Text REtrieval Conference (TREC-6), pp. 1-28. NIST Special Publication 500-240.Google Scholar
  63. Voorhees EM and Harman DK (1999), Eds. Proceedings of the seventh Text REtrieval Conference (TREC-7). NIST Special Publication 500-242.Google Scholar
  64. Whaley LJ (1997) Introduction to Typology: The Unity and Diversity of Language. Sage Publications.Google Scholar
  65. Wilbur J (1994) Non-parametric significance tests of retrieval performance comparisons. Journal of Information Science, 20(4):270–284.Google Scholar
  66. Willet P (1979) Document retrieval experiments using indexing vocabularies of varying size. II. Hashing, truncation, digram and trigram encoding of index terms. Journal of Documentation, 35:296–305.Google Scholar
  67. Wisniewski JL (1987) Effective text compression with simultaneous digram and trigram encoding. Journal of Information Science, 13:159–164.Google Scholar
  68. Womser-Hacker C (2002) Multilingual topic generation within the CLEF 2001 experiments. In: Peters 2002, Ed. Springer, pp. 389-393.Google Scholar

Copyright information

© Kluwer Academic Publishers 2004

Authors and Affiliations

  • Vera Hollink
    • 1
  • Jaap Kamps
    • 1
  • Christof Monz
    • 1
  • Maarten de Rijke
    • 1
  1. 1.Language & Inference Technology Group, ILLCUniversity of AmsterdamAmsterdamThe Netherlands

Personalised recommendations