Advertisement

Information Retrieval

, Volume 7, Issue 1–2, pp 73–97 | Cite as

Character N-Gram Tokenization for European Language Text Retrieval

  • Paul McNamee
  • James Mayfield
Article

Abstract

The Cross-Language Evaluation Forum has encouraged research in text retrieval methods for numerous European languages and has developed durable test suites that allow language-specific techniques to be investigated and compared. The labor associated with crafting a retrieval system that takes advantage of sophisticated linguistic methods is daunting. We examine whether language-neutral methods can achieve accuracy comparable to language-specific methods with less concomitant software complexity. Using the CLEF 2002 test set we demonstrate empirically how overlapping character n-gram tokenization can provide retrieval accuracy that rivals the best current language-specific approaches for European languages. We show that n = 4 is a good choice for those languages, and document the increased storage and time requirements of the technique. We report on the benefits of and challenges posed by n-grams, and explain peculiarities attendant to bilingual retrieval. Our findings demonstrate clearly that accuracy using n-gram indexing rivals or exceeds accuracy using unnormalized words, for both monolingual and bilingual retrieval.

cross-language information retrieval language-neutral retrieval character n-grams Cross Language Evaluation Forum European languages 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ballasteros L and Croft WB (1997) Phrasal translation and query expansion techniques for cross-language information retrieval. In: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, PA, pp. 84-91.Google Scholar
  2. Ballasteros L and Croft WB (1998) Resolving ambiguity for cross-language retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, pp. 64-71.Google Scholar
  3. Benedetto D, Caglioti E and Loreto V (2002) Language Trees and Zipping. Physical Review Letters, 88.Google Scholar
  4. Braschler M and Schäuble P (2000) Experiments with the eurospider retrieval system for CLEF 2000. In: Peters C, Ed. Proceedings of the First Cross-Language Evaluation Forum, pp. 140-149.Google Scholar
  5. Buckley C, Mitra M, Walz J and Cardie C (1998), Using clustering and super concepts within SMART: TREC-6. In: Voorhees EM and Harman DK, Eds. Proceedings of the Sixth Text REtrieval Conference (TREC-6), NIST Special Publication 500-240, pp. 107-124.Google Scholar
  6. Carmel D, Cohen D, Fagin R, Farchi E, Herscovici M, Maarek Y and Soffer A (2001) Static index pruning for information retrieval systems. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 43-50.Google Scholar
  7. Cavnar WB (1994) Using an N-gram-based document representation with a vector processing retrieval model. In: Harman DK, Ed. Proceedings of the Third Text REtrieval Conference (TREC-3), NIST Special Publication 500-226, pp. 269-278.Google Scholar
  8. Cavnar WB and Trenkle JM (1994) N-Gram based text categorization. In: Proceedings of the Third Symposium on Document Analysis and Information Retrieval, pp. 161-169.Google Scholar
  9. Chen A, He J, Xu L, Gey F and Meggs J (1997) Chinese text retrieval without using a dictionary. SIGIR, 42-49.Google Scholar
  10. align: A program for aligning parallel texts at the character level. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 1-8.Google Scholar
  11. Cohen JD (1995) Highlights: Language-and domain-independent automatic indexing terms for abstracting. Journal of the American Society for Information Science, 46:162–174.Google Scholar
  12. Comlekoglu FM (1990) Optimizing a text retrieval system utilizing N-gram indexing. Ph.D Thesis, George Washington University.Google Scholar
  13. Damashek M (1995) Gauging similarity with n-grams: Language-independent categorization of text. Science, 267:843–848.Google Scholar
  14. D'Amore RJ and Mah CP (1985) One-time complete indexing of text: Theory and practice. In: Proceedings of the 8th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-85), pp. 155-164.Google Scholar
  15. De Heer T (1974) Experiments with syntactic traces in information retrieval. Information Storage and Retrieval, 10:133–144.Google Scholar
  16. De Heer T (1982) The application of the concept of homeosemy to natural language information retrieval. Information Processing & Management, 18:229–236.Google Scholar
  17. Harman D (1992) Relevance feedback revisited. In: Proceedings of the 15th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-92), pp. 1-10.Google Scholar
  18. Harman D et al. (1995) Performance of text retrieval systems. Science, 268:1417–1418.Google Scholar
  19. Hiemstra D (2000) Using language models for information retrieval. Ph.D. Thesis, Center for Telematics and Information Technology, The Netherlands.Google Scholar
  20. Jelinek F and Mercer R (1980) Interpolated estimation of Markov source parameters from sparse data. In: Gelsema ES and Kanal LN, Eds. Pattern Recognition in Practice, North Holland, pp. 381-402.Google Scholar
  21. Kraaij W (2001) TNO at CLEF-2001: Comparing translation resources. In: Peters C et al., Eds. Evaluation of Cross-Language Information Retrieval Systems: Second Workshop of the Cross-Language Evaluation Forum (CLEF-2001).Google Scholar
  22. Landauer TK and Littman ML (1990) Fully automated cross-language document retrieval using latent semantic indexing. In: Proceedings of the 6th Annual Conference of the UW Centre for the New Oxford English Dictionary and Text Research, pp. 31-38.Google Scholar
  23. Lee JH and Ahn JS (1996) Using N-grams for Korean text retrieval. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 216-224.Google Scholar
  24. Mah CP and D'Amore RJ (1983) Complete statistical indexing of text by overlapping word fragments. ACM SIGIR Forum, 17(3):6–16.Google Scholar
  25. Mayfield J, McNamee P and Piatko C (2000) The JHU/APL HAIRCUT system at TREC-8. In: Voorhees EM and Harman DK, Eds. Proceedings of the Eighth Text REtrieval Conference (TREC-8), NIST Special Publication 500-246, Gaithersburg, Maryland, pp. 445-452.Google Scholar
  26. McCarley S. (1999) Should we translate the documents or the queries in cross-language information retrieval. In: Proceedings of ACL.Google Scholar
  27. McNamee P, Mayfield J and Piatko C (2001a) A language-independent approach to European text retrieval. In: Peters C Ed. Cross-Language Information Retrieval and Evaluation: Proceedings of the CLEF-2000 Workshop, Lecture Notes in Computer Science 2069, Springer, Lisbon, Portugal, pp. 129–139.Google Scholar
  28. McNamee P, Mayfield J and Piatko C (2001b) The HAIRCUT system at TREC-9. In: Voorhees EM and Harman DK, Eds. Proceedings of the Ninth Text REtrieval Conference (TREC-9), NIST Special Publication 500-249, Gaithersburg, Maryland, pp. 273-279.Google Scholar
  29. McNamee P and Mayfield J (2002a) Comparing cross-language query expansion techniques by degrading translation resources. In: Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval, Tampere, Finland, pp. 159-166.Google Scholar
  30. McNamee P and Mayfield J (2002b) JHU/APL experiments at CLEF-2001: Translation resoruces and score normalization. In: Peters C et al. Eds. Evaluation of Cross-Language Information Retrieval Systems: Second Workshop of the Cross-Language Evaluation Forum (CLEF-2001), Darmstadt, Germany, pp. 193-208.Google Scholar
  31. McNamee P and Mayfield J (2002c) Scalable multilingual information access. In: Working Notes of the CLEF 2002 Workshop, Rome, Italy, pp. 133-140.Google Scholar
  32. Melamed ID (2001) Empirical Methods for Exploiting Parallel Texts. MIT Press, Cambridge, MA.Google Scholar
  33. Mihalcea R and Nastase V (2002) Letter level learning for language independent diacritics restoration. In: Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002), pp. 105-111.Google Scholar
  34. Miller D, Leek T and Schwartz R (1999) A hidden Markov model information retrieval system. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, California, pp. 214-221.Google Scholar
  35. Nie J-Y, Simard M and Foster G (2000) Multilingual information retrieval based on parallel texts from the web. In: Proceedings of the CLEF-2000 Workshop, Lecture Notes in Computer Science 2069, Springer, Lisbon, Portugal, pp. 188–201.Google Scholar
  36. Oard DW and Hackett P (1997) Document translation for cross-language text retrieval at the University of Maryland. In: Proceedings of the Sixth Text REtrieval Conference (TREC-6), pp. 687-696.Google Scholar
  37. Oard DW, Levow G and Cabezas CI (2001) CLEF experiments at Maryland: Statistical stemming and back-off translation. In: Peters C, Ed. Proceedings of the First Cross-Language Evaluation Forum, pp. 176-187.Google Scholar
  38. Och FJ and Ney H (2000) Improved statistical alignment models. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pp. 440-447.Google Scholar
  39. Ogawa Y and Matsuda T (1997) Overlapping statistical word indexing: A new indexing method for Japanese text. In: Proceedings of the 20th International Conference on Research and Development in Information Retrieval (SIGIR-97), pp. 226-234.Google Scholar
  40. Pearce C and Nicholas C (1996) TELLTALE: Experiments in a dynamic hypertext environment for degraded and multilingual data. Journal for the American Society for Information Science, 47:236–275.Google Scholar
  41. Pearce C and Rye W (1998) N-gram term weighting: A comparative analysis. National Security Agency Technical Report, TR-R52-001-98.Google Scholar
  42. Peters C and Braschler M (this volume), Manuscript in preparation.Google Scholar
  43. Pirkola A, Hedlund T, Keskusalo H and Järvelin K (2001) Dictionary-based cross-language information retrieval: Problems, methods, and research findings. Information Retrieval, 4:209–230.Google Scholar
  44. Ponte JM and Croft WB (1998) A language modeling approach to information retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, pp. 275-281.Google Scholar
  45. Porter MF (1980) An algorithm for suffix stripping. Program, 14:130–137.Google Scholar
  46. Porter MF (2001) Snowball: A Language for Stemming Algorithms. <http://snowball.tartarus.org/texts/introduction>. html (visited 13 March 2003).Google Scholar
  47. Robertson SE, Walker S and Beaulieu M (1999) Okapi and TREC-7: Automatic ad hoc, filtering, vlc, and interactive. In: Voorhees EM and Harman DK, Eds. Proceedings of the 7th Text REtrieval Conference (TREC-7), August 1999, NIST Special Publication 500-242, pp. 253-264.Google Scholar
  48. Resnik P (1999) Mining the web for bilingual text. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 527-534.Google Scholar
  49. Salton G and Buckley C (1988) Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513–523.Google Scholar
  50. Savoy J (2002) Report on CLEF 2002 experiments: Combining multiple sources of evidence. In: Working Notes for the CLEF 2002 Workshop, pp. 31-46.Google Scholar
  51. Savoy J (2003) Cross-language information retrieval: Experiments based on CLEF 2000 corpora. Information Processing and Management, 39(1):75–115.Google Scholar
  52. Shannon C (1948) A mathematical theory of communication. Bell System Technical Journal, 27:379–423 and 623-656.Google Scholar
  53. Shannon C (1980) Scientific aspects of juggling. In: Sloane NJA and Wyner AD, Eds. (1993) Claude Elwood Shannon: Collected Papers, IEEE Press.Google Scholar
  54. Teufel B (1988) Natural language documents: Indexing and retrieval in an information system. In: Proceedings of the 9th International Conference on Information Systems, Minneapolis, Minnesota, pp. 193-201.Google Scholar
  55. United Nations (no date). Universal Declaration of Human Rights, http://www.unhchr.ch/udhr/ (visited October 28th, 2002).Google Scholar
  56. Voorhees EM and Harman DK (1999) Overview of the seventh Text REtrieval Conference (TREC-7). In: Voorhees EM and Harman DK, Eds. The Seventh Text REtrieval Conference (TREC-7). NIST Special Publication 500-242.Google Scholar
  57. Willett P (1979) Document retrieval experiments using indexing vocabularies of varying size. II. Hashing, truncation, digram and trigram encoding of index terms. Journal of Documentation, 35:296–305.Google Scholar
  58. Witten IH, Moffat A and Bell TC (1999) Managing Gigabytes-Compressing and Indexing Documents and Images, 2nd ed., Morgan Kaufmann Publishers.Google Scholar
  59. Zamora EM, Pollock JJ and Zamora A (1981) The use of trigram analysis for spelling error detection. Information Processing and Management, 17:305–316.Google Scholar
  60. Zhai C and Lafferty J (2001) A study of smoothing methods for language models applied to ad hoc information retrieval. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 334-342.Google Scholar

Copyright information

© Kluwer Academic Publishers 2004

Authors and Affiliations

  • Paul McNamee
    • 1
  • James Mayfield
    • 1
  1. 1.Applied Physics LaboratoryJohns Hopkins UniversityLaurelUSA

Personalised recommendations