Text Retrieval through Corrupted Queries

  • Juan Otero
  • Jesús Vilares
  • Manuel Vilares
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5290)

Abstract

Our work relies on the design and evaluation of experimental information retrieval systems able to cope with textual misspellings in queries. In contrast to previous proposals, commonly based on the consideration of spelling correction strategies and a word language model, we also report on the use of character n-grams as indexing support.

Keywords

Degraded text information retrieval 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Amati, G., van Rijsbergen, C.-J.: Probabilistic models of Information Retrieval based on measuring divergence from randomness. ACM Transactions on Information Systems 20(4), 357–389 (2002)CrossRefGoogle Scholar
  2. 2.
    Cross-Language Evaluation Forum (visited, July 2008), http://www.clef-campaign.org
  3. 3.
    Collins-Thompson, K., Schweizer, C., Dumais, S.: Improved string matching under noisy channel conditions. In: Proc. of the 10th Int. Conf. on Information and Knowledge Management, pp. 357–364 (2001)Google Scholar
  4. 4.
    Damerau, F.: A technique for computer detection and correction of spelling errors. Communications of the ACM 7(3) (March 1964)Google Scholar
  5. 5.
    Graña, J., Alonso, M.A., Vilares, M.: A common solution for tokenization and part-of-speech tagging: One-pass Viterbi algorithm vs. iterative approaches. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2002. LNCS (LNAI), vol. 2448, pp. 3–10. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  6. 6.
    Lam-Adesina, A.M., Jones, G.J.F.: Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents. Information Processing Management 42(3), 633–649 (2006)CrossRefGoogle Scholar
  7. 7.
    McNamee, P., Mayfield, J.: Character N-gram Tokenization for European Language Text Retrieval. Information Retrieval 7(1-2), 73–97 (2004)CrossRefGoogle Scholar
  8. 8.
    McNamee, P., Mayfield, J.: jhu/apl experiments in tokenization and non-word translation. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 85–97. Springer, Heidelberg (2004)Google Scholar
  9. 9.
    Mittendorf, E., Schauble, P.: Measuring the effects of data corruption on information retrieval. In: Symposium on Document Analysis and Information Retrieval, p. XX (1996)Google Scholar
  10. 10.
    Mittendorf, E., Schäuble, P.: Information retrieval can cope with many errors. Information Retrieval 3(3), 189–216 (2000)MATHCrossRefGoogle Scholar
  11. 11.
    Mittendorfer, M., Winiwarter, W.: A simple way of improving traditional ir methods by structuring queries. In: Proc. of the 2001 IEEE Int. Workshop on Natural Language Processing and Knowledge Engineering (NLPKE 2001) (2001)Google Scholar
  12. 12.
    Mittendorfer, M., Winiwarter, W.: Exploiting syntactic analysis of queries for information retrieval. Data & Knowledge Engineering 42(3), 315–325 (2002)MATHCrossRefGoogle Scholar
  13. 13.
    Nardi, A., Peters, C., Vicedo, J.L.: Results of the CLEF 2006 Cross-Language System Evaluation Campaign, Working Notes of the CLEF 2006 Workshop, Alicante, Spain, September 20-22 (2006) [2] Google Scholar
  14. 14.
    Otero, J., Graña, J., Vilares, M.: Contextual Spelling Correction. In: Moreno Díaz, R., Pichler, F., Quesada Arencibia, A. (eds.) EUROCAST 2007. LNCS, vol. 4739, pp. 290–296. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  15. 15.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)Google Scholar
  16. 16.
    Ruch, P.: Using contextual spelling correction to improve retrieval effectiveness in degraded text collections. In: Proc. of the 19th Int. Conf. on Computational Linguistics, pp. 1–7 (2002)Google Scholar
  17. 17.
    Savary, A.: Typographical nearest-neighbor search in a finite-state lexicon and its application to spelling correction. In: Watson, B.W., Wood, D. (eds.) CIAA 2001. LNCS, vol. 2494, pp. 251–260. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  18. 18.
    Taghva, K., Borsack, J., Condit, A.: Results of applying probabilistic ir to ocr text. In: Proc. of the 17th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval. Performance Evaluation, pp. 202–211 (1994)Google Scholar
  19. 19.
    Takasu, A.: An approximate multi-word matching algorithm for robust document retrieval. In: CIKM 2006: Proc. of the 15th ACM Int. Conf. on Information and Knowledge Management, pp. 34–42 (2006)Google Scholar
  20. 20.
  21. 21.
    Vilares, M., Otero, J., Graña, J.: On asymptotic finite-state error repair. In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 271–272. Springer, Heidelberg (2004)Google Scholar
  22. 22.
    Viterbi, A.: Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Trans. Information Theory IT-13, 260–269 (1967)CrossRefGoogle Scholar
  23. 23.
    Véronis, J.: Multext-corpora: An annotated corpus for five European languages. cd-rom, Distributed by elra/elda (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Juan Otero
    • 1
  • Jesús Vilares
    • 2
  • Manuel Vilares
    • 1
  1. 1.Department of Computer ScienceUniversity of VigoOurenseSpain
  2. 2.Department of Computer ScienceUniversity of A CoruñaA CoruñaSpain

Personalised recommendations