Revisiting N-Gram Based Models for Retrieval in Degraded Large Collections

  • Javier Parapar
  • Ana Freire
  • Álvaro Barreiro
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5478)

Abstract

The traditional retrieval models based on term matching are not effective in collections of degraded documents (output of OCR or ASR systems for instance). This paper presents a n-gram based distributed model for retrieval on degraded text large collections. Evaluation was carried out with both the TREC Confusion Track and Legal Track collections showing that the presented approach outperforms in terms of effectiveness the classical term centred approach and the most of the participant systems in the TREC Confusion Track.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Beitzel, S.M., Jensen, E.C., Grossman, D.A.: Retrieving OCR text: A survey of current approaches. In: Symposium on Document Image Understanding Technologies, SDUIT (2003)Google Scholar
  2. 2.
    Voorhees, E.M., Harman, D.K.: TREC: Experiment and Evaluation in Information Retrieval. The MIT Press, Cambridge (2005)Google Scholar
  3. 3.
    Harman, D.: Overview of the fourth Text REtrieval Conference (TREC-4). In: NIST Special Publication 500-236, pp. 1–24 (1996)Google Scholar
  4. 4.
    Buckley, C., Singhal, A., Mitra, M.: New retrieval approaches using SMART: TREC 4. In: NIST Special Publication 500-236, pp. 25–48 (1996)Google Scholar
  5. 5.
    Kantor, P.B., Voorhees, E.M.: The TREC-5 confusion track: Comparing retrieval methods for scanned text. Inf. Retr. 2(2/3), 165–176 (2000)CrossRefGoogle Scholar
  6. 6.
    Ballerini, J.P., Bchel, M., Domenig, R., Knaus, D., Mateev, B., Mittendorf, E., Schuble, P., Sheridan, P., Wechsler, M.: SPIDER retrieval system at TREC-5. In: NIST Special Publication 500-238, pp. 217–228 (1997)Google Scholar
  7. 7.
    Harding, S.M., Croft, W.B., Weir, C.: Probabilistic retrieval of OCR degraded text using n-grams. In: Peters, C., Thanos, C. (eds.) ECDL 1997. LNCS, vol. 1324, pp. 345–359. Springer, Heidelberg (1997)CrossRefGoogle Scholar
  8. 8.
    Tomlinson, S., Oard, D.W., Baron, J.R., Thompson, P.: Overview of the TREC 2007 legal track. In: NIST Special Publication 500-274 (2007)Google Scholar
  9. 9.
    Coetzee, D.: TinyLex: Static n-gram index pruning with perfect recall. In: Proceeding of the 17th ACM conference on Information and Knowledge Management, pp. 409–418. ACM, New York (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Javier Parapar
    • 1
  • Ana Freire
    • 1
  • Álvaro Barreiro
    • 1
  1. 1.IRLab, Computer Science DepartmentUniversity of A CoruñaSpain

Personalised recommendations