Skip to main content

Improved Stable Retrieval in Noisy Collections

  • Conference paper
Advances in Information Retrieval Theory (ICTIR 2011)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6931))

Included in the following conference series:

Abstract

We consider the problem of retrieval on noisy text collections. This is a paramount problem for retrieval with new social media collections, like Twitter, where typical messages are short, whilst dictionary is very large in size, plenty of variations of emoticons, term shortcuts and any other type of users jargon. In particular, we propose a new methodology which combines different effective techniques, some of them proposed in the OCR information retrieval literature, such as n-grams tokenization, approximate string matching algorithms, that need to be plugged in suitable IR models of retrieval and query reformulation. To evaluate the methodology we use the OCR degraded collections of the Confusion TREC. Unlike the solutions proposed by the TREC participants, tuned for specific collections and thus exhibiting a high variable performance among the different degradation levels, our model is highly stable. In fact, with the same tuned parameters, it reaches the best or nearly best performance simultaneously on all the three Confusion collections (Original, Degrade 5% and Degrade 20%), with a 33% improvement on the average MAP measure. Thus, it is a good candidate as a universal high precision strategy to be used when there isn’t any a priori knowledge of the specific domain. Moreover, our parameters can be specifically tuned in order to obtain the best up to date retrieval performance at all levels of collection degradation, and even on the clean collection, that is the original collection without the OCR errors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Amati, G., van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20(4), 357–389 (2002)

    Article  Google Scholar 

  2. D’Amore, R.J., Mah, C.P.: One-time complete indexing of text: Theory and practice. In: SIGIR, pp. 155–164 (1985)

    Google Scholar 

  3. Harding, S.M., Croft, W.B., Weir, C.: Probabilistic retrieval of ocr degraded text using n-grams. In: Peters, C., Thanos, C. (eds.) ECDL 1997. LNCS, vol. 1324, pp. 345–359. Springer, Heidelberg (1997)

    Chapter  Google Scholar 

  4. Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Johnson, D.: Terrier information retrieval platform. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 517–519. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  5. Kantor, P.B., Voorhees, E.M.: Rep. on Trec-5 confusion track. In: TREC (1996)

    Google Scholar 

  6. Mitra, M., Chaudhuri, B.B.: Information retrieval from documents: A survey. Inf. Retr. 2, 141–163 (2000)

    Article  Google Scholar 

  7. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)

    Article  Google Scholar 

  8. Shannon, C.E.: A mathematical theory of communication. Mobile Computing and Communications Review 5(1), 3–55 (2001)

    Article  MathSciNet  Google Scholar 

  9. Rice, S., Kanai, J., Nartker, T.: An evaluation of information retrieval accuracy. UNLV Information Science Research Institute Annual Report (1993)

    Google Scholar 

  10. Taghva, K., Borsack, J., Condit, A.: Results of applying probabilistic ir to ocr text. In: SIGIR, pp. 202–211 (1994)

    Google Scholar 

  11. Taghva, K., Borsack, J., Condit, A.: Evaluation of model-based retrieval effectiveness with ocr text. ACM Trans. Inf. Syst. 14(1), 64–93 (1996)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Amati, G., Celi, A., Di Nicola, C., Flammini, M., Pavone, D. (2011). Improved Stable Retrieval in Noisy Collections. In: Amati, G., Crestani, F. (eds) Advances in Information Retrieval Theory. ICTIR 2011. Lecture Notes in Computer Science, vol 6931. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23318-0_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23318-0_35

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23317-3

  • Online ISBN: 978-3-642-23318-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics