Improved Stable Retrieval in Noisy Collections

Amati, Gianni; Celi, Alessandro; Di Nicola, Cesidio; Flammini, Michele; Pavone, Daniela

doi:10.1007/978-3-642-23318-0_35

Gianni Amati^18,19,
Alessandro Celi^18,19,
Cesidio Di Nicola^18,19,
Michele Flammini^18,19 &
…
Daniela Pavone^18,19

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6931))

Included in the following conference series:

Conference on the Theory of Information Retrieval

867 Accesses
2 Citations

Abstract

We consider the problem of retrieval on noisy text collections. This is a paramount problem for retrieval with new social media collections, like Twitter, where typical messages are short, whilst dictionary is very large in size, plenty of variations of emoticons, term shortcuts and any other type of users jargon. In particular, we propose a new methodology which combines different effective techniques, some of them proposed in the OCR information retrieval literature, such as n-grams tokenization, approximate string matching algorithms, that need to be plugged in suitable IR models of retrieval and query reformulation. To evaluate the methodology we use the OCR degraded collections of the Confusion TREC. Unlike the solutions proposed by the TREC participants, tuned for specific collections and thus exhibiting a high variable performance among the different degradation levels, our model is highly stable. In fact, with the same tuned parameters, it reaches the best or nearly best performance simultaneously on all the three Confusion collections (Original, Degrade 5% and Degrade 20%), with a 33% improvement on the average MAP measure. Thus, it is a good candidate as a universal high precision strategy to be used when there isn’t any a priori knowledge of the specific domain. Moreover, our parameters can be specifically tuned in order to obtain the best up to date retrieval performance at all levels of collection degradation, and even on the clean collection, that is the original collection without the OCR errors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Amati, G., van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20(4), 357–389 (2002)
Article Google Scholar
D’Amore, R.J., Mah, C.P.: One-time complete indexing of text: Theory and practice. In: SIGIR, pp. 155–164 (1985)
Google Scholar
Harding, S.M., Croft, W.B., Weir, C.: Probabilistic retrieval of ocr degraded text using n-grams. In: Peters, C., Thanos, C. (eds.) ECDL 1997. LNCS, vol. 1324, pp. 345–359. Springer, Heidelberg (1997)
Chapter Google Scholar
Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Johnson, D.: Terrier information retrieval platform. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 517–519. Springer, Heidelberg (2005)
Chapter Google Scholar
Kantor, P.B., Voorhees, E.M.: Rep. on Trec-5 confusion track. In: TREC (1996)
Google Scholar
Mitra, M., Chaudhuri, B.B.: Information retrieval from documents: A survey. Inf. Retr. 2, 141–163 (2000)
Article Google Scholar
Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Article Google Scholar
Shannon, C.E.: A mathematical theory of communication. Mobile Computing and Communications Review 5(1), 3–55 (2001)
Article MathSciNet Google Scholar
Rice, S., Kanai, J., Nartker, T.: An evaluation of information retrieval accuracy. UNLV Information Science Research Institute Annual Report (1993)
Google Scholar
Taghva, K., Borsack, J., Condit, A.: Results of applying probabilistic ir to ocr text. In: SIGIR, pp. 202–211 (1994)
Google Scholar
Taghva, K., Borsack, J., Condit, A.: Evaluation of model-based retrieval effectiveness with ocr text. ACM Trans. Inf. Syst. 14(1), 64–93 (1996)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Fondazione Ugo Bordoni, Rome, Italy
Gianni Amati, Alessandro Celi, Cesidio Di Nicola, Michele Flammini & Daniela Pavone
Department, of Computer Science, University of L’Aquila, L’Aquila, Italy
Gianni Amati, Alessandro Celi, Cesidio Di Nicola, Michele Flammini & Daniela Pavone

Authors

Gianni Amati
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Celi
View author publications
You can also search for this author in PubMed Google Scholar
Cesidio Di Nicola
View author publications
You can also search for this author in PubMed Google Scholar
Michele Flammini
View author publications
You can also search for this author in PubMed Google Scholar
Daniela Pavone
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Fondazione Ugo Bordoni, Viale del Policlinico 147, 00161, Rome, Italy
Giambattista Amati
Faculty of Informatics, University of Lugano, 6900, Lugano, Switzerland
Fabio Crestani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Amati, G., Celi, A., Di Nicola, C., Flammini, M., Pavone, D. (2011). Improved Stable Retrieval in Noisy Collections. In: Amati, G., Crestani, F. (eds) Advances in Information Retrieval Theory. ICTIR 2011. Lecture Notes in Computer Science, vol 6931. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23318-0_35

Download citation

DOI: https://doi.org/10.1007/978-3-642-23318-0_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23317-3
Online ISBN: 978-3-642-23318-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics