Abstract
In this paper, we present our work in the RISOT track of FIRE 2011. Here, we describe an error modeling technique for OCR errors in an Indic script. Based on the error model, we apply a two-fold error correction method on the OCRed corpus. First, we correct the corpus by correction with full confidence and correction without full confidence approaches. Finally, we use query expansion for error correction. We have achieved retrieval results which are significantly better than the baseline and the difference between our best result and the original text run is not significant.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Salton, G., Singhal, A., Buckley, C.: Length normalization in degraded text collections. In: Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval, pp. 149–162 (1996)
Amati, G., Van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20(4), 357–389 (2002)
Baron, J., Hedin, B., Tomlinson, S., Oard, D.: Overview of the trec 2009 legal track. In: The Eighteenth Text Retrieval Conference (2009)
Chaudhuri, B.B., Pal, U.: Ocr error detection and correction of an inflectional indian language script. Pattern Recognition 3, 245–249 (1996)
Tomlinson, S., Oard, D., Hedin, B., Baron, J.: Overview of the trec 2008 legal track. In: The Seventeenth Text Retrieval Conference (2008)
Harman, D.: Overview of the fourth text retrieval conference. In: The Fourth Text Retrieval Conference, pp. 1–24 (1995)
Lewis, D., Baron, J., Oard, D.: The trec-2006 legal track. In: The Fifteenth Text Retrieval Conference (2006)
Borsack, J., Taghva, K., Condit, A.: Results of applying probabilistic ir to ocr text. In: The Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 202–211 (1994)
Kantor, P., Voorhees, E.: Report on the trec-5 confusion track. In: The Fifth Text Retrieval Conference, pp. 65–74 (1996)
Kolak, O., Resnik, P.: Ocr error correction using a noisy channel model. In: HLT, pp. 149–162 (2002)
Magdy, W., Darwish, K.: Arabic ocr error correction using character segment correction, language modeling, and shallow morphology. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 408–414 (2006)
Robertson, S.E., Zaragoza, H.: The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval 3(4), 333–389 (2009)
Baron, J., Tomlinson, S., Oard, D., Thompson, P.: Overview of the trec 2007 legal track. In: The Sixteenth Text Retrieval Conference (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ghosh, K., Parui, S.K. (2013). Retrieval from OCR Text: RISOT Track. In: Majumder, P., Mitra, M., Bhattacharyya, P., Subramaniam, L.V., Contractor, D., Rosso, P. (eds) Multilingual Information Access in South Asian Languages. Lecture Notes in Computer Science, vol 7536. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40087-2_21
Download citation
DOI: https://doi.org/10.1007/978-3-642-40087-2_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40086-5
Online ISBN: 978-3-642-40087-2
eBook Packages: Computer ScienceComputer Science (R0)