Abstract.
In this paper, we study the effects of automatic zoning on retrieval and ranking variability. We will show that OCR-generated text from automatic zoning, followed by postprocessing, produces retrieval results equivalent to OCR-generated text from manual zoning. We further show that there is a strong linear association between the ranked query results obtained from these two methods of zoning.
Similar content being viewed by others
References
Autonomy Inc (1999) San Francisco, CA Autonomy Knowledge Server, 2.2.0 edn
Croft WB, Harding S, Taghva K, Borsack J (1994) An evaluation of information retrieval accuracy with simulated OCR output. In: Proceedings of the 3rd symposium on document analysis and information retrieval, Las Vegas, NV, April 1994, pp 115-126
Harman D (1992) Ranking algorithms. In: Frakes WB, Baeza-Yates R (eds) Information retrieval: data structures and algorithms. Prentice-Hall, Englewood Cliffs, NJ, pp 363-392
Hawking D (1996) Document retrieval in ocr-scanned text. In: Proceedings of the 6th parallel computing workshop, paper P2-F, Kawasaki, Japan, November 1996
Nartker T, Young R (2002) OCR accuracy produced by the current DOE document conversion system. Technical Report 2002-06, Information Science Research Institute, University of Nevada, Las Vegas
Salton G, McGill MJ (1983) Introduction to modern information retrieval. McGraw-Hill, New York
Scansoft Inc (2000) Peabody, MA Recognition API manual, v10 edn
Science Applications International Corporation (1990) Capture station simulation lessons learned. Final report for the Licensing Support System prepared under contract DE-AC01-87RW00084 for the U.S. Department of Energy, Office of Civilian Radioactive Waste Management, Washington, DC
Singhal A, Salton G, Buckley C (1996) Length normalization in degraded text collections. In: Proceedings of the 5th annual symposium on document analysis and information retrieval, Las Vegas, NV, April 1996, pp 149-162
Taghva K, Borsack J, Condit A (1994) An expert system for automatically correcting OCR output. In: Proceedings of IS&T/SPIE 1994 international symposium on electronic imaging science and technology, San Jose, CA, February 1994, pp 270-278
Taghva K, Borsack J, Condit A (1994) Results of applying probabilistic IR to OCR text. In: Proceedings of the 17th international ACM/SIGIR conference on research and development in information retrieval, Dublin, Ireland, July 1994, pp 202-211
Taghva K, Borsack J, Condit A (1996) Effects of OCR errors on ranking and feedback using the vector space model. J Inf Process Manage 32(3):317-327
Taghva K, Borsack J, Condit A (1996) Evaluation of model-based retrieval effectiveness with OCR text. ACM Trans Inf Sys 14(1):64-93
Taghva K, Borsack J, Condit A, Erva S (1994) The effects of noisy data on text retrieval. J Am Soc Inf Sci 45(1):50-58
Taghva K, Condit A, Borsack J, Kilburg J, Wu C, Gilbreth J (1998) The MANICURE document processing system. In: Proceedings of the IS&T/SPIE 1998 international symposium on electronic imaging science and technology, San Jose, CA, January 1998
Taghva K, Coombs J (2002) Hairetes: a search engine for OCR documents. In: Proceedings of Document Analysis Systems V: 5th international workshop, Princeton, NJ, August 2002. Lecture notes in computer science, vol 2423. Springer, Berlin Heidelberg New York, pp 412-422
Author information
Authors and Affiliations
Additional information
Received: 17 July 2003, Accepted: 18 October 2003, Published online: 6 February 2004
Information Science Research Institute: e-mail isri@isri.unlv.edu
Rights and permissions
About this article
Cite this article
Taghva, K., Borsack, J., Lumos, S. et al. A comparison of automatic and manual zoning. IJDAR 6, 230–235 (2003). https://doi.org/10.1007/s10032-003-0116-x
Issue Date:
DOI: https://doi.org/10.1007/s10032-003-0116-x