Probabilistic Automaton Model for Fuzzy English-Text Retrieval
Optical character reader (OCR) misrecognition is a serious problem when searching against OCR-scanned documents in databases such as digital libraries. This paper proposes fuzzy retrieval methods for English text that contains errors in the recognized text without cor- recting the errors manually. Costs are thereby reduced. The proposed methods generate multiple search terms for each input query term based on probabilistic automata reflecting both error-occurrence probabilities and character-connection probabilities. Experimental results of test-set retrieval indicate that one of the proposed methods improves the recall rate from 95.56% to 97.88% at the cost of a decrease in precision rate from 100.00% to 95.52% with 20 expanded search terms.
- Eugene Charniak. Statistical Language Learning. The MIT Press, 1993.Google Scholar
- W. B. Croft, S. M. Harding, K. Taghva, and J. Borsack. An evaluation of information retrieval accuracy with simulated OCR output. In Proc. of SDAIR’94 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 115–126, Las Vegas, NV, April 1994.Google Scholar
- Daniel Lopresti and Jiangying Zhou. Retrieval strategies for noisy text. In Proc. of SDAIR’96 5th Annual Symposium on Document Analysis and Information Retrieval, pages 255–269, Las Vegas, NV, April 1996.Google Scholar
- Daniel P. Lopresti. Robust retrieval of noisy text. In Proc. of ADL’96 Forum on Research and Technology Advances in Digital Libraries, pages 76–85, Library of Congress, Washington, D. C., May 1996. URL http://dlt.gsfc.nasa.gov/adl96/.
- Manabu Ohta, Atsuhiro Takasu, and Jun Adachi. Reduction of expanded search terms for fuzzy English-text retrieval. In Proc. of ECDL’98, LNCS 1513, pages 619–633, Crete, Greece, September 1998. Springer.Google Scholar
- Kazem Taghva, Julie Borsack, and Allen Condit. An expert system for automatically correcting OCR output. In Proc. of the IS&T/SPIE 1994 International Symposium on Electronic Imaging Science and Technology, pages 270–278, San Jose, CA, February 1994.Google Scholar
- Kazem Taghva, Julie Borsack, and Allen Condit. Results of applying probabilistic IR to OCR text. In Proc. of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 202–211, Dublin, Ireland, July 1994.Google Scholar
- Kazem Taghva, Allen Condit, and Julie Borsack. An evaluation of an automatic markup system. In Proc. of the IS&T/SPIE 1995 International Symposium on Electronic Imaging Science and Technology, pages 317–327, San Jose, CA, February 1995.Google Scholar
- Kazem Taghva, Allen Condit, Julie Borsack, John Kilburg, Changshi Wu, and Jeff Gilbreth. The MANICURE document processing system. Technical Report 95–02, Information Science Research Institute, University of Nevada, Las Vegas, NV, March 1995.Google Scholar