Hairetes: A Search Engine for OCR Documents
In this paper, we report on the architecture and preliminary implementation of our search engine, Hairetes. This engine is based on an extended concept of Retrieval by General Logical Imaging (RbGLI). In this extension, word similarity measures are computed by EMIM and Bayes’ theorem.
Unable to display preview. Download preview PDF.
- Jean Aitchison, Alan Gilchrist, and David Bawden. Thesaurus Construction and Use: A Practical Manual. Fitzroy Dearborn, 4th edition, 2000.Google Scholar
- Fabio Crestani. Exploiting the similarity of non-matching terms at retrieval time. Journal of Information Retrieval, pages 25–45, 2000.Google Scholar
- Fabio Crestani, Ian Ruthven, M. Sanderson, and C.J. van Rijsbergen. The troubles with using a logical model of ir on a large collection of documents. experimenting retrieval by logical imaging on trec. In Proceedings of the Fourth Text Retrieval Conference (TREC-4), 1995.Google Scholar
- William B. Frakes. Stemming algorithms. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, pages 131–160. Prentice Hall, 1992.Google Scholar
- R. E. Gorin, Pace Willisson, Walt Buehring, Geoff Kuenning, et al. Ispell, a free software package for spell checking files. The UNIX community, 1971. version 2.0.02.Google Scholar
- Donna K. Harman. Ranking algorithms. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, pages 363–392. Prentice Hall, 1992.Google Scholar
- Donna K. Harman. Relevance feedback and other query modification techniques. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, pages 241–263. Prentice Hall, 1992.Google Scholar
- Amit Singhal, Gerard Salton, and Chris Buckley. Length normalization in degraded text collections. In Proc. of SDAIR-96 5th Annual Symposium on Document Analysis and Information Retrieval, pages 149–162, Las Vegas, NV, 1996.Google Scholar
- Kazem Taghva, Julie Borsack, and Allen Condit. Results of applying probabilistic IR to OCR text. In Proc. 17th Intl. ACM/SIGIR Conf. on Research and Development in Information Retrieval, pages 202–211, Dublin, Ireland, July 1994.Google Scholar
- Kazem Taghva, Julie Borsack, and Allen Condit. Effects of OCR errors on ranking and feedback using the vector space model. Inf. Proc. and Management, 32(3):317–327, 1996.Google Scholar
- Kazem Taghva, Thomas A. Nartker, and Julie Borsack. Recognize, categorize, and retrieve. In Proc. of the Symposium on Document Image Understanding Technology, pages 227–232, Columbia, MD, April 2001. Laboratory for Language and Media Processing, University of Maryland.Google Scholar
- I. Witten, A. Moffat, and T. Bell. Managing Gigabytes: Compressing and indexing documents and images. Morgan Kaufmann, 2nd edition, 1999.Google Scholar