Skip to main content

Hairetes: A Search Engine for OCR Documents

Part of the Lecture Notes in Computer Science book series (LNCS,volume 2423)

Abstract

In this paper, we report on the architecture and preliminary implementation of our search engine, Hairetes. This engine is based on an extended concept of Retrieval by General Logical Imaging (RbGLI). In this extension, word similarity measures are computed by EMIM and Bayes’ theorem.

Keywords

  • Information Retrieval
  • Query Processing
  • Latent Semantic Indexing
  • Unary Code
  • Spelling Correction

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. Jean Aitchison, Alan Gilchrist, and David Bawden. Thesaurus Construction and Use: A Practical Manual. Fitzroy Dearborn, 4th edition, 2000.

    Google Scholar 

  2. Fabio Crestani. Exploiting the similarity of non-matching terms at retrieval time. Journal of Information Retrieval, pages 25–45, 2000.

    Google Scholar 

  3. Fabio Crestani and C.J. Van Rijsbergen. A study of kinematics in information retrieval. ACM Transactions on Information Systems, 16:225–255, 1998.

    CrossRef  Google Scholar 

  4. Fabio Crestani, Ian Ruthven, M. Sanderson, and C.J. van Rijsbergen. The troubles with using a logical model of ir on a large collection of documents. experimenting retrieval by logical imaging on trec. In Proceedings of the Fourth Text Retrieval Conference (TREC-4), 1995.

    Google Scholar 

  5. Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, 1990.

    CrossRef  Google Scholar 

  6. William B. Frakes. Stemming algorithms. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, pages 131–160. Prentice Hall, 1992.

    Google Scholar 

  7. R. E. Gorin, Pace Willisson, Walt Buehring, Geoff Kuenning, et al. Ispell, a free software package for spell checking files. The UNIX community, 1971. version 2.0.02.

    Google Scholar 

  8. Donna K. Harman. Ranking algorithms. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, pages 363–392. Prentice Hall, 1992.

    Google Scholar 

  9. Donna K. Harman. Relevance feedback and other query modification techniques. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, pages 241–263. Prentice Hall, 1992.

    Google Scholar 

  10. C. J. Van Rijsbergen. A theoretical basis for the use of co-occurrence data in information retrieval. Journal of Documentation, 33(2):106–109, June 1977.

    CrossRef  Google Scholar 

  11. C. J. Van Rijsbergen. A non-classical logic for information retrieval. The Computer Journal, 29:481–485, 1986.

    MATH  CrossRef  Google Scholar 

  12. Amit Singhal, Gerard Salton, and Chris Buckley. Length normalization in degraded text collections. In Proc. of SDAIR-96 5th Annual Symposium on Document Analysis and Information Retrieval, pages 149–162, Las Vegas, NV, 1996.

    Google Scholar 

  13. Kazem Taghva, Julie Borsack, and Allen Condit. Results of applying probabilistic IR to OCR text. In Proc. 17th Intl. ACM/SIGIR Conf. on Research and Development in Information Retrieval, pages 202–211, Dublin, Ireland, July 1994.

    Google Scholar 

  14. Kazem Taghva, Julie Borsack, and Allen Condit. Effects of OCR errors on ranking and feedback using the vector space model. Inf. Proc. and Management, 32(3):317–327, 1996.

    Google Scholar 

  15. Kazem Taghva, Julie Borsack, and Allen Condit. Evaluation of model-based retrieval effectiveness with OCR text. ACM Transactions on Information Systems, 14(1):64–93, January 1996.

    CrossRef  Google Scholar 

  16. Kazem Taghva, Julie Borsack, Allen Condit, and Srinivas Erva. The effects of noisy data on text retrieval. J. American Soc. for Inf. Sci., 45(1):50–58, January 1994.

    CrossRef  Google Scholar 

  17. Kazem Taghva, Thomas A. Nartker, and Julie Borsack. Recognize, categorize, and retrieve. In Proc. of the Symposium on Document Image Understanding Technology, pages 227–232, Columbia, MD, April 2001. Laboratory for Language and Media Processing, University of Maryland.

    Google Scholar 

  18. Kazem Taghva and Eric Stofsky. Ocrspell: An interactive spelling correction system for OCR errors in text. Intl. Journal on Document Analysis and Recognition, 3(3):125–137, March 2001.

    CrossRef  Google Scholar 

  19. I. Witten, A. Moffat, and T. Bell. Managing Gigabytes: Compressing and indexing documents and images. Morgan Kaufmann, 2nd edition, 1999.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Taghva, K., Coombs, J. (2002). Hairetes: A Search Engine for OCR Documents. In: Lopresti, D., Hu, J., Kashi, R. (eds) Document Analysis Systems V. DAS 2002. Lecture Notes in Computer Science, vol 2423. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45869-7_45

Download citation

  • DOI: https://doi.org/10.1007/3-540-45869-7_45

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44068-0

  • Online ISBN: 978-3-540-45869-2

  • eBook Packages: Springer Book Archive