Hairetes: A Search Engine for OCR Documents

  • Kazem Taghva
  • Jeffrey Coombs
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2423)


In this paper, we report on the architecture and preliminary implementation of our search engine, Hairetes. This engine is based on an extended concept of Retrieval by General Logical Imaging (RbGLI). In this extension, word similarity measures are computed by EMIM and Bayes’ theorem.


Information Retrieval Query Processing Latent Semantic Indexing Unary Code Spelling Correction 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. [1]
    Jean Aitchison, Alan Gilchrist, and David Bawden. Thesaurus Construction and Use: A Practical Manual. Fitzroy Dearborn, 4th edition, 2000.Google Scholar
  2. [2]
    Fabio Crestani. Exploiting the similarity of non-matching terms at retrieval time. Journal of Information Retrieval, pages 25–45, 2000.Google Scholar
  3. [3]
    Fabio Crestani and C.J. Van Rijsbergen. A study of kinematics in information retrieval. ACM Transactions on Information Systems, 16:225–255, 1998.CrossRefGoogle Scholar
  4. [4]
    Fabio Crestani, Ian Ruthven, M. Sanderson, and C.J. van Rijsbergen. The troubles with using a logical model of ir on a large collection of documents. experimenting retrieval by logical imaging on trec. In Proceedings of the Fourth Text Retrieval Conference (TREC-4), 1995.Google Scholar
  5. [5]
    Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, 1990.CrossRefGoogle Scholar
  6. [6]
    William B. Frakes. Stemming algorithms. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, pages 131–160. Prentice Hall, 1992.Google Scholar
  7. [7]
    R. E. Gorin, Pace Willisson, Walt Buehring, Geoff Kuenning, et al. Ispell, a free software package for spell checking files. The UNIX community, 1971. version 2.0.02.Google Scholar
  8. [8]
    Donna K. Harman. Ranking algorithms. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, pages 363–392. Prentice Hall, 1992.Google Scholar
  9. [9]
    Donna K. Harman. Relevance feedback and other query modification techniques. In William B. Frakes and Ricardo Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, pages 241–263. Prentice Hall, 1992.Google Scholar
  10. [10]
    C. J. Van Rijsbergen. A theoretical basis for the use of co-occurrence data in information retrieval. Journal of Documentation, 33(2):106–109, June 1977.CrossRefGoogle Scholar
  11. [11]
    C. J. Van Rijsbergen. A non-classical logic for information retrieval. The Computer Journal, 29:481–485, 1986.zbMATHCrossRefGoogle Scholar
  12. [12]
    Amit Singhal, Gerard Salton, and Chris Buckley. Length normalization in degraded text collections. In Proc. of SDAIR-96 5th Annual Symposium on Document Analysis and Information Retrieval, pages 149–162, Las Vegas, NV, 1996.Google Scholar
  13. [13]
    Kazem Taghva, Julie Borsack, and Allen Condit. Results of applying probabilistic IR to OCR text. In Proc. 17th Intl. ACM/SIGIR Conf. on Research and Development in Information Retrieval, pages 202–211, Dublin, Ireland, July 1994.Google Scholar
  14. [14]
    Kazem Taghva, Julie Borsack, and Allen Condit. Effects of OCR errors on ranking and feedback using the vector space model. Inf. Proc. and Management, 32(3):317–327, 1996.Google Scholar
  15. [15]
    Kazem Taghva, Julie Borsack, and Allen Condit. Evaluation of model-based retrieval effectiveness with OCR text. ACM Transactions on Information Systems, 14(1):64–93, January 1996.CrossRefGoogle Scholar
  16. [16]
    Kazem Taghva, Julie Borsack, Allen Condit, and Srinivas Erva. The effects of noisy data on text retrieval. J. American Soc. for Inf. Sci., 45(1):50–58, January 1994.CrossRefGoogle Scholar
  17. [17]
    Kazem Taghva, Thomas A. Nartker, and Julie Borsack. Recognize, categorize, and retrieve. In Proc. of the Symposium on Document Image Understanding Technology, pages 227–232, Columbia, MD, April 2001. Laboratory for Language and Media Processing, University of Maryland.Google Scholar
  18. [18]
    Kazem Taghva and Eric Stofsky. Ocrspell: An interactive spelling correction system for OCR errors in text. Intl. Journal on Document Analysis and Recognition, 3(3):125–137, March 2001.CrossRefGoogle Scholar
  19. [19]
    I. Witten, A. Moffat, and T. Bell. Managing Gigabytes: Compressing and indexing documents and images. Morgan Kaufmann, 2nd edition, 1999.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Kazem Taghva
    • 1
  • Jeffrey Coombs
    • 1
  1. 1.Information Science Research InstituteUniversity of NevadaLas Vegas

Personalised recommendations