Skip to main content

CMIC at INEX 2007: Book Search Track

  • Conference paper
Book cover Focused Access to XML Documents (INEX 2007)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4862))

  • 525 Accesses

Abstract

With massive book digitization efforts underway, the need for effective retrieval of books and pages in books is an important problem. This paper describes our submissions to the INEX 2007 Book Search track. We explored using book specific features such as table of content and index pages and headers along with non-book specific features. Our results show that indexing the entire contents of books and headers provided the most effective retrieval strategy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Barret, W., Hutchison, L., Quass, D., Nielson, H., Kennard, D.: Digital Mountain: From Granite Archive to Global Access. In: Proc. of International Workshop on Document Image Analysis for Libraries, Palo Alto, January 2004, pp. 104–121 (2004)

    Google Scholar 

  2. Croft, W.B., Harding, S., Taghva, K., Andborsak, J.: An evaluation of information retrieval accuracy with simulated OCR output. In: Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, University of Nevada, Las Vegas, Nev, pp. 115–126 (1994)

    Google Scholar 

  3. Darwish, K., Emam, O.: The Effect of Blind Relevance Feedback on a New Arabic OCR Degraded Text Collection. In: International Conference on Machine Intelligence: Special Session on Arabic Document Image Analysis (2005)

    Google Scholar 

  4. Darwish, K., Oard, D.: Term Selection for Searching Printed Arabic. In: SIGIR 2002, pp. 261–268 (2002)

    Google Scholar 

  5. Doerman, D.: The Retrieval of Document Images: A Brief Survey. In: ICDAR, pp. 945–949 (1997)

    Google Scholar 

  6. Doermann, D.: The Indexing and Retrieval of Document Images: A Survey. Computer Vision and Image Understanding 70(3), 287–298 (1998)

    Article  Google Scholar 

  7. Harding, S., Croft, W., Weir, C.: Probabilistic Retrieval of OCR-degraded Text Using N-Grams. In: European Conference on Digital Libraries, pp. 345–359 (1997)

    Google Scholar 

  8. Harman, D.: Overview of the First Text REtrieval Conference. In: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, Pittsburgh, Pennsylvania, United States, pp. 36–47 (1992)

    Google Scholar 

  9. Hawking, D.: Document Retrieval in OCR-Scanned Text. In: Sixth Parallel Computing Workshop, paper P2-F (1996)

    Google Scholar 

  10. Kantor, P., Voorhees, E.: Report on the TREC-5 Confusion Track. TREC-5, p. 65 (1996)

    Google Scholar 

  11. Lam-Adesina, A.M., Jones, G.J.: Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents. Inf. Process. Manage. 42(3), 633–649 (2006)

    Article  Google Scholar 

  12. Matveeva, I., Burges, C., Burkard, T., Laucius, A., Wong, L.: High accuracy retrieval with multiple nested rankers. In: SIGIR 2006 (2006)

    Google Scholar 

  13. Metzler, D., Croft, W.B.: Combining the Language Model and Inference Network Approaches to Retrieval. Information Processing and Management Special Issue on Bayesian Networks and Information Retrieval 40(5), 735–750 (2004)

    Google Scholar 

  14. Simske, S., Lin, X.: Creating Digital Libraries: Content Generation and Re-mastering. In: Proc. International Workshop on Document Image Analysis for Libraries, Palo Alto, January 2004, pp. 33–45 (2004)

    Google Scholar 

  15. Smith, S.: An Analysis of the Effects of Data Corruption on Text Retrieval Performance. Technical Report DR90-1, Thinking Machines Corp: Cambridge, MA (1990)

    Google Scholar 

  16. Taghva, K., Borsack, J., Condit, A.: An Expert System for Automatically Correcting OCR Output. In: Proc. IS&T/SPIE 1994 Intl. Symp. on Electronic Imaging Science and Technology, San Jose, CA, pp. 270–278 (1994a)

    Google Scholar 

  17. Taghva, K., Borasack, J., Condit, A., Gilbreth, J.: Results and Implications of the Noisy Data Projects. Technical Report 94-01, Information Science Research Institute, University of Nevada, Las Vegas (1994b)

    Google Scholar 

  18. Taghva, K., Borasack, J., Condit, A., Inaparthy, P.: Querying Short OCR’d Documents. Technical Report 94-10, Information Science Research Institute, University of Nevada, Las Vegas (1995)

    Google Scholar 

  19. Taghva, K., Borsack, J., Condit, A.: Evaluation of Model-Based Retrieval Effectiveness OCR Text. ACM Transactions on Information Systems 14(1), 64–93 (1996a)

    Article  Google Scholar 

  20. Taghva, K., Borsack, J., Condit, A.: Effects of OCR errors on Ranking and Feedback using the Vector Space Model. Information Processing and Management 32(3), 317–327 (1996b)

    Article  Google Scholar 

  21. Thoma, G., Ford, G.: Automated Data Entry System: Performance Issues. In: Proc. SPIE Conference on Document Recognition and Retrieval IX, San Jose, pp. 181–190 (2002)

    Google Scholar 

  22. Tseng, Y., Oard, D.: Document Image Retrieval Techniques for Chinese. In: Symposium on Document Image Understanding Technology, Columbia, MD, pp. 151–158 (2001)

    Google Scholar 

  23. Voorhees, E.: Evaluation by highly relevant documents. In: Proceedings of SIGIR, pp. 74–82 (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Norbert Fuhr Jaap Kamps Mounia Lalmas Andrew Trotman

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Magdy, W., Darwish, K. (2008). CMIC at INEX 2007: Book Search Track. In: Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds) Focused Access to XML Documents. INEX 2007. Lecture Notes in Computer Science, vol 4862. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85902-4_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-85902-4_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-85901-7

  • Online ISBN: 978-3-540-85902-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics