Skip to main content

Image Based Retrieval and Keyword Spotting in Documents

  • Reference work entry
  • First Online:
Handbook of Document Image Processing and Recognition

Abstract

The attempt to move towards paperless offices has led to the digitization of large quantities of printed documents for storage in image databases. Thanks to advances in computer and network technology, it is possible to generate and transmit huge amount of document images efficiently. An ensuing and pressing issue is then to find ways and means to provide highly reliable and efficient retrieval functionality over these document images from a vast variety of information sources. Optical Character Recognition (OCR) is one powerful tool to achieve retrieval tasks, but nowadays there is a debate over the trade-off between OCR-based and OCR-free retrieval, because of OCR errors and wastage of time to OCR the entire collection into text format. Instead, image-based retrieval using document image similarity measure is a much more economical alternative. Till now, many methods have been proposed to achieve different sub-tasks, all of which contribute to the final retrieval performance. This chapter will present different methods for presenting word images and preprocessing steps before similarity measure or training and testing and discuss different algorithms or models for achieving keyword spotting and document image retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 549.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 549.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Galloway EA, Gabrielle VM (1998) The heinz electronic library interactive on-line system: an update. Public-Access Comput Syst Rev 9(1):1–12

    Google Scholar 

  2. Taghva K, Borsack J, Condit A, Erva S (1994) The effects of noisy data on text retrieval. J Am Soc Inf Sci 45(1):50–58

    Article  Google Scholar 

  3. Spitz AL (1995) Using character shape codes for word spotting in document images. In: Dori D, Bruckstein A (eds) Shape, structure and pattern recognition. World Scientific, Singapore, pp 382–389

    Google Scholar 

  4. Marinai S, Marino E, Soda G (2006) Font adaptive word indexing of modern printed documents. IEEE Trans Pattern Anal Mach Intell 28(8):1187–1199

    Article  Google Scholar 

  5. Cao H, Govindaraju V, Bhardwaj A (2011) Unconstrained handwritten document retrieval. Int J Doc Anal Recognit 14:145–157

    Article  Google Scholar 

  6. Breuel TM (2005) The future of document imaging in the era of electronic documents. In: Proceedings of the international workshop on document analysis, IWDA’05, Kolkata. Allied Publishers, pp 275–296

    Google Scholar 

  7. Sellen AJ, Harper RHR (2003) The myth of the paperless office. MIT, Cambridge/London

    Google Scholar 

  8. Vincent L (2007) Google book search: document understanding on a massive scale. In: Proceedings of the international conference on document analysis and recognition, Curitiba, vol 2. IEEE, pp 819–823

    Google Scholar 

  9. Zhang L, Tan CL (2005) A word image coding technique and its applications in information retrieval from imaged documents. In: Proceedings of the international workshop on document analysis, IWDA’05, Kolkata. Allied Publishers, pp 69–92

    Google Scholar 

  10. Lu S, Li L, Tan CL (2008) Document image retrieval through word shape coding. IEEE Trans Pattern Anal Mach Intell 130(11):1913–1918

    Google Scholar 

  11. Hull JJ (1986) Hypothesis generation in a computational model for visual word recognition. IEEE Expert 1(3):63–70

    Article  Google Scholar 

  12. Lu Y, Tan CL (2004) Information retrieval in document image databases. IEEE Trans Knowl Data Eng 16(11):1398–1410

    Article  Google Scholar 

  13. Levy S (2004) Google’s two revolutions. Newsweek, December 27:2004

    Google Scholar 

  14. Tomai CI, Zhang B, Govindaraju V (2002) Transcript mapping for historic handwritten document images. In: Proceedings of the eighth international workshop on frontiers in handwriting recognition, 2002, Niagara-on-the-Lake. IEEE, pp 413–418

    Google Scholar 

  15. Antonacopoulos A, Downton AC (2007) Special issue on the analysis of historical documents. Int J Doc Anal Recognit 9(2):75–77

    Article  Google Scholar 

  16. Indermuhle E, Bunke H, Shafait F, Breuel T (2010) Text versus non-text distinction in online handwritten documents. In: Proceedings of the 2010 ACM symposium on applied computing, Sierre. ACM, pp 3–7

    Google Scholar 

  17. Liwicki M, Indermuhle E, Bunke H (2007) On-line handwritten text line detection using dynamic programming. In: Ninth international conference on document analysis and recognition, ICDAR 2007, Curitiba, vol 1. IEEE, pp 447–451

    Google Scholar 

  18. Zimmermann M, Bunke H (2002) Automatic segmentation of the iam off-line handwritten english text database. In: 16th international conference on pattern recognition, Quebec, vol 4, pp 35–39

    Google Scholar 

  19. Simard PY, Steinkraus D, Agrawala M (2005) Ink normalization and beautification. In: Proceedings of the eighth international conference on document analysis and recognition 2005, Seoul. IEEE, pp 1182–1187

    Google Scholar 

  20. Vinciarelli A, Luettin J (2001) A new normalization technique for cursive handwritten words. Pattern Recognit Lett 22(9):1043–1050

    Article  Google Scholar 

  21. Uchida S, Taira E, Sakoe H (2001) Nonuniform slant correction using dynamic programming. In: Proceedings of the sixth international conference on document analysis and recognition, 2001, Seattle. IEEE, pp 434–438

    Google Scholar 

  22. Manmatha R, Han C, EM Riseman, Croft WB (1996) Indexing handwriting using word matching. In: Proceedings of the first ACM international conference on digital libraries, Bethesda. ACM, pp 151–159

    Google Scholar 

  23. Likforman-Sulem L, Zahour A, Taconet B (2007) Text line segmentation of historical documents: a survey. Int J Doc Anal Recognit 9(2):123–138

    Article  Google Scholar 

  24. Adamek T, O’Connor NE, Smeaton AF (2007) Word matching using single closed contours for indexing handwritten historical documents. Int J Doc Anal Recognit 9(2):153–165

    Article  Google Scholar 

  25. Ho TK, Hull JJ, Srihari SN (1992) A word shape analysis approach to lexicon based word recognition. Pattern Recognit Lett 13(11):821–826

    Article  Google Scholar 

  26. Leydier Y, Lebourgeois F, Emptoz H (2007) Text search for medieval manuscript images. Pattern Recognit 40(12):3552–3567

    Article  Google Scholar 

  27. Leydier Y, Ouji A, Lebourgeois F, Emptoz H (2009) Towards an omnilingual word retrieval system for ancient manuscripts. Pattern Recognit 42(9):2089–2105

    Article  Google Scholar 

  28. Madhvanath S, Govindaraju V (2001) The role of holistic paradigms in handwritten word recognition. IEEE Trans Pattern Anal Mach Intell 23(2):149–164

    Article  Google Scholar 

  29. Fischer A, Keller A, Frinken V, Bunke H (2010) Hmm-based word spotting in handwritten documents using subword models. In: 2010 international conference on pattern recognition, Istanbul. IEEE, pp 3416–3419

    Google Scholar 

  30. Myers CS, Habiner LR (1981) A comparative study of several dynamic time-warping algorithms for connected-word. Bell Syst Tech J 60(7):1389–1409

    Article  Google Scholar 

  31. Rodríguez-Serrano JA, Perronnin F (2009) Handwritten word-spotting using hidden markov models and universal vocabularies. Pattern Recognit 42(9):2106–2116

    Article  Google Scholar 

  32. Frinken V, Fischer A, Bunke H (2010) A novel word spotting algorithm using bidirectional long short-term memory neural networks. In: Schwenker F, El Gayar N (eds) Artificial neural networks in pattern recognition. Springer, Berlin/Heidelberg, pp 185–196

    Chapter  Google Scholar 

  33. Graves A, Liwicki M, Fernández S, Bertolami R, Bunke H, Schmidhuber J (2009) A novel connectionist system for unconstrained handwriting recognition. IEEE Trans Pattern Anal Mach Intell 31:855–868

    Article  Google Scholar 

  34. Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620

    Article  Google Scholar 

  35. Robertson SE, Sparck Jones K (1976) Relevance weighting of search terms. J Am Soc Inf Sci 27(3):129–146

    Article  Google Scholar 

  36. Lan M, Tan CL, Low HB (2006) Proposing a new term weighting scheme for text categorization. In: Proceedings of the 21st national conference on artificial intelligence, Boston

    Google Scholar 

  37. Tan CL, Huang W, Yu Z, Xu Y (2002) Imaged document text retrieval without OCR. IEEE Trans Pattern Anal Mach Intell 24:838–844

    Article  Google Scholar 

  38. Rath TM, Manmatha R, Lavrenko V (2004) A search engine for historical manuscript images. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, Sheffield. ACM, pp 369–376

    Google Scholar 

  39. Cao H, Farooq F, Govindaraju V (2007) Indexing and retrieval of degraded handwritten medical forms. In: Proceedings of the workshop on multimodal information retrieval at IJCAI-2007, Hyderabad

    Google Scholar 

  40. Cao H, Bhardwaj A, Govindaraju V (2009) A probabilistic method for keyword retrieval in handwritten document images. Pattern Recognit 42(12):3374–3382

    Article  Google Scholar 

  41. Milewski RJ, Govindaraju V, Bhardwaj A (2009) Automatic recognition of handwritten medical forms for search engines. Int J Doc Anal Recognit 11(4):203–218

    Article  Google Scholar 

  42. Bhardwaj A, Farooq F, Cao H, Govindaraju V (2008) Topic based language models for ocr correction. In: Proceedings of the second workshop on analytics for noisy unstructured text data, Singapore. ACM, pp 107–112

    Google Scholar 

  43. Marinai S (2006) A survey of document image retrieval in digital libraries. In: 9th colloque international Francophone Sur l’Ecrit et le document (CIFED), Fribourg, pp 193–198.

    Google Scholar 

  44. Aschenbrenner S (2005) Jstor: adapting lucene for new search engine and interface. DLib Mag Vol. 11, no. 6

    Google Scholar 

  45. Agam G, Argamon S, Frieder O, Grossman D, Lewis D (2007) Content-based document image retrieval in complex document collections. In: Proceedings of the SPIE, vol 6500. Document Recognition & Retrieval XIV, San Jose.

    Google Scholar 

  46. Zhu G, Zheng Y, Doermann D (2008) Signature-based document image retrieval. In: Computer vision–ECCV 2008, Marseille, pp 752–765

    Google Scholar 

  47. Zhu G, Zheng Y, Doermann D, Jaeger S (2007) Multi-scale structural saliency for signature detection. In: 2007 IEEE conference on computer vision and pattern recognition, Minneapolis. IEEE, pp 1–8

    Google Scholar 

  48. Belongie S, Malik J, Puzicha J (2002) Shape matching and object recognition using shape contexts. IEEE Trans Pattern Anal Mach Intell 24:509–522

    Article  Google Scholar 

  49. Zheng Y, Doermann D (2006) Robust point matching for nonrigid shapes by preserving local neighborhood structures. IEEE Trans Pattern Anal Mach Intell 28:643–649

    Article  Google Scholar 

  50. Srihari SN, Shetty S, Chen S, Srinivasan H, Huang C, Agam G, Frieder O (2006) Document image retrieval using signatures as queries. In: Second international conference on document image analysis for libraries 2006, DIAL’06, Lyon. IEEE, p 6

    Google Scholar 

  51. Jain AK, Vailaya A (1998) Shape-based retrieval: a case study with trademark image databases. Pattern Recognit 31(9):1369–1390

    Article  Google Scholar 

  52. Terrades OR, Valveny E (2003) Radon transform for lineal symbol representation. Doc Anal Recognit 1:195

    Google Scholar 

  53. Weber M, Liwicki M, Dengel A (2010) a.Scatch-a sketch-based retrieval for architectural floor plans. In: 2010 12th international conference on frontiers in handwriting recognition, Kolkata. IEEE, pp 289–294

    Google Scholar 

  54. Vajda S, Plotz T, Fink GA (2009) Layout analysis for camera-based whiteboard notes. J Univers Comput Sci 15(18):3307–3324

    Google Scholar 

  55. Burzan T, Burzan B (2003) The mind map book. BBC Worldwide, London

    Google Scholar 

  56. Liwicki M, Bunke H (2005) Handwriting recognition of whiteboard notes. In: Proceedings of the 12th conference of the international graphonomics society, Salerno, pp 118–122.

    Google Scholar 

  57. Marti UV, Bunke H (2001) Using a statistical language model to improve the performance of an hmm-based cursive handwriting recognition system. IJPRAI 15(1):65–90

    Google Scholar 

  58. Plotz T, Thurau C, Fink GA (2008) Camera-based whiteboard reading: new approaches to a challenging task. In: Proceedings of the 11th international conference on frontiers in handwriting recognition, Montreal, pp 385–390

    Google Scholar 

  59. Yoshida D, Tsuruoka S, Kawanaka H, Shinogi T (2006) Keywords recognition of handwritten character string on whiteboard using word dictionary for e-learning, International Conference on Hybrid Information Technology, Cheju Island, Vol. 1, pp 140–145

    Google Scholar 

  60. Konidaris T, Gatos B, Ntzios K, Pratikakis I, Theodoridis S (2007) Keyword-guided word spotting in historical printed documents using synthetic data and user feedback. Int J Doc Anal Recognit 9(2):167–177

    Article  Google Scholar 

  61. Lu Y, Tan CL (2004) Chinese word searching in imaged documents. Int J Pattern Recognit Artif Intell 18(2):229–246

    Article  Google Scholar 

  62. Zhang H, Wang DH, Liu CL (2010) Keyword spotting from online chinese handwritten documents using one-vs-all trained character classifier. In: 2010 12th international conference on frontiers in handwriting recognition, Kolkata. IEEE, pp 271–276

    Google Scholar 

  63. Senda S, Minoh M, Ikeda K (1993) Document image retrieval system using character candidates generated by character recognition process. In: Proceedings of the second international conference on document analysis and recognition, 1993, Tsukuba. IEEE, pp 541–546

    Google Scholar 

  64. Sagheer MW, Nobile N, He CL, Suen CY (2010) A novel handwritten Urdu word spotting based on connected components analysis. In: 2010 international conference on pattern recognition, Istanbul. IEEE, pp 2013–2016

    Google Scholar 

  65. Moghaddam RF, Cheriet M (2009) Application of multi-level classifiers and clustering for automatic word spotting in historical document images. In: 2009 10th international conference on document analysis and recognition, Barcelona. IEEE, pp 511–515

    Google Scholar 

  66. Leydier Y, Le Bourgeois F, Emptoz H (2005) Omnilingual segmentation-free word spotting for ancient manuscripts indexation. In: Proceedings of the eighth international conference on document analysis and recognition, 2005, Seoul. IEEE, pp 533–537

    Google Scholar 

  67. Mitra M, Chaudhuri BB (2000) Information retrieval from documents: a survey. Inf Retr 2(2):141–163

    Article  Google Scholar 

  68. Murugappan A, Ramachandran B, Dhavachelvan P (2011) A survey of keyword spotting techniques for printed document images. Artif Intell Rev 1–18

    Google Scholar 

  69. Marinai S, Miotti B, Soda G (2011) Digital libraries and document image retrieval techniques: a survey. In: Biba M, Xhafa F (eds) Learning structure and schemas from documents. Springer, Berlin/Heidelberg, pp 181–204

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Chew Lim Tan or Xi Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag London

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Tan, C.L., Zhang, X., Li, L. (2014). Image Based Retrieval and Keyword Spotting in Documents. In: Doermann, D., Tombre, K. (eds) Handbook of Document Image Processing and Recognition. Springer, London. https://doi.org/10.1007/978-0-85729-859-1_27

Download citation

Publish with us

Policies and ethics