Abstract
The attempt to move towards paperless offices has led to the digitization of large quantities of printed documents for storage in image databases. Thanks to advances in computer and network technology, it is possible to generate and transmit huge amount of document images efficiently. An ensuing and pressing issue is then to find ways and means to provide highly reliable and efficient retrieval functionality over these document images from a vast variety of information sources. Optical Character Recognition (OCR) is one powerful tool to achieve retrieval tasks, but nowadays there is a debate over the trade-off between OCR-based and OCR-free retrieval, because of OCR errors and wastage of time to OCR the entire collection into text format. Instead, image-based retrieval using document image similarity measure is a much more economical alternative. Till now, many methods have been proposed to achieve different sub-tasks, all of which contribute to the final retrieval performance. This chapter will present different methods for presenting word images and preprocessing steps before similarity measure or training and testing and discuss different algorithms or models for achieving keyword spotting and document image retrieval.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Galloway EA, Gabrielle VM (1998) The heinz electronic library interactive on-line system: an update. Public-Access Comput Syst Rev 9(1):1–12
Taghva K, Borsack J, Condit A, Erva S (1994) The effects of noisy data on text retrieval. J Am Soc Inf Sci 45(1):50–58
Spitz AL (1995) Using character shape codes for word spotting in document images. In: Dori D, Bruckstein A (eds) Shape, structure and pattern recognition. World Scientific, Singapore, pp 382–389
Marinai S, Marino E, Soda G (2006) Font adaptive word indexing of modern printed documents. IEEE Trans Pattern Anal Mach Intell 28(8):1187–1199
Cao H, Govindaraju V, Bhardwaj A (2011) Unconstrained handwritten document retrieval. Int J Doc Anal Recognit 14:145–157
Breuel TM (2005) The future of document imaging in the era of electronic documents. In: Proceedings of the international workshop on document analysis, IWDA’05, Kolkata. Allied Publishers, pp 275–296
Sellen AJ, Harper RHR (2003) The myth of the paperless office. MIT, Cambridge/London
Vincent L (2007) Google book search: document understanding on a massive scale. In: Proceedings of the international conference on document analysis and recognition, Curitiba, vol 2. IEEE, pp 819–823
Zhang L, Tan CL (2005) A word image coding technique and its applications in information retrieval from imaged documents. In: Proceedings of the international workshop on document analysis, IWDA’05, Kolkata. Allied Publishers, pp 69–92
Lu S, Li L, Tan CL (2008) Document image retrieval through word shape coding. IEEE Trans Pattern Anal Mach Intell 130(11):1913–1918
Hull JJ (1986) Hypothesis generation in a computational model for visual word recognition. IEEE Expert 1(3):63–70
Lu Y, Tan CL (2004) Information retrieval in document image databases. IEEE Trans Knowl Data Eng 16(11):1398–1410
Levy S (2004) Google’s two revolutions. Newsweek, December 27:2004
Tomai CI, Zhang B, Govindaraju V (2002) Transcript mapping for historic handwritten document images. In: Proceedings of the eighth international workshop on frontiers in handwriting recognition, 2002, Niagara-on-the-Lake. IEEE, pp 413–418
Antonacopoulos A, Downton AC (2007) Special issue on the analysis of historical documents. Int J Doc Anal Recognit 9(2):75–77
Indermuhle E, Bunke H, Shafait F, Breuel T (2010) Text versus non-text distinction in online handwritten documents. In: Proceedings of the 2010 ACM symposium on applied computing, Sierre. ACM, pp 3–7
Liwicki M, Indermuhle E, Bunke H (2007) On-line handwritten text line detection using dynamic programming. In: Ninth international conference on document analysis and recognition, ICDAR 2007, Curitiba, vol 1. IEEE, pp 447–451
Zimmermann M, Bunke H (2002) Automatic segmentation of the iam off-line handwritten english text database. In: 16th international conference on pattern recognition, Quebec, vol 4, pp 35–39
Simard PY, Steinkraus D, Agrawala M (2005) Ink normalization and beautification. In: Proceedings of the eighth international conference on document analysis and recognition 2005, Seoul. IEEE, pp 1182–1187
Vinciarelli A, Luettin J (2001) A new normalization technique for cursive handwritten words. Pattern Recognit Lett 22(9):1043–1050
Uchida S, Taira E, Sakoe H (2001) Nonuniform slant correction using dynamic programming. In: Proceedings of the sixth international conference on document analysis and recognition, 2001, Seattle. IEEE, pp 434–438
Manmatha R, Han C, EM Riseman, Croft WB (1996) Indexing handwriting using word matching. In: Proceedings of the first ACM international conference on digital libraries, Bethesda. ACM, pp 151–159
Likforman-Sulem L, Zahour A, Taconet B (2007) Text line segmentation of historical documents: a survey. Int J Doc Anal Recognit 9(2):123–138
Adamek T, O’Connor NE, Smeaton AF (2007) Word matching using single closed contours for indexing handwritten historical documents. Int J Doc Anal Recognit 9(2):153–165
Ho TK, Hull JJ, Srihari SN (1992) A word shape analysis approach to lexicon based word recognition. Pattern Recognit Lett 13(11):821–826
Leydier Y, Lebourgeois F, Emptoz H (2007) Text search for medieval manuscript images. Pattern Recognit 40(12):3552–3567
Leydier Y, Ouji A, Lebourgeois F, Emptoz H (2009) Towards an omnilingual word retrieval system for ancient manuscripts. Pattern Recognit 42(9):2089–2105
Madhvanath S, Govindaraju V (2001) The role of holistic paradigms in handwritten word recognition. IEEE Trans Pattern Anal Mach Intell 23(2):149–164
Fischer A, Keller A, Frinken V, Bunke H (2010) Hmm-based word spotting in handwritten documents using subword models. In: 2010 international conference on pattern recognition, Istanbul. IEEE, pp 3416–3419
Myers CS, Habiner LR (1981) A comparative study of several dynamic time-warping algorithms for connected-word. Bell Syst Tech J 60(7):1389–1409
RodrÃguez-Serrano JA, Perronnin F (2009) Handwritten word-spotting using hidden markov models and universal vocabularies. Pattern Recognit 42(9):2106–2116
Frinken V, Fischer A, Bunke H (2010) A novel word spotting algorithm using bidirectional long short-term memory neural networks. In: Schwenker F, El Gayar N (eds) Artificial neural networks in pattern recognition. Springer, Berlin/Heidelberg, pp 185–196
Graves A, Liwicki M, Fernández S, Bertolami R, Bunke H, Schmidhuber J (2009) A novel connectionist system for unconstrained handwriting recognition. IEEE Trans Pattern Anal Mach Intell 31:855–868
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Robertson SE, Sparck Jones K (1976) Relevance weighting of search terms. J Am Soc Inf Sci 27(3):129–146
Lan M, Tan CL, Low HB (2006) Proposing a new term weighting scheme for text categorization. In: Proceedings of the 21st national conference on artificial intelligence, Boston
Tan CL, Huang W, Yu Z, Xu Y (2002) Imaged document text retrieval without OCR. IEEE Trans Pattern Anal Mach Intell 24:838–844
Rath TM, Manmatha R, Lavrenko V (2004) A search engine for historical manuscript images. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, Sheffield. ACM, pp 369–376
Cao H, Farooq F, Govindaraju V (2007) Indexing and retrieval of degraded handwritten medical forms. In: Proceedings of the workshop on multimodal information retrieval at IJCAI-2007, Hyderabad
Cao H, Bhardwaj A, Govindaraju V (2009) A probabilistic method for keyword retrieval in handwritten document images. Pattern Recognit 42(12):3374–3382
Milewski RJ, Govindaraju V, Bhardwaj A (2009) Automatic recognition of handwritten medical forms for search engines. Int J Doc Anal Recognit 11(4):203–218
Bhardwaj A, Farooq F, Cao H, Govindaraju V (2008) Topic based language models for ocr correction. In: Proceedings of the second workshop on analytics for noisy unstructured text data, Singapore. ACM, pp 107–112
Marinai S (2006) A survey of document image retrieval in digital libraries. In: 9th colloque international Francophone Sur l’Ecrit et le document (CIFED), Fribourg, pp 193–198.
Aschenbrenner S (2005) Jstor: adapting lucene for new search engine and interface. DLib Mag Vol. 11, no. 6
Agam G, Argamon S, Frieder O, Grossman D, Lewis D (2007) Content-based document image retrieval in complex document collections. In: Proceedings of the SPIE, vol 6500. Document Recognition & Retrieval XIV, San Jose.
Zhu G, Zheng Y, Doermann D (2008) Signature-based document image retrieval. In: Computer vision–ECCV 2008, Marseille, pp 752–765
Zhu G, Zheng Y, Doermann D, Jaeger S (2007) Multi-scale structural saliency for signature detection. In: 2007 IEEE conference on computer vision and pattern recognition, Minneapolis. IEEE, pp 1–8
Belongie S, Malik J, Puzicha J (2002) Shape matching and object recognition using shape contexts. IEEE Trans Pattern Anal Mach Intell 24:509–522
Zheng Y, Doermann D (2006) Robust point matching for nonrigid shapes by preserving local neighborhood structures. IEEE Trans Pattern Anal Mach Intell 28:643–649
Srihari SN, Shetty S, Chen S, Srinivasan H, Huang C, Agam G, Frieder O (2006) Document image retrieval using signatures as queries. In: Second international conference on document image analysis for libraries 2006, DIAL’06, Lyon. IEEE, p 6
Jain AK, Vailaya A (1998) Shape-based retrieval: a case study with trademark image databases. Pattern Recognit 31(9):1369–1390
Terrades OR, Valveny E (2003) Radon transform for lineal symbol representation. Doc Anal Recognit 1:195
Weber M, Liwicki M, Dengel A (2010) a.Scatch-a sketch-based retrieval for architectural floor plans. In: 2010 12th international conference on frontiers in handwriting recognition, Kolkata. IEEE, pp 289–294
Vajda S, Plotz T, Fink GA (2009) Layout analysis for camera-based whiteboard notes. J Univers Comput Sci 15(18):3307–3324
Burzan T, Burzan B (2003) The mind map book. BBC Worldwide, London
Liwicki M, Bunke H (2005) Handwriting recognition of whiteboard notes. In: Proceedings of the 12th conference of the international graphonomics society, Salerno, pp 118–122.
Marti UV, Bunke H (2001) Using a statistical language model to improve the performance of an hmm-based cursive handwriting recognition system. IJPRAI 15(1):65–90
Plotz T, Thurau C, Fink GA (2008) Camera-based whiteboard reading: new approaches to a challenging task. In: Proceedings of the 11th international conference on frontiers in handwriting recognition, Montreal, pp 385–390
Yoshida D, Tsuruoka S, Kawanaka H, Shinogi T (2006) Keywords recognition of handwritten character string on whiteboard using word dictionary for e-learning, International Conference on Hybrid Information Technology, Cheju Island, Vol. 1, pp 140–145
Konidaris T, Gatos B, Ntzios K, Pratikakis I, Theodoridis S (2007) Keyword-guided word spotting in historical printed documents using synthetic data and user feedback. Int J Doc Anal Recognit 9(2):167–177
Lu Y, Tan CL (2004) Chinese word searching in imaged documents. Int J Pattern Recognit Artif Intell 18(2):229–246
Zhang H, Wang DH, Liu CL (2010) Keyword spotting from online chinese handwritten documents using one-vs-all trained character classifier. In: 2010 12th international conference on frontiers in handwriting recognition, Kolkata. IEEE, pp 271–276
Senda S, Minoh M, Ikeda K (1993) Document image retrieval system using character candidates generated by character recognition process. In: Proceedings of the second international conference on document analysis and recognition, 1993, Tsukuba. IEEE, pp 541–546
Sagheer MW, Nobile N, He CL, Suen CY (2010) A novel handwritten Urdu word spotting based on connected components analysis. In: 2010 international conference on pattern recognition, Istanbul. IEEE, pp 2013–2016
Moghaddam RF, Cheriet M (2009) Application of multi-level classifiers and clustering for automatic word spotting in historical document images. In: 2009 10th international conference on document analysis and recognition, Barcelona. IEEE, pp 511–515
Leydier Y, Le Bourgeois F, Emptoz H (2005) Omnilingual segmentation-free word spotting for ancient manuscripts indexation. In: Proceedings of the eighth international conference on document analysis and recognition, 2005, Seoul. IEEE, pp 533–537
Mitra M, Chaudhuri BB (2000) Information retrieval from documents: a survey. Inf Retr 2(2):141–163
Murugappan A, Ramachandran B, Dhavachelvan P (2011) A survey of keyword spotting techniques for printed document images. Artif Intell Rev 1–18
Marinai S, Miotti B, Soda G (2011) Digital libraries and document image retrieval techniques: a survey. In: Biba M, Xhafa F (eds) Learning structure and schemas from documents. Springer, Berlin/Heidelberg, pp 181–204
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag London
About this entry
Cite this entry
Tan, C.L., Zhang, X., Li, L. (2014). Image Based Retrieval and Keyword Spotting in Documents. In: Doermann, D., Tombre, K. (eds) Handbook of Document Image Processing and Recognition. Springer, London. https://doi.org/10.1007/978-0-85729-859-1_27
Download citation
DOI: https://doi.org/10.1007/978-0-85729-859-1_27
Published:
Publisher Name: Springer, London
Print ISBN: 978-0-85729-858-4
Online ISBN: 978-0-85729-859-1
eBook Packages: Computer ScienceReference Module Computer Science and Engineering