Skip to main content

Searching and Browsing in Historical Documents—State of the Art and Novel Approaches for Template-Based Keyword Spotting

  • Chapter
  • First Online:
Business Information Systems and Technology 4.0

Part of the book series: Studies in Systems, Decision and Control ((SSDC,volume 141))

  • 4492 Accesses

Abstract

In many public and private institutions, the digitalization of handwritten documents has progressed greatly in recent decades. As a consequence, the number of handwritten documents that are available digitally is constantly increasing. However, accessibility to these documents in terms of browsing and searching is still an issue as automatic full transcriptions are often not feasible. To bridge this gap, Keyword Spotting (KWS) has been proposed as a flexible and error-tolerant alternative to full transcriptions. KWS provides unconstrained retrievals of keywords in handwritten documents that are acquired either online or offline. In general, offline KWS is regarded as the more difficult task when compared to online KWS where temporal information on the writing process is also available. The focus of this chapter is on handwritten historical documents and thus on offline KWS. In particular, we review and compare different state-of-the-art as well as novel approaches for template-based KWS. In contrast to learning-based KWS, template-based KWS can be applied to documents without any a priori learning of a model and is thus regarded as the more flexible approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    George Washington Papers at the Library of Congress, 1741–1799: Series 2, Letterbook 1, pp. 270–279 & 300–309, http://memory.loc.gov/ammem/gwhtml/gwseries2.html.

  2. 2.

    Parzival at IAM historical document database, http://www.fki.inf.unibe.ch/databases/iam-historical-document-database/parzival-database.

References

  • Adamek T, O’Connor NE, Smeaton AF (2006) Word matching using single closed contours for indexing handwritten historical documents. Int J Doc Anal Recogn 9(2–4):153–165

    Google Scholar 

  • Agazzi O (1994) Keyword spotting in poorly printed documents using pseudo 2-D hidden Markov models. IEEE Trans Pattern Anal Mach Intell 16(8):842–848

    Article  Google Scholar 

  • Aghbari ZA, Brook S (2009) HAH manuscripts: a holistic paradigm for classifying and retrieving historical Arabic handwritten documents. Expert Syst Appl 36(8):10942–10951

    Google Scholar 

  • Almazán J, Gordo A, Fornés A, Valveny E (2014) Segmentation-free word spotting with exemplar SVMs. Pattern Recogn 47(12):3967–3978

    Article  Google Scholar 

  • Ameri M, Stauffer M, Riesen K, Bui T, Fischer A (2017) Keyword spotting in historical documents based on handwriting graphs and Hausdorff edit distance. In: International graphonomics society conference

    Google Scholar 

  • Bui QA, Visani M, Mullot R (2015) Unsupervised word spotting using a graph representation based on invariants. In: International conference on document analysis and recognition, pp 616–620

    Google Scholar 

  • Bunke H, Allermann G (1983) Inexact graph matching for structural pattern recognition. Pattern Recogn Lett 1(4):245–253

    Article  MATH  Google Scholar 

  • Can EF, Duygulu P (2011) A line-based representation for matching words in historical manuscripts

    Google Scholar 

  • Cao H, Govindaraju V (2007) Template-free word spotting in low-quality manuscripts. In: International conference on advances in pattern recognition, pp 1–5

    Google Scholar 

  • Chan J, Ziftci C, Forsyth D (2006) Searching off-line arabic documents. IEEE Comput Soc Conf Comput Vis Pattern Recogn 2:1455–1462

    Google Scholar 

  • Conte D, Foggia P, Sansone C, Vento M (2004) Thirty years of graph matching in pattern recognition. Int J Pattern Recogn Artif Intell 18(03):265–298

    Article  Google Scholar 

  • Dey S, Nicolaou A, Llados J, Pal U (2016) Local binary pattern for word spotting in handwritten historical document. Computing Research Repository

    Google Scholar 

  • Edwards J, Teh YW, Bock R, Maire M, Vesom G, Forsyth DA (2004) Making latin manuscripts searchable using gHMM’s. Int Conf Neural Inf Process Syst 17:385–392

    Google Scholar 

  • Fankhauser S, Riesen K, Bunke H (2011) Speeding up graph edit distance computation through fast bipartite matching. In: Graph-based representations in pattern recognition, pp 102–111

    Google Scholar 

  • Fischer A, Indermühle E, Bunke H, Viehhauser G, Stolz M (2010) Ground truth creation for handwriting recognition in historical documents. In: International workshop on document analysis systems, New York, USA, pp 3–10

    Google Scholar 

  • Fischer A, Keller A, Frinken V, Bunke H (2012) Lexicon-free handwritten word spotting using character HMMs. Pattern Recogn Lett 33(7):934–942

    Article  Google Scholar 

  • Foggia P, Percannella G, Vento M (2014) Graph matching and learning in pattern recognition in the last 10 years. Int J Pattern Recogn Artif Intell 28(01)

    Google Scholar 

  • Frinken V, Fischer A, Manmatha R, Bunke H (2012) A novel word spotting method based on recurrent neural networks. IEEE Trans Pattern Anal Mach Intell 34(2):211–224

    Article  Google Scholar 

  • Guo Z, Hall RW (1989) Parallel thinning with two-subiteration algorithms. Commun ACM 32(3):359–373

    Article  MathSciNet  Google Scholar 

  • Huang L, Yin F, Chen QH, Liu CL (2011) Keyword spotting in offline chinese handwritten documents using a statistical model. In: International conference on document analysis and recognition, pp 78–82

    Google Scholar 

  • Konidaris T, Kesidis AL, Gatos B (2015) A segmentation-free word spotting method for historical printed documents. Pattern Anal Appl

    Google Scholar 

  • Kovalchuk A, Wolf L, Dershowitz N (2014) A simple and fast word spotting method. In: International conference on frontiers in handwriting recognition, pp 3–8

    Google Scholar 

  • Kruskal JB (1956) On the shortest spanning subtree of a graph and the traveling salesman problem. Proc Am Math Soc 7(1):48–48

    Article  MathSciNet  MATH  Google Scholar 

  • Lavrenko V, Rath T, Manmatha R (2004) Holistic word recognition for handwritten historical documents. In: International workshop on document image analysis for libraries, pp 278–287

    Google Scholar 

  • Leydier Y, Lebourgeois F, Emptoz H (2007) Text search for medieval manuscript images. Pattern Recogn 40(12):3552–3567

    Article  MATH  Google Scholar 

  • Manmatha R, Han C, Riseman E (1996) Word spotting: a new approach to indexing handwriting. In: Computer vision and pattern recognition, pp 631–637

    Google Scholar 

  • Manmatha R, Rath TM (2003) Indexing of handwritten historical documents—recent progress. In: Symposium on document image understanding technology, pp 77–85

    Google Scholar 

  • Marti UV, Bunke H (2001) Using a statistical language model to improve the performance of an HMM-based cursive handwriting recognition systems. Int J Pattern Recogn Artif Intell 15(01):65–90

    Article  Google Scholar 

  • Perronnin F, Rodríguez-Serrano JA (2009) Fisher kernels for handwritten word-spotting. In: International conference on document analysis and recognition, pp 106–110

    Google Scholar 

  • Rath T, Manmatha R (2003) Word image matching using dynamic time warping. In: Computer vision and pattern recognition, vol 2, pp II–521–II–527

    Google Scholar 

  • Riba P, Llados J, Fornes A (2015) Handwritten word spotting by inexact matching of grapheme graphs. In: International conference on document analysis and recognition, pp 781–785

    Google Scholar 

  • Riesen K (2015) Structural pattern recognition with graph edit distance. In: Advances in computer vision and pattern recognition, Cham

    Google Scholar 

  • Riesen K, Bunke H (2009) Approximate graph edit distance computation by means of bipartite graph matching. Image Vis Comput 27(7):950–959

    Article  Google Scholar 

  • Rodríguez-Serrano JA, Perronnin F (2008) Local gradient histogram features for word spotting in unconstrained handwritten documents. In: International conference on frontiers in handwriting recognition, pp 7–12

    Google Scholar 

  • Rodríguez-Serrano JA, Perronnin F (2009) Handwritten word-spotting using hidden Markov models and universal vocabularies. Pattern Recogn 42(9):2106–2116

    Article  MATH  Google Scholar 

  • Rodríguez-Serrano JA, Perronnin F (2012) A model-based sequence similarity with application to handwritten word spotting. IEEE Trans Pattern Anal Mach Intell 34(11):2108–20

    Article  Google Scholar 

  • Rose R, Paul D (1990) A hidden Markov model based keyword recognition system. In: IEEE international conference on acoustics, speech, and signal processing, pp 129–132

    Google Scholar 

  • Rothacker L, Fink GA (2015) Segmentation-free query-by-string word spotting with bag-of-features HMMs. In: International conference on document analysis and recognition, pp 661–665

    Google Scholar 

  • Rothacker L, Rusinol M, Fink Ga (2013) Bag-of-features HMMs for segmentation-free word spotting in handwritten documents. In: International conference on document analysis and recognition, pp 1305–1309

    Google Scholar 

  • Sakoe H, Chiba S (1978) Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans Acoust Speech, Signal Process 26(1):43–49

    Article  MATH  Google Scholar 

  • Scott GL, Longuet-Higgins HC (1991) An algorithm for associating the features of two images. Proc Roy Soc B: Biol Sci 244(1309):21–26

    Article  Google Scholar 

  • Stauffer M, Fischer A, Riesen K (2016a) A novel graph database for handwritten word images. In: International workshop on structural, syntactic, and statistical pattern recognition

    Google Scholar 

  • Stauffer M, Fischer A, Riesen K (2016b) Graph-based keyword spotting in historical handwritten documents. In: International workshop on structural, syntactic, and statistical pattern recognition

    Google Scholar 

  • Stauffer M, Fischer A, Riesen K (2017a) Ensembles for graph-based keyword spotting in historical handwritten documents. In: International conference on document analysis and recognition

    Google Scholar 

  • Stauffer M, Fischer A, Riesen K (2017b) Speeding-up graph-based keyword spotting by quadtree segmentations. In: International conference on computer analysis of images and patterns

    Google Scholar 

  • Stauffer M, Fischer A, Riesen K (2017c) Speeding-up graph-based keyword spotting in historical handwritten documents. In: Graph-based representations in pattern recognition

    Google Scholar 

  • Stauffer M, Tschachtli T, Fischer A, Riesen K (2017d) A survey on applications of bipartite graph edit distance. In: Graph-based representations in pattern recognition

    Google Scholar 

  • Terasawa K, Tanaka Y (2009) Slit style HOG feature for document image word spotting. In: International conference on document analysis and recognition, pp 116–120

    Google Scholar 

  • Thomas S, Chatelain C, Heutte L, Paquet T, Kessentini Y (2014) A deep HMM model for multiple keywords spotting in handwritten documents. Pattern Anal Appl 18(4):1003–1015

    Article  MathSciNet  Google Scholar 

  • Wang P, Eglin V, Garcia C, Largeron C, Llados J, Fornes A (2014) A novel learning-free word spotting approach based on graph representation. In: International workshop on document analysis systems, pp 207–211

    Google Scholar 

  • Wicht B, Fischer A, Hennebert J (2016) Deep learning features for handwritten keyword spotting. In: International conference on pattern recognition

    Google Scholar 

  • Zhang B, Srihari SN, Huang C (2003) Word image retrieval using binary features. In: Document recognition and retrieval, p 45

    Google Scholar 

Download references

Acknowledgements

This work has been supported by the Hasler Foundation Switzerland.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Stauffer .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Stauffer, M., Fischer, A., Riesen, K. (2018). Searching and Browsing in Historical Documents—State of the Art and Novel Approaches for Template-Based Keyword Spotting. In: Dornberger, R. (eds) Business Information Systems and Technology 4.0. Studies in Systems, Decision and Control, vol 141. Springer, Cham. https://doi.org/10.1007/978-3-319-74322-6_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-74322-6_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-74321-9

  • Online ISBN: 978-3-319-74322-6

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics