Automatic Keyword Extraction from Historical Document Images

  • Kengo Terasawa
  • Takeshi Nagasaki
  • Toshio Kawashima
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3872)

Abstract

This paper presents an automatic keyword extraction method from historical document images. The proposed method is language independent because it is purely appearance based, where neither lexical information nor any other statistical language models are required. Moreover, since it does not need word segmentation, it can be applied to Eastern languages where they do not put clear spacing between words. The first half of the paper describes the algorithm to retrieve document image regions which have similar appearance to the given query image. The algorithm was evaluated in recall-precision manner, and showed its performance of over 80–90% average precision. The second half of the paper describes the keyword extraction method which works even if no query word is explicitly specified. Since the computational cost was reduced by the efficient pruning techniques, the system could extract keywords successfully from relatively large documents.

Keywords

Query Image Document Image Word Segmentation Matching Cost Query Word 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Fink, G.A., Plötz, T.: On appearance-based feature extraction methods for writer-independent handwritten text recognition. In: Proc. of International Conference on Document Analysis and Recognition, pp. 1070–1074 (2005)Google Scholar
  2. 2.
    Gatos, B., Konidaris, T., Ntzios, K., Pratikakis, I., Perantonis, S.: A segmentation-free approach for keyword search in historical typewritten documents. In: Proc. of International Conference on Document Analysis and Recognition, pp. 54–58 (2005)Google Scholar
  3. 3.
    Lu, Y., Tan, C.L.: Word spotting in Chinese document images without layout analysis. In: Proc. of IEEE International Conference on Pattern Recognition, pp. 30057–30060 (2002)Google Scholar
  4. 4.
    Manmatha, R., Han, C., Riseman, E.M.: Word Spotting: A New Approach to Indexing Handwriting. In: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pp. 631–637 (1996)Google Scholar
  5. 5.
    Marinai, S., Marino, E., Soda, G.: Indexing and retrieval of words in old documents. In: Proc. of International Conference on Document Analysis and Recognition, pp. 223–227 (2003)Google Scholar
  6. 6.
    Oka, R.: Spotting Method for Classification of Real World Data. The Computer Journal 41(8), 559–565 (1998)MATHCrossRefGoogle Scholar
  7. 7.
    Rath, T.M., Manmatha, R.: Features for Word Spotting in Historical Manuscripts. In: Proc. of International Conference on Document Analysis and Recognition, pp. 218–222 (2003)Google Scholar
  8. 8.
    Rath, T.M., Manmatha, R.: Word image matching using dynamic time warping. In: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pp. 521–527 (2003)Google Scholar
  9. 9.
    Terasawa, K., Nagasaki, T., Kawashima, T.: Eigenspace method for text retrieval in historical document images. In: Proc. of International Conference on Document Analysis and Recognition, pp. 437–441 (2005)Google Scholar
  10. 10.
    Turk, M.A., Pentland, A.P.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991)CrossRefGoogle Scholar
  11. 11.
    Turk, M.A., Pentland, A.P.: Face recognition using eigenfaces. In: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pp. 586–591 (1991)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Kengo Terasawa
    • 1
  • Takeshi Nagasaki
    • 1
  • Toshio Kawashima
    • 1
  1. 1.School of Systems Information ScienceFuture University-HakodateHokkaidoJapan

Personalised recommendations