Sample-Based Collection and Adjustment Algorithm for Metadata Extraction Parameter of Flexible Format Document

  • Toshiko Matsumoto
  • Mitsuharu Oba
  • Takashi Onoyama
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6114)


We propose an algorithm for automatically generating metadata extraction parameters. It first enumerates candidates on the basis of metadata occurrence in training documents, and then examines these candidates to avoid side effects and to maximize effectiveness. This two-stage approach enables both avoidance of exponential explosion of computation and detailed optimization. An experiment on Japanese business documents shows that an automatically generated parameter enables metadata extraction as accurately as a manually adjusted one.


logical structure analysis metadata extraction keyword extraction layout characteristics 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Taylor, S.L., Fritzson, R., Pastor, J.A.: Extraction of Data from Preprinted Forms. Machine Vision and Applications 5, 211–222 (1992)CrossRefGoogle Scholar
  2. 2.
    Lee, K., Choy, Y., Cho, S.: Geometric Structure Analysis of Document Images: A Knowledge-Based Approach. IEEE Trans. on PAMI 22, 1224–1240 (2000)Google Scholar
  3. 3.
    Minagawa, A., Fujii, Y., Takebe, H., Fujimoto, K.: A Method of Logical Structure Analysis for Form Images with Various Layouts by Belief Propagation. IEIC Technical Report 106, 17–22 (2006)Google Scholar
  4. 4.
    Ishitani, Y.: Logical Structure Analysis of Document Images Based on Emergent Computation. In: 5th International Conference on Document Analysis and Recognition, pp. 189–192 (1999)Google Scholar
  5. 5.
    Esposito, F., Malerba, D., Semeraro, G., Ferilli, S., Altamura, O., Basile, T.M.A., Berardi, M., Ceci, M., Di Mauro, N.: Machine Learning Methods for Automatically Processing Historical Documents: from Paper Acquisition to XML Transformation. In: 1st International Workshop on Document Image Analysis for Libraries, pp. 328–335 (2004)Google Scholar
  6. 6.
    Wnek, J.: Machine learning of generalized document templates for data extraction. In: Lopresti, D.P., Hu, J., Kashi, R.S. (eds.) DAS 2002. LNCS, vol. 2423, pp. 457–468. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  7. 7.
    Kramer, M., Kaprykowsky, H., Keysers, D., Breuel, T.: Bibliographic Meta-Data Extraction Using Probabilistic Finite State Transducers. In: 9th International Conference on Document Analysis and Recognition, pp. 609–613 (2007)Google Scholar
  8. 8.
    Zipf, G.K.: Selected Studies of the Principle of Relative Frequency in Language. Cambridge (1932)Google Scholar
  9. 9.
    Mitra, S., Acharya, T.: Data Mining: Multimedia, Soft Computing and Bioinformatics. Wiley-Interscience, Hoboken (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Toshiko Matsumoto
    • 1
  • Mitsuharu Oba
    • 1
  • Takashi Onoyama
    • 1
  1. 1.Research and Development Division, Hitachi Software Engineering Co., Ltd.TokyoJapan

Personalised recommendations