Abstract
We propose an algorithm for automatically generating metadata extraction parameters. It first enumerates candidates on the basis of metadata occurrence in training documents, and then examines these candidates to avoid side effects and to maximize effectiveness. This two-stage approach enables both avoidance of exponential explosion of computation and detailed optimization. An experiment on Japanese business documents shows that an automatically generated parameter enables metadata extraction as accurately as a manually adjusted one.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Taylor, S.L., Fritzson, R., Pastor, J.A.: Extraction of Data from Preprinted Forms. Machine Vision and Applications 5, 211–222 (1992)
Lee, K., Choy, Y., Cho, S.: Geometric Structure Analysis of Document Images: A Knowledge-Based Approach. IEEE Trans. on PAMI 22, 1224–1240 (2000)
Minagawa, A., Fujii, Y., Takebe, H., Fujimoto, K.: A Method of Logical Structure Analysis for Form Images with Various Layouts by Belief Propagation. IEIC Technical Report 106, 17–22 (2006)
Ishitani, Y.: Logical Structure Analysis of Document Images Based on Emergent Computation. In: 5th International Conference on Document Analysis and Recognition, pp. 189–192 (1999)
Esposito, F., Malerba, D., Semeraro, G., Ferilli, S., Altamura, O., Basile, T.M.A., Berardi, M., Ceci, M., Di Mauro, N.: Machine Learning Methods for Automatically Processing Historical Documents: from Paper Acquisition to XML Transformation. In: 1st International Workshop on Document Image Analysis for Libraries, pp. 328–335 (2004)
Wnek, J.: Machine learning of generalized document templates for data extraction. In: Lopresti, D.P., Hu, J., Kashi, R.S. (eds.) DAS 2002. LNCS, vol. 2423, pp. 457–468. Springer, Heidelberg (2002)
Kramer, M., Kaprykowsky, H., Keysers, D., Breuel, T.: Bibliographic Meta-Data Extraction Using Probabilistic Finite State Transducers. In: 9th International Conference on Document Analysis and Recognition, pp. 609–613 (2007)
Zipf, G.K.: Selected Studies of the Principle of Relative Frequency in Language. Cambridge (1932)
Mitra, S., Acharya, T.: Data Mining: Multimedia, Soft Computing and Bioinformatics. Wiley-Interscience, Hoboken (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Matsumoto, T., Oba, M., Onoyama, T. (2010). Sample-Based Collection and Adjustment Algorithm for Metadata Extraction Parameter of Flexible Format Document. In: Rutkowski, L., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds) Artifical Intelligence and Soft Computing. ICAISC 2010. Lecture Notes in Computer Science(), vol 6114. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13232-2_69
Download citation
DOI: https://doi.org/10.1007/978-3-642-13232-2_69
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13231-5
Online ISBN: 978-3-642-13232-2
eBook Packages: Computer ScienceComputer Science (R0)