Sample-Based Collection and Adjustment Algorithm for Metadata Extraction Parameter of Flexible Format Document

Matsumoto, Toshiko; Oba, Mitsuharu; Onoyama, Takashi

doi:10.1007/978-3-642-13232-2_69

Toshiko Matsumoto²⁴,
Mitsuharu Oba²⁴ &
Takashi Onoyama²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6114))

Included in the following conference series:

International Conference on Artificial Intelligence and Soft Computing

1946 Accesses
1 Citations

Abstract

We propose an algorithm for automatically generating metadata extraction parameters. It first enumerates candidates on the basis of metadata occurrence in training documents, and then examines these candidates to avoid side effects and to maximize effectiveness. This two-stage approach enables both avoidance of exponential explosion of computation and detailed optimization. An experiment on Japanese business documents shows that an automatically generated parameter enables metadata extraction as accurately as a manually adjusted one.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Taylor, S.L., Fritzson, R., Pastor, J.A.: Extraction of Data from Preprinted Forms. Machine Vision and Applications 5, 211–222 (1992)
Article Google Scholar
Lee, K., Choy, Y., Cho, S.: Geometric Structure Analysis of Document Images: A Knowledge-Based Approach. IEEE Trans. on PAMI 22, 1224–1240 (2000)
Google Scholar
Minagawa, A., Fujii, Y., Takebe, H., Fujimoto, K.: A Method of Logical Structure Analysis for Form Images with Various Layouts by Belief Propagation. IEIC Technical Report 106, 17–22 (2006)
Google Scholar
Ishitani, Y.: Logical Structure Analysis of Document Images Based on Emergent Computation. In: 5th International Conference on Document Analysis and Recognition, pp. 189–192 (1999)
Google Scholar
Esposito, F., Malerba, D., Semeraro, G., Ferilli, S., Altamura, O., Basile, T.M.A., Berardi, M., Ceci, M., Di Mauro, N.: Machine Learning Methods for Automatically Processing Historical Documents: from Paper Acquisition to XML Transformation. In: 1st International Workshop on Document Image Analysis for Libraries, pp. 328–335 (2004)
Google Scholar
Wnek, J.: Machine learning of generalized document templates for data extraction. In: Lopresti, D.P., Hu, J., Kashi, R.S. (eds.) DAS 2002. LNCS, vol. 2423, pp. 457–468. Springer, Heidelberg (2002)
Chapter Google Scholar
Kramer, M., Kaprykowsky, H., Keysers, D., Breuel, T.: Bibliographic Meta-Data Extraction Using Probabilistic Finite State Transducers. In: 9th International Conference on Document Analysis and Recognition, pp. 609–613 (2007)
Google Scholar
Zipf, G.K.: Selected Studies of the Principle of Relative Frequency in Language. Cambridge (1932)
Google Scholar
Mitra, S., Acharya, T.: Data Mining: Multimedia, Soft Computing and Bioinformatics. Wiley-Interscience, Hoboken (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Research and Development Division, Hitachi Software Engineering Co., Ltd., 4-12-7, Higashishinagawa, Shinagawa-ku, Tokyo, 140-0002, Japan
Toshiko Matsumoto, Mitsuharu Oba & Takashi Onoyama

Authors

Toshiko Matsumoto
View author publications
You can also search for this author in PubMed Google Scholar
Mitsuharu Oba
View author publications
You can also search for this author in PubMed Google Scholar
Takashi Onoyama
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Artificial Intelligence, Academy of Humanities and Economics, Poland
Leszek Rutkowski
Academy of Humanities and Economics in Łódź, ul. Rewolucji 1905 nr 64, Łódź, Poland
Rafał Scherer
Institute of Automatics, AGH University of Science and Technology, Al. Mickiewicza 30, PL-30-059, Kraków, Poland
Ryszard Tadeusiewicz
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley Initiative in Soft Computing (BISC), 94720-1776, CA
Lotfi A. Zadeh
Computational Intelligence Laboratory Department of Electrical and Computer Engineering, University of Louisville, 40292, Louisville, KY
Jacek M. Zurada

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Matsumoto, T., Oba, M., Onoyama, T. (2010). Sample-Based Collection and Adjustment Algorithm for Metadata Extraction Parameter of Flexible Format Document. In: Rutkowski, L., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds) Artifical Intelligence and Soft Computing. ICAISC 2010. Lecture Notes in Computer Science(), vol 6114. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13232-2_69

Download citation

DOI: https://doi.org/10.1007/978-3-642-13232-2_69
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13231-5
Online ISBN: 978-3-642-13232-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics