Abstract
Automatic information extraction from scanned business documents is especially valuable in the application domain of document management and archiving. Although current solutions for document classification and extraction work pretty well, they still require a high effort of on-site configuration done by domain experts or administrators. Especially small office/home office (SOHO) users and private individuals often do not use such systems because of the need for configuration and long periods of training to reach acceptable extraction rates. Therefore we present a solution for information extraction out of scanned business documents that fits the requirements of these users. Our approach is highly adaptable to new document types and index fields and uses only a minimum of training documents to reach extraction rates comparable to related works and manual document indexing. By providing a cooperative extraction system, which allows sharing extraction knowledge between participants, we furthermore want to minimize the number of user feedback and increase the acceptance of such a system.
A first evaluation of our solution according to a document set of 12,500 documents with 10 commonly used fields shows competitive results above 85% F1-measure. Results above 75% F1-measure are already reached with a minimal training set of only one document per template.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Klein, B., Dengel, A., Fordan, A.: smartfix: An adaptive system for document analysis and understanding. In: Reading and Learning, pp. 166–186 (2004)
Opentext, Opentext capture center (2012), http://www.opentext.com/2/global/products/products-capture-and-imaging/products-opentext-capture-center.htm
Schulz, F., Ebbecke, M., Gillmann, M., Adrian, B., Agne, S., Dengel, A.: Seizing the treasure: Transferring knowledge in invoice analysis. In: 10th International Conference on Document Analysis and Recognition, pp. 848–852 (2009)
Sako, H., Seki, M., Furukawa, N., Ikeda, H., Imaizumi, A.: Form reading based on form-type identification and form-data recognition. In: Seventh International Conference on Document Analysis and Recognition, pp. 926–930 (2003)
Appiani, E., Cesarini, F., Colla, A.M., Diligenti, M., Gori, M., Marinai, S., Soda, G.: Automatic document classification and indexing in high-volume applications. International Journal on Document Analysis and Recognition 4(2), 69–83 (2001)
Sorio, E., Bartoli, A., Davanzo, G., Medvet, E.: Open world classification of printed invoices. In: Proceedings of the 10th ACM Symposium on Document Engineering, DocEng 2010, pp. 187–190. ACM, New York (2010)
Vila, M., Bardera, A., Feixas, M., Sbert, M.: Tsallis mutual information for document classification. Entropy 13(9), 1694–1707 (2011)
Diligenti, M., Frasconi, P., Gori, M.: Hidden tree markov models for document image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(4), 519–523 (2003)
Belaïd, A., D’Andecy, V.P., Hamza, H., Belaïd, Y.: Administrative Document Analysis and Structure. In: Biba, M., Xhafa, F. (eds.) Learning Structure and Schemas from Documents. SCI, vol. 375, pp. 51–71. Springer, Heidelberg (2011)
Alippi, C., Pessina, F., Roveri, M.: An adaptive system for automatic invoice-documents classification. In: IEEE International Conference on Image Processing, ICIP 2005, vol. 2, pp. II-526–II-529 (2005)
Adali, S., Sonmez, A.C., Gokturk, M.: An integrated architecture for processing business documents in turkish. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 394–405. Springer, Heidelberg (2009)
Cesarini, F., Francesconi, E., Gori, M., Soda, G.: Analysis and understanding of multi-class invoices. IJDAR 6(2), 102–114 (2003)
Bart, E., Sarkar, P.: Information extraction by finding repeated structure. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, DAS 2010, pp. 175–182 (2010)
Belaid, Y., Belaid, A.: Morphological tagging approach in document analysis of invoices. In: Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004 (2004)
Likforman-Sulem, L., Vaillant, P., Yvon, F.: Proper names extraction from fax images combining textual and image features. In: Seventh International Conference on Document Analysis and Recognition, vol. 1, pp. 545–549 (2003)
Saund, E.: Scientific challenges underlying production document processing. In: Document Recognition and Retrieval XVIII, DRR (2011)
Salperwyck, C., Lemaire, V.: Learning with few examples: An empirical study on leading classifiers. In: The International Joint Conference on Neural Networks, IJCNN (2011)
Forman, G., Cohen, I.: Learning from little: Comparison of classifiers given little training. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 161–172. Springer, Heidelberg (2004)
Esser, D., Schuster, D., Muthmann, K., Berger, M., Schill, A.: Automatic indexing of scanned documents - a layout-based approach. In: Document Recognition and Retrieval XIX, DRR, San Francisco, CA, USA (2012)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Chinchor, N., Sundheim, B.: Muc-5 evaluation metrics. In: Proceedings of the 5th Conference on Message Understanding, MUC5 1993, pp. 69–78 (1993)
Klein, B., Agne, S., Dengel, A.R.: Results of a study on invoice-reading systems in germany. In: Marinai, S., Dengel, A.R. (eds.) DAS 2004. LNCS, vol. 3163, pp. 451–462. Springer, Heidelberg (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Esser, D. (2013). Cooperative and Fast-Learning Information Extraction from Business Documents for Document Archiving. In: Demey, Y.T., Panetto, H. (eds) On the Move to Meaningful Internet Systems: OTM 2013 Workshops. OTM 2013. Lecture Notes in Computer Science, vol 8186. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41033-8_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-41033-8_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41032-1
Online ISBN: 978-3-642-41033-8
eBook Packages: Computer ScienceComputer Science (R0)