Machine Learning of Generalized Document Templates for Data Extraction

  • Janusz Wnek
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2423)


The purpose of this research is to reverse engineer the process of encoding data in structured documents and subsequently automate the process of extracting it. We assume a broad category of structured documents for processing that goes beyond form processing. In fact, the documents may have flexible layouts and consist of multiple and varying numbers of pages. The data extraction method (DataX) employs general templates generated by the Inductive Template Generator (InTeGen). The InTeGen method utilizes inductive learning from examples of documents with identified data elements. Both methods achieve high automation with minimal user’s input.


Data Element Reverse Engineering Document Image Optical Character Recognition Omission Error 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. Bayer, T., Mogg-Schneider, H., “A Generic System for Processing Invoices,” Proc. Int. Conf. on Doc. Analysis and Recognition, pp.740–744, IEEE Computer Society Press, 1997.Google Scholar
  2. Cesarini, F., Francesconi, E., Gori, M., and Soda, G., “A Two Level Knowledge Approach for Understanding Documents of a Multi-Class Domain,” Proc. Int. Conf. on Doc. Analysis and Recognition, pp.135–138, IEEE Computer Society Press, 1999.Google Scholar
  3. Dengel, A., “ANASTASIL: A System for Low-Level and High-Level Geometric Analysis of Printed Documents” in Structured Document Image Analysis, Springer-Verlag, Berlin, 1992.Google Scholar
  4. Esposito, F., Malerba, D., and Semeraro, G., “Multistrategy Learning for Document Recognition,” Applied Artificial Intelligence, Vol. 8, pp.33–94, 1994.CrossRefGoogle Scholar
  5. Koppen, M., Waldostl, D., and Nickolay, B., “A System for the Evaluation of Invoices,” in Document Analysis Systems II, pp. 223–241, World Scientific, 1998.Google Scholar
  6. Summers, K., “Near-Wordless Document Structure Classification,” Proc. Int. Conf. On Document Analysis and Recognition, IEEE Computer Society Press, 1995.Google Scholar
  7. Wnek, J., “Learning to Identify Hundreds of Flex-form Documents,” Proc. of SPIE, Document Recognition and Retrieval VI, D. Lopresti and J. Zhou Eds., Vol. 3651, pp. 173–182, 1999.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Janusz Wnek
    • 1
  1. 1.Science Applications International CorporationViennaUSA

Personalised recommendations