A probabilistic approach to printed document understanding

Abstract

We propose an approach for information extraction for multi-page printed document understanding. The approach is designed for scenarios in which the set of possible document classes, i.e., documents sharing similar content and layout, is large and may evolve over time. Describing a new class is a very simple task: the operator merely provides a few samples and then, by means of a GUI, clicks on the OCR-generated blocks of a document containing the information to be extracted. Our approach is based on probability: we derived a general form for the probability that a sequence of blocks contains the searched information. We estimate the parameters for a new class by applying the maximum likelihood method to the samples of the class. All these parameters depend only on block properties that can be extracted automatically from the operator actions on the GUI. Processing a document of a given class consists in finding the sequence of blocks, which maximizes the corresponding probability for that class. We evaluated experimentally our proposal using 807 multi-page printed documents of different domains (invoices, patents, data-sheets), obtaining very good results—e.g., a success rate often greater than 90% even for classes with just two samples.

This is a preview of subscription content, access via your institution.

References

  1. 1

    Aiello, M., Monz, C., Todoran, L.: Combining linguistic and spatial information for document analysis. Arxiv preprint cs/0009014 (2000)

  2. 2

    Aiello M., Monz C., Todoran L., Worring M.: Document understanding for a broad class of documents. Int. J. Doc. Anal. Recognit. 5(1), 1–16 (2002)

    MATH  Article  Google Scholar 

  3. 3

    Amano A., Asada N., Mukunoki M., Aoyama M.: Table form document analysis based on the document structure grammar. Int. J. Doc. Anal. Recognit. 8(2), 201–213 (2006)

    Article  Google Scholar 

  4. 4

    Bartoli, A., Davanzo, G., Medvet, E., Sorio, E.: Improving features extraction for supervised invoice classification. In: Artificial Intelligence and Applications. ACTA Press (2010)

  5. 5

    Belaid, Y., Belaid, A.: Morphological tagging approach in document analysis of invoices. In: Proceedings of the Pattern Recognition, 17th International Conference on (ICPR’04), vol. 1–01, pp. 469–472. IEEE Computer Society (2004)

  6. 6

    Van Beusekom, J., Keysers, D., Shafait, F., Breuel, T.M.: Distance measures for layout-based document image retrieval. In: Second International Conference on Document Image Analysis for Libraries, 2006. DIAL’06, p. 11 (2006)

  7. 7

    Cesarini F., Francesconi E., Gori M., Soda G.: Analysis and understanding of multi-class invoices. Int. J. Doc. Anal. Recognit. 6(2), 102–114 (2003)

    Article  Google Scholar 

  8. 8

    Cesarini F., Gori M., Marinai S., Soda G.: INFORMys: a flexible invoice-like form-reader system. IEEE Trans. Pattern Anal. Mach. Intell. 20(7), 730–745 (1998)

    Article  Google Scholar 

  9. 9

    Chen N., Blostein D.: A survey of document image classification: problem statement, classifier architecture and performance evaluation. Int. J. Doc. Anal. Recognit. 10(1), 1–16 (2007)

    MATH  Article  Google Scholar 

  10. 10

    Dengel, A. R.: Making documents work: Challenges for document understanding. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition, vol. 2, p. 1026. IEEE Computer Society (2003)

  11. 11

    Hamza, Hatem, Belaïd Yolande, Belaïd, Abdel: Case-based reasoning for invoice analysis and recognition. In: Case-Based Reasoning Research and Development, pp. 404–418 (2007)

  12. 12

    Hu J., Kashi R., Wilfong G.: Comparison and classification of documents based on layout similarity. Inf. Retr. 2(2–3), 227–243 (2000)

    Article  Google Scholar 

  13. 13

    Klein, B., Agne, S., Dengel, A.: On benchmarking of invoice analysis systems. In: Document Analysis Systems VII, pp. 312–323 (2006)

  14. 14

    Klink, S., Dengel, A., Kieninger, T.: Document structure analysis based on layout and textual features. In: Proceedings of International Workshop on Document Analysis Systems, DAS2000, pp. 99–111. Citeseer (2000)

  15. 15

    Kwok, Thomas, Laredo, Jim, Maradugu, Sridhar: A web services integration to manage invoice identification, metadata extraction, storage and retrieval in a multi-tenancy SaaS application. In: Proceedings of the 2008 IEEE International Conference on e-Business Engineering, pages 359–366. IEEE Computer Society (2008)

  16. 16

    Lewis, D., Agam, G., Argamon, S., Frieder, O., Grossman, D., Heard, J.: Building a test collection for complex document information processing. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 665–666, Seattle, Washington, USA. ACM (2006)

  17. 17

    Peng H., Long F., Chi Z.: Document image recognition based on template matching of component block projections. IEEE Trans. Pattern Anal. Mach. Intell. 25(9), 1188–1192 (2003)

    Article  Google Scholar 

  18. 18

    Sako, H., Seki, M., Furukawa, N., Ikeda, H., Imaizumi, A.: Form reading based on form-type identification and form-data recognition. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition—vol. 2, p. 926. IEEE Computer Society (2003)

  19. 19

    Schulz, F., Ebbecke, M., Gillmann, M., Adrian, B., Agne, S., Dengel, A.: Seizing the treasure: Transferring knowledge in invoice analysis. In: Proceedings of the 2009 10th International Conference on Document Analysis and Recognition—vol. 00, pp. 848–852. IEEE Computer Society (2009)

  20. 20

    Sorio, E., Bartoli, A., Davanzo, G., Medvet, E.: Open world classification of printed invoices. In: DocEng 2010: Proceedings of the 10th ACM Symposium on Document Engineering, ACM, New York, NY, USA (2010)

  21. 21

    Todoran, L., Aiello, M., Monz, C., Worring, M.: Logical structure detection for heterogeneous document classes. In: Proceedings of SPIE, vol. 4307, p. 99 (2000)

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Eric Medvet.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Medvet, E., Bartoli, A. & Davanzo, G. A probabilistic approach to printed document understanding. IJDAR 14, 335–347 (2011). https://doi.org/10.1007/s10032-010-0137-1

Download citation

Keywords

  • Document understanding
  • Automatic model upgrading
  • Invoice analysis
  • Maximum likelihood