Machine Learning for Document Structure Recognition

Part of the Studies in Computational Intelligence book series (SCI, volume 370)


The backbone of the information age is digital information which may be searched, accessed, and transferred instantaneously. Therefore the digitization of paper documents is extremely interesting. This chapter describes approaches for document structure recognition detecting the hierarchy of physical components in images of documents, such as pages, paragraphs, and figures, and transforms this into a hierarchy of logical components, such as titles, authors, and sections. This structural information improves readability and is useful for indexing and retrieving information contained in documents. First we present a rule-based system segmenting the document image and estimating the logical role of these zones. It is extensively used for processing newspaper collections showing world-class performance. In the second part we introduce several machine learning approaches exploring large numbers of interrelated features. They can be adapted to geometrical models of the document structure, which may be set up as a linear sequence or a general graph. These advanced models require far more computational resources but show a better performance than simpler alternatives and might be used in future.


Document Image Text Line Conditional Random Field Parse Tree Text Region 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Akindele, O., Belaid, A.: Page segmentation by segment tracing. In: Proc. International Conf. Document Analysis and Recognition (ICDAR), pp. 341–344 (1993)Google Scholar
  2. 2.
    Antonacopoulos, A., Gatos, B., Bridson, D.: ICDAR2007 Page Segmentation Competition. In: Proc. International Conf. Document Analysis and Recognition (ICDAR), vol. 2, pp. 1279–1283. IEEE Computer Society, Los Alamitos (2007)Google Scholar
  3. 3.
    Antonacopoulos, A., Pletschacher, S., Bridson, D., Papadopoulos, C.: Icdar 2009 page segmentation competition. In: 10th International Conference on Document Analysis and Recognition, ICDAR (2009)Google Scholar
  4. 4.
    Awasthi, P., Gagrani, A., Ravindran, B.: Image modeling using tree structured conditional random fields. In: IJCAI (2007)Google Scholar
  5. 5.
    Baird, H., Casey, M.: Towards versatile document analysis systems. In: Proc. 7th International Workshop Document Analysis Systems, pp. 280–290 (2006)Google Scholar
  6. 6.
    Besag, J.: Statistical analysis of non-lattice data. The Statistician 24(3), 179–195 (1975)MathSciNetCrossRefGoogle Scholar
  7. 7.
    van Beusekom, J., Keysers, D., Shafait, F., Breuel, T.: Example-based logical labeling of document title page images. In: Proc. International Conf. Document Analysis and Recognition (ICDAR), vol. 2, pp. 919–923. IEEE Computer Society, Los Alamitos (2007)Google Scholar
  8. 8.
    Breuel, T.M.: High performance document layout analysis. In: Symposium on Document Image Understanding Technology, Greenbelt, Maryland (2003)Google Scholar
  9. 9.
    Cattoni, R., Coianiz, T., Messelodi, S., Modena, C.: Geometric layout analysis techniques for document image understanding: a review. Tech. Rep. 9703-09, ITC-irst (1998),
  10. 10.
    Chen, S., Mao, S., Thoma, G.: Simultaneous layout style and logical entity recognition in a heterogeneous collection of documents. In: Proc. International Conf. Document Analysis and Recognition (ICDAR), vol. 1, pp. 118–122. IEEE Computer Society, Los Alamitos (2007)Google Scholar
  11. 11.
    Chidlovskii, B., Lecerf, L.: Stacked dependency networks for layout document structuring. In: SAC 2008, pp. 424–428 (2008)Google Scholar
  12. 12.
    DDB, Deutsche Digitale Bibliothek (2010), (retrieved on December 23, 2010)
  13. 13.
    Dias, A.P.: Minimum spanning trees for text segmentation. In: Proc. Annual Symposium Document Analysis and Information Retrieval (1996)Google Scholar
  14. 14.
    Doermann, D.: Page decomposition and related research. In: Proc. Symp. Document Image Understanding Technology, pp. 39–55 (1995)Google Scholar
  15. 15.
    Esposito, F., Malerba, D., Semeraro, G., Ferilli, S., Altamura, O., Basile, T., Berardi, M., Ceci, M., Di Mauro, N.: Machine learning methods for automatically processing historical documents: from paper acquisition to xml transformation. In: Proc. 1st International Workshop Document Image Analysis for Libraries, pp. 328–335. IEEE Computer Society, Los Alamitos (2004)CrossRefGoogle Scholar
  16. 16.
    Europeana, Europeana portal (2010), (retrieved on December 23, 2010)
  17. 17.
    Finkel, J., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: ACL (2005)Google Scholar
  18. 18.
    Furmaniak, R.: Unsupervised newspaper segmentation using language context. In: Proc. International Conf. Document Analysis and Recognition (ICDAR), vol. 2, pp. 619–623. IEEE Computer Society, Los Alamitos (2007)Google Scholar
  19. 19.
    Gatos, B., Danatsas, D., Pratikakis, I., Perantonis, S.J.: Automatic table detection in document images. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds.) ICAPR 2005. LNCS, vol. 3686, pp. 609–618. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  20. 20.
    Getoor, L., Taskar, B. (eds.): Introduction to Relational Statistical Learning. MIT Press, Cambridge (2007)zbMATHGoogle Scholar
  21. 21.
    Ha, J., Haralick, R., Phillips, I.: Document page decomposition by the bounding-box projection technique. In: Proc. International Conf. Document Analysis and Recognition (ICDAR), pp. 1119–1122 (1995)Google Scholar
  22. 22.
    Haralick, R.: Document image understanding: Geometric and logical layout. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 385–390 (1994)Google Scholar
  23. 23.
    Ittner, D.J., Baird, H.S.: Language-free layout analysis. In: Proc. International Conf. Document Analysis and Recognition (ICDAR), pp. 336–340 (1993)Google Scholar
  24. 24.
    Jain, A., Yu, B.: Document representation and its application to page decomposition. IEEE Trans on Pattern Analysis and Machine Intelligence 20(3), 294–308 (1998)CrossRefGoogle Scholar
  25. 25.
    Kise, K., Sato, A., Iwata, M.: Segmentation of page images using the area voronoi diagram. Computer Vision and Image Understanding 70(3), 370–382 (1998)CrossRefGoogle Scholar
  26. 26.
    Konya, I.V., Seibert, C., Eickeler, S.: Fraunhofer newspaper segmenter – a modular document image understanding system. Journal on Document Analysis and Recognition, IJDAR (2011); Ijdar – expected publication in 2010 (accepted for publication)Google Scholar
  27. 27.
    Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th International Conf. on ine Learning, vol. (2001)Google Scholar
  28. 28.
    Manning, C.D., Schütze, H.: Foundations of statistical natural language processing. MIT Press, Cambridge (1999)zbMATHGoogle Scholar
  29. 29.
    Mao, S., Rosenfeld, A., Kanungo, T.: Document structure analysis algorithms: A literature survey. In: Document Recognition and Retrieval X, SPIE, vol. 5010, pp. 197–207 (2003)Google Scholar
  30. 30.
    Marinai, S., Fujisawa, H. (eds.): Machine Learning in Document Analysis and Recognition. Springer, Heidelberg (2008)zbMATHGoogle Scholar
  31. 31.
    Neville, J., Jensen, D.: Relational dependency networks. Journal of Machine Learning Research 8, 653–692 (2007)Google Scholar
  32. 32.
    Niyogi, D., Srihari, S.: Knowledge-based derivation of document logical structure. In: Proc. Int. Conference on Document Analysis and Recognition, Montreal, Canada, pp. 472–475 (1995)Google Scholar
  33. 33.
    Paaß, G., Reichartz, F.: Exploiting semantic constraints for estimating supersenses with crfs. In: Proc. SDM 2009 (2009)Google Scholar
  34. 34.
    Peng, F., McCallum, A.: Accurate information extraction from research papers using conditional random fields. In: HLT-NAACL 2004, pp. 329–336 (2004)Google Scholar
  35. 35.
    Rangoni, Y., Belaïd, A.: Document logical structure analysis based on perceptive cycles. In: Proc. 7th International Workshop Document Analysis Systems, pp. 117–128. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  36. 36.
    Richardson, M., Domingos, P.: Markov logic networks. Machine Learning 62(1-2), 107–136 (2006)CrossRefGoogle Scholar
  37. 37.
    Sankar, K.P., Ambati, V., Pratha, L., Jawahal, C.: Digitizing a million books: Challenges for document analysis. In: Proc. 7th International Workshop Document Analysis Systems, pp. 425–436 (2006)Google Scholar
  38. 38.
    Schneider, K.M.: Information extraction from calls for papers with conditional random fields and layout features. Artif. Intell. Rev. 25, 67–77 (2006)CrossRefGoogle Scholar
  39. 39.
    Shafait, F., Keysers, D., Breuel, T.: Performance comparison of six algorithms for page segmentation. In: 7th IAPR Workshop on Document Analysis Systems (DAS), pp. 368–379 (2006)Google Scholar
  40. 40.
    Summers, K.: Near-wordless document structure classification. In: Proc. International Conf. on Document Analysis and Recognition (ICDAR), pp. 462–465 (1995)Google Scholar
  41. 41.
    Sutton, C., McCallum, A.: Collective segmentation and labeling of distant entities in information extraction. In: ICML Workshop on Statistical Relational Learning (2004)Google Scholar
  42. 42.
    Sutton, C., McCallum, A.: An introduction to conditional random fields for relational learning. In: Getoor, L., Taskar, B. (eds.) Introduction to Relational Statistical Learning. MIT Press, Cambridge (2007)Google Scholar
  43. 43.
    Sutton, C.A., Rohanimanesh, K., McCallum, A.: Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data. In: Proc. ICML 2004 (2004)Google Scholar
  44. 44.
    Tang, Y., Ma, H., Mao, X., Liu, D., Suen, C.: A new approach to document analysis based on modified fractal signature. In: Proc. International Conf. Document Analysis and Recognition (ICDAR), pp. 567–570 (1995)Google Scholar
  45. 45.
    Taskar, B., Abbeel, P., Koller, D.: Discriminative probabilistic models for relational data. In: Eighteenth Conference on Uncertainty in Artificial Intelligence, UAI 2002 (2002)Google Scholar
  46. 46.
    Tsujimoto, S., Asada, H.: Major components of a complete text reading system. Proc. IEEE 80(7), 1133–1149 (1992)CrossRefGoogle Scholar
  47. 47.
    Vincent, L.: Google book search: Document understanding on a massive scale. In: Proc. 9th International Conf. Document Analysis and Recognition, pp. 819–823 (2007)Google Scholar
  48. 48.
    Viola, P.A., Narasimhan, M.: Learning to extract information from semi-structured text using a discriminative context free grammar. In: SIGIR 2005, pp. 330–337 (2005)Google Scholar
  49. 49.
    Wahl, F., Wong, K., Casey, R.: Block segmentation and text extraction in mixed text/image documents. Computer Vision, Graphics, and Image Processing 20, 375–390 (1982)CrossRefGoogle Scholar
  50. 50.
    Wisniewski, G., Gallinari, P.: Relaxation labeling for selecting and exploiting efficiently non-local dependencies in sequence labeling. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 312–323. Springer, Heidelberg (2007)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  1. 1.Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS)Sankt AugustinGermany

Personalised recommendations