Abstract
The backbone of the information age is digital information which may be searched, accessed, and transferred instantaneously. Therefore the digitization of paper documents is extremely interesting. This chapter describes approaches for document structure recognition detecting the hierarchy of physical components in images of documents, such as pages, paragraphs, and figures, and transforms this into a hierarchy of logical components, such as titles, authors, and sections. This structural information improves readability and is useful for indexing and retrieving information contained in documents. First we present a rule-based system segmenting the document image and estimating the logical role of these zones. It is extensively used for processing newspaper collections showing world-class performance. In the second part we introduce several machine learning approaches exploring large numbers of interrelated features. They can be adapted to geometrical models of the document structure, which may be set up as a linear sequence or a general graph. These advanced models require far more computational resources but show a better performance than simpler alternatives and might be used in future.
Keywords
- Document Image
- Text Line
- Conditional Random Field
- Parse Tree
- Text Region
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, access via your institution.
Buying options
Preview
Unable to display preview. Download preview PDF.
References
Akindele, O., Belaid, A.: Page segmentation by segment tracing. In: Proc. International Conf. Document Analysis and Recognition (ICDAR), pp. 341–344 (1993)
Antonacopoulos, A., Gatos, B., Bridson, D.: ICDAR2007 Page Segmentation Competition. In: Proc. International Conf. Document Analysis and Recognition (ICDAR), vol. 2, pp. 1279–1283. IEEE Computer Society, Los Alamitos (2007)
Antonacopoulos, A., Pletschacher, S., Bridson, D., Papadopoulos, C.: Icdar 2009 page segmentation competition. In: 10th International Conference on Document Analysis and Recognition, ICDAR (2009)
Awasthi, P., Gagrani, A., Ravindran, B.: Image modeling using tree structured conditional random fields. In: IJCAI (2007)
Baird, H., Casey, M.: Towards versatile document analysis systems. In: Proc. 7th International Workshop Document Analysis Systems, pp. 280–290 (2006)
Besag, J.: Statistical analysis of non-lattice data. The Statistician 24(3), 179–195 (1975)
van Beusekom, J., Keysers, D., Shafait, F., Breuel, T.: Example-based logical labeling of document title page images. In: Proc. International Conf. Document Analysis and Recognition (ICDAR), vol. 2, pp. 919–923. IEEE Computer Society, Los Alamitos (2007)
Breuel, T.M.: High performance document layout analysis. In: Symposium on Document Image Understanding Technology, Greenbelt, Maryland (2003)
Cattoni, R., Coianiz, T., Messelodi, S., Modena, C.: Geometric layout analysis techniques for document image understanding: a review. Tech. Rep. 9703-09, ITC-irst (1998), http://citeseer.comp.nus.edu.sg/330609.html
Chen, S., Mao, S., Thoma, G.: Simultaneous layout style and logical entity recognition in a heterogeneous collection of documents. In: Proc. International Conf. Document Analysis and Recognition (ICDAR), vol. 1, pp. 118–122. IEEE Computer Society, Los Alamitos (2007)
Chidlovskii, B., Lecerf, L.: Stacked dependency networks for layout document structuring. In: SAC 2008, pp. 424–428 (2008)
DDB, Deutsche Digitale Bibliothek (2010), http://www.deutsche-digitale-bibliothek.de/ (retrieved on December 23, 2010)
Dias, A.P.: Minimum spanning trees for text segmentation. In: Proc. Annual Symposium Document Analysis and Information Retrieval (1996)
Doermann, D.: Page decomposition and related research. In: Proc. Symp. Document Image Understanding Technology, pp. 39–55 (1995)
Esposito, F., Malerba, D., Semeraro, G., Ferilli, S., Altamura, O., Basile, T., Berardi, M., Ceci, M., Di Mauro, N.: Machine learning methods for automatically processing historical documents: from paper acquisition to xml transformation. In: Proc. 1st International Workshop Document Image Analysis for Libraries, pp. 328–335. IEEE Computer Society, Los Alamitos (2004)
Europeana, Europeana portal (2010), http://www.europeana.eu/ (retrieved on December 23, 2010)
Finkel, J., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: ACL (2005)
Furmaniak, R.: Unsupervised newspaper segmentation using language context. In: Proc. International Conf. Document Analysis and Recognition (ICDAR), vol. 2, pp. 619–623. IEEE Computer Society, Los Alamitos (2007)
Gatos, B., Danatsas, D., Pratikakis, I., Perantonis, S.J.: Automatic table detection in document images. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds.) ICAPR 2005. LNCS, vol. 3686, pp. 609–618. Springer, Heidelberg (2005)
Getoor, L., Taskar, B. (eds.): Introduction to Relational Statistical Learning. MIT Press, Cambridge (2007)
Ha, J., Haralick, R., Phillips, I.: Document page decomposition by the bounding-box projection technique. In: Proc. International Conf. Document Analysis and Recognition (ICDAR), pp. 1119–1122 (1995)
Haralick, R.: Document image understanding: Geometric and logical layout. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 385–390 (1994)
Ittner, D.J., Baird, H.S.: Language-free layout analysis. In: Proc. International Conf. Document Analysis and Recognition (ICDAR), pp. 336–340 (1993)
Jain, A., Yu, B.: Document representation and its application to page decomposition. IEEE Trans on Pattern Analysis and Machine Intelligence 20(3), 294–308 (1998)
Kise, K., Sato, A., Iwata, M.: Segmentation of page images using the area voronoi diagram. Computer Vision and Image Understanding 70(3), 370–382 (1998)
Konya, I.V., Seibert, C., Eickeler, S.: Fraunhofer newspaper segmenter – a modular document image understanding system. Journal on Document Analysis and Recognition, IJDAR (2011); Ijdar – expected publication in 2010 (accepted for publication)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th International Conf. on ine Learning, vol. (2001)
Manning, C.D., Schütze, H.: Foundations of statistical natural language processing. MIT Press, Cambridge (1999)
Mao, S., Rosenfeld, A., Kanungo, T.: Document structure analysis algorithms: A literature survey. In: Document Recognition and Retrieval X, SPIE, vol. 5010, pp. 197–207 (2003)
Marinai, S., Fujisawa, H. (eds.): Machine Learning in Document Analysis and Recognition. Springer, Heidelberg (2008)
Neville, J., Jensen, D.: Relational dependency networks. Journal of Machine Learning Research 8, 653–692 (2007)
Niyogi, D., Srihari, S.: Knowledge-based derivation of document logical structure. In: Proc. Int. Conference on Document Analysis and Recognition, Montreal, Canada, pp. 472–475 (1995)
Paaß, G., Reichartz, F.: Exploiting semantic constraints for estimating supersenses with crfs. In: Proc. SDM 2009 (2009)
Peng, F., McCallum, A.: Accurate information extraction from research papers using conditional random fields. In: HLT-NAACL 2004, pp. 329–336 (2004)
Rangoni, Y., Belaïd, A.: Document logical structure analysis based on perceptive cycles. In: Proc. 7th International Workshop Document Analysis Systems, pp. 117–128. Springer, Heidelberg (2006)
Richardson, M., Domingos, P.: Markov logic networks. Machine Learning 62(1-2), 107–136 (2006)
Sankar, K.P., Ambati, V., Pratha, L., Jawahal, C.: Digitizing a million books: Challenges for document analysis. In: Proc. 7th International Workshop Document Analysis Systems, pp. 425–436 (2006)
Schneider, K.M.: Information extraction from calls for papers with conditional random fields and layout features. Artif. Intell. Rev. 25, 67–77 (2006)
Shafait, F., Keysers, D., Breuel, T.: Performance comparison of six algorithms for page segmentation. In: 7th IAPR Workshop on Document Analysis Systems (DAS), pp. 368–379 (2006)
Summers, K.: Near-wordless document structure classification. In: Proc. International Conf. on Document Analysis and Recognition (ICDAR), pp. 462–465 (1995)
Sutton, C., McCallum, A.: Collective segmentation and labeling of distant entities in information extraction. In: ICML Workshop on Statistical Relational Learning (2004)
Sutton, C., McCallum, A.: An introduction to conditional random fields for relational learning. In: Getoor, L., Taskar, B. (eds.) Introduction to Relational Statistical Learning. MIT Press, Cambridge (2007)
Sutton, C.A., Rohanimanesh, K., McCallum, A.: Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data. In: Proc. ICML 2004 (2004)
Tang, Y., Ma, H., Mao, X., Liu, D., Suen, C.: A new approach to document analysis based on modified fractal signature. In: Proc. International Conf. Document Analysis and Recognition (ICDAR), pp. 567–570 (1995)
Taskar, B., Abbeel, P., Koller, D.: Discriminative probabilistic models for relational data. In: Eighteenth Conference on Uncertainty in Artificial Intelligence, UAI 2002 (2002)
Tsujimoto, S., Asada, H.: Major components of a complete text reading system. Proc. IEEE 80(7), 1133–1149 (1992)
Vincent, L.: Google book search: Document understanding on a massive scale. In: Proc. 9th International Conf. Document Analysis and Recognition, pp. 819–823 (2007)
Viola, P.A., Narasimhan, M.: Learning to extract information from semi-structured text using a discriminative context free grammar. In: SIGIR 2005, pp. 330–337 (2005)
Wahl, F., Wong, K., Casey, R.: Block segmentation and text extraction in mixed text/image documents. Computer Vision, Graphics, and Image Processing 20, 375–390 (1982)
Wisniewski, G., Gallinari, P.: Relaxation labeling for selecting and exploiting efficiently non-local dependencies in sequence labeling. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 312–323. Springer, Heidelberg (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Paaß, G., Konya, I. (2011). Machine Learning for Document Structure Recognition. In: Mehler, A., Kühnberger, KU., Lobin, H., Lüngen, H., Storrer, A., Witt, A. (eds) Modeling, Learning, and Processing of Text Technological Data Structures. Studies in Computational Intelligence, vol 370. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22613-7_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-22613-7_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22612-0
Online ISBN: 978-3-642-22613-7
eBook Packages: EngineeringEngineering (R0)