Abstract
One of the main distinguishing features of a document is its layout, as determined by the organization of, and reciprocal relationships among, the single components that make it up. For many tasks, one can afford to work at the level of single pages, since the various pages in multi-page documents are usually sufficiently unrelated to be processed separately. This chapter discusses the processing steps that lead from the original document to the identification of its class and of the role played by its single components according to their geometrical aspect: digitization (if any), low-level pre-processing for documents in the form of images or expressed in term of very elementary layout components, optical character recognition, layout analysis and document image understanding. This results in two distinct but related structures for a document (the layout and the logical one), for which suitable representation techniques are introduced as well.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
A binary relationship is an ordering relationship if it is reflexive, antisymmetric and transitive.
- 2.
- 3.
Another model, called SAX (Simple API for XML), processes the documents linewise. This avoids the need to completely load them into memory, but bounds processing to proceed forward only (once an item is passed over, it can be accessed again only by restarting processing from the beginning).
- 4.
A minimum resolution for allowing significant processing is 300 dpi, but thanks to the recently improved performance of computers the current digitization standards are becoming 400 dpi, with a trend towards 600 dpi.
- 5.
Switching from the color space to a logic one, they can be interpreted as denoting a pixel being important, in terms of True or False.
- 6.
Code available at http://code.google.com/p/tesseract-ocr/.
- 7.
The source code of Tesseract is structured in several directories:
- ccmain :
-
main program
- training :
-
training functionalities
- display :
-
a utility to view and operate on the internal structures
- testing :
-
test scripts (also contains execution results and errors)
- wordrec :
-
lexical recognition
- textord :
-
organization of text in words and lines
- classify :
-
character recognition
- ccstruct :
-
structures for representing page information
- viewer :
-
client-side interface for viewing the system (no server side is yet available)
- image :
-
images and image processing functionalities
- dict :
-
language models (including extension by addition of new models)
- cutil :
-
management of file I/O and data structures in C
- ccutil :
-
C++ code for dynamic memory allocation and data structures.
- 8.
Code available at http://vietocr.sourceforge.net/.
- 9.
Version 0.9.2 of JTOCR, the latest available at the time of writing, has a bug in the crop operation (method jMenuItemOCRActionPerformed of class gui.java). Let us represent a rectangle having top-left corner at coordinates (x,y), width w and height h as a 4-tuple (x,y,w,h). Given a selection (x S ,y S ,w S ,h S ) of the original image at zoom ratio p, the excerpt is identified as \((\frac{x_{S}}{p},\frac{y_{S}}{p},\frac {w_{S}}{p},\frac{h_{S}}{p})\). Clearly, no displacement is taken into account. To fix the problem, the offset \((x_{o},y_{o}) = (\frac {x_{P}-x_{I}}{2},\frac{y_{P}-y_{I}}{2})\) between the container panel (x P ,y P ,w P ,h P ) and the image (x I ,y I ,w I ,h I ) has to be considered, and hence the correct formula is \((\frac{x_{S}-x_{o}}{p},\frac {y_{S}-y_{o}}{p},\frac{w_{S}}{p},\frac{h_{S}}{p})\).
- 10.
A tree where the offspring of a node can be partitioned into groups such that the elements in the same group are considered in AND and groups are considered in OR.
- 11.
A tree built from a graph using all of its nodes and just a subset of its edges (spanning tree) such that the sum of weights of the subset of edges chosen to make up the tree is minimum with respect to all possible such trees. In Kruskal’s algorithm [48], it is built by progressively adding the next unused edge with minimum weight, skipping those that yield cycles, until a tree is obtained.
- 12.
A simplified profile of the ISO 8601 standard, that combines in different patterns groups of digits YYYY, MM, DD, HH, MM, SS expressing year, month, day, hour, minute and second, respectively.
- 13.
Internet Media Type, formerly MIME types.
- 14.
The typical two-character language codes, optionally followed by a two-character country code (e.g., it for Italian, en-uk for English used in the United Kingdom).
- 15.
Thesaurus of Geographic Names.
References
Document Object Model (DOM) Level 1 Specification—version 1.0. Tech. rep. REC-DOM-Level-1-19981001, W3C (1998)
Document Object Model (DOM) Level 2 Core Specification. Tech. rep. 1.0, W3C (2000)
Dublin Core metadata element set version 1.1. Tech. rep. 15836, International Standards Organization (2009)
Altamura, O., Esposito, F., Malerba, D.: Transforming paper documents into XML format with WISDOM++. International Journal on Document Analysis and Recognition 4, 2–17 (2001)
Baird, H.S.: The skew angle of printed documents. In: Proceedings of the Conference of the Society of Photographic Scientists and Engineers, pp. 14–21 (1987)
Baird, H.S.: Background structure in document images. In: Advances in Structural and Syntactic Pattern Recognition, pp. 17–34. World Scientific, Singapore (1992)
Baird, H.S.: Document image defect models. In: Baird, H.S., Bunke, H., Yamamoto, K. (eds.) Structured Document Image Analysis, pp. 546–556. Springer, Berlin (1992)
Baird, H.S., Jones, S., Fortune, S.: Image segmentation by shape-directed covers. In: Proceedings of the 10th International Conference on Pattern Recognition (ICPR), pp. 820–825 (1990)
Berkhin, P.: Survey of clustering Data Mining techniques. Tech. rep., Accrue Software, San Jose, CA (2002)
Breuel, T.M.: Two geometric algorithms for layout analysis. In: Proceedings of the 5th International Workshop on Document Analysis Systems (DAS). Lecture Notes in Computer Science, vol. 2423, pp. 188–199. Springer, Berlin (2002)
Cao, H., Prasad, R., Natarajan, P., MacRostie, E.: Robust page segmentation based on smearing and error correction unifying top-down and bottom-up approaches. In: Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 392–396. IEEE Computer Society, Los Alamitos (2007)
Cesarini, F., Marinai, S., Soda, G., Gori, M.: Structured document segmentation and representation by the Modified X–Y tree. In: Proceedings of the 5th International Conference on Document Analysis and Recognition (ICDAR), pp. 563–566. IEEE Computer Society, Los Alamitos (1999)
Chaudhuri, B.: Digital Document Processing—Major Directions and Recent Advances. Springer, Berlin (2007)
Chen, Q.: Evaluation of OCR algorithms for images with different spatial resolution and noise. Ph.D. thesis, University of Ottawa, Canada (2003)
Ciardiello, G., Scafuro, G., Degrandi, M., Spada, M., Roccotelli, M.: An experimental system for office document handling and text recognition. In: Proceedings of the 9th International Conference on Pattern Recognition (ICPR), pp. 739–743 (1988)
Egenhofer, M.J.: Reasoning about binary topological relations. In: Gunther, O., Schek, H.J. (eds.) 2nd Symposium on Large Spatial Databases. Lecture Notes in Computer Science, vol. 525, pp. 143–160. Springer, Berlin (1991)
Egenhofer, M.J., Herring, J.R.: A mathematical framework for the definition of topological relationships. In: Proceedings of the 4th International Symposium on Spatial Data Handling, pp. 803–813 (1990)
Egenhofer, M.J., Sharma, J., Mark, D.M.: A critical comparison of the 4-intersection and 9-intersection models for spatial relations: Formal analysis. In: Proceedings of the 11th International Symposium on Computer-Assisted Cartography (Auto-Carto) (1993)
Esposito, F., Ferilli, S., Basile, T.M.A., Di Mauro, N.: Machine Learning for digital document processing: from layout analysis to metadata extraction. In: Marinai, S., Fujisawa, H. (eds.) Machine learning in Document Analysis and Recognition. Studies in Computational Intelligence, vol. 90, pp. 105–138. Springer, Berlin (2008)
Esposito, F., Ferilli, S., Fanizzi, N., Basile, T.M., Di Mauro, N.: Incremental multistrategy learning for document processing. Applied Artificial Intelligence: An International Journal 17(8/9), 859–883 (2003)
Fateman, R.J., Tokuyasu, T.: A suite of lisp programs for document image analysis and structuring. Tech. rep., Computer Science Division, EECS Department—University of California at Berkeley (1994)
Ferilli, S., Basile, T.M.A., Esposito, F.: A histogram-based technique for automatic threshold assessment in a Run Length Smoothing-based algorithm. In: Proceedings of the 9th International Workshop on Document Analysis Systems (DAS). ACM International Conference Proceedings, pp. 349–356 (2010)
Ferilli, S., Biba, M., Esposito, F., Basile, T.M.A.: A distance-based technique for non-Manhattan layout analysis. In: Proceedings of the 10th International Conference on Document Analysis Recognition (ICDAR), pp. 231–235 (2009)
Frank, A.U.: Qualitative spatial reasoning: Cardinal directions as an example. International Journal of Geographical Information Systems 10(3), 269–290 (1996)
Gatos, B., Pratikakis, I., Ntirogiannis, K.: Segmentation based recovery of arbitrarily warped document images. In: Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR), pp. 989–993 (2007)
Impedovo, S., Ottaviano, L., Occhinegro, S.: Optical character recognition—a survey. International Journal on Pattern Recognition and Artificial Intelligence 5(1–2), 1–24 (1991)
Kainz, W., Egenhofer, M.J., Greasley, I.: Modeling spatial relations and operations with partially ordered sets. International Journal of Geographical Information Systems 7(3), 215–229 (1993)
Kakas, A.C., Mancarella, P.: On the relation of truth maintenance and abduction. In: Proceedings of the 1st Pacific Rim International Conference on Artificial Intelligence (PRICAI), pp. 438–443 (1990)
Kise, K., Sato, A., Iwata, M.: Segmentation of page images using the area Voronoi diagram. Computer Vision Image Understanding 70(3), 370–382 (1998)
Michalski, R.S.: Inferential theory of learning. Developing foundations for multistrategy learning. In: Michalski, R., Tecuci, G. (eds.) Machine Learning. A Multistrategy Approach, vol. IV, pp. 3–61. Morgan Kaufmann, San Mateo (1994)
Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
Mori, S., Suen, C.Y., Yamamoto, K.: Historical review of OCR research and development. Proceedings of the IEEE 80(7), 1029–1058 (1992)
Nagy, G.: Twenty years of document image analysis in PAMI. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1), 38–62 (2000)
Nagy, G., Kanai, J., Krishnamoorthy, M.: Two complementary techniques for digitized document analysis. In: ACM Conference on Document Processing Systems (1988)
Nagy, G., Seth, S., Viswanathan, M.: A prototype document image analysis system for technical journals. Computer 25(7), 10–22 (1992)
Nagy, G., Seth, S.C.: Hierarchical representation of optically scanned documents. In: Proceedings of the 7th International Conference on Pattern Recognition (ICPR), pp. 347–349. IEEE Computer Society Press, Los Alamitos (1984)
Nienhuys-Cheng, S.H., de Wolf, R. (eds.): Foundations of Inductive Logic Programming. Lecture Notes in Computer Science, vol. 1228. Springer, Berlin (1997)
O’Gorman, L.: The document spectrum for page layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 1162–1173 (1993)
O’Gorman, L., Kasturi, R.: Document Image Analysis. IEEE Computer Society, Los Alamitos (1995)
Papadias, D., Theodoridis, Y.: Spatial relations, minimum bounding rectangles, and spatial data structures. International Journal of Geographical Information Science 11(2), 111–138 (1997)
Papamarkos, N., Tzortzakis, J., Gatos, B.: Determination of run-length smoothing values for document segmentation. In: Proceedings of the International Conference on Electronic Circuits and Systems (ICECS), vol. 2, pp. 684–687 (1996)
Pavlidis, T., Zhou, J.: Page segmentation by white streams. In: Proceedings of the 1st International Conference on Document Analysis and Recognition (ICDAR), pp. 945–953 (1991)
Rice, S.V., Jenkins, F.R., Nartker, T.A.: The fourth annual test of OCR accuracy. Tech. rep. 95-03, Information Science Research Institute, University of Nevada, Las Vegas (1995)
Salembier, P., Marques, F.: Region-based representations of image and video: Segmentation tools for multimedia services. IEEE Transactions on Circuits and Systems for Video Technology 9(8), 1147–1169 (1999)
Shafait, F., Smith, R.: Table detection in heterogeneous documents. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems (DAS). ACM International Conference Proceedings, pp. 65–72 (2010)
Shih, F., Chen, S.S.: Adaptive document block segmentation and classification. IEEE Transactions on Systems, Man, and Cybernetics—Part B 26(5), 797–802 (1996)
Simon, A., Pret, J.C., Johnson, A.P.: A fast algorithm for bottom-up document layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(3), 273–277 (1997)
Skiena, S.S.: The Algorithm Design Manual, 2nd edn. Springer, Berlin (2008)
Smith, R.: A simple and efficient skew detection algorithm via text row accumulation. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition (ICDAR), pp. 1145–1148, IEEE Computer Society, Los Alamitos (1995)
Smith, R.: An overview of the Tesseract OCR engine. In: Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR), pp. 629–633. IEEE Computer Society, Los Alamitos (2007)
Smith, R.: Hybrid page layout analysis via tab-stop detection. In: Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR), pp. 241–245. IEEE Computer Society, Los Alamitos (2009)
Sun, H.M.: Page segmentation for Manhattan and non-Manhattan layout documents via selective CRLA. In: Proceedings of the 8th International Conference on Document Analysis and Recognition (ICDAR), pp. 116–120. IEEE Computer Society, Los Alamitos (2005)
Wahl, F., Wong, K., Casey, R.: Block segmentation and text extraction in mixed text/image documents. Graphical Models and Image Processing 20, 375–390 (1982)
Wang, D., Srihari, S.N.: Classification of newspaper image blocks using texture analysis. Computer Vision, Graphics, and Image Processing 47, 327–352 (1989)
Wong, K.Y., Casey, R., Wahl, F.M.: Document analysis system. IBM Journal of Research and Development 26, 647–656 (1982)
Zucker, J.D.: Semantic abstraction for concept representation and learning. In: Proceedings of the 4th International Workshop on Multistrategy Learning (MSL), pp. 157–164 (1998)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2011 Springer-Verlag London Limited
About this chapter
Cite this chapter
Ferilli, S. (2011). Document Image Analysis. In: Automatic Digital Document Processing and Management. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-0-85729-198-1_5
Download citation
DOI: https://doi.org/10.1007/978-0-85729-198-1_5
Publisher Name: Springer, London
Print ISBN: 978-0-85729-197-4
Online ISBN: 978-0-85729-198-1
eBook Packages: Computer ScienceComputer Science (R0)