Document Image Analysis

Ferilli, Stefano

doi:10.1007/978-0-85729-198-1_5

Stefano Ferilli²

Part of the book series: Advances in Pattern Recognition ((ACVPR))

1267 Accesses

Abstract

One of the main distinguishing features of a document is its layout, as determined by the organization of, and reciprocal relationships among, the single components that make it up. For many tasks, one can afford to work at the level of single pages, since the various pages in multi-page documents are usually sufficiently unrelated to be processed separately. This chapter discusses the processing steps that lead from the original document to the identification of its class and of the role played by its single components according to their geometrical aspect: digitization (if any), low-level pre-processing for documents in the form of images or expressed in term of very elementary layout components, optical character recognition, layout analysis and document image understanding. This results in two distinct but related structures for a document (the layout and the logical one), for which suitable representation techniques are introduced as well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
A binary relationship is an ordering relationship if it is reflexive, antisymmetric and transitive.
2.
An evolution of this model that considers also the complement of the region (and hence a 3×3 matrix), called 9-Intersection model, was successively developed [16, 18].
3.
Another model, called SAX (Simple API for XML), processes the documents linewise. This avoids the need to completely load them into memory, but bounds processing to proceed forward only (once an item is passed over, it can be accessed again only by restarting processing from the beginning).
4.
A minimum resolution for allowing significant processing is 300 dpi, but thanks to the recently improved performance of computers the current digitization standards are becoming 400 dpi, with a trend towards 600 dpi.
5.
Switching from the color space to a logic one, they can be interpreted as denoting a pixel being important, in terms of True or False.
6.
Code available at http://code.google.com/p/tesseract-ocr/.
7.
The source code of Tesseract is structured in several directories:
ccmain :

main program

training :

training functionalities

display :

a utility to view and operate on the internal structures

testing :

test scripts (also contains execution results and errors)

wordrec :

lexical recognition

textord :

organization of text in words and lines

classify :

character recognition

ccstruct :

structures for representing page information

viewer :

client-side interface for viewing the system (no server side is yet available)

image :

images and image processing functionalities

dict :

language models (including extension by addition of new models)

cutil :

management of file I/O and data structures in C

ccutil :

C++ code for dynamic memory allocation and data structures.
8.
Code available at http://vietocr.sourceforge.net/.
9.
Version 0.9.2 of JTOCR, the latest available at the time of writing, has a bug in the crop operation (method jMenuItemOCRActionPerformed of class gui.java). Let us represent a rectangle having top-left corner at coordinates (x,y), width w and height h as a 4-tuple (x,y,w,h). Given a selection (x _S,y _S,w _S,h _S) of the original image at zoom ratio p, the excerpt is identified as \((\frac{x_{S}}{p},\frac{y_{S}}{p},\frac {w_{S}}{p},\frac{h_{S}}{p})\). Clearly, no displacement is taken into account. To fix the problem, the offset \((x_{o},y_{o}) = (\frac {x_{P}-x_{I}}{2},\frac{y_{P}-y_{I}}{2})\) between the container panel (x _P,y _P,w _P,h _P) and the image (x _I,y _I,w _I,h _I) has to be considered, and hence the correct formula is \((\frac{x_{S}-x_{o}}{p},\frac {y_{S}-y_{o}}{p},\frac{w_{S}}{p},\frac{h_{S}}{p})\).
10.
A tree where the offspring of a node can be partitioned into groups such that the elements in the same group are considered in AND and groups are considered in OR.
11.
A tree built from a graph using all of its nodes and just a subset of its edges (spanning tree) such that the sum of weights of the subset of edges chosen to make up the tree is minimum with respect to all possible such trees. In Kruskal’s algorithm [48], it is built by progressively adding the next unused edge with minimum weight, skipping those that yield cycles, until a tree is obtained.
12.
A simplified profile of the ISO 8601 standard, that combines in different patterns groups of digits YYYY, MM, DD, HH, MM, SS expressing year, month, day, hour, minute and second, respectively.
13.
Internet Media Type, formerly MIME types.
14.
The typical two-character language codes, optionally followed by a two-character country code (e.g., it for Italian, en-uk for English used in the United Kingdom).
15.
Thesaurus of Geographic Names.

References

Document Object Model (DOM) Level 1 Specification—version 1.0. Tech. rep. REC-DOM-Level-1-19981001, W3C (1998)
Google Scholar
Document Object Model (DOM) Level 2 Core Specification. Tech. rep. 1.0, W3C (2000)
Google Scholar
Dublin Core metadata element set version 1.1. Tech. rep. 15836, International Standards Organization (2009)
Google Scholar
Altamura, O., Esposito, F., Malerba, D.: Transforming paper documents into XML format with WISDOM++. International Journal on Document Analysis and Recognition 4, 2–17 (2001)
Article Google Scholar
Baird, H.S.: The skew angle of printed documents. In: Proceedings of the Conference of the Society of Photographic Scientists and Engineers, pp. 14–21 (1987)
Google Scholar
Baird, H.S.: Background structure in document images. In: Advances in Structural and Syntactic Pattern Recognition, pp. 17–34. World Scientific, Singapore (1992)
Google Scholar
Baird, H.S.: Document image defect models. In: Baird, H.S., Bunke, H., Yamamoto, K. (eds.) Structured Document Image Analysis, pp. 546–556. Springer, Berlin (1992)
Chapter Google Scholar
Baird, H.S., Jones, S., Fortune, S.: Image segmentation by shape-directed covers. In: Proceedings of the 10th International Conference on Pattern Recognition (ICPR), pp. 820–825 (1990)
Google Scholar
Berkhin, P.: Survey of clustering Data Mining techniques. Tech. rep., Accrue Software, San Jose, CA (2002)
Google Scholar
Breuel, T.M.: Two geometric algorithms for layout analysis. In: Proceedings of the 5th International Workshop on Document Analysis Systems (DAS). Lecture Notes in Computer Science, vol. 2423, pp. 188–199. Springer, Berlin (2002)
Chapter Google Scholar
Cao, H., Prasad, R., Natarajan, P., MacRostie, E.: Robust page segmentation based on smearing and error correction unifying top-down and bottom-up approaches. In: Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 392–396. IEEE Computer Society, Los Alamitos (2007)
Google Scholar
Cesarini, F., Marinai, S., Soda, G., Gori, M.: Structured document segmentation and representation by the Modified X–Y tree. In: Proceedings of the 5th International Conference on Document Analysis and Recognition (ICDAR), pp. 563–566. IEEE Computer Society, Los Alamitos (1999)
Google Scholar
Chaudhuri, B.: Digital Document Processing—Major Directions and Recent Advances. Springer, Berlin (2007)
Book MATH Google Scholar
Chen, Q.: Evaluation of OCR algorithms for images with different spatial resolution and noise. Ph.D. thesis, University of Ottawa, Canada (2003)
Google Scholar
Ciardiello, G., Scafuro, G., Degrandi, M., Spada, M., Roccotelli, M.: An experimental system for office document handling and text recognition. In: Proceedings of the 9th International Conference on Pattern Recognition (ICPR), pp. 739–743 (1988)
Google Scholar
Egenhofer, M.J.: Reasoning about binary topological relations. In: Gunther, O., Schek, H.J. (eds.) 2nd Symposium on Large Spatial Databases. Lecture Notes in Computer Science, vol. 525, pp. 143–160. Springer, Berlin (1991)
Google Scholar
Egenhofer, M.J., Herring, J.R.: A mathematical framework for the definition of topological relationships. In: Proceedings of the 4th International Symposium on Spatial Data Handling, pp. 803–813 (1990)
Google Scholar
Egenhofer, M.J., Sharma, J., Mark, D.M.: A critical comparison of the 4-intersection and 9-intersection models for spatial relations: Formal analysis. In: Proceedings of the 11th International Symposium on Computer-Assisted Cartography (Auto-Carto) (1993)
Google Scholar
Esposito, F., Ferilli, S., Basile, T.M.A., Di Mauro, N.: Machine Learning for digital document processing: from layout analysis to metadata extraction. In: Marinai, S., Fujisawa, H. (eds.) Machine learning in Document Analysis and Recognition. Studies in Computational Intelligence, vol. 90, pp. 105–138. Springer, Berlin (2008)
Chapter Google Scholar
Esposito, F., Ferilli, S., Fanizzi, N., Basile, T.M., Di Mauro, N.: Incremental multistrategy learning for document processing. Applied Artificial Intelligence: An International Journal 17(8/9), 859–883 (2003)
Article Google Scholar
Fateman, R.J., Tokuyasu, T.: A suite of lisp programs for document image analysis and structuring. Tech. rep., Computer Science Division, EECS Department—University of California at Berkeley (1994)
Google Scholar
Ferilli, S., Basile, T.M.A., Esposito, F.: A histogram-based technique for automatic threshold assessment in a Run Length Smoothing-based algorithm. In: Proceedings of the 9th International Workshop on Document Analysis Systems (DAS). ACM International Conference Proceedings, pp. 349–356 (2010)
Chapter Google Scholar
Ferilli, S., Biba, M., Esposito, F., Basile, T.M.A.: A distance-based technique for non-Manhattan layout analysis. In: Proceedings of the 10th International Conference on Document Analysis Recognition (ICDAR), pp. 231–235 (2009)
Google Scholar
Frank, A.U.: Qualitative spatial reasoning: Cardinal directions as an example. International Journal of Geographical Information Systems 10(3), 269–290 (1996)
Google Scholar
Gatos, B., Pratikakis, I., Ntirogiannis, K.: Segmentation based recovery of arbitrarily warped document images. In: Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR), pp. 989–993 (2007)
Google Scholar
Impedovo, S., Ottaviano, L., Occhinegro, S.: Optical character recognition—a survey. International Journal on Pattern Recognition and Artificial Intelligence 5(1–2), 1–24 (1991)
Article Google Scholar
Kainz, W., Egenhofer, M.J., Greasley, I.: Modeling spatial relations and operations with partially ordered sets. International Journal of Geographical Information Systems 7(3), 215–229 (1993)
Article Google Scholar
Kakas, A.C., Mancarella, P.: On the relation of truth maintenance and abduction. In: Proceedings of the 1st Pacific Rim International Conference on Artificial Intelligence (PRICAI), pp. 438–443 (1990)
Google Scholar
Kise, K., Sato, A., Iwata, M.: Segmentation of page images using the area Voronoi diagram. Computer Vision Image Understanding 70(3), 370–382 (1998)
Article Google Scholar
Michalski, R.S.: Inferential theory of learning. Developing foundations for multistrategy learning. In: Michalski, R., Tecuci, G. (eds.) Machine Learning. A Multistrategy Approach, vol. IV, pp. 3–61. Morgan Kaufmann, San Mateo (1994)
Google Scholar
Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
MATH Google Scholar
Mori, S., Suen, C.Y., Yamamoto, K.: Historical review of OCR research and development. Proceedings of the IEEE 80(7), 1029–1058 (1992)
Article Google Scholar
Nagy, G.: Twenty years of document image analysis in PAMI. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1), 38–62 (2000)
Article Google Scholar
Nagy, G., Kanai, J., Krishnamoorthy, M.: Two complementary techniques for digitized document analysis. In: ACM Conference on Document Processing Systems (1988)
Google Scholar
Nagy, G., Seth, S., Viswanathan, M.: A prototype document image analysis system for technical journals. Computer 25(7), 10–22 (1992)
Article Google Scholar
Nagy, G., Seth, S.C.: Hierarchical representation of optically scanned documents. In: Proceedings of the 7th International Conference on Pattern Recognition (ICPR), pp. 347–349. IEEE Computer Society Press, Los Alamitos (1984)
Google Scholar
Nienhuys-Cheng, S.H., de Wolf, R. (eds.): Foundations of Inductive Logic Programming. Lecture Notes in Computer Science, vol. 1228. Springer, Berlin (1997)
Google Scholar
O’Gorman, L.: The document spectrum for page layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 1162–1173 (1993)
Article Google Scholar
O’Gorman, L., Kasturi, R.: Document Image Analysis. IEEE Computer Society, Los Alamitos (1995)
Google Scholar
Papadias, D., Theodoridis, Y.: Spatial relations, minimum bounding rectangles, and spatial data structures. International Journal of Geographical Information Science 11(2), 111–138 (1997)
Article Google Scholar
Papamarkos, N., Tzortzakis, J., Gatos, B.: Determination of run-length smoothing values for document segmentation. In: Proceedings of the International Conference on Electronic Circuits and Systems (ICECS), vol. 2, pp. 684–687 (1996)
Chapter Google Scholar
Pavlidis, T., Zhou, J.: Page segmentation by white streams. In: Proceedings of the 1st International Conference on Document Analysis and Recognition (ICDAR), pp. 945–953 (1991)
Google Scholar
Rice, S.V., Jenkins, F.R., Nartker, T.A.: The fourth annual test of OCR accuracy. Tech. rep. 95-03, Information Science Research Institute, University of Nevada, Las Vegas (1995)
Google Scholar
Salembier, P., Marques, F.: Region-based representations of image and video: Segmentation tools for multimedia services. IEEE Transactions on Circuits and Systems for Video Technology 9(8), 1147–1169 (1999)
Article Google Scholar
Shafait, F., Smith, R.: Table detection in heterogeneous documents. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems (DAS). ACM International Conference Proceedings, pp. 65–72 (2010)
Chapter Google Scholar
Shih, F., Chen, S.S.: Adaptive document block segmentation and classification. IEEE Transactions on Systems, Man, and Cybernetics—Part B 26(5), 797–802 (1996)
Article Google Scholar
Simon, A., Pret, J.C., Johnson, A.P.: A fast algorithm for bottom-up document layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(3), 273–277 (1997)
Article Google Scholar
Skiena, S.S.: The Algorithm Design Manual, 2nd edn. Springer, Berlin (2008)
Book MATH Google Scholar
Smith, R.: A simple and efficient skew detection algorithm via text row accumulation. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition (ICDAR), pp. 1145–1148, IEEE Computer Society, Los Alamitos (1995)
Chapter Google Scholar
Smith, R.: An overview of the Tesseract OCR engine. In: Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR), pp. 629–633. IEEE Computer Society, Los Alamitos (2007)
Google Scholar
Smith, R.: Hybrid page layout analysis via tab-stop detection. In: Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR), pp. 241–245. IEEE Computer Society, Los Alamitos (2009)
Google Scholar
Sun, H.M.: Page segmentation for Manhattan and non-Manhattan layout documents via selective CRLA. In: Proceedings of the 8th International Conference on Document Analysis and Recognition (ICDAR), pp. 116–120. IEEE Computer Society, Los Alamitos (2005)
Google Scholar
Wahl, F., Wong, K., Casey, R.: Block segmentation and text extraction in mixed text/image documents. Graphical Models and Image Processing 20, 375–390 (1982)
Google Scholar
Wang, D., Srihari, S.N.: Classification of newspaper image blocks using texture analysis. Computer Vision, Graphics, and Image Processing 47, 327–352 (1989)
Article Google Scholar
Wong, K.Y., Casey, R., Wahl, F.M.: Document analysis system. IBM Journal of Research and Development 26, 647–656 (1982)
Article Google Scholar
Zucker, J.D.: Semantic abstraction for concept representation and learning. In: Proceedings of the 4th International Workshop on Multistrategy Learning (MSL), pp. 157–164 (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica, Università di Bari, Via E. Orabona 4, 70126, Bari, Italy
Stefano Ferilli

Authors

Stefano Ferilli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefano Ferilli .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ferilli, S. (2011). Document Image Analysis. In: Automatic Digital Document Processing and Management. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-0-85729-198-1_5

Download citation

DOI: https://doi.org/10.1007/978-0-85729-198-1_5
Publisher Name: Springer, London
Print ISBN: 978-0-85729-197-4
Online ISBN: 978-0-85729-198-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics