Skip to main content

Part of the book series: Advances in Pattern Recognition ((ACVPR))

  • 1267 Accesses

Abstract

One of the main distinguishing features of a document is its layout, as determined by the organization of, and reciprocal relationships among, the single components that make it up. For many tasks, one can afford to work at the level of single pages, since the various pages in multi-page documents are usually sufficiently unrelated to be processed separately. This chapter discusses the processing steps that lead from the original document to the identification of its class and of the role played by its single components according to their geometrical aspect: digitization (if any), low-level pre-processing for documents in the form of images or expressed in term of very elementary layout components, optical character recognition, layout analysis and document image understanding. This results in two distinct but related structures for a document (the layout and the logical one), for which suitable representation techniques are introduced as well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    A binary relationship is an ordering relationship if it is reflexive, antisymmetric and transitive.

  2. 2.

    An evolution of this model that considers also the complement of the region (and hence a 3×3 matrix), called 9-Intersection model, was successively developed [16, 18].

  3. 3.

    Another model, called SAX (Simple API for XML), processes the documents linewise. This avoids the need to completely load them into memory, but bounds processing to proceed forward only (once an item is passed over, it can be accessed again only by restarting processing from the beginning).

  4. 4.

    A minimum resolution for allowing significant processing is 300 dpi, but thanks to the recently improved performance of computers the current digitization standards are becoming 400 dpi, with a trend towards 600 dpi.

  5. 5.

    Switching from the color space to a logic one, they can be interpreted as denoting a pixel being important, in terms of True or False.

  6. 6.

    Code available at http://code.google.com/p/tesseract-ocr/.

  7. 7.

    The source code of Tesseract is structured in several directories:

    ccmain :

    main program

    training :

    training functionalities

    display :

    a utility to view and operate on the internal structures

    testing :

    test scripts (also contains execution results and errors)

    wordrec :

    lexical recognition

    textord :

    organization of text in words and lines

    classify :

    character recognition

    ccstruct :

    structures for representing page information

    viewer :

    client-side interface for viewing the system (no server side is yet available)

    image :

    images and image processing functionalities

    dict :

    language models (including extension by addition of new models)

    cutil :

    management of file I/O and data structures in C

    ccutil :

    C++ code for dynamic memory allocation and data structures.

  8. 8.

    Code available at http://vietocr.sourceforge.net/.

  9. 9.

    Version 0.9.2 of JTOCR, the latest available at the time of writing, has a bug in the crop operation (method jMenuItemOCRActionPerformed of class gui.java). Let us represent a rectangle having top-left corner at coordinates (x,y), width w and height h as a 4-tuple (x,y,w,h). Given a selection (x S ,y S ,w S ,h S ) of the original image at zoom ratio p, the excerpt is identified as \((\frac{x_{S}}{p},\frac{y_{S}}{p},\frac {w_{S}}{p},\frac{h_{S}}{p})\). Clearly, no displacement is taken into account. To fix the problem, the offset \((x_{o},y_{o}) = (\frac {x_{P}-x_{I}}{2},\frac{y_{P}-y_{I}}{2})\) between the container panel (x P ,y P ,w P ,h P ) and the image (x I ,y I ,w I ,h I ) has to be considered, and hence the correct formula is \((\frac{x_{S}-x_{o}}{p},\frac {y_{S}-y_{o}}{p},\frac{w_{S}}{p},\frac{h_{S}}{p})\).

  10. 10.

    A tree where the offspring of a node can be partitioned into groups such that the elements in the same group are considered in AND and groups are considered in OR.

  11. 11.

    A tree built from a graph using all of its nodes and just a subset of its edges (spanning tree) such that the sum of weights of the subset of edges chosen to make up the tree is minimum with respect to all possible such trees. In Kruskal’s algorithm [48], it is built by progressively adding the next unused edge with minimum weight, skipping those that yield cycles, until a tree is obtained.

  12. 12.

    A simplified profile of the ISO 8601 standard, that combines in different patterns groups of digits YYYY, MM, DD, HH, MM, SS expressing year, month, day, hour, minute and second, respectively.

  13. 13.

    Internet Media Type, formerly MIME types.

  14. 14.

    The typical two-character language codes, optionally followed by a two-character country code (e.g., it for Italian, en-uk for English used in the United Kingdom).

  15. 15.

    Thesaurus of Geographic Names.

References

  1. Document Object Model (DOM) Level 1 Specification—version 1.0. Tech. rep. REC-DOM-Level-1-19981001, W3C (1998)

    Google Scholar 

  2. Document Object Model (DOM) Level 2 Core Specification. Tech. rep. 1.0, W3C (2000)

    Google Scholar 

  3. Dublin Core metadata element set version 1.1. Tech. rep. 15836, International Standards Organization (2009)

    Google Scholar 

  4. Altamura, O., Esposito, F., Malerba, D.: Transforming paper documents into XML format with WISDOM++. International Journal on Document Analysis and Recognition 4, 2–17 (2001)

    Article  Google Scholar 

  5. Baird, H.S.: The skew angle of printed documents. In: Proceedings of the Conference of the Society of Photographic Scientists and Engineers, pp. 14–21 (1987)

    Google Scholar 

  6. Baird, H.S.: Background structure in document images. In: Advances in Structural and Syntactic Pattern Recognition, pp. 17–34. World Scientific, Singapore (1992)

    Google Scholar 

  7. Baird, H.S.: Document image defect models. In: Baird, H.S., Bunke, H., Yamamoto, K. (eds.) Structured Document Image Analysis, pp. 546–556. Springer, Berlin (1992)

    Chapter  Google Scholar 

  8. Baird, H.S., Jones, S., Fortune, S.: Image segmentation by shape-directed covers. In: Proceedings of the 10th International Conference on Pattern Recognition (ICPR), pp. 820–825 (1990)

    Google Scholar 

  9. Berkhin, P.: Survey of clustering Data Mining techniques. Tech. rep., Accrue Software, San Jose, CA (2002)

    Google Scholar 

  10. Breuel, T.M.: Two geometric algorithms for layout analysis. In: Proceedings of the 5th International Workshop on Document Analysis Systems (DAS). Lecture Notes in Computer Science, vol. 2423, pp. 188–199. Springer, Berlin (2002)

    Chapter  Google Scholar 

  11. Cao, H., Prasad, R., Natarajan, P., MacRostie, E.: Robust page segmentation based on smearing and error correction unifying top-down and bottom-up approaches. In: Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 392–396. IEEE Computer Society, Los Alamitos (2007)

    Google Scholar 

  12. Cesarini, F., Marinai, S., Soda, G., Gori, M.: Structured document segmentation and representation by the Modified X–Y tree. In: Proceedings of the 5th International Conference on Document Analysis and Recognition (ICDAR), pp. 563–566. IEEE Computer Society, Los Alamitos (1999)

    Google Scholar 

  13. Chaudhuri, B.: Digital Document Processing—Major Directions and Recent Advances. Springer, Berlin (2007)

    Book  MATH  Google Scholar 

  14. Chen, Q.: Evaluation of OCR algorithms for images with different spatial resolution and noise. Ph.D. thesis, University of Ottawa, Canada (2003)

    Google Scholar 

  15. Ciardiello, G., Scafuro, G., Degrandi, M., Spada, M., Roccotelli, M.: An experimental system for office document handling and text recognition. In: Proceedings of the 9th International Conference on Pattern Recognition (ICPR), pp. 739–743 (1988)

    Google Scholar 

  16. Egenhofer, M.J.: Reasoning about binary topological relations. In: Gunther, O., Schek, H.J. (eds.) 2nd Symposium on Large Spatial Databases. Lecture Notes in Computer Science, vol. 525, pp. 143–160. Springer, Berlin (1991)

    Google Scholar 

  17. Egenhofer, M.J., Herring, J.R.: A mathematical framework for the definition of topological relationships. In: Proceedings of the 4th International Symposium on Spatial Data Handling, pp. 803–813 (1990)

    Google Scholar 

  18. Egenhofer, M.J., Sharma, J., Mark, D.M.: A critical comparison of the 4-intersection and 9-intersection models for spatial relations: Formal analysis. In: Proceedings of the 11th International Symposium on Computer-Assisted Cartography (Auto-Carto) (1993)

    Google Scholar 

  19. Esposito, F., Ferilli, S., Basile, T.M.A., Di Mauro, N.: Machine Learning for digital document processing: from layout analysis to metadata extraction. In: Marinai, S., Fujisawa, H. (eds.) Machine learning in Document Analysis and Recognition. Studies in Computational Intelligence, vol. 90, pp. 105–138. Springer, Berlin (2008)

    Chapter  Google Scholar 

  20. Esposito, F., Ferilli, S., Fanizzi, N., Basile, T.M., Di Mauro, N.: Incremental multistrategy learning for document processing. Applied Artificial Intelligence: An International Journal 17(8/9), 859–883 (2003)

    Article  Google Scholar 

  21. Fateman, R.J., Tokuyasu, T.: A suite of lisp programs for document image analysis and structuring. Tech. rep., Computer Science Division, EECS Department—University of California at Berkeley (1994)

    Google Scholar 

  22. Ferilli, S., Basile, T.M.A., Esposito, F.: A histogram-based technique for automatic threshold assessment in a Run Length Smoothing-based algorithm. In: Proceedings of the 9th International Workshop on Document Analysis Systems (DAS). ACM International Conference Proceedings, pp. 349–356 (2010)

    Chapter  Google Scholar 

  23. Ferilli, S., Biba, M., Esposito, F., Basile, T.M.A.: A distance-based technique for non-Manhattan layout analysis. In: Proceedings of the 10th International Conference on Document Analysis Recognition (ICDAR), pp. 231–235 (2009)

    Google Scholar 

  24. Frank, A.U.: Qualitative spatial reasoning: Cardinal directions as an example. International Journal of Geographical Information Systems 10(3), 269–290 (1996)

    Google Scholar 

  25. Gatos, B., Pratikakis, I., Ntirogiannis, K.: Segmentation based recovery of arbitrarily warped document images. In: Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR), pp. 989–993 (2007)

    Google Scholar 

  26. Impedovo, S., Ottaviano, L., Occhinegro, S.: Optical character recognition—a survey. International Journal on Pattern Recognition and Artificial Intelligence 5(1–2), 1–24 (1991)

    Article  Google Scholar 

  27. Kainz, W., Egenhofer, M.J., Greasley, I.: Modeling spatial relations and operations with partially ordered sets. International Journal of Geographical Information Systems 7(3), 215–229 (1993)

    Article  Google Scholar 

  28. Kakas, A.C., Mancarella, P.: On the relation of truth maintenance and abduction. In: Proceedings of the 1st Pacific Rim International Conference on Artificial Intelligence (PRICAI), pp. 438–443 (1990)

    Google Scholar 

  29. Kise, K., Sato, A., Iwata, M.: Segmentation of page images using the area Voronoi diagram. Computer Vision Image Understanding 70(3), 370–382 (1998)

    Article  Google Scholar 

  30. Michalski, R.S.: Inferential theory of learning. Developing foundations for multistrategy learning. In: Michalski, R., Tecuci, G. (eds.) Machine Learning. A Multistrategy Approach, vol. IV, pp. 3–61. Morgan Kaufmann, San Mateo (1994)

    Google Scholar 

  31. Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)

    MATH  Google Scholar 

  32. Mori, S., Suen, C.Y., Yamamoto, K.: Historical review of OCR research and development. Proceedings of the IEEE 80(7), 1029–1058 (1992)

    Article  Google Scholar 

  33. Nagy, G.: Twenty years of document image analysis in PAMI. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1), 38–62 (2000)

    Article  Google Scholar 

  34. Nagy, G., Kanai, J., Krishnamoorthy, M.: Two complementary techniques for digitized document analysis. In: ACM Conference on Document Processing Systems (1988)

    Google Scholar 

  35. Nagy, G., Seth, S., Viswanathan, M.: A prototype document image analysis system for technical journals. Computer 25(7), 10–22 (1992)

    Article  Google Scholar 

  36. Nagy, G., Seth, S.C.: Hierarchical representation of optically scanned documents. In: Proceedings of the 7th International Conference on Pattern Recognition (ICPR), pp. 347–349. IEEE Computer Society Press, Los Alamitos (1984)

    Google Scholar 

  37. Nienhuys-Cheng, S.H., de Wolf, R. (eds.): Foundations of Inductive Logic Programming. Lecture Notes in Computer Science, vol. 1228. Springer, Berlin (1997)

    Google Scholar 

  38. O’Gorman, L.: The document spectrum for page layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 1162–1173 (1993)

    Article  Google Scholar 

  39. O’Gorman, L., Kasturi, R.: Document Image Analysis. IEEE Computer Society, Los Alamitos (1995)

    Google Scholar 

  40. Papadias, D., Theodoridis, Y.: Spatial relations, minimum bounding rectangles, and spatial data structures. International Journal of Geographical Information Science 11(2), 111–138 (1997)

    Article  Google Scholar 

  41. Papamarkos, N., Tzortzakis, J., Gatos, B.: Determination of run-length smoothing values for document segmentation. In: Proceedings of the International Conference on Electronic Circuits and Systems (ICECS), vol. 2, pp. 684–687 (1996)

    Chapter  Google Scholar 

  42. Pavlidis, T., Zhou, J.: Page segmentation by white streams. In: Proceedings of the 1st International Conference on Document Analysis and Recognition (ICDAR), pp. 945–953 (1991)

    Google Scholar 

  43. Rice, S.V., Jenkins, F.R., Nartker, T.A.: The fourth annual test of OCR accuracy. Tech. rep. 95-03, Information Science Research Institute, University of Nevada, Las Vegas (1995)

    Google Scholar 

  44. Salembier, P., Marques, F.: Region-based representations of image and video: Segmentation tools for multimedia services. IEEE Transactions on Circuits and Systems for Video Technology 9(8), 1147–1169 (1999)

    Article  Google Scholar 

  45. Shafait, F., Smith, R.: Table detection in heterogeneous documents. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems (DAS). ACM International Conference Proceedings, pp. 65–72 (2010)

    Chapter  Google Scholar 

  46. Shih, F., Chen, S.S.: Adaptive document block segmentation and classification. IEEE Transactions on Systems, Man, and Cybernetics—Part B 26(5), 797–802 (1996)

    Article  Google Scholar 

  47. Simon, A., Pret, J.C., Johnson, A.P.: A fast algorithm for bottom-up document layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(3), 273–277 (1997)

    Article  Google Scholar 

  48. Skiena, S.S.: The Algorithm Design Manual, 2nd edn. Springer, Berlin (2008)

    Book  MATH  Google Scholar 

  49. Smith, R.: A simple and efficient skew detection algorithm via text row accumulation. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition (ICDAR), pp. 1145–1148, IEEE Computer Society, Los Alamitos (1995)

    Chapter  Google Scholar 

  50. Smith, R.: An overview of the Tesseract OCR engine. In: Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR), pp. 629–633. IEEE Computer Society, Los Alamitos (2007)

    Google Scholar 

  51. Smith, R.: Hybrid page layout analysis via tab-stop detection. In: Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR), pp. 241–245. IEEE Computer Society, Los Alamitos (2009)

    Google Scholar 

  52. Sun, H.M.: Page segmentation for Manhattan and non-Manhattan layout documents via selective CRLA. In: Proceedings of the 8th International Conference on Document Analysis and Recognition (ICDAR), pp. 116–120. IEEE Computer Society, Los Alamitos (2005)

    Google Scholar 

  53. Wahl, F., Wong, K., Casey, R.: Block segmentation and text extraction in mixed text/image documents. Graphical Models and Image Processing 20, 375–390 (1982)

    Google Scholar 

  54. Wang, D., Srihari, S.N.: Classification of newspaper image blocks using texture analysis. Computer Vision, Graphics, and Image Processing 47, 327–352 (1989)

    Article  Google Scholar 

  55. Wong, K.Y., Casey, R., Wahl, F.M.: Document analysis system. IBM Journal of Research and Development 26, 647–656 (1982)

    Article  Google Scholar 

  56. Zucker, J.D.: Semantic abstraction for concept representation and learning. In: Proceedings of the 4th International Workshop on Multistrategy Learning (MSL), pp. 157–164 (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefano Ferilli .

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag London Limited

About this chapter

Cite this chapter

Ferilli, S. (2011). Document Image Analysis. In: Automatic Digital Document Processing and Management. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-0-85729-198-1_5

Download citation

  • DOI: https://doi.org/10.1007/978-0-85729-198-1_5

  • Publisher Name: Springer, London

  • Print ISBN: 978-0-85729-197-4

  • Online ISBN: 978-0-85729-198-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics