A Study on the Document Zone Content Classification Problem

Wang, Yalin; Phillips, Ihsin T.; Haralick, Robert M.

doi:10.1007/3-540-45869-7_25

Yalin Wang⁶,
Ihsin T. Phillips⁷ &
Robert M. Haralick⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2423))

Included in the following conference series:

International Workshop on Document Analysis Systems

1121 Accesses
5 Citations

Abstract

A document can be divided into zones on the basis of its content. For example, a zone can be either text or non-text. Given the segmented document zones, correctly determining the zone content type is very important for the subsequent processes within any document image understanding system. This paper describes an algorithm for the determination of zone type of a given zone within an input document image. In our zone classification algorithm, zones are represented as feature vectors. Each feature vector consists of a set of 25 measurements of pre-defined properties. A probabilistic model, decision tree, is used to classify each zone on the basis of its feature vector. Two methods are used to optimize the decision tree classifier to eliminate the data over-fitting problem. To enrich our probabilistic model, we incorporate context constraints for certain zones within their neighboring zones.We also model zone class context constraints as a Hidden Markov Model and usedViterbi algorithm to obtain optimal classification results. The training, pruning and testing data set for the algorithm include 1, 600 images drawn from theUWCDROM-III document image database. With a total of 24, 177 zones within the data set, the cross-validation methodwas used in the performance evaluation of the classifier. The classifier is able to classify each given scientific and technical document zone into one of the nine classes, 2 text classes (of font size 418pt and font size 1932 pt), math, table, halftone, map/drawing, ruling, logo, and others. A zone content classification performance evaluation protocol is proposed. Using this protocol, our algorithm accuracy is 98.45% with a mean false alarm rate of 0.50%.

Download to read the full chapter text

Chapter PDF

Document segmentation and classification into musical scores and text

Article 12 August 2016

Symbolic Representation and Classification of Logos

Detection and Classification of Interesting Parts in Scanned Documents by Means of AdaBoost Classification and Low-Level Features Verification

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

R. Haralick and L. Shapiro. Computer and Robot Vision, volume 1. AddisonWesley, 1997.
Google Scholar
L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77:257–285, February 1989.
Google Scholar
Y. Wang, R. Haralick, and I. T. Phillips. Improvement of zone content classification by using background analysis. In Fourth IAPR International Workshop on Document Analysis Systems. (DAS2000), Rio de Janeiro, Brazil, December 2000.
Google Scholar
Y. Wang, R. Haralick, and I. T. Phillips. Zone content classification and its performance evaluation. In Sixth International Conference on Document Analysis and Recognition(ICDAR01), pages 540–544, Seattle,WA, September 2001.
Google Scholar
J. Liang, R. Haralick, and I. T. Phillips. Document zone classification using sizes of connected components. Document Recognition III, SPIE’96, pages 150–157, 1996.
Google Scholar
D. Chetverikov, J. Liang, J. Komuves, and R. Haralick. Zone classification using texture features. In Proc. International Conference on Pattern Recognition, pages 676–680, Vienna, 1996.
Google Scholar
D. X. Le, J. Kim, G. Pearson, and G. R. Thom. Automated labeling of zones from scanned documents. Proceedings SDIUT99, pages 219–226, 1999.
Google Scholar
I. Phillips. Users’ reference manual. CD-ROM, UW-III Document Image Database-III, 1995.
Google Scholar
W. Press, B. Flannery, S. Teukolsky, and W. Vetterling. Numerical Recipes in C. Cambridge University Press, 1988.
MATH Google Scholar
A. Antonacopoulos. Page segmentation using the description of the background. Computer Vision and Image Understanding, pages 350–369, June 1998.
Google Scholar
H. S. Baird. Background structure in document images. Document Image Analysis, pages 17–34, 1994.
Google Scholar
W. Buntine. Learning classification trees. Statistics and Computing journal, pages 63–76, 1992.
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Elect. Eng. Univ. of Washington, 98195, Seattle, WA, US
Yalin Wang
Dept. of Comp. Science, Queens College, City Univ. of NewYork, 11367, Flushing, NY, US
Ihsin T. Phillips
The Graduate School, City Univ. Of NewYork, 10016, NewYork, NY, US
Robert M. Haralick

Authors

Yalin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ihsin T. Phillips
View author publications
You can also search for this author in PubMed Google Scholar
Robert M. Haralick
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Bell Labs, Lucent Technologies, 600 Mountain Avenue, 07974, Murray Hill, NJ, USA
Daniel Lopresti
Avaya Labs Research, 233 Mount Airy Road, 07920, Basking Ridge, NJ, USA
Jianying Hu & Ramanujan Kashi &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y., Phillips, I.T., Haralick, R.M. (2002). A Study on the Document Zone Content Classification Problem. In: Lopresti, D., Hu, J., Kashi, R. (eds) Document Analysis Systems V. DAS 2002. Lecture Notes in Computer Science, vol 2423. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45869-7_25

Download citation

DOI: https://doi.org/10.1007/3-540-45869-7_25
Published: 09 August 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44068-0
Online ISBN: 978-3-540-45869-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

A Study on the Document Zone Content Classification Problem

Abstract

Chapter PDF

Similar content being viewed by others

Document segmentation and classification into musical scores and text

Symbolic Representation and Classification of Logos

Detection and Classification of Interesting Parts in Scanned Documents by Means of AdaBoost Classification and Low-Level Features Verification

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

A Study on the Document Zone Content Classification Problem

Abstract

Chapter PDF

Similar content being viewed by others

Document segmentation and classification into musical scores and text

Symbolic Representation and Classification of Logos

Detection and Classification of Interesting Parts in Scanned Documents by Means of AdaBoost Classification and Low-Level Features Verification

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation