Skip to main content
Log in

Consensus-based table form recognition of low-quality historical documents

  • Original Paper
  • Published:
International Journal of Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

Given a set of low-quality line-delimited tabular documents of the same layout, we present a robust zoning algorithm which exploits both intra- and inter-document consensus to extract the structure of the table. The structure is captured in the form of a document template, that can then be snapped to a new document to perform automated “cookie cutter” data extraction. We also report a companion consensus-based algorithm for the classification of zone content as either machine print, handwriting or empty. Using scanned Census records from 1841 to 1881, the template is recovered with an efficiency of.076 [0, 1). Using consensus over about 10 documents from each data set, this error was reduced to.0076, or by 90%, which amounts to two missing line segments and one false positive. Similarly, the error for coverage was reduced from 0.098 to 0.016, or by 83%. Use of consensus also resulted in machine print classification accuracy of 100% for two of the three data sets. The classification error for handwriting averaged 0.1225 per document. By exploiting consensus within and between documents, automated zoning and labeling is greatly improved, providing field-level indexing of document content.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abu-Tarif, A.: Table processing and understanding. Master's thesis. Rensselaer Polytechnic Institute, Troy, New York (May 1998)

  2. Casey, R.G., Ferguson, D.R.: Intelligent forms processing. IBM Systems J. 29(3), 435–450 (1990)

    Article  Google Scholar 

  3. Chandran, S., Kasturi, R.: Structural recognition of tabulated data. In: Proceedings of the 2nd International Conference on Document Analysis and Recognition (ICDAR), pp. 516–519. Tsukuba Science City, Japan (1993)

  4. Chen, J., Lee, H.: An efficient algorithm for form structure extraction using strip projection. Pattern Recognition 31(9), 1353–1368 (1998)

    Article  MathSciNet  Google Scholar 

  5. Fisher, J.L., Hinds, S.C., D'Amato, D.P.: A rule-base system for document image segmentation. In: International Conference on Pattern Recognition, pp. 567–572. Atlantic City, New Jersey (1990)

  6. Garris, M.: Evaluating spatial correspondence of zones in document recognition systems. In: IEEE International Conference on Image Processing, pp. 304–307. Washington, DC (1995)

  7. Haralick, R., Shapiro, L.: Computer and Robot Vision. Vol. 1. Addison-Wesley (1992)

  8. Hori, O., Doermann, D.: Robust table-form structure analysis based on box-driven reasoning. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition (ICDAR), pp. 218–221. Montreal, Canada (1995)

  9. Ittner, D.J., Baird, H.S.: Language-free layout analysis. In: Proceedings of the 2nd International Conference on Document Analysis and Recognition (ICDAR), pp. 336–340. Tsukuba Science City, Japan (1993)

  10. Kieninger, T., Dengel, A.: The T-recs table recognition and analysis system. In: Document Analysis Systems: Theory and Practice, Third International Association for Pattern Recognition (IAPR) Workshop, pp. 255–269. Nagano, Japan (1998)

  11. Nagy, G., Seth, S.: Hierarchical representation of optically scanned documents. In: Proceedings of the 7th International Conference on Pattern Recognition (ICPR), pp. 347–349. Montreal, Canada (1984)

  12. O'Gorman, L.: The document spectrum for page layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 1162–1173 (1993)

    Article  Google Scholar 

  13. Blostein, D., Zanibbi, R., Cordy, J.R.: A survey of table recognition: Models, observations, transformations and inferences. International Journal on Document Analysis and Recognition (IJDAR) 7(1), 1–16 (2004)

    Google Scholar 

  14. Saitoh, T., Yamaai, T., Tachikawa, M.: Document image segmentation and layout analysis. Institute of Electronics Information and Communication Engineers (IEICE) Transactions on Information and Systems E77-D(7), 778–888 (1994)

    Google Scholar 

  15. Shinjo, H., Hadano, E., Marukawa, K., Shima, Y., Sako, H.: A recursive analysis for form cell recognition. In: Proceedings of the 6th International Conference on Document Analysis and Recognition (ICDAR), pp. 694–698. Seattle, Washington (2001)

  16. Taylor, S., Fritzson, R., Pastor, J.: Extraction of data from preprinted forms. Machine Vision and Applications 5, 211–222 (1992)

    Article  Google Scholar 

  17. Xi, D., Lee, S.: Table structure extraction from form documents based on gradient-wavelet scheme. In: Document Analysis Systems: Theory and Practice, Third International Association for Pattern Recognition (IAPR) Workshop, pp. 240–254. Nagano, Japan (1998)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to H. Nielson.

Additional information

Heath Nielson received his B.S. in 1998 and an M.S. degree in 2003 from Brigham Young University, Prove, Utah, in computer science. He is now working at the Church of Jesus Christ of Latter-day Saints on microfilm scanning technology at Salt Lake City, Utah.

William Barrett received his Ph.D. (1978) in medical biophysics and computing and his undergraduate degree in mathematics from the University of Utah. He was a research fellow at the National Institutes of Health in the Division of Computer Research and Technology, where he worked with the National Heart, Lung, and Blood Institute. He is also a member of IEEE and ACM and has over 60 refereed publications. He is currently at BYU and heads an active research group that works in the areas of computer vision, pattern recognition, and image processing.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nielson, H., Barrett, W. Consensus-based table form recognition of low-quality historical documents. IJDAR 8, 183–200 (2006). https://doi.org/10.1007/s10032-005-0002-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-005-0002-9

Keywords

Navigation