Abstract
Given a set of low-quality line-delimited tabular documents of the same layout, we present a robust zoning algorithm which exploits both intra- and inter-document consensus to extract the structure of the table. The structure is captured in the form of a document template, that can then be snapped to a new document to perform automated “cookie cutter” data extraction. We also report a companion consensus-based algorithm for the classification of zone content as either machine print, handwriting or empty. Using scanned Census records from 1841 to 1881, the template is recovered with an efficiency of.076 [0, 1). Using consensus over about 10 documents from each data set, this error was reduced to.0076, or by 90%, which amounts to two missing line segments and one false positive. Similarly, the error for coverage was reduced from 0.098 to 0.016, or by 83%. Use of consensus also resulted in machine print classification accuracy of 100% for two of the three data sets. The classification error for handwriting averaged 0.1225 per document. By exploiting consensus within and between documents, automated zoning and labeling is greatly improved, providing field-level indexing of document content.
Similar content being viewed by others
References
Abu-Tarif, A.: Table processing and understanding. Master's thesis. Rensselaer Polytechnic Institute, Troy, New York (May 1998)
Casey, R.G., Ferguson, D.R.: Intelligent forms processing. IBM Systems J. 29(3), 435–450 (1990)
Chandran, S., Kasturi, R.: Structural recognition of tabulated data. In: Proceedings of the 2nd International Conference on Document Analysis and Recognition (ICDAR), pp. 516–519. Tsukuba Science City, Japan (1993)
Chen, J., Lee, H.: An efficient algorithm for form structure extraction using strip projection. Pattern Recognition 31(9), 1353–1368 (1998)
Fisher, J.L., Hinds, S.C., D'Amato, D.P.: A rule-base system for document image segmentation. In: International Conference on Pattern Recognition, pp. 567–572. Atlantic City, New Jersey (1990)
Garris, M.: Evaluating spatial correspondence of zones in document recognition systems. In: IEEE International Conference on Image Processing, pp. 304–307. Washington, DC (1995)
Haralick, R., Shapiro, L.: Computer and Robot Vision. Vol. 1. Addison-Wesley (1992)
Hori, O., Doermann, D.: Robust table-form structure analysis based on box-driven reasoning. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition (ICDAR), pp. 218–221. Montreal, Canada (1995)
Ittner, D.J., Baird, H.S.: Language-free layout analysis. In: Proceedings of the 2nd International Conference on Document Analysis and Recognition (ICDAR), pp. 336–340. Tsukuba Science City, Japan (1993)
Kieninger, T., Dengel, A.: The T-recs table recognition and analysis system. In: Document Analysis Systems: Theory and Practice, Third International Association for Pattern Recognition (IAPR) Workshop, pp. 255–269. Nagano, Japan (1998)
Nagy, G., Seth, S.: Hierarchical representation of optically scanned documents. In: Proceedings of the 7th International Conference on Pattern Recognition (ICPR), pp. 347–349. Montreal, Canada (1984)
O'Gorman, L.: The document spectrum for page layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 1162–1173 (1993)
Blostein, D., Zanibbi, R., Cordy, J.R.: A survey of table recognition: Models, observations, transformations and inferences. International Journal on Document Analysis and Recognition (IJDAR) 7(1), 1–16 (2004)
Saitoh, T., Yamaai, T., Tachikawa, M.: Document image segmentation and layout analysis. Institute of Electronics Information and Communication Engineers (IEICE) Transactions on Information and Systems E77-D(7), 778–888 (1994)
Shinjo, H., Hadano, E., Marukawa, K., Shima, Y., Sako, H.: A recursive analysis for form cell recognition. In: Proceedings of the 6th International Conference on Document Analysis and Recognition (ICDAR), pp. 694–698. Seattle, Washington (2001)
Taylor, S., Fritzson, R., Pastor, J.: Extraction of data from preprinted forms. Machine Vision and Applications 5, 211–222 (1992)
Xi, D., Lee, S.: Table structure extraction from form documents based on gradient-wavelet scheme. In: Document Analysis Systems: Theory and Practice, Third International Association for Pattern Recognition (IAPR) Workshop, pp. 240–254. Nagano, Japan (1998)
Author information
Authors and Affiliations
Corresponding author
Additional information
Heath Nielson received his B.S. in 1998 and an M.S. degree in 2003 from Brigham Young University, Prove, Utah, in computer science. He is now working at the Church of Jesus Christ of Latter-day Saints on microfilm scanning technology at Salt Lake City, Utah.
William Barrett received his Ph.D. (1978) in medical biophysics and computing and his undergraduate degree in mathematics from the University of Utah. He was a research fellow at the National Institutes of Health in the Division of Computer Research and Technology, where he worked with the National Heart, Lung, and Blood Institute. He is also a member of IEEE and ACM and has over 60 refereed publications. He is currently at BYU and heads an active research group that works in the areas of computer vision, pattern recognition, and image processing.
Rights and permissions
About this article
Cite this article
Nielson, H., Barrett, W. Consensus-based table form recognition of low-quality historical documents. IJDAR 8, 183–200 (2006). https://doi.org/10.1007/s10032-005-0002-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-005-0002-9