Abstract
SGML is a language for defining the layout structure of a document. Various attempts at generating SGML from a document image have not been successful. We focus on extracting some of the important layout elements by using flexible matching strategy and easy model generation. Our proposed approach treats each extracted element as it were independent. Some segmented areas like “title” or “author” are defined locally making the system robust, able to withstand shifting and noise. The system is also easy to operate. Since the system is not full automatic, we need to supply typical models of each component. Our GUI presents the attributes of each segmented area as well as the original bit map images. The color-coded attributes help us to easily edit the extracted component. In experiments with 288 pages of test images, the proposed method is shown to be 95.6% correct for a wide range of documents. By using 145 pages of documents as a learning set, the system recognized 99.2% of feature sets from 148 various types of unknown documents.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Reference
T. Watanabe, et al, “ Extraction of data from preprinted forms”, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol.17, No. 4, 1995, pp.432–445.
J. Yuan, et al, “ Form items extraction by model matching”, ICPR'96,1996, pp.691–695.
H. Arai, K. Odaka, “ Information Acquisition and Storage of Forms in Document Processing”, ICDAR, 1997, pp.164–169.
C. Wenzel,” Supporting Information Extraction from Printed Documents by Lexico-Semantic Pattern Matching”, ICDAR, 1997.
M. Sharpe, et al, “ An Intelligent Document Understanding & Reproduction System”, MVA'94,1994, pp.267–271.
T. Watanabe, X. Huang,” Automatic Acquisition of Layout Knowledge for Understanding Business Cards”, ICDAR, 1997, pp.216–220
H. Walischewski, “ Automatic Acquisition of Spatial Document Interpretation”, ICDAR, 1997, pp.243–247
C. Lin, et al, “ Logical Structure Analysis of Book Document Images Using Contents Information”, ICDAR, 1997, pp.1048–1054.
Y. Tang, et al, “ Document Processing for Automatic Knowledge Acquisition”, IEEE, Transaction on Knowledge and Data Engineering, Vol. 6, No. 1, 1994, pp.3–31.
T. Saitoh,et al, ” Document Image Segmentation and Text Area Ordering” Proceedings of ICDAR, 1993, pp.323–329.
S. Khoubyari and J. Hull,” Font Function Word Identification in Document Recognition”, CVIU,1996, pp.66–74.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kochi, T., Saitoh, T. (1999). A Layout-Free Method for Extracting Elements from Document Images. In: Lee, SW., Nakano, Y. (eds) Document Analysis Systems: Theory and Practice. DAS 1998. Lecture Notes in Computer Science, vol 1655. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48172-9_18
Download citation
DOI: https://doi.org/10.1007/3-540-48172-9_18
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66507-6
Online ISBN: 978-3-540-48172-0
eBook Packages: Springer Book Archive