Abstract
Portable document format (PDF) is a common output format for electronic documents. Most PDF documents are untagged and do not have basic high-level document logical structural information, which makes the reuse or modification of the documents difficult. We developed techniques that identified logical components on a PDF document page. The outlines, style attributes and the contents of the logical components were extracted and expressed in an XML format. These techniques could facilitate the reuse and modification of the layout and the content of a PDF document page.
Chapter PDF
References
PDF Reference, Adobe system Incorporated
Acrobat 4: Adobe’s bid to make it more than a viewer, Walter, M. Seybold report on Internet Publishing. Vol. 3, no. 7
Nagy, G., Seth, S., Viswanathan, M.: A prototype document image analysis system for technical Journals. IEEE Computer 25, 10–22 (1992)
Dengel, A.: Initial learning of document structures. In: Second International Conference on Document Analysis and Recognition (ICDAR), pp. 86–90 (1993)
Dengel, A., Dubiel, F.: Clustering and classification of document structure? A machine learning approach. In: ICDAR 1995, pp. 587–591 (1995)
Kise, K.: Incremental acquisition of knowledge about layout structures from examples of documents. In: ICDAR 1993, pp. 668–671 (1993)
Walischewski, H.: Automatic knowledge acquisition for spatial document interpretation. In: Proceedings of the Fourth ICDAR 1997, pp. 243–247 (1997)
Jain, A., Yu, B.: Document representation and its application to page decomposition. IEEE Transactions on pattern analysis and machine intelligence 20(3) (March 1998)
Smith, P.N., Brailsford, D.F.: Towards structured, block-based PDF. Electronic publishing: origination, dissemination and Design 8(2-3)
Lovegrove, W.S., Brailsford, D.F.: Document analysis of PDF files; method, results and implications. Electronic publishing 8(2-3)
Anjewierden, A.: AIDAS: Incremental logical structure discovery in PDF document. In: ICDAR 2001 (2001)
Karim, H., Rigamonti, M., Lalanne, D., Ingold, R.: Xed: a new tool for extracting hidden structure from electronic documents. In: Digital Library workshop, Palo alto (January 2004)
Haffner, P., Bottou, L., Howard, P., Cun, Y.L.: DjVu Analyzing and Compressing Scanned Documents for Internet Distribution. In: ICDAR, pp. 625–628 (1999)
Fan, J.: Text extraction via an edge-bounded averaging and a parametric character model. In: Proc. Electronic Imaging 2003, Santa Clara, CA, January 22-24. SPIE, vol. 5010, pp. 8–19 (2003)
Chao, H., Beretta, G., Sang, H.: PDF Document Layout Study with Page Elements and Bounding Boxes. In: DLIA 2001 (2001)
Chao, H.: Graphics Extraction in PDF Document, SPIE, vol. 5010(317).
Chao, H.: Background extraction in multi-page PDF document. In: DLIA (2003)
Simske, S.J., Arnabat, J.: User-directed analysis of scanned images. In: ACM workshop of Document Engineering 2003 (2003)
Revankar, S.V., Fan, Z.: Image Segmentation system, U.S. Patent 5,767,978, January 21 (1997)
Paknad, M.D., Ayers, R.M.: Method and apparatus for identifying words described in a portable electronic document, US patent 5,832,530
Ayers, Robert, M.: Method and apparatus for identifying words described in a page description language file, US patent 5,832,531
O’Rourke, Computational Geometry in C (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chao, H., Fan, J. (2004). Layout and Content Extraction for PDF Documents. In: Marinai, S., Dengel, A.R. (eds) Document Analysis Systems VI. DAS 2004. Lecture Notes in Computer Science, vol 3163. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28640-0_20
Download citation
DOI: https://doi.org/10.1007/978-3-540-28640-0_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23060-1
Online ISBN: 978-3-540-28640-0
eBook Packages: Springer Book Archive