Layout and Content Extraction for PDF Documents

Chao, Hui; Fan, Jian

doi:10.1007/978-3-540-28640-0_20

Layout and Content Extraction for PDF Documents

Hui Chao¹⁸ &
Jian Fan¹⁸

Conference paper

2845 Accesses
42 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3163))

Abstract

Portable document format (PDF) is a common output format for electronic documents. Most PDF documents are untagged and do not have basic high-level document logical structural information, which makes the reuse or modification of the documents difficult. We developed techniques that identified logical components on a PDF document page. The outlines, style attributes and the contents of the logical components were extracted and expressed in an XML format. These techniques could facilitate the reuse and modification of the layout and the content of a PDF document page.

Download to read the full chapter text

Chapter PDF

References

PDF Reference, Adobe system Incorporated
Google Scholar
Acrobat 4: Adobe’s bid to make it more than a viewer, Walter, M. Seybold report on Internet Publishing. Vol. 3, no. 7
Google Scholar
Nagy, G., Seth, S., Viswanathan, M.: A prototype document image analysis system for technical Journals. IEEE Computer 25, 10–22 (1992)
Google Scholar
Dengel, A.: Initial learning of document structures. In: Second International Conference on Document Analysis and Recognition (ICDAR), pp. 86–90 (1993)
Google Scholar
Dengel, A., Dubiel, F.: Clustering and classification of document structure? A machine learning approach. In: ICDAR 1995, pp. 587–591 (1995)
Google Scholar
Kise, K.: Incremental acquisition of knowledge about layout structures from examples of documents. In: ICDAR 1993, pp. 668–671 (1993)
Google Scholar
Walischewski, H.: Automatic knowledge acquisition for spatial document interpretation. In: Proceedings of the Fourth ICDAR 1997, pp. 243–247 (1997)
Google Scholar
Jain, A., Yu, B.: Document representation and its application to page decomposition. IEEE Transactions on pattern analysis and machine intelligence 20(3) (March 1998)
Google Scholar
Smith, P.N., Brailsford, D.F.: Towards structured, block-based PDF. Electronic publishing: origination, dissemination and Design 8(2-3)
Google Scholar
Lovegrove, W.S., Brailsford, D.F.: Document analysis of PDF files; method, results and implications. Electronic publishing 8(2-3)
Google Scholar
Anjewierden, A.: AIDAS: Incremental logical structure discovery in PDF document. In: ICDAR 2001 (2001)
Google Scholar
Karim, H., Rigamonti, M., Lalanne, D., Ingold, R.: Xed: a new tool for extracting hidden structure from electronic documents. In: Digital Library workshop, Palo alto (January 2004)
Google Scholar
Haffner, P., Bottou, L., Howard, P., Cun, Y.L.: DjVu Analyzing and Compressing Scanned Documents for Internet Distribution. In: ICDAR, pp. 625–628 (1999)
Google Scholar
Fan, J.: Text extraction via an edge-bounded averaging and a parametric character model. In: Proc. Electronic Imaging 2003, Santa Clara, CA, January 22-24. SPIE, vol. 5010, pp. 8–19 (2003)
Google Scholar
Chao, H., Beretta, G., Sang, H.: PDF Document Layout Study with Page Elements and Bounding Boxes. In: DLIA 2001 (2001)
Google Scholar
Chao, H.: Graphics Extraction in PDF Document, SPIE, vol. 5010(317).
Google Scholar
Chao, H.: Background extraction in multi-page PDF document. In: DLIA (2003)
Google Scholar
Simske, S.J., Arnabat, J.: User-directed analysis of scanned images. In: ACM workshop of Document Engineering 2003 (2003)
Google Scholar
Revankar, S.V., Fan, Z.: Image Segmentation system, U.S. Patent 5,767,978, January 21 (1997)
Google Scholar
Paknad, M.D., Ayers, R.M.: Method and apparatus for identifying words described in a portable electronic document, US patent 5,832,530
Google Scholar
Ayers, Robert, M.: Method and apparatus for identifying words described in a page description language file, US patent 5,832,531
Google Scholar
O’Rourke, Computational Geometry in C (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Hewlett-Packard Laboratories, 1501 Page Mill Road, ms 1203, Palo Alto, CA, 95129, USA
Hui Chao & Jian Fan

Authors

Hui Chao
View author publications
You can also search for this author in PubMed Google Scholar
Jian Fan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dipartimento di Sistemi e Informatica, Università di Firenze, Via di Santa Marta 3, 50139, Firenze, Italy
Simone Marinai
Knowledge Management Department, German Research Center for Artificial Intelligence (DFKI) GmbH, Kaiserslautern, Germany
Andreas R. Dengel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chao, H., Fan, J. (2004). Layout and Content Extraction for PDF Documents. In: Marinai, S., Dengel, A.R. (eds) Document Analysis Systems VI. DAS 2004. Lecture Notes in Computer Science, vol 3163. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-28640-0_20

Download citation

DOI: https://doi.org/10.1007/978-3-540-28640-0_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23060-1
Online ISBN: 978-3-540-28640-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)