Layout and Content Extraction for PDF Documents

  • Hui Chao
  • Jian Fan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3163)


Portable document format (PDF) is a common output format for electronic documents. Most PDF documents are untagged and do not have basic high-level document logical structural information, which makes the reuse or modification of the documents difficult. We developed techniques that identified logical components on a PDF document page. The outlines, style attributes and the contents of the logical components were extracted and expressed in an XML format. These techniques could facilitate the reuse and modification of the layout and the content of a PDF document page.


Document Image Text Line Text Segment Text Block Portable Document Format 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    PDF Reference, Adobe system IncorporatedGoogle Scholar
  2. 2.
    Acrobat 4: Adobe’s bid to make it more than a viewer, Walter, M. Seybold report on Internet Publishing. Vol. 3, no. 7Google Scholar
  3. 3.
    Nagy, G., Seth, S., Viswanathan, M.: A prototype document image analysis system for technical Journals. IEEE Computer 25, 10–22 (1992)Google Scholar
  4. 4.
    Dengel, A.: Initial learning of document structures. In: Second International Conference on Document Analysis and Recognition (ICDAR), pp. 86–90 (1993)Google Scholar
  5. 5.
    Dengel, A., Dubiel, F.: Clustering and classification of document structure? A machine learning approach. In: ICDAR 1995, pp. 587–591 (1995)Google Scholar
  6. 6.
    Kise, K.: Incremental acquisition of knowledge about layout structures from examples of documents. In: ICDAR 1993, pp. 668–671 (1993)Google Scholar
  7. 7.
    Walischewski, H.: Automatic knowledge acquisition for spatial document interpretation. In: Proceedings of the Fourth ICDAR 1997, pp. 243–247 (1997)Google Scholar
  8. 8.
    Jain, A., Yu, B.: Document representation and its application to page decomposition. IEEE Transactions on pattern analysis and machine intelligence 20(3) (March 1998)Google Scholar
  9. 9.
    Smith, P.N., Brailsford, D.F.: Towards structured, block-based PDF. Electronic publishing: origination, dissemination and Design 8(2-3)Google Scholar
  10. 10.
    Lovegrove, W.S., Brailsford, D.F.: Document analysis of PDF files; method, results and implications. Electronic publishing 8(2-3)Google Scholar
  11. 11.
    Anjewierden, A.: AIDAS: Incremental logical structure discovery in PDF document. In: ICDAR 2001 (2001)Google Scholar
  12. 12.
    Karim, H., Rigamonti, M., Lalanne, D., Ingold, R.: Xed: a new tool for extracting hidden structure from electronic documents. In: Digital Library workshop, Palo alto (January 2004)Google Scholar
  13. 13.
    Haffner, P., Bottou, L., Howard, P., Cun, Y.L.: DjVu Analyzing and Compressing Scanned Documents for Internet Distribution. In: ICDAR, pp. 625–628 (1999)Google Scholar
  14. 14.
    Fan, J.: Text extraction via an edge-bounded averaging and a parametric character model. In: Proc. Electronic Imaging 2003, Santa Clara, CA, January 22-24. SPIE, vol. 5010, pp. 8–19 (2003)Google Scholar
  15. 15.
    Chao, H., Beretta, G., Sang, H.: PDF Document Layout Study with Page Elements and Bounding Boxes. In: DLIA 2001 (2001)Google Scholar
  16. 16.
    Chao, H.: Graphics Extraction in PDF Document, SPIE, vol. 5010(317).Google Scholar
  17. 17.
    Chao, H.: Background extraction in multi-page PDF document. In: DLIA (2003)Google Scholar
  18. 18.
    Simske, S.J., Arnabat, J.: User-directed analysis of scanned images. In: ACM workshop of Document Engineering 2003 (2003)Google Scholar
  19. 19.
    Revankar, S.V., Fan, Z.: Image Segmentation system, U.S. Patent 5,767,978, January 21 (1997)Google Scholar
  20. 20.
    Paknad, M.D., Ayers, R.M.: Method and apparatus for identifying words described in a portable electronic document, US patent 5,832,530Google Scholar
  21. 21.
    Ayers, Robert, M.: Method and apparatus for identifying words described in a page description language file, US patent 5,832,531Google Scholar
  22. 22.
    O’Rourke, Computational Geometry in C (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Hui Chao
    • 1
  • Jian Fan
    • 1
  1. 1.Hewlett-Packard LaboratoriesPalo AltoUSA

Personalised recommendations