Skip to main content

Integration of Text Information and Graphic Composite for PDF Document Analysis

  • Conference paper
  • 812 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 333))

Abstract

The trend of large scale digitization has greatly motivated the research on the processing of the PDF documents with little structure information. Challenging problems like graphic segmentation integrating with texts remain unsolved for successful practical application of PDF layout analysis. To cope with PDF documents, a hybrid method incorporating text information and graphic composite is proposed to segment the pages that are difficult to handle by traditional methods. Specifically, the text information is derived accurately from born-digital documents embedded with low-level structure elements in explicit form. Then page text elements are clustered by applying graph based method according to proximity and feature similarity. Meanwhile, the graphic components are extracted by means of texture and morphological analysis. By integrating the clustered text elements with image based graphic components, the graphics are segmented for layout analysis. The experimental results on pages of PDF books have shown satisfactory performance.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Marinai, S., Marino, E., Soda, G.: Conversion of PDF Books in ePub Format. In: 11th International Conference on Document Analysis and Recognition, pp. 478–482 (2011)

    Google Scholar 

  2. Fan, J.: Text Segmentation of Consumer Magazines in PDF Format. In: 11th International Conference on Document Analysis and Recognition, pp. 794–798 (2011)

    Google Scholar 

  3. Doucet, A., Kazai, G., Dresevic, B., Uzelac, A., Radakovic, B., Todic, N.: Book Structure Extraction Competition. In: 10th International Conference on Document Analysis and Recognition, pp. 1408–1412 (2009)

    Google Scholar 

  4. Doucet, A., Kazai, G., Meunier, J.-L.: Book Structure Extraction Competition. In: 11th International Conference on Document Analysis and Recognition, pp. 1501–1505 (2011)

    Google Scholar 

  5. Antonacopoulos, A., Pletschacher, S., Bridson, D., Papadopoulos, C.: Page Segmentation Competition. In: 10th International Conference on Document Analysis and Recognition, pp. 1370–1374 (2009)

    Google Scholar 

  6. Tombre, K.: Graphics Recognition: The Last Ten Years and the Next Ten Years. In: Liu, W., Lladós, J. (eds.) GREC 2005. LNCS, vol. 3926, pp. 422–426. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  7. O’Gorman, L.: The Document Spectrum for Page Layout Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 1162–1173 (1993)

    Article  Google Scholar 

  8. Kise, K., Sato, A., Iwata, M.: Segmentation of Page Images Using the Area Voronoi Diagram. Computer Vision and Image Understanding 70, 370–382 (1998)

    Article  Google Scholar 

  9. Simon, A., Pret, J.C., Johnson, A.P.: A Fast Algorithm for Bottom-up Document Layout Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(3), 273–277 (1997)

    Article  Google Scholar 

  10. Xiao, Y., Yan, H.: Text Region Extraction in a Document Image Based on the Delaunay Tessellation. Pattern Recognition 36, 799–809 (2003)

    Article  Google Scholar 

  11. Ferilli, S., Biba, M., Esposito, F.: A Distance-Based Technique for Non-Manhattan Layout Analysis. In: 10th International Conference on Document Analysis and Recognition, pp. 231–235 (2009)

    Google Scholar 

  12. Koo, H.I., Cho, N.I.: State Estimation in a Document Image and Its Application in Text Block Identification and Text Line Extraction. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part II. LNCS, vol. 6312, pp. 421–434. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  13. Chao, H.: Graphics Extraction in PDF Document. In: Document Recognition and Retrieval X, Santa Clara, CA, USA, vol. 5010, pp. 317–325 (2003)

    Google Scholar 

  14. Shao, M., Futrelle, R.P.: Recognition and Classification of Figures in PDF Documents. In: Liu, W., Lladós, J. (eds.) GREC 2005. LNCS, vol. 3926, pp. 231–242. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  15. Hadjar, K., Rigamonti, M., Lalanne, D., Ingold, R.: Xed: A New Tool for Extracting Hidden Structures from Electronic Documents. In: International Workshop on Document Image Analysis for Libraries, pp. 212–224 (2004)

    Google Scholar 

  16. Fang, J., Tang, Z., Gao, L.: Reflowing-Driven Paragraph Recognition for Electronic Books in PDF. In: SPIE-IS&T International Conference of Document Recognition and Retrieval XVIII, vol. 7874, pp. 78740U-1–78740U-9 (2011)

    Google Scholar 

  17. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient Graph-Based Image Segmentation. International Journal of Computer Vision 59(2), 167–181 (2004)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Xu, C., Tang, Z., Tao, X., Shi, C. (2012). Integration of Text Information and Graphic Composite for PDF Document Analysis. In: Zhou, M., Zhou, G., Zhao, D., Liu, Q., Zou, L. (eds) Natural Language Processing and Chinese Computing. NLPCC 2012. Communications in Computer and Information Science, vol 333. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34456-5_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-34456-5_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-34455-8

  • Online ISBN: 978-3-642-34456-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics