Integration of Text Information and Graphic Composite for PDF Document Analysis

Xu, Canhui; Tang, Zhi; Tao, Xin; Shi, Cao

doi:10.1007/978-3-642-34456-5_2

Integration of Text Information and Graphic Composite for PDF Document Analysis

Canhui Xu^5,6,7,
Zhi Tang^5,7,
Xin Tao^5,7 &
…
Cao Shi⁵

Conference paper

812 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 333))

Abstract

The trend of large scale digitization has greatly motivated the research on the processing of the PDF documents with little structure information. Challenging problems like graphic segmentation integrating with texts remain unsolved for successful practical application of PDF layout analysis. To cope with PDF documents, a hybrid method incorporating text information and graphic composite is proposed to segment the pages that are difficult to handle by traditional methods. Specifically, the text information is derived accurately from born-digital documents embedded with low-level structure elements in explicit form. Then page text elements are clustered by applying graph based method according to proximity and feature similarity. Meanwhile, the graphic components are extracted by means of texture and morphological analysis. By integrating the clustered text elements with image based graphic components, the graphics are segmented for layout analysis. The experimental results on pages of PDF books have shown satisfactory performance.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Marinai, S., Marino, E., Soda, G.: Conversion of PDF Books in ePub Format. In: 11^th International Conference on Document Analysis and Recognition, pp. 478–482 (2011)
Google Scholar
Fan, J.: Text Segmentation of Consumer Magazines in PDF Format. In: 11^th International Conference on Document Analysis and Recognition, pp. 794–798 (2011)
Google Scholar
Doucet, A., Kazai, G., Dresevic, B., Uzelac, A., Radakovic, B., Todic, N.: Book Structure Extraction Competition. In: 10^th International Conference on Document Analysis and Recognition, pp. 1408–1412 (2009)
Google Scholar
Doucet, A., Kazai, G., Meunier, J.-L.: Book Structure Extraction Competition. In: 11^th International Conference on Document Analysis and Recognition, pp. 1501–1505 (2011)
Google Scholar
Antonacopoulos, A., Pletschacher, S., Bridson, D., Papadopoulos, C.: Page Segmentation Competition. In: 10^th International Conference on Document Analysis and Recognition, pp. 1370–1374 (2009)
Google Scholar
Tombre, K.: Graphics Recognition: The Last Ten Years and the Next Ten Years. In: Liu, W., Lladós, J. (eds.) GREC 2005. LNCS, vol. 3926, pp. 422–426. Springer, Heidelberg (2006)
Chapter Google Scholar
O’Gorman, L.: The Document Spectrum for Page Layout Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(11), 1162–1173 (1993)
Article Google Scholar
Kise, K., Sato, A., Iwata, M.: Segmentation of Page Images Using the Area Voronoi Diagram. Computer Vision and Image Understanding 70, 370–382 (1998)
Article Google Scholar
Simon, A., Pret, J.C., Johnson, A.P.: A Fast Algorithm for Bottom-up Document Layout Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(3), 273–277 (1997)
Article Google Scholar
Xiao, Y., Yan, H.: Text Region Extraction in a Document Image Based on the Delaunay Tessellation. Pattern Recognition 36, 799–809 (2003)
Article Google Scholar
Ferilli, S., Biba, M., Esposito, F.: A Distance-Based Technique for Non-Manhattan Layout Analysis. In: 10^th International Conference on Document Analysis and Recognition, pp. 231–235 (2009)
Google Scholar
Koo, H.I., Cho, N.I.: State Estimation in a Document Image and Its Application in Text Block Identification and Text Line Extraction. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part II. LNCS, vol. 6312, pp. 421–434. Springer, Heidelberg (2010)
Chapter Google Scholar
Chao, H.: Graphics Extraction in PDF Document. In: Document Recognition and Retrieval X, Santa Clara, CA, USA, vol. 5010, pp. 317–325 (2003)
Google Scholar
Shao, M., Futrelle, R.P.: Recognition and Classification of Figures in PDF Documents. In: Liu, W., Lladós, J. (eds.) GREC 2005. LNCS, vol. 3926, pp. 231–242. Springer, Heidelberg (2006)
Chapter Google Scholar
Hadjar, K., Rigamonti, M., Lalanne, D., Ingold, R.: Xed: A New Tool for Extracting Hidden Structures from Electronic Documents. In: International Workshop on Document Image Analysis for Libraries, pp. 212–224 (2004)
Google Scholar
Fang, J., Tang, Z., Gao, L.: Reflowing-Driven Paragraph Recognition for Electronic Books in PDF. In: SPIE-IS&T International Conference of Document Recognition and Retrieval XVIII, vol. 7874, pp. 78740U-1–78740U-9 (2011)
Google Scholar
Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient Graph-Based Image Segmentation. International Journal of Computer Vision 59(2), 167–181 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computer Science and Technology, Peking University, Beijing, China
Canhui Xu, Zhi Tang, Xin Tao & Cao Shi
Postdoctoral Workstation of the Zhongguancun Haidian Science Park and Peking University Founder Group Co. Ltd, Beijing, China
Canhui Xu
State Key Laboratory of Digital Publishing Technology, Beijing, China
Canhui Xu, Zhi Tang & Xin Tao

Authors

Canhui Xu
View author publications
You can also search for this author in PubMed Google Scholar
Zhi Tang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Tao
View author publications
You can also search for this author in PubMed Google Scholar
Cao Shi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Microsoft Research Asia, Beijing, China
Ming Zhou
Soochow University, 215006, Suzhou, China
Guodong Zhou
Institute of Computer Science & Technology, Peking University, 100871, Beijing, China
Dongyan Zhao & Lei Zou &
Institute of Computing Technology, Chinese Academy of Sciences, No.6 Kexueyuan South Road, Haidian District, 100190, Beijing, China
Qun Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, C., Tang, Z., Tao, X., Shi, C. (2012). Integration of Text Information and Graphic Composite for PDF Document Analysis. In: Zhou, M., Zhou, G., Zhao, D., Liu, Q., Zou, L. (eds) Natural Language Processing and Chinese Computing. NLPCC 2012. Communications in Computer and Information Science, vol 333. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34456-5_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-34456-5_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34455-8
Online ISBN: 978-3-642-34456-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics