Extracting Mathematical Components Directly from PDF Documents for Mathematical Expression Recognition and Retrieval

Yu, Botao; Tian, Xuedong; Luo, Wenjie

doi:10.1007/978-3-319-11897-0_20

Botao Yu^18,19,
Xuedong Tian^18,19 &
Wenjie Luo^18,19

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8795))

Included in the following conference series:

International Conference in Swarm Intelligence

2048 Accesses
6 Citations

Abstract

PDF document gains its popularity in information storage and exchange. With more and more documents, especially the scientific documents, available in PDF format, extracting mathematical expressions in PDF documents becomes an important issue in the field of mathematical expression recognition and retrieval. In this paper, we proposed a method of extracting mathematical components directly from PDF documents rather than cooperating indirectly with corresponding images converted from PDF files. Compared with traditional image-based method, the proposed method makes full use of the internal information of PDF documents such as font size, baseline, glyph bounding box and so on to extract the mathematical characters and their geometric information. The experimental result shows the method could meet the needs of the following processing of mathematical expressions such as formula structural analysis, reconstruction and retrieval, and has a higher efficiency than traditional image-based ways.

This work is supported by the National Natural Science Foundation of China (Grant No. 61375075) and the Natural Science Foundation of Hebei Province (Grant No. F2012201020).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Adobe Systems Incorporated, PDF Reference, 6th edn. (November 2006)
Google Scholar
Chao, H., Fan, J.: Layout and Content Extraction for PDF Documents. In: Marinai, S., Dengel, A.R. (eds.) DAS 2004. LNCS, vol. 3163, pp. 213–224. Springer, Heidelberg (2004)
Chapter Google Scholar
Marinai, S.: Metadata Extraction from PDF Papers for Digital Library Ingest. In: 10th International Conference on Document Analysis and Recognition, pp. 251–255. IEEE Press, New York (2009)
Google Scholar
Déjean, H., Meunier, J.-L.: A System for Converting PDF Documents into Structured XML Format. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 129–140. Springer, Heidelberg (2006)
Chapter Google Scholar
Rahman, F., Alam, H.: Conversion of PDF Documents into HTML: a Case Study of Document Image Analysis. In: Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, vol. 1, pp. 87–91. IEEE Press, New York (2004)
Google Scholar
Yang, M., Fateman, R.: Extracting Mathematical Expressions from Postscript Documents. In: Proceedings of the 2004 International Symposium on Symbolic and Algebraic Computation, pp. 305–311. ACM (2004)
Google Scholar
Chan, K.-F., Yeung, D.-Y.: Mathematical Expression Recognition: a Survey. J. International Journal on Document Analysis and Recognition. 3(1), 3–15 (2000)
Article Google Scholar
Lin, X.Y., Gao, L.C., Tang, Z., Lin, X.F., Hu, X.: Mathematical Formula Identification in PDF Documents. In: 2011 International Conference on Document Analysis and Recognition, pp. 1419–1423. IEEE Press, New York (2011)
Chapter Google Scholar
Lin, X.Y., Gao, L.C., Tang, Z., Hu, X., Lin, X.F.: Identification of Embedded Mathematical Formulas in PDF Documents Using SVM. In: IS&T/SPIE Electronic Imaging, pp. 82970D–82970D. International Society for Optics and Photonics (2012)
Google Scholar
Baker, J.B., Sexton, A.P., Sorge, V.: Extracting Precise Data on the Mathematics Content of PDF Documents. Towards Digital Mathematics Library, Birmingham , pp. 75–79 (2008)
Google Scholar
Baker, J.B., Sexton, A.P., Sorge, V.: A Linear Grammar Approach to Mathematical Formula Recognition from PDF. In: Carette, J., Dixon, L., Coen, C.S., Watt, S.M. (eds.) MKM 2009, Held as Part of CICM 2009. LNCS, vol. 5625, pp. 201–216. Springer, Heidelberg (2009)
Chapter Google Scholar
Adobe Systems Incorporated, The Compact Font Format Specification, Version 1.0, 4 (December 2003)
Google Scholar
Adobe Systems Incorporated, Adobe Type 1 Font Format, Version 1.1 (February 1993)
Google Scholar
Tian, X.D., Li, N., Xu, L.J.: Research on Structural Analysis of Mathematical Expressions in Printed Documents. J. Computer Engineering 32(23), 202–204 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Mathematics and Computer, Hebei University, Baoding, Hebei, China
Botao Yu, Xuedong Tian & Wenjie Luo
Hebei key laboratory of Machine Learning and Computational Intelligence, Baoding, China
Botao Yu, Xuedong Tian & Wenjie Luo

Authors

Botao Yu
View author publications
You can also search for this author in PubMed Google Scholar
Xuedong Tian
View author publications
You can also search for this author in PubMed Google Scholar
Wenjie Luo
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Key Laboratory of Machine Perception (MOE), Peking University, 100871, Beijing, China
Ying Tan
Department of Electrical & Electronic Engineering, Xi’an Jiaotong-Liverpool University, Suzhou, China
Yuhui Shi
Computer Science Department, CINVESTAV-IPN, Mexico
Carlos A. Coello Coello

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, B., Tian, X., Luo, W. (2014). Extracting Mathematical Components Directly from PDF Documents for Mathematical Expression Recognition and Retrieval. In: Tan, Y., Shi, Y., Coello, C.A.C. (eds) Advances in Swarm Intelligence. ICSI 2014. Lecture Notes in Computer Science, vol 8795. Springer, Cham. https://doi.org/10.1007/978-3-319-11897-0_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-11897-0_20
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11896-3
Online ISBN: 978-3-319-11897-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics