Skip to main content

Extracting Mathematical Components Directly from PDF Documents for Mathematical Expression Recognition and Retrieval

  • Conference paper
Advances in Swarm Intelligence (ICSI 2014)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8795))

Included in the following conference series:

Abstract

PDF document gains its popularity in information storage and exchange. With more and more documents, especially the scientific documents, available in PDF format, extracting mathematical expressions in PDF documents becomes an important issue in the field of mathematical expression recognition and retrieval. In this paper, we proposed a method of extracting mathematical components directly from PDF documents rather than cooperating indirectly with corresponding images converted from PDF files. Compared with traditional image-based method, the proposed method makes full use of the internal information of PDF documents such as font size, baseline, glyph bounding box and so on to extract the mathematical characters and their geometric information. The experimental result shows the method could meet the needs of the following processing of mathematical expressions such as formula structural analysis, reconstruction and retrieval, and has a higher efficiency than traditional image-based ways.

This work is supported by the National Natural Science Foundation of China (Grant No. 61375075) and the Natural Science Foundation of Hebei Province (Grant No. F2012201020).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adobe Systems Incorporated, PDF Reference, 6th edn. (November 2006)

    Google Scholar 

  2. Chao, H., Fan, J.: Layout and Content Extraction for PDF Documents. In: Marinai, S., Dengel, A.R. (eds.) DAS 2004. LNCS, vol. 3163, pp. 213–224. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  3. Marinai, S.: Metadata Extraction from PDF Papers for Digital Library Ingest. In: 10th International Conference on Document Analysis and Recognition, pp. 251–255. IEEE Press, New York (2009)

    Google Scholar 

  4. Déjean, H., Meunier, J.-L.: A System for Converting PDF Documents into Structured XML Format. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 129–140. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  5. Rahman, F., Alam, H.: Conversion of PDF Documents into HTML: a Case Study of Document Image Analysis. In: Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, vol. 1, pp. 87–91. IEEE Press, New York (2004)

    Google Scholar 

  6. Yang, M., Fateman, R.: Extracting Mathematical Expressions from Postscript Documents. In: Proceedings of the 2004 International Symposium on Symbolic and Algebraic Computation, pp. 305–311. ACM (2004)

    Google Scholar 

  7. Chan, K.-F., Yeung, D.-Y.: Mathematical Expression Recognition: a Survey. J. International Journal on Document Analysis and Recognition. 3(1), 3–15 (2000)

    Article  Google Scholar 

  8. Lin, X.Y., Gao, L.C., Tang, Z., Lin, X.F., Hu, X.: Mathematical Formula Identification in PDF Documents. In: 2011 International Conference on Document Analysis and Recognition, pp. 1419–1423. IEEE Press, New York (2011)

    Chapter  Google Scholar 

  9. Lin, X.Y., Gao, L.C., Tang, Z., Hu, X., Lin, X.F.: Identification of Embedded Mathematical Formulas in PDF Documents Using SVM. In: IS&T/SPIE Electronic Imaging, pp. 82970D–82970D. International Society for Optics and Photonics (2012)

    Google Scholar 

  10. Baker, J.B., Sexton, A.P., Sorge, V.: Extracting Precise Data on the Mathematics Content of PDF Documents. Towards Digital Mathematics Library, Birmingham , pp. 75–79 (2008)

    Google Scholar 

  11. Baker, J.B., Sexton, A.P., Sorge, V.: A Linear Grammar Approach to Mathematical Formula Recognition from PDF. In: Carette, J., Dixon, L., Coen, C.S., Watt, S.M. (eds.) MKM 2009, Held as Part of CICM 2009. LNCS, vol. 5625, pp. 201–216. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  12. Adobe Systems Incorporated, The Compact Font Format Specification, Version 1.0, 4 (December 2003)

    Google Scholar 

  13. Adobe Systems Incorporated, Adobe Type 1 Font Format, Version 1.1 (February 1993)

    Google Scholar 

  14. Tian, X.D., Li, N., Xu, L.J.: Research on Structural Analysis of Mathematical Expressions in Printed Documents. J. Computer Engineering 32(23), 202–204 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Yu, B., Tian, X., Luo, W. (2014). Extracting Mathematical Components Directly from PDF Documents for Mathematical Expression Recognition and Retrieval. In: Tan, Y., Shi, Y., Coello, C.A.C. (eds) Advances in Swarm Intelligence. ICSI 2014. Lecture Notes in Computer Science, vol 8795. Springer, Cham. https://doi.org/10.1007/978-3-319-11897-0_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11897-0_20

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11896-3

  • Online ISBN: 978-3-319-11897-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics