Skip to main content

A Linear Grammar Approach to Mathematical Formula Recognition from PDF

  • Conference paper
Intelligent Computer Mathematics (CICM 2009)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5625))

Included in the following conference series:

Abstract

Many approaches have been proposed over the years for the recognition of mathematical formulae from scanned documents. More recently a need has arisen to recognise formulae from PDF documents. Here we can avoid ambiguities introduced by traditional OCR approaches and instead extract perfect knowledge of the characters used in formulae directly from the document. This can be exploited by formula recognition techniques to achieve correct results and high performance.

In this paper we revisit an old grammatical approach to formula recognition, that of Anderson from 1968, and assess its applicability with respect to data extracted from PDF documents. We identify some problems of the original method when applied to common mathematical expressions and show how they can be overcome. The simplicity of the original method leads to a very efficient recognition technique that not only is very simple to implement but also yields results of high accuracy for the recognition of mathematical formulae from PDF documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adobe Systems. PDF Reference fifth edition Adobe Portable Document Format Version 1.6 (2004)

    Google Scholar 

  2. Aly, W., Uchida, S., Suzuki, M.: Identifying subscripts, superscripts in mathematical documents. Mathematics in Computer Science (2008)

    Google Scholar 

  3. Anderson, R.H.: Syntax-Directed Recognition of Hand-Printed Two-dimensional Mathematics. PhD thesis, Harvard University, Cambridge, MA (January 1968)

    Google Scholar 

  4. Anjwierden, A.: Aidas: Incremental logical structure discovery in PDF documents. In: Proc. of ICDAR 2001, p. 374. IEEE Computer Society, Los Alamitos (2001)

    Google Scholar 

  5. Baker, J., Sexton, A.P., Sorge, V.: Extracting precise data on the mathematical content of PDF documents. In: Proc. of DML 2008. Masaryk University Press (2008)

    Google Scholar 

  6. Blostein, D., Grbavec, A.: Handbook on Optical Character Recognition and Document Image Analysis, Recognition of Mathematical Notation. World Scientific, Singapore (1996)

    Google Scholar 

  7. Chan, K., Yeung, D.: Mathematical expression recognition: a survey. International Journal on Document Analysis and Recognition (2000)

    Google Scholar 

  8. Judson, T.: Abstract algebra — theory and applications (February 2009), http://abstract.ups.edu/download.html

  9. Kanahori, T., Suzuki, M.: A recognition method of matrices by using variable block pattern elements generating rectangular areas. In: Blostein, D., Kwon, Y.-B. (eds.) GREC 2001. LNCS, vol. 2390, pp. 320–329. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  10. Kanahori, T., Suzuki, M.: Detection of matrices and segmentation of matrix elements in scanned images of scientific documents. In: ICDAR 2003, pp. 433–437 (2003)

    Google Scholar 

  11. Phelps, T.: Multivalent, http://multivalent.sourceforge.net/

  12. Roberts, T.: LaTeX mathematics examples (May 2004), http://www.sci.usq.edu.au/staff/aroberts/LaTeX/Src/maths.pdf

  13. Sexton, A., Sorge, V.: Database-driven mathematical character recognition. In: Liu, W., Lladós, J. (eds.) GREC 2005. LNCS, vol. 3926, pp. 218–230. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  14. Sternberg, S.: Semi-riemann geometry and general relativity (September 2003), http://www.math.harvard.edu/~shlomo/docs/semi_riemannian_geometry.pdf

  15. Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T.: Infty — an integrated OCR system for mathematical documents. In: Proceedings of ACM Symposium on Document Engineering, pp. 95–104. ACM Press, New York (2003)

    Google Scholar 

  16. Yang, M., Fateman, R.: Extracting mathematical expressions from postscript documents. In: Proc. of ISSAC 2004, pp. 305–311. ACM Press, New York (2004)

    Google Scholar 

  17. Yuan, F., Liu, B.: A new method of information extraction from PDF files. In: Proc. of Machine Learning and Cybernetics, pp. 1738–1742. IEEE Computer Society, Los Alamitos (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Baker, J.B., Sexton, A.P., Sorge, V. (2009). A Linear Grammar Approach to Mathematical Formula Recognition from PDF. In: Carette, J., Dixon, L., Coen, C.S., Watt, S.M. (eds) Intelligent Computer Mathematics. CICM 2009. Lecture Notes in Computer Science(), vol 5625. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02614-0_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-02614-0_19

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-02613-3

  • Online ISBN: 978-3-642-02614-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics