Advertisement

A Linear Grammar Approach to Mathematical Formula Recognition from PDF

  • Josef B. Baker
  • Alan P. Sexton
  • Volker Sorge
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5625)

Abstract

Many approaches have been proposed over the years for the recognition of mathematical formulae from scanned documents. More recently a need has arisen to recognise formulae from PDF documents. Here we can avoid ambiguities introduced by traditional OCR approaches and instead extract perfect knowledge of the characters used in formulae directly from the document. This can be exploited by formula recognition techniques to achieve correct results and high performance.

In this paper we revisit an old grammatical approach to formula recognition, that of Anderson from 1968, and assess its applicability with respect to data extracted from PDF documents. We identify some problems of the original method when applied to common mathematical expressions and show how they can be overcome. The simplicity of the original method leads to a very efficient recognition technique that not only is very simple to implement but also yields results of high accuracy for the recognition of mathematical formulae from PDF documents.

Keywords

Mathematical Expression Parse Tree Syntax Tree Terminal Symbol Grammatical Approach 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Adobe Systems. PDF Reference fifth edition Adobe Portable Document Format Version 1.6 (2004)Google Scholar
  2. 2.
    Aly, W., Uchida, S., Suzuki, M.: Identifying subscripts, superscripts in mathematical documents. Mathematics in Computer Science (2008)Google Scholar
  3. 3.
    Anderson, R.H.: Syntax-Directed Recognition of Hand-Printed Two-dimensional Mathematics. PhD thesis, Harvard University, Cambridge, MA (January 1968)Google Scholar
  4. 4.
    Anjwierden, A.: Aidas: Incremental logical structure discovery in PDF documents. In: Proc. of ICDAR 2001, p. 374. IEEE Computer Society, Los Alamitos (2001)Google Scholar
  5. 5.
    Baker, J., Sexton, A.P., Sorge, V.: Extracting precise data on the mathematical content of PDF documents. In: Proc. of DML 2008. Masaryk University Press (2008)Google Scholar
  6. 6.
    Blostein, D., Grbavec, A.: Handbook on Optical Character Recognition and Document Image Analysis, Recognition of Mathematical Notation. World Scientific, Singapore (1996)Google Scholar
  7. 7.
    Chan, K., Yeung, D.: Mathematical expression recognition: a survey. International Journal on Document Analysis and Recognition (2000)Google Scholar
  8. 8.
    Judson, T.: Abstract algebra — theory and applications (February 2009), http://abstract.ups.edu/download.html
  9. 9.
    Kanahori, T., Suzuki, M.: A recognition method of matrices by using variable block pattern elements generating rectangular areas. In: Blostein, D., Kwon, Y.-B. (eds.) GREC 2001. LNCS, vol. 2390, pp. 320–329. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  10. 10.
    Kanahori, T., Suzuki, M.: Detection of matrices and segmentation of matrix elements in scanned images of scientific documents. In: ICDAR 2003, pp. 433–437 (2003)Google Scholar
  11. 11.
    Phelps, T.: Multivalent, http://multivalent.sourceforge.net/
  12. 12.
    Roberts, T.: LaTeX mathematics examples (May 2004), http://www.sci.usq.edu.au/staff/aroberts/LaTeX/Src/maths.pdf
  13. 13.
    Sexton, A., Sorge, V.: Database-driven mathematical character recognition. In: Liu, W., Lladós, J. (eds.) GREC 2005. LNCS, vol. 3926, pp. 218–230. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  14. 14.
    Sternberg, S.: Semi-riemann geometry and general relativity (September 2003), http://www.math.harvard.edu/~shlomo/docs/semi_riemannian_geometry.pdf
  15. 15.
    Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T.: Infty — an integrated OCR system for mathematical documents. In: Proceedings of ACM Symposium on Document Engineering, pp. 95–104. ACM Press, New York (2003)Google Scholar
  16. 16.
    Yang, M., Fateman, R.: Extracting mathematical expressions from postscript documents. In: Proc. of ISSAC 2004, pp. 305–311. ACM Press, New York (2004)Google Scholar
  17. 17.
    Yuan, F., Liu, B.: A new method of information extraction from PDF files. In: Proc. of Machine Learning and Cybernetics, pp. 1738–1742. IEEE Computer Society, Los Alamitos (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Josef B. Baker
    • 1
  • Alan P. Sexton
    • 1
  • Volker Sorge
    • 1
  1. 1.School of Computer ScienceUniversity of BirminghamUK

Personalised recommendations