Abstract
Many approaches have been proposed over the years for the recognition of mathematical formulae from scanned documents. More recently a need has arisen to recognise formulae from PDF documents. Here we can avoid ambiguities introduced by traditional OCR approaches and instead extract perfect knowledge of the characters used in formulae directly from the document. This can be exploited by formula recognition techniques to achieve correct results and high performance.
In this paper we revisit an old grammatical approach to formula recognition, that of Anderson from 1968, and assess its applicability with respect to data extracted from PDF documents. We identify some problems of the original method when applied to common mathematical expressions and show how they can be overcome. The simplicity of the original method leads to a very efficient recognition technique that not only is very simple to implement but also yields results of high accuracy for the recognition of mathematical formulae from PDF documents.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Adobe Systems. PDF Reference fifth edition Adobe Portable Document Format Version 1.6 (2004)
Aly, W., Uchida, S., Suzuki, M.: Identifying subscripts, superscripts in mathematical documents. Mathematics in Computer Science (2008)
Anderson, R.H.: Syntax-Directed Recognition of Hand-Printed Two-dimensional Mathematics. PhD thesis, Harvard University, Cambridge, MA (January 1968)
Anjwierden, A.: Aidas: Incremental logical structure discovery in PDF documents. In: Proc. of ICDAR 2001, p. 374. IEEE Computer Society, Los Alamitos (2001)
Baker, J., Sexton, A.P., Sorge, V.: Extracting precise data on the mathematical content of PDF documents. In: Proc. of DML 2008. Masaryk University Press (2008)
Blostein, D., Grbavec, A.: Handbook on Optical Character Recognition and Document Image Analysis, Recognition of Mathematical Notation. World Scientific, Singapore (1996)
Chan, K., Yeung, D.: Mathematical expression recognition: a survey. International Journal on Document Analysis and Recognition (2000)
Judson, T.: Abstract algebra — theory and applications (February 2009), http://abstract.ups.edu/download.html
Kanahori, T., Suzuki, M.: A recognition method of matrices by using variable block pattern elements generating rectangular areas. In: Blostein, D., Kwon, Y.-B. (eds.) GREC 2001. LNCS, vol. 2390, pp. 320–329. Springer, Heidelberg (2002)
Kanahori, T., Suzuki, M.: Detection of matrices and segmentation of matrix elements in scanned images of scientific documents. In: ICDAR 2003, pp. 433–437 (2003)
Phelps, T.: Multivalent, http://multivalent.sourceforge.net/
Roberts, T.: LaTeX mathematics examples (May 2004), http://www.sci.usq.edu.au/staff/aroberts/LaTeX/Src/maths.pdf
Sexton, A., Sorge, V.: Database-driven mathematical character recognition. In: Liu, W., Lladós, J. (eds.) GREC 2005. LNCS, vol. 3926, pp. 218–230. Springer, Heidelberg (2006)
Sternberg, S.: Semi-riemann geometry and general relativity (September 2003), http://www.math.harvard.edu/~shlomo/docs/semi_riemannian_geometry.pdf
Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T.: Infty — an integrated OCR system for mathematical documents. In: Proceedings of ACM Symposium on Document Engineering, pp. 95–104. ACM Press, New York (2003)
Yang, M., Fateman, R.: Extracting mathematical expressions from postscript documents. In: Proc. of ISSAC 2004, pp. 305–311. ACM Press, New York (2004)
Yuan, F., Liu, B.: A new method of information extraction from PDF files. In: Proc. of Machine Learning and Cybernetics, pp. 1738–1742. IEEE Computer Society, Los Alamitos (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Baker, J.B., Sexton, A.P., Sorge, V. (2009). A Linear Grammar Approach to Mathematical Formula Recognition from PDF. In: Carette, J., Dixon, L., Coen, C.S., Watt, S.M. (eds) Intelligent Computer Mathematics. CICM 2009. Lecture Notes in Computer Science(), vol 5625. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02614-0_19
Download citation
DOI: https://doi.org/10.1007/978-3-642-02614-0_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02613-3
Online ISBN: 978-3-642-02614-0
eBook Packages: Computer ScienceComputer Science (R0)