A Linear Grammar Approach to Mathematical Formula Recognition from PDF

Baker, Josef B.; Sexton, Alan P.; Sorge, Volker

doi:10.1007/978-3-642-02614-0_19

Josef B. Baker²³,
Alan P. Sexton²³ &
Volker Sorge²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5625))

Included in the following conference series:

International Conference on Intelligent Computer Mathematics

852 Accesses
19 Citations

Abstract

Many approaches have been proposed over the years for the recognition of mathematical formulae from scanned documents. More recently a need has arisen to recognise formulae from PDF documents. Here we can avoid ambiguities introduced by traditional OCR approaches and instead extract perfect knowledge of the characters used in formulae directly from the document. This can be exploited by formula recognition techniques to achieve correct results and high performance.

In this paper we revisit an old grammatical approach to formula recognition, that of Anderson from 1968, and assess its applicability with respect to data extracted from PDF documents. We identify some problems of the original method when applied to common mathematical expressions and show how they can be overcome. The simplicity of the original method leads to a very efficient recognition technique that not only is very simple to implement but also yields results of high accuracy for the recognition of mathematical formulae from PDF documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Adobe Systems. PDF Reference fifth edition Adobe Portable Document Format Version 1.6 (2004)
Google Scholar
Aly, W., Uchida, S., Suzuki, M.: Identifying subscripts, superscripts in mathematical documents. Mathematics in Computer Science (2008)
Google Scholar
Anderson, R.H.: Syntax-Directed Recognition of Hand-Printed Two-dimensional Mathematics. PhD thesis, Harvard University, Cambridge, MA (January 1968)
Google Scholar
Anjwierden, A.: Aidas: Incremental logical structure discovery in PDF documents. In: Proc. of ICDAR 2001, p. 374. IEEE Computer Society, Los Alamitos (2001)
Google Scholar
Baker, J., Sexton, A.P., Sorge, V.: Extracting precise data on the mathematical content of PDF documents. In: Proc. of DML 2008. Masaryk University Press (2008)
Google Scholar
Blostein, D., Grbavec, A.: Handbook on Optical Character Recognition and Document Image Analysis, Recognition of Mathematical Notation. World Scientific, Singapore (1996)
Google Scholar
Chan, K., Yeung, D.: Mathematical expression recognition: a survey. International Journal on Document Analysis and Recognition (2000)
Google Scholar
Judson, T.: Abstract algebra — theory and applications (February 2009), http://abstract.ups.edu/download.html
Kanahori, T., Suzuki, M.: A recognition method of matrices by using variable block pattern elements generating rectangular areas. In: Blostein, D., Kwon, Y.-B. (eds.) GREC 2001. LNCS, vol. 2390, pp. 320–329. Springer, Heidelberg (2002)
Chapter Google Scholar
Kanahori, T., Suzuki, M.: Detection of matrices and segmentation of matrix elements in scanned images of scientific documents. In: ICDAR 2003, pp. 433–437 (2003)
Google Scholar
Phelps, T.: Multivalent, http://multivalent.sourceforge.net/
Roberts, T.: LaTeX mathematics examples (May 2004), http://www.sci.usq.edu.au/staff/aroberts/LaTeX/Src/maths.pdf
Sexton, A., Sorge, V.: Database-driven mathematical character recognition. In: Liu, W., Lladós, J. (eds.) GREC 2005. LNCS, vol. 3926, pp. 218–230. Springer, Heidelberg (2006)
Chapter Google Scholar
Sternberg, S.: Semi-riemann geometry and general relativity (September 2003), http://www.math.harvard.edu/~shlomo/docs/semi_riemannian_geometry.pdf
Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T.: Infty — an integrated OCR system for mathematical documents. In: Proceedings of ACM Symposium on Document Engineering, pp. 95–104. ACM Press, New York (2003)
Google Scholar
Yang, M., Fateman, R.: Extracting mathematical expressions from postscript documents. In: Proc. of ISSAC 2004, pp. 305–311. ACM Press, New York (2004)
Google Scholar
Yuan, F., Liu, B.: A new method of information extraction from PDF files. In: Proc. of Machine Learning and Cybernetics, pp. 1738–1742. IEEE Computer Society, Los Alamitos (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, University of Birmingham, UK
Josef B. Baker, Alan P. Sexton & Volker Sorge

Authors

Josef B. Baker
View author publications
You can also search for this author in PubMed Google Scholar
Alan P. Sexton
View author publications
You can also search for this author in PubMed Google Scholar
Volker Sorge
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computing and Software, McMaster University, 1280 Main Street West, L8S 4K1, Hamilton, ON, Canada
Jacques Carette
Informatics, 2.02 Informatics Forum, University of Edinburgh, 10 Crichton street, EH8 9AB, Edinburgh, UK
Lucas Dixon
Department of Computer Science, University of Bologna, via Mura Anteo Zamboni, 7, 40127, Bologna, Italy
Claudio Sacerdoti Coen
Department of Computer Science MC 375, University of Western Ontario, N6A 5B7, London, ON, Canada
Stephen M. Watt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Baker, J.B., Sexton, A.P., Sorge, V. (2009). A Linear Grammar Approach to Mathematical Formula Recognition from PDF. In: Carette, J., Dixon, L., Coen, C.S., Watt, S.M. (eds) Intelligent Computer Mathematics. CICM 2009. Lecture Notes in Computer Science(), vol 5625. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02614-0_19

Download citation

DOI: https://doi.org/10.1007/978-3-642-02614-0_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02613-3
Online ISBN: 978-3-642-02614-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics