Abstract
Automatic document analysis tools for mathematical texts are necessary to enlarge the pool of mathematical knowledge available in electronic form. However, development of such tools is currently hindered by the weakness of optical character recognition systems in dealing with the large range of mathematical symbols and the often subtle but important distinctions in font usage in mathematical texts. Research on developing better systems for mathematical optical character recognition crucially depends on having an extensive, high quality database of glyphs used in mathematical texts for training and test purposes. We present such a database of symbols constructed from a large set of characters available in the LATEX document preparation system that can serve as a basis mathematical text recognition. We describe its integration into a prototypical system optical character recognition system for mathematics that enables the construction of LATEX source documents from mathematical documents available as images. From the lessons learned in this work we derive a road map for further research into the area of mathematical text analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Anderson, R.H.: Syntax-Directed Recognition of Hand-Printed Two-dimensional Mathematics. PhD thesis, January 1968, Harvard University, Cambridge (1968); Klerer, M., Reinfelds, J.: Shorter version. Interactive Systems for Experimental Applied Mathematics, pp. 436–459. Academics Press, London (1968)
The JSTOR scholarly journal archive, http://www.jstor.org/
The JSTOR production process, http://www.jstor.org/about/process.html
Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T.: Infty — an integrated ocr system for mathematical documents. In: Vanoirbeek, C., Roisin, C., Munson, E. (eds.) Proceedings of ACM Symposium on Document Engineering, Grenoble, France, pp. 95–104 (2003)
Parkin, S.: The comprehensive latex symbol list. Technical report, CTAN (September 29, 2003), available at http://twww.ctan.org
Sexton, A., Sorge, V.: Database-driven mathematical character recognition. In: Liu, W., Lladós, J. (eds.) GREC 2005. LNCS, vol. 3926, pp. 218–230. Springer, Heidelberg (2006) (to appear)
Sexton, A.P., Todman, A., Woodward, K.: Font recognition using shape-based quad-tree and kd-tree decomposition. In: 3rd International Conference on Computer Vision, Pattern Recognition and Image Processing, Atlantic City, USA, pp. 212–215 (February 2000); Appears in Vol 2 of the collected proceedings of JCIS 2000, the Fifth Joint Conference on Information Sciences
Suzuki, M., Uchida, S., Nomura, A.: A ground-truthed mathematical character and symbol image database. Technical report, Faculty of Mathematics, Kyushu University, 6-10-1 Hakozaki, Higashi-ku, Fukuoka-shi, 812-8581 Japan (2004), Available at http://www.inftyproject.org/AboutInftyCDB-1.pdf
Transactions of the American Mathematical Society. Available as part of JSTOR at http://uk.jstor.org/journals/00029947.html
The teTeX homepage, http://www.tug.org/teTeX/
Tiño, P., Hammer, B.: Architectural Bias in Recurrent Neural Networks: Fractal Analysis. Neural Computation 15(8), 1931–1957 (2003), available at http://www.cs.bham.ac.uk/~pxt/PAPERS/rnn.frac.nc.fin.ps.gz
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sexton, A., Sorge, V. (2006). A Database of Glyphs for OCR of Mathematical Documents. In: Kohlhase, M. (eds) Mathematical Knowledge Management. MKM 2005. Lecture Notes in Computer Science(), vol 3863. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11618027_14
Download citation
DOI: https://doi.org/10.1007/11618027_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31430-1
Online ISBN: 978-3-540-31431-8
eBook Packages: Computer ScienceComputer Science (R0)