Skip to main content
Log in

Mathematical formula identification and performance evaluation in PDF documents

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

An important initial step of mathematical formula recognition is to correctly identify the location of formulae within documents. Previous work in this area has traditionally focused on image-based documents; however, given the prevalence and popularity of the PDF format for dissemination, alternatives to image-based approaches are increasingly being explored. In this paper, we investigate the use of both machine learning techniques and heuristic rules to locate the boundaries of both isolated and embedded formulae within documents, based upon data extracted directly from PDF files. We propose four new features along with preprocessing and post-processing techniques for isolated formula identification. Furthermore, we compare, analyse and extensively tune nine state-of-the-art learning algorithms for a comprehensive evaluation of our proposed methods. The evaluation is carried out over a ground-truth dataset, which we have made publicly available, together with an application adaptable fine-grained evaluation metric. Our experimental results demonstrate that the overall accuracies of isolated and embedded formula identification are increased by 11.52 and 10.65 %, compared with our previously proposed formula identification approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. http://www.icst.pku.edu.cn/cpdp/data/marmot_data.htm.

  2. http://www.icst.pku.edu.cn/cpdp/data/marmot_data.htm.

References

  1. Anderson, R.H.: Syntax-directed recognition of hand-printed two-dimensional mathematics. PhD thesis, Harvard University, Cambridge, Massachusetts (1968)

  2. Lin, X., Gao, L., Tang, Z., Lin, X., Hu, X.: Mathematical formula identification in PDF documents. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 1419–1423. IEEE (2011)

  3. Lin, X., Gao, L., Tang, Z., Hu, X., Lin, X.: Identification of embedded mathematical formulas in PDF documents using SVM. In: Document Recognition and Retrieval (DRR) XIX, pp. 8297 0D 1–8 (2012)

  4. Lin, X., Gao, L., Tang, Z., Lin, X., Hu, X.: Performance evaluation of mathematical formula identification. In: The 10th IAPR International Workshop on Document Analysis Systems (DAS), pp. 287–291. IEEE (2012)

  5. Adobe. PDF reference, 7th edition (2008)

  6. Zanibbi, R., Blostein, D.: Recognition and retrieval of mathematical expressions. Int. J. Document Anal. Recognit. pp. 1–27 (2011)

  7. Baker, J.B.: A linear grammar approach for the analysis of mathematical documents. PhD thesis, University of Birmingham (2012)

  8. Rahman, F., Alam, H.: Conversion of PDF documents into HTML: a case study of document image analysis. In: Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, vol. 1, pp. 87–91. IEEE (2003)

  9. Baker, J.B., Sexton, A.P., Sorge, V.: A linear grammar approach to mathematical formula recognition from PDF. In: Proceedings of the 8th International Conference on Mathematical Knowledge Management, vol. 5625 of LNAI, pp. 201–216. Springer (2009)

  10. Fateman, R.J., Tokuyasu, T., Berman, B.P., Mitchell, N.: Optical character recognition and parsing of typeset mathematics. J. Vis. Commun. Image Represent. 7(1), 2–15 (1996)

    Article  Google Scholar 

  11. Lee, H.J., Wang, J.S.: Design of a mathematical expression understanding system. Pattern Recognit. Lett. 18(3), 289–298 (1997)

    Article  Google Scholar 

  12. Toumit, J.Y., Garcia-Salicetti, S., Emptoz, H.: A hierarchical and recursive model of mathematical expressions for automatic reading of mathematical documents. In: Proceedings of the Fifth International Conference on Document Analysis and Recognition (ICDAR), pp. 119–122. IEEE (1999)

  13. Garain, U., Chaudhuri, B.B.: A syntactic approach for processing mathematical expressions in printed documents. In: Proceedings of the 15th International Conference on Pattern Recognition (ICPR), vol. 4, pp. 523–526. IEEE (2000)

  14. Kacem, A., Belaïd, A., Ben Ahmed, M.: Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context. Int. J. Document Anal. Recognit. 4(2), 97–108 (2001)

    Google Scholar 

  15. Inoue, K., Miyazaki, R., Suzuki, M.: Optical recognition of printed mathematical documents. In: Proceedings of the Third Asian Technology Conference on Mathematics, pp. 280–289 (1998)

  16. Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T.: INFTY: an integrated OCR system for mathematical documents. In: Proceedings of the 2003 ACM Symposium on Document Engineering, pp. 95–104. ACM (2003)

  17. Baker, J.B., Sexton, A.P., Sorge, V.: Towards reverse engineering of PDF documents. In: Towards a Digital Mathematics Library, pp. 65–75. Masaryk University Press (2011)

  18. Lee, H.J., Wang, J.S.: Design of a mathematical expression recognition system. In: Proceedings of the Third International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 1084–1087. IEEE (1995)

  19. Chowdhury, S.P., Mandal, S., Das, A.K., Chanda, B.: Automated segmentation of math-zones from document images. In: 7th International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 755–759 (2003)

  20. Chang, T.Y., Takiguchi, Y., Okada, M.: Physical structure segmentation with projection profile for mathematic formulae and graphics in academic paper images. In: The Ninth International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 1193–1197. IEEE (2007)

  21. Garain, U., Chaudhuri, B.B., Chaudhuri, A.R.: Identification of embedded mathematical expressions in scanned documents. In: Proceedings of the 17th International Conference on Pattern Recognition (ICPR), vol. 1, pp. 384–387. IEEE (2004)

  22. Garain, U.: Identification of mathematical expressions in document images. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 1340–1344. IEEE (2009)

  23. Jin, J., Han, X., Wang, Q.: Mathematical formulas extraction. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 1138–1141. IEEE (2003)

  24. Drake, D.M., Baird, H.S.: Distinguishing mathematics notation from English text using computational geometry. In: Proceedings. Eighth International Conference on Document Analysis and Recognition (ICDAR), pp. 1270–1274. IEEE (2005)

  25. Liu, Y., Bai, K., Gao, L.: An efficient pre-processing method to identify logical components from PDF documents. Adv. Knowl. Discov. Data Min. pp. 500–511 (2011)

  26. Uchida, S., Nomura, A., Suzuki, M.: Quantitative analysis of mathematical documents. Int. J. Document Anal. Recognit. 7(4), 211–218 (2005)

    Article  Google Scholar 

  27. Phillips, I., Chanda, B., Haralick, R: University of Washington UW-III English technical document image database (1996)

  28. http://ntcir-math.nii.ac.jp/ (2013)

  29. Gao, L., Tang, Z., Lin, X., Qiu, R.: Comprehensive global typography extraction system for electronic book documents. In: The Eighth IAPR International Workshop on Document Analysis Systems (DAS), pp. 615–621. IEEE (2008)

  30. http://www.cs.cmu.edu/~quake/triangle.html (2013)

  31. Bishop, C.M.: Pattern Recognition and Machine Learning, vol. 4. Springer, New York (2006)

  32. Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor. Newsl. 6(1), 1–6 (2004)

    Google Scholar 

  33. Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6(1), 20–29 (2004)

  34. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Google Scholar 

  35. Kim, S.H., Jeong, C.B., Kwag, H.K., Suen, C.Y.: Word segmentation of printed text lines based on gap clustering and special symbol detection. In: Proceedings. 16th International Conference on Pattern Recognition (ICPR), vol. 2, pp. 320–323. IEEE (2002)

  36. Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)

    Google Scholar 

  37. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)

    Article  Google Scholar 

Download references

Acknowledgments

This work is sponsored by the National Natural Science Foundation of China (No. 61202232) and National Key Technology R&D Program of China (No. 2012BAH40F01). We would like to thank our colleagues Jing Fang, Yongtao Wang and Luyuan Li for their comments on this paper. The learning algorithms in our paper are implemented by LibSVM and weka, which are open source software providing implementations of machine learning algorithms.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liangcai Gao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, X., Gao, L., Tang, Z. et al. Mathematical formula identification and performance evaluation in PDF documents. IJDAR 17, 239–255 (2014). https://doi.org/10.1007/s10032-013-0216-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-013-0216-1

Keywords

Navigation