Advertisement

Mathematical formula identification and performance evaluation in PDF documents

  • Xiaoyan Lin
  • Liangcai GaoEmail author
  • Zhi Tang
  • Josef Baker
  • Volker Sorge
Original Paper

Abstract

An important initial step of mathematical formula recognition is to correctly identify the location of formulae within documents. Previous work in this area has traditionally focused on image-based documents; however, given the prevalence and popularity of the PDF format for dissemination, alternatives to image-based approaches are increasingly being explored. In this paper, we investigate the use of both machine learning techniques and heuristic rules to locate the boundaries of both isolated and embedded formulae within documents, based upon data extracted directly from PDF files. We propose four new features along with preprocessing and post-processing techniques for isolated formula identification. Furthermore, we compare, analyse and extensively tune nine state-of-the-art learning algorithms for a comprehensive evaluation of our proposed methods. The evaluation is carried out over a ground-truth dataset, which we have made publicly available, together with an application adaptable fine-grained evaluation metric. Our experimental results demonstrate that the overall accuracies of isolated and embedded formula identification are increased by 11.52 and 10.65 %, compared with our previously proposed formula identification approach.

Keywords

Mathematical formula identification  Machine learning  PDF documents Performance evaluation 

Notes

Acknowledgments

This work is sponsored by the National Natural Science Foundation of China (No. 61202232) and National Key Technology R&D Program of China (No. 2012BAH40F01). We would like to thank our colleagues Jing Fang, Yongtao Wang and Luyuan Li for their comments on this paper. The learning algorithms in our paper are implemented by LibSVM and weka, which are open source software providing implementations of machine learning algorithms.

References

  1. 1.
    Anderson, R.H.: Syntax-directed recognition of hand-printed two-dimensional mathematics. PhD thesis, Harvard University, Cambridge, Massachusetts (1968)Google Scholar
  2. 2.
    Lin, X., Gao, L., Tang, Z., Lin, X., Hu, X.: Mathematical formula identification in PDF documents. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 1419–1423. IEEE (2011)Google Scholar
  3. 3.
    Lin, X., Gao, L., Tang, Z., Hu, X., Lin, X.: Identification of embedded mathematical formulas in PDF documents using SVM. In: Document Recognition and Retrieval (DRR) XIX, pp. 8297 0D 1–8 (2012)Google Scholar
  4. 4.
    Lin, X., Gao, L., Tang, Z., Lin, X., Hu, X.: Performance evaluation of mathematical formula identification. In: The 10th IAPR International Workshop on Document Analysis Systems (DAS), pp. 287–291. IEEE (2012)Google Scholar
  5. 5.
    Adobe. PDF reference, 7th edition (2008)Google Scholar
  6. 6.
    Zanibbi, R., Blostein, D.: Recognition and retrieval of mathematical expressions. Int. J. Document Anal. Recognit. pp. 1–27 (2011)Google Scholar
  7. 7.
    Baker, J.B.: A linear grammar approach for the analysis of mathematical documents. PhD thesis, University of Birmingham (2012)Google Scholar
  8. 8.
    Rahman, F., Alam, H.: Conversion of PDF documents into HTML: a case study of document image analysis. In: Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, vol. 1, pp. 87–91. IEEE (2003)Google Scholar
  9. 9.
    Baker, J.B., Sexton, A.P., Sorge, V.: A linear grammar approach to mathematical formula recognition from PDF. In: Proceedings of the 8th International Conference on Mathematical Knowledge Management, vol. 5625 of LNAI, pp. 201–216. Springer (2009)Google Scholar
  10. 10.
    Fateman, R.J., Tokuyasu, T., Berman, B.P., Mitchell, N.: Optical character recognition and parsing of typeset mathematics. J. Vis. Commun. Image Represent. 7(1), 2–15 (1996)CrossRefGoogle Scholar
  11. 11.
    Lee, H.J., Wang, J.S.: Design of a mathematical expression understanding system. Pattern Recognit. Lett. 18(3), 289–298 (1997)CrossRefGoogle Scholar
  12. 12.
    Toumit, J.Y., Garcia-Salicetti, S., Emptoz, H.: A hierarchical and recursive model of mathematical expressions for automatic reading of mathematical documents. In: Proceedings of the Fifth International Conference on Document Analysis and Recognition (ICDAR), pp. 119–122. IEEE (1999)Google Scholar
  13. 13.
    Garain, U., Chaudhuri, B.B.: A syntactic approach for processing mathematical expressions in printed documents. In: Proceedings of the 15th International Conference on Pattern Recognition (ICPR), vol. 4, pp. 523–526. IEEE (2000)Google Scholar
  14. 14.
    Kacem, A., Belaïd, A., Ben Ahmed, M.: Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context. Int. J. Document Anal. Recognit. 4(2), 97–108 (2001)Google Scholar
  15. 15.
    Inoue, K., Miyazaki, R., Suzuki, M.: Optical recognition of printed mathematical documents. In: Proceedings of the Third Asian Technology Conference on Mathematics, pp. 280–289 (1998)Google Scholar
  16. 16.
    Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T.: INFTY: an integrated OCR system for mathematical documents. In: Proceedings of the 2003 ACM Symposium on Document Engineering, pp. 95–104. ACM (2003)Google Scholar
  17. 17.
    Baker, J.B., Sexton, A.P., Sorge, V.: Towards reverse engineering of PDF documents. In: Towards a Digital Mathematics Library, pp. 65–75. Masaryk University Press (2011)Google Scholar
  18. 18.
    Lee, H.J., Wang, J.S.: Design of a mathematical expression recognition system. In: Proceedings of the Third International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 1084–1087. IEEE (1995)Google Scholar
  19. 19.
    Chowdhury, S.P., Mandal, S., Das, A.K., Chanda, B.: Automated segmentation of math-zones from document images. In: 7th International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 755–759 (2003)Google Scholar
  20. 20.
    Chang, T.Y., Takiguchi, Y., Okada, M.: Physical structure segmentation with projection profile for mathematic formulae and graphics in academic paper images. In: The Ninth International Conference on Document Analysis and Recognition (ICDAR), vol. 2, pp. 1193–1197. IEEE (2007)Google Scholar
  21. 21.
    Garain, U., Chaudhuri, B.B., Chaudhuri, A.R.: Identification of embedded mathematical expressions in scanned documents. In: Proceedings of the 17th International Conference on Pattern Recognition (ICPR), vol. 1, pp. 384–387. IEEE (2004)Google Scholar
  22. 22.
    Garain, U.: Identification of mathematical expressions in document images. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 1340–1344. IEEE (2009)Google Scholar
  23. 23.
    Jin, J., Han, X., Wang, Q.: Mathematical formulas extraction. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 1138–1141. IEEE (2003)Google Scholar
  24. 24.
    Drake, D.M., Baird, H.S.: Distinguishing mathematics notation from English text using computational geometry. In: Proceedings. Eighth International Conference on Document Analysis and Recognition (ICDAR), pp. 1270–1274. IEEE (2005)Google Scholar
  25. 25.
    Liu, Y., Bai, K., Gao, L.: An efficient pre-processing method to identify logical components from PDF documents. Adv. Knowl. Discov. Data Min. pp. 500–511 (2011)Google Scholar
  26. 26.
    Uchida, S., Nomura, A., Suzuki, M.: Quantitative analysis of mathematical documents. Int. J. Document Anal. Recognit. 7(4), 211–218 (2005)CrossRefGoogle Scholar
  27. 27.
    Phillips, I., Chanda, B., Haralick, R: University of Washington UW-III English technical document image database (1996)Google Scholar
  28. 28.
  29. 29.
    Gao, L., Tang, Z., Lin, X., Qiu, R.: Comprehensive global typography extraction system for electronic book documents. In: The Eighth IAPR International Workshop on Document Analysis Systems (DAS), pp. 615–621. IEEE (2008)Google Scholar
  30. 30.
  31. 31.
    Bishop, C.M.: Pattern Recognition and Machine Learning, vol. 4. Springer, New York (2006)Google Scholar
  32. 32.
    Chawla, N.V., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor. Newsl. 6(1), 1–6 (2004)Google Scholar
  33. 33.
    Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6(1), 20–29 (2004)Google Scholar
  34. 34.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002) Google Scholar
  35. 35.
    Kim, S.H., Jeong, C.B., Kwag, H.K., Suen, C.Y.: Word segmentation of printed text lines based on gap clustering and special symbol detection. In: Proceedings. 16th International Conference on Pattern Recognition (ICPR), vol. 2, pp. 320–323. IEEE (2002)Google Scholar
  36. 36.
    Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)Google Scholar
  37. 37.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Xiaoyan Lin
    • 1
  • Liangcai Gao
    • 1
    Email author
  • Zhi Tang
    • 1
  • Josef Baker
    • 2
  • Volker Sorge
    • 2
  1. 1.Institute of Computer Science and TechnologyPeking University No.5 Yiheyuan RoadBeijingChina
  2. 2.School of Computer ScienceUniversity of BirminghamBirminghamUK

Personalised recommendations