Skip to main content

Part-of-Math Tagging and Applications

  • Conference paper
  • First Online:
Intelligent Computer Mathematics (CICM 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10383))

Included in the following conference series:

Abstract

Nearly all of the recent mathematical literature, and much of the old literature, are online and mostly in natural-language form. Therefore, math content processing presents some of the same challenges faced in natural language processing (NLP), such as math disambiguation and math semantics determination. These challenges must be surmounted to enable more effective math knowledge management, math knowledge discovery, automated presentation-to-computation (P2C) conversion, and automated math reasoning. To meet this goal, considerable math language processing (MLP) technology is needed.

This project aims to advance MLP by developing (1) a sophisticated part-of-math (POM) tagger, (2) math-sense disambiguation techniques along with supporting Machine-Learning (ML) based MLP algorithms, and (3) semantics extraction from, and enrichment of, math expressions. Specifically, the project first created an evolving tagset for math terms and expressions, and is developing a general-purpose POM tagger. The tagger works in several scans and interacts with other MLP algorithms that will be developed in this project. In the first scan of an input math document, each math term and some sub-expressions are tagged with two kinds of tags. The \(1^\mathrm{st}\) kind consists of definite tags (such as operation, relation, numerator, etc.) that the tagger is certain of. The \(2^\mathrm{nd}\) kind consists of alternative, tentative features (including alternative roles and meanings) drawn from a knowledge base that has been developed for this project. The \(2^\mathrm{nd}\) and \(3^\mathrm{rd}\) scan will, in conjunction with some NLP/ML-based algorithms, select the right features from among those alternative features, disambiguate the terms, group subsequences of terms into unambiguous sub-expressions and tag them, and thus derive definite unambiguous semantics of math terms and expressions. The NLP/ML-based algorithms needed for this work will be another part of this project. These include math topic modeling, math context modeling, math document classification (into various standard areas of math), and definition-harvesting algorithms.

The project will create significant new concepts and techniques that will advance knowledge in two respects. First, the tagger, math disambiguation techniques, and NLP/ML-based algorithms, though they correspond to NLP and ML counterparts, will be quite novel because math expressions are radically different from natural language. Second, the project outcomes will enable the development of new advanced applications such as: (1) techniques for computer-aided semantic enrichment of digital math libraries; (2) automated P2C conversion of math expressions from natural form to (i) a machine-computable form and (ii) a formal form suitable for automated reasoning; (3) math question-answering capabilities at the manuscript level and collection level; (4) richer math UIs; and (5) more accurate math optical character recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://dlmf.nist.gov/.

  2. 2.

    https://arxiv.org/.

  3. 3.

    http://www.msc2010.org.

  4. 4.

    http://wordnet.princeton.edu/wordnet/download/current-version/.

  5. 5.

    http://www.msc2010.org.

  6. 6.

    https://arxiv.org/.

  7. 7.

    https://en.wikipedia.org/wiki/Areas_of_mathematics.

  8. 8.

    https://arxiv.org/archive/math.

  9. 9.

    http://www.msc2010.org.

References

  1. Agirre, E., Lopez de Lacalle, A., Soroa, A.: Knowledge-based WSD on specific domains: performing better than generic supervised WSD. In: IJCAI, pp. 1501–1506 (2009)

    Google Scholar 

  2. Anca, S.: Natural language and mathematics processing for applicable theorem search. Master’s thesis, Jacobs University Bremen (2009)

    Google Scholar 

  3. Anderson, R.H.: Two-dimensional mathematical notation. In: Fu, K.S. (ed.) Syntactic Pattern Recognition, Applications, pp. 174–177. Springer, New York (1977)

    Google Scholar 

  4. arXiv.org: https://arxiv.org/

  5. Alvaro, F., Sanchez, J.-A., Benedi, J.-M.: Recognition of printed mathematical expressions using two-dimensional context-free grammars. In: International Conference on Document Analysis and Recognition, Beijing, China, pp. 1225–1229 (2011)

    Google Scholar 

  6. Bishop, C.: Pattern Recognition and Machine Learning. Springer, New York (2006)

    MATH  Google Scholar 

  7. Blei, D.: Introduction to probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)

    Article  Google Scholar 

  8. Bengio, Y., LeCun, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)

    Article  Google Scholar 

  9. Blei, D., Ng, A., Jordan, M., Lafferty, J.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  10. Bowman, S., Potts, C., Manning, C.: Learning distributed word representations for natural logic reasoning. In: The AAAI Spring Symposium on Knowledge Representation and Reasoning (2015)

    Google Scholar 

  11. Baker, J.B., Sexton, A.P., Sorge, V.: A linear grammar approach to mathematical formula recognition from PDF. In: Carette, J., Dixon, L., Coen, C.S., Watt, S.M. (eds.) CICM 2009. LNCS, vol. 5625, pp. 201–216. Springer, Heidelberg (2009). doi:10.1007/978-3-642-02614-0_19

    Chapter  Google Scholar 

  12. Baker, J.B., Sexton, A.P., Sorge, V.: Faithful mathematical formula recognition from PDF documents. In: International Workshop on Document Analysis Systems, Boston, USA, pp. 485–492 (2010)

    Google Scholar 

  13. Chan, K.-F., Yeung, D.-Y.: Mathematical expression recognition - a survey. Int. J. Doc. Anal. Recogn. 3, 3–15 (2000)

    Article  Google Scholar 

  14. Cajori, F.: A History of Mathematical Notations, vol. 2. Open Court Publishing Company, Chicago (1929)

    MATH  Google Scholar 

  15. Cohl, H., Schubotz, M., Youssef, A., Greiner-Petter, A., Gerhard, J., Saunders, B.V., McClain, M.A., Bang, J., Chen, K.: Semantic preserving bijective mappings of mathematical formulae between word processors and computer algebra systems. In: CICM 2017, Edingburgh, Scotland (2017)

    Google Scholar 

  16. Cramer, M., Fisseni, B., Koepke, P., Kühlwein, D., Schröder, B., Veldman, J.: The naproche project controlled natural language proof checking of mathematical texts. In: Fuchs, N.E. (ed.) CNL 2009. LNCS, vol. 5972, pp. 170–186. Springer, Heidelberg (2010). doi:10.1007/978-3-642-14418-9_11

    Chapter  Google Scholar 

  17. Cohl, H.S., McClain, M.A., Saunders, B.V., Schubotz, M., Williams, J.C.: Digital repository of mathematical formulae. In: Watt, S.M., Davenport, J.H., Sexton, A.P., Sojka, P., Urban, J. (eds.) CICM 2014. LNCS, vol. 8543, pp. 419–422. Springer, Cham (2014). doi:10.1007/978-3-319-08434-3_30

    Chapter  Google Scholar 

  18. (World) Digital Mathematics Library: https://www.math.uni-bielefeld.de/~rehmann/DML/dml_links.html

  19. The European Digital Mathematics Library: https://eudml.org/

  20. Ganesalingam, M.: The Language of Mathematics. Ph.D. thesis, Cambridge University (2009)

    Google Scholar 

  21. Garain, U.: Identification of mathematical expressions in document images. In: International Conference on Document Analysis and Recognition, Barcelona, Spain, pp. 1340–1344 (2009)

    Google Scholar 

  22. Ginev, D.: The Structure of Mathematical Expressions. Master thesis, Jacobs University Bremen, Bremen, Germany (2011)

    Google Scholar 

  23. Goldwater, S., Griffiths, T.: A fully Bayesian approach to unsupervised part-of-speech tagging. In: Association for Computational Linguistics (2007)

    Google Scholar 

  24. Göttinger Digitalisierungszentrum: http://gdz.sub.uni-goettingen.de/gdz/

  25. Grigore, M.: Knowledge-poor Interpretation of Mathematical Expressions in Context. Master thesis, Jacobs University Bremen, Bremen, Germany (2010)

    Google Scholar 

  26. Guidi, F., Coen, S.C.: A survey on retrieval of mathematical knowledge. In: Kerber, M., Carette, J., Kaliszyk, C., Rabe, F., Sorge, V. (eds.) CICM 2015. LNCS, vol. 9150, pp. 296–315. Springer, Cham (2015). doi:10.1007/978-3-319-20615-8_20

    Chapter  Google Scholar 

  27. Grigore, M., Wolska, M., Kohlhase, M.: Towards context-based disambiguation of mathematical expressions. In: The Joint Conference of ASCM 2009 and MACIS 2009, Math-for-Industry, Fukuoka, Japan (2009)

    Google Scholar 

  28. Hall, M., Frank, F., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The WEKA data mining software: an update. SIGKDD Explor. Newslett. 11(1), 10–18 (2009)

    Article  Google Scholar 

  29. O’Halloran, K.L.: Mathematical Discourse: Language, Symbolism and Visual Images. Continuum, New York (2005)

    Google Scholar 

  30. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, \(2^{\rm nd}\) edn. Springer, New York (2013)

    Google Scholar 

  31. Hinton, G., Salakhutdinov, R.: A better way to pretrain deep Boltzmann machines. Adv. Neural Inf. Process. Syst. 3, 1–9 (2012)

    Google Scholar 

  32. Hambasan, R., Kohlhase, M., Prodescu, C.: MathWebSearch at NTCIR-11. In: 10th NTCIR Conference, pp. 114–119, Tokyo, Japan (2014)

    Google Scholar 

  33. Olver, F.W.J., Olde Daalhuis, A.B., Lozier, D.W., Schneider, B.I., Boisvert, R.F., Clark, C.W., Miller, B.R., Saunders, B.V., (eds.) NIST Digital Library of Mathematical Functions. http://dlmf.nist.gov/. Release 1.0.14 of 2016-12-21

  34. Kofler, K., Neumaier, A.: DynGenPar – a dynamic generalized parser for common mathematical language. In: Jeuring, J., Campbell, J.A., Carette, J., Reis, G., Sojka, P., Wenzel, M., Sorge, V. (eds.) CICM 2012. LNCS, vol. 7362, pp. 386–401. Springer, Heidelberg (2012). doi:10.1007/978-3-642-31374-5_26

    Chapter  Google Scholar 

  35. Kohlhase, A.: Search interfaces for mathematicians. In: Watt, S.M., Davenport, J.H., Sexton, A.P., Sojka, P., Urban, J. (eds.) CICM 2014. LNCS, vol. 8543, pp. 153–168. Springer, Cham (2014). doi:10.1007/978-3-319-08434-3_12

    Chapter  Google Scholar 

  36. Kohlhase, M.: Semantic Markup for Mathematical Statements. Version v1.2 (2016)

    Google Scholar 

  37. Kottwitz, S.: LaTeX Beginner’s Guide. PACKT Publishing, Birmingham (2001)

    Google Scholar 

  38. Libbrecht, P., Melis, E.: Methods to access and retrieve mathematical content in ActiveMath. In: Iglesias, A., Takayama, N. (eds.) ICMS 2006. LNCS, vol. 4151, pp. 331–342. Springer, Heidelberg (2006). doi:10.1007/11832225_33

    Chapter  Google Scholar 

  39. Libbrecht, P.: Notations around the world: census and exploitation. In: Autexier, S., Calmet, J., Delahaye, D., Ion, P.D.F., Rideau, L., Rioboo, R., Sexton, A.P. (eds.) CICM 2010. LNCS, vol. 6167, pp. 398–410. Springer, Heidelberg (2010). doi:10.1007/978-3-642-14128-7_34

    Chapter  Google Scholar 

  40. Liska, M., Sojka, P., Ruzicka, M.: Similarity search for mathematics: Masaryk University team at the NTCIT-10 math task. In: 10th NTCIR Conference, Tokyo, Japan, pp. 686–691 (2013)

    Google Scholar 

  41. Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Boston (1999)

    MATH  Google Scholar 

  42. Manning, C.D.: Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In: Gelbukh, A.F. (ed.) CICLing 2011. LNCS, vol. 6608, pp. 171–189. Springer, Heidelberg (2011). doi:10.1007/978-3-642-19400-9_14

    Chapter  Google Scholar 

  43. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing tootlkit. In: ACL (2014)

    Google Scholar 

  44. Miller, B.: LaTeXML: A LaTeX to XML/HTML/MathML Converter. http://dlmf.nist.gov/LaTeXML/

  45. The database MathSciNet: http://www.ams.org/mathscinet/

  46. Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, London (2012)

    MATH  Google Scholar 

  47. Malon, C.D., Uchida, S., Suzuki, M.: Mathematical symbol recognition with support vector machines. Pattern Recogn. Lett. 29, 1326–1332 (2008)

    Article  Google Scholar 

  48. Navigli, R.: Word sense disambiguation: a survey. ACM Comput. Surv. 41(2), 1–69 (2009)

    Article  Google Scholar 

  49. Neumaier, A., Schodl, P.: A framework for representing and processing arbitrary mathematics. In: The International Conference on Knowledge Engineering and Ontology Development, pp. 476–479 (2010)

    Google Scholar 

  50. Nghiem, M.-Q., Yokoi, K., Matsubayashi, Y., Aizawa, A.: Mining coreference relations between formulas and text using Wikipedia. In: Second Workshop on NLP Challenges in the Information Explosion Era, Beijing, China, pp. 69–74 (2010)

    Google Scholar 

  51. Robertson, W.: Every Symbol (most Symbols) Defined by Unicode-Math (2015)

    Google Scholar 

  52. Santorini, B.: Part-of-speech tagging guidelines for the Penn treebank project. 3rd Revision, University of Pennsylvania (1990)

    Google Scholar 

  53. Schöneberg, U., Sperber, W.: POS tagging and its applications for mathematics. In: Watt, S.M., Davenport, J.H., Sexton, A.P., Sojka, P., Urban, J. (eds.) CICM 2014. LNCS, vol. 8543, pp. 213–223. Springer, Cham (2014). doi:10.1007/978-3-319-08434-3_16

    Chapter  Google Scholar 

  54. Schubotz, M., Grigorev, A., Leich, M., Cohl, H.S., Meuschke, N., Gippx, B., Youssef, A., Markl, V.: Semantification of identifiers in mathematics for better math information retrieval. In: The 39th Annual ACM SIGIR Conference (SIGIR 2016), Pisa, Italy, pp. 135–144 (2016)

    Google Scholar 

  55. Stamerjohanns, H., Kohlhase, M., Ginev, D., David, C., Miller, B.: Transforming large collections of scientific publications to XML. Math. Comput. Sci. 3(3), 299–307 (2010). Birkhäuser

    Article  MATH  Google Scholar 

  56. Socher, R., Lin, C., Ng, A.Y., Manning, C.D.: Parsing natural scenes and natural language with recursive neural networks. In: ICML (2011)

    Google Scholar 

  57. Smirnova, E., Watt, S.M.: Notation selection in mathematical computing environments. In: Transgressive Computing 2006: A conference in honor of Jean Della Dora (TC 2006), Granada, Spain, pp. 339–355 (2006)

    Google Scholar 

  58. Søgaard, A.: Simple semi-supervised training of part-of-speech taggers. In: The ACL Conference Short Papers, pp. 205–208 (2010)

    Google Scholar 

  59. So, C.M., Watt, S.M.: Determining empirical characteristics of mathematical expression use. In: Kohlhase, M. (ed.) MKM 2005. LNCS, vol. 3863, pp. 361–375. Springer, Heidelberg (2006). doi:10.1007/11618027_24

    Chapter  Google Scholar 

  60. Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T.: INFTY: an integrated OCR system for mathematical documents. In: ACM Symposium on Document Engineering, Grenoble, France, pp. 95–104 (2003)

    Google Scholar 

  61. Uchida, S., Nomura, A., Suzuki, M.: Quantitative analysis of mathematical documents. Int. J. Doc. Anal. Recogn. 7(4), 211–218 (2005)

    Article  Google Scholar 

  62. Vapnik, V.N.: The Nature of Statistical Machine Learning, \(2^{\rm nd}\) edn. Springer, Heidelberg (2000)

    Google Scholar 

  63. Watt, S.M.: Exploiting implicit mathematical semantics in conversion between TEX and MathML. TUGBoat 23(1), 108 (2002)

    Google Scholar 

  64. Watt, S.M.: An empirical measure on the set of symbols occurring in engineering mathematics texts. In: International Workshop on Document Analysis Systems, Nara, Japan, pp. 557–564 (2008)

    Google Scholar 

  65. Wolska, M., Grigore, M.: Symbol declarations in mathematical writing: a corpus study. In: Towards Digital Mathematics Library, DML workshop, pp. 119–127. Masaryk University, Brno (2010)

    Google Scholar 

  66. Wolska, M., Grigore, M., Kohlhase, M.: Using discourse context to interpret object-denoting mathematical expressions. In: Towards Digital Mathematics Library, DML workshop, pp. 85–101. Masaryk University, Brno (2011)

    Google Scholar 

  67. Yang, M., Fateman, R.: Extracting mathematical expressions from postscript documents. In: ISSAC 2004, pp. 305–311. ACM Press (2004)

    Google Scholar 

  68. Youssef, A.: Roles of math search in mathematics. In: Borwein, J.M., Farmer, W.M. (eds.) MKM 2006. LNCS, vol. 4108, pp. 2–16. Springer, Heidelberg (2006). doi:10.1007/11812289_2

    Chapter  Google Scholar 

  69. Youssef, A.: Relevance ranking and hit description in math search. Math. Comput. Sci. 2(2), 333–353 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  70. Yu, B., Tian, X., Luo, W.: Extracting mathematical components directly from pdf documents for mathematical expression recognition and retrieval. In: Tan, Y., Shi, Y., Coello, C.A.C. (eds.) ICSI 2014. LNCS, vol. 8795, pp. 170–179. Springer, Cham (2014). doi:10.1007/978-3-319-11897-0_20

    Google Scholar 

  71. Zanibbi, R., Aizawa, A., Kohlhase, M., Ounis, I., Topic, G., Davila, K.: NTCIR-12 MathIR task overview. In: NTCIR-12, Tokyo, Japan (2016)

    Google Scholar 

  72. Zanibbi, R., Blostein, D.: Recognition and retrieval of mathematical expressions. Int. J. Doc. Anal. Recogn. 15(4), 331–357 (2012)

    Article  Google Scholar 

  73. The database zbMATH: http://www.zentralblatt-math.org/zbmath/

  74. Zhang, Q., Youssef, A.: Performance evaluation and optimization of math-similarity search. In: Kerber, M., Carette, J., Kaliszyk, C., Rabe, F., Sorge, V. (eds.) CICM 2015. LNCS, vol. 9150, pp. 243–257. Springer, Cham (2015). doi:10.1007/978-3-319-20615-8_16

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abdou Youssef .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Youssef, A. (2017). Part-of-Math Tagging and Applications. In: Geuvers, H., England, M., Hasan, O., Rabe, F., Teschke, O. (eds) Intelligent Computer Mathematics. CICM 2017. Lecture Notes in Computer Science(), vol 10383. Springer, Cham. https://doi.org/10.1007/978-3-319-62075-6_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-62075-6_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-62074-9

  • Online ISBN: 978-3-319-62075-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics