Skip to main content

An Evaluation of NLP Methods to Extract Mathematical Token Descriptors

  • Conference paper
  • First Online:
Intelligent Computer Mathematics (CICM 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13467))

Included in the following conference series:

Abstract

Mathematical formulae are a foundational component of information in all scientific and mathematical papers. Parsing meaning from these expressions by extracting textual descriptors of their variable tokens is a unique challenge that requires semantic and grammatical knowledge. In this work, we present a new manually-labeled dataset (called the MTDE dataset) of mathematical objects, the contexts in which they are defined, and their textual definitions. With this dataset, we evaluate the accuracy of several modern neural network models on two definition extraction tasks. While this is not a solved task, modern language models such as BERT perform well (\(\sim \)90%). Both the dataset and neural network models (implemented in PyTorch jupyter notebooks) are available online to help aid future researchers in this space.

Supported by University of Illinois at Urbana-Champaign - College of Engineering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/emhamel/Mathematical-Text-Understanding.

  2. 2.

    Details about each of these models can be found in the Appendix. PyTorch code for each of these models is available here: https://github.com/emhamel/Mathematical-Text-Understanding.

References

  1. International Mathematical Knowledge Trust. https://imkt.org/

  2. Aizawa, A., Kohlhase, M., Ounis, I.: NTCIR-10 math pilot task overview. In: NTCIR (2013)

    Google Scholar 

  3. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate, May 2016. https://doi.org/10.48550/arXiv.1409.0473

  4. Carette, J., Farmer, W.M.: A review of mathematical knowledge management. In: Carette, J., Dixon, L., Coen, C.S., Watt, S.M. (eds.) CICM 2009. LNCS (LNAI), vol. 5625, pp. 233–246. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02614-0_21

    Chapter  Google Scholar 

  5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs], May 2019

  6. Gao, L., Jiang, Z., Yin, Y., Yuan, K., Yan, Z., Tang, Z.: Preliminary exploration of formula embedding for mathematical information retrieval: can mathematical formulae be embedded like a natural language? arXiv:1707.05154 [cs], August 2017

  7. Ginev, D.: arXMLiv 2020 - an HTML5 dataset for arXiv.org\(\cdot \) SIGMathLing (2020)

  8. Greiner-Petter, A., et al.: Discovering mathematical objects of interest—a study of mathematical notations. In: Proceedings of The Web Conference 2020, WWW 2020, pp. 1445–1456. Association for Computing Machinery, Taipei, April 2020. https://doi.org/10.1145/3366423.3380218

  9. Hirschman, L., Gaizauskas, R.: Natural language question answering: the view from here. Nat. Lang. Eng. 7(4), 275–300 (2001). https://doi.org/10.1017/S1351324901002807

    Article  Google Scholar 

  10. Kohlhase, M. (ed.): MKM 2005. LNCS (LNAI), vol. 3863. Springer, Heidelberg (2006). https://doi.org/10.1007/11618027

    Book  MATH  Google Scholar 

  11. Kristianto, G.Y., Topić, G., Aizawa, A.: Extracting textual descriptions of mathematical expressions in scientific papers. D-Lib Mag. 20(11/12) (2014). https://doi.org/10.1045/november14-kristianto

  12. Kristianto, G.Y., Topić, G., Aizawa, A.: Utilizing dependency relationships between math expressions in math IR. Inf. Retrieval J. 20(2), 132–167 (2017). https://doi.org/10.1007/s10791-017-9296-8

    Article  Google Scholar 

  13. Kristianto, G.Y., Topic, G., Ho, F.: The MCAT math retrieval system for NTCIR-11 math track, p. 7 (2014)

    Google Scholar 

  14. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization, January 2019. https://doi.org/10.48550/arXiv.1711.05101

  15. Munot, N., Govilkar, S.: Comparative study of text summarization methods. Int. J. Comput. Appl. 102, 33–37 (2014). https://doi.org/10.5120/17870-8810

    Article  Google Scholar 

  16. Pagael, R., Schubotz, M.: Mathematical language processing project. arXiv:1407.0167 [cs], July 2014

  17. Schubotz, M., et al.: Semantification of identifiers in mathematics for better math information retrieval. In: Proceedings of the 39th International ACM SIGIR Conference, SIGIR 2016, pp. 135–144. Association for Computing Machinery, New York, July 2016. https://doi.org/10.1145/2911451.2911503

  18. Shen, L., et al.: Backdoor pre-trained models can transfer to all. In: Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pp. 3141–3158, November 2021. https://doi.org/10.1145/3460120.3485370

  19. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, vol. 27. Curran Associates, Inc. (2014)

    Google Scholar 

  20. Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. arXiv:1506.03134 [cs, stat], January 2017

  21. Wang, S., Jiang, J.: Machine comprehension using match-LSTM and answer pointer, November 2016

    Google Scholar 

  22. Youssef, A., Miller, B.R.: Explorations into the use of word embedding in math search and math semantics. In: Kaliszyk, C., Brady, E., Kohlhase, A., Sacerdoti Coen, C. (eds.) CICM 2019. LNCS (LNAI), vol. 11617, pp. 291–305. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-23250-4_20

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Emma Hamel .

Editor information

Editors and Affiliations

Appendices

Appendix

1 Models

Below is a more detailed overview of the models used in this manuscript. Long Short-Term Memory (LSTM) RNNs were used for all of the models (with the exception of BERT).

1.1 1.1 Vanilla Sequence-to-Sequence Model

Fig. 3.
figure 3

The Vanilla Seq2Seq architecture

To find the best output sequence, generative models use the probability chain rule, which states that the probability of a sequence is the conditional probability of its tokens. We represent this in the following relation, where \(T\) is the training data and \(\theta \) is the parameters of the RNN:

$$\begin{aligned} p_{\theta }(y|T; \theta ) = \prod ^{i}{p_{\theta }(y_{i}| y_{i-1}, y_{i-2}, \ldots , y_{1}, T;\theta )} \end{aligned}$$

Seq2Seq-based models only take one sequence as its input. Therefore, a separation token (seen as \(<SEP>\) in Fig. 3) was used to concatenate the variable name and it’s context. This was done for the Vanilla Seq2Seq model, Transformer Seq2Seq model, Pointer network, and BERT model (Fig. 4).

1.2 1.2 Transformer Seq2Seq Model

Fig. 4.
figure 4

The Transformer Seq2Seq architecture

To create a context vector, an annotation for each input word is generated based on the word itself and its surrounding words. After this, the context vector for the \(i\)-th output word is calculated as the weighted sum of the annotations \(h\):

$$\begin{aligned} c_{i} = \sum ^{k}_{j=1}{\alpha _{ij}h_{j}} \end{aligned}$$

where i is the indicates the length of the output sequence and \(k\) is the length of the input sequence. The attention weights are calculated as such:

$$\begin{aligned} \alpha _{ij} = \frac{\exp {(a(s_{i - 1}, h_{j}))}}{\sum ^{T_{x}}_{k=1}{\exp {(a(s_{i - 1}, h_{j}))}}} \end{aligned}$$

where \(a(s_{i - 1}, h_{j})\) is the alignment function which ranks how likely an input word is related to an output word. The alignment function takes the decoder hidden state of the \(i\)-th output word and the annotation of the \(j\)-th input word and learns the weights for each input annotation-output word pair via a feed-forward layer.

1.3 1.3 Pointer Network

To find the most likely answer out of the input tokens, the Pointer Network computes attention weights of variable length as such:

$$\begin{aligned} \alpha ^{i}_{j} = \mathop {\textrm{softmax}}\limits {(u^{i}_{j})} \end{aligned}$$
$$\begin{aligned} u^{i}_{j} = v^\top \tanh {(W_{1}e_{j} + W_{2}d_{i})} \end{aligned}$$

where \(e_{j}\) is the encoder state at the \(j\)-th input word, \(d_{i}\) is the decoder state at the \(i\)-th input word, and \(v\), \(W_{1}\), \(W_{2}\) are learnable parameters (Fig. 5).

Fig. 5.
figure 5

The Pointer Network architecture

1.4 1.4 Match-LSTM Model

Fig. 6.
figure 6

The Match LSTM architecture

A visual representation of the model is given in Fig. 6. The attention weights in the Match-LSTM layer is calculated as such:

$$\begin{aligned} \alpha _{i} = w^{\top }u_{i} + b \end{aligned}$$
$$\begin{aligned} u_{i} = \tanh {(W^{v}h^{v} + (W^{c}h_{i}^{c} + W^{r}h_{i-1}^{r} + b^{c}))} \end{aligned}$$

Here, \(W^{v}, W^{c}, W^{r}, b^{c}, w, b\) are learned weights, \(h_{i}^{c}\) and \(h^{v}\) are the hidden states of the \(i-th\) word of the context text and the variable name respectively, and \(h_{i-1}^{r}\) is the previous hidden state of the Match-LSTM. The attention weights are then combine with \(h_{i}^{c}\) and \(h^{v}\) and processed in the Match-LSTM. This process is done in the forward and backward direction and all resulting hidden states are coalesced into a final hidden state vector. The final hidden state vector is then processed by a pointer layer, which calculates the most likely position of the definition of the variable in the context text. The following formulas show how the probability \(\beta \) is calculated:

$$\begin{aligned} \beta _{j} = \mathop {\textrm{softmax}}\limits {(v^{\top }s_{j} + b \otimes e_{C + 1})} \end{aligned}$$
$$\begin{aligned} s_{j} = \tanh {(V\bar{H^{r}} + (W^{a}h_{k - 1}^{a} + b^{a}) \otimes e_{C + 1})} \end{aligned}$$

where \(V, W^{a}, b^{a}, v\) and \(b\) are learned weights, \(\bar{H^{r}}\) is the vector of concatenated Match-LSTM hidden states, \(h_{k - 1}^{a}\) is the previous hidden state of the pointer LSTM and \(e_{C + 1}\) is a vector of ones with size C, where C is the length of the context text. \(\beta _{j}\) is combined with \(h_{j}^{c}\) and processed by an LSTM.

1.5 1.5 Bert-Based Model

In these methods, a pre-trained Hugging Face BertForQuestionAnswering model was fine-tuned to perform the definition-extraction task. The tokenizer used to process the text was a custom Hugging Face tokenizer created by [18]. This tokenizer was trained on a mathematical language corpus comprised of a wide range of education materials used in primary-school, high-school, and college-level courses (Fig. 7).

Fig. 7.
figure 7

The BERT architecture

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hamel, E., Zheng, H., Kani, N. (2022). An Evaluation of NLP Methods to Extract Mathematical Token Descriptors. In: Buzzard, K., Kutsia, T. (eds) Intelligent Computer Mathematics. CICM 2022. Lecture Notes in Computer Science(), vol 13467. Springer, Cham. https://doi.org/10.1007/978-3-031-16681-5_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-16681-5_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-16680-8

  • Online ISBN: 978-3-031-16681-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics