An Evaluation of NLP Methods to Extract Mathematical Token Descriptors

Hamel, Emma; Zheng, Hongbo; Kani, Nickvash

doi:10.1007/978-3-031-16681-5_23

Emma Hamel⁹,
Hongbo Zheng⁹ &
Nickvash Kani⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13467))

Included in the following conference series:

International Conference on Intelligent Computer Mathematics

547 Accesses
1 Citations

Abstract

Mathematical formulae are a foundational component of information in all scientific and mathematical papers. Parsing meaning from these expressions by extracting textual descriptors of their variable tokens is a unique challenge that requires semantic and grammatical knowledge. In this work, we present a new manually-labeled dataset (called the MTDE dataset) of mathematical objects, the contexts in which they are defined, and their textual definitions. With this dataset, we evaluate the accuracy of several modern neural network models on two definition extraction tasks. While this is not a solved task, modern language models such as BERT perform well ($\sim $90%). Both the dataset and neural network models (implemented in PyTorch jupyter notebooks) are available online to help aid future researchers in this space.

Supported by University of Illinois at Urbana-Champaign - College of Engineering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/emhamel/Mathematical-Text-Understanding.
2.
Details about each of these models can be found in the Appendix. PyTorch code for each of these models is available here: https://github.com/emhamel/Mathematical-Text-Understanding.

References

International Mathematical Knowledge Trust. https://imkt.org/
Aizawa, A., Kohlhase, M., Ounis, I.: NTCIR-10 math pilot task overview. In: NTCIR (2013)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate, May 2016. https://doi.org/10.48550/arXiv.1409.0473
Carette, J., Farmer, W.M.: A review of mathematical knowledge management. In: Carette, J., Dixon, L., Coen, C.S., Watt, S.M. (eds.) CICM 2009. LNCS (LNAI), vol. 5625, pp. 233–246. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02614-0_21
Chapter Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs], May 2019
Gao, L., Jiang, Z., Yin, Y., Yuan, K., Yan, Z., Tang, Z.: Preliminary exploration of formula embedding for mathematical information retrieval: can mathematical formulae be embedded like a natural language? arXiv:1707.05154 [cs], August 2017
Ginev, D.: arXMLiv 2020 - an HTML5 dataset for arXiv.org$\cdot $ SIGMathLing (2020)
Greiner-Petter, A., et al.: Discovering mathematical objects of interest—a study of mathematical notations. In: Proceedings of The Web Conference 2020, WWW 2020, pp. 1445–1456. Association for Computing Machinery, Taipei, April 2020. https://doi.org/10.1145/3366423.3380218
Hirschman, L., Gaizauskas, R.: Natural language question answering: the view from here. Nat. Lang. Eng. 7(4), 275–300 (2001). https://doi.org/10.1017/S1351324901002807
Article Google Scholar
Kohlhase, M. (ed.): MKM 2005. LNCS (LNAI), vol. 3863. Springer, Heidelberg (2006). https://doi.org/10.1007/11618027
Book MATH Google Scholar
Kristianto, G.Y., Topić, G., Aizawa, A.: Extracting textual descriptions of mathematical expressions in scientific papers. D-Lib Mag. 20(11/12) (2014). https://doi.org/10.1045/november14-kristianto
Kristianto, G.Y., Topić, G., Aizawa, A.: Utilizing dependency relationships between math expressions in math IR. Inf. Retrieval J. 20(2), 132–167 (2017). https://doi.org/10.1007/s10791-017-9296-8
Article Google Scholar
Kristianto, G.Y., Topic, G., Ho, F.: The MCAT math retrieval system for NTCIR-11 math track, p. 7 (2014)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization, January 2019. https://doi.org/10.48550/arXiv.1711.05101
Munot, N., Govilkar, S.: Comparative study of text summarization methods. Int. J. Comput. Appl. 102, 33–37 (2014). https://doi.org/10.5120/17870-8810
Article Google Scholar
Pagael, R., Schubotz, M.: Mathematical language processing project. arXiv:1407.0167 [cs], July 2014
Schubotz, M., et al.: Semantification of identifiers in mathematics for better math information retrieval. In: Proceedings of the 39th International ACM SIGIR Conference, SIGIR 2016, pp. 135–144. Association for Computing Machinery, New York, July 2016. https://doi.org/10.1145/2911451.2911503
Shen, L., et al.: Backdoor pre-trained models can transfer to all. In: Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pp. 3141–3158, November 2021. https://doi.org/10.1145/3460120.3485370
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, vol. 27. Curran Associates, Inc. (2014)
Google Scholar
Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. arXiv:1506.03134 [cs, stat], January 2017
Wang, S., Jiang, J.: Machine comprehension using match-LSTM and answer pointer, November 2016
Google Scholar
Youssef, A., Miller, B.R.: Explorations into the use of word embedding in math search and math semantics. In: Kaliszyk, C., Brady, E., Kohlhase, A., Sacerdoti Coen, C. (eds.) CICM 2019. LNCS (LNAI), vol. 11617, pp. 291–305. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-23250-4_20
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
Emma Hamel, Hongbo Zheng & Nickvash Kani

Authors

Emma Hamel
View author publications
You can also search for this author in PubMed Google Scholar
Hongbo Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Nickvash Kani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Emma Hamel .

Editor information

Editors and Affiliations

Imperial College London, London, UK
Kevin Buzzard
Johannes Kepler University Linz, Linz, Austria
Temur Kutsia

Appendices

Appendix

1 Models

Below is a more detailed overview of the models used in this manuscript. Long Short-Term Memory (LSTM) RNNs were used for all of the models (with the exception of BERT).

1.1 1.1 Vanilla Sequence-to-Sequence Model

To find the best output sequence, generative models use the probability chain rule, which states that the probability of a sequence is the conditional probability of its tokens. We represent this in the following relation, where $T$ is the training data and $\theta $ is the parameters of the RNN:

$$\begin{aligned} p_{\theta }(y|T; \theta ) = \prod ^{i}{p_{\theta }(y_{i}| y_{i-1}, y_{i-2}, \ldots , y_{1}, T;\theta )} \end{aligned}$$

Seq2Seq-based models only take one sequence as its input. Therefore, a separation token (seen as $<SEP>$ in Fig. 3) was used to concatenate the variable name and it’s context. This was done for the Vanilla Seq2Seq model, Transformer Seq2Seq model, Pointer network, and BERT model (Fig. 4).

1.2 1.2 Transformer Seq2Seq Model

To create a context vector, an annotation for each input word is generated based on the word itself and its surrounding words. After this, the context vector for the $i$-th output word is calculated as the weighted sum of the annotations $h$:

$$\begin{aligned} c_{i} = \sum ^{k}_{j=1}{\alpha _{ij}h_{j}} \end{aligned}$$

where i is the indicates the length of the output sequence and $k$ is the length of the input sequence. The attention weights are calculated as such:

$$\begin{aligned} \alpha _{ij} = \frac{\exp {(a(s_{i - 1}, h_{j}))}}{\sum ^{T_{x}}_{k=1}{\exp {(a(s_{i - 1}, h_{j}))}}} \end{aligned}$$

where $a(s_{i - 1}, h_{j})$ is the alignment function which ranks how likely an input word is related to an output word. The alignment function takes the decoder hidden state of the $i$-th output word and the annotation of the $j$-th input word and learns the weights for each input annotation-output word pair via a feed-forward layer.

1.3 1.3 Pointer Network

To find the most likely answer out of the input tokens, the Pointer Network computes attention weights of variable length as such:

$$\begin{aligned} \alpha ^{i}_{j} = \mathop {\textrm{softmax}}\limits {(u^{i}_{j})} \end{aligned}$$

$$\begin{aligned} u^{i}_{j} = v^\top \tanh {(W_{1}e_{j} + W_{2}d_{i})} \end{aligned}$$

where $e_{j}$ is the encoder state at the $j$-th input word, $d_{i}$ is the decoder state at the $i$-th input word, and $v$, $W_{1}$, $W_{2}$ are learnable parameters (Fig. 5).

1.4 1.4 Match-LSTM Model

A visual representation of the model is given in Fig. 6. The attention weights in the Match-LSTM layer is calculated as such:

$$\begin{aligned} \alpha _{i} = w^{\top }u_{i} + b \end{aligned}$$

$$\begin{aligned} u_{i} = \tanh {(W^{v}h^{v} + (W^{c}h_{i}^{c} + W^{r}h_{i-1}^{r} + b^{c}))} \end{aligned}$$

Here, $W^{v}, W^{c}, W^{r}, b^{c}, w, b$ are learned weights, $h_{i}^{c}$ and $h^{v}$ are the hidden states of the $i-th$ word of the context text and the variable name respectively, and $h_{i-1}^{r}$ is the previous hidden state of the Match-LSTM. The attention weights are then combine with $h_{i}^{c}$ and $h^{v}$ and processed in the Match-LSTM. This process is done in the forward and backward direction and all resulting hidden states are coalesced into a final hidden state vector. The final hidden state vector is then processed by a pointer layer, which calculates the most likely position of the definition of the variable in the context text. The following formulas show how the probability $\beta $ is calculated:

$$\begin{aligned} \beta _{j} = \mathop {\textrm{softmax}}\limits {(v^{\top }s_{j} + b \otimes e_{C + 1})} \end{aligned}$$

$$\begin{aligned} s_{j} = \tanh {(V\bar{H^{r}} + (W^{a}h_{k - 1}^{a} + b^{a}) \otimes e_{C + 1})} \end{aligned}$$

where $V, W^{a}, b^{a}, v$ and $b$ are learned weights, $\bar{H^{r}}$ is the vector of concatenated Match-LSTM hidden states, $h_{k - 1}^{a}$ is the previous hidden state of the pointer LSTM and $e_{C + 1}$ is a vector of ones with size C, where C is the length of the context text. $\beta _{j}$ is combined with $h_{j}^{c}$ and processed by an LSTM.

1.5 1.5 Bert-Based Model

In these methods, a pre-trained Hugging Face BertForQuestionAnswering model was fine-tuned to perform the definition-extraction task. The tokenizer used to process the text was a custom Hugging Face tokenizer created by [18]. This tokenizer was trained on a mathematical language corpus comprised of a wide range of education materials used in primary-school, high-school, and college-level courses (Fig. 7).

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hamel, E., Zheng, H., Kani, N. (2022). An Evaluation of NLP Methods to Extract Mathematical Token Descriptors. In: Buzzard, K., Kutsia, T. (eds) Intelligent Computer Mathematics. CICM 2022. Lecture Notes in Computer Science(), vol 13467. Springer, Cham. https://doi.org/10.1007/978-3-031-16681-5_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-16681-5_23
Published: 17 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16680-8
Online ISBN: 978-3-031-16681-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Evaluation of NLP Methods to Extract Mathematical Token Descriptors