## Abstract

Mathematical formulae are a foundational component of information in all scientific and mathematical papers. Parsing meaning from these expressions by extracting textual descriptors of their variable tokens is a unique challenge that requires semantic and grammatical knowledge. In this work, we present a new manually-labeled dataset (called the MTDE dataset) of mathematical objects, the contexts in which they are defined, and their textual definitions. With this dataset, we evaluate the accuracy of several modern neural network models on two definition extraction tasks. While this is not a solved task, modern language models such as BERT perform well (\(\sim \)90%). Both the dataset and neural network models (implemented in PyTorch jupyter notebooks) are available online to help aid future researchers in this space.

Supported by University of Illinois at Urbana-Champaign - College of Engineering.

## Access this chapter

Tax calculation will be finalised at checkout

Purchases are for personal use only

### Similar content being viewed by others

## Notes

- 1.
- 2.
Details about each of these models can be found in the Appendix. PyTorch code for each of these models is available here: https://github.com/emhamel/Mathematical-Text-Understanding.

## References

International Mathematical Knowledge Trust. https://imkt.org/

Aizawa, A., Kohlhase, M., Ounis, I.: NTCIR-10 math pilot task overview. In: NTCIR (2013)

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate, May 2016. https://doi.org/10.48550/arXiv.1409.0473

Carette, J., Farmer, W.M.: A review of mathematical knowledge management. In: Carette, J., Dixon, L., Coen, C.S., Watt, S.M. (eds.) CICM 2009. LNCS (LNAI), vol. 5625, pp. 233–246. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02614-0_21

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs], May 2019

Gao, L., Jiang, Z., Yin, Y., Yuan, K., Yan, Z., Tang, Z.: Preliminary exploration of formula embedding for mathematical information retrieval: can mathematical formulae be embedded like a natural language? arXiv:1707.05154 [cs], August 2017

Ginev, D.: arXMLiv 2020 - an HTML5 dataset for arXiv.org\(\cdot \) SIGMathLing (2020)

Greiner-Petter, A., et al.: Discovering mathematical objects of interest—a study of mathematical notations. In: Proceedings of The Web Conference 2020, WWW 2020, pp. 1445–1456. Association for Computing Machinery, Taipei, April 2020. https://doi.org/10.1145/3366423.3380218

Hirschman, L., Gaizauskas, R.: Natural language question answering: the view from here. Nat. Lang. Eng.

**7**(4), 275–300 (2001). https://doi.org/10.1017/S1351324901002807Kohlhase, M. (ed.): MKM 2005. LNCS (LNAI), vol. 3863. Springer, Heidelberg (2006). https://doi.org/10.1007/11618027

Kristianto, G.Y., Topić, G., Aizawa, A.: Extracting textual descriptions of mathematical expressions in scientific papers. D-Lib Mag.

**20**(11/12) (2014). https://doi.org/10.1045/november14-kristiantoKristianto, G.Y., Topić, G., Aizawa, A.: Utilizing dependency relationships between math expressions in math IR. Inf. Retrieval J.

**20**(2), 132–167 (2017). https://doi.org/10.1007/s10791-017-9296-8Kristianto, G.Y., Topic, G., Ho, F.: The MCAT math retrieval system for NTCIR-11 math track, p. 7 (2014)

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization, January 2019. https://doi.org/10.48550/arXiv.1711.05101

Munot, N., Govilkar, S.: Comparative study of text summarization methods. Int. J. Comput. Appl.

**102**, 33–37 (2014). https://doi.org/10.5120/17870-8810Pagael, R., Schubotz, M.: Mathematical language processing project. arXiv:1407.0167 [cs], July 2014

Schubotz, M., et al.: Semantification of identifiers in mathematics for better math information retrieval. In: Proceedings of the 39th International ACM SIGIR Conference, SIGIR 2016, pp. 135–144. Association for Computing Machinery, New York, July 2016. https://doi.org/10.1145/2911451.2911503

Shen, L., et al.: Backdoor pre-trained models can transfer to all. In: Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pp. 3141–3158, November 2021. https://doi.org/10.1145/3460120.3485370

Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, vol. 27. Curran Associates, Inc. (2014)

Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. arXiv:1506.03134 [cs, stat], January 2017

Wang, S., Jiang, J.: Machine comprehension using match-LSTM and answer pointer, November 2016

Youssef, A., Miller, B.R.: Explorations into the use of word embedding in math search and math semantics. In: Kaliszyk, C., Brady, E., Kohlhase, A., Sacerdoti Coen, C. (eds.) CICM 2019. LNCS (LNAI), vol. 11617, pp. 291–305. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-23250-4_20

## Author information

### Authors and Affiliations

### Corresponding author

## Editor information

### Editors and Affiliations

## Appendices

### Appendix

### 1 Models

Below is a more detailed overview of the models used in this manuscript. Long Short-Term Memory (LSTM) RNNs were used for all of the models (with the exception of BERT).

### 1.1 1.1 Vanilla Sequence-to-Sequence Model

To find the best output sequence, generative models use the probability chain rule, which states that the probability of a sequence is the conditional probability of its tokens. We represent this in the following relation, where \(T\) is the training data and \(\theta \) is the parameters of the RNN:

Seq2Seq-based models only take one sequence as its input. Therefore, a separation token (seen as \(<SEP>\) in Fig. 3) was used to concatenate the variable name and it’s context. This was done for the Vanilla Seq2Seq model, Transformer Seq2Seq model, Pointer network, and BERT model (Fig. 4).

### 1.2 1.2 Transformer Seq2Seq Model

To create a context vector, an annotation for each input word is generated based on the word itself and its surrounding words. After this, the context vector for the \(i\)-th output word is calculated as the weighted sum of the annotations \(h\):

where i is the indicates the length of the output sequence and \(k\) is the length of the input sequence. The attention weights are calculated as such:

where \(a(s_{i - 1}, h_{j})\) is the alignment function which ranks how likely an input word is related to an output word. The alignment function takes the decoder hidden state of the \(i\)-th output word and the annotation of the \(j\)-th input word and learns the weights for each input annotation-output word pair via a feed-forward layer.

### 1.3 1.3 Pointer Network

To find the most likely answer out of the input tokens, the Pointer Network computes attention weights of variable length as such:

where \(e_{j}\) is the encoder state at the \(j\)-th input word, \(d_{i}\) is the decoder state at the \(i\)-th input word, and \(v\), \(W_{1}\), \(W_{2}\) are learnable parameters (Fig. 5).

### 1.4 1.4 Match-LSTM Model

A visual representation of the model is given in Fig. 6. The attention weights in the Match-LSTM layer is calculated as such:

Here, \(W^{v}, W^{c}, W^{r}, b^{c}, w, b\) are learned weights, \(h_{i}^{c}\) and \(h^{v}\) are the hidden states of the \(i-th\) word of the context text and the variable name respectively, and \(h_{i-1}^{r}\) is the previous hidden state of the Match-LSTM. The attention weights are then combine with \(h_{i}^{c}\) and \(h^{v}\) and processed in the Match-LSTM. This process is done in the forward and backward direction and all resulting hidden states are coalesced into a final hidden state vector. The final hidden state vector is then processed by a pointer layer, which calculates the most likely position of the definition of the variable in the context text. The following formulas show how the probability \(\beta \) is calculated:

where \(V, W^{a}, b^{a}, v\) and \(b\) are learned weights, \(\bar{H^{r}}\) is the vector of concatenated Match-LSTM hidden states, \(h_{k - 1}^{a}\) is the previous hidden state of the pointer LSTM and \(e_{C + 1}\) is a vector of ones with size C, where C is the length of the context text. \(\beta _{j}\) is combined with \(h_{j}^{c}\) and processed by an LSTM.

### 1.5 1.5 Bert-Based Model

In these methods, a pre-trained Hugging Face BertForQuestionAnswering model was fine-tuned to perform the definition-extraction task. The tokenizer used to process the text was a custom Hugging Face tokenizer created by [18]. This tokenizer was trained on a mathematical language corpus comprised of a wide range of education materials used in primary-school, high-school, and college-level courses (Fig. 7).

## Rights and permissions

## Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

## About this paper

### Cite this paper

Hamel, E., Zheng, H., Kani, N. (2022). An Evaluation of NLP Methods to Extract Mathematical Token Descriptors. In: Buzzard, K., Kutsia, T. (eds) Intelligent Computer Mathematics. CICM 2022. Lecture Notes in Computer Science(), vol 13467. Springer, Cham. https://doi.org/10.1007/978-3-031-16681-5_23

### Download citation

DOI: https://doi.org/10.1007/978-3-031-16681-5_23

Published:

Publisher Name: Springer, Cham

Print ISBN: 978-3-031-16680-8

Online ISBN: 978-3-031-16681-5

eBook Packages: Computer ScienceComputer Science (R0)