Introduction

Documents from Science, Technology, Engineering, and Mathematics (STEM) often contain a significant amount of mathematical formulas (Hambasan & Kohlhase, 2015). Formulas are a vital non-textual component to understand the content of STEM documents. Systems, such as semantic search engines, question answering systems, and document recommender systems, should also be capable of processing formulas and their connections with the surrounding text and mathematical expressions. In information science and technology, the semantics of natural language is typically grasped via conceptualization (Yucong & Cruz, 2011). According to Gruber (1993), the term conceptualization refers to the process of simplifying the representation of objects of discourse and specifying a semantic vocabulary in an ontology (knowledge system). Analogously, to capture the semantics of mathematical language in formulas, we argue for the introduction of a mathematical Formula Concept, which we define as a collection of equivalent formulas with different representations (see also Sect. “Formula concept discovery” below). This extends the definition of the formula content comprising constituents, relations, and semantics of a formula, which was introduced in Scharpf et al. (2018). We select the Klein–Gordon equation as an example for mathematical conceptualization. Figure 1 shows different representations of the Klein–Gordon equationFootnote 1 from quantum mechanics (also referred to as a relativistic wave equation). These representations of the Klein–Gordon equation in the academic literature appear to be diverse, but they all represent the same mathematical concept. Employing additional mathematical Formula Concept examples, we illustrate and discuss differences and explain the resulting challenges of this conceptualization process in detail. We introduce two tasks: Formula Concept Discovery (FCD) and Formula Concept Recognition (FCR) to (1) identify Formula Concepts and (2) find formulas which are instances of particular Formula Concepts. We present implementations to automatically perform the FCD and FCR tasks using machine learning techniques.

Fig. 1
figure 1

Representations of the Klein–Gordon equation extracted from physics papers—(a): Arbab (2010), (b): Pecher (1984), (c): Tretyakov and Akgun (2010), (d): Detweiler (1980), (e): Kaloyerou and Vigier (1989), (f): Haroun et al. (2017), (g): Tiwari (1988), (h): Strauss and Vazquez (1978), (i): nLab authors (2022), (j): nLab authors (2022), (k): Morawetz (1968). Some of the representations are written in a general, potentially nonlinear form. With constraints given for the parameters in the respective publications, the equations become the linear Klein–Gordon equation

Novelty of contribution. This paper extends our previous publication (Scharpf et al., 2019b), in which we introduced the first FCD retrieval method implementation. We extend our study of Formula Concepts by two additional FCD retrieval methods, three additional tasks, and the entire section on FCR experiments. A strong focus of this work is placed on the in-depth analytical examination of example Formula Concepts. We discuss 36 different representations of the Klein–Gordon equation, Einstein’s field equations, and Maxwell’s equations. Analyzing their differences, we identify 13 challenges for FCD to derive requirements for the practical implementation of an FCD framework. Furthermore, we investigate the Formula Concept vector space of our examples in four different formula encodings (vector representations). Additionally, we examine the separability or delineation of different Formula Concepts by computing classification accuracy (SVM classifier) and cluster purity (k-means clusterer). We also generate formula similarity maps in different encoding measures to illustrate FC class coherence. Finally, we present and discuss several of our FCR implementations, including search rankings and additional machine learning methods.

Related work

This section reviews and explains some background knowledge necessary to understand this research project. This includes our own preliminary work and achievements to tackle FCD, related methods of Mathematical Entity Linking, formula knowledge bases, STEM document dataset sources, and mathematical information system applications.

We recently introduced a first machine learning approach for Formula Concept Discovery (Scharpf et al., 2019a). Using Doc2Vec (Le & Mikolov, 2014) encodings and k-means clustering, equivalent representations of formulas were retrieved and evaluated. The experiment was carried out on a selection of astrophysics papers from the NTCIR arXiv dataset (Aizawa et al., 2014). We took formulas that occurred most often in the corpus (duplicates) as a cluster seed. Furthermore, for the major part of the test selection candidates, a valid Formula Concept name could be retrieved from the surrounding text. For almost all of the retrieved name candidates, a Wikidata QID was available.Footnote 2

In this paper, we extend our Formula Concept Discovery method with novel Formula Concept Recognition methods. Both approaches involve two steps: knowledge base population and content referencing. These can both be described in terms of Mathematical Entity Linking (MathEL) (Scharpf et al., 2021a, 2021b). MathEL approaches link mathematical formulas to unique URLs in a semantic knowledge base. If the URLs are part of Wiki web resources, MathEL can be regarded as the ‘Wikification’ of mathematical content (Kristianto et al., 2016).

In Natural Language Processing, Entity Linking entities are typically linked to Wikipedia with a variety of applications, such as Named Entity Recognition (NER), relationship extraction, and entity summarization (Rosales-Méndez et al., 2018). In analogy, methods to link mathematical expressions in scientific documents to Wikipedia articles using their surrounding text have been developed (Kristianto et al., 2016; Kristianto & Aizawa, 2017). One of the conclusions was that for the linking to be reliable, a balanced combination of textual and mathematical elements must be considered. As potential candidates for MathEL, Mathematical Objects of Interest (MOI) were defined to elaborate methods for their discovery (Greiner-Petter et al., 2020). MathEL is expected to enhance mathematical subject classification (Scharpf et al., 2020b; Schubotz et al., 2020).

To implement our FCR methods, we employ Wikidata as the semantic grounding for Wikification (entity linking to Wiki web resources). Since Wikipedia is only semi-structured, WikidataFootnote 3 was launched to provide direct access to specific interlingual facts (RDFFootnote 4 triples) and to retrieve information systematically. Wikidata is a free and open semantic knowledge base that can be read and edited by humans and machines (Vrandecic & Krötzsch, 2014). Wikidata stores items with statements and references. In the case of mathematical knowledge, this may include formulas. For example, one may describe the physics concept ‘pressure’ (item ID Q39552) with a ‘defining formula’ property (property ID P2534) \(p = F/S\). To scalably seed information into Wikidata, a primary sources tool (PST)Footnote 5 was introduced. This tool allows active users to quickly browse through new claims and references in order to approve or reject their validity. Currently, Wikidata contains approximately 5,7K items with a ‘defining formula’ property.Footnote 6

Besides Wikidata, other semantic databases exist that store mathematical formula knowledge. The NIST Digital Library of Mathematical Functions (DLMF, 2022) and NIST Digital Repository of Mathematical Formulae (DRMF) (Cohl et al., 2014) are two examples of maintained high-quality semantic datasets. Moreover, the benchmark mathmlben (Schubotz et al., 2018a) was created to evaluate tools for mathematical format conversion (from latex to mathml to Computer Algebra Systems), containing almost 400 formulas from Wikipedia, the arXiv,Footnote 7 and the DLMF. These were augmented by Wikidata macros in Scharpf et al. (2018).

Mathematical Information Retrieval (MathIR) systems address the information need of people working in STEM fields by retrieving, processing, and analyzing mathematical formulas (Scharpf et al., 2018). Up until now, various formula search engines have been developed. Furthermore, translations between different markups (latex, Presentation, and Content mathml) and standards have been introduced (Guidi & Coen, 2016). Schubotz et al. present a framework to translate mathml into Computer Algebra System (CAS) syntax. Furthermore, standards like OpenMathFootnote 8 and OMCDocFootnote 9 provide extensible ways to represent the semantics of mathematical objects in mathematical documents (Kohlhase, 2006). They can be used to annotate formula expressions in definitions, theorems, and proofs. Given markup on object, statement, and theory level, the soundness of mathematical systems can be assessed (Scharpf et al., 2018). In addition, the PhysML variant accounts for the special characteristics of physics: observables, physical systems, and experiments (Hilf et al., 2006). Moreover, Mathematical Question Answering (MathQA) systems have been built (Schubotz et al., 2019; Scharpf et al., 2022a) to provide quick and concise formula answers to mathematical questions in natural language, which are commonly asked on the web (Scharpf et al., 2020a). MathQA systems can retrieve answers from unstructured text passages or structured knowledge bases. In the latter case, MathEL needs to be employed to assign natural language concept names to mathematical formulas. While classical math search engines typically map a mathematical language query (formula string) to a collection of web resources that include the natural language name of the Formula Concept (Kohlhase & Sucan, 2006), MathQA systems perform the reverse transformation from natural to mathematical language. Another application of the mapping from mathematical to natural language using MathEL is question generation (Scharpf et al., 2022b).

For some Mathematical Language Processing (MLP) applications, the formula constituents (operators, identifiers, numbers) have to be annotated using Mathematical Markup Language (mathml). There are several tools available to convert latex into mathml, most prominently the latexml converter.Footnote 10 Furthermore, the occurring symbols (variables, constants) need to be disambiguated, i.e., their meaning inferred from the context by unsupervised retrieval or supervised annotation. There have been previous attempts to automatically retrieve the semantics of identifiers from the surrounding text (Schubotz et al., 2016; Greiner-Petter et al., 2022). However, it was found that not all identifier names could be extracted from the text. To address this, Schubotz et al. cluster identifier namespaces to enable a fallback retrieval from the definition cluster. While Wikipedia articles commonly contain variable definitions in the text, many paper articles often omit them, assuming expert reader domain knowledge. To build machine-interpretable datasets, manual annotation is thus inevitable. Since this is very time-consuming, formula and identifier annotation recommender systems, such as ‘AnnoMathTeX’ Scharpf et al. (2019a, 2021a), are built to speed up the process.

To create labeled formula data benchmarks, we need open access corpora of STEM documents. For research experiment reproducibility, snapshots must be defined. The arXiv.org e-Print archive (McKiernan, 2000) makes available free preprints for an extensive collection of publications from physics, mathematics, computer science, economics, and other fields. On the arXiv, many authors provide their latex source code. Both Wikipedia and arXiv articles were extracted as part of the NTCIR MathIR Task (Aizawa et al., 2014). We employ the NTCIR arXiv dataset for our research in this paper. In 2017, the Special Interest Group for Math Linguistics (SIGMathLing)Footnote 11 was initiated as a forum and resource cooperative for the linguistics of mathematical or technical documents.

Formula concept discovery

In this section, we attempt to formally define a Formula Concept and set up Formula Concept Retrieval Tasks.

Formula concept retrieval tasks

Definition. Following (Scharpf et al., 2018), we define the formula content as the sets of operators, identifiers,Footnote 12 and numbers that a formula contains. Furthermore, we define a Formula Concept as a collection of equivalent formulas with different representations featuring the same formula content (operators, identifiers, and numbers). Consider the Klein–Gordon equation representations in Fig. 1 as an example of a Formula Concept. Obviously, the formula content may vary as the occurring operators, identifiers, and numbers change from instance to instance. Operators such as partial derivatives can be represented in several ways (\(\partial ^2 u / \partial t^2\) vs. \(u_{tt}\) vs. \(\ddot{u}\)), identifiers can be subsumed into others (e.g., \(\alpha = m c / \hbar\)), and physical constants can be transformed in to different unit systems (e.g., natural units with \(\hbar = c = 1\)). The Formula Concept Discovery challenges will be discussed in more detail in Sect. “Task 3: identification of challenges”. This motivates our study to find out what equivalent representations can occur and how to handle them.

Tasks Our goal is to map diverse representations of a formula to one unique Formula Concept ID,Footnote 13 e.g., linking all occurrences of the Klein–Gordon equation shown in Fig. 1 to the Wikidata item Q868967.Footnote 14 We define two subtasks of the Formula Concept Retrieval Task:

  • Formula Concept Discovery is a method to find common equivalent representations and a name candidate for a given formula, and

  • Formula Concept Recognition is an approach for recognizing formulas in documents as being instances of a previously defined Formula Concept.

In the following, we present our implementation and evaluation results for Formula Concept Discovery and Formula Concept Recognition. These results are based on analytical examinations, machine learning, fuzzy string matching, and Wikipedia article heuristics.

For the discovery of Formula Concepts, we define the following four tasks:

  1. Task 1:

    Retrieval of formula concept examples,

  2. Task 2:

    Analysis of formula concept examples,

  3. Task 3:

    Identification of formula concept discovery challenges,

  4. Task 4:

    Derivation of formula concept retrieval system requirements.

In Task 1, we employ three methods to retrieve examples of Formula Concepts, which are suitable for discussing and identifying challenges of Formula Concept Discovery and Formula Concept Recognition. In Task 2, we analyze and discuss three selected Formula Concept examples. We choose three sets of differential equations from physics: the Klein–Gordon equation (KGE), Einstein’s field equations (EFE), and Maxwell’s equations (ME). The examples are retrieved from search engine results for the Formula Concept name yielding publications (sources as in Fig. 1), as well as from Wikipedia article content\(^{15}\), and a textbook (Fließbach, 1990). Given our background in theoretical physics and applied mathematics, we choose examples from this domain. Since we are domain experts on the topics, we can judge the Formula Concept semantics. The formula annotation is achieved in a two-step process: (1) the retrieval by the concept name in the selected sources determines the annotation or assignment of the whole formula; (2) the domain expert subsequently semantically analyzes the formula and retrieves the semantic annotations of the formula constituents by considering the context and descriptions or explanations from the respective sources (text surrounding the formula). In Task 3, we identify and summarize the Formula Concept Discovery challenges, which we observe in the discussion of the three Formula Concept examples. These challenges determine the requirements for technical implementations of FCD and FCR. In Task 4, we address the identified challenges by deriving requirements for a Formula Concept Retrieval system and proposing methods to tackle the challenges.

The developed algorithms, the dataset, and full result tables are available at https://github.com/ag-gipp/formula-concept-retrieval.

Task 1: retrieval of formula concept examples

For the retrieval of example Formula Concepts, we employ the following three methods:

  • Method 1: Search by Formula Concept Name,

  • Method 2: k-Nearest-Neighbors (kNN) in Formula Vector Space,

  • Method 3: Wikipedia Article First Formula Multi-Language Heuristic.

In Method 1, we perform searches by the Formula Concept name in a corpus of publications, a Wikipedia article, and a textbook, respectively. In Method 2, we employ machine learning to retrieve equivalent representations of formulas (Scharpf et al., 2019a), which occur most often (duplicates) in a selected corpus containing astrophysics papers from the NTCIR arXiv dataset (Aizawa et al., 2014). For an introduction of the dataset, see the paragraph ‘Data selection’ in Sect. “Method 2: k-Nearest-neighbors in formula vector space”. In Method 3, we make use of a simple heuristic (Schubotz et al., 2018b; Halbach, 2020). We take the tentative Formula Concept names of the examples retrieved using Method 2. We then extract the corresponding Wikipedia articles. For each Formula Concept article, we retrieve the first five versions in different languages. We then assess how many different representations of the individual Formula Concepts are among these articles.

Method 1: Search by Formula Concept Name

For our first example, the Klein–Gordon equation, we perform a web search to retrieve ten representations from publications (Arbab, 2010; Detweiler, 1980; Haroun et al., 2017; Kaloyerou & Vigier, 1989; Morawetz, 1968; Pecher, 1984; Strauss & Vazquez, 1978; Tiwari, 1988; Tretyakov & Akgun, 2010). Each publication contains the Formula Concept name as a keyword or in the full text. For our second example, Einstein’s field equations, we retrieve representations from the corresponding Wikipedia article.Footnote 15 For our third example, Maxwell’s field equations, we take derivations from a textbook on General Relativity (Fließbach, 1990).

Method 2: k-Nearest-Neighbors (KNN) in Formula Vector Space

This subsection is based on our previous publication (Scharpf et al., 2019a), in which we presented Formula Concept Discovery using k-Nearest-Neighbors for the first time. Since it might be impossible to formally define all equivalence transformations exhaustively, we test approaching a Formula Concept in machine learning terms as a collection of approved formula vectors (comparing encodings) within a specified similarity range (comparing metrics). We illustrate the formula space (formula content space in Fig. 4 and formula semantic space in Fig. 5) in Experiment 2 of FCR in Sect. “Experiment 2: formula concept classification and clustering”. It represents formulas as encoded vectors. Then, a Formula Concept can be defined as all vectors around a central vector within a specified distance (cutoff).

Method. We approach the discovery of Formula Concepts by retrieving equivalent formulations with different representations using machine learning (see Fig. 2). The retrieved instances are augmented with name candidates from the surrounding text. The initial step is to identify formula candidates that occur most often within a given dataset. We assume that they are potential seeds of popular Formula Concepts. We first tried formula clustering (Adeel et al., 2012; Ma et al., 2010). However, we discovered that this was not a suitable method for FCD since the number of clusters is a priory unclear.Footnote 16 The tested algorithms are not able to group equivalent formulas. Subsequently, we decided to start with a ranking of formula duplicates (with the same latex string). In contrast to the clustering, this yields valuable results for the selected Formula Concept examples.

Fig. 2
figure 2

Clustering equivalent representations of formulas in the semantic space as named Formula Concept Wikidata items

Data selection. We employ the NTCIR arXiv dataset (Aizawa et al., 2014), which comprises 105,120 document sections containing over 60 million formulas. The formulas are enclosed in <math> tag environments. The documents were converted from latex to an XHTML format (https://tei-c.org). The disk size of the dataset is about 174GB uncompressed. We confine our computations to the subject class of astrophysics (680 astro-ph documents), employing a domain expert to evaluate the results semantically. To get the most popular formulas in the dataset as potential candidates for important Formula Concepts, we first identify duplicates where the exact formula string reoccurs in multiple documents. We subsequently rank the results by their occurrence frequency, i.e., the number of duplicates d (see the respective column in Table 1). From the duplicate ranking, we select a formula length range between 10 and 30 charactersFootnote 17 and restrict our selection to duplicates occurring in at least two documents \(D \ge 2\). This selection criteria processing results in 3,495 formulas. We then manually select all equations (for now, we confine the Formula Concept definition to include equations only). We discard all stubs without a right-hand side, as well as simple variable dependence definitions, such as \(x = x\left( t\right)\) and \(x=y\) or \(x = \textrm{const}\). The algorithms for the data selection pipeline can be found in the source repository.

Evaluation. For the first 50 samples from the duplicate ranking, we retrieve the operators and identifiers from the provided mathml <mo> and <mi> tags, as well as the surrounding text (words within a window of \(\pm 500\) characters around the formula). We encode both tag contents using the TfidfVectorizer from the Python package Scikit-learn (Pedregosa et al., 2011) and Doc2Vec model (Le & Mikolov, 2014) from the Python package Gensim (Rehurek, 2011). We then assess the performance of a k-Nearest-Neighbors classifier (Shakhnarovish et al., 2005) to retrieve equivalent formula representations. For a given instance of a Formula Concept, we compute the k-Nearest-Neighbors formulas as candidates for variations of that Formula Concept. Subsequently, we use our domain knowledge to judge whether these candidates are indeed equivalent representations of the given Formula Concept. We test the effectiveness of our approach on four different formula vector encodings:

  • math2vec encoding the formula constituents using the Doc2Vec model as proposed in Youssef and Miller (2018);

  • math tf-idf encoding the formula constituents using the TfidfVectorizer;

  • semantics2vec encoding the surrounding text (containing tentative formula semantics) using the Doc2Vec model; and

  • semantics tf-idf encoding the surrounding text using the TfidfVectorizer.

The computation of the Doc2Vec formula vector encodings is more time-expensive than TF-IDF, due to the iterative learning process of the neural model.

Results. Table 1 shows the results of our approach for discovering Formula Concepts as published before (Scharpf et al., 2019b). We rank the extracted formulas by the number of duplicates d and list the number of documents D, in which they appear. Note that the likelihood of retrieving non-duplicate equivalent representations increases for higher values of distinct documents. If the Formula Concept representations are found in different documents, there are more than if they appear in the same document. This means that there are fewer variations within the same document. We can see that only for the first 18 Formula Concept examples are there more than two duplicates from distinct documents, i.e., formulas appearing twice or more within the corpus. We evaluate the first 50 examples. The primary investigation was to compare the performance of four different formula vector encodings in terms of the retrieved number of equivalent representations. In total, we can retrieve 163 equivalent Formula Concept representations for our 50 samples. On average, this corresponds to more than three (163/50 = 3.3) per formula (from 3 different documents) or around one (163/50/4 = 0.8) per source per formula. Some of the retrieved formulas even contain different identifier symbols or varying indices (e.g., a is replaced by R in Example 1, see the first line of Table 1). Increasing the number of formula neighbors parameter k from 1 in integer steps, we can not find additional matching representations above \(k=9\). We define the retrieval success s of an individual encoding as the percentage of retrieved representations compared to all other formula vector encodings. Calculating the overall success distribution, we discover that the math2vec (\(e_m\)) encoding distinctively outperforms the others by yielding 71% of the retrieved instances, followed by semantics tf-idf, (\(\hat{e}_s\)) with 15%, semantics2vec (\(e_s\)) with 11%, and math tf-idf (\(\hat{e}_m\)) with 4%. Overall, for 34 of the investigated 50 sample formulas, i.e., 34/50 = 68%, we are able to retrieve equivalent representations. We conclude that while the math2vec encoding retrieves the most equivalent formula matches as candidates for a Formula Concept, it is most effective to employ all formula vector encodings simultaneously to maximize the retrieval. Note that we can only determine false positives and compute precision but not the number of false negatives to compute recall. This is because we do not know a priori how many different equivalent representations, semantically close to the examined concept, still exist. We can neither determine this in general (how many notational variations are possible in principle) nor for the given corpus (how many do occur). Finally, we list the top five name candidates from the surrounding text. The word window size is chosen to be ±500 characters. Decreasing the window size in steps of 100, the top 50 coverage performance drops from 100% to 17% to 11% for \(\text {ws} = \{500, 400, 300\}\) to \(\text {ws} = 200\) to \(\text {ws} = 100,\) respectively. We evaluate whether they contain a suitable name for the Formula Concept to be seeded as a Wikidata item. For our 50 Formula Concept examples, we achieve a recall of 36/50 = 72% for the formula name. Furthermore, for 41/50 = 82% of the retrieved name candidates, a Wikidata QID is available to tag the Formula Concept.

Table 1 Formula concept discovery (Scharpf et al., 2019b)

Method 3: Wikipedia Article First Formula Multi-Language Heuristic

Table 2 shows another approach to discover Formula Concepts. We employ the tentative mathematical concept name candidate and retrieve the corresponding English Wikipedia article. For each Formula Concept article, we retrieve the first five versions in different languages. We then assess how many of these contain a first formula that is a different representation of the Formula Concept. As an example, for formula number 1, the ‘Hubble parameter’, the English article’s first formula is v=H_0 D, while in the German it is H(t)=\frac{\dot a(t)}{a(t)}. We show the success score s in the last column. It is the fraction of different representations within the first five language versions. On average, a Formula Concept appears in two different representations. In our evaluation, we leave out all formulas, for which no concept name is available (N/A), to search for Wikipedia articles (−). For the 32 formulas, for which we can select a Formula Concept name from the surrounding text candidates, we find 155 Wikipedia articles (for some names, there are less than five language versions available). In total, 53/155 = 34% of the individual versions contain Formula Concept variations. This corresponds to 19/32 = 59% of the formulas. The results indicate that it is in principle possible to retrieve Formula Concept representations via Wikipedia article first formula multi-language heuristic. However, this does not work for a significant part of the sample. Our finding aligns with previous results in the literature (Halbach, 2020), which report that considering multiple Wikipedia languages decreases both precision and recall compared to using only English Wikipedia.

Table 2 Formula Concept Discovery via Wikipedia article first formula multi-language heuristic

Task 2: analysis of formula concept examples

In the following, we do step-by-step examinations of three differential equations from physics:

  1. Example 1:

    Klein–Gordon equation,

  2. Example 2:

    Einstein’s field equations,

  3. Example 3:

    Maxwell’s equations.

The presented representations are not exhaustive. Only some of the most interesting representations are selected and presented to discuss important aspects and derive a list of challenges for Formula Concept Retrieval.

The challenge analysis framework is the following: The domain expert thoroughly examines the formula at hand to understand its specific particularities. Performing a ‘semantic analysis’ means that constraints, notation (see, for example, https://dlmf.nist.gov/not), substitutions, and equivalences are carefully considered.

Example 1: Klein–Gordon equation

The Klein–Gordon equation is a relativistic wave equation. It describes the behavior of particles (modeled as waves) at high energies and velocities comparable to the speed of light (relativistic). Being a partial differential equation containing second partial derivatives in both time \(\partial ^2 / \partial t^2\) and space \(\partial ^2 / \partial x_k^2\) it can be employed to compute the evolution of a quantum wave function \(\psi\) in time t and space \(\vec {x}\) (Gross, 2008). Apart from the terms containing the derivatives of the wave function, there is an additional term with the undifferentiated wave function. Depending on the notation, some terms are additionally multiplied by some factors of constants (not changing in time and space). The signs of the terms depend on the metric signature, a notational convention of how to combine time and space (Fließbach, 1990).

In the first retrieved representation

$$\frac{1}{c^2} \frac{\partial ^2 \psi }{\partial t^2} -\nabla ^2 \psi + \left( \frac{m_0 c}{\hbar } \right) ^2 \psi = 0,$$
(1)

the term pre-factors are \(1/c^2\) and \((m_0 c/\hbar )^2\). The spatial derivatives with respect to the coordinates \(\vec {x} = (x,y,z)\) are encapsulated in the Laplace operator

$$\begin{aligned} \nabla ^2 = {\nabla \cdot \nabla = (\partial _x, \partial _y, \partial _z) \cdot (\partial _x, \partial _y, \partial _z)}. \end{aligned}$$

In the second representation

$$\begin{aligned} u_{tt} + A u + f(u) = 0, \end{aligned}$$
(2)

the wave function is denoted u instead of \(\psi\). Additionally, the second derivative with respect to time is denoted using subscripts \(u_{tt} = \frac{\partial ^2 u }{\partial t^2}\). The space derivative is operated using a matrix multiplication \(A \cdot u\) corresponding to \(\nabla ^2 u\), and the metric signature is chosen such that the term has a positive sign. Finally, the constant factors are absorbed in the function f(u), which is proportional to \((m_0 c/\hbar )^2 u\). In both the previous and following representations, the multiplication is always implicit, i.e., the multiplication sign “\(\cdot\)” is omitted. The equation representation allows any function of u, f(u) linear or nonlinear to be added. For it to be the Klein–Gordon equation, f(u) has to equal a non-zero constant times u. In this case, the parameters are set to

$$\begin{aligned} A:=-\varDelta +m^2, m\ne 0, \quad f(u):=\lambda |u|^{\rho -1}u, \lambda \in {{\mathbb {R}}}, \end{aligned}$$

such that the equation contains the second space derivatives in the Laplace operator \(\varDelta\) and is linear in u, e.g., \(f(u) = \lambda u\) for \(\rho = 1\). The need to automatically retrieve this additional constraint information is a major challenge for FCR. In the third representation

$$\begin{aligned} \partial ^2_{ct} h_n (z,t) - \partial ^2_z h_n (z, t) +\nu _n^2 h_n (z,t) = 0, \end{aligned}$$
(3)

the time derivatives includes the factor c (speed of light) and is again denoted using subscripts, such that

$$\begin{aligned} \partial ^2_{ct} = \frac{ \partial ^2}{ \partial (ct)^2} = \frac{1}{c^2} \partial ^2_t. \end{aligned}$$

This is equivalent to the absorption of the factor \(1/c^2\) from the first representation (1). The wave function is here denoted h(zt), explicitly emphasizing the dependence on space z and time t. Here, only one dimension is considered—the coordinate z, such that the spacial derivative is reduced to \(\partial ^2_z = \partial ^2 / \partial z^2\). The metric signature is the same as in (1) with a minus sign in front of the second term. In the fourth representation

$$\begin{aligned} \nabla ^a \nabla _a \psi = \mu ^2 \psi , \end{aligned}$$
(4)

the wave function is again denoted \(\psi\) as in (1). The constants are absorbed in the factor \(\mu ^2\), such that the linear term containing the undifferentiated wave function is now shifted from the left-hand to the right-hand side of the equation. Both the space and time derivatives are combined into one single term by using Einstein’s notation of summation convention (Einstein, 1916). It states implicit summation over double indices. In our case, a, the summation index, denotes the dimension coordinates of time t and space xyz. Without additional remarks, it is now clear whether all coordinates are considered or some omitted. It could possibly be a time-independent (\(\partial ^2 \psi / \partial t^2 = 0\)) or one-dimensional form (\(\psi (\vec {x}) = \psi (z)\)), as in (3). In the fifth representationFootnote 18

$$\begin{aligned} \frac{\hbar ^2}{c^2} \frac{\partial ^2 \psi }{\partial t^2} -\frac{\hbar ^2 \partial ^2 \psi }{\partial x^2} = - 2 i \hbar \frac{\partial \psi }{\partial \tau }, \end{aligned}$$
(5)

there is an additional term containing a first derivative with respect to proper time \(\tau\), which is proportional to time t for constant speed. The term is imaginary, denoted by the imaginary unit i. Physically, it introduces an exponential decay of the wave function (damping). The sixth representation

$$\begin{aligned} -\hbar ^2 \frac{\partial ^2 \varPsi }{\partial t^2} + c^2 \hbar ^2 \nabla ^2 \varPsi = m_0^2 c^4 \varPsi , \end{aligned}$$
(6)

has a different signature (the term signs differ from the previous representations). However, the term without derivative appears positive on the right-hand side as in (4). Moreover, the pre-factors containing the constants—Planck’s constant \(\hbar\), the speed of light c, and the rest mass \(m_0\)—are distributed differently. In the seventh representation

$$\begin{aligned} \nabla ^2 \phi - \frac{1}{c^2} \frac{\partial ^2 \phi }{\partial t^2} - \frac{2\alpha + a}{c^2} \frac{\partial \phi }{\partial t} -\frac{\alpha ^2 + a \alpha }{c^2} \phi = 0, \end{aligned}$$
(7)

the wave function is denoted \(\phi\). The second space derivatives appear again using the Laplace operator \(\nabla ^2\) as in (1). Here, some additional constants \(\alpha\) and a are introduced, and a term containing a first partial time derivative \(\partial \phi / \partial t\), similar to (5). By setting \(a = -2\alpha\) in the publication, this term vanishes, and the equation becomes the Klein–Gordon equation. The eight representation

$$\begin{aligned} u_{tt} - \varDelta u + m^2 u + G'(u) = 0, \end{aligned}$$
(8)

uses the same variable u and time derivative \(u_{tt}\) as in (2). The Laplace operator performing the second spatial derivatives is denoted as \(\varDelta = \nabla ^2\). The constants are absorbed in the factor \(m^2\), and there is an additional term, the function \(G'(u)\) of the wave function. This G(u) must be equal to a non-zero constant times u in order for \('G(u) = 0\) and the representation to be the Klein–Gordon equation. The ninth representation

$$\begin{aligned} \left( \eta ^{\mu \nu } \frac{\partial }{x^\mu } \frac{\partial }{x^\nu } -\left( \frac{mc}{\hbar } \right) ^2 \right) \varphi = 0, \end{aligned}$$
(9)

again uses Einstein notation as in (4) for the partial (time and space) derivatives. For the signature (the term signs), the Minkowski metric \(\eta _{\mu \nu }\) is employed. The wave function \(\phi\) can then be factored out. The tenth representation

$$\begin{aligned} \left( -\frac{1}{c^2} \frac{\partial ^2}{\partial t^2} \sum _{i=1}^p \frac{\partial }{x^i} \frac{\partial }{x^i} -\left( \frac{mc}{\hbar } \right) ^2 \right) \varphi = 0 \end{aligned}$$
(10)

is similar to (9). However, it explicitly displays the summation using the sign \(\sum\) and limits the considered dimensions to p. Lastly, the eleventh representation,

$$\begin{aligned} u_{tt} - \varDelta u + m u + P'(u) = 0 \end{aligned}$$
(11)

is almost identical to (8)—the only difference being that the constant \(m^2\) is replaced by m and the function G by P. This again means that to be the Klein–Gordon equation, the function derivative \(P'(u)\) must vanish.

To summarize, in the different representations of the Klein–Gordon equation extracted from the 11 publications, there are several different symbols used to denote the wave function: \(\psi\), u, h, \(\varPsi\), and \(\phi\). The constant factors \(m_0\), c, \(\hbar\), etc., appear at different places in different terms of the equation or are omitted entirely in particular unit systems. The derivative notation varies significantly, e.g., from \(\partial ^2 \psi / \partial t^2\) to \(\partial ^2_{ct}\) to \(u_{tt}\) for the time derivative of the wave function. In (4) and (9), Einstein’s summation notation is used to compactify the derivatives, while omitting summation signs. The signs of the terms differ with the metric signature that is used. Additional terms and functions are introduced (e.g., the damping term in (5) and \(G'(u)\) and \(P'(u)\) in equations (8) and (11)). Note that there are potentially more representation variations, which were not considered due to their absence in the examples. For instance, there are forms of the KGE, in which the D’Alembert operator

$$\begin{aligned} \Box = \frac{1}{c^2}\frac{\partial ^2}{\partial t^2}-\sum _{i=1}^{d-1} \frac{\partial ^2}{\partial x_i{}^2} \end{aligned}$$

takes care of the time and space derivatives.

Example 2: Einstein’s field equations

Einstein’s field equations are the fundamental differential equations in Einstein’s theory of general relativity. They relate the curved geometry of spacetime (space and time are united in the framework) to the distribution of matter, which generates a gravitational field (Einstein, 1916). Mathematically, the EFEs form a system of ten coupled nonlinear partial differential equations (Rendall, 2005). As in the previously discussed representations (4) and (9) of the Klein–Gordon equation, four-dimensional indices are used.

The first representation

$$\begin{aligned} G_{\mu \nu }+\varLambda g_{\mu \nu }=\kappa T_{\mu \nu } \end{aligned}$$
(12)

is a very compact form. The Einstein tensor

$$\begin{aligned} G_{\mu \nu } = R_{\mu \nu } - \tfrac{1}{2} R g_{\mu \nu } \end{aligned}$$

subsumes the spacetime curvature Ricci tensor \(R_{\mu \nu }\) and scalar curvature R, and metric tensor \(g_{\mu \nu }\), which describes the gravitational field. The stress-energy tensor \(T_{\mu \nu }\) describes the density and flux of energy and momentum in spacetime. A tensor is a generalization of a matrix and a vector in higher dimensions. The two-dimensional tensors with two indices \(\mu\) and \(\nu\) can also be written as a matrix (cf. field tensor in Example 3), where the indices correspond to the column and row numbers. In equation (12), The cosmological constant \(\varLambda\) quantifies the contribution of dark energy to the expansion of the universe. Furthermore, there is another constant

$$\begin{aligned} \kappa = 8 \pi G / c^4, \end{aligned}$$

containing the gravitational constant G and the constant which represents the speed of light c. The second representation

$$\begin{aligned} G_{\mu \nu } + \varLambda g_{\mu \nu } = 8 \pi T_{\mu \nu }\,(G=c=1), \end{aligned}$$
(13)

explicitly states that geometric units are used with the constants \(G=c=1\), which sets the pre-factor on the right-hand side to \(\kappa = 8 \pi\). The third representation

$$\begin{aligned} R_{\mu \nu }-\frac{1}{2}g_{\mu \nu }R-\varLambda g_{\mu \nu }=(8\pi G_{N}) T_{\mu \nu }, \end{aligned}$$
(14)

writes out the definition of \(G_{\mu \nu }\) on the left-handy side, and uses \(c=1\) but \(G=G_N\) with an additional index N. The fourth representation

$$\begin{aligned} G_{\mu \nu }=-\varLambda g_{\mu \nu }+\kappa ^{2}T^{\textrm{tot}}_{\mu \nu } \end{aligned}$$
(15)

shows the term with the cosmological constant \(\varLambda\) moved to the right-hand side, \(\kappa ^2\) listed instead of \(\kappa\), and \(T^\text {tot}_{\mu \nu }\) is listed with an additional superscript. The fifth representation

$$\begin{aligned} G_{\mu \nu }=R_{\mu \nu }-g_{\mu \nu }R/2=\kappa T^{\mu \nu }-\varLambda g_{\mu \nu } \end{aligned}$$
(16)

is a combination of (12) and (15). The sixth representation

$$\begin{aligned} R_{\mu \nu }-\frac{1}{2}Rg_{\mu \nu }=\kappa _{r}(T) T_{\mu \nu }+\varLambda (T)g_{\mu \nu } \end{aligned}$$
(17)

has the sign of the \(\varLambda\)-term changed again, while showing its dependence of T. Furthermore, \(\kappa\) here has the index r and its dependence of T is shown as well. In the seventh representation

$$\begin{aligned} K_{\mu \nu }-Kg_{\mu \nu }=-\frac{\kappa ^{2}}{2} T_{\mu \nu }+r_{c}G_{\mu \nu }, \end{aligned}$$
(18)

the units are chosen, such that the pre-factor of the \(T_{\mu \nu }\)-term is \(-\kappa ^2/2\), and \(G_{\mu \nu }\) is multiplied by an additional factor \(r_c\) (critical radius of the universe. The eight representation

$$\begin{aligned} G_{AB}\equiv R_{AB}-{1\over 2}g_{AB}R=\kappa ^{2}\,T_{AB} \end{aligned}$$
(19)

uses the Latin letters A and B instead of the Greek letters \(\mu\) and \(\nu\). The ninth representation

$$\begin{aligned} R_{\mu \nu }-\frac{1}{2}g_{\mu \nu }R+\varLambda g_{\mu \nu } =-8\pi GT_{\mu \nu } f_{R}\,G_{\mu \nu } \end{aligned}$$
(20)

introduces an additional function \(f_R\) and an explicit occurrence of the Newtonian gravitational constant G. The tenth and eleventh representations

$$\begin{aligned} R_{\mu \nu }-\frac{1}{2}g_{\mu \nu }R+\varLambda _{c} g_{\mu \nu }=8\pi GT_{\mu \nu } \end{aligned}$$
(21)

and

$$\begin{aligned} R_{\mu \nu }-{1\over 2}Rg_{\mu \nu }+\varLambda _{eff} g_{\mu \nu }=8\pi GT_{\mu \nu } \end{aligned}$$
(22)

highlight that the cosmological constant \(\varLambda\) is critical (c) and effective (eff) using subscripts. The twelfth representation

$$\begin{aligned} G_{\mu \nu }-g_{\mu \nu }\varLambda =\frac{8\pi G}{c_{0}^{4}\phi ^{4}}T_{\mu \nu } \end{aligned}$$
(23)

displays an additional identifier \(\phi\) within \(\kappa\) and index of \(c_0\). The thirteenth representation

$$\begin{aligned} E^{\mu \nu }=-G^{\mu \nu }+\kappa T^{\mu \nu }-\varLambda g^{\mu \nu } \end{aligned}$$
(24)

relates a fourth tensor \(E_{\mu \nu }\) to the other three (\(G_{\mu \nu }\), \(g_{\mu \nu }\), and \(T_{\mu \nu }\)). For \(E_{\mu \nu } = 0\) it reduces to (12). In the fourteenth representation

$$\begin{aligned} R_{\mu \nu }-\frac{1}{2}g_{\mu \nu }R=8\pi G_{5}T_{\mu \nu } -\varLambda _{5}g_{\mu \nu }, \end{aligned}$$
(25)

another index 5 is added to the constants G and \(\varLambda\). In the fifteenth representation

$$\begin{aligned} R_{\mu \nu }-\frac{1}{2}Rg_{\mu \nu }=8\pi GT_{\mu \nu } -\varLambda g_{\mu \nu } T^{\textrm{RG}}_{\mu \nu }, \end{aligned}$$
(26)

an additional superscript RG is displayed. The sixteenth representation

$$\begin{aligned} R_{\mu \nu } - \frac{ \varLambda g_{\mu \nu }}{\frac{D}{2}-1} =\frac{8 \pi G}{c^4} \left( T_{\mu \nu } - \frac{1}{D-2}T g_{\mu \nu }\right) , \end{aligned}$$
(27)

contains an additional constant D, which is the dimension of the spacetime. Finally, the seventeenth representation

$$\begin{aligned} G_{\mu \nu }=\kappa _{4}^{2}T_{\mu \nu }-\varLambda g_{\mu \nu } +Q_{\mu \nu } \end{aligned}$$
(28)

adds another subscript for \(\kappa\) and the electromagnetic charge tensor \(Q_{\mu \nu }\) (‘Einstein-Maxwell equations’). Summarizing, Example 2 reiterates that the same Formula Concept can be represented using different unit systems, which modify the coefficients of the individual terms. As in Example 1, different names for identifiers and sub- or superscripts can occur. Furthermore, sometimes a variable dependence is explicitly displayed as in (17).

Example 3: Maxwell’s equations

Maxwell’s equations are the foundation of classical electromagnetism and optics. They describe how charges and electric currents generate electric and magnetic fields and model light as electromagnetic waves (Jackson, 1999). Mathematically, they form a set of four coupled partial differential equations, which—like the Klein–Gordon equation (Example 1)—contain time and space derivatives. While the Klein–Gordon equation is a scalar equation (wave function), Einstein’s field equations relate tensors (curvature and mass-energy), Maxwell’s equations are vector (electric and magnetic field) equations.

The first two equations are Gauß’ law for electric and magnetic fields

$$\begin{aligned} \text {div} \ \vec {E} = 4 \pi \rho , \ \text {div} \ \vec {B} = 0. \end{aligned}$$
(29)

They state that the source (given by the divergence) of the electric field (\(\vec {E}\)) is a charge (the density distribution \(\rho\)), while the magnetic field (\(\vec {B}\)) has no source distribution (equals zero). The third and fourth of Maxwell’s equations are Faraday’s law of induction and Ampère’s circuital law

$$\begin{aligned} \text {rot} \ \vec {E} = - \frac{1}{c} \frac{\partial \vec {B}}{\partial t}, \ \text {rot} \ \vec {B} = \frac{4 \pi }{c} \vec {j} + \frac{1}{c} \frac{\partial \vec {E}}{\partial t}. \end{aligned}$$
(30)

They state that electric fields (rot \(\vec {E}\)) (or curl) are generated by changing magnetic fields (\(\partial \vec {B} / \partial t\)) and magnetic fields (rot \(\vec {B}\)) are generated by changing electric fields (\(\partial \vec {E} / \partial t\)) and charge current density distributions (\(\vec {j}\)). Both the existence of a non-zero curl (rot), i.e., vortex strength, and divergence (div), i.e., source strength of the electric and magnetic fields, are obtained using permutations of the field components. While the second and third equations are homogeneous, the first and the fourth equations are inhomogeneous. The latter two contain source terms (electric charge and current density distributions).

Equations (29) and (30) are the differential forms of Maxwell’s equations. However, it is also possible to represent them in their integral forms. Gauß’s law for the electric field then writes

$${\oiint}_{{\partial \Omega }} {\text{E}} \cdot {\text{dS}} = \frac{1}{{\varepsilon _{0} }}\iiint_{\Omega } {\rho {\text{d}}V}$$
(31)

where \(\oint _{\partial \Omega }\) is a surface integral over the boundary surface \(\partial \varOmega\) (with the loop indicating that the surface is closed), and \(\iiint _\varOmega\) is a volume integral over the volume \(\varOmega\). Gauß law for the magnetic field then becomes

$${\oiint}_{{\partial \Omega }} {\text{B}} \cdot {\text{dS}}$$
(32)

Faraday’s law of induction can be written as

$$\begin{aligned} \oint _{\partial \varSigma } \textbf{E} \cdot \textrm{d} \varvec{l} = -\frac{\textrm{d}}{\textrm{dt}} \iint _{\varSigma } \textbf{B} \cdot \textrm{d}\textbf{S}, \end{aligned}$$
(33)

where \(\oint _{\partial \varSigma }\) is a line integral integrating over the boundary curve \(\partial \varSigma\) (with the loop indicating that the curve is closed), and \(\iint _{\varSigma }\) is a surface integral over the surface \(\varSigma\). Finally, Ampère’s law becomes

$$\begin{aligned} \oint _{\partial \varSigma }&\textbf{B} \cdot \textrm{d} \varvec{l} = \mu _0 \left( \iint _{\varSigma } \textbf{j} \cdot \textrm{d}\textbf{S} + \varepsilon _0 \frac{\textrm{d}}{\textrm{d}t} \iint _{\varSigma } \textbf{E} \cdot \textrm{d} \textbf{S} \right) . \end{aligned}$$
(34)

Maxwell’s equations can also be transformed into a four-vector notation, which includes tensors and Einstein’s summation convention (as in Example 2: Einstein’s field equations). In this notation, the two inhomogeneous partial differential equations are reduced to

$$\begin{aligned} \partial _\alpha F^{\alpha \beta } = \frac{4 \pi }{c} j^\beta , \end{aligned}$$
(35)

and the homogeneous partial differential equation is reduced to

$$\begin{aligned} \varepsilon ^{\alpha \beta \gamma \delta } \partial _\beta F_{\gamma \delta } = 0. \end{aligned}$$
(36)

The charge and current sources density distributions (\(\rho\) and \(\vec {j}\)) are combined into one four-vector

$$\begin{aligned} (j^\beta ) = (c \rho , j^i). \end{aligned}$$

The four-derivative of both space and time is defined as \(\partial _\alpha = \frac{\partial }{\partial x^\alpha }\). The permutations needed for the curl and the divergence of the electric and magnetic field are encapsulated in the Levi-Civita symbol

$$\begin{aligned} \varepsilon ^{\alpha \beta \gamma \delta } ={\left\{ \begin{array}{ll} +1, &{} (\alpha , \beta , \gamma , \delta ) = \text {even permutation of} \ (0,1,2,3)\\ -1, &{} (\alpha , \beta , \gamma , \delta ) = \text {odd permutation of} \ (0,1,2,3)\\ 0 &{} \text {otherwise} \end{array}\right. }. \end{aligned}$$

The electromagnetic field tensor is then defined as

$$\begin{aligned} (F^{\alpha \beta }) =\left( \begin{matrix} 0 &{} -E_x/c &{} -E_y/c &{} -E_z/c \\ E_x/c &{} 0 &{} -B_z &{} B_y \\ E_y/c &{} B_z &{} 0 &{} -B_x \\ E_z/c &{} -B_y &{} B_x &{} 0 \\ \end{matrix}\right) , \end{aligned}$$

containing all six components of both the electric and magnetic fields in three dimensions.

Summarizing, Example 3 shows how unification into a single physics framework (Maxwell’s equation of electromagnetism) combines multiple Formula Concepts: Gauß’ law of electric and magnetic fields; Faraday’s law of induction; and Ampère’s circuital law. Equation (35) could either be labeled ‘Gauß’ electric law’ and ‘Ampère’s law’ or ‘Maxwell’s inhomogeneous equations’. Analogously, equation (36) could either be labeled ‘Gauß’ magnetic law’ and ‘Faraday’s law’ or ‘Maxwell’s homogeneous equations’. By transforming to the more compact notation, tensors and indices are introduced. Notably, the electromagnetic field tensor \(F^{\alpha \beta }\) subsumes multiple components of two vectors.

Task 3: identification of challenges

In the following, we identify the challenges for Formula Concept Discovery and Recognition. They are derived from the discussion of the three Formula Concept examples. The challenges provide an impression of the peculiarities that need to be considered by FCD and FCR approaches.

Table 3 Challenges for Formula Concept Discovery and Recognition, derived from the discussion of three Formula Concept examples (differential equations presented in Sect. “Task 2: analysis of formula concept examples”)

Table 3 contains the results of our evaluation. Most of them are notation issues. Different names for symbols (constants or variables) are used. Different notation systems are applied for signatures and units. Different forms for derivatives, summations, and tensors are employed. For some challenges, e.g., Challenge 3 there is an overlap between the different Formula Concept examples. For others, e.g., Challenges 10 and 11, the issues only apply to the specific example. We can note an average of four challenges per example. It remains an open question whether this number increases or decreases with additional examples. There can potentially be more or less overlap of challenges shared by examples. If the same challenges do not reoccur frequently and the number of challenges significantly increases with new examples, Formula Concept retrieval methods are faced with additional difficulties.

Task 4: derivation of formula concept retrieval system requirements

In the following, we address the identified Formula Concept Discovery challenges by deriving requirements for a Formula Concept Retrieval system. Since currently, less than 6,000 formulas are seeded into WikidataFootnote 19 and storing multiple representations as ‘defining formula’ of the same Formula Concept item is not endorsed, we argue for the creation of a specific Wikidata-attached Formula Concept Database (Schubotz et al., 2018b). It should include formalized augmentation to generate equivalent forms using, e.g., commutations, additional sub- and superscripts, unit and reference frame variations, etc. Most importantly, a method for inferring substitutions or implicit terms needs to be developed.

We propose to formalize the augmentation of a Formula Concept as translation between its different representations. One could use equivalence generations made by Computer Algebra Systems to train, e.g., a Siamese Network, Bromley et al. (1993), to assess whether two formulas are representations of the same Formula Concept. For this, the choice of a suitable formula encoding needs to be explored. A hypothesis we have to examine beforehand is whether Formula Concept Recognition relies on identifying equivalent representations or only requires the semantic annotations of formula identifiers. We will discuss this further in future work, as well the exploration of practical implications of the interpretation of a Formula Concept as a mathematical ‘word’ that can be translated between different representations (analogous to ‘languages’).

Apart from distinguishing FCD and FCR as separate methods, one could also combine them to discover Formula Concepts by recognizing (tagging) an increasing amount of formulas per mathematical concept over time. Therefore, we propose an Active Learning system that shows randomly selected formulas to a user. The system then has to figure out whether, for a shown formula, there is already a mathematical concept identifier available. If missing, it should create one and match the following occurrences to it. Unfortunately, CAS cannot generate all notation transformations (e.g., from vector to tensor, see Formula Concept Example 3).

Fig. 3
figure 3

Comparison of two representations of the Klein–Gordon equation (left and right). Different constituents of the expression trees are marked as semantic entities that have a unique Wikidata ID

Figure 3 shows the expression trees of two representations of the Klein–Gordon equation (left and right) in comparison. Different constituents of the equations are marked in the trees as semantic entities. They can be matched to unique IDs in a semantic database, e.g., Wikidata. For example, the identifier c representing the ‘speed of light’ is assigned the Query ID (QID) ‘Q2111’. Since both trees contain the same semantic entities, they can be matched as representing the same Formula Concept.

Summarizing, we derive the following Formula Concept Retrieval system requirements from the identified challenges for FCD:

  1. (1)

    Set up a Formula Concept Database (FCDB);

  2. (2)

    Employ equivalence transformations and Computer Algebra Systems;

  3. (3)

    Enable Formula Concept Discovery by Recognition (FCD by FCR); and

  4. (4)

    Integrate formula matching via semantic formula encoding.

Conclusion (FCD)

We compare the effectiveness of retrieving different Formula Concept representations of Method 2 (k-Nearest-Neighbors in formula vector space) with Method 3 (Wikipedia article first formula multi-language heuristic). While Method 3 achieves a precision of 34% for retrieving Formula Concept representations from multilingual Wikipedia articles, Method 2 outperforms this with a precision of 68% using machine learning. The kNN approach is not only performing well, but it also has the advantage of being easily usable and transferable to other corpora. Method 1 can not be compared to the other two because it is a priori unclear where (at which number of webpages or textbooks) to stop the search. Therefore, we only concentrate on our three Formula Concept examples (KGE, EFE, and ME), for which we can retrieve a total of more than 30 representations, searching in publications, Wikipedia, and a textbook. We conclude that for Formula Concept Discovery to achieve the best results (retrieval of a large number of equivalent formula representations per concept), it is beneficial to combine the different methods optimally.

Formula concept recognition

In this section, we introduce methods for Formula Concept Recognition (FCR). Recall that the goal of FCR is to recognize formulas in documents as being instances of a previously defined Formula Concept.

The presented FCR methods were not introduced or published before. Prior work only included the first FCD experiments and results. Currently, to the best of our knowledge, no other FCR methods have been published so far. However, to establish comparability and replicability, we evaluate the performance of our approaches against that of open source and commercial formula search engines in Experiment 1 as presented in Section 4.1.

In the following, we describe and evaluate several different approaches for FCR. To assess the feasibility and performance of the proposed methods, we set up the following three experiments:

  1. Experiment 1:

    Formula Concept Search;

  2. Experiment 2:

    Formula Concept Classification and Clustering;

  3. Experiment 3:

    Formula Concept Similarity.

In Experiment 1, we investigate how well Formula Concepts can be retrieved by search queries using the formula latex string or the formula constituents. Therefore, we employ several sources, such as Wikidata items, as well as Wikipedia articles and arXiv documents from the NTCIR dataset. The results from Wikidata can be associated with a unique semantic ID (the Wikidata QID). We compare the performance of the open source retrieval to selected competitor (formula) search engines. In Experiment 2, we assess how well a manually labeled balanced dataset of 100 Formula Concept examples from 10 classes can be automatically recognized by machine learning classification and clustering to separate the Formula Concepts in several vector encoding spaces. In Experiment 3, we test how well formula (encoding) similarities can indicate that different formulas are representations of the same Formula Concept. Therefore, we compute a similarity map matrix of pairwise formula or class similarities. The developed algorithms, the dataset, and full result tables are available at https://github.com/ag-gipp/formula-concept-retrieval.

Experiment 1: formula concept search

We first approach the recognition of Formula Concepts (FCR) as a search ranking problem, in contrast to classification and clustering, examined in the subsequent experiment. To evaluate finding, i.e., recognizing FCs in large corpora of mathematical content, we employ three open data sources (Wikidata, Wikipedia, arXiv) and two methods (retrieval using formula latex string or constituents). Furthermore, we compare the performance of our methods to two formula search engines, one open source (Approach ZeroFootnote 20), and one commercial (GoogleFootnote 21).

Table 4 Ten classes of our test set with 100 Formula Concept differential equation examples, including a linked Wikidata QID and concept name with Wikipedia article source link (above), as well as an example equation latex string

For this and all subsequent FCR experiments, we collect a test set with 100 Formula Concept example differential equations from 10 classes. Table 4 shows the concept class names and labels, together with the corresponding Wikidata QID (above) and example latex string (below). The linked Wikipedia article is the source of the respective equations, which we collected for each class. A full list of all 100 collected equations can be found in the appendix. The selection extends the three classes discussed in Sect. “Task 2: analysis of formula concept examples” by additional 7 classes with 10 examples each. Each class corresponds to a Wikipedia article (as indicated in Table 4). This means that we here apply the definition of a Formula Concept as a set of equation representations collected from the same Wikipedia article.

For each of our 100 example formulas, we evaluate the performance of 8 selected Formula Concept search retrieval sources: arXiv latex, arXiv constituents, Wikidata latex, Wikidata constituents, Wikipedia latex, Wikipedia constituents, Approach Zero, and Google. The first 6 represent our retrieval methods over open corpora, while the last 2 employ search engines. The method label latex indicates that the formulae are compared by their latexstrings, whereas ‘constituents’ means that the formula parts are aligned (set intersections of operators and identifiers).

We generated the top 10 results for each of the 8 sources on our 100 examples and manually assessed the ranking of the correct result for the resulting \(10 \times 8 \times 100 = 8,000\) formulae. As ranking measures, we used ‘Top-10 Recall’ and ‘Top-1 Recall’ as well as ‘Mean Rank’ (MR) and ‘Mean Reciprocal Rank’ (MRR), which is defined as Voorhees (1999)

$$\begin{aligned} \text {MRR} = 1/\text {MR} = \frac{1}{|Q|} \sum _{i=1}^{|Q|} \frac{1}{\text {rank}_i}, \end{aligned}$$

summing over all query results Q = 10. In this formula, \(\text {rank}_i\) refers to the rank position of the first relevant document for the i-th query. The reciprocal value of the mean reciprocal rank represents the harmonic mean of the ranks.

Table 5 FCR as formula concept search problem

Table 5 shows the results of the Formula Concept search evaluation. The performance of different FCR methods is compared to state-of-the-art (formula) search engine competitors (Approach Zero\(^{20}\) and Google\(^{21}\)). We also tested other formula search engines, such as MathWebSearch,Footnote 22,Footnote 23 ZbMATH formulae,Footnote 24 and Wolfram AlphaFootnote 25 but they were either not working, access-restricted or too low performing to be included in the result table. The best results (lowest Mean Rank MR, highest Mean Reciprocal Rank MRR, and Recall) are marked in bold. The results exhibit that the FCR method source ‘Wikipedia latex’ outperformed all other method sources in all metrics. This can be explained by the fact that our FCR examples were extracted from Wikipedia articles. However, not all equations were present in the NTCIR Wikipedia dataset. We find that the formula latexstring retrieval outperformed the retrieval using formula constituents. Furthermore, we compare our retrieval methods (FCRs) to the selected search engines (SEs). Our methods outperform the search engines in all metrics except ‘Top-10 Recall’ (it is very close in the ‘Top-1 Recall’ metrics). Summarizing, we compare the performance of different retrieval methods and sources in several ranking measures to demonstrate that it is possible to recognize Formula Concepts using search with a Mean Rank of down to 1.78, Mean Reciprocal Rank up to 0.78, and Recall up to 0.74. Our FCR methods outperform state-of-the-art search engines.

Experiment 2: formula concept classification and clustering

To assess how well the computer could separate our 100 Formula Concept examples into classes, we examine their joint formula (content or semantic) space. Recall that the formula content was defined following (Scharpf et al., 2018) as the sets of operators, identifiers, and numbers that a formula contains. Because of Challenge 2 (substitutions) and Challenge 13 (different unit systems), we decided to neglect the set of numbers. Compared to the operators and identifiers, there are significantly fewer numbers, and they heavily depend on substitutions and unit systems [e.g., the number 8 in the factor \(8 \pi\) or the exponents 4 in (23)].

Since formulas in mathematics can be similar to each other syntactically, yet address completely different concepts semantically or vice versa, we analyze the relationship between syntactic and semantic encodings. There are two challenging cases: (1) syntactically similar but semantically different formulas (syntactic inter-class coherence but semantic inter-class separability) and (2) syntactically different but semantically coherent formulas (syntactic inner-class separability but semantic inner-class coherence). An example for (1) from our selected classes can be:

$$\begin{aligned} a \ \varPsi _t + b \ \nabla ^2 \varPsi + c \ \varPsi = 0 \ \text {(class KGE) vs.} \ a \ \varPsi _{tt} + b \ \nabla ^2 \varPsi + c \ \varPsi = 0 \ \text {(class SE)} \end{aligned}$$

or

$$\begin{aligned} -\partial \varPsi / \partial t^2 + \nabla ^2 \varPsi - m^2 \varPsi = 0 \ \text {(KGE) vs.} \ i \partial \varPsi / \partial t + 1/2m \nabla ^2 \varPsi - V \varPsi = 0 \ \text {(SE)}. \end{aligned}$$

An example for (2) can be: \(F = m a\) vs. \(F = p / t\) (class NSL expressed using mass m and acceleration a vs. momentum p and time t).

Encoding and classifying the syntactic or semantic formula content is indispensable, since the surrounding text is often noisy and the formula concepts are not explicitly named or described. Some authors of mathematical content implicitly assume the reader’s profound background knowledge. This limits the use of text-based encoding and classification methods. In the following, we describe and discuss our tests of the content vs. semantic coherence of Formula Concepts in terms of separability (classification accuracy and cluster centroid distance and purity).

For the machine learning experiments, we create four files with the equation labels, latex strings, content, as well as semantic annotations, including Wikidata QIDs. Each of the files has 100 lines corresponding to the individual formulas, i.e., (10 Formula Concept examples from each of the 10 classes KGE EFE, ME, etc. respectively, see Table 4). As an example, consider the first formula (12). It belongs to the first class, so the line in the label file reads EFE. In the latex string file, the corresponding line read

\frac{1}{c⌃2}\frac{\partial⌃2 \psi} {\partial t⌃2 - \nabla⌃2\psi}

+ \left(\frac{m_0 c}{\hbar} \right)⌃2\psi = 0.

The content line, containing the set of parsed operators and identifiers, then reads

c, \partial, \psi, t, \nabla, m, \hbar.

We encode their semantics as

c: "speed of light" (Q2111),

\partial: "partial derivative" (Q186475),

\psi: "wave function" (Q2362761), t: "time" (Q11471),

\nabla: "del" (Q334508), m: "mass" (Q11423),

\hbar: "Planck constant" (Q122894)

where the ID in parenthesis is the unique QID from the item, we find in the semantic knowledge base Wikidata.

Summarizing, the data pipeline is the following: We parse the formula latexstrings (‘formula TeX’) to formula constituents (‘formula content’) and annotate them (‘formula semantics’) to get Wikidata encodings (‘formula qids’). The yields a dictionary of formula constituent meanings with an average of 2 different annotations per constituent. As an example, the identifier ‘R’ appears as ‘distance (Q126017)’ or ‘Ricci curvature’ (Q1195879)’.

In our experiment, we employ the following formula vector encodings of both operators and identifiers:

  • Formula content TF-IDF;

  • Formula content Doc2Vec;

  • Formula semantics TF-IDF; and

  • Formula semantics Doc2Vec.

For the formula content encodings, the sets of the parsed operator and identifier latex strings from the content file are employed. For the formula semantics encodings, we use the sets of Wikidata QIDs. It is important to note that while the sequence of formula constituents does not matter for the TF-IDF encoding, it is considered by the Doc2Vec encoding. In our experiments, we focus on a relative evaluation, i.e., a comparison of different encodings, rather than optimizing the overall performance by tuning hyperparameters.

30 Examples We first examine the separation of the three Formula Concepts by investigating the formula space in each of the four computed formula vector encodings. Figures 4 and 5 show the resulting plots. We reduce the dimensions via Principal Component Analysis (PCA) to two (x- and y-axes). Furthermore, we color-code the results of our formula clustering experiment (see next paragraph), such that each datapoint color corresponds to a different cluster computed by k-means (\(k=3\)) clustering. Apparently, in the formula content space with Doc2Vec encodings (second plot), the three Formula Concept classes are separated best with the largest distances between the three cluster centroids (see Table 6). Only two Formula Concept examples of class ME are incorrectly located in the cluster, which primarily consists of class KGE. We can identify these as being equation (35) and (36). We suspect the partial derivative to be causing the mix-up of these ME, since they predominantly occur in the KGE.

Table 6 Mean cluster centroid distance after employing PCA to reduce the number of datapoint dimensions to two (see the 2D plots in Figs. 4 and 5)
Fig. 4
figure 4

Formula content space of three selected Formula Concepts (KGE, EFE, ME), using TF-IDF or Doc2Vec encodings, reduced by Principal Component Analysis (PCA) to two dimensions. The color code corresponds to the clusters computed by k-means (\(k=3\)) clustering. The three classes are best separated in the formula content Doc2Vec encoding (second plot) with cluster mean centroid distance of 0.81, purity of 0.94, and classification accuracy of 0.90

Fig. 5
figure 5

Formula semantic space of three selected Formula Concepts (KGE, EFE, ME), using TF-IDF or Doc2Vec encodings, reduced by Principal Component Analysis (PCA) to two dimensions. The color code corresponds to the clusters computed by k-means (k = 3) clustering

As another measure for the separability of our three example Formula Concepts, we calculate the cluster purity as the number of datapoints of the class that makes up the largest fraction of a cluster divided by the cluster size, averaged over all clusters:

$${\text{purity}} = {\mathop{\text{mean}}_{{\rm clusters}}} \left[ {\frac{1}{\text{cluster size}}\;{\text{max}}(\# \;{\text{datapoints in cluster per class}})} \right].$$

Table 7 holds the cluster purities of a k-means clusterer on different formula vector encodings. Apparently, the formula content Doc2Vec encoding outperforms the others. This is illustrated by comparing Figs. 4 and 5. In the Doc2Vec encoding, the smallest number of Formula Concept labels (only two) are mixed up.

Table 7 Mean cluster purity of a k-means clusterer on different formula vector encodings

As the third measure for the separability of our three example Formula Concepts, we calculate the classification accuracy of a Support Vector Machine (SVM) classifier on our four formula vector encodings. Summarizing, we test FCR approaches for Formula Concept separation using machine learning techniques such as neural formula vector encodings (Doc2Vec), dimensionality reduction (PCA), clustering (k-means), and classification (SVM). Our three measures of separability are 1) mean cluster centroid distance, 2) mean cluster purity, and 3) classification accuracy (cross-validated). While the formula semantic TF-IF encoding performs best (averaged over the two classifiers and cross-validation splittings), the formula content Doc2Vec encodings outperform the others in both cluster centroid distance and purity.

We avoid data skewness by employing a balanced dataset of examples equally distributed over classes.

The Formula Concept clustering using a k-means algorithm can assign 29/30 \(\simeq\) 97% correctly, while the fuzzy string matching performsFootnote 26 slightly worse with 28/30 \(\simeq\) 93%. Random sampling only reaches 8/30 \(\simeq\) 27%. So, the clustering outperforms the other methods. However, this only works if the cluster number k (number of Formula Concept classes in the dataset) is known a priori.

100 Examples In the next step, we extend our study to the full dataset of 100 examples FCs from 10 classes.

Fig. 6
figure 6

Classification accuracies (cross-validated) and cluster purities (labeling-referenced) for a selection of 100 equations, semantically annotated (constituent QIDs) and sorted into 10 classes (formula QIDs). The binomial choice distribution for a selection of N formulas out of the pool is shown above. Four different encodings (Content TF-IDF, Semantics TF-IDF, Content Doc2Vec, and Semantics Doc2Vec) are compared below

Table 8 Classification accuracies (cross-validated) and cluster purities (labeling-referenced) for a selection of 100 equations, semantically annotated (constituent QIDs) and sorted into 10 classes (formula QIDs)

Figure 6 and Table 8 show the performance evaluation of classification (cross-validated) and clustering (labeling-referenced) of the labeled selection of 100 FC examples from 10 FC classes. Classification accuracy (blue bars) and cluster purity (orange bars) is computed for each encoding (content or semantics in TF-IDF or Doc2Vec) in all 1275 combinatoric class choices individually (with N ranging from 3 to 10, see the top plot for the binomial distribution). The displayed values (y-axis) are averaged over all respective combinations for a given number of class choices (x-axis). For each of the 4×1275 runs, we perform N-fold cross-validation retrieving the classification accuracy.

For the TF-IDF encoding (upper plots), the results are the following: While the classification accuracy remains approximately stable with increasing N, the cluster purity decreases. This means that in the supervised retrieval case (FCR), clustering is most appropriate for a small number of classes. However, it can still be helpful in the unsupervised case for discovering (FCD) and labeling unknown classes. For the Doc2Vec encoding (lower plots), the results are the following: The classification accuracy also decreases with increasing N, and the cluster purity more strongly. This means that it might be preferable to employ TF-IDF instead of Doc2Vec, which even has the additional advantage of being faster to compute.

We conclude that the classification is potentially more useful than the clustering for labeled FCR (if the formulas are already annotated). Yet, also for unlabeled formulas, the clustering might not be helpful because, as stated before, the cluster number of different concepts is not known a priori. However, in the upcoming Experiment 3, we showed that a formula similarity map could be used instead as a means for both FCD and FCR.

Experiment 3: formula concept similarity

In this experiment, we investigate the FC separability using FC similarity map matrices. We start with a preliminary analysis of the small set of 30 examples to be subsequently extended to all 100 examples.

30 Examples Fig. 7 shows the matrix of Formula Concept latex fuzzy string similarities for the small selection of 30 formulas discussed in Sect. “Task 2: analysis of formula concept examples”. We employ the fuzz.partial_ratio function of the Python package fuzzywuzzy.Footnote 27 Each square corresponds to the similarity percentage of the example equation with the number displayed on the x-axis to the example equation with the number displayed on the y-axis. Since pairwise similarities are symmetric, the matrix is symmetric, and we can concentrate the investigations only on the part above or below the diagonal. Apparently, the three Formula Concepts (KGE equation number 1-10, EFE number 10-20, ME 20-30) form three large squares (or triangles) aligned on the diagonal (containing the individual 100% self-similarities). Particularly striking is the EFE square in the center of the matrix with its high values and density. This means that the Einstein Field Equations are the most similar, and the Formula Concept is highly coherent. The considered representations of the other two Formula Concepts are much more diverse and more difficult to match or identify. Figure 8 shows the matrix of the Formula Concept semantic similarities. The color code corresponds to the number of matching Wikidata QIDs of the corresponding Formula Concept examples (the x- and y-axes). The distribution is very similar to the fuzzy latex string content matching shown in Fig. 7 (except the EFE square is slightly more distinct). Thus, semantification has no significant advantage here. However, in cases where the identifier symbols vary more, we expect an improvement.

Fig. 7
figure 7

Matrix of the Formula Concept latex fuzzy string similarity percentages. On the x- and y-axes, the equation number is displayed such that each little square corresponds to one similarity value between one equation and another

Fig. 8
figure 8

Matrix of the matching numbers of formula semantic QIDs. On the x- and y-axes, the equation number is displayed such that each square corresponds to one similarity value between one equation and another

Fig. 9
figure 9

Comparing unlabeled random equations (left) from the arXiv NTCIR dataset (astro-ph domain) to selected labeled equations (right) annotated by a human domain expert in different encodings (TF-IDF above, Doc2Vec middle, Fuzzy below, Content left, and Semantic right). Axes show random numbers or selected equation class labels. Very high TF-IDF, Doc2Vec cosine, or fuzzy string similarity between equations are marked in red. Figure best viewed in color

100 Examples Fig. 9 shows a comparison of the formula similarities of random unlabeled to all of our 100 selected labeled example formulas. While the random formulas are extracted from the arXiv NTCIR dataset, the labeled selection is taken from Wikipedia articles.

In Doc2Vec and Fuzzy encodings, the random unlabeled similarity map appears to be very similar to that of the labeled selection. This indicates that in both random sampling and labeled sampling, most of the formulas are not very similar to each other (blue background). However, for the labeled selection, there is an apparent self-coherence of the individual labeled FC classes (brighter red squares on the diagonal line).

We conclude that since the similarity map of labeled FCs is not weaker (less similarity) than that for random formulas, we can justify the classification and clustering as an appropriate tool or suitable means to recognize FCs. The lack of similarity or distinctness of the labeled classes does reflect the real-world situation for formulas in corpora, which is fortunate since it makes search and machine learning methods effective.

We can show that in the random sampling, the formula distinctness (low similarity) is equally low as for the labeled selection. This means that our machine learning experiments presented in Sect. “Related work” are reasonable since they represent an information retrieval scenario that could occur.

Fig. 10
figure 10

Comparing labeled equation similarities for different encodings (TF-IDF above, Doc2Vec middle, Fuzzy below, Content left, and Semantic right). Axes show equation class labels. Very high TF-IDF, Doc2Vec cosine or fuzzy string similarity between equations are marked in red. Similarities are sorted within classes. Figure best viewed in color

Figure 10 shows the formula similarities in different encodings (TF-IDF, Doc2Vec, Fuzzy) for all 100 examples, comparing the content space (formula constituent symbols encoded) to the semantic space (formula constituent QIDs encoded). Similarities are sorted within classes. The self-coherence of the labeled formula classes (labels on axes) is evident in all encodings. However, in the semantic space (Doc2Vec) encoding, additional inter-class / cross-class coherences are visible (some squares span several classes, e.g., ‘BE’ and ‘HE’ in the middle). This indicates latent semantic coherences that are less visible in the unsemantified content encoding.

Fig. 11
figure 11

Comparing labeled averaged class similarities for different encodings (TF-IDF above, Doc2Vec middle, Fuzzy below, Content left, and Semantic right). Axes show equation class labels. Very high TF-IDF, Doc2Vec cosine or fuzzy string similarity between equations are marked in red. Figure best viewed in color

Figure 11 shows the formula similarities in different encodings and spaces averaged over classes (mean pooling). This view helps to better highlight the intra-class and inter-class coherences. On the top-left, the high intra-class coherence of the ‘EFE’ formulas is illustrated by the prominent darker (more red intense) square. Moreover, the cross-class coherence mentioned in the description of Fig. 10 is apparent again in the semantic space (Doc2Vec) encoding shown in the center of the middle right plot. Besides, other class similarities, such as that of the Klein–Gordon equations (‘KGE’) and Schrödinger equations (‘SE’), can be identified as brighter squares. Notice that the semantic (Fuzzy) space map (Fig. 10 bottom right) shows that the inter-class similarity between KGE and SE in the semantic space is comparably high as the intra-class similarity of the ME class. This is reasonable, since they are indeed semantically very close. In the quantum physics framework, one equation can be derived from the other and vice versa. On the other hand, the intra-class similarity of the ME instances is high, since they are mutually semantically related. The FC class similarity maps are also helpful for FCD, discovering FCs as coherent similarity areas to be subsequently analyzed and labeled.

Fig. 12
figure 12

Sorted similarity maps (Content TF-IDF encoding) for equations (left) and classes (right). The mean equation similarity is 0.2

Figure 12 illustrates the overall dissimilarity of the equations in a sorted similarity map. The blue space (low similarity) significantly outweighs the red area (high similarity) at the bottom. The low mean equation similarity of 0.2 motivates FCR methods to exploit the separability.

Conclusion (FCR)

In three different experiments, we investigate the feasibility and effectiveness of methods to retrieve, separate, and recognize Formula Concepts (FCR). For all experiments, we employ a manually labeled dataset of 100 Formula Concept examples from 10 classes retrieved from Wikipedia articles.

In Experiment 1 (Formula Concept search), we compare 8 different formula search methods on open corpora (Wikidata, Wikipedia, arXiv) and the web. We test how well Formula Concepts can be retrieved by search queries using either the formula latex string or the formula constituents, respectively. The results show that using different retrieval methods and sources, it is possible to recognize Formula Concepts using search with a Mean Rank down to 1.78, Mean Reciprocal Rank up to 0.78, and Recall up to 0.74. Our FCR methods outperform the state-of-the-art search engines Approach Zero and Google.

Experiment 2 (Formula Concept classification and clustering), we assess Formula Concept separability by machine learning classification and clustering in selected formula encodings. The results show while the cluster purity decreases with more FC classes, classification accuracy remains approximately stable around 0.9 when using TF-IDF formula encodings. This means that with stable accuracies, FC classification might be a more powerful means for FCR than FC clustering.

Experiment 3 (Formula Concept similarity), we visualize formula (encoding) separability in similarity map matrices to illustrate coherence and overlap of Formula Concepts. The results show that similarity maps are a valuable method for identifying both intra-class coherence and inter-class separability or overlap, which is useful for both FCD and FCR. Furthermore, the results motivate the employed machine learning methods since a comparison of our manual formula selection to randomly chosen formulas shows that in both cases, Formula Concepts are rather dissimilar and thus their classes separable from each other.

We conclude that the search for specific formulas within a large dataset of STEM documents is a challenging problem. Furthermore, we note that for FCR, there is an urgent need to augment semantic formula databases, for example, mathmlbenFootnote 28 and Wikidata, such that they allow for multiple representations of a formula to be stored as a Formula Concept. Having formulas tagged by Wikidata QIDs enables using them as markers in documents that can be cited (math citations). Additionally, they can be employed to improve content-based recommender systems for academic literature, plagiarism detection systems, and ontology learning.

Note that our study’s aim is not a large-scale evaluation but rather a deductive conceptual work. The data, plots, and results we presented serve to illustrate the methodological concepts. We demonstrate the fundamental feasibility using examples and outline the potential for machine learning on labeled formula data. For a large-scale analysis using unlabeled formula data, we refer to the literature (Scharpf et al., 2020b; Greiner-Petter et al., 2020).

Future work

This section outlines future endeavors and challenges, which we plan to address to further improve, evaluate, and apply FCD and FCR methods to additional use cases. These include exploring the practicability of a ‘Formula Rank’, investigating a formula semantics sufficiency hypothesis, and developing methods for efficient semantic formula and triple annotation.

FormulaRank and Semantic Indexing. In analogy to Google’s ‘PageRank’ (Brin & Page, 1998), and ‘TextRank’

Mihalcea and Tarau (2004), we propose to employ a ‘FormulaRank’ for Formula Concept popularity retrieval. FormulaRank is supposed to rank formulas by the number of neighbors (kNN) or constituent intersections to estimate their importance. For this experiment, we first need to elaborate on interpretation standards and evaluation metrics for the results. Secondly, we will develop and evaluate semantic indexing of the arXiv datasets containing formulas, their latex string, constituents retrieved from mathml tags, surrounding text, and more.

Functional vs. Semantic Recognition. Furthermore, we will investigate the following research question: “Does the recognition of Formula Concepts require taking the functional relations of the formulas into account, or is it sufficient to only consider the semantics of the formula constituents?”. As an example, the Klein–Gordon equation

$$\begin{aligned} \frac{1}{c^2} \frac{\partial ^2 \psi }{\partial t^2} - \nabla ^2 \psi + \left( \frac{m_0 c}{\hbar } \right) ^2 \psi = 0, \end{aligned}$$

can be encoded as the semantic fingerprint of its constituents:

c: "speed of light" (Q2111),

\partial: "partial derivative" (Q186475),

\psi: "wave function" (Q2362761), t: "time" (Q11471),

\nabla: "del" (Q334508), m: "mass" (Q11423),

\hbar: "Planck constant" (Q122894)

Alternatively, one could additionally take into account that the partial derivatives \(\partial /\partial t\) and \(\partial /\partial x\) act on the wave function \(\psi\) and are applied with respect to both time t and space x. Considering this circumstance would mean taking the functional relations of the formulas into account instead of merely considering the set of the semantics (fingerprint) of the formula constituents.

Semantic annotations. To enable FCD by FCR, we are building a latex formula annotation recommender system (Scharpf et al., 2019a), which helps and motivates authors from the STEM disciplines to make their papers semantically machine-interpretable by annotating formula and identifier names with Wikidata items (name and QID). We need labeled formula data for the semantic encodings and formula classification introduced in Section 4.2. Our long-term goal for this system is to directly integrate the annotation recommendation into both Wikipedia and Overleaf’s editing or composing views. This would allow the Wikipedia and research communities to be more easily included in the semantification process of mathematical articles and research papers. Employing extended AI-aided formula annotation enables scaling our approaches in further research projects on our infrastructure at Wikimedia, zbMATH, and the University of Göttingen.

RDF triple extraction. In the future, the semantic annotator will provide recommendations of RDF triples, both for natural language and mathematical statements. A natural language statement can be, for example, the triple {theory of relativity (Q43514), instance of (P31), scientific theory (Q3239681)}. For the mathematical statements, the Formula Concepts are represented as the triple {Formula Concept item name, defining formula, formula latex string}.