1 Introduction

Historically, proving theorems of mathematics was one of the central aims of artificial intelligence (AI) research (Newell et al., 1957). After some early successes, however, optimism about AI theorem proving faded. This does not mean that computer-assisted theorem proving or other computer applications have become less important for mathematical practice. In applied mathematics, in particular, the use of computer tools has become central to practice. This is an important topic, but in this paper, I will focus on the mathematical practice of proving theorems. Also in that regard, computers have had an important effect on mathematics as recent decades have seen an important rise in the status of computer assisted theorem proving. Many theorems, like the four-color theorem and the Kepler conjecture, have received proofs which would not have been possible without computers (Appel & Haken, 1976; Hales et al., 2017).Footnote 1 Nevertheless, the present generation of theorem proving computers is limited in its applications, and as a result the proofs are qualitatively different from proofs conducted by human mathematicians. Typically, present-day computer proofs apply brute computing force to exhaust a finite set of cases, one after another.

In addition to theorem proving, computer software are used to assist with logical deductions. In the field called automated theorem proving (ATP), such proof assistant software can be used for checking proofs, but also potentially for coming up with new proofs and new theorems. A typical automated theorem proving software functions by running specific algorithms to represent inferences in a system of logical calculus. Such rule-based software have many advantages. For one, we can generally trust them, because we can know the algorithm that it is running. However, the rule-based automated theorem provers have clear limitations. They run a mechanical procedure reliably, but their processing does not concern what the theorem is about, or even what the proof does to establish its correctness. In short, even though they are discussed as applications of artificial intelligence, they are not intelligent in any way.Footnote 2

This is particularly important to remember when we consider the potential of computer software in proving new theorems, or providing new proofs for existing theorems, in an autonomous manner. An automated theorem prover can be fed a system of axioms that it then uses to prove and output theorems of the system. Usually in mathematics, the number of theorems is infinite, so the output needs to be limited to a finite subset. But even when limited to some finite subset of the theorems, the automated theorem prover would typically be indiscriminate in its proving capacity. It could list a million theorems as its output, but most likely only few of them would be in any way interesting to human mathematicians.

One approach to make automated theorem provers more sophisticated is to program them to include criteria for interesting proofs and theorems. As we will see, some moderate progress has been made in this way. Yet the developments so far don’t suggest that those types of software provide significant advances over previous programs. This matter would be potentially different, however, if instead of following specific rules, the computers learned mathematics. The theorem-proving potential of a particular type of artificial intelligence, i.e., a machine learning application run on deep artificial neural networks is the main topic of this paper. I will analyze how an artificial neural network could develop some way of processing mathematics that would enable it to distinguish between interesting and trivial proofs and theorems. As of now, symbolic mathematics is still a very difficult prospect for artificial neural networks. However, there are some early results that point out to the possibility that this could change in the future. In this paper, I will reflect on this possibility. In particular, I am interested in what kind of mathematical role this kind of AI could feasibly play, and how it would be received by human mathematicians and incorporated into the mathematical community.

In Sect. 2, I provide a short history of theorem proving in artificial intelligence, demonstrating how it has been an important issue in AI research ever since the establishment of the discipline. In Sect. 3, I will then present the state of the art in automated theorem proving, distinguishing between interactive and autonomous automated theorem proving. The current generation of automated theorem proving software is then analyzed in Sect. 4 in terms of their ability to distinguish between interesting and trivial proofs and theorems. In Sect. 5, the focus switches to machine learning and artificial neural networks, as I review some early results showing potential in the field. Then in Sect. 6, I provide a critical analysis of what artificial neural networks could and could not do in the field of theorem proving. In Sect. 7, I discuss the changes this would cause to mathematical practice in theorem proving and in Sect. 8 their epistemological importance. I argue that the changing epistemic role of computers in mathematics is best handled within a community approach, in which computer-assisted and computer-generated proofs are assessed by the mathematical community essentially similarly to the way humanly generated proofs are. Finally, in Sect. 9, I briefly discuss the questions of authorship and accountability that emerge from increasing use of AI in mathematics.

2 A very brief history of theorem-proving AI

Technically speaking, automated theorem proving is a subfield of the more general area in artificial intelligence research called automated reasoning. However, while the latter should in principle apply to different forms of reasoning, in practice automated reasoning has become largely identified with mechanized deductive inferences. Thus, the difference between the notions of automated reasoning and automated theorem proving has largely disappeared. Unless otherwise specified, both can be assumed to refer to mechanical, algorithmic computing procedures that represent inferences in formal systems of a logical calculus (Portoraro, 2021).

Therefore, the history of automated reasoning can be traced back to the formalization of logic. Particularly important in this development was the work of Frege, whose Begriffsschrift introduced a logical calculus of propositional and predicate logics (Frege, 1879). In his Grundlagen der Arithmetik (1884) and subsequent Grundgesetze der Arithmetik (1893), this approach was extended with the – ultimately unsuccessful – aim of deriving the laws of arithmetic from the basic laws of logic. The approach of using symbolic logic to derive mathematical theorems reached its early apex in the three-volume Principia Mathematica by Whitehead and Russell (1910–1914). Their goal was to show that all mathematical theorems can be derived from logical axioms by following rules of proof. If everything about mathematics could be thus formalized, it made sense to conjecture that mathematical reasoning could be consequently automatized. Indeed, there was early promise that this could be done when Presburger presented in 1929 an algorithm for deciding whether a sentence of an arithmetic consisting of natural numbers and addition is true (Presburger, 1929). However, this early promise was very quickly countered by Gödel’s incompleteness theorems. While Presburger’s arithmetic only had addition, Gödel showed that any formal system that can express standard Peano (1889) arithmetic (i.e., arithmetic with addition and multiplication) is in fact undecidable, i.e., it cannot prove all the truths of the system (Gödel, 1931).

In the study of mathematical logic, Gödel’s result was momentous. Nevertheless, when technology developed sufficiently to make the mechanical application of algorithms reality, the incompleteness theorems did not discourage researchers of automated reasoning. One of the most important early developments in this regard was the 1954 programming of the Presburger algorithm into a vacuum tube computer by Davis. As reported by Davis (2001), Presburger’s procedure was needlessly complex and the program didn’t fare particularly well. It did manage, however, to prove that the sum of two even numbers is an even number (Davis, 1983). This may have been the first general mathematical theorem proved by a computer.

The work of Davis, however, was only one development in the rapid growth of AI research in the 1950s. Simultaneously with Davis, Newell, Simon and Shaw had been working on automated theorem proving and in 1956 they presented their computer program Logic Theorist, which proved theorems of Principia Mathematica of Whitehead and Russell (McCorduck & Cfe, 2004; Newell et al., 1957). Logic Theorist is often called the first artificial intelligence program (see, e.g., Crevier, 1993) and it was for its time quite impressive. It quickly proved 38 of the first 52 theorems of the Chap. 2 of Principia Mathematica.

Intriguingly for the present purposes, in one case the proof provided by the Logic Theorist was deemed more elegant than that provided originally by Whitehead and Russell (namely the theorem 2.85, see, (McCorduck & Cfe, 2004, p. 167). This may have well been the first case of an AI improving on the work of human mathematicians when it comes to theorem proving. It could have also become the first computer-assisted proof to be published in an academic journal. As recounted by Crevier (1993, p. 46), Newell and Simon submitted the proof to the Journal of Symbolic Logic. However, the paper was rejected on account of being in the outmoded system of Principia Mathematica, with apparently no significance given to the fact that a computer had come up with the proof.

3 Automated and interactive theorem provers

The great promise shown by early programs like Logic Theorist did not lead into the kind of revolution in mathematics that the first AI researchers may have envisioned. It certainly did not lead to the kind of philosophical revolution that one of its creators, Simon, later claimed:

[W]e invented a computer program capable of thinking non-numerically, and thereby solved the venerable mind/body problem, explaining how a system composed of matter can have the properties of mind. (Simon, 1991, pp. 206–207)

This quotation may be too boastful for most people’s tastes, and perhaps should not be taken at face value. But it also reveals an important belief of the early AI researchers. For them, there was no important difference between an AI showing human-like behavior and it thinking in a human-like fashion. This goes against a basic distinction standardly made in modern AI research, according to which we need to distinguish between intelligence as a property of behavior and as a property of internal processes (see, e.g., Russell & Norvig, 2020, Sect. 1.1). For Simon, because Logic Theorist showed intelligent behavior, it also had to have “properties of the mind”.

In the modern context, this distinction is central. While the promise of the early theorem proving computers may not have (at least yet) been fully realized, in recent decades there have emerged many important theorem proving software, such as Isabelle, Vampire, Prover9, Mizar, OTTER, Waldmeister, Lean and E. In addition, software like MATLAB and Mathematica provide features that can be used for theorem proving purposes. All these software achieve far more than Logic Theorist ever could, but few would claim that they are in any way intelligent. Whatever the properties of the mind involved in theorem proving may be, the theorem proving software are not thought to mirror or instantiate them.

What the current generation of theorem proving software do in most common mathematical applications is roughly the following. The human user gives them a problem as the input, consisting of a set of axioms (first-order formulas) and a conjecture (a first-order formula). Then, standardly using first-order logic with equality, the theorem proving software checks whether the conjecture follows from the axioms (Voronkov, 2003, p. 1607).Footnote 3 Instead of a mere yes/no output from the theorem prover, it is desirable that the software produces a proof (in case of ‘yes’), which should then be readable by humans (ibid.).

As mentioned in the introduction, such software for automated theorem proving are standardly called proof assistants. Sometimes they are also called interactive theorem provers (ITP). In interactive theorem proving, theorems are proved through a human-machine collaboration. A typical example of this is using the ITP for checking the validity of a formal proof, or a part of it. ITPs are used to different degree by many mathematicians and they are becoming increasingly important tools for the mathematical community (see, e.g., Barendregt & Wiedijk, 2005). The automated theorem prover Mizar, for example, is associated with a library (The Mizar Mathematical Library)Footnote 4 of formalized mathematical proofs that can be used by authors to check the validity of their proofs. These proofs currently formalize introductory mathematics, but new submissions are added constantly by the community to contribute more advanced results. This library could in the future provide an easy and reliable way to check the validity of proofs for state-of-the-art mathematical research.

This kind of interactive theorem proving is not the only way in which automated theorem provers can help mathematics progress. Another form of automated theorem proving would be for software to prove new theorems on their own. In such ATP applications, the idea is that the software proves theorems independently, after getting the initial input of a system of axioms. As a result, the ATP could both prove new theorems and provide new proofs to existing theorems. Let us call this autonomous automated theorem proving (AATP), to distinguish it from interactive theorem proving. In practice, the distinction between ITP and AATP is likely to be based on use, not necessarily on software. It is feasible that the same ATP software could be used for both purposes, even by the same mathematician.

4 Distinguishing between the interesting and the trivial

While the automated and interactive theorem provers have developed greatly in recent years, their importance for mathematical practice in the field of theorem proving should not be overestimated. Many mathematicians use such AI applications to varying degree in their work and in some tasks, like checking proofs, they can be an indisputably useful tool. In general, in the growing field of experimental mathematics, the use of computers for mathematical purposes has become increasingly important (see, e.g., McEvoy, 2013; B. van Kerkhove & van Bendegem, 2008). This approach can include testing conjectures, but also discovering new patterns and gaining new insights (Borwein & Bailey, 2008, pp. 3–4).

In this approach, the potential of AATP software is to be established. While they can provide proofs of new theorems, as well as provide new proofs of existing theorems, it remains to be seen how useful they can be in generating new theorems and proofs that mathematicians find interesting. What an ATP can do is take a system of axioms and derive proofs according to a system of logic. As the output, we could inquire whether a certain conjecture is a theorem of the axiomatic system, which is the ITP approach. Alternatively, we could simply have the ATP list a (finite) subset of theorems of the system, which is the AATP approach. What the AATP cannot currently do is evaluate the theorems it proves in terms of their mathematical importance. So far, to the best of my knowledge, there are no software that are somehow programmed to autonomously recognize interesting proofs, or interesting theorems. That is still an exclusively human activity, and as such within the field of interactive theorem proving.Footnote 5

This is not to say that automated theorem provers cannot discriminate between proofs based on human proof-theoretic criteria. The most obvious of these is the length of a proof. Veroff, for example, has presented a procedure for searching for the shortest proof with the theorem proving software OTTER (Veroff, 2001). Fitelson and Wos have also used OTTER to find shorter proofs for logical theorems (Fitelson & Wos, 2001). Kinyon has used the software Prover9 for proof simplification, a procedure of shortening proof lengths (Kinyon, 2019). All these can be seen as efforts to find automated ways of establishing humanly appealing proofs. Yet these methods are very simple and take a limited approach even to the question of length of proofs.

To see this, we need to understand better how these software function. What the ATPs do is provide a list of inference steps and the justification for each step. Thus an ATP proof consists of two parts: a sequence of clauses consisting of atomic formulae and their negations, and the inferences used to derive the clause from its parenting clauses (Kinyon, 2019). The length of the proof refers then simply to the number of clauses in the sequence. Yet, as pointed out by Kinyon, the simplicity of the ATP proofs could also be measured by at least two other ways. First, instead of a sequence of clauses, a proof can be visualized as a directed graph. Simplicity of the proof could then refer to the complexity of such graphs. Second, in presenting the proof as a sequence of clauses, in addition to the number of lines (clauses) in the proof sequence, also the length of the clauses themselves adds to the complexity of the proof. This is measured in the simplest way simply by the number of symbols in each clause, called the weight of the clause (ibid.).

This gives us some idea how difficult it is to measure the simplicity of an ATP proof in an objective manner. So far, the approaches have focused on measuring the number of clauses in a proof sequence, but that is already a simplified procedure. However, even if we had a more inclusive measure, perhaps combining length with clause weights and graph complexity, how would we know to weigh the different notions in assessing the simplicity of a proof? One classic approach to find a way around such problems has been to invoke the notion of Kolmogorov complexity (Kolmogorov, 1963/1998). Kolmogorov complexity refers to the length of the shortest computer program which has an informative object, such as a string of symbols, as its output. To give a simple example, the string “bbbbbbbbbbbbbbbbbbbb” has a lower Kolmogorov complexity than the string “keehfydo38dkrislero29s”. Both strings are 20 symbols long, but whereas the second string cannot (presumably) be described by a shorter string, the former can. The English description “20 times b”, for example, is 10 symbols long (counting spaces). Thus, the former string has a lower Kolmogorov complexity than the latter.

ATP proofs can also be measured in terms of their Kolmogorov complexity, given that they are informative objects comparable to strings of symbols. This would have the advantage that instead of multiple measures, there would be a simple well-defined notion of complexity. Since it refers to the shortest computer program already in its definition, Kolmogorov complexity might initially appear to be suited as a general measure of simplicity of proofs. After all, mathematical proofs can be seen as instances of computer programsFootnote 6, and there is intuitive plausibility in the idea that shorter programs provide simpler proofs. However, Kolmogorov complexity is not without problems. While as a theoretical notion it may seem straight-forward and intuitive, it was proved already early on that Kolmogorov complexity is in fact incomputable, i.e., there is no general algorithm for determining the Kolmogorov complexity of a string of symbols (Vitanyi, 2020; see also Zvonkin & Levin, 1970). In addition, it has turned out that determining the Kolmogorov complexity of even short strings of symbols is an extremely difficult task (see, e.g., Soler-Toscano et al., 2014).

Through these kinds of considerations, it becomes clear that, at present, it is problematic to apply automated theorem provers even to assess the simplicity of a proof in a technical sense that the ATPs can process. This is to say nothing about the cognitive complexity of a proof. Proof lengths, clause weights and other such measures are related to the difficulty of the cognitive task of understanding a proof, but neither alone or together can they be equated with it. For this, we need a separate notion of cognitive complexity, one that takes into account particular aspects of human cognition, background knowledge and cultural context (Fabry & Pantsar, 2021; Pantsar, 2021a). This is the case if we focus on traditional measures of computational complexity or notions such as descriptive complexity (Pantsar, 2021b). As argued in those papers, computational complexity measures are rarely (if ever) directly applicable to studying complexity of cognitive tasks and processes.

Based on the above considerations, it is clear that the current generation of theorem proving AI applications lacks means of distinguishing between interesting and uninteresting proofs. Some minimal progress in terms of different understandings of simplicity has been made, but when it comes to having useful tools for discriminating proofs in terms of them being humanly interesting, the advances are negligible. In this respect, it is also important to note that simplicity is only one factor by which proofs are assessed by human mathematicians. Aside from simplicity, some notion of “insightfulness” is also likely to be present in assessing proofs (see, e.g., Macbeth, 2012; Weber, 2010). Another often mentioned property of mathematical proofs is their beauty. This topic has been discussed by philosophers in different ways (see, e.g., Johnson & Steinerberger, 2019; Rota, 1997). Recently, it has also been studied by neuroscientists and the experience of mathematical beauty appears to be a phenomenon associated with similar brain activity in the medial orbito-frontal cortex as other experiences of beauty (see, e.g., Zeki et al., 2014 for an experiment on the beauty of mathematical equations). Clearly such experiences related to mathematical proofs are not present in any way in the current generation of automated theorem proving software.

But even if we were able to assess proofs in any such way, it would not help us with the question of how the theorems themselves are assessed by human mathematicians. Why is some theorem considered to be important and another trivial? As with proofs, considerations of insightfulness and beauty can be relevant to this topic. But equally importantly, mathematical theorems get their importance as part of mathematical theories and their place within the mathematical community. Evaluating the importance of a mathematical theorem cannot be reduced to the mathematical properties of the particular theorem. Instead, the importance of a theorem is tightly connected to its place in the historical development of mathematics. The importance of Fermat’s Last Theorem, for example, cannot be discussed without including its status in the mathematical community over centuries. Thus, all the considerations on mathematical importance – including insightfulness and beauty – need to be located in the context of wider mathematical practice.

In addition, one important factor in assessing the value of mathematical theorems is their applicability both in mathematics and wider in science (see, e.g., Lange, 2017). These, and many other questions concerning human mathematical practices are actively studied by philosophers of mathematics (for an introduction, see Mancosu, 2008). Here it is not possible to go further into details, but the problem should be clear by now. Human mathematicians associate proofs and theorems with a wide variety of valenced assessments. So far, automated theorem provers can only be included in such assessments in very rudimentary ways. For the big questions, i.e., why some theorem is considered to be important or interesting, or why some proof is considered to be more elegant than another, their current importance is negligible. When it comes to constructing artificial mathematical intelligence, the present automated theorem provers have little to contribute aside from their role as one tool at the disposal of the modern mathematician.

5 Artificial neural networks

The problems identified in the previous section relate to the present generation of automated theorem provers, which are rule-based systems. These systems function based on pre-set rules that the software follows to compute an output for a given input. This is what computer programs traditionally have been like: what they can do is constrained by the rules they were programmed to have. In such an approach to automated theorem proving, the question of distinguishing between different types of proofs and theorems is thus constrained by the kind of rules that are programmable. As we have seen, the length of an ATP proof can be one distinguishing factor in a rule-based system. However, factors like insightfulness, beauty and applicability would seem to be hopelessly too ambiguous to be included in rule-based systems. Certainly, there can be some programmable rules concerning such factors. For example, proofs by exhaustion (i.e., by the brute force method of verification case by case) are generally not considered to be elegant by human mathematicians. However, it does not seem feasible to capture notions like insightfulness or beauty by such rules. And if we don’t have a good understanding of those notions in the first place, how can we realistically aim to program a computer to model them – not to say anything about the technical challenges involved in formalizing such notions into programmable forms even if we did have a good grasp of them.

In this respect, however, machine learning with (deep) artificial neural networks (ANN) can provide a different approach. In this type of artificial neural network (from here on just “neural network”), the computer learns to extract patterns from its input without being provided rules for it. ANN machine learning has made important advances in recent times and it has been applied highly successfully in fields like playing games (such as Go and chess), image recognition, natural language processing, and translation (for overview, see, e.g., Mitchell, 2019). Recently, there have also been advances in machine learning applications in mathematics.

Traditionally, artificial neural networks have been struggling with symbolic mathematics, but Lample and Charton (2019) have presented interesting data on a neural network that solved symbolic mathematical problems. The network was fed mathematical formulas (about 200 million of them) in a tree format and was given problems to solve (differentiation and differential equations; types of problems that do not have simple general-purpose solution algorithms). It performed well (and quickly, less than a second per problem) with 5000 test equations, giving right solutions to the vast majority of the problems. In integration tasks the success rate was almost 100%, in differential equations slightly less. Remarkably, for integration problems, it outperformed the standard commercial package Mathematica (Lample & Charton, 2019).

Such early results give hope that perhaps in the future, an ANN-based theorem prover can help human mathematicians in ways that automated theorem provers today cannot. Unlike a traditional rule-based theorem prover, such a neural network would have learned mathematics. If it were able to do that in a human-like fashion, it is possible that it could develop some sort of human-like ability to determine which theorems and proofs are interesting and which are not. For example, the AI could prove a million theorems in an axiomatic system and then rank them into categories in terms of their elegance, how interesting they are, and so on. If the AI is trained with the kind of theorems humans find interesting, it could develop an ability to recognize interesting theorems also among new theorems. The great advantage compared to the current generation of theorem provers would be that there would be no need to specify rules for what makes proofs and theorems interesting. Instead, these notions could remain implicit if the AI is able to detect a pattern in the training data. Presently this is of course in the realm of science fiction, but with the growth in AI development, such scenarios don’t seem impossible.

Some reason for optimism is given already by the important progress that has been made in machine learning applications of interactive theorem provers. One of the key problems in this field is called premise selection. This refers to the problem of finding mathematical statements that are relevant for proving a particular conjecture (Wang et al., 2017). Progress has been reported in machine learning applications in the premise selection task using the Mizar library of formalized mathematics (Alemi et al., 2017; Wang et al., 2017). These approaches have led to progress in pre-selection of premises both for the first-order automated theorem prover E (Schulz, 2002) and higher-order logic theorem proving (Bansal et al., 2019). The general idea in these approaches is that machine learning limits the number of premises that are then used in a rule-based automated theorem proving software. This is far from developing autonomous automatic theorem provers, but it provides an important new research direction for interactive theorem proving. Under this approach, machine learning systems may not necessarily directly lead to proofs of theorems, but they could guide human mathematical intuitions and as such work as a new form of an interactive theorem prover (Davies et al., 2021). Indeed, machine learning applications have already been used to find counter-examples to open conjectures (Wagner, 2021).

It should be noted that all these developments are still in early stages and their importance should not be overstated. As pointed out by Davis (Davis, 2019), for example, there are several ways in which the ANN of Lample and Charton is not as impressive as first seems. First of all, it can handle only a limited subset of the problems that, e.g., Mathematica is able to solve. For problems solvable, or made considerably easier, by simplification, for example, rule-based systems like Mathematica would beat their ANN. Second, at times the output of the ANN is not even a well-formed formula, which is something that a rule-based system would not do. Third, and perhaps most importantly, it cannot be considered to be a stand-alone system. As Davis says, “the construction of [the ANN of Lample and Charton] is entirely dependent on the pre-existing symbolic processors developed over the last 50 years by experts in symbolic mathematics” (Davis, 2019, p. 6).

Nevertheless, it is clear that an ANN approach to automated theorem proving provides potential advantages that are beyond the capacity of the current generation of rule-based theorem provers. One important reason for this is that, as mentioned above, some of the notions that philosophers have suggested as criteria for interesting proofs and theorems, such as insightfulness and beauty, are too ambiguous to be captured by explicit rules. Machine learning systems, however, could detect patterns that correspond (at least partially) to the human interpretation of such notions. We should not expect this process to be accurate right from the beginning, but another strength of machine learning systems is that they are fast. With trial and error in creating and adjusting datasets for training the AI, progress can be made even if they do not function perfectly. Given the success of machine learning systems in many other tasks during recent years, I believe that we must start considering their potential in mathematics, as well as its significance philosophically.

6 The black box problem

We have seen that the current state of the art is far from providing feasible applications of autonomous automated theorem proving, let alone discriminating between interesting and trivial proofs and theorems. But in a philosophical discussion of automated theorem proving and artificial neural networks, we should not get stuck to the technical problems involved in the present generation of applications. However, there is one general difficulty concerning deep neural networks that we need to be concerned about, namely the “black box” problem (Russell & Norvig, 2020, Sect. 19.9.4). We do not have a clear idea what kind of explanations artificial neural network models provide, even when they are highly predictive (see, e.g., Kay, 2018). Deep neural networks learn, but often the only data we get of them is behavioral, i.e., concerning its output. How can we know that a theorem proving neural network has followed rules of proof correctly? Indeed, how could we make the neural network report on its reasoning? This question is closely related to the general problem in the philosophy of science of the epistemic opacity involved in using computational methods in scientific explanations (Durán & Formanek, 2018; Humphreys, 2009).

To tackle the opacity involved in ANN mathematical processing, Lample and Charton (2019) suggest that we could trace the network’s reasoning by making small adjustments to datasets and observe differences in behavior. However, in a scenario in which an artificial intelligence is trained to prove interesting new mathematical theorems, the effects could be exceedingly complex to determine. This general problem is well-known in AI research and the emerging area of explainable AI (XAI) aims to find solutions to the black box problem of machine learning (see, e.g., Doran et al., 2017; Holzinger, 2018; for an analysis of how this line of research connects to philosophy, see (Thompson, 2021). Many researchers, however, are skeptical of the possibility of explainable AI in deep neural networks due to the sheer complexity of the millions or even billions of terms in the equations involved in their processing. According to the skeptics, the best we can hope for is subjective interpretation of the neural network, not proper explanation (see, e.g., Landgrebe, 2022).

However, here theorem proving could potentially have different practical characteristics when compared to other types of problem solving, such as the cases reported by Lample and Charton. While with differential equations we might be content to simply get a solution, with new theorems mathematicians would expect something more. Instead of just presenting some formula as a theorem, we would ultimately expect the AI to provide some kind of humanly accessible proof of the theorem, as well. Therefore, the opacity of the artificial neural network would not prevent humans for evaluating the proof in a sufficient manner, even if the processing of the ANN would remain opaque. While the opacity of ANN would remain to be a problem, assessing the proofs would potentially give human mathematicians more information about the reliability of the ANN in the case of theorem proving.

It is important to recognize that AI theorem proving of the type we are discussing would not happen in a computer cocoon; rather, it would become part of the activity of the mathematical community. Thus AI-generated proofs could be inspected by human mathematicians, which could be a way to get around the black box problem. While the processing of the theorem proving software would remain a black box, the proof itself would be accessible to the scrutiny of mathematical peers – whether human or artificial. It is conceivable that in such a scenario the AI proof could be accepted along the same standards as human proofs. After all, in mathematics we are currently not concerned about the lack of knowledge about the human cognitive processes involved in proving theorems. What we are interested in are the theorems, their proofs, and partly informal expositions of them. Rather than transparency of the processing of the AI, presenting understandable proofs could be a more realistic aim for developing theorem proving artificial intelligence. This is the idea that I want to develop in the rest of this paper.

However, the above considerations prompt the question just how human-like the mathematical ability of the AI would be? Certainly, the learning process of the Lample and Charton design, at least, is very much un-human-like. No human mathematician will go through a training set of millions of formulas. Rather, the way humans learn is to use relatively few formulas – in addition to a lot of informal material – to capture the essential content involved in processes like integration and derivation. Given the vast differences in the respective learning processes, it is reasonable to argue that the way deep neural networks learn things is at the very least a problematic fit with actual human learning.

This is important when we consider the problem of applying machine learning methods in proving theorems that mathematicians find interesting. While it is possible to generate training material of hundreds of millions of differential equations, the corpus of theorems that mathematicians find interesting is much smaller. It is thus questionable whether there could be a sufficiently large training set for an AI to learn to distinguish interesting theorems (or interesting proofs). Of course, some aspects of interesting theorems could be captured easily. For example, equations with identical left and right sides could feasibly be seen to be uninteresting. But overall, the phenomenon of interesting mathematics is likely to be so complex that the datasets would not be sufficient to detect the relevant patterns.Footnote 7

Thus, the neural network AATP approach includes two difficult problems. First, the way the ANN learns is un-human-like, which may limit its applicability in recognizing humanly interesting mathematics. Second, the ANN could also be un-ATP-like, in the sense that we couldn’t explain its processing. The great strength of the current generation of rule-based ATPs is that we can trust them to do their (limited) job correctly. With a machine learning ATP, we could no longer count on that.

7 Evolution of mathematical practice

How could ANN automated theorem provers be instilled in the mathematical practice of theorem proving? One possibility that we have already seen in the previous section is to include them in hybrid approaches that combine the use of machine learning with traditional rule-based systems. In this kind of approach, a neural network could come up with conjectures and relevant premises, and a rule-based system would prove it. This could potentially be a way around the black box problem, because we can trust the logical inferences conducted by the rule-based system.

Yet that kind of hybrid approach would not get us any closer to distinguishing between interesting and trivial theorems and proofs. The familiar question would remain in a new form, namely, whether an ANN could feasibly assess what is interesting and what is trivial as part of a hybrid system. In the previous section we saw that an ANN autonomous automated theorem prover would learn mathematics in a very un-human-like way. But perhaps it could follow its own rationality and establish criteria for what is interesting and what is trivial. These may or may not coincide with human assessments, but they might be a way forward in limiting the number of theorems and proofs that an AATP gives as an output. If humans ultimately evaluate the output, such limitations would be crucial.Footnote 8

But perhaps there could be an altogether different approach for a neural network to acquire human-like mathematical ability. While the ANN of Lample and Charton learned to solve problems by being fed exactly those types of problems as the input, could an ANN learn mathematics in a more bottom-up way, to mirror the way humans learn mathematics? Indeed, early advantages in such an approach have already been made. In experiments reported by (Stoianov & Zorzi, 2012; Testolin et al., 2020), an ANN learned numerosity discrimination from visual stimuli similarly to young children (Halberda & Feigenson, 2008; Piazza et al., 2010). This approach was extended also to counting (Di Nuovo & McClelland, 2019; Fang et al., 2018; Pantsar, 2023). So far, these abilities are very basic and even small integer addition is beyond the reach of the ANNs, but this kind of approach could lead to AIs learning mathematics in a more human-like manner. In that kind of learning process supervised by humans, the AI could also develop an ability to detect what kind of mathematics is interesting for humans.

But what would the status of such an AI, or any AI that has developed autonomous theorem proving abilities, be? If they were used as part of interactive theorem proving, it is feasible that they could enter theorem proving practice. Most likely this would not be without controversy, as we have seen in the history of introducing computer-assisted proofs into mathematics. From the four-color theorem (Appel & Haken, 1976) to the Kepler conjecture (Hales et al., 2017), there has been a significant change in how computer-assisted proofs in mathematics are seen. While the computer-assisted proof of the four-color theorem was criticized, among other things, for potential undetected errors (Tymoczko, 1979), the modern proofs like that of Kepler conjecture seem to be accepted more readily.

Indeed, this seems like a reasonable approach. While computer-assisted proofs may not be fully checkable, that is also the case with many human proofs. As proofs become longer and more complex, it becomes increasingly difficult for humans to check them in an error-free way. Thus, increasing the role of ATPs in mathematical practice, whether for proof-checking or proof-assisting, seems to be a justifiable future direction in mathematics. Ultimately, this could also include autonomous proving of theorems by ATPs, as well as autonomous writing of mathematical papers presenting the proofs.

8 The transforming epistemic role of computer tools in theorem proving

In the final scenario presented in the previous section, AI systems can autonomously prove theorems and write papers. Clearly this kind of epistemic role of computers would be vastly different from the present-day practice. But what exactly would that role be and how should we deal with it? It is good to remember that the initial reaction among philosophers to the introduction of computer-assisted proofs in mathematics was largely skeptical. The likes of Kripke (1980) and Tymoczko (1979) argued that the use of computers in theorem proving makes mathematics partly empirical.

This topic is important for the question of whether mathematical knowledge reached in this way can still be considered a priori. To appease such potential threats, Burge (1998) has convincingly argued that a proper kind of assimilation of computer use in theorem proving need not be essentially different from the kind of appeal to other mathematicians that we are happy to accept in knowledge-ascriptions. There are always a posteriori aspects related to mathematics – most trivially, we need to see symbols, etc. – but we may come to trust computers in similar ways that we trust human mathematicians. After all, no mathematician’s knowledge of mathematics is based on understanding all proofs of known theorems. Sometimes our mathematical knowledge is simply based on trusting other people’s competence. And as Burge rightfully asks, how is this essentially different from trusting the competence of a computer?Footnote 9

The question Burge dealt with was whether computer-assisted mathematical proofs produce a priori knowledge. However, similar considerations are applicable also to the more fundamental question whether computer-assisted mathematical proofs produce knowledge in the first place, whether purely a priori or including empirical aspects. Nowadays, few mathematicians question computer-assisted proofs like that of the four-color theorem. By and large, mathematicians seem to accept that we can trust computers, even though we cannot completely discount the possibility of errors in their functioning. This seems sensible in light of Burge’s analysis. For the most part, we accept that Andrew Wiles’ proof of Fermat’s Last Theorem, for example, is valid because we trust the relevant parts of the mathematical community.

However, the possibility of machine learning systems in theorem proving requires us to reassess the question whether computer-assisted proofs can produce knowledge. The kind of computer-assisted proofs that Burge discussed are firmly within the realm of rule-based systems, which were the only game in town back in 1998. Mathematics applying classical ATPs may be empirical only in the way that mathematics among humans is empirical (i.e., empirical aspects are present, but not in the justification of mathematical results). But could this be different for neural network AATPs? How do we come to trust such machine learning systems, and accept their results as mathematical knowledge?

It is instructive to approach also this question by comparing human mathematics with computer-assisted mathematics. Therefore, the first question to ask is how we come to accept human-proved theorems as mathematical knowledge. This is a complex social phenomenon where doubtlessly a variety of different factors can be identified. Reputation, for example, matters, as do extra-mathematical aspects like language skills. But hopefully we can assume that an important part of the acceptance procedure is the reviewing process in which the mathematical content is assessed by competent mathematicians. Ultimately, in this part of mathematical activity, reviewers are expected to focus primarily on the mathematical content. What they are not expected to focus on are the cognitive processes involved in coming up with the proof.

It is important to note that even though the process of accepting a mathematical theory into the canon of mathematical knowledge is thus focused primarily on the theorem and its proof, there are many important factors involved implicitly. For example, we typically trust that we share some sense of rationality among other humans. Among mathematicians, there is probably a heightened sense of that. For such reasons, it is not insignificant that we believe that a paper that we are reviewing was written by a human. We have come to accept – by and large – that the human practice of mathematics can include the application of computers also for providing parts of proofs that would not be otherwise possible. But this is still a different matter from accepting a proof that is entirely the product of a computer as mathematical knowledge.

However, I believe that this is mainly due to the unfamiliarity that this kind of technology evokes in us. Most importantly, of course, at present the technology does not exist. But given the problems that are associated with machine learning systems in other fields of AI, we can assume that similar considerations would arise also when (or if) AATPs were available. In this section, I want to prepare for that eventuality. More specifically, I want to discuss the scenario in which AATPs are not only available, but they are also trusted implicitly by humans. By this I don’t mean that the AATP-generated proofs are accepted without scrutiny. On the contrary, I mean that they are trusted in the sense that human mathematicians are trusted, i.e., they can make errors but most of the time they function correctly.

Hence, I want to propose here a similar approach for the epistemic evaluation of machine learning AATPs as Burge (1998) proposed for rule-based computer-assisted proofs in mathematics. We should compare the scenario of accepting AATP-generated proofs to the mathematical practice of accepting human-generated proofs. Related approaches have been presented previously in the debate about computer-assisted proofs in mathematics. Detlefsen and Luker (1980), for example, argued against Tymoczko’s (1979) skeptical view about computer-assisted proofs by remarking that humanly checked proofs are never absolutely certain, either. While this is often – including by Tymoczko, Detlefsen and Luker – understood as bringing a probabilistic, empirical element to mathematical proofs, I agree with McEvoy (2013) that this is mistaken. Rather than the proofs, it is our ability to recognize a genuine proof that is probabilistic.

Indeed, as argued by Hales (2008), there is good reason to think that ATP software (in his examples HOL Light and Coq) are actually particularly reliable ways of checking – and thus recognizing – proofs. They have a small “logical kernel” whose soundness can be reliably established. If anything, when it comes to classic ATP software, computer-assisted proofs can generally be established more certainly than “computer-free” proofs, simply due to the likelihood of human error as the proofs become longer. However, this clearly changes with AATPs based on machine learning. Such systems do not have a small logical kernel for their functioning, or if they do, we cannot determine what it is. Hence neural network AATPs face many similar questions about lack of certainty that computer-assisted proofs did in the 1970s and 80s. But just like those questions turn out, under proper analysis, to be about recognizing proofs, I contend that the related AATP-specific questions concern recognizing proofs. We should not disregard the possibility that an AATP-proven putative theorem turns out to be false, and in machine learning systems this possibility could indeed be greater than with classical ATPs. However, rather than being an argument against the use of machine learning systems for theorem proving, this is better understood as an invitation to develop accurate methods of checking proofs presented by such systems.

This question is tightly connected to the kind of output that neural network AATPs are designed to produce. At one extreme is a minimal output of simply stating that a particular string of symbols is a theorem of a particular mathematical system. At the other extreme is a full step-by-step proof of the theorem. In the former case, the checking of the proof would require constructing the proof, which would make the use of AATP minimally informative. In the latter case, the proof could be then checked by a classic rule-based ATP, in which case we could reach a maximally high standard of checking the validity of the proof.

In practice, it is likely that the output of the neural network would produce something between those two extremes. Since it would be trained with existing proofs of theorems, the output would most likely (in the realistic best-case scenario) bear a resemblance to human proofs. If this would indeed be the mode of the output, then the question of checking the validity of proofs becomes more intricate. The proofs would have gaps in logical steps, just like human proofs, which would pose challenges for the proof-checking process. However, in such a scenario, we would already have a system of mathematical practice ready. If the output of proofs is similar to humanly produced proofs, it seems reasonable that we would put the proofs under similar scrutiny as humanly produced proofs. This may involve using computers, or it can be purely human proof-checking. But the important point is that in such a scenario we would not discriminate between the proofs based on whether they are human-produced, hybrid-produced, or AATP-produced. They would all face the same level of scrutiny by the mathematical community. Let us call this the community approach to AI-generated mathematics.

The idea of the community approach is that AI systems, including possible AATPs, are assimilated to the mathematical community. This may happen in different ways. One way would be to make the application of an AI system explicit at all stages of practice. This would mean, for example, that proofs for which an AI application has been used are explicitly reported as such and the role of AI specified. However, it could also mean that an AATP could write a paper independently and it would be reviewed without the referees being aware that it is an AI-written paper. In this latter scenario, the community approach to AI-written papers would not differ in any way from that of human-written papers. Ultimately, in this scenario artificial agents would be accepted as part of the mathematical community.

9 Authorship and accountability

The above scenario of AATPs also presents potential problems, one of which concerns authorship. In the scenario, we should expect there to be papers with AIs both as co-authors and sole authors. But how could this work in practice? Could an AI actually be listed as an author? If not, why? Regulations for co-authorship including AI agents would need to be established but how could we make sure that AIs are credited according to those regulations? Indeed, a particularly problematic scenario is one in which an AATP creates a proof but a human mathematician takes credit for it. These types of problems have been widely reported in chess after chess-playing AI systems surpassed the level of human players. We should expect similar problems with regard to AI-generated mathematics. Mathematical practice is also about careers and the associated competition, which would become unfair if some mathematicians would be using (uncredited) AI applications for their mathematical work.

In addition to authorship, the question of accountability also needs to be discussed. How can we trust AI-generated mathematics and who is accountable if problems emerge? Every AI application, whether a commercial product or open source software, is produced by some group of people. In the scenario in which AATPs provide proofs of theorems, part of establishing trust in the AI system is establishing trust in its developers and users. As pointed out by, e.g., von Eschenbach (2021), AI is always situated within a socio-technical system that includes multiple groups of people, including AI designers but also administrators, legislators, marketers and many others, all the way to the end users of the software. In order to trust an AI application, von Eschenbach argues, we need to justify our trust in that socio-technical system. This kind of ethics-based approach may seem more relevant for AI applications in, for example, medicine, but it is also relevant for theorem proving AI. Mathematical theorems play an important role in technological and other scientific applications and establishing trust in them is crucial.

This question becomes particularly pertinent in a scenario in which an AATP-type AI is accepted as an independent mathematical agent, including a potential author of articles. In such cases it is important to establish where the authorship, and also hence the accountability, of the AATPs lies. Currently such questions may seem like science fiction, but they might well become important in the future. The future of mathematics is likely to become increasingly open to human-computer collaborations in which the human contribution gradually changes and, in terms of carrying out formal proof procedures, decreases. If the best way to achieve progress in mathematics is by increasing the amount of ATP use, this is likely to happen. In such a scenario, the human role in mathematical practice will change, and being able to apply ATPs may become a central skill. Among other things, this may force us to reconsider what it is to be a mathematician, and what higher mathematics education should be like. But it also raises important questions concerning the accountability involved in mathematics.