Introduction

The ACM 2013 computer science curriculum lists the ability to construct formal proofs as one of the learning outcomes of a basic logic course (Association for Computing Machinery (ACM) and IEEE Computer Society Joint Task Force on Computing Curricula 2013). The three main formal deductive systems are Hilbert systems, sequent calculus, and natural deduction. Natural deduction is probably the most popular system, but classical textbooks on mathematical logic usually also discuss Hilbert systems (Kelly 1997; Mendelson 2015; Enderton 2001). Hilbert systems belong to the necessary foundation to the introduction of logics (temporary, Hoare, unity, fixpoint, and description logic) used in teaching of various fields of computer science (Varga and Várterész 2006), and are treated in several textbooks on logic for computer science (Ben-Ari 2012; Nievergelt 2002; Arun 2002; van Benthem 2003). Hilbert systems are also taught in mathematics and logic programs (Leary and Kristiansen 2015; Goldrei 2005).

Students have problems with constructing formal proofs. An analysis of the high number of drop-outs in logic classes during a period of eight years shows that many students give up when formal proofs are introduced (Galafassi 2012; Galafassi et al. 2015). Our own experience also shows that students have difficulties with formal proofs. We analyzed the homework handed in by 65 students who participated in the course “Logic and Computer Science” during the academic years 2014-2015 and 2015-2016. From these students, 22 had to redo their homework exercise on axiomatic proofs. This is significantly higher than, for example, the number of students in the same group who had to redo the exercise on semantic tableaux: 5 out of 65.

A student practices axiomatic proofs by solving exercises. Since it is not always possible to have a human tutor available, an intelligent tutoring system (ITS) might be of help. There are several ITSs supporting exercises on natural deduction systems (Sieg 2007; Perkins 2007; Broda et al. 2006). In these ITSs, students construct proofs and get hints and feedback. We found two e-learning tools that can be used by a student to practice the construction of axiomatic proofs: Metamath Solitaire (Megill 2007) and Gateway to logic (Gottschall 2012). Both tools are proof-editors: a student chooses an applicable rule and the system applies this rule automatically. These systems provide no help on how to construct a proof.

In this paper we describe logax, a new tool that helps students in constructing Hilbert-style axiomatic proofs. logax provides feedback, hints at different levels, next steps, and complete solutions. logax is part of a suite of tools assisting students in studying logic, such as a tool to practice rewriting formulae in disjunctive or conjunctive normal form, and to prove an equivalence using standard equivalences (Lodder et al. 2016; 2019).

logax is an example of an ITS that gives hints and feedback to students solving tasks that can be solved in many different ways. Other domains with similar characteristics are proof systems such as natural deduction and the sequent calculus, but also proving geometry theorems (Matsuda and VanLehn 2004), and constructing a program that satisfies some given properties. Developing an ITS that gives hints for these domains is notably difficult, because not all possible solutions can be calculated upfront or algorithmically, as for almost all kinds of tasks in, for example, algebra (Heeren and Jeuring 2014). If a student takes a step in the current solution space, logax can provide feedback and hints. If a student takes a step outside the current solution space, logax dynamically recalculates the solution space, taking the student step as a starting point. It then uses the new solution space as the source for hints and feedback. The dynamic approach and the algorithm to recalculate the solution space are central to our solution and make it possible to always give feedback and hints to a student. Similar techniques would be useful for ITSs for the other domains mentioned above.

The main contributions of this paper are:

  • an example of a tutoring system giving feedback and hints for a domain for which feedback and hints cannot be specified algorithmically upfront

  • an algorithm for generating axiomatic proofs and dynamically extending partial proofs

  • an extension of this algorithm to incorporate lemmas

  • generating hints and feedback based on this algorithm and studying the effect of these in small-scale experiments

To determine the quality of the proofs generated by logax, we compare the proofs generated by the tool with expert proofs and student solutions. We use the set of homework exercises mentioned above to collect common mistakes, which we have added as buggy rules (rules to provide informative feedback) to logax.

This paper is organized as follows. Section “Teaching Hilbert-Style Axiomatic Proofs” describes Hilbert’s axiom system and the way it is introduced in textbooks and Section “An E-Learning Tool for Hilbert-Style Axiomatic Proofs” explains the interface of our e-learning tool logax. Section “An Algorithm for Generating Proof Graphs” introduces the algorithm to generate proofs automatically. Section “Distilling Proofs for Students” explains how we linearize these generated proofs and Section “Lemmas” how we add the possibility to use lemmas. Section “Hints and Feedback” explains how we use the generated proofs for providing hints. This section also describes how we collect a set of buggy rules. Section “Evaluation of the Generated Proofs” and Section “Small-Scale Experiments with Students” discuss the results of several evaluations of our work. We relate our work to existing approaches of generating solutions and hints in Section “Related Work”. Section “Conclusion and Future Work” concludes and presents ideas for future work.

Teaching Hilbert-Style Axiomatic Proofs

We start with a short description of Hilbert-style axiomatic proofs and the way they are introduced in different textbooks. Axiomatic proof systems come in several variants. The most common axiom systems are

$$ \begin{array}{ll} {\phi\to (\psi\to \phi)} & \text{Axiom a} \\ {(\phi\to (\psi\to \chi))\to ((\phi\to \psi)\to (\phi\to \chi))} & \text{Axiom b} \\ {(\neg \phi\to \neg \psi)\to (\psi\to \phi)} & \text{Axiom c} \end{array} $$

used for example in Ben-Ari (2012), Nievergelt (2002), van Benthem (2003), Goldrei (2005), and Kelly (1997), and the system consisting of Axiom a and b, but Axiom c’ instead of Axiom c:

$$ \begin{array}{ll} {(\neg \phi\to \neg \psi)\to ((\neg \phi\to \psi)\to \phi)} & \qquad\text{Axiom c'} \end{array} $$

used for example in Hirst and Hirst (2015), Arun (2002), Wasilewska (2018), and Mendelson (2015). These axioms are schemas that can be instantiated by replacing the metavariables ϕ, ψ and χ by concrete formulae. A proof consists of a list of statements of the form Σϕ, where Σ is a set of formulae (assumptions) and ϕ is the formula that is derived from Σ. In a ‘pure’ axiomatic proof, each line is either an instantiation of an axiom, an assumption, or an application of the Modus Ponens (MP) rule:

$$ \text{ if } {\varSigma\vdash\phi} \text{ and } {\varDelta\vdash\phi\to \psi} \text{ then } {\varSigma\cup\varDelta\vdash\psi} $$

From these axioms and MP, the deduction theorem can be derived:

$$ \text{ if } {\varSigma,\phi\vdash\psi} \text{ then } {\varSigma\vdash\phi\to \psi} $$

The Open University of the Netherlands teaches axiomatic proofs in a bachelor course “Logic and computer science” and in a premaster program that prepares for admission to a master in computer science. The learning objective related to axiomatic proofs is:

  • students are able to construct simple axiomatic proofs.

The course lectures start with recognizing instances of the axioms, and proceed with simple proofs, providing strategies such as:

  • can you derive the last line of the proof by an application of the deduction theorem or Modus Ponens?

  • how can you use the assumptions?

The textbooks we studied (Ben-Ari 2012; Nievergelt 2002; van Benthem 2003; Goldrei 2005; Hirst and Hirst 2015; Arun 2002; Wasilewska 2018; Mendelson 2015) do not give explicit learning goals, except for Kelly (1997), which starts each chapter with chapter aims. The aims of the chapter on axiomatic proofs are amongst others: “When you have completed your study of this chapter you should have a clear understanding of the structure of formal axiomatic systems, and be able to construct formal proofs of theorems”. From the other textbooks we can deduce learning goals from the examples and exercises. The textbooks all start the chapter on axiomatic proofs with introducing the axioms, followed by some examples and exercises in which a student has to construct simple proofs or provide the motivation to given proof lines. Some of these proofs use earlier results such as lemmas or derived rules. Some books start with the first two axioms (Wasilewska 2018; Nievergelt 2002) and introduce the negation axiom (Axiom c or c’) after the deduction theorem, others introduce the deduction theorem after the three axioms. After the introduction of the deduction theorem, exercises using this theorem are presented. The exercises in these textbooks suggest that constructing proofs is a learning goal. The single exception is Wasilewska (2018): here most exercises only ask to motivate steps in an already constructed proof.

Hardly any textbook provides substantial information about how to construct a proof, except from providing examples and showing the use of the deduction theorem. Wasilewska (2018) explicitly states that constructing a proof may start with searching for two statements such that the conclusion is an application of Modus Ponens on these statements, and Kelly (1997) explains how to use the deduction theorem and gives a heuristic to derive Σψϕ from Σϕ. Constructing proofs requires knowledge of the syntax of propositional logic, and competencies in rewriting logical formulae. Therefore, most textbooks deal with rewriting formulas using standard equivalences (Goldrei 2005; Ben-Ari 2012; Wasilewska 2018; Arun 2002) or semantic tableaux (Kelly 1997; van Benthem 2003; Ben-Ari 2012), before the introduction of axiomatic proofs.

An E-Learning Tool for Hilbert-Style Axiomatic Proofs

The e-learning tool that we developed, logax , uses the set of axioms a, b and c described in Section “Teaching Hilbert-Style Axiomatic Proofs” and Modus Ponens and the deduction theorem. A proof in this system can be constructed in two directions. To take a step in a proof, a student can ask two questions:

  • How can I reach the conclusion?

  • How can I use the assumptions?

An answer to the first question might be: use the deduction theorem to reach the conclusion. This answer creates a new goal to be reached, and adds a backward step to the proof. An answer to the second question might be: introduce an instance of an axiom that can be used together with an assumption in an application of Modus Ponens. This adds one or more forward steps. Figure 1 shows an example of a partial proof, constructed in our tool logax. A full proof that completes this partial proof is:

$$ \begin{array}{l ll r} 1. && {{p}\vdash{p}} & \text{ Assumption } \\ 2. && {{p}\to {q}\vdash{p}\to {q}}&\text{ Assumption}\\ 3. && {{p},{p}\to {q}\vdash{q}} &\text{ Modus Ponens, 1, 2}\\ 4. && {{q}\to {r}\vdash{q}\to {r}}&\text{ Assumption}\\ 5. && {{p},{p}\to {q},{q}\to {r}\vdash{r}} &\text{ Modus Ponens, 3, 4}\\ 6. && {{p}\to {q},{q}\to {r}\vdash{p}\to {r}} &\text{ Deduction 5}\\ 7. && {{q}\to {r}\vdash({p}\to {q})\to ({p}\to {r})} &\text{ Deduction 6} \end{array} $$
Fig. 1
figure 1

A partial proof of qr ⊩ (pq) → (pr) performed in logax. On the right is the dialog box, in which a student can choose rules and fill in step numbers and help buttons below this dialog box. On the left is the proof as presented by logax

Figure 1 illustrates most of the functionality of our e-learning tool logax. A student starts with choosing a new exercise from the list, or formulating her own exercise. She continues working in the dialog box to add new proof lines. Here she can first choose which rule to apply: an assumption, axiom, an application of Modus Ponens or deduction theorem, or a new goal. In case of an assumption she enters a formula, and in case of an axiom, logax asks for parameters to add the instantiation of the axiom to the proof. Figure 1 shows adding a Modus Ponens: a student has to fill in at least two of the three line numbers. logax performs a step automatically and adds a forward or backward step to the proof. In the same way, a student provides a line number to perform a backward application of the deduction theorem. If the deduction theorem is applied in a forward step, the student also provides a formula ϕ. The new goal option can be used to formulate a subgoal to be reached.

If a student makes a mistake, e.g. she writes a syntactical error in a formula, or tries to perform an impossible application of Modus Ponens, the tool provides immediate feedback. At any moment she can ask for a hint, next step, or a complete proof. The high number labelling the target statement (1000) is chosen deliberately, because at the start of the proof it is not yet clear how long the proof will be. After finishing the proof a student can ask the tool to renumber the complete proof.

The reason to use a dialog box in logax to add new proof lines is that a student can concentrate on proof construction. The design choice to allow a student to choose a rule and let the software perform the rule has successfully been applied in several e-learning tools for logic and mathematics (Mostafavi and Barnes 2016; Beeson 1998; Robson et al. 2012). For instance, Robson et al. (2012) state that their interface “allows students to concentrate on strategies while the software carries out procedures”. The use of the dialog box also implies that students can make fewer mistakes. By the time students start to develop proofs in logax, they have had extensive training in writing syntactically correct formulae. Hence, logax does not focus on writing correct formulae. However, a mistake with parentheses in a long formula, such as an instance of Axiom b, is easily made. By using the dialog box, students focus on proof construction and spend less time on correcting syntax errors. The only possible syntax errors students still can make occur in smaller formulae that need to be entered when adding, for example, an instance of an axiom to the proof. The evaluation in Section “Small-Scale Experiments with Students” shows that students make very few syntactical mistakes.

Our approach and design choices build upon scaffolding theories as described by Wood et al. (1976) and Belland (2017). Wood et al. (1976), referring to Bernshteı̌n (1967), mention reducing the degrees of freedom as one of the scaffolding functions. Reducing the number of steps that a student has to perform makes it possible to focus on the elements of the task that lead to learning gains (Belland 2017). The dialog box allows a student to concentrate on the steps that are closely related to the learning goal of logax. ‘Providing just the right amount of support’ is the second scaffolding element in Belland’s list. In an intelligent tutoring system scaffolding is often implemented as a sequence of hints that are increasingly supportive (Belland 2017). We have implemented this scaffolding strategy in our ITS. Since we do not employ a student model at this moment, we cannot apply fading strategies, which reduce the amount of feedback when the system thinks that the student does not need this. This is not necessarily a shortcoming: according to Belland (2017), leaving control of the support by the ITS to a student may lead to transfer of responsibility.

An Algorithm for Generating Proof Graphs

An ITS for axiomatic proofs provides hints and feedback. There are at least two ways to construct hints and feedback for a proof. First, they can be obtained from a complete proof. Such a proof can either be supplied by a teacher or an expert, or deduced from a set of student solutions. An example of an ITS for natural deduction proofs that uses student solutions has been developed by Mostafavi and Barnes (2016). A drawback of this approach is that the tool only recognizes solutions that are more or less equal to the stored proofs. The tool cannot provide hints when a student solution diverges from these stored proofs. Also, this only works for a fixed set of exercises. If a teacher wants to add a new exercise, she also has to provide solutions, and the tool cannot give hints for exercises that are defined by a student herself. The second way to provide the tool with solutions, which we use, is to create proofs automatically. At first sight this might only solve the second problem: automatically providing hints for new exercises. Section “Distilling Proofs for Students” explains how our approach makes it possible to provide hints also in case a student diverges from a model solution.

We develop an algorithm that automatically generates proofs. This algorithm should generate proofs that resemble textbook and expert proofs, which we can use to teach our students. Existing algorithms, such as the Kalmár constructive completeness proof (Kalmár 1935), or the algorithms used in automatic theorem proving (Harrison 2009), are unsuitable for this purpose, because the strategies used in these proofs differ too much from expert and textbook strategies. The Kalmár construction only provides proofs for tautologies. This is not necessarily a problem, since instead of, for example, proving qr ⊩ (pq) → (pr), we can prove ⊩ (qr) → ((pq) → (pr)). However, the Kalmár construction would start with eight proofs (¬)p,(¬)q,(¬)r ⊩ (qr) → ((pq) → (pr)) for each of the eight valuations of p, q and r, followed by a procedure to combine these eight proofs in a proof of ⊩ (qr) → ((pq) → (pr)). The resulting proof will be considerably longer and more complicated than the proof we present in this article. The proofs generated by automatic theorem proving are also longer than, and different from, textbook or expert proofs. Natural deduction tools such as ProofLab (Sieg 2007) and Pandora (Broda et al. 2006) also use algorithms to calculate solutions, and these algorithms can provide useful hints and feedback. We adapt an existing algorithm for natural deduction to create axiomatic proofs. Before we describe the algorithm, we first explain how we represent proofs.

Figure 1 shows a partial example proof of qr ⊩ (pq) → (pr). There are alternative ways to start this proof. A student may choose between various orders, for example swap line number 1 and line number 2. Using one or more axiom instances we may obtain entirely different proofs. Since we want to recognize different proofs, we represent proofs as labeled directed acyclic multi graphs (DAM), where the vertices are statements Σϕ and the edges connect dependent statements. We annotate vertices with the applied rule: Assumption, Axiom, Modus Ponens or Deduction. Note that a statement can be the result of different applications of rules. An example of such a DAM is shown in Fig. 2. Vertices are numbered for readability. A blue (dashed) arrow means that the lower statement follows from the higher by application of the deduction theorem. A pair of red (solid) arrows represents an application of Modus Ponens. This DAM contains three essentially different proofs: one that uses Axiom a and b, one that applies the deduction theorem and Axiom a, and one that uses no axioms and applies the deduction theorem twice. This last proof is a continuation of the proof provided in Fig. 1.

Fig. 2
figure 2

A DAM for the proof of qr ⊩ (pq) → (pr)

The basis for our algorithm for axiomatic proofs is Bolotov’s algorithm for natural deduction proofs (Bolotov et al. 2005). The rules used in this system are presented in Fig. 3, restricted to the connectives ¬ and →, since these are the only connectives used in the Hilbert axiomatic system. Here we use the same notation as presented in van Benthem (2003). A natural deduction proof is here presented as a tree-like structure. The elimination rules (¬elim and \(\rightarrow _{elim}\)) express that you can extend a proof of ¬¬ϕ with ϕ and combine subproofs of ϕψ and ϕ into a proof of ψ. The introduction rules discard assumptions: subproofs of ϕ and ¬ϕ can be combined in a proof of ¬ψ by an application of rule ¬intro while discarding ψ. The last rule, \(\rightarrow _{intro}\), states that if you have a proof of ψ, you can add ϕψ and discard ϕ.

Fig. 3
figure 3

Rules for natural deduction

The natural deduction rules for implication translate directly to rules in the Hilbert system: \(\rightarrow _{ elim}\) corresponds to Modus Ponens and \(\rightarrow _{ intro}\) to the deduction theorem. The rules for negation do not have direct counterparts in the axiomatic system. Therefore, the first adaptation that we have to make to Bolotov’s algorithm is the use of axiomatic subproofs that mimic the natural deduction rules for negation. The ¬elim rule is translated to a single subproof, and we use seven different subproofs to translate the ¬intro rule, mainly to cover the possible different dependencies from ϕ and ¬ϕ on ψ.

The Bolotov algorithm is goal-driven, and uses a stack of goals. We build a DAM using steps that are divided into five groups. The first group contains a single step to initialize the algorithm. The steps in the second group check whether or not a goal is reached. The steps in the third group extend the DAM. The steps in group 4 handle the goals and may add new formulae to the DAM. In this group, a goal F can be added. The symbol F is not part of the language, but we use F as shorthand for “prove a contradiction”. Finally, group 5 completes the algorithm, where we omit certain details for the steps that are needed to prevent the algorithm from looping.

  1. 1.

    We start the algorithm by adding the target statement (e.g. qr ⊩ (pq) → (pr)) to our stack of goals, and the assumptions of this goal (qrqr) to the DAM.

Until the stack of goals is empty, repeat:

  1. 2.
    1. (a)

      If the top of the stack of goals (the top goal from now on) belongs to the DAM, we remove this goal from the stack of goals.

      Motivation: the goal is reached.

    2. (b)

      If the top goal is ΔF and the DAM contains the statements \({\varDelta ^{\prime }\vdash \phi }\) and \(\varDelta ^{\prime \prime }\vdash \neg \phi \) such that \({\varDelta ^{\prime }\cup \varDelta ^{\prime \prime }\subseteq \varDelta }\), we add a set of axioms to the DAM that can be used to prove the goal below the top from these two statements. We remove the goal ΔF from the stack.

      Motivation: we can use the contradiction to prove the goal below the top. Apart from the instances of the axioms, this proof will use applications of Modus Ponens. Hence, the goal below the top will be removed in a later step.

  2. 3.
    1. (a)

      If the DAM contains a formula Δ ⊩¬¬ϕ, we add an instance of Axiom a (⊩¬¬ϕ → (¬¬¬¬ϕ →¬¬ϕ)) and two instances of Axiom c to the DAM. The next step uses these axioms to deduce Δϕ.

      Motivation: use the doubly negated formula.

    2. (b)

      We close the DAM under applications of Modus Ponens.

      Motivation: here we perform a broad search, and any derivable statement will be added to the DAM.

    3. (c)

      If the DAM contains a formula Δψ and the top goal is Δϕϕψ, we add Δϕϕψ to the DAM.

      Motivation: use the deduction theorem.

  3. 4.
    1. (a)

      If the top goal is Δϕψ, we add ϕϕ to the DAM and the goal Δ, ϕψ to our stack of goals.

      Motivation: prove Δϕψ with the deduction theorem.

    2. (b)

      If the goal is Δ ⊩¬ϕ we add ϕϕ to the DAM and the goal Δ, ϕF to our stack of goals.

      Motivation: prove Δ ⊩¬ϕ by contradiction.

    3. (c)

      If the goal is Δp, where p is an atomic formula, we add ¬p ⊩¬p to the DAM and the goal ΔpF to our stack of goals.

      Motivation: we cannot prove Δp directly, and hence we prove it by contradiction.

  4. 5.
    1. (a)

      If the top goal is ΔF and Δϕψ belongs to the DAM, we add Δϕ to our stack of goals.

      Motivation: we cannot prove a contradiction with the steps performed thus far. Hence, we exploit the statements we already have. Since our goal is to prove ΔF, any formula is provable from Δ.

    2. (b)

      If the top goal is ΔF and Δ ⊩¬ϕ belongs to the DAM we add Δϕ to our stack of goals.

      Motivation: use derived statements.

This algorithm constructs a basic DAM. Bolotov shows that his algorithm is sound and complete. Our adaptations, as for instance the replacement of a negation introduction rule by a set of instances of axioms, preserve soundness and completeness. We omit a detailed description of our adaptations and a proof of the correctness.

The above algorithm only uses axioms in a proof of a contradiction, or in the use of double negations. This means that without extra adaptations, Axiom b will never be used in a generated proof. Since we want the constructed proofs to resemble the proofs constructed by experts or students, and since logax should teach our students to recognize the possibility to use axioms, we use extra heuristic rules to add more instances of axioms to the DAM. With these heuristics we can produce the example DAM in Fig. 2. The heuristics to produce the right branch with nodes 4, 7, 8, 9 and 10 of the DAM are:

  • If the top goal equals Δ ⊩ (ϕψ) → (ϕχ) and \({\varDelta ^{\prime }\vdash \psi \to \chi }\) already belongs to the DAM and \({\varDelta ^{\prime }\subseteq \varDelta }\), then add an instance (ϕ → (ψχ)) → ((ϕψ) → (ϕχ)) of Axiom b to the DAM.

  • If the top goal equals Δϕ and \({\varDelta ^{\prime }\vdash (\psi \to \chi )\to \phi }\) and \({\varDelta ^{\prime \prime }\vdash \chi }\) belong to the DAM with \({\varDelta ^{\prime }\subseteq \varDelta }\) and \({\varDelta ^{\prime \prime }\subseteq \varDelta }\), then add an instance χ → (ψχ) of Axiom a to the DAM.

Distilling Proofs for Students

In the previous section we described the algorithm used to construct the DAM. Such a DAM may contain different solutions. For example, Fig. 2 shows three essentially different solutions for the proof of qr ⊩ (pq) → (pr). Since we use this DAM to generate proofs for the purpose of giving hints to students or providing sample solutions, we have to find a way to isolate single proofs. Moreover, the proofs in the DAM are structured as directed acyclic graphs, whereas an axiomatic proof is a linear structure. Hence, we need a procedure to extract linear proofs from a DAM. We will not only use this procedure to provide complete solutions, but also to generate next steps and hints, which means that the procedure should meet the following requirements:

  • R1: generate a complete linear proof at once or stepwise

  • R2: complete a partial proof, even if this proof diverges from the generated linear proof or contains a user-defined goal

  • R3: add steps to a proof in an order that corresponds to the way students or experts add steps.

Requirement R1 is a direct consequence of our goal to use the DAM to provide sample solutions, hints, and next steps. Since a student solution may differ from the sample solution constructed from the DAM, we need requirement R2 to ensure that logax can always provide a hint or a next step, using the procedure to complete a partial proof. There are two ways in which the order of the steps while constructing a proof may vary. To illustrate the first way, we look at the example proof in Section “An E-Learning Tool for Hilbert-Style Axiomatic Proofs”. We could construct this proof in a forward way, from top to bottom, starting with line number 1 and finishing with line number 7. However, most textbooks advise to apply the deduction theorem backwards. Hence we prefer a solution that starts with line number 6 and 7. A second way in which the order of the steps may vary is the order of the lines in the completed proof. Take, for example, the proof in Section “An E-Learning Tool for Hilbert-Style Axiomatic Proofs” again. This proof might also start with line number 4.

To fulfil requirement R3, we studied example proofs in textbooks and student assignments. Most textbooks introduce an assumption or axiom only when it can be used directly. When an axiom or assumption is introduced in line n, it is used in an application of Modus Ponens or the deduction theorem in line n + 1, or in an application of Modus Ponens in line n + 2 combined with another component in line n + 1. We found one case in which an assumption in line n was followed by a proof of a second component of Modus Ponens, and an application of Modus Ponens. All but two of the textbooks described in Section “Teaching Hilbert-Style Axiomatic Proofs” use this proof order. Mendelson (2015) and Wasilewska (2018) form the only exceptions: they start a proof with stating all the necessary assumptions and axioms. In the homework assignments we also notice that students tend to introduce an assumption or axiom only when it is needed. This was confirmed by the pretest, described in Section “Small-Scale Experiments with Students”, where all students who completed at least half of the axiomatic proof in propositional logic (16 out of 18 students) followed this strategy. In exams we also find sometimes student solutions starting with stating all the assumptions. If a student asks for a hint, we want logax to provide the step that would be advised by an expert or a fellow student, hence R3 requires an order of the steps corresponding to the way students or experts add steps. In the rest of this section we will first explain how we extract linear proofs and motivate why this way of extracting proofs matches the requirements later in this section.

The correctness of the algorithm defined in Section “An Algorithm for Generating Proof Graphs” ensures that the DAM contains a complete proof. Extracting a single proof can be seen as searching for a subtree. Linearization of this subtree requires topological sorting. Since generating a stepwise solution is one of the requirements, we perform these two tasks, extracting and linearization, simultaneously.

The procedure for proof extraction consists of four different kinds of steps, which are repeated until the linearized proof is complete. In each step a new line is added to the proof under construction, or an unmotivated line is motivated. In the following list, the different steps are ordered according to the preference in which a certain step is chosen:

  • a close step: add a motivation to an unmotivated proof line

  • a backward step: add a backward application of the deduction theorem to a proof

  • a forward step: add a forward application of the deduction theorem or Modus Ponens to a proof

  • an introduction step: add an assumption or axiom to a proof.

While performing these steps, the procedure keeps track of the partial linearized proof, which consists of the grounded part (the already proven proof lines), and the ungrounded part. The latter part consists of lines that are already in the linearized proof, but are either unmotivated, or their motivation depends on unmotivated proof lines. The unmotivated lines are part of a list of goals, which may also contain a set of other subgoals to be reached.

If possible, the procedure performs a close step, since in general this step completes the proof. The next preferred step is a backward step since a backward application of the deduction theorem replaces the goal to be reached by a simpler goal. When applications of the deduction theorem are impossible, the procedure tries to use already proven lines in an application of a forward step, and only if all other three kinds of steps are impossible, the procedure introduces an assumption or an axiom. Here we have to take care of the logical structure of the proof. We illustrate this by means of the example in Fig. 2. Suppose that the partial proof consists of nodes 6 and 10, which means that deduction was applied to node 10. Continuing with node number 7 would add a superfluous line to the proof. To prevent this, the procedure trims the DAM into a subDAM using the first goal of the list of subgoals as a root. In our example, node number 6 becomes the new root, and the leaves in this subDAM are the nodes 1, 2 and 4. Suppose the procedure continues with node number 1. That leaves two possibilities for the next step, namely node number 2 or 4. Because of the last requirement R3, node number 2 is preferred, since in general students or experts choose an assumption that can be used directly over an assumption that can only be used later. The procedure realizes this preference by adding subgoals to the list of subgoals after performing an introduction step. These subgoals consist of the nodes between the introduced leaf and the node that corresponds to a subgoal already in the list of subgoals. In our example, the line numbers 3 and 5 are added as subgoals, which forces the procedure to look for a next leaf in the subtree rooted by line number 3 in the next step. As a consequence, line number 2 is indeed added in this step.

We claim that this procedure meets the three requirements given above. Requirement R1, generating complete proofs, is guaranteed by the construction of the DAM. To show that requirement R2 is met, we distinguish two situations. As long as the steps in the student solution correspond to the steps generated by the procedure described in this section, this proof can be completed directly. If the student solution diverges from the generated solution, logax will use the student solution as a starting point to build a new DAM. Motivated lines will be marked as grounded lines and unmotivated lines will be part of the list of goals. This ensures that the procedure indeed extends the partial proof into a complete proof. Note that this complete proof may contain superfluous lines, for example when the student introduced an assumption that cannot be discarded by an application of the deduction theorem. This will result in a path in the DAM that is not connected to the goal. In such a case, logax will not remove the student lines, but complete the proof using the student lines that lead to the goal.

From student solutions to exercises and the log data collected from logax we know that students can perform the steps of a proof in many different orders. However, there are some heuristics in the construction of a proof, such as trying to use assumptions, or simplifying the goal by applying the deduction theorem. The preference on the order of the steps in our procedure ensures that the procedure follows these heuristics (requirement R3). This implies that steps can be added in two directions, forward and backward, and that a user can switch direction at any moment. Moreover, the order of the steps should be such that we can always motivate the next line: why do we perform a certain step at a certain moment. To achieve this, we use a dynamic programming approach, where subproblems are defined by the list of subgoals. The restriction to subDAMs as described above, ensures that we complete the subproblem defined by the first node of this list before we start a new subproblem. All steps can thus be motivated by a subgoal.

Lemmas

Reusing proven results is common practice in mathematics and logic. For example, the proof of the fundamental theorem of arithmetic (every number larger than 1 can be written in a unique way as a product of primes) uses the lemma that a prime divisor of a product ab is also a divisor of a or b. In logic the use of proven results is widespread too. Here, proven results are sometimes presented as derived rules, such as for instance the rule Modes Tollens (¬ϕ can be derived from ϕψ and ¬ψ) in Huth and Ryan’s textbook (2004). Axiomatic proofs often build on each other: for example, a proof of ⊩¬¬pp can be used as a lemma in another proof.

Lemmas appear in several ITSs that deal with constructing proofs. They serve various purposes, such as a starting set to generate geometry problems (Alvin et al. 2014), or just as a predefined set that can be used by the student to solve a problem (Matsuda and VanLehn 2005). Perhaps more interesting is the possibility to allow the addition of lemmas by the user. In both the Jape natural deduction proof assistant (Bornat 2017) and the proof assistant described by Aguilera et al. (2000), a student can save proven results and use these results as lemmas in a new proof. The proof assistant Gateway to Logic for axiomatic proofs offers a user the possibility to state and use lemmas too (Gottschall 2012).

Lemmas in axiomatic proofs have various shapes. For example, we distinguish tautologies (⊩ ϕ) and valid sequents (ϕ1,....ϕnψ), but also schemas (for example ¬¬ϕϕ) and instantiations of schemas (¬¬pp). We include the use of lemmas in logax. The main purpose of including lemmas is to support adding relatively easy exercises. Without lemmas, many axiomatic proofs are too lengthy and complicated to be used in education. With the possibility to use lemmas, a new class of relatively easy exercises becomes available. The second goal is to give users the possibility to use their own lemmas. logax can provide a student with an exercise together with a lemma that may be used in the proof, and in a user-defined exercise the user can use her own lemmas. We impose two restrictions: predefined exercises use only instantiations of lemmas, and the interface only accepts user-defined lemmas that are instantiations of a tautology. The latter is not a real restriction since a valid sequent ϕ1,....ϕnψ can always be rewritten as a tautology ⊩ (ϕ1 → (... → (ϕnψ))).

We adapt the algorithm to create a DAM to support the use of lemmas. In predefined exercises, lemmas are added to the DAM at the start of the algorithm, comparable to the addition of assumptions. The construction of the DAM and the extraction of a linear proof work in the same way as described in Section “An Algorithm for Generating Proof Graphs” and “Distilling Proofs for Students”. Students who solve these predefined exercises receive the lemma as a first line of the proof. To facilitate user-defined lemmas, a lemma rule is added to the set of rules. In a user-defined exercise, a student can introduce a lemma at any stage during the proof. The algorithm constructs a DAM based on the partial proof including the lemma and uses this DAM to provide hints and feedback.

Figure 4 shows an example of an exercise with a lemma. Since a motivation of line 999 as an application of Modus Ponens to the lemma in line 1 and line 4 completes the proof, the hint tells the student to add a motivation.

Fig. 4
figure 4

A partial proof with a lemma, performed in logax

Hints and Feedback

Hints

One of the reasons for the effectiveness of human tutors is that they provide feedback at the level of solution steps, and help a student to overcome impasses using hints (Merrill et al. 1992). Hence, for ITSs to be as effective as human tutors, they should give stepwise feedback and some form of help. In a first version of logax, we implemented a hint sequence consisting of three hint types: the direction of a next step (forward or backward), the axiom or rule to apply, and a bottom-out hint that shows how to perform a next step. Although these hints can help students to complete a proof, they might not always help a student to understand why a certain step is useful. A study with the Geometry Tutor (McKendree 1990) shows that students who receive informative feedback combined with information about a subgoal are more effective in correcting mistakes than students who only receive informative feedback. Several logic tutors offer hints containing subgoals. An early attempt is the P-logic tutor (Lukins et al. 2002), in which students learn to construct proofs using standard equivalences and inference rules. Since this tutor cannot construct proofs, it uses heuristics to construct possible useful subgoals, such as an atomic formula from which the truth can be deduced. A drawback of this approach is that the tutor might suggest an unnecessary subgoal. The Deep Thought Logic tutor (Eagle et al. 2012; Barnes and Stamper 2008) uses datamining to construct proofs and subgoal hints from student solutions. In a comparison of the performance of students receiving next step hints with students receiving hints about a subgoal to be reached in two or three steps, the latter group outperformed the former one in the more difficult exercises both with respect to the time needed to take a step as well as accuracy (Cody et al. 2018).

In logax we keep track of a list of subgoals while constructing a proof. We use this list to provide hints about a subgoal. We do not give a subgoal as hint if a student can still apply deduction backwards, when a subgoal coincides with an unmotivated line in the proof, or when a subgoal coincides with the next step. In the other cases, we give a hint concerning a subgoal instead of a hint about the direction of the proof. For instance, the hint for the unfinished proof in Fig. 5 will be: try to prove p, pqq. An example where our algorithm deliberately not gives a subgoal as hint can be found in the proof in Fig. 1. Here, he first hint will be: perform a forward step, since in this case the subgoal p, qqq, Modus Ponens, is equal to the next step.

Fig. 5
figure 5

The start of a proof for qr ⊩ (pq) → (pr)

Feedback

In this subsection we first analyze student errors, and then describe how we use this analysis to create feedback. Students make mistakes in axiomatic proofs. From the homework of 40 students participating in our course we collected a set of mistakes, and classified these mistakes in three categories:

  • oversights (19),

  • conceptual errors (11), and

  • ‘creative’ rule adaptations (9).

Mistakes such as missing parentheses belong to the first category. This category mainly consists of missing parentheses in Axiom b. A typical example of a mistake in the second category is the following application of Modus Ponens:

$$ \begin{array}{lllr} 1. && {\neg {p},\neg {q}\vdash\neg {q}} & \text{ Assumption }\\ 2. && {\neg {q}\vdash\neg {p}\to \neg {q}} & \text{ Deduction 2}\\ 3. && {\vdash(\neg {p}\to \neg {q})\to ({q}\to {p})} & \text{ Axiom c}\\ 4. && {\neg {p}\to \neg {q}\vdash{q}\to {p}} & \text{ Modus Ponens 2, 3 } \end{array} $$

Here, the student has the (wrong) idea that after an application of Modus Ponens on Δϕ and Σϕψ, the formula ϕ becomes the assumption of the conclusion. Creative rule adaptations may take various forms. An example of such a rule adaptation is:

$$ \begin{array}{lllr} 1. && {\neg {p}\to ({q}\to \neg {r})\vdash\neg {p}\to ({q}\to \neg {r})} & \text{Assumption}\\ 2. && {{q}\vdash{q}} & \text{Assumption}\\ 3. && {\neg {p}\to ({q}\to \neg {r}),{q}\vdash\neg {p}\to \neg {r}} & \text{Modus Ponens 1, 2} \end{array} $$

In this example the ultimate goal is to prove that ¬p → (q →¬r), qrp. The student tries to reach this via the subgoal ¬p → (q →¬r), q ⊩¬p →¬r, but she misses the possibility to reach this subgoal with an instantiation of Axiom b and Axiom a. Instead, she creates her own variant of Modus Ponens in line 3 of the proof.

Further analysis of the homework exercises suggests that students typically make these mistakes when they do not know how to proceed. This is in line with the repair theory, which describes the actions of students when they reach an impasse (Brown and VanLehn 1980).

The example of a conceptual error given above is impossible to construct in logax, since logax fills in the assumptions automatically. In the evaluation in Section “Small-Scale Experiments with Students” we analyze whether students recognize these kinds of conceptual mistakes after practicing with logax. However, it is still possible that a student tries to apply a rule incorrectly. For example, a student might apply Modus Ponens on Δϕ and \({\varSigma \vdash \phi ^{\prime }\to \psi }\) where ϕ and \(\phi ^{\prime }\) are equivalent but not equal. We used homework solutions to define a set of buggy rules for mistakes that can be made in logax. Most of these rules relate to Modus Ponens (8 buggy forward applications, 3 buggy backward applications and 2 closure rules), the other to deduction (2 buggy backward applications and 4 buggy closure rules). Using these rules, logax can give informative feedback. Shute’s guidelines (2008) state that feedback should be elaborate, specific, clear, and as simple as possible. Our feedback not only points out a mistake, but if possible also mentions exactly which formula, subformula or set of formulae do not match with the rule chosen. For example, if a student wants to complete the proof

$$ \begin{array}{lllr} 1. && {\neg {q},\neg {p}\vdash\neg {q}} & \text{ Assumption } \\ 2. && {\neg {q}\vdash\neg {p}\to \neg {q}} & \text{ Deduction 1 }\\ 3. && {\vdash(\neg {p}\to \neg {q})\to ({q}\to {p})} & \text{ Axiom c}\\ 4. && {\neg {p}\to \neg {q}\vdash{q}\to {p}} \end{array} $$

by applying Modus Ponens to lines 2, 3 and 4, she gets a message that line 4 cannot be the result of an application of Modus Ponens on lines 2 and 3, since the assumption of line 2 does not belong to the set of assumptions in line 4.

Evaluation of the Generated Proofs

We first evaluate the proofs generated by logax by comparing them with expert proofs. After that, we evaluate the recognition of student solutions by logax. Section “Small-Scale Experiments with Students” describes the final part of the evaluation, which is a small-scale experiment with students using logax.

Comparison of the Generated Proofs with Expert Proofs

We evaluate the proofs generated by logax in two ways. First, we compare the generated proofs with expert proofs. Since example proofs and worked solutions in textbooks often use earlier proofs as a lemma, the number of proofs that we can compare with the logax proofs without lemmas is small. We found 10 examples we could use for a comparison in the textbooks of Ben-Ari, Kelly, and Goldrei, and in the lecture notes of a course on logic (LenI) (Lodder et al. 2018). The list of exercises can be found in Table 1.

Table 1 Exercises without lemmas in textbooks and lecture notes. The last column compares thet logax proof with the expert proof

Eight of the ten expert proofs are equal to the logax proofs. In exercise 6, logax uses the deduction theorem instead of Axiom a. The logax proof of ⊩¬p → (¬¬pp) (exercise 5) is one step longer than the expert proof, since logax proves ⊩¬¬pp directly without making use of an assumption ¬p. The logax proof uses a standard construction. Although this standard construction is useful in quite a lot of proofs, in this case, missing the shorter solution is a shortcoming of logax since a goal of the exercise is that the student recognizes the possibility to use this assumption. However, the completion of logax of a student proof that starts using this assumption in an application of Axiom c, equals the proof in the lecture notes. In the future, we might fine-tune logax such that the generated solution equals the lecture notes solution also in other cases. We conclude that for most of the examples and exercises in textbooks, logax generates a proof that is equal to the expert proof.

To evaluate the version of logax with lemmas, we have nine expert proofs available. The course on logic contains five examples of exercises using lemmas, together with example proofs of these exercises, see Table 2. In the first three exercises only instantiated lemmas are given. The proofs generated by logax for these exercises are equal to the solutions in the course notes. The fourth and fifth exercise ask for a proof in predicate logic, but we can compare the propositional part of these proofs. Both exercises present lemmas as schemas, and instantiations of these schemas are used in the solution. We add these instantiations as lemmas in logax. The solution to the fifth exercise is equal to the proof generated by logax, except for the order of the proof lines. In the fourth exercise, logax originally only used one of the two lemmas, and the proof by logax was longer than the solution in the logic course notes. After some minor changes in the implementation of the heuristics, logax uses no lemmas, but generates a shorter proof. Both proofs are given in Appendix A. Since exercises in textbooks also use derived rules, we have only 4 extra exercises with lemmas in these books. All the proofs of these exercises are equal to the logax proofs.

Table 2 Exercises in a logic course

To evaluate more proofs we use the large collection of proofs on the Metamath website.Footnote 1 This website collects formal proofs, not only for logic statements, but also for mathematical statements. The part on propositional logic contains proofs of well-known theorems, for example from the Principia Mathematica by Russell and Whitehead. In general, a Metamath theorem is presented as follows:

$$ {\vdash\phi_1}\ \text{and}\ {\vdash\phi_2} \text{ and ... and } \vdash\phi_n\Rightarrow \vdash\psi $$

So if ϕ1,..., ϕn are provable, then ψ is provable. Since logax cannot deal with general tautologies, we translate a Metamath theorem in a theorem with assumptions. Instead of ⊩ ϕ1 and ⊩ ϕ2 and ... and ⊩ ϕn ⇒ ⊩ ψ, we prove ϕ1, ϕ2,..., ϕnψ. Since proofs in Metamath build on each other, these proofs seem to be natural candidates to use in a comparison with logax with lemmas. However, the best way to compare proofs is not immediately clear. To demonstrate this, we present an example of a Metamath proof in Fig. 6. This example shows a proof of ϕ → (ψ → (ψχ)) ⊩ ϕ → (ψχ). The proof uses two previous theorems: id (⊩ ψψ) and mpdi (from ⊩ ϕχ and ⊩ ϕ → (ψ → (χθ)) follows ⊩ ϕ → (ψθ)). The most direct way to translate this in a logax proof with lemmas would be by rewriting mpdi in a theorem with assumptions (ϕχ, ϕ → (ψ → (χθ)) ⊩ ϕ → (ψθ)), and using instantiated versions of these theorems as lemmas in the proof. However, to complete this proof, a forward application of deduction followed by an application of Modus Ponens by logax suffices, which makes the comparison not very informative. A more interesting way to compare proofs would be by letting logax find useful instantiations of lemmas, but so far, we have not implemented this functionality. In the comparison, we therefore add just a single instantiated lemma to logax. We inline the other lemmas in the Metamath proofs. Since Metamath proofs do not make use of the deduction theorem, a last adaptation we have to make is to remove applications of the deduction theorem in the logax proofs. We use the constructive proof of the deduction theorem to remove occurrences of deduction in logax proofs, and compare these proofs with the Metamath proofs. The results are shown in Appendix B. The comparison consists of 24 theorems, 12 with and 12 without negation. Nearly two thirds (15 out of 24) proofs are equal except for the order of the lines. The logax proof is shorter than the Metamath proof in eight cases. In seven of these cases logax does not use the lemmas. It is not surprising that in these cases, inlined proofs in Metamath are longer: the Metamath proofs are constructed by choosing suitable lemmas. Metamath does not inline proofs and uses lemmas that are known theorems, or variants, for example originating from the Principia Mathematica. Hence a Metamath proof is short when it uses a small set of lemmas, without caring about the total length of the inlined proof.

Fig. 6
figure 6

An example of a Metamath proof

In Lodder et al. (2017) we compared 30 Metamath proofs with proofs generated by logax without lemmas. After inlining the used lemmas in Metamath and removing applications of the deduction theorem in the logax proofs, we found that 27 logax proofs were equally long as the Metamath proofs. In three cases the logax proof was shorter than the Metamath proof. Although the comparison of logax with Metamath can only be done indirectly, the results (from the proofs with lemmas nearly two thirds of the logax proofs are equal to Metamath proofs, and most of the logax proofs have equal length as these proofs) indicate that logax indeed generates proofs that are comparable to expert proofs.

Recognizing Student Solutions

In a second evaluation we investigate whether or not correct student solutions can be recognized by logax. Axiomatic proofs are part of a course on logic where we also deal with other topics, such as semantics of predicate logic, axiomatic proofs in predicate logic, structural induction and Hoare calculus. Students may hand in homework to earn a bonus point. The homework exercises contain one exercise about propositional logic axiomatic proofs and one on predicate logic axiomatic proofs. Furthermore, the exams usually have an exercise on axiomatic proofs. We use solutions of two homework exercises, and one exam exercise, to determine whether or not logax recognizes student proofs.

In the exam exercise, students have to prove that ¬¬p →¬q, rqr →¬p. The correct student solutions to this exercise can be divided into two groups, where each group contains solutions that are equal up to the order of the proof lines. Solutions in the first group contain an application of Axiom a, b and c, and no application of the deduction theorem. Solutions in the second group contain an application of the deduction theorem, and of Axiom c. From the 19 correct solutions, the majority (16) belongs to the second group, and the remaining three solutions to the first group. The example solution provided by logax also belongs to the second group. The solutions of the first group do not (yet) appear in our initial DAM, but they do appear in the DAM we dynamically obtain when a student introduces Axiom b, and we use this DAM to provide feedback. In the future we might add an extra heuristic for the use of Axiom b:

  • If the top goal equals Δϕχ, and \({\varDelta ^{\prime }\vdash \phi \to \psi }\) and \({\varDelta ^{\prime \prime }\vdash \psi \to \chi }\) both appear in the DAM, where \({\varDelta ^{\prime }\cup \varDelta ^{\prime \prime }\subseteq \varDelta }\), then add an instance (ϕ → (ψχ)) → ((ϕψ) → (ϕχ)) of Axiom b to the DAM.

In the first homework exercise students have to prove that qp → (q →¬r) ⊩rp. Here almost all student solutions (16) use Axiom a, b and c. Only one student uses the deduction theorem instead of Axiom b. logax generates this last proof. Solutions using the three axioms are not part of the DAM that is generated at the start of the exercise, but can be recognized by a dynamically generated DAM. The second homework exercise is an exercise in predicate logic, but it contains a propositional part that amounts to a proof of (pq) →¬pq →¬p. Again, there were two groups of solutions: 13 students use Axiom a and the deduction theorem, and 2 students use an extra application of the deduction theorem instead of Axiom a. In this case the solution generated by logax is the solution that does not use Axiom a, but the DAM also contains a solution with Axiom a.

We summarize the results in Table 3. The first column (preferred) shows the number of solutions that corresponds to the preferred solution of logax, the second the number that corresponds to a non-preferred solution, and solutions in the third column can be recognized by a dynamically generated DAM. The conclusion of this evaluation is that with the use of dynamically generated DAMs, we can recognize all student solutions, and also give hints. Still, we might optimize logax by adding more heuristics, e.g. such that the solution generated by logax for homework exercise 2 equals the student solutions. In the current implementation, heuristics for the use of Axiom b and the deduction theorem interfere: extra heuristics for Axiom b can broaden the DAM, but the algorithm for the distillation of a linear proof prefers applications of the deduction theorem, a local decision. We would have to extend this algorithm with global heuristics to ensure that the extracted proof contains instances of axioms when applicable.

Table 3 Recognized solutions

Small-Scale Experiments with Students

We have performed several small-scale experiments with logax. The main results in this section are obtained from an experiment with logax without lemmas, performed in May 2018. The 18 participants in this experiment were preparing for admission to a master program Computer Science at the Open University of the Netherlands. A course on logic is part of the premaster program. We required participants to submit a solution to the first homework exercise, see Section “Comparison of the Generated Proofs with Expert Proofs”, before participating in the experiment. Thus, we guaranteed that participants had studied the subject before the experiment. We used their solutions to the exercise as an indication of their prior knowledge. The experiment consisted of a 20-minute (online) instruction, after which students practiced with the tool for 75 minutes. The experiment concluded with a 20-minute posttest. All interactions of the students with logax were logged. The 10 exercises in the tool and the questions in the posttest can be found in Appendix C.

We use the results of this experiment

  1. (1)

    to evaluate the hints and feedback given by logax,

  2. (2)

    to analyze the way students use logax, and

  3. (3)

    to evaluate the effect of using logax on students’ performance.

Evaluation of Hints and Feedback

We start with evaluating the generated hints and feedback by answering the following questions:

  • does logax recognize common mistakes?

  • is the feedback sufficient to repair mistakes?

  • do hints on subgoals help students to reach these subgoals?

To answer these questions, we analyze the log data of the experiments. Table 4 summarizes the results of this analysis, and in the following paragraphs we will discuss these results in more detail.

Table 4 Number of different errors in logged student steps

Apart from 5 syntax errors, the log data contain 179 incorrect steps of a total of 1480 steps performed by the students. The syntax errors were performed by 5 different students who could repair this error in the next step without asking for extra help. It seems that the dialog box indeed prevents students from spending time repairing syntax errors. In 24% (43/179) of the incorrect steps, a student tries to apply Modus Ponens, but interchanges the first and second line in the dialog box, as shown in Fig. 7. This typically occurs when the implication (ϕψ) has a smaller line number than the antecedent of this implication (ϕ). During the experiment we did not have a buggy rule implemented for this situation, and hence students received the feedback ‘ϕ is not an implication’, or ‘Modus Ponens is not applicable’. In 34 out of the 43 occurrences of this kind of mistake, this feedback was sufficient for the student to fill in the dialog box correctly. In the other cases students asked for a hint or next step, or continued with another rule. One student did not realize that the implication may precede the antecedent, and consequently constructed the proofs in such a way that implications always have a higher line number than the antecedents.

Fig. 7
figure 7

A common error: interchanging line 1 and 2 in the dialog box for Modus Ponens

A second category of mistakes also seems to have its origin in an incorrect use of the dialog box. Examples are backward applications of Modus Ponens where the last line is already motivated, or the other line is unmotivated (10 occurrences). A student who puts the number 2 in the bottom field in Fig. 7 and leaves the middle field open makes this type of mistake. The feedback message in such a case was not very helpful: “Cannot apply Modus Ponens”. Some buggy applications of Modus Ponens were only recognized when students completed the dialog box ‘correctly’ (entering the implication in the second field). For instance, an erroneous application of Modus Ponens on formulae Σϕψ and Δψ is not recognized if the student enters the line number of the first line in the uppermost (antecedent) field, and the second in the middle (implication) field. Some misreading of parentheses is recognized by logax, but, for example, the misreading of parentheses in an application of Modus Ponens on Σp → (qr) and Δ⊩ (pq) → (r → (pq)) is not recognized as a common mistake. Also mistakes in the introduction of an axiom or assumption in the dialog box, such as interchanging p and q with a faulty Modus Ponens application on Σ⊩¬p →¬q and ⊩ (¬q →¬p) → (pq) as a result, is not recognized. The remaining errors that are not recognized cannot be classified as buggy rules, since we cannot find a pattern in, or a misconception as a cause of, these errors.

We further analyzed the log data to see whether in the cases that a common error is detected (86 errors), the error message is sufficient to help a student in making progress. In 60% (52/86) of these cases, a student can proceed without help of the system, in over 8% (7/86) a student gives up on the exercise directly after this mistake, and in 3% (3/86) after one or more erroneous steps, and in the other cases a student can proceed with a hint or next step. If a student needs more help, this does not necessarily mean that the error message is not clear: the log data suggest that often a student recognizes the mistake (in 60% of the cases a student does not make the same error again during the session), but does not know how to proceed.

We conclude that we can improve error messages by recognizing the cases where a student interchanges the lines in the dialog box. In the 43 cases where interchanging the lines was the only mistake, students will receive a message about how to fill in the dialog box correctly. In 15 other cases, where students now get a default message or a message that probably does not refer to the actual mistake, we will also improve the feedback. An example of this situation is the application of Modus Ponens on Δpq and \({\varDelta ^{\prime }\vdash {q}}\) (entered in this order). At this moment, students receive the error message that q is not an implication, but after recognizing this mistake as a combination of a common error and swapping lines in the dialog box, the error message will say that the formula in the second line should be equal to the left-hand side of the implication instead of the right-hand side. With these improvements we would have provided specific feedback in 80% ((86 + 43 + 15) / 179) of the mistakes made by students, instead of in 48% (86/179) of the mistakes in the version used in the experiment.

Since the possibility to give a hint about a subgoal was new in the logax version that we used in this experiment, we evaluate the effect of this type of hints. A student receives a hint that indicates a subgoal to be reached if it takes more than one step to reach this subgoal, and if the subgoal is not already present in the proof as an unmotivated line. logax gives a hint about a subgoal in 75 of its 192 hints, and 40 of these subgoals were reached by the students without further assistance of logax. In the other cases, the students used next step hints which told them the rule to proceed or a next step. This number might seem somewhat disappointing, but a more detailed analysis shows that in general only students who ask a lot of help do not reach the subgoal without extra help. Figure 8 presents a scatter plot with the total number of different subgoal hints given to the student on the x-axis, and the number of reached subgoals after the hint, without further help on the y-axis. The figure shows that students who use fewer than eight different subgoal hints (hints in different exercises or in different stages of their proof), in general reach the subgoal themselves.

Fig. 8
figure 8

The effectivity of subgoal hints

Use of logax

The second part of this evaluation is the question: how do students use logax. Do students misuse the system by asking too much help, or by randomly filling in the dialog boxes, or do they struggle without using the help offered by logax? The results of our experiment show that the use of logax is related to performance in the homework exercise, which we use as a pretest. From the 12 students who score at least 0.8 (out of 1) in the pretest, eight students can complete the exercises without using much help or making many mistakes. One student reported that he completed the exercises with pen and paper and used logax mainly to check his answers. Two students use quite a lot of hints, also to complete the exercises, and one student seems to misuse the system by performing lots of actions (338 interactions versus an average of 173). From the students who score lower than 0.8 on the pretest, only one student completes all the exercises without much help. In this group help seeking strategies differ considerably: two students hardly ask any help, one student performs 65 help-seeking actions. We conclude that in this experiment, in general, good students use logax as intended, and can complete exercises without a lot of help. Since only six weaker students participated, we cannot draw hard conclusions, but the log data seem to indicate that these students either tend to overuse or underuse help. Another observation is that most of the weaker students can complete the first four or five easier exercises, but the later exercises seem to be too difficult for this group. This indicates a shortcoming in our experiment: it seems the difference in difficulty between the first five and the other exercises is too big. We could regulate hint use by letting logax provide unsolicited hints when students make too many mistakes or do not make progress in their exercise, and on the other hand maximize the number of hints that a student can ask for.

Evaluation of Learning Effects

The last question we want to answer is whether logax supports students with learning axiomatic proofs. Most of the participants in the experiment were good students, who performed already well on the pretest: the average score on the pretest was 0.84, where the maximum possible score was 1. This might have influenced learning effects negatively. The posttest consisted of two parts. In the first part, students had to point out possible errors in five small proofs. Four of the five proofs were incorrect. The incorrect proofs contained common errors collected from the homework exercises, see Section “Hints”. We deliberately added errors that are possible to make in logax and errors that are prevented by the interface, since we wanted to know whether the last type would occur more often in the posttest. In the second part, students had to provide a proof.

The scores of the posttest can be found in Table 5. The low scores on exercises 1b and 1d are remarkable. The score on exercise 1b was more or less expected, because the exercise contains an error in the set of assumptions after applying Modus Ponens. Since logax automatically determines this set, students do not practice in correctly determining this set. Exercise 1a also contained an error that is not possible in logax (mixing an application of Modus Ponens with Axiom b), but this error was recognized by students. Exercise 1d applies deduction in ‘the wrong direction’, a common mistake made by students, which is apparently not sufficiently corrected while practicing with logax (the log data contain only 3 occurrences of this error). The misreading of parentheses in exercise 1e (a possible error in logax) is recognized by most students. Students scored lower on exercise 2, an exercise in which a student has to construct a proof, than in the pretest. We hypothesize that this is caused by fatigue (since most of our students combine study with a job, the experiment took place in an evening, and the posttest started at 21:15), and the fact that we asked students not to spend more than 20 minutes on the posttest, while they could spend as much time as they needed for the pretest, since this was part of the homework assignment.

Table 5 Results of the posttest

We also looked at the results on the exam. Students who participated in the experiment receive on average the same score for the exercise on axiomatic proofs as they got for the pretest. Since most students participated in the experiment, we cannot compare their results with students who did not participate. The results of an earlier experiment (January 2018) with 9 students were more or less comparable. The average score on the pretest of this group is a little lower (0.8) and also the score on the posttest proof, exercise 2, is lower (0.52). The results on the exam are a bit higher (0.89), but the exercise was slightly different, and the time between the experiment and the exam was considerably shorter: 13 days instead of 40 days for the last experiment. We conclude that at this moment we do not have enough data to evaluate learning effects. However, experiments with other tools (Lodder et al. 2019) show that this kind of tutoring system can be effective.

Limitations

In the previous subsections, we described our pilot experiments and mentioned some limitations. First, the number of participants in the experiments is too low to draw statistical conclusions or measure learning gains. Second, most Open University students combine their study with a job, and hence the online lessons are organized in the evening. This may have influenced the results in the posttest. Third, the difference in difficulty between the first five and the remaining exercises is a problem. Weaker students could have benefited from a more gradual increase in difficulty. There are other factors that could have influenced learning effects, for example gaming behaviour of students while working with logax. Since the participants in our experiments where motivated adults, preparing for admissions to a master program, we did not expect much gaming behaviour. Still, analysis of the log data shows that one student overused the possibility to let the system perform a next step. This might have been caused by frustration, one of possible causes of gaming behaviour, mentioned by Baker et al. (2008).

Related Work

As mentioned in the introduction, there are two e-learning tools that can be used to practice the construction of Hilbert-style axiomatic proofs in propositional logic: Metamath Solitaire (Megill 2007) and Gateway to logic (Gottschall 2012). Both tools are proof-editors: a student chooses an applicable rule and the system applies this rule automatically. These systems provide no help on how to construct a proof. There are quite a lot of systems that help students with other kinds of exercises in logic, and many more in other subjects.

In the AProS project, Sieg and collegues have developed Proof Tutor, a tutor that teaches students natural deduction (Sieg 2007; Perkins 2007). They have developed an automated proof search method, which differs from the Bolotov method in the use of normal proofs. Their algorithm uses a set of tactics that are explicitly used as hints for students. Perkins (2007) describes how they provide help such as hints or next steps in the case that the partial solution of a student diverges from a generated model solution. First, they check if the subgoals of the partial solution are indeed derivable from the assumptions (we do not need this check since a situation in which a subgoal is not derivable is not possible in logax). Second, they check whether the partial solution can be completed by the Proof Tutor; if this is not the case, they let the student erase the lines of the part that does not belong to a generated proof. In logax we do not let students erase lines. The consequence is that a final proof may contain unnecessary lines, but also that we will not erase a useful part that is not recognized as useful by logax.

We use a strategy language to generate both the solutions and the feedback (Heeren et al. 2010). The use of such a language is related to the use of production rules as, for example, in Anderson et al. (1995) and Corbett et al. (1997), and the way in which we recognize student solutions is akin to their model-tracing approach. The Geometry Tutor makes use of contextualized rules, which means that a rule will only fire in a specific context. The different axioms that we add in step 2(b) of our version of the algorithm depend on assumptions in the statements \({\varDelta ^{\prime }\vdash \phi }\) and \({\varDelta ^{\prime \prime }\vdash \neg \phi }\), and these rules could also be perceived as contextualized rules. When more than one rule can be applied in the same situation, tools based on production systems may add preferences to specific rules (Anderson et al. 1995; Jaques et al. 2013). We use the strategy language to specify a preference in the application of the rules.

Ahmed et al. (2013) use a different approach to generate and solve natural deduction exercises. Their main idea is to use truth tables, representing an equivalence class of logical formulae. The representations of these formulae are used in a proof graph of predefined size. Proof generation consists of finding a truth-table-representative of the assumptions and conclusion, searching for a proof using these representatives, and adding rewrite steps (such as replacing subformulae of the form ¬¬ϕ by ϕ, or vice versa) to this proof. In this way, they can generate exercises and solutions typically used in education. However, in their approach it is essential that rewrite steps on subformulae are allowed, which is not the case in Hilbert axiomatic systems, and also often not in natural deduction systems.

Answer set programming, for example used by O’Rourke et al. (2019), is related to the production systems used by Anderson et al. (1995). Their program finds all different solutions of an algebra exercise, using deductive rules and integrity constraints that forbid certain solutions. Also explanations and a subset of misconceptions can be generated automatically. To restrict the search space to a finite space (which is necessary in their program), the number of terms in an equation and the solution length is maximized. In logax we do not need to maximize the length of formulae; formula length is not a good measure for the expected proof length. For technical reasons, we maximize calculation time, but thus far logax has been able to solve all problems within this limit.

For some domains it is possible to use existing tools to generate solutions or correct student solutions. An example of this approach is (Sadigh et al. 2012): to solve state machine problems such as finding a trace for a given model that violates or satisfies a certain property, they use existing model checking tools. This approach can be very useful to produce solutions or grade exercises, but is in general less suitable for giving hints, since steps performed by a tool not always correspond to human steps.

Conclusion and Future Work

By using an existing algorithm for natural deduction, we developed a sound and complete algorithm to generate Hilbert-style axiomatic proofs, and introduced a representation of these proofs as a directed acyclic multi graph (DAM). We use these DAMs in a new interactive tutoring tool logax to give hints and next steps to students, and to extract model solutions. Comparing the generated proofs with expert solutions shows that the quality of the proofs is comparable to that of expert proofs. The tool recognizes most of the steps in a set of student solutions, and in case a step diverges from the generated proof, logax can still provide hints and next steps. This holds both for the original version of logax as well as for the extension with lemmas. We derived buggy rules from a set of student solutions, and added these to logax. Evaluation with a test set showed that this set covers the majority of student errors. In an experiment with students we discovered that we overlooked a source of errors originating in the user interface (i.e. errors made while filling in the dialog box for Modus Ponens). We expect that after adding buggy rules for this kind of error, logax will recognize about 80% of the errors.

We performed several pilot evaluations with logax. Since the number of participating students was low, and students performed already well on the pretest, we cannot derive conclusions about the learning effect of logax. However, we conclude that well-prepared students use the system as intended, and can complete most of the exercises without using much help or making a lot of mistakes. Students who do not ask more hints than on average, reach the subgoal given in a hint without extra help. Future evaluations can investigate whether students indeed learn by using logax. It might be necessary to add more easy exercises for weaker students, since the results of the evaluation indicate that the difficulty level of the exercises rises too quickly for this category of students. Also mechanisms to diminish excessive hint use, or stimulate hint use in case of minimal use, might help the weaker students to benefit from the use of logax.

We developed algorithms for generating DAMs and distilling proofs from this DAM for the domain of Hilbert-style axiomatic proofs. The strategy language used to formulate these algorithms is much broader applicable, and we expect that also some of the ideas used in our algorithms can be applied to other problem domains. For example, the dynamic extension of partial student proofs, or the use of the strategy language to order the introduction of the proof steps in a way that corresponds to an expert’s pen-and-paper proof, could be based on techniques similar to logax. For instance, it would be interesting to compare the approach in the AProS project with our approach. Contrary to the logax approach, AProS erases lines in a partial student proof that are not used in the generated completion. These different approaches could have different effects on learning and motivation.

Also geometry tutors could benefit from our approach. The geometry prover used by Matsuda and VanLehn (2004) restricts the number of points that can be used as a starting point in a construction to prevent combinatorial explosion. With dynamically generated strategies it could be possible to add points introduced by students without a high increase in CPU time. QED-Tutrix is a geometry tutor that uses a different approach by first extending a problem figure into a super-figure that contains possible useful points and segments (Font et al. 2020). A student who uses this system can only use points and segments in the figure and the super-figure. The authors state that a limitation of their system is the necessity to provide all the possible elements of the proof in advance. With the dynamic approach, it could be possible to find also solutions using extra points of segments introduced by students.