Introduction

Explanations are central to both teaching and learning. Teachers and lecturers routinely offer instructional explanations as part of classroom practice, either during instructor-led exposition or in response to students’ questions (e.g., Lew et al., 2016; Treagust & Harrison, 1999). However, explanations vary in quality, to the point where instructional explanations may sometimes have negative effects. Specifically, instructional explanations do not always lead to successful learning (Chi et al., 2001; Lew et al., 2016; VanLehn et al., 2003; Wittwer & Renkl, 2008), and generating low-quality explanations can in some circumstances limit understanding. For example, Lachner and Nückles (2016) conducted a classroom experiment to compare the effects of different explanations of an optimisation problem (involving finding extremum of a function) on senior high school students. Two types of explanations were compared: one produced by research mathematicians (emphasising principles and conceptual rationale), and the other by mathematics teachers (emphasing procedural steps of how to solve a problem). The students were randomly assigned to three groups, with two groups receiving explanations prepared by either mathematicians or teachers, and the third (control) group receiving no explanation. The researchers found that the students who received the explanations elucidating principles and conceptual rationale behind the steps (offered by mathematicians) outperformed the other students on a subsequent application test (Lachner & Nückles, 2016).

These findings emphasise the importance of taking into consideration explanation quality: in order to maximise student learning, instruction needs to be based on high-quality explanations and omit low-quality explanations. But, despite the importance of explanation quality, Wittwer and Renkl (2008) noted that the issue of what constitutes an effective explanation has been widely neglected by education researchers. Thus, when teachers and learners generate instructional explanations, they must typically rely upon experience and intuition rather than research-based advice.

Our goal in this paper is to consider the quality of instructional explanations from the perspectives of undergraduate mathematics lecturers and students. There are two broad approaches to analysing explanation quality that can be adopted. The first, which we describe as the top-down approach, is to evaluate the quality of instructional explanations using pre-existing general frameworks that attempt to describe the features that high-quality instructional explanations may have (e.g., Wittwer & Renkl, 2008). An alternative, which we describe as the bottom-up approach, would be to collect a corpus of instructional explanations, develop an empirical method by which mathematicians and students can assess their quality, and then study the features that high- and low-quality explanations have. In light of the lack of frameworks for instructional explanation quality rooted in the mathematics classroom, and in light of the strong possibility (discussed below) that mathematical explanations may be disanalogous to non-mathematical explanations (such as those in science), we adopted the bottom-up approach. That is, instead of attempting to characterise the quality of instructional explanations in mathematics in a top-down fashion, using principles of general frameworks, we aim to empirically explore this notion of explanation quality as it exists among mathematics lecturers and students. Our hope is that such bottom-up exploration can contribute to the more general characterisation of the quality of instructional explanations in mathematics.

More specifically, we address two main questions. First, can mathematicians reliably judge the quality of explanations? In other words, when asked to assess the quality of different mathematical explanations, do mathematics lecturers tend to agree with each other? Second, do the judgements about explanation quality made by lecturers coincide with those of their intended audience, undergraduate students? Before introducing the study we conducted to address these issues, we first briefly review the existing literature on explanations in mathematics.

Philosophical and educational perspectives on mathematical explanations

At least in the area of proof-based mathematics, mathematics education researchers have traditionally turned to the philosophy of mathematics for insights into a notion of explanation in mathematical practice that could be useful to determine what makes a proof explanatory in the mathematics classroom (Hanna, 1990; Hersh, 1993; Weber, 2010). Philosophers have devoted a great deal of attention to understanding what it means to say that A explains B. For instance, some accounts in the general philosophy of science literature rely on either statistical associations or causal mechanisms. For instance, Salmon (1971, 1984) suggested that A explains B if B is consistently correlated with A, or if there is a causal history that connects B and A (cf. Hempel & Oppenheim, 1948). So, we can say that wearing shoes in the wrong size explains why our feet are hurting because there is a causal connection between the two events. But, while causal and statistical accounts may work well in scientific contexts, they seem to fail in mathematics. Mathematical concepts are not related causally, as there is no temporal order in the universe of mathematical definitions, theorems, and proofs. The fact that the square root of 2 is irrational is not located at a particular point in time. Neither are mathematical facts related statistically, as they take no probabilities other than 0 or 1 (i.e., a mathematical statement is either true or false). Consequently, many scientific accounts of explanation do not seem to easily apply in mathematical contexts (Colyvan, 2012; Mancosu, 2001).

If mathematical explanations are not scientific explanations, what are they? This question has generated significant interest. Some regard the lack of causal and correlational relations as a reason to deny that mathematical explanations exist, at least in a manner analogous to scientific explanation (Resnik & Kushner, 1987; Zelcer, 2013). This approach seems at odds with practice. Studies of mathematical language show that research mathematicians do use explanatory words when communicating with one another (Mejía-Ramos et al., 2019). And it is certainly at odds with educational practice. As noted above, explanations are central to mathematical teaching and learning, and many mathematics education researchers have emphasised the importance of engaging students with mathematical proofs that explain theorems, rather than those which merely demonstrate that theorems are true (e.g., Bell, 1979; Hanna, 2000). The desire to distinguish explanatory from non-explanatory proofs also rules out the proposal that A explains B if B can be logically deduced from A. Under such an account, all valid deductive arguments would be equally explanatory. The apparent ubiquity of explanations in mathematical discourse leads most philosophers to reject the suggestion that mathematical explanations do not exist (Weber & Frans, 2017) or that they are simply logical deductions (Colyvan, 2012). Instead, they offer two broad categories of account, which Delarivière et al. (2017) described as the ontic and epistemic approaches to mathematical explanation.Footnote 1

Ontic accounts focus on the objective properties of purported explanations. For instance, Steiner (1978) suggested that mathematical explanations are arguments that refer to characterising properties of relevant concepts. Ontic accounts also include Kitcher’s (1984) proposal that mathematical explanations are characterised by the way in which they unify a range of different mathematical concepts, and Lange’s (2014) suggestion that mathematical explanations are characterised by their use of certain types of salient features (e.g., symmetries) of the explained result. What all ontic accounts share is the belief that the (non-)explanatoriness of a mathematical proof—and it is typically proofs that these accounts focus on, a fact that was critiqued as ‘proof chauvinism’ by D’Alessandro (2020)—can be assessed without considering its actual or potential audience. Despite the fact that ontic accounts are audience independent, mathematics education researchers have attempted to use them to address the question of what makes a proof explanatory in the mathematics classroom (Hanna, 1990; Reid, 1995).

In contrast, epistemic and, similarly, functional accounts start with the assumption that explanations are communicative acts that aim to cause understanding (Delarivière et al., 2017; Inglis & Mejía-Ramos, 2021). For example, Wilkenfeld (2014) defined explanations to be those things that generate understanding “under the right circumstances and in the right sort of way” (p. 3367). In order to unpack exactly what the right circumstances and sorts of ways are, Wilkenfeld drew heavily on the epistemology literature (e.g., Grimm et al., 2016).

Although the ontic and epistemic approaches differ in emphasis, they are not necessarily contradictory. Inglis and Mejía-Ramos (2021) offered a functional account which they suggested could encompass the ontic accounts of Steiner (1978), Kitcher (1984), and Lange (2014). Specifically, they argued that modern theories of human cognitive architecture imply that mathematical arguments, which make use of characterising properties, which unify concepts, or which involve Lange-style saliency, will typically help a reader to increase their level of understanding of the domain(s) in which the explanation is situated. In other words, the ontic properties identified by Steiner, Kitcher, and Lange are all likely to contribute to an argument’s epistemic explanatoriness.

The fact that philosophers disagree about whether ontic or epistemic approaches to mathematical explanation are more promising raises the possibility that mathematicians too may adopt differing perspectives on explanatoriness. If this were correct, then we might expect there to be between-mathematician variation in the types of explanations that they deem most explanatory. Assessing the extent to which this is the case is one goal of the study reported in this paper. Adopting an epistemic approach to mathematical explanation requires specifying what type(s) of understanding it is that explanations try to foster. Inglis and Mejía-Ramos (2021) emphasised that, for them, explanations aim to increase objectual understanding, not merely propositional understanding. This distinction can be illustrated by comparing the statements “Tessa understands that there are five continents” (propositional understanding) and “Honali understands topology” (objectual understanding). Whereas objectual understanding admits degrees, propositional understanding does not. It would be straightforward to find someone whose understanding of topology was either higher or lower than Honali’s, but it does not seem possible to understand that there are five continents more or less than anyone else. Although there are various different accounts of precisely how objectual understanding should be characterised (e.g., Grimm, 2006; Kelp, 2016; Kvanvig, 2003), a common theme is that the relationship between previously disconnected information is crucial. For instance, Kvanvig (2003) wrote that “the grasping of relations between items of information is central to the nature of understanding” (p. 197).

Given the importance of explanations for educational practice, and given the unresolved philosophical debates about what exactly mathematical explanations are (and even whether or not they exist), it is perhaps surprising that more attention has not been devoted to understanding explanation quality in educational settings. One exception to this rule is Wittwer and Renkl’s (2008) general framework for exploring the effectiveness of (not necessarily mathematical) instructional explanations. They defined instructional explanations as being explanations given in educational contexts designed for the purpose of teaching, and gave four normative criteria by which their quality can be interrogated.

  1. 1.

    Explanations should take account of learners’ existing knowledge. If the goal of an explanation is to increase objectual understanding, and if increasing objectual understanding involves developing relationships between previously disconnected knowledge schemas, then it seems self-evident that instructional explanations should be designed with reference to the learner’s existing knowledge (e.g., Leinhardt & Steele, 2005). This assumption has been studied empirically. For example, Duffy et al. (1986) compared the verbal explanations given by more and less effective fifth-grade teachers. They found that the more effective teachers would spontaneously adapt their explanations in light of interactions with students, in order to respond to student misunderstandings. The association between an instructor’s knowledge of their students’ knowledge levels and the quality of explanations they produce is causal. Wittwer et al. (2008) experimentally manipulated instructors’ beliefs about their students’ knowledge levels in an online tutoring environment. They found that those instructors given inaccurate information about their students’ existing knowledge produced less appropriate explanations, as assessed by the extent to which students increased their knowledge.

  2. 2.

    Explanations should focus on concepts and principles. Because increasing objectual understanding involves constructing new relationships between knowledge, explanations which focus on generalisable concepts or principles are likely to help learners integrate more of their existing knowledge. Research has shown that principle-oriented explanations promote learners’ mathematical understanding since they integrate conceptual and procedural information, thus making them particularly tangible for novice students (Lachner et al., 2019). It is well documented, however, that teachers often omit conceptual explanations when explaining procedures, which is detrimental to learning (Lachner & Nückles, 2016; Lachner et al., 2019). There are large between-instructor differences in the extent to which explanations focus on generalisable principles. For example, Perry (2000) compared the explanations offered by teachers in first- and fifth-grade classrooms in Japan, Taiwan, and the USA. She found that teachers in the Asian classrooms offered more explanations, but also that their explanations were much more likely to generalise beyond the specific problem being discussed. Perry (2000) suggested that this difference in explanation quality might partly account for differences in performance on international comparisons found between the USA and Asian countries.

  3. 3.

    Explanations should be integrated into the learner’s ongoing cognitive activities. In line with the emphasis that constructivist accounts of learning place on active processing (e.g., Fiorella & Mayer, 2015), if learners actively engage with the information in an explanation—rather than just listen or read it—a higher level of objectual understanding is likely to be reached. For instance, Webb et al. (1995) investigated seventh-grade students’ activity after receiving instructional explanations from their teachers. The extent to which students continued to work on problems after receiving an explanation, by either solving the problem or explaining how to solve the problem, was strongly predictive of scores on a subsequent assessment of their learning.

  4. 4.

    Explanations should not replace learners’ knowledge-construction activities. The last of Wittwer and Renkl’s (2008) criteria for effective explanations concerned when explanations should not be provided. If the provision of an explanation leads to lower levels of engagement with the to-be-learned material, then there is likely to be less learning, regardless of the quality of the provided explanation. For instance, Roy et al. (2017) found that providing undergraduates with verbal explanations of steps in a proof led to lower comprehension retention than if no such explanations were provided. They hypothesised that the extra support the explanations provided disrupted the students’ engagement with the proof, reducing the extent to which they were able to integrate their learning with existing knowledge (Alcock et al., 2015).

It is worth noting two aspects about Wittwer and Renkl’s (2008) framework. First, their four criteria are audience dependent (even the second criterion is ultimately related to learners’ understanding in terms of, and knowledge construction around, these concepts and principles). In this sense, this framework is more closely related to epistemic approaches of mathematical explanation in the philosophy of mathematics (albeit, specific epistemic approaches could operationalise “understanding” differently, e.g., purely in terms of abilities as opposed to cognitive structures and activities). In contrast, from an ontic perspective, these criteria (particularly the first, third, and fourth) are simply irrelevant in the characterisation of mathematical explanation (i.e., from an ontic perspective, there would be nothing wrong if a given mathematical explanation meets those criteria, but those criteria would certainly not be part of the defining features of mathematical explanation). Second, it is important to note that while Wittwer and Renkl’s (2008) framework provides us with general “evidence-based conjectures about how to use instructional explanations to effectively support understanding and learning” (p. 51), it would be difficult to use this framework to compare the quality of specific instructional explanations in a given mathematics classroom (e.g., the individual criteria are somewhat general, and the framework does not assign differential weights to each of them).

The current study employs a novel approach to the investigation of explanation quality in mathematics. In an attempt to complement the top-down application of general characterisations of explanation quality from the philosophy of mathematics and educational psychology literature, we investigated the viability of a bottom-up investigation into the notion of explanation quality as it exists among mathematics lecturers and students. A natural first question in this bottom-up approach concerns the level of agreement in mathematics lecturers’ and students’ assessments of explanation quality in mathematics. The goal of the study reported in this paper is to tackle this question. Specifically, we asked whether university mathematics lecturers and undergraduate students are able to reliably judge explanation quality. In turn, we argue that this empirical investigation has implications for both the philosophy of mathematics and mathematics education. If philosophical accounts of mathematical explanation wish to reflect mathematical practice (Hamami & Morris, 2020; Van Bendegem, 2014), then it is important for philosophers to understand what mathematicians themselves consider to be explanatory. Studies seeking to assess the reliability of mathematicians’ judgements in this domain therefore have the potential to constrain existing and future philosophical accounts of mathematical explanation (we return to this point in the discussion). In the case of mathematics education, and indeed educational psychology more generally, results from the current study may help rule out one possible reason why instructors often produce explanations that do not effectively support learning (Chi et al., 2001; Lachner & Nückles, 2016; Lachner et al., 2019; VanLehn et al., 2003; Wittwer et al., 2008), namely that no one in the classroom (neither instructors nor students) can reliably distinguish high-quality explanations from low-quality ones (i.e., that assessing explanation quality generates large between-instructor, between-student, and between-instructor-student disagreements).

Creating a corpus of explanations

To achieve our goal of investigating the reliability of mathematicians’ and undergraduates’ judgements of explanation quality, we first created a corpus of mathematical explanations. Specifically, we aimed to collect mathematicians’ responses to a prompt for an explanation of a mathematical concept. To this end, we designed a short online survey.

Participants in the survey, who received invitations to participate from a member of the research team by email, were all research-active mathematicians working at universities in New Zealand, the USA, Canada, Australia, Singapore, Belgium, and Finland. After giving consent to participate, participants provided us with some demographic information about their research area, and then clicked through to a page with the following prompt:

Imagine that a math major on your linear algebra course comes to your office hours and says that they are confused. They explain that although they have seen the definition, they do not understand what an abstract vector space is, or what it is for. What explanation would you give the student in response? (Feel free to use tex or pidgin tex in your response.)

They were able to respond using a free-text box. This prompt, which referenced the student’s lack of objectual understanding of vector spaces, was designed to describe a realistic scenario in which a university lecturer might be asked for an explanation by a student. We were careful to set the context for the required explanation: a mathematics major taking a linear algebra course, who had seen the definition of a vector space, but who did not understand it and therefore had come to the lecturer’s office hours session.

A convenience sample of twenty mathematicians participated. A variety of different explanations were offered. One participant offered a diagram as part of their explanation (which was sent directly to the researchers outside of the Web form). From the twenty explanations offered, we selected a corpus of ten. These ten were chosen to reflect the full variety of the explanations offered. In other words, we removed extremely similar explanations.

The ten selected explanations, which we edited lightly for clarity, are given in the Appendix. The original versions of all twenty explanations are available in the online dataset which accompanies this article.Footnote 2 The ten explanations varied considerably in content and other elements. For instance, some involved procedural aspects (e.g., Explanations 1, 7, and 8), whereas others were entirely conceptual (e.g., Explanation 2). Some explanations were primarily geometric (e.g., Explanations 9 and 10), whereas others were focused on building on numerical intuitions (e.g., Explanation 8). The main feature of many explanations was highlighting similarities between examples (Explanations 3, 4, 5, 6, and 9) and one used a non-example (of a set that is not a vector space) to illustrate the concept boundary (Explanation 1). Utility considerations were present in many explanations, accentuating the purpose of the definition and exemplifying its possible uses (e.g., Explanations 1, 4, 5, 6, and 9). Moreover, some explanations explicitly involved learner-instructor interactions such as posing questions and providing further information depending on the answers received.

It is unsurprising that some of the content-related aspects of the ten explanations obtained have been widely discussed in the mathematics education literature. Numerous studies have emphasised the importance of the use of examples in mathematical learning in order to develop conceptual understanding (e.g., Bills & Watson, 2008; Fukawa-Connelly & Newton, 2014; Tall & Vinner, 1981), with the role of non-examples deemed necessary to gain a coherent concept image (Fukawa-Connelly & Newton, 2014; Goldenberg & Mason, 2008; Tall & Vinner, 1981). The importance of geometrical interpretations to suplement explanations and the use of diagrams and graphs has also been identified and discussed widely, emphasising their functional role in promoting learning (e.g., Mejía-Ramos & Weber, 2019; Samkoff et al., 2012; Soto-Johnson & Troup, 2014).

Comparative judgement

To assess the extent to which mathematicians and undergraduates can reliably assess the quality of explanations, we adopted a comparative judgement approach (Bisson et al., 2016; Jones et al., 2019, 2015). Comparative judgement approaches to assessment rely upon the finding that people are more accurate when comparing items than they are when asked to evaluate an item in isolation (Thurstone, 1927a). For example, it is easier to decide which of two weights is the heavier than it is to estimate the weight of one in isolation. Thurstone (1927b) referred to this observation as the ‘law of comparative judgement’ and used it to assign values to a variety of stimuli. For instance, in one study, Thurstone (1927b) was able to measure the perceived seriousness of different criminal offences (libel, perjury, smuggling, etc.) by asking college students to engage in a series of paired comparisons and fitting the resulting data to his psychophysical model.

Modern uses of comparative judgement in assessment rely upon the Bradley-Terry model (Bradley & Terry, 1952). This assumes that each item i (indexed by a positive integer) has a numerical parameter \(\beta _i\) which captures its quality on some dimension of interest. In our case, this is ‘explanatoriness’; in Thurstone’s (1927b), the dimension was ‘seriousness of offence’. Given two items i and j, then the probability that a judge regards i as being rated higher on the given dimension is given by

$$\begin{aligned} \mathbb {P}(i>j)=\frac{e^{\beta _i}}{e^{\beta _i}+e^{\beta _j}}. \end{aligned}$$

By presenting judges with repeated pairs of stimuli and asking them to assess which they would rate higher on the given dimension, empirical estimates of the \(\beta _i\) can be obtained. Jones et al. (2019) suggested that an average of 10 judgements per item usually suffices to produce an accurate estimate of each \(\beta _i\).

Comparative judgement techniques are becoming common in educational assessment contexts. They have been used to assess the quality of student essays (Heldsinger & Humphry, 2013) and laboratory reports (McMahon & Jones, 2015), as well as more nebulous constructs such as students’ conceptual understanding (Bisson et al., 2016), students’ problem solving skills (Jones & Inglis, 2015), and mathematicians’ conceptions of mathematical proof (Davies et al., 2021). The method is particularly helpful when one wishes to assess constructs about which people are expected to have an intuitive understanding, but which they may not be able to fully articulate or use to make reliable absolute judgements (Pollitt, 2012).

Critically, comparative judgement can also be used to produce estimates of the reliability of judges’ judgements. A variety of reliability coefficients can be produced from comparative judgement data, but here we focus our attention on intuitively straightforward split-half inter-rater reliability coefficients (Bisson et al., 2016). To calculate such a coefficient, one randomly selects two groups of judges from the total set (typically half in each group) and fits the Bradley-Terry model separately for each group. This produces two \(\beta _i\) estimates for each of the judged items, one derived from each group of judges. The correlation between these two sets of \(\beta _i\) produces an estimate of the extent to which the two groups of judges agree with each other about the construct being assessed. Repeating this procedure 1000 times—with a new random split each time—and taking the average of the correlations yields an overall estimate of the reliability of the judges. A coefficient close to 1 indicates that the judges tend to agree with each other about the construct being assessed, whereas a coefficient close to 0 indicates little or no between-judge agreement. This is the method we used in our study.

Method

Two groups of participants took part in the judging session, where they were asked to judge which of two explanations was the better. Eighteen research mathematicians from the University of Auckland participated, following an email to relevant colleagues in the department. None had participated in the earlier Web survey that generated the corpus of explanations. Each participant was asked to complete 20 judgements, but one did not and was removed from the analysis. One further participant responded extremely quickly to each comparison (a mean of 8.1 s per judgement, compared to the average of 31.9 s) and was also removed from the analysis. Therefore a total of 320 judgements from 16 mathematicians were included in the final analysis, with a mean of 33.3 s per judgement.

We also recruited undergraduates who had recently taken a linear algebra course of the type mentioned in the explanation prompt discussed in Section 3. These participants were studying mathematics at either the University of Auckland or Rutgers University. We aimed to collect the same number of judgements as we had collected from the mathematicians (320), but following feedback from several mathematician participants about the length of the study, we asked each undergraduate participant to make only ten comparisons. Of the 34 participants we recruited, two failed to complete their set of judgements and were removed from the analysis. The mean duration per judgement for the undergraduates was 49.7 s, and no undergraduate was excluded due to a very low mean judgement duration (the lowest was 17.3 s). Therefore, a total of 320 judgements from 32 undergraduates were included in the final analysis. Of the 48 participants, 12 identified themselves as female (1 mathematician, 11 undergraduates), and 36 as male. A majority of mathematicians, 12, described their research area as falling primarily within the domain of pure mathematics.

Participants were invited to take part via an email from a member of the research team, which stated the following:

We are interested in understanding how people assess the quality of mathematical explanations. We asked 10 mathematicians this question:

Imagine that a math major on your linear algebra course comes to your office hours and says that they are confused. They explain that although they have seen the definition, they do not understand what an abstract vector space is, or what it is for. What explanation would you give the student in response?

We are going to ask you to evaluate their responses so that we can understand what you value in an explanation.

If recipients of the email wished to participate, they clicked through to a website which explained the purpose of the study, took some demographic information (research area for the mathematicians, gender for all participants), and then asked participants to start judging. To record participants’ judgements, we used the No More Marking comparative judgement platform (https://www.nomoremarking.com/). Participants were presented with two randomly chosen explanations side by side, and simply asked “which is the better explanation of what a vector space is?”. They responded by selecting either the explanation on the left or the explanation on the right. Participants were instructed that, if they were unsure, they should go with their “gut instinct”. When participants had completed their allotted judgements (20 for mathematicians, 10 for undergraduates), they exited the judging platform. (Raw data, materials, and analysis scripts are available online.Footnote 3)

The data analysis method involved comparative judgement approach, which was based on the Bradley-Terry model, described in Section 4. The method produced \(\beta\) estimates for each judged item, capturing the perceived explanatoriness of each explanation, separately for each group. These \(\beta\)s are unitless, so can only be interpreted in relation to other \(\beta\)s on the same scale. The Bradley-Terry model also produces a standard error associated with each \(\beta\), which captures the precision with which the \(\beta\) has been estimated.

Results

Each group produced highly reliable \(\beta\) estimates. The split-half inter-rater reliability coefficient was \(r=.781\) for the mathematicians and \(r=.796\) for the undergraduates, indicating that the within-group agreement about the quality of the various explanations was high.

The \(\beta\)s for each explanation produced by each group are shown in Fig. 1. Overall, the correlation between the two groups was high, at \(r=.85\), indicating that the two groups largely agreed with each other about which explanations were better and which were worse.

Fig. 1
figure 1

The perceived quality of each explanation (the \(\beta\) estimates) produced by the mathematicians and undergraduates. Note. Error bars show ±1 standard error. The numbers next to each point indicate the explanation represented by that point (see the Appendix for the full list of explanations). Note that the units on the x and y scales are not comparable

Inspecting Fig. 1 indicates that both groups agreed that Explanation 2—which attempted to explain vector spaces in terms of enriched Abelian groups—was the least explanatory. The two groups also both disliked Explanation 10, which attempted to use the geometry of the academic’s office to introduce the notion of linearly independent vectors, and then generalised to abstract vector spaces.

We operationalised disagreement between the two groups as being where an explanation’s \(\beta\) estimate was over two standard errors away from the regression line shown in Fig. 1. There were only two examples of disagreement. The mathematicians perceived Explanation 9 to be more explanatory than the undergraduates. This focused on using geometric analogies and then pointed out that vector spaces are useful to understand the properties of solutions to partial differential equations (PDEs) using geometric ideas, images, and pictures. While this utility consideration would make sense to a mathematician, most students taking a linear algebra course would not be familiar with the theory of PDEs. Finally, towards the top end of the scale, the undergraduates rated Explanation 6 as being the most explanatory, whereas this was only the fourth most explanatory for the mathematicians. Explanation 6 focused on connecting the notion of a vector space to the properties of mathematical objects that undergraduates could be reasonably expected to be very familiar with (addition and multiplication of numbers, combining functions, adding matrices, etc.). It went on to explain that the purpose of the notion of a vector space is to extract the commonality between these various concepts. Notably, it was the only explanation to start with an informal hand-wavy definition: “a vector space is just a collection of objects, together with a way to combine those objects, and a list of rules that govern how we combine them”, thereby, perhaps, earning appreciation from undergraduates.

Despite these two disagreements, the overall message of our study was one of strong agreement. Using a comparative judgement technique, both the mathematicians and undergraduates were able to reliably assess the quality of these mathematical explanations, in the sense that the members of each group tended to agree with other members of the same group about the explanatoriness of each explanation. Furthermore, the two groups agreed with each other. The mathematicians were largely able to predict how the undergraduates would assess explanation quality, and vice versa.

Discussion

Summary of main findings

Our goal in this paper was to conduct a bottom-up investigation of the quality of mathematical explanations. To this end, we created a small corpus of mathematical explanations, and examined whether mathematicians and undergraduates were able to reliably assess their quality using a comparative judgement approach. We found that both groups showed reasonably high levels of reliability, in the sense that the split-half inter-rater reliability coefficients were well above 0.7 in both cases.

Given this, we asked whether the two groups agreed with each other: in other words, do mathematicians tend to assess explanatoriness in a similar fashion to undergraduates? We found that the two groups’ assessments of the explanations in our corpus were strongly correlated, suggesting that—notwithstanding some differences—the two groups have a shared understanding of what makes a high- and a low-quality mathematical explanation.

Study limitations

Several limitations of our study may have influenced our results. The first concerns the methodological decisions made in the design of the explanations by the participating mathematicians. Some of the shorter explanations produced by the participating mathematicians (such as Explanations 5, 8, 9, and 10) present as outlines (plans) of an explanation, stating the intentions but omitting details. For example, Explanation 9 starts with “I would give geometric explanations and analogies; and show with pictures how the geometric explanations works for all vectors.” This leaves room for various interpretations around the implementation of such an intent, thus translating into more subjective judgements. Given this, it is perhaps surprising that we found such high levels of between-participant agreement.

Second, no information was available about the domain-specific knowledge of the participating undergraduates, other than that they were mathematics majors who recently completed a linear algebra course. Potentially large differences in participants’ levels of domain knowledge or mathematical experience could perhaps have influenced their judgements, which we might expect to suppress observed agreement levels. Future studies could productively investigate whether individual differences influence the type and nature of students’ (and mathematicians’) judgements of explanatoriness.

Moreover, Explanations 1–10 might not have been sensitive enough to capture differences in participants’ perceptions as they were not systematically designed to test variations. Systematically varying aspects of explanations might be worthwhile in future investigations in order to test factors that characterise explanation quality in mathematics. Furthermore, future studies could benefit from increasing the relatively small sample size used in our investigation.

Validity

Although we found strong evidence of the reliability of mathematicians’ and undergraduates’ judgements about explanation quality, a question remains about whether those explanations which our participants perceived to be the most explanatory are actually the most explanatory. Answering this question requires us to—in a top-down fashion—independently specify what explanatoriness consists in.

As noted in Section 2, philosophers typically adopt either ontic or epistemic accounts of mathematical explanation. What would theorists from each of these camps think about the explanations in our corpus? For ontic theorists, this seems a difficult question. As D’Alessandro (2020) noted, ontic theorists have typically assumed that explanations in mathematics must involve proofs, that the only things which can be explanatory in mathematics are explanatory proofs. D’Alessandro (2020) critiqued this view, which he labelled ‘proof chauvinism’. Clearly, the explanations in our corpus are not proofs, so it is difficult to know what Steiner (1978) or Kitcher (1984) would make of them.Footnote 4 One approach would be for ontic theorists to deny that our explanations are in fact explanations, and perhaps instead to consider them to be mere motivations of the vector space definition. While this approach would have the advantage of being consistent with their conceptualisation of explanation, it has the significant disadvantage of being inconsistent with mathematical practice. The mathematicians who generated the explanations in our corpus were asked “what explanation would you give the student?” They were not asked how they would “motivate the definition.” If we wish our conceptions of mathematical explanation to be consistent with mathematical practice, they ought to be applicable to the kinds of things mathematicians produce when asked to explain.

Epistemic theorists have an easier time when considering the validity of our participants’ explanatory judgements. If explanations are defined to be those things which generate understanding, then better explanations are going to be those explanations which generate more understanding. Wittwer and Renkl’s (2008) framework provides a top-down method for us to consider the extent to which each of the explanations in our corpus is likely to generate understanding for a typical undergraduate, and therefore allows us to interrogate the validity of our participants’ judgements.

We consider each of Wittwer and Renkl’s (2008) criteria in turn. Recall that each explanation is given in the Appendix.

  1. 1.

    Explanations should take account of learners’ existing knowledge. It seems clear that the lowest scoring explanation, Explanation 2, falls foul of this criterion. Explanation 2 references Abelian groups, fields, enrichments, rings, and algebras over a ring. Given the context specified in the prompt—an undergraduate taking an introductory linear algebra course—it seems very unlikely that the explanation’s recipient would be familiar with many of these more advanced mathematical concepts. In a similar vein, Explanation 9, referring to the theory of PDEs, does not score highly in the ranking.

  2. 2.

    Explanations should focus on concepts and principles. A vector space is an abstract notion that captures a range of mathematical objects that all behave in the same way, in the sense that if a property follows from the vector space axioms, then all vector spaces will have that property. This extraction of common mathematical structure seems a central concept/principle which accounts for why mathematicians are interested in vector spaces. Notably, this idea is entirely absent from Explanation 2, is not clearly included in Explanation 10, and is not clearly expressed in Explanations 8 and 9. These four explanations were rated particularly poorly by our participants. In contrast, all the other explanations reference this key concept/principle in one form or another. Explanations 1, 4, 5, and 6—all high scoring for both groups—are extremely explicit about this. For example, Explanation 5 included the line “by proving (or knowing) something about a vector space we know it about all of those examples [discussed earlier in the explanation].”

  3. 3.

    Explanations should be integrated into the learner’s ongoing cognitive activities. Good explanations encourage active engagement with the recipient. Interestingly, Explanations 1, 4, and 5—three of the high-scoring explanations for both groups—all included some kind of interactive element. For example, Explanation 4 involved rhetorical questions (“Now you might ask in what ways are [real polynomials and \(\mathbb {R}^n\)] similar?”), and Explanation 1 involved multiple possibilities for the student receiving the explanation to contribute (“Is W a vector space? Let’s check: closed under addition—yes.”). Explanation 5 emphasised how the explanation would be contingent on how the student responded (“I would [\(\ldots\)] point out the common properties. Once the student is happy with those, we would say that “something” with all of those properties is called a vector space.”). In contrast, the lowest scoring explanations did not seem to require the student to interact with the explanation or the explainer at all.

  4. 4.

    Explanations should not replace learners’ knowledge-construction activities. The final criterion proposed by Wittwer and Renkl (2008) concerned when it would be preferable for instructors to withhold explanations from students, and so does not directly apply to the context specified in our vignette.

In sum, these considerations suggest that the judgements made by the mathematicians and undergraduates in our study were not inconsistent with Wittwer and Renkl’s (2008) framework. Indeed, in some important ways, the higher-scoring explanations seemed to meet the criteria, and the lower-scoring explanations seemed not to. However, these discussions represent, at best, highly indirect evidence of validity. Future studies which directly compare explanatory judgements of different explanations with the extent to which those explanations generate student understanding would be extremely worthwhile. If comparative judgement could be established as both a valid and reliable way of assessing the quality of instructional explanations in mathematics, the method could be harnessed to help improve both classroom instruction and instructional materials such as textbooks and lecture notes.

Developing explanatory skills

Supposing that mathematicians are able to reliably and validly judge the quality of mathematical explanations, this raises a puzzling question. Why do some mathematicians produce low-quality explanations when prompted? Consider Explanation 2, the lowest-ranked explanation. Of the 62 comparisons made by mathematicians involving Explanation 2, it ‘lost’ 92% (in the sense that the explanation it was paired with was deemed the better in 92% of pairings). Given this level of consensus about its low quality, why did the mathematician who wrote it consider it to be appropriate?

One answer might be to suggest that producing high-quality explanations may be considerably harder than assessing the quality of explanations. Analogously, it is possible for critics to assess the quality of novels, even though they may not be able to produce a novel themselves. If this is right, then it raises the question of how teachers and lecturers can be helped to produce better explanations.

We believe that the type of comparative judgement session that we used here for research purposes could serve as a starting point for productive form of professional development for teachers and lecturers. Comparison has been shown to be an effective pedagogical strategy in other contexts. Rittle-Johnson et al. (2020) recently reviewed the evidence concerning how comparing different problem-solving strategies for the same mathematical problem can support students’ learning (see also Alfieri et al.’s (2013) meta-analysis). It is typically suggested that promotion of analogical reasoning is the mechanism behind this effect: by comparing two examples, students are able to create an analogy between them, which helps them to attend to structural similarities and differences. This, in turn, may facilitate transfer to new situations (Gentner et al., 2003; Gick & Holyoak, 1983). If this is correct, then this mechanism would seem to be as applicable in the context of teachers and lecturers comparing different instructional explanations (along with explicating and reflecting on their differences) as it is in the context of students comparing different problem-solving strategies.

If comparative judgement were to be harnessed to develop professional development materials for teachers and lecturers who want to develop their ability to produce instructional explanations, then there is a large literature on which to draw. For instance, Rittle-Johnson et al. (2020) pointed out that including various instructional supports, such as presenting examples side by side (as in our study), including cues to guide attention to important similarities and differences, and including self-explanation prompts at relevant points, all facilitate learning from comparison.

Implications for the philosophical accounts of mathematical explanation

Finally, we briefly comment on the implications of our results for philosophers interested in mathematical explanation. We first re-emphasise that all the mathematicians in these studies were either willing to produce non-proof explanations, or to compare non-proof explanations. It therefore seems clear that any account of explanation in mathematics—or at least any account that wishes to remain faithful to the practices of mathematicians—must, as argued by D’Alessandro (2020), be able to account for explanations which are not proofs. Our view is that epistemic accounts such as Delarivière et al.’s (2017) and Inglis and Mejía-Ramos’s (2021) can do this with relative ease. In contrast, the challenge for ontic accounts seems much greater.

More generally, the evidence presented here suggests that mathematicians and undergraduates have a shared conception of what constitutes an effective explanation in mathematics, at least in the context of these rather simple explanations. Hence, this finding serves as foundation for designing and undertaking further investigations. Future work should explore whether this remains the case when different types of explanations, including proofs, are considered. If it does, then we see the goal of philosophical accounts of mathematical explanation as being to produce an accurate description of what this shared conception actually is. Perhaps, this will necessitate a consideration of distinct taxonomic groups (such as definitions, theorems, proofs, problem-solving procedures) within which the classification of mathematical explanations with respect to their quality can be achieved. Having a method available—comparative judgement—which seems to be able to reliably measure mathematicians’ explanatory judgements will allow existing and future accounts to be tested empirically.