The main purpose of the present thesis is to investigate first-year university students’ understanding of the generality of mathematical statements and the relation to proof reading and construction (see Fig. 4.1 in Chapter 4). The respective research questions were structured through my adapted version of the framework on proof-related activities introduced by Mejía Ramos and Inglis (2009b), in which I suggest to explicitly consider the reading of the statement that has to be proven or for which a proof has to be read. The reading of a statement involves the comprehension of the statement, among other aspects, its generality. I defined understanding the generality of statements as consistent responses regarding the estimation of truth and the existence of counterexamples, which was then used to operationalize this understanding. To investigate the relation to proof reading and construction, students’ performances in the relevant activities—estimation of truth, proof evaluation regarding conviction, proof comprehension, and proof construction—were considered. Moreover, the experimental design of my study particularly aimed at analyzing the influence of the type of argument as well as the type of statement (truth value and familiarity) on students’ understanding of the generality of statements and their proof skills.

In the following, the results presented in the previous chapter are interpreted and discussed in the context of prior research. Thereby, I follow the structure of the four sets of research questions derived in Chapter 4. Further, the adapted framework on proof-related activities, methodological decisions, and potential limitations of this study are discussed in Section 7.2. Lastly, main implications of the results for the learning and teaching of proof at the transition from school to university and future research are presented in Sections 7.3 and 7.4, respectively.

7.1 Interpretation

In the following, the research questions are answered one by one and the results are interpreted and discussed in relation to prior research.

7.1.1 Estimation of Truth and Proof Evaluation Regarding Conviction

The first set of research questions focused on students’ performance in estimating the truth of statements and proof evaluation regarding conviction:

RQ1: Conviction of the truth of universal statements and its relation to reading different types of arguments

  1. RQ1.1:

    How do the type of argument and the type of statement influence students’ estimation of the truth of universal statements?

  2. RQ1.2:

    How do the type of argument, the type of statement, and the level of comprehension influence how convincing students find different types of arguments? What aspects of mathematical arguments do students identify as not convincing?

As researchers and prior studies have suggested (e.g., Barkai et al., 2002; Buchbinder & Zaslavsky, 2007; Dubinsky & Yiparaki, 2000; Hanna, 1989; Ko, 2011), the type of statement, in particular the statements’ truth value but also the familiarity with the statement affected students’ estimation of truth. The falsity of the statement had a negative effect and being familiar with a statement had a (smaller) positive effect. These results were not surprising, but provide clear experimental evidence on what has already been suggested in the literature. The comparatively small effect of familiarity can mainly be explained by students’ estimation of truth of the pythagorean theorem. A comparatively large percentage of participants was unsure about the truth value of this statement, even though they should be very familiar it. The fact that the pythagorean theorem was expressed in natural language and not as an equation is most likely the reason why some students seemed to not have recognized the statement and therefore had difficulties with estimating the truth value. This can be seen as a limitation (see also Section 7.2), but also provides information regarding students’ content-specific knowledge and level of comprehension of mathematical statements they have been taught in school.

Reading empirical arguments (and generic proofs) supports students to estimate the truth value of statements.

The type of argument affected students’ estimation of truth. Participants who received empirical arguments (and generic proofs, with a smaller effect and not reaching significance after Holm’s correction) were more likely to correctly estimate the truth value of (true) statements than participants who received no arguments. No prior studies on the influence of different types of arguments on students’ estimation of truth had been conducted before, which made it difficult to formulate a hypothesis. However, students as well as professional mathematicians use empirical arguments to estimate the truth value of statements (e.g., Alcock & Inglis, 2008; Buchbinder & Zaslavsky, 2007; Lockwood et al., 2016), possibly because these experimental investigations provide better understanding of the statement and a better intuition regarding its truth value (see also de Villiers, 2010). This is in line with findings reported by Bieda and Lepak (2014) that empirical arguments provide students with more information and enhance their comprehension of the statement in comparison to ordinary proofs. My findings now provide strong evidence that empirical arguments (and to a lesser degree generic proofs) may indeed help to better understand mathematical statements and therefore lead to better performance in the estimation of truth. Furthermore, reading any of the considered types of arguments seems to make participants more likely to choose an answer different from “I have no idea”, in particular the reading of empirical arguments. This strengthens the assumption that empirical arguments may provide participants with a better understanding of the statement—or at least give them the feeling to better understand it.

The second research question in this set focused on students’ evaluation regarding conviction. Similar to the findings on students’ estimation of truth, the type of statement also affected students’ conviction. Participants were less likely to find the arguments convincing regarding the false (unfamiliar) statement than regarding the true (unfamiliar) statements, which would be expected, because the respective proofs were in fact incorrect. Further, participants were more likely to be convinced by the arguments regarding the familiar statements than the unfamiliar (true) statements, even though this effect was comparatively smaller than the effect of the truth value. A positive effect of familiarity with the statement on students’ conviction was also expected, because the role of familiarity for the acceptance of proof (which most likely influences conviction) has been highlighted in the literature, as already mentioned (e..g., Hanna, 1989). But prior studies had not found respective evidence (e.g., Kempen, 2021; Martin & Harel, 1989).

The type of argument also affected students’ conviction. In line with prior research (e.g., Kempen, 2021; D. Miller & CadwalladerOlsker, 2020; Weber, 2010), participants who received generic or ordinary proofs were more likely to find these arguments convincing than students who received empirical arguments. However, in a comparatively high percentage of observations, participants nevertheless claimed to be completely (!) convinced by empirical arguments (about 25%). Those who were not (completely) convinced by the empirical arguments were asked to explain why they are not convinced by these arguments. Participants most often referred to a lack of generality (78% of observations) of the arguments, which indicates that the majority of these students is not only aware of the limitations of empirical arguments, but understands—at least to some extent—why. In contrast, in a study conducted by Ufer et al. (2009), only about one third of the participating high school students could “adequately” explain why an empirical argument is not valid. However, in the present study, the 78% reported above only refer to those participants, who were not or only partially convinced by these arguments and who responded to the open question. Further, the responses were not thoroughly coded regarding adequacy of their responses, but with respect to mentioned aspects (here generality). Thus, the comparability of the results may be limited, also because of the differences regarding age and experience of participants (high school students vs first-year university students).

As discussed in Section 3.2.3, prior research findings on students’ and teachers’ evaluation of generic proofs has been ambiguous. Some studies found that many teachers are not convinced by generic proofs, for instance, because of a perceived lack of generality and modes of representations that do not meet the criteria for proof (e.g., Lesseig et al., 2019; Tabach, Levenson, et al., 2010). The majority of participants in the present study seemed to be (at least partially) convinced by generic proofs; participants claimed to not find these arguments convincing at all in less than 25% of the observations. Moreover, a lack of generality was indeed mentioned more often as a reason for not finding the arguments (completely) convincing by participants who received generic proofs than by those who received ordinary proofs (12% vs 4%). In contrast to the findings reported by Tabach, Barkai, et al. (2010), the mode of representation, was only mentioned occasionally regarding generic proofs (6%) (and not at all regarding ordinary proofs). Further, participants in the present study were more convinced by the ordinary proofs than the generic proofs, which experimentally confirms findings reported by Kempen (2018), for instance.

The results of the present study furthermore clearly confirm the influence of students’ (self-reported) proof comprehension on their (self-reported) conviction (as has been reported by Weber, 2010, for instance): Participants with higher levels of (self-reported) proof comprehension were also more likely to claim being convinced by the arguments. Moreover, the results of the content analysis of aspects students identified as not convincing also highlight these findings: The comprehension of the statement or proof was the reason why most of the participants claimed to be not convinced by generic or ordinary proofs (64 and 81%, respectively). While these findings may not be surprising and confirm prior research findings (Ko & Knuth, 2013; Sommerhoff & Ufer, 2019), they nevertheless emphasize the strong relation between proof comprehension and proof evaluation regarding conviction.

Self-reported conviction of arguments does not reflect actual conviction of the truth of statements.

While participants showed higher levels of self-reported conviction regarding generic and ordinary proofs compared to empirical arguments, the participants who received empirical arguments (and generic proofs) were more likely to correctly estimate the truth value of (true) universal statements than those participants who received no arguments. Differences regarding ordinary proofs were not significant. This suggests that participants assume that ordinary (and generic) proofs should generally be convincing, in particular compared to empirical arguments, but that empirical arguments (and potentially generic proofs) actually provide higher levels of conviction regarding the truth of the statement. This finding highlights the gap between self-report—which can potentially be influenced by social desirability, for instance—and reality (Golke, Steininger, & Wittwer, 2022) and is particularly relevant for the construction of future questionnaires. I come back to this finding and its implications in Section 7.4.

7.1.2 Comprehension of Arguments

The second set of research questions aimed at investigating students’ (self-reported) proof comprehension, in particular regarding differences between generic and ordinary proofs:

RQ2: Proof comprehension

  1. RQ2.1:

    How does students’ (self-reported) proof comprehension differ between students who receive generic proofs and those who receive ordinary proofs? How does the familiarity with the statement influence students’ proof comprehension?

  2. RQ2.2:

    What aspects of mathematical arguments do students identify as not understandable? How do these aspects differ regarding generic and ordinary proofs?

Participants show higher levels of proof comprehension regarding generic proofs than regarding ordinary proofs.

Based on prior experimental studies (e.g., Lew et al., 2020), it was hypothesized that no significant differences between students’ comprehension of generic and ordinary proofs exist. Therefore, the finding that participants who received ordinary proofs were less likely to claim to have understood these arguments than participants who received generic proofs was unexpected. In contrast to other studies, for instance, by Lew et al., the present study relied on students’ self-report on their proof comprehension. Thus, the participants who received generic proofs might not actually have better understood these proofs, but these types of arguments might just have appeared more comprehensible to them. Previous research has indeed found that mathematics students often inaccurately assess how well they have understood a proof (A. Selden & Selden, 2003). However, it could also be the case that generic proofs do provide students with better understanding, as other researchers have suggested (Dreyfus et al., 2012; Malek & Movshovitz-Hadar, 2011; Mason & Pimm, 1984; Rowland, 2001). Given that generic proofs supported students’ correct estimation of truth (see finding above), they may indeed also help with the comprehension of proof. Further (experimental) studies, which do not solely rely on students’ self-reports are needed to definitely answer this question.

Unexpectedly, the familiarity with the statement had a negative effect on students’ proof comprehension. This result is surprising, because the participants have encountered these statements and potentially the proofs before during school and should also be more familiar with the underlying theories of these statements. However, the statements did not only differ with respect to familiarity, but also regarding their content domains. The familiar statements were from geometry and the unfamiliar statements from elementary number theory. Thus, most likely, participants have perceived the proofs regarding the familiar statements from geometry to be more difficult, not because of the familiarity.

With respect to the second research question in this set, in line with prior research (e.g., Conradie & Frith, 2000; Moore, 1994; Neuhaus-Eckhardt, 2022; Reiss & Heinze, 2000) and therefore expected, participants mainly referred to local aspects such as not having understood the terms, statements, equations, and/or illustrations used in the proof. Moreover, in a comparatively high percentage of observations, participants seemed to have not understood the statements themselves, for instance, the meaning of simple terms such as the meaning of odd or even numbers, product, or the square of the legs (in German Kathetenquadrat), and the explanations included in the proofs seemed to have not clarified these terms for the participants. This observation is of practical relevance for the teaching of proof, because it emphasizes the need to first focus on sufficiently understanding the statements and relevant terms before a proof is presented and discussed or before students are asked to prove a claim. One would assume that this is obvious, but lecturers might not be aware of the extent to which students have difficulties with simple terms and the comprehension of statements.

In contrast to ordinary proofs, participants stated to not having understood why the generic proofs are general on several occasions (14%), which is in line with the findings on aspects participants identified as not convincing reported above. As was expected, participants who received generic proofs also referred to the proof framework slightly more often than participants who received ordinary proofs when asked what they did not understand about the argument (8% vs 5%). But overall, this aspect was not mentioned that frequently. Most likely students have generally limited experience with proof and proving (as suggested by prior research, for instance, Hemmi, 2008; Kempen & Biehler, 2019) and therefore do not often consider the general proof idea, but focus on surface features, as has been reported in the literature (e.g., A. Selden & Selden, 2003). Further, a smaller percentage of participants referred to not having understood particular statements, equations, illustrations used in the generic proofs than in the ordinary proofs (33 vs 58%), but comparatively more participants did not understand the statements themselves regarding the reading of generic proofs than the reading of ordinary proofs (24 vs 13%). This does not necessarily mean that participants who received ordinary proofs had actually better understood the statements. It could also mean that the reading of generic proofs more often reveals an insufficient understanding. Further research would be needed to investigate this hypothesis.

7.1.3 Justification: Students’ Proof Schemes

The first experimental group did not receive any arguments but instead had to justify why they think the statements are true or false. This group therefore served as a control group regarding the influence of reading arguments and provided data to answer the third set of research questions, which aimed at analyzing students’ proof schemes:

RQ3: Construction of arguments to justify the truth of universal statements (students’ proof schemes)

  1. RQ3.1:

    What types of arguments do students themselves use to justify the truth or falsity of a universal statement? How do students’ proof schemes differ regarding the type of statement (i.e., familiarity and truth value)?

  2. RQ3.2:

    What potential relation between the type of argument used by students and the level of conviction of the truth of the statement exists?

As was expected based on prior research findings (e.g., Barkai et al., 2002; Bell, 1976; Recio & Godino, 2001; Sevimli, 2018; Stylianou et al., 2006), empirical proof schemes could be observed most often, when participants were asked to justify the truth of the unfamiliar statements (45%). In contrast, participants mainly showed external proof schemes regarding familiar statements (86%). These participants often made reference to authorities (24%), such as school or university, claimed the statement is a general rule (37%), or gave pseudo arguments (25%). Moreover, the majority of participants used counterexamples to correctly refute the false statement (about 51%). Expectedly, deductive proofs schemes were much rarer, but could be observed more often regarding unfamiliar statements from elementary number theory than familiar statements from geometry. Most likely, not only the (un)familiarity with the statements but also the different content domains account for these differences (see Section 7.2 for a further discussion). Transformative arguments, such as generic proofs, were only used 5 times, and these were all incomplete. I want to highlight again that the participants were not explicitly asked to (dis)prove the statements, but to justify why they think the statements are true or false, similar to what has been done by Barkai et al. (2002), for instance. Thereby, the aim was to gain insights about the types of arguments that convince students of the truth or falsity of universal statements, in the sense proof schemes were defined by Harel and Sowder (1998). Thus, these results might not be comparable to those of other studies, in which students were explicitly asked to construct a proof, for instance, in the studies conducted by Recio and Godino (2001) and Stylianou et al. (2006), even though these studies also found that many students fail to construct valid deductive proofs and often give empirical arguments instead.

A relation between students’ proof schemes and their level of conviction was identified.

Participants who gave empirical arguments were generally only relatively convinced (100% regarding the familiar statements and more than 60% regarding the unfamiliar statements) of the truth of the respective statements. Thus, as argued by Weber and Mejia-Ramos (2015), one should not automatically worry about students’ usage of empirical arguments, if they do not gain absolute conviction by these arguments. There was still a comparatively large percentage of participants with empirical proof schemes who seemed to have gained absolute conviction of the truth of the statements. Thus, the usage of empirical arguments might be problematic for some students. However, that fact that these participants gave empirical arguments does not necessarily mean that these arguments were the only source for their conviction in the truth of the statement. Further research is needed to investigate why some students seem to gain absolute conviction in the truth of a statement by empirical arguments and what other factors may influence students’ (level of ) conviction. The findings of the present study suggest that students who construct (complete or incomplete) deductive arguments have high levels of conviction of the truth of statements. Participants with deductive proof schemes were in fact almost all absolutely convinced of the truth of the true familiar and unfamiliar statements. However, it cannot be derived from these findings that the construction of deductive arguments (automatically) leads to absolute conviction, because other factors might have played a role as well. The respective participants might have been convinced by the truth of the statements before they even attempted to prove them (for instance, because they were familiar with the statements), as has been pointed out by Polya (1954). Noteworthy, external proof schemes seemed to provide most students with absolute conviction as well. But given that participants mainly had external proof schemes regarding familiar statements, the familiarity with these statements may be mainly responsible for the high levels of conviction.

7.1.4 Understanding the Generality of Statements

Finally, the last set of research questions build the focus of the present thesis, which is on students’ understanding of the generality of mathematical statements and the relation to proof reading and construction:

RQ4: Students’ understanding of the generality of mathematical statements

  1. RQ4.1:

    What proportion of first-year university students have a correct understanding of the generality of statements?

  2. RQ4.2:

    What is the influence of reading different types of arguments on students’ understanding of the generality of mathematical statements? How does the type of statement influence students’ understanding of its generality?

  3. RQ4.3:

    How does students’ comprehension and conviction of arguments influence their understanding of generality of statements?

  4. RQ4.4:

    What potential relation exists between students’ proof schemes and their understanding of the generality of statements?

In 64% of all observations (about 68 % if “don’t knowers” are excluded), participants showed a correct understanding of the generality of mathematical statements. The percentage of students having a correct understanding of generality of statements thereby differed with respect to the study program. Overall, the higher the level of mathematics in the chosen study program, the higher the percentage of students with a correct understanding of generality. This can mainly be explained by differences in prior knowledge/experience (e.g., attendance of an honors course) and general cognitive skills (e.g., CRT score).

Understanding the generality of statements is not solely determined by students’ knowledge of the meaning of mathematical generality but positively related to it.

The most predictive for students’ understanding of generality of statements was their knowledge of the meaning of mathematical generality. However, since a comparatively large percentage of students with a correct knowledge of the meaning of generality still responded inconsistently regarding the estimation of truth and the existence of counterexamples, solely knowing what mathematical generality means is not sufficient for a consistent correct understanding of the generality of statements.

The percentage of observations in which participants responded inconsistently regarding the two relevant questions also differed by the type of statement, which further indicates that the understanding of generality of statements is not solely determined by students’ knowledge of the meaning of generality. Participants were more likely to have a correct understanding of generality of the false and the familiar statements than of the (true) unfamiliar statements, as was expected. The content analysis of students proofs schemes further suggests that most students who correctly refuted the false statement most likely did so, because they found one or more counterexamples. These participants therefore knew that a counterexample exists, which proves the falsity of the statement, and consequently responded more often consistently regarding the truth of the statement and the existence of counterexamples. Similarly, as has been argued in Chapter 4, familiar statements have most likely been applied by the participants to many arbitrary cases before, which might have made them more confident in the non-existence of counterexamples and therefore more likely to have a correct understanding of generality for these statements.

The reading of any type of argument mainly influenced students’ responding behavior in that they were less likely to answer “I have no idea” regarding the estimation of truth and the existence of counterexamples. Thus, reading an argument may at least give them the feeling of knowing enough to make a decision. However, reading generic or ordinary proofs did not lead to a higher probability of having a correct understanding of generality, as was hypothesized. On the contrary. If at all, it made participants less likely to respond consistently to the respective questions. Reading empirical arguments seemed to have no significant effect on students’ understanding of generality in comparison to reading no arguments at all. The findings reported above indicate that reading ordinary proofs did not support students’ correct estimation of truth, potentially because students lack knowledge to gain information and certainty from proofs. Because of their limited knowledge, reading proofs might actually make them more uncertain regarding the existence of counterexamples, which might explain the higher likelihood of an incorrect understanding of generality of the statement. However, the significance of this effect is unclear.

With respect to the third research question, there seems to be a positive relation regarding students’ (self-reported) proof comprehension and their understanding of generality of statements. After controlling for other individual resources, such as the CRT score and students’ knowledge of the meaning of generality, this effect was however smaller and did not reach significance. Thus, the relation between proof comprehension and students’ understanding of generality might at least partially be explained by other variables which influence both proof comprehension and students’ understanding of generality (for instance, the CRT). In contrast, the participation in an honors mathematics course during high school, which was predictive for students’ understanding of generality before proof comprehension was considered, had no significant effect after it was included. Thus, the participation in honors courses most likely provides students with better comprehension or these are both simultaneously influenced by a further variable. As the present study relied on students’ self-report on their understanding of the arguments, findings are limited (see Section 7.2.3 for a further discussion on this). Measuring students’ proof comprehension through assessment tests, as suggested by Mejía Ramos et al. (2012), for instance, might provide further insights into the relation between students’ proof comprehension and their understanding of the generality of statements.

The conviction of empirical arguments is related to students’ understanding of the generality of statements.

The findings reported in Section 6.5 on the influence of conviction on understanding of generality are inconclusive. A clear negative effect was found regarding students’ conviction of empirical arguments on their understanding of generality. Participants who were not at all convinced by empirical arguments were more likely to have a correct understanding of the generality of statements than participants who were (at least partially) convinced by these arguments. A relation between students conviction of empirical arguments and their understanding of the generality of proof was suggested by some researchers (e.g., Conner, 2022). My findings now provide clear evidence for a relation between students’ conviction of empirical arguments and their understanding of the generality of statements—which assumably is related to understanding the generality of proofs. The effect of conviction regarding generic or ordinary proofs, however, is less clear. There seemed to be a positive relation between students’ conviction by the argument and their correct understanding of generality, however, after including other (control) variables, this effect diminished and—even more—participants who claimed to be not at all convinced by generic or ordinary proofs were then more likely to have a correct understanding of generality than those who claimed to by partially convinced by the proofs. Among the participants who claimed not to be convinced by the arguments, a high percentage answered “I have no idea” regarding the relevant questions for measuring students’ understanding of generality. These were treated as missing values, which might have affected the estimates of the regressions. Another explanation for this unexpected relation could be that participants with an overall good understanding of proof—and potentially correct understanding of generality—tend to either be completely convinced or not at all, but not partially. Or vice versa, participants, who do not have a good understanding of proof and generality might tend to answer that they are partially convinced, simply because they have no reference of what (should) convince(s) them.

Students with empirical proof schemes are less likely to have a correct understanding of the generality of statements than students with deductive proof schemes.

In contrast to the effect of reading different types of arguments on understanding the generality of statements, a relation between students’ proof schemes and their understanding of generality was found. Participants with empirical proof schemes had an incorrect understanding of generality most often, compared to participants with any other proof scheme. The percentage of participants with a correct understanding of generality was the highest among those with analytical proof schemes, in particular deductive ones. The percentage of participants with external proof schemes and a correct understanding of generality was between these two groups. Overall, these differences were significant with medium effect size. On the one hand, these results make sense in that participants who were able to construct a proof gain absolute conviction (as discussed above) and simultaneously make them more aware that no counterexamples exist, thus leading to a correct understanding of generality. On the other hand, it is interesting that students who give empirical arguments to justify universal statements respond inconsistently most often. It is conclusive that they only gain relative conviction (as was found in this study), but this does not explain that their estimation regarding the existence of counterexamples is then inconsistent. These findings combined with those regarding the influence of (or absence thereof) reading different types of arguments, could suggest that students who give empirical arguments in general have an insufficient understanding of proof which also affects their understanding of the generality of statements, but that it is not the reading or construction itself that explains this relation.

7.2 Reflections and Limitations

The present study provides many new insights into students’ proof skills, in particular students’ understanding of the generality of statements. In the following sections, I reflect on my adapted framework on proof-related activities and several methodological decisions that were made in this study. In addition, I outline specific limitations of this study as well as more general limitations of empirical (field) research.

7.2.1 The Adapted Framework on Proof-Related Activities

The present study was based on the framework for proof-related activities presented in Section 3.2, which is an adapted version of the framework introduced by Mejía Ramos and Inglis (2009b). I chose to distinguish between activities that are related to the statements, which are to be proven or for which a proof is to be read, and activities that are related to the arguments that aim to justify the statements. Moreover, I proposed potential relationships between the activities (see Fig. 7.1; problem exploration was not explicitly considered in the present study).

Figure 7.1
figure 1

Adapted framework on proof-related activities based on Mejía Ramos and Inglis (2009b), numbers refer to identified relationships

The adapted framework has been shown to be very useful and conclusive in this study. Further, my findings mainly confirm or at least highlight the presumed relationships. The content analysis of students’ responses regarding aspects they did not understand showed that comparatively many participants insufficiently comprehended the statements for which they received and read arguments. These participants were therefore not able to understand the arguments and they would also not have been able to decide if the arguments are valid proofs. Thus, not surprisingly, reading the statement with respect to comprehension is required for activities regarding the reading of given arguments (relationship 1 in Fig. 7.1). Similarly, without understanding the statement, students were not able to justify its truth or falsity, which mainly resulted in unclear responses regarding students’ proof schemes (relationship 2 in Fig. 7.1). Reading (particular types of) arguments affected students’ success in estimating the truth of the statements (relationship 3 in Fig. 7.1). In particular, reading empirical arguments (and to a lesser degree generic proofs) supported students in deciding if the statements are true or false, likely because they helped them to better understand the statements. But reading ordinary proofs did not have this effect. Even more, it seemed that—if at all—reading ordinary proofs negatively influenced students’ understanding of the generality of statements (as part of statement comprehension). Moreover, my findings provide evidence that strong relations between different proof reading activities exist as well, for instance, between comprehension of the arguments and evaluation regarding conviction. The influence of constructing (different types of) arguments to justify the truth or falsity of statements may also support students’ comprehension of statements and their success in estimating the truth (relationship 4 in Fig. 7.1). Participants who were not able to give any argument—not even empirical ones—to justify the statements (coded as unclear regarding the proof scheme) were also most often unsuccessful in estimating the truth value of the statements. Vice versa, most students who provided arguments correctly estimated the truth of the statements, at least with relative conviction, and the type of proof scheme was related to their level of conviction as well as their understanding of generality of the statements. Further research would be needed to explicitly investigate if and how the construction of arguments supports the comprehension of statements and students’ success in estimating the truth value.

Overall, my adapted framework, in particular the distinction between activities related to statements and arguments, provides a useful basis for further research on proof-related activities and their relations.

7.2.2 Overall Research Design

In the present study, participants were randomly assigned to experimental groups (types of arguments), which was methodologically desirable. However, there might still have been a selection bias induced by the different types of arguments participants received: The percentage of students who chose not to answer or complete the questionnaire was higher for generic and ordinary proofs than for empirical arguments and no arguments, possibly due to the perceived difficulty of reading these proofs. Moreover, the percentage of “I don’t knowers” was lower among the participants who received any type of argument in comparison to participants who were not provided with arguments. Thus, responding behavior of participants who did not drop out of the experiment early on was influenced by reading the arguments, in that it made them more confident in choosing an answer different from “I have no idea”. Therefore, while the missing values might limit my results to some extent, they also provide information on aspects that make participants less likely to be “I don’t knowers”—for instance the reading of arguments—which can be useful for future studies.

Furthermore, some of the questions might have been redundant for participants who received empirical arguments, in particular, regarding the open-ended question on why they thought these arguments are not convincing, as some students mentioned this in their responses to the open-ended questions. This could have resulted in participants perceiving the questions as too easy and getting bored, thus a reduced test-taking motivation, which can lead to lower performance (e.g., Asseburg & Frey, 2013). To avoid this, an alternative approach for investigating the influence of the type of argument could have been taken. For instance, instead of providing participants with the same type of argument, each participant could have been given different types of arguments (as was shortly discussed in Section 5.1.1). However, such a design also comes with its downsides. First, participants’ responses, particularly regarding conviction, could be influenced by the possibility of comparing the different types of arguments, which I did not aim for. Second, the number of items in such a questionnaire should be larger for such an approach, because more than one mathematical statement for each kind of argument would be necessary to be able to draw conclusions about the influence of the type of argument. Because otherwise, the differences might not result primarily from the type of argument but the specific statement. Such an approach could be used best in a laboratory experiment, where the conditions are more controllable (e.g., Döring & Bortz, 2016). In addition, laboratory experiments with financial rewards could increase test-taking motivation, even though respective research is not consistent (e.g., Baumert & Demmrich, 2001; Braun, Kirsch, & Yamamoto, 2011; O’Neil, Sugrue, & Baker, 1995), resulting in better answer quality (e.g., Wise & DeMars, 2005) and the possibility to increase testing time. Alternatively, one could also design different versions of questionnaires, such that all combinations of statements and types of arguments are considered. This might, however, increase the needed sample size, because the specific combination of type of argument and (type of) statement could also influence students’ responses. Given the framework of this study, the chosen experimental design seemed to be the best approach regarding the research questions of this study, even considering the described limitations.

7.2.3 Conceptualization and Operationalization

Only few prior studies have investigated students’ understanding of the generality, and to my knowledge, no studies have explicitly analyzed understanding of generality of statements and the specific relation to proof construction and reading. Further, prior studies on students’ understanding of generality have mainly reported on students or teachers who were convinced of the correctness of the statement and/or proof but not convinced that no counterexample exists (Chazan, 1993; Knuth, 2002) or regarding students’ awareness that one counterexample disproves a universal statement (Buchbinder & Zaslavsky, 2019; Galbraith, 1981). I decided to consider participants’ responses regarding the estimation of truth and relate them to the responses regarding the existence of counterexamples. An incorrect understanding of generality was then defined as inconsistent responses. This conceptualization might have limited the comparability of my results to the few prior studies that have been conducted. Moreover, other reasons for inconsistent responses cannot be ruled out and should also be discussed. For instance, participants responding inconsistently might lack logical reasoning skills, as the question regarding the existence of counterexamples was expressed implicitly via the negation of the statement. However, these students would then nevertheless have an insufficient understanding of the particular statement, specifically regarding the appearance of respective counterexamples. Moreover, the high correlation between students’ understanding of generality and their knowledge of the meaning of mathematical generality, which was assessed via a closed item—further indicates that the chosen conceptualization and operationalization of students’ understanding of the generality of statements indeed provided valid results. Therefore, the chosen approach does not only provide new results of students’ understanding of the generality of statements and the relation to proofs, but also builds a new basis for future research on students’ understanding of generality.

As has been highlighted by other researchers (e.g., Sommerhoff, 2017), studies on students’ proof skills have generally not been using exactly the same definitions, conceptualizations, and operationalizations. Even though I have tried to identify and implement the essential commonalities, decisions were made regarding conceptualization and operationalization, which may limit the generalizability and comparability of my results. For instance, in contrast to other studies, in which proof comprehension was measured via assessment tests, I decided to rely on students’ self-reports on their comprehension of the arguments. Assessing participants’ proof comprehension via tests for five statements was assumed to unreasonably increase the test duration, risking reduced test-taking motivation and mental fatigue effects (e.g., Ackerman & Kanfer, 2009; Möckel, Beste, & Wascher, 2015; van der Linden, Frese, & Meijman, 2003). Further, proof comprehension was measured via a three-level scale—completely, partially, or not at all understood. More nuanced measures might have provided further insights into students’ proof comprehension and its relation to understanding generality. To ensure comparability to some extent, the coding scheme used to analyze participants’ responses regarding aspects they claimed to have not understood was based on the assessment model developed by Mejía Ramos et al. (2012). However, relying on students’ self-reports nevertheless limits generalizability and comparability of the research findings, because self-reports are subject to several biases. For instance, students’ might not be able to assess themselves accurately (see, e.g., A. Selden & Selden, 2003), which implies that self-reports on students’ proof comprehension measure what participants think they have (not) understood or how well they believe they have understood an argument and not their actual comprehension of the specific proof, which limits the validity of self-reported data on students’ comprehension.

To analyze students’ conviction by the argument, participants were first asked if the presented justification has convinced them of the truth of the statement and, if they did not claim to be completely convinced by the argument, why the justification did not convince them. I relied again on self-reports, which implies limitations already discussed above. In particular, the combined findings on students’ estimation of truth and conviction suggest that asking participants questions about how convinced they are by an argument might reveal what arguments participants assume they should find convincing—not necessarily what types of arguments actually convince them the most of the truth of the statements. As conviction and acceptance criteria for proof are subjective and influenced by prior experience, for instance, in the classroom (e.g., Hanna, 1989; Stylianides, 2007), they may be susceptible to biases such as social desirability. Students’ might know that empirical arguments do not constitute a proof (e.g., Ufer et al., 2009) and therefore assume that they should not find these convincing. However, empirical arguments can in fact be convincing and lead to high levels of conviction in the truth of a statement (e.g., Weber, 2013), which my findings experimentally confirm. Even though participants were not asked if they thought the arguments are valid proofs, they might have nevertheless—consciously or not—taken this into account when asked if the argument has convinced them of the truth of the statement. In this regard, as has been discussed by other researchers, for instance Inglis and Mejía-Ramos (2013), participants might interpret questions about how convinced or persuaded they are by an argument differently. To reduce this risk, I decided to explicitly ask participants if the argument convinces them of the truth of the statement. While this has hopefully led to better comparability of responses of participants in this study, the comparability of my results to those of other studies might be limited. Furthermore, participants were not provided with pre-defined criteria for convincing arguments, as suggested by Mejía Ramos and Inglis (2009b), for instance, and it is not fully clear what criteria participants based their decision on. However, this limitation was at least partially overcome by asking students to explain why they were not (completely) convinced by the arguments. The respective findings of the content analysis further strengthen the assumption that participants may have considered acceptance criteria for proof when asked why the arguments did not convince them of the truth of the statements.

To investigate students’ proof schemes, participants were not explicitly asked to (dis)prove the statements, but to justify why they think the statement is correct or false. While some prior studies chose a similar approach (e.g., Barkai et al., 2002; Harel & Sowder, 1998), others explicitly asked the participants to construct a proof (e.g., Recio & Godino, 2001; Stylianou et al., 2006). Thus, these different approaches might limit the comparability to some of the prior studies on proof schemes.

7.2.4 Number, Selection, and Order of Statements

Significant effects of the type of statement were found across all analyzed activities and students’ understanding of generality. A larger number of statements would still have been desirable to increase validity and reliability of the findings. But due to testing time, the number of statements included in this study had to be limited to avoid fatigue effects (as has been mentioned above). Moreover, the defined criteria for the selection of statements also limited the number of suitable items, in particular false statements. To further increase the validity of findings, it would have been beneficial to include additional statements, in particular false ones.

The selection and allocation was mainly based on theoretical considerations, such as the extent to which the statements are present in text books and school curricula (see Section 5.3.1). However, because the two unfamiliar statements only require basic knowledge and should be known by teachers, it is possible that teachers teach and discuss these statements with their students, even though the statements are not part of the school curriculum. The content analysis of the data on students’ proof schemes provides evidence that the allocation made in this study is generally conclusive. Regarding the (true) unfamiliar statements, only few participants used authority arguments or claimed that the statement is a general rule (4 and 1%, respectively). Instead, they mainly used empirical arguments. Thus, it can be assumed that the vast majority did not seem to have gained (much) experience with the unfamiliar statements during high school. In contrast, most participants used these types of arguments (authority and rule) to justify the truth of the familiar statements (24 and 37%), which indicates at least some degree of familiarity with these statements. This interpretation might be limited by the fact that the statements do not only differ by familiarity but also content-wise: The familiar statements were taken from geometry and the unfamiliar statements from elementary number theory. This choice was based on the respective criteria defined in Section 5.3.1. (NRW) School curricula only mention geometry statements explicitly with respect to proving; and statements from elementary number theory are assumed to require comparatively few knowledge to understand and prove them, which was one of the main criteria defined for the selection of unfamiliar statements. Few studies have explicitly reported on the effect of the content domain on students’ performance in proof-related activities (e.g., Ko & Knuth, 2013). The fact that the familiarity-or in other words, geometry—unexpectedly had a negative effect on participants’ (self-reported) proof comprehension suggests that the content area indeed plays a role. For future studies, it would therefore be desirable to consider familiar and unfamiliar statements from both content domains to specifically identify what characteristics—content domain and/or familiarity—contribute to the observed effects, even though the defined criteria for the selection of statements would make a respective implementation no simple task (at least in Germany).

Further, to ensure comparability, all statements were mainly expressed in natural language. As a consequence, participants had difficulties understanding and recognizing the pythagorean theorem, as mentioned several times in this thesis. In particular, a relatively high percentage claimed to not know the truth value of the statement and if counterexamples exist, which resulted in missing values in the variable understanding of generality. This could have biased the findings regarding the effect of familiarity (because this statement was assumed to be known to the participants), as participants for whom a value for understanding of generality was measured might not only have had better content knowledge, but generally a better understanding of proof and generality.

The order of items can also influence participants’ responding behavior and performance, even though research on this is ambiguous (e.g., Anaya et al., 2022; Bresnock, Graves, & White, 1989; Kleinke, 1980; Newman, Kundert, Jr, & Bull, 1988; Şad, 2020). I have decided to order the statements from easiest to most difficult, based on pre-tests and expert opinions (see Section 5.3.1). The percentage of participants who claimed to not know the truth value of the statement and if counterexamples exist increased over the course of the experiment and was the highest for the last statement, the pythagorean theorem. Other reasons for the high percentage of missing values regarding this statement have already been discussed. But due to the fixed order of statements, it cannot be ruled out that the position of the statement in the questionnaire also affected participants’ (non-)responses. Therefore, a randomization of the statements might have been the better choice, even though this could have resulted in a higher percentage of participants dropping out of the experiment early on, if at random the first statement would have been the most difficult one (Anaya et al., 2022). Randomization might nevertheless have been preferable, because potential order effects could then have been analyzed and verified.

7.2.5 Open-Ended Questions and Content Analysis

In general, the collection and analysis of responses to open-ended questions have several limitations. One already mentioned is related to the sample size and potential selection bias, because some participants might perceive open-ended questions as too time-consuming or they lack interest in the topic and decide to not answer them (e.g., Holland & Christian, 2009; A. L. Miller & Lambert, 2014). Another general limitation concerns the texts that are being analyzed. In the present study, participants were asked open-ended questions regarding their proof schemes, proof comprehension, and conviction. Participants might not be able to fully express their thinking, identify all aspects they did not comprehend or find convincing, resulting in incomplete responses. However, some studies have shown that most people are generally capable of articulating themselves in their answers to open-ended questions (e.g., Geer, 1988), but (more recent) research on this seems to be scarce.

It should be noted that content analyses almost always involve interpretation to some extend (e.g., Bryman, 2012). The coding of complete and incomplete arguments was specifically difficult, because mathematically it is not clear where to draw the line (unless formal proofs would have been considered, which was not done for good reasons), even mathematicians do not always agree, and it was not always clear if participants did not include specific steps in their arguments because they assumed them to be obvious or because they did not think about them. To overcome these limitations and to ensure reliability, the coding schemes were based on previous frameworks, of which some have been used extensively (e.g., Harel & Sowder, 1998), and I provided detailed coding protocols as well as tried to be as transparent as possible (see Section 5.4.2 and paragraphs on content analysis in Sections 5.4.35.4.4, and 5.4.5 as well as Appendix B in the Electronic Supplementary Material). This resulted in very high inter-coder reliabilities after coders were sufficiently trained.

Overall, analyzing students’ responses to the open-ended questions provided important insights into their understanding of generality and proof.

7.2.6 Control Variables

Prior knowledge was only indirectly considered by including participation in an honors mathematics course (LK) and, to a lesser degree, by the participation in a transition course (Vorkurs). The participation in an honors course proved to be a useful predictor for most of the activities and for students’ understanding of generality. However, it is not completely clear what it actually controls for: Mainly content-specific resources such as conceptual and procedural knowledge and domain-specific resources such as mathematical strategic and methodological knowledge, or other (domain-general) resources or even something else. Similarly, the CRT score was considered as a control variable to account for (domain-general) cognitive resources. Overall, the CRT score was the most significant predictor regarding almost all proof-related activities (with the exception of conviction) and students’ understanding of generality. Again, these findings do not provide information regarding the influence of more nuanced (domain-general) resources such as problem-solving and (general) reasoning skills. Furthermore, it can be seen as a limitation that there is even a debate about what it is that the CRT measures—cognitive reflection, rational thinking, numeracy, insight problem solving, and/or something else (e.g., Liberali, Reyna, Furlan, Stein, & Pardo, 2012; Patel et al., 2019; Pennycook et al., 2016; Toplak et al., 2014). However, as a control variable for individual cognitive differences, the CRT score still seems to be a useful and easy to measure control variable. Further research is needed to investigate the relation between CRT score and students’ proof skills (see Section 7.4 for a further discussion).

As it was not the purpose of this study to identify specific predictive resources for students’ proof skills, the limitation regarding the informative value of the considered control variables is acceptable. Moreover, the findings are nevertheless useful in that they provide evidence for differences on an individual level and suggest influences of resources that have not been considered in previous research in that way. While the assessment and inclusion of content- and domain-specific resources would have been beneficial to contribute to the existing research on the influence of individual resources (e.g., Chinnappan et al., 2012; Sommerhoff, 2017), it would not have been reasonable to consider these in this study, for instance, due to limitations regarding the test length and the focus of this thesis.

7.2.7 Sample

The overall sample size was generally satisfactory. It would nevertheless have been beneficial to have larger samples regarding some of the research questions, in particular regarding those in which only one or two experimental groups were considered, for instance, the analysis of the influence of students’ conviction and proof comprehension on understanding of generality. Larger sample sizes would have increased statistical power and more robust estimates of the coefficients for the respective variables of interest. Moreover, the content analyses would have benefited from larger sample sizes as well, because the sample size was not only limited by the selection of experimental groups, but also by students’ responses to prior questions and their willingness and ability to answer the open-ended questions (see also discussion further below). Further, the sample was unbalanced regarding the study program, with a large number of preservice primary school teachers and a much smaller number of preservice secondary school teachers, for example. Even though the effect of study program was not directly analyzed, individual resources (for instance, the CRT score and participation in honors mathematics courses) and most likely students’ proof skills differ with respect to the study program, which could potentially have biased research findings. Therefore, a more balanced sample would have been desirable, even though the distribution of study program roughly corresponded to the actual distribution at Bielefeld university. Using more advanced statistical tools such as generalized linear mixed models contributed to overcoming these limitations by including respective control variables such as the CRT score and the attendance of an honors course.

7.3 Implications for the Learning and Teaching of Proof at the Transition from School to University

The results of my thesis contribute to the existing research on university students’ proof skills and understanding at the transition from school to university. My findings confirm those of prior studies that many students have limited knowledge and understanding of mathematical concepts and proof when they enter university (e.g., Gueudet, 2008; Kempen & Biehler, 2019; Recio & Godino, 2001). In particular, my findings suggest that many students have no sufficient knowledge of the meaning of basic terms and concepts, such as divisibility, even and odd numbers, product, and the meaning of variables. Further—and most important for this study—many students also seem to lack sufficient understanding of the generality of mathematical statements, which seems to be related to their conviction and usage of different types of proofs, in particular, empirical arguments. It is therefore no surprise that students’ have difficulties with proof and proving when they enter university, and potentially even lack an intellectual need for proof (see also directions for future research further below). Most (German) universities already strive to close the gap at the transition from school to university by offering transition (to proof) courses (see, for instance, Gerdes, Halverscheid, & Schneider, 2022). However, so far, the effect of these courses is unclear (e.g., Greefrath, Koepf, & Neugebauer, 2017; Tieben, 2019) and dropout rates in mathematics remain high at German universities Heublein et al., 2022. The findings of my study further suggest that the attendance of a transition course had no significant effect on students’ performance in the considered proof-related activities and their understanding of generality. While one reason might be that these courses are simply too short—two weeks at Bielefeld university—other reasons might include that the content currently included in these courses does not fully meet students’—or lecturers’—needs. In this regard, the findings of the present thesis may be useful to revise the content of transition courses. It particular, they provide a basis for (intervention) studies that 1) aim at improving students’ understanding of the generality of statements, and 2) analyze if and how an improved understanding affects students’ intellectual need for proof and their proof skills (see also Section 7.4).

When planning university courses for first-year students, lecturers should take into account that many students currently have an insufficient knowledge of basic mathematical terms and understanding of generality. In general, more emphasis should be put on sufficiently understanding a theorem first—and definitions, concepts, etc. involved—before students are confronted with its proof. Moreover, reading, constructing, and potentially discussing examples (i.e., empirical arguments) can support students’ understanding of the statements, which may ease comprehension and construction of proofs. Generic proofs may also be useful in this regard, as my findings suggest.

These results are particularly important in lectures for preservice teachers to break the cycle of teachers not having sufficient knowledge, therefore school students not learning sufficiently about proof and argumentation, which consequently leads to a gap in students’ knowledge at the transition from school to university. The present thesis did not investigate or aim at identifying respective new teaching methods. However, as mentioned, my findings suggest that students lack basic knowledge and understanding and may therefore benefit from activities that particularly aim at assessing and improving their understanding of theorems—including their understanding of mathematical generality. In general, assessing students’ knowledge and understanding can help both lecturers and students, in that it would be more transparent to the students what is expected from them and what they need to know to follow along, and lecturers would get a better picture of what their students actually know—and what not. In this respect, (real-time) quizzes can be an effective way of assessing students’ knowledge and understanding (e.g., Cohn & Fraser, 2016; Méndez Coca & Slisko, 2013; Plump & LaRosa, 2017). Questions could not only assess students’ (prior) knowledge of terms and statements being used in a theorem, but also consist of questions that more specifically aim at their understanding of the generality of (particular) theorems. For instance, after introducing a new theorem—which students most likely assume being true—lecturers could ask their students if counterexamples may exist. Thereby, I would avoid using the term counterexample and phrase the respective question as suggested in this study (see Section 5.3.5). The responses of the students could provide an opportunity for informative discussions about the theorem itself—including its generality—but also about the purpose and intellectual need for proof. However, the effectiveness of such instructions would need to be investigated.

7.4 Directions for Future Research

Finally, the findings and limitations combined give rise to directions for future research of which some have already been identified in Section 7.2. In this section, I first discuss potential research questions regarding the investigation of students’ understanding of generality, before other, partially more general implications for future research are identified.

7.4.1 Further Investigating Students’ Understanding of the Generality of Statements

To further generalize the findings on students’ understanding of the generality of mathematical statements, replication studies at other (German) universities would be valuable. Moreover, it would be beneficial to include additional or different statements from other content domains to analyze if the findings of this study are content-specific or generalizable to other areas. To analyze the influence of familiarity on students’ understanding of generality—but also on other proof skills—it would be valuable to include familiar and unfamiliar statements from the same content domain, as has been highlighted before.

While the focus of the present study was on first-year university students’ understanding of generality, a replication of the study with experienced university students would enable to investigate potential developments of students’ understanding of generality of statements throughout their studies. I would expect more experienced students to have a more consistent understanding of generality, due to more experience with higher mathematics and proof in particular, but this hypothesis needs to be tested.

The findings of the present thesis furthermore suggest a relation between students’ understanding of generality and other proof skills, such as proof comprehension and evaluation. However, results were ambiguous. Several reasons have been identified, such as sample size, but also relying on students’ self-reports on their proof comprehension. Future studies could replicate the study with an even higher number of students and/or consider measuring students’ proof comprehension via assessment tests (see also discussion further below), as suggested by Mejía Ramos et al. (2012), for instance.

Moreover, the correlations between students’ proof schemes and their understanding of generality found in this study need to be further investigated. For instance, it is not clear by the results, if students’ with empirical proof schemes showed an incorrect understanding of generality more often than students’ with deductive proof schemes because they were not able to produce a general argument, or if other characteristics of theses students explain the correlation. Future studies should consider this when investigating the relation between students’ proof schemes and their understanding of the generality of statements.

Further, while I had chosen to investigate students’ proof comprehension, conviction, and proof schemes and their relation to understanding generality, future studies could consider relations to other proof-related concepts or aspects. For instance, it may be worthwhile to analyze potential relations between students’ understanding of generality and their ability of logical inferences. Even though research suggests that logical reasoning skills, in particular conditional reasoning skills only play a minor role regarding students’ proof skills (e.g., Sommerhoff, 2017), a relation to students’ understanding of generality might nevertheless exist. As mentioned before, understanding logical negation should—at least in theory—be particularly relevant for the understanding of generality of statements as defined in this study. Because questions regarding the existence of counterexamples were expressed via the negation of the respective statement (see Section 5.3.5). However, this hypothesis would need to be investigated. Further, the influence of participants’ CRT score on their understanding of generality suggested by my findings implies a potential relation between students’ (logical) reasoning skills and rational thinking and their understanding of generality (e.g., Liberali et al., 2012; Primi et al., 2016; Toplak et al., 2014). Further research is needed to analyze 1) what the CRT particularly measures and 2) how this relates to students’ understanding of the generality of statements and proof skills. Moreover, researchers who want to use the CRT score as a control instrument in future studies may want to consider alternative CRT items to avoid an overemphasis on numerical abilities and floor effects in non-elite population or younger students, for instance (e.g., Sirota, Dewberry, Juanchich, Valuš, & Marshall, 2021; Young, Powers, Pilgrim, & Shtulman, 2018).

Further, the awareness and correct understanding that no counterexample to universal statements exist might be related to or even increase students’ appreciation of proof and their intellectual need for certainty (introduced by Harel, 2013). Because it is the mathematical generality that is the defining element of mathematical proof, the reason why a deductive proof is indeed necessary and empirical arguments are not sufficient to rule out the existence of any counterexamples. The findings of the present thesis suggest that participants with empirical proof schemes more often have an incorrect understanding of generality than students with deductive proof schemes, for example, and further, that students’ who are convinced by empirical arguments are more likely to have an incorrect understanding of generality than students who are not convinced of these arguments. As some researchers have emphasized the relation between students’ usage of and satisfaction with empirical arguments and their lack of intellectual need for proof (e.g., Zaslavsky, Nickerson, Stylianides, Kidron, & Winicki-Landman, 2012), investigating the potential relation between students’ understanding of generality and their appreciation and intellectual need for proof could be valuable.

Lastly, the findings of my study give rise to a potential intervention study. The effect of students’ knowledge of the meaning of generality on students’ consistent responses (i.e., their actual understanding of generality) was highly significant. An intervention study that investigates the effect of explicitly teaching the meaning of generality of mathematical statements on students’ understanding of generality could be promising in this regard. However, given that the sole knowledge of the meaning of generality was also not sufficient for consistently having a correct understanding of generality, other factors (such as the familiarity with the statement and logical reasoning skills) also play a role and should be considered.

7.4.2 Self-Reported Data and Reality

Several limitations discussed in Section 7.2 concern the relation between self-reported data regarding students’ conviction and proof comprehension and students’ actual conviction and proof comprehension. The findings of the present study indicate that participants might not always be able to assess themselves accurately. For instance, as has been discussed, the question “Does the justification convince you of the correctness of the claim?” does not necessarily provide information about students’ actual conviction of the truth of a statement by different types of arguments, but about which types of arguments they think should be convincing to them, thus, their conceptions of convincing mathematical arguments or proof. In general, I find it questionable what studies on students’ conviction actually assess—most likely not students’ actual conviction regarding the truth of a statement, but more likely acceptance of the argument and respective criteria in the sense of social proof. The gap between self-reported conviction and actual conviction was only revealed by the experimental design of this study. Thus, further investigating students’ conviction or other proof skills experimentally would be very valuable. In particular, one should be careful in interpreting results solely based on students’ self-reported data. The observation that self-reports do not always reflect the reality is not new (e.g., Maki & McGuire, 2002; Thiede, Griffin, Wiley, & Redford, 2009), however, only few researchers have explicitly investigated this in in the context of proof and to my knowledge, the extent of this phenomenon has not explicitly been researched yet. Further, existing studies on mathematicians’ conviction of arguments have also relied on self-reports. It would be valuable to conduct a similar experimental study, in which the relation between mathematicians’ estimation of truth—based on different types of arguments—and their self-reported level of conviction of the truth of statements by the arguments is analyzed. Different statements that are not too simple (i.e., that mathematicians are not familiar with) should be selected for such a study. Another potential future study concerns the relation between students’ self-reported proof comprehension and their actual proof comprehension, assessed via comprehension tests (e.g., Mejía Ramos et al., 2012). Several studies on text comprehension have provided evidence that many learners fail to accurately judge their text comprehension in that they often overestimate but also underestimate their comprehension (e.g., Golke et al., 2022; Maki & McGuire, 2002; Prinz, Golke, & Wittwer, 2020; Thiede et al., 2009). Similarly, A. Selden and Selden (2003) have reported that mathematics students indeed overestimate their understanding of a proof (even though the focus of their study was on proof validation). However, to my knowledge, no studies have explicitly investigated differences in students’ self-reported and actual proof comprehension. Proof comprehension tests have the advantage of providing more valid, reliable, and nuanced results of students actual proof comprehension. However, their construction as well as the conduction of such tests is time consuming and not always feasible. Self-reports provide a much simpler way of measuring students’ proof comprehension, which is why their validity needs to be investigated.

7.4.3 Question Order Effects

As has been discussed, the statements included in the questionnaire have been ordered from most easiest to most difficult. While this is assumed to have benefits such as a lower percentage of participants abandon the questionnaire and better performance (e.g., Anaya et al., 2022; Kleinke, 1980), it might also lead to potential order effect, such as less motivation to answer more difficult questions at the end of the questionnaire. Therefore, future studies on students’ proof skills may want to consider random ordering of statements. Moreover, it would be beneficial to investigate respective order effects experimentally, because research on this is still ambiguous and studies in the context of mathematic education, in particular proof, seem to be scarce. In such a study, several questionnaires with different order of statements could be designed, for instance, from easiest to hardest, from hardest to easiest, and/or random.