Before the results are presented in this chapter, further information on the sample characteristics and findings from preliminary analyses are provided. The subsequent sections are then structured according to the four sets of research questions. In section 6.6, missing values regarding the understanding of generality–those observations in which participants answered “I have no idea” regarding the estimation of truth and the existence of counterexamples, see section 5.3.5–are analyzed. Lastly, I summarize the main findings in section 6.7.

6.1 Preliminary Analysis

In this section, further information on the sample characteristics and findings from preliminary analyses are provided.

To be able to better interpret and compare the results with respect to the specific sample of my study, information regarding participants’ CRT (Cognitive Reflexion Test) scores is first reported in the following. The most frequent CRT score was 0, which means that many of the participants (about one third) answered all four CRT questions incorrectly (see Fig. 6.1 for absolute frequencies of CRT scores). On average, participants solved 1.3 CRT items (about 33%) correctly (\(SD=1.2\)) As discussed in section 5.3.6, similar floor effects in less elite populations have frequently been reported in the literature. The CRT score differed substantially by study program of the participants, as Figure 6.2 illustrates. The floor effect can only be observed regarding preservice primary school teachers without mathematics as major. About 45% of these students had a CRT score of 0. This group also had the lowest average CRT score (\(M=0.9, SD=1.1\)). In contrast, less than 5% of the mathematics students did not solve any of the CRT questions correctly. These students had the highest average CRT score (\(M=2.4, SD=1\)) compared to all other study degree programs. The preservice primary school teachers formed the largest group in the sample (see section 5.2.2), which results in the overall floor effect.

Figure 6.1
figure 1

CRT scores of participants

Figure 6.2
figure 2

CRT scores of participants by study program

The participants were asked to rate the difficulty of questions regarding the proof-related activities to verify if respective ceiling or floor effects exist. Figure 6.3 shows the rating of difficulty of questions by the participants. Only few participants rated the questions very difficult or very easy, which indicates no ceiling or floor effects, at least regarding the perceived difficulty of the questions.

Even though the participants mainly received the same questions (in particular those in the experimental groups B, C, and D, who were provided with justifications for the statements), the perceived difficulty of the questions differed by the type of argument (see Fig. 6.4). The questions were perceived the most difficult by students who received ordinary proofs and the least difficult by students who received empirical arguments. Students who were not provided with any justification were asked to justify the truth/falsity of the statements. They perceived the questions to be more difficult than the participants who got empirical arguments and slightly more difficult than participants who received generic proofs, but less difficult than the participants who got ordinary proofs. Noteworthy, students who received generic proofs perceived the questions to be less difficult than students who received ordinary proofs.

Figure 6.3
figure 3

Rating of the difficulty of questions

Figure 6.4
figure 4

Rating of the difficulty of questions by group (i.e., type of argument)

On average, participants completed the questionnaire in about 24 minutes (\(SD=9\)). Participants who received ordinary proofs spend the most time answering the questions, while participants who received no arguments needed the least amount of time, closely followed by participants who received empirical arguments (see Fig. 6.5). Given that participants who received no arguments were asked to justify the truth/falsity of the statements, it is surprising that they needed the least amount of time to finish the questionnaire.

Figure 6.5
figure 5

Minutes needed to answer the questionnaire by group

Mainly small to moderate intercorrelations between the considered independent variables were observed (see Tab. 6.1; because some of these variables are nominal, correlation coefficients were calculated with Cramer’s V). The CRT score correlated comparatively strongly with the attendance of an honors mathematics course (LK), which should be taken into account when interpreting the results. However, both of these variables were used as controls and were other than that not of particular interest.

Table 6.1 Intercorrelation among variables

Overall, no severe multicollinearity was expected.

6.2 Conviction of the Truth of Statements

This section reports on the results regarding the first set of research questions. It is divided into two main parts: Students’ estimation of truth and students’ proof evaluation regarding conviction.

6.2.1 Estimation of Truth

Between about 45 and 70% of the participants correctly estimated the truth of the statements and claimed to be absolutely sure, depending on the statement. If relative conviction of the truth of the statement is included, about 60% (regarding the false statement) to 95% (regarding one of the true unfamiliar statements, closely followed by one of the familiar statements) correctly estimated the truth of the statements. Figure 6.6 gives an overview of students’ correct estimation of truth regarding the five statements (as a reminder, statements 1 and 3 were true and supposedly unfamiliar, statements 2 and 5 were familiar, and statement 4 was false). Noteworthy, almost one fourth of all participants claimed to not know if the pythagorean theorem (statement 5) is true, which seems unexpected at first. Moreover, comparatively many participants incorrectly estimated the truth values of the unfamiliar statement that the product of two odd numbers is odd (statement 3) and the false statement that the sum of three consecutive numbers is divisible by 6 (statement 4); about one third of the participants was relatively or absolutely sure that this statement is true.

Figure 6.6
figure 6

Correct estimation of truth by statement 1: sum of two odd numbers is even, 2: sum of interior angles in a triangle, 3: product of two odd numbers is odd, 4: sum of three consecutive numbers is divisible by 6, 5: pythagorean theorem

Figure 6.7 shows participants’ correct estimation of the truth of the statements, depending on the type of statement (familiarity and truth value) and argument (experimental group). Participants were generally more successful in estimating the truth value of familiar and unfamiliar statements–all of which are true–than of the false (unfamiliar) statement. In particular, a substantial percentage among the participants who received empirical arguments or ordinary proofs seemed to be absolutely sure that the false statement is true. Furthermore, the graph suggests that students who received empirical arguments were overall the most successful in estimating the truth values of the true universal statements, followed by those who received generic proofs.

Figure 6.7
figure 7

Correct estimation of truth by type of statement and argument

The effects of the type of argument and statement on students’ estimation of truth was analyzed using mixed effects ordinal logistic regression (see also section 5.4.1). The results are summarized in Table 6.2. Model 2 was selected as the final model. As was expected, the familiarity and the truth value of the statement affected students’ estimation of truth. Being familiar with the statement correlated positively with correctly estimating the truth value, even though not as strongly as was expected (\(\upbeta =.20\), p.adjFootnote 1\(=.041\)), while the falsity of a statement had a strong negative effect (\(\upbeta =-.92, \text {p.adj}<.001\)). This means that participants were more likely to correctly estimate the truth value of the familiar statements and less likely regarding the false statement, both compared to estimating the truth of the true unfamiliar statements.

Table 6.2 CLMM comparison regarding students’ estimation of truth

Furthermore, students who received empirical arguments were more likely to correctly estimate the truth value than students who did not receive any justifications (\(\upbeta =.44, \text {p.adj}=.004\)). Reading generic proofs also had a positive effect on students’ estimation of truth, but this effect did not reach significance regarding the adjusted p-value (\(\upbeta =.27, \text {p.adj}=.088\)). Ordinary proofs had no significant effect on students’ estimation of truth (\(\upbeta =-.09, \text {p.adj}=.537\)).

Among the four control variables (CRT score, honors course, transition course, and final high school mathematics grade), only the CRT score and the participation in a mathematics honors course during high school predicted students’ estimation of truth. The higher the CRT score, the more likely participants correctly estimated the truth value (\(\upbeta =.63, \textrm{p}<.001\)). Similarly, and with an even larger effect, if participants specialized in mathematics in an honors course during high school, the more likely they were successful in estimating the truth value (\(\upbeta =.80, \textrm{p}<.001\)). The effect of the final mathematics grade was comparatively smaller and did not quite reach significance (\(\upbeta =-.20, \textrm{p}=.051\); note that in Germany, 1 is the best grade and 6 the worst, which explains the opposite sign of the estimate). The attendance of a transition course had an even smaller effect, which was highly insignificant (\(\upbeta =.03, \textrm{p}=.798\)) and therefore excluded from the models. Due to the comparatively small effect, the mathematics grade was excluded in Model 3. But because Model 3 did not have a smaller AIC valueFootnote 2 than Model 2 and the influence of the mathematics grade is conclusive, Model 2 seemed to be the best choice overall.

6.2.2 Proof Evaluation Regarding Conviction

About half of the participants who received generic or ordinary arguments claimed that the argument completely convinced them of the truth of the statement (see Fig. 6.8). Noticeably, in about 25% of the observations, participants claimed to be completely (!) convinced by the empirical arguments.

Figure 6.8
figure 8

Conviction by type of argument

Figure 6.9
figure 9

Conviction by type of argument and statement

Figure 6.10
figure 10

Conviction by type of argument and statement, and by comprehension of argument

As would be expected, students claimed to be less convinced by the (incorrect) arguments regarding the false statement than regarding the true familiar and unfamiliar statements (see Fig. 6.9). However, over 60% of participants who received ordinary proofs were also at least partially convinced by the argument regarding the false statement. In contrast, less than 50 and 40% of participants who received empirical arguments and generic proofs, respectively, claimed to be at least partially convinced by the arguments regarding the false statement.

Figure 6.10 illustrates the relation between students’ conviction and their self-reported level of comprehension of the argument (completely, partially, not at all). As would be expected, participants, who claimed to have (partially) understood the arguments were more often also (partially) convinced by them. Vice versa, participants who self-reportedly did not understand the arguments at all were in general also not at all convinced by the arguments.

To investigate the effect of the type of argument and statement as well as students’ proof comprehension on their proof evaluation regarding conviction, mixed effects ordinal logistic regression was used (see Tab. 6.3). As was expected, students’ rating of conviction was affected by the type of argument. Participants who received generic or ordinary proofs were more likely to claim being convinced by the argument than participants who received empirical arguments (\(\upbeta =1.70, \text {p.adj}<.001\) and \(\upbeta =2.20, \text {p.adj}<.001\), respectively). The falsity of the statement had a negative effect on students’ conviction, which means that participants were less likely to be convinced by arguments regarding the false (unfamiliar) statement than regarding the true unfamiliar statements (\(\upbeta =-1.31, \text {p.adj}<.001\)), which was also expected. The familiarity of the statement had a comparatively smaller, but positive effect on students’ conviction (\(\upbeta =.54, \text {p.adj}<.001\)).

Table 6.3 CLMM comparison regarding students’ conviction

Further, understanding the argument highly correlated with students’ conviction with the largest effect overall. Participants who claimed to have completely understood the argument were more likely to be convinced by the argument and participants who claimed to have not understood the argument at all were less likely to be convinced by the argument, both compared to students who claimed to have partially understood the argument (\(\upbeta =2.73, \text {p.adj}<.001\) and \(\upbeta =-2.84, \text {p.adj}<.001\), respectively), as would be expected.

Among the control variables, only the CRT score seemed to be predictive for students’ proof evaluation. Unexpectedly, the CRT score correlated negatively with students’ conviction, even though the effect was comparatively small (\(\upbeta =-.60, \textrm{p}=.003\)). It was suspected that this effect was caused by including observations regarding empirical arguments: Participants with a higher CRT score were less likely to be convinced by empirical arguments–but not regarding generic or ordinary proofs. To test this hypothesis, a second regression model was calculated, in which these observations were excluded (see Model 2 in Tab. 6.3). The effects reported above mainly remainedFootnote 3, but the CRT score had no negative effect anymore. In fact, all controls showed only small and insignificant effects (unadjusted p-values between .475 and .850 ) and were therefore excluded in Model 3, in which all other effects remain significant, with .002 being the largest (adjusted) p-value.

6.2.2.1 Aspects That Influence Conviction

Participants who were not completely convinced by the arguments were asked to describe why the argument did not convince them. 353 out of 499 observations (about 71%), in which participants claimed to be not completely convinced by empirical arguments for statements 1 and 2 and generic or ordinary proofs for statements 1, 2, 3, and 5 (see section 5.4.3 for reasons why the analysis was restricted to these observations), contained explanations why (i.e., about 71% responded to the open-ended question). These were coded according to the coding scheme shown in Table 5.2 in section 5.4.3. In 9 of the responses, participants claimed to “don’t know”. These responses were coded as NA and not further considered in the analysis.

Figure 6.11
figure 11

Reasons why participants did not find arguments convincing by type of argument (based on 344 observations)

The vast majority of students who received generic or ordinary proofs referred to not having understood the argument, when they were asked why they were not (completely) convinced by the argument (see Fig. 6.11). This finding is in line with the regression analysis above (Tab. 6.3), where students’ self-reported proof comprehension was highly predictive for their (lack of) conviction. The percentage was particularly high–about 81%–for students who were provided with ordinary proofs. In comparison, about 64% were not (completely) convinced by generic proofs because they did not understand them. More students referred to a lack of generality regarding generic proofs (about 12%) than ordinary proofs (about 4%). Students who received empirical arguments were mostly (about 78%) not convinced because of a lack of generality of the argument. Another 11% referred to the number or selection of examples, for instance, that too few examples were considered or that (seemingly) relevant cases were ignored. Some participants referred to the representation of the argument regarding empirical arguments and generic proofs (3 and 6% of the observations, respectively), but not regarding ordinary proofs. The familiarity with the argument was only mentioned on three occasions, once regarding a generic proof and twice regarding an ordinary proof.

6.3 Comprehension of Arguments

Overall, participants had higher levels of self-reported proof comprehension regarding the generic proofs compared to ordinary proofs. In particular, more students claimed to have completely understood the generic proofs than the ordinary proofs (see Fig. 6.12). However, the percentage of participants who claimed not having understood the provided arguments at all was comparatively small in both experimental groups (about 10%).

Figure 6.12
figure 12

Comprehension of argument: generic vs ordinary proof

Unexpectedly, participants claimed to have comprehended the generic and ordinary proofs regarding the unfamiliar statements from elementary number theory more often than the proofs regarding the familiar geometry statements (see Fig. 6.13).

Figure 6.13
figure 13

Comprehension of argument by familiarity with the statement: generic vs ordinary proof

Figure 6.14 illustrates the relation between students’ proof comprehension and the type of argument, familiarity with the statement, and their attendance in an honors (LK) or regular course (GK) in mathematics during high school. Students who attended an honors course generally claimed to have comprehended the provided arguments more often than students who attended a regular course.

Figure 6.14
figure 14

Comprehension of argument by familiarity with the statement: generic vs ordinary proof and LK (honors course) vs GK (regular course)

Mixed effects ordinal logistic regression was used to estimate the effect of the type of argument and familiarity with the statement on students’ (self-reported) proof comprehension (see Tab. 6.4). Model 2 was selected as the final model (see explanation further below). Participants who received ordinary proofs were less likely to claim having understood the arguments than participants who received generic proofs (\(\upbeta =-.59, \textrm{p}<.001\)). The familiarity with the statement also had a significant effect on students’ self-reported proof comprehension. Participants were less likely to claim having (completely) understood the arguments regarding the familiar (geometry) statements (\(\upbeta =-.55, \textrm{p}<.001\)) than the true unfamiliar statements.

Table 6.4 CLMM comparison regarding students’ self-reported proof comprehension

Further, except for attending a transition course all other considered control variables (CRT score, honors vs regular mathematics course, final mathematics grade in high school) correlated positivelyFootnote 4 with students’ self-reported comprehension of the arguments (\(\upbeta =.65, \textrm{p}=.001\), \(\upbeta =.58, \textrm{p}=.003\), and \(\upbeta =-.31, \textrm{p}=.063\), respectively). Regarding these variables, the CRT score and the participation in an honors class had the largest effects. The mathematics grade had the smallest effect and did not reach significance.

Aspects of Students’ Proof Comprehension

284 observations were made in which participants claimed not having completely understood the (correct) generic and ordinary proofs. 149 of these observations contained responses regarding the open-ended question on what participants did not understand. Because responses to the open-ended question on students’ conviction also often contained information regarding aspects students (seemingly) did not understand, these responses were also considered (see coding protocol in Appendix B in the Electronic Supplementary Material). In total 208 responses (about 73% of observations in which students claimed not having completely understood the arguments) were coded according to the coding scheme shown in Table 5.3 in section 5.4.4.

The aspect students most often referred to when asked what they did not understand was local proof comprehension (32% for generic proofs and 54% for ordinary proof, see Fig. 6.15). These students claimed to have not understood particular statements, equations, or illustrations used in the proof. In particular, many participants seemed to not fully understanding the meaning of variables. Several participants referred to not understanding why two different variables are needed for the two odd numbers, if “\(2n+1\) stands for every odd number”. Further, about one fourth of the participants who received generic proofs had difficulties understanding the statement itself, compared to about 14% of participants who received ordinary proofs. These students lacked knowledge of the meaning of basic terms, for instance, divisible, product, and odd and even numbers. Not understanding the proof’s framework was slightly more often mentioned regarding generic proofs (about 8%) than ordinary proofs (about 5%). Further, reference was made to the generality of the proof in about 14% of the observations for generic proofs, while none of the participants who received ordinary proofs claimed to not have understood why the argument is general. The percentage of participants not being able to specify what they did not understand was higher for ordinary proofs (about 21%) than for generic proofs (about 15%).

Figure 6.15
figure 15

Aspects of participants’ (self-reported) proof comprehension: generic vs ordinary proof (based on 208 observations)

Figure 6.16
figure 16

Aspects of participants’ (self-reported) proof comprehension by type of statement (based on 208 observations)

Aspects of proof comprehension mentioned by participants also differed regarding the familiarity with the statement (see Fig. 6.16). Students more often made reference to not having understood the arguments regarding unfamiliar statements than regarding familiar ones (25 and 10%, respectively). Further, the proof framework and the generality of the proof was more often (self-reportedly) not understood by the participants regarding the familiar (geometry) statements (11 and 10%, respectively) than the unfamiliar (arithmetic) statements (2 and 3%, respectively). The percentage of participants not being able to specify what they did not understand was almost twice as high for familiar statements (about 24%) than for unfamiliar ones (about 13%).

6.4 Justification: Students’ Proof Schemes

Most participants responded when asked to justify why they think the statements are true/false (about 80% of all 580 potential observations–i.e., potential responses of the 116 participants in the control group A for each of the 5 statements). Overall, 467 observations were coded according to the proof schemes shown in Table 5.4. 5 of these observations were excluded from the analysis reported in this section because they included references to the geometry statements not being true on the sphere (which is of course correct, but made these responses difficult to interpret regarding the analysis of students’ proof schemes). Table 6.5 provides an overview of the number of observations in each category.

As was expected, most students used empirical proof schemes (109 in total). Fewer students had a deductive proof schemes (39 in total). Transformative proof schemes were only occasionally observed (5). Many participants also showed external proofs schemes, such as referring to an authority (49) or claiming that the statement is a general rule (65). Noteworthy, 59 observations (about 13%) were coded as unclear. Most of these responses were coded as unclear because participants either seemed to have not understood the statement and/or falsely thought the statement was incorrect or were not able to give any argument, for instance, they just stated “I suspect it”.

Table 6.5 Students’ proof schemes (462 observations)

Students’ proof schemes differed highly by the type of statement (see Fig. 6.17). Regarding familiar statements, participants most often referred to the statement being a general rule or a known theorem (about 37%), in particular regarding the pythagorean theorem. Further, one quarter of participants used pseudo arguments to justify the familiar statements. For instance, one student wrote that “if the sum is not 180 degrees it is not a triangle” (see also Tab. 5.4 in section 5.4.5). Another quarter of participants gave authoritarian arguments. These students often stated to have learned the statement in school or in a lecture. In contrast, the majority of justifications for the unfamiliar statements were coded as empirical proof schemes (about 45% in total). Only few students referred to the statement being a general rule (1%) or to any type of authority (4%). Complete or incomplete deductive arguments were given in 3 and 12% of the observations, respectively. Another 4% contained relevant aspects, but no chain of arguments, and even fewer participants attempted to give a transformative argument (2%). 16% of the observations regarding the unfamiliar statements contained pseudo arguments which often consisted of re-stating the claim. About half of the participants correctly provided one or more counterexamples to refute the false statement. Only few participants gave complete or incomplete deductive arguments to disprove the false statement (2% each). Further, 18% gave empirical arguments to (falsely) justify the truth of the statement.

Figure 6.17
figure 17

Students’ proof schemes by type of statement (based on 462 observations) empirical proof schemes in shades of blue, external proof schemes in shades of green, analytical proof schemes in shades of orange, counterexamples in yellow

Figure 6.18
figure 18

Students’ proof schemes by type of statement and estimation of truth (based on 462 observations)

Figure 6.18 illustrates the relation between students’ proof schemes, the type of statement, and students’ (correct) estimation of truth of the statement. It should be noted that the number of observations for each of the proof schemes differed substantially (see Fig. 6.17) and the frequencies reported are based on these observations. Except for one observation regarding the false statement, participants with a (complete or incomplete) deductive proof scheme correctly estimated the truth of the statements mostly with absolute conviction. The participant who gave a complete deductive proof to refute the false statement stated that “the statement is correct, provided that one accepts fractions as a solution.” The student then gives a correct proof why the statement is generally false, if fractions are not accepted as a solution. But regarding the closed item on estimating the truth value, the student chose the answer “Yes, I am absolutely sure the claim is correct.” Further, most participants referring to a rule or authority correctly estimated the truth of the statements with absolute conviction. Regarding the true statements, the vast majority of participants giving empirical arguments was only relatively convinced of the truth of the statement. Similarly, most students who gave examples to justify the truth of the false statement was relatively (and not absolutely) convinced of the truth (which is of course not the correct answer in this case). About 25% of participants who provided correct counterexamples for the false statements were not absolutely convinced of the falsity of the statement, but only relatively–even though they had in fact disproven the statement.

6.5 Understanding the Generality of Statements

To be able to better compare the results on students’ understanding of generality with those of previous research, I first report results regarding those students who were absolutely convinced of the truth of a statement, but not absolutely sure that there cannot be counterexamples. Depending on the type of statement, about 4 to 35% of the observations in which participants correctly estimated the truth of the statement consisted of inconsistent estimations of the existence of counterexamples (see Tab. 6.6). Participants more often showed a correct understanding of generality regarding the false statement than the true statements. Moreover, the percentage of observations regarding a correct understanding of generality of statements was higher for the familiar statements than for the (true) unfamiliar statements. Based on Chi squared test, these differences were highly significant with medium effect size (\(\chi ^2(2) = 73.4\), \(p < .001\), Cramer’s V \(= .25\)).

Table 6.6 Number/Percentages of observations in which the truth of the statement was correctly estimated (with absolute conviction) and the existence of counterexamples as well (yes) or not (no) by type of statement

The percentage of observations in which participants correctly estimated the truth of the statement (with absolute conviction) but inconsistently estimated the existence of counterexamples also differed by the type of argument (see Tab. 6.7). However, these differences were comparatively small. Moreover, in contrast to the type of statement, the differences regarding the type of argument were not significant (\(\chi ^2(3) = 3.5\), \(p = .32\), Cramer’s V \(= .05\)).

Table 6.7 Number/Percentages of observations in which the truth of the statement was correctly estimated (with absolute conviction) and the existence of counterexamples as well (yes) or not (no) by type of argument

In the remainder of this section, results regarding students’ understanding of the generality of statements as defined in Table 5.1 are reported. Overall, in about 64% of all observations, participants showed a correct understanding of generality (see Fig. 6.19). In about 6% of the observations, participants claimed to not know the answer to the two respective questions. These observations were therefore treated as missing values in the regression analysis (see section 5.3.5).

Figure 6.19
figure 19

Participants’ understanding of the generality of statements

The understanding of generality differed (significantly) by study program. As was expected, the higher the level of mathematics in the study program, the more participants seemed to have a correct understanding of generality (see Fig. 6.20). As most of the participants were in their first semester, influence of the study program itself on students’ proof skills (including their understanding the generality of statements) is highly unlikely. Therefore, the study program was not used as a predictor in the regression models (see further below). But other control variables that were considered may (at least partially) explain differences regarding the study program. For instance, as was shown in Fig. 6.2, participants in study programs with higher levels of mathematics had higher CRT scores. Therefore not surprisingly, students with a higher CRT score also showed more often a correct understanding of generality (see Fig. 6.21). Another variable that might explain differences regarding the study program is the attendance in a mathematics honors course (LK). As was reported in section 5.2.2, students in study programs with higher levels of mathematics also participated in an honors course more often. Expectedly, the percentage of participants with a correct understanding of generality was higher for students who participated in an honors course than for students who had a regular mathematics course in high school (see Fig. 6.22).

Figure 6.20
figure 20

Participants’ understanding of generality by study program

Figure 6.21
figure 21

Participants’ understanding of generality by CRT score

Figure 6.22
figure 22

Understanding generality by type of mathematics course in high school GK (regular course) vs LK (honors course)

Figures 6.23 and 6.24 illustrate the relation between students’ understanding of generality and the type of argument and the type of statement, respectively. Similar to the results reported above regarding the more restricted assessment of students understanding of generality (see Tab. 6.6), differences regarding the type of statement can be observed. The percentage of students who showed a correct understanding of generality was the highest regarding the false (unfamiliar) statement. But differences regarding true familiar and unfamiliar statements are now less obvious. One reason for this is the comparatively high percentage of missing values regarding the familiar statements. Also in line with the results regarding the more restricted assessment of students understanding of generality (see Tab. 6.7), the differences regarding the type of argument are comparatively small. However, the percentage of observations in which participants showed a correct understanding of generality was the lowest regarding ordinary proofs. Noteworthy is again the comparatively high percentage of missing values for participants who received no arguments.

Figure 6.23
figure 23

Understanding of generality by type of argument

Figure 6.24
figure 24

Understanding of generality by type of statement

At the end of the questionnaire, participants were asked about the meaning of generality (see Fig. 5.12 for the respective item and Fig. 6.25 for the result). The majority chose the correct answer (option 3). However, about a quarter responded incorrectly. 4 participants chose option 4. These participants then gave either incorrect meanings of generality (e.g., “it [the statement] is valid until there is a case where this statement is not true.”) or claimed to not know.

Figure 6.25
figure 25

Students’ knowledge of the meaning of mathematical generality 1: “The statement is correct for many cases (e.g., for many odd numbers)”, 2: “The statement is correct in general, i.e., with few exceptions”, 3: “The statement is correct without any exceptions”, 4: “Something else, namely:...”

Figure 6.26 shows the relation between students’ actual understanding of generality–as defined in this study via consistent responses regarding the estimation of truth and the existence of counterexamples–and their knowledge of the meaning of mathematical generality. The percentage of participants with an incorrect understanding of mathematical generality was higher for students who also had an incorrect knowledge of the meaning of mathematical generality (about 41 vs 27%). However, about 27% of participants with a correct knowledge of generality still responded inconsistently regarding their conviction of the truth of statements and the existence of counterexamples.

Figure 6.26
figure 26

Understanding generality by knowledge of the meaning of mathematical generality

To estimate the effects of the variables of interest on students’ understanding of generality, generalized linear mixed models were used (see also section 5.4.1). As can be seen in Table 6.8, the three models that were fitted do not differ much regarding AIC. Models 2 and 3 seem to be better than Model 1 regarding both AIC and BIC. Given the small difference in AIC and a better BIC value, the smaller Model 3 seemed to be the best choice overall. The GLMM results confirm the observed effects reported above. Participants who received ordinary proofs were less likely to have a correct understanding of generality than participants that received no arguments, even though this effect did not quite reach significance after Holm’s adjustment (\(\upbeta =-.41, \text {p.adj}=.075\)). A similar, but smaller effect can be observed regarding participants who received generic proofs and empirical arguments (\(\upbeta =-.32, \text {p.adj}=.148\) and \(\upbeta =-.27, \text {p.adj}=.148\)), also not reaching significance.

Table 6.8 GLMM comparison regarding students’ understanding of generality

Being familiar with the statement as well as the truth value seemed to have influenced students’ understanding of generality. Participants were more likely to have a correct understanding of the generality of familiar statements (\(\upbeta =.29, \text {p.adj}=.011\)) and the false (unfamiliar) statement (\(\upbeta =.46, \text {p.adj}=.003\)) compared to true unfamiliar statements. Participants who correctly answered the closed item on the meaning of mathematical generality, were also more likely to show a correct understanding of generality as defined in this study (\(\upbeta =.68, \textrm{p}<.001\)). This effect was overall the largest.

Further, among the considered control variables, only the CRT score and the participation in an honors course were predictive for students’ understanding of generality. Participants with a higher CRT score were more likely to have a correct understanding of generality than participants with a lower score (\(\upbeta =.66, \textrm{p}<.001\)). Similarly, participants who attended an honors course were also more likely to have a correct understanding than participants who had a regular mathematics course in high school (\(\upbeta =.38, \textrm{p}=.010\)). The participation in the transition course had an unexpected negative effect (and was therefore excluded in the final model), however, not quite reaching significance (in Model 2, \(\upbeta =-.22, \textrm{p}=.081\)). The effect of the final mathematics grade was comparatively small and not significant (in Model 1, \(\upbeta =-.10, \textrm{p}=.463\)).

Students’ Understanding of Generality in Relation to Their Conviction and Comprehension

Figure 6.27 shows the relation between students’ level of conviction regarding different arguments and their understanding of generality. Regarding empirical arguments, there is a negative relation between students’ understanding of generality and their level of conviction. Participants, who were convinced by empirical arguments (partially or completely), had an incorrect understanding of generality more often (about 37 and 43%, respectively) than participants who claimed to not find the empirical arguments convincing at all (about 20%).

Figure 6.27
figure 27

Understanding generality by type of argument and level of conviction

A simple mixed effects logistic regressionFootnote 5 was calculated to analyze this relation, in which the individuals were considered as a random effect (see Model 1 in Tab. 6.9). Participants, who claimed to find the empirical arguments not at all convincing were more likely to have a correct understanding of generality than participants who claimed to be partially convinced by empirical arguments (\(\upbeta =.90, \text {p.adj}=.027\)). In contrast, participants who were completely convinced by empirical arguments were less likely to have a correct understanding of generality than participants who were partially convinced, but this effect was comparatively smaller and not significant (\(\upbeta =-.26, \text {p.adj}=.353\)).

Table 6.9 GLMM comparison regarding students’ understanding of generality in relation to conviction and comprehension (Model 1 regarding empirical arguments; Models 2–4 regarding generic and ordinary proofs)

It seems that this effect is reversed regarding generic and ordinary proof. However, more than a quarter of observations in which students found the arguments not convincing at all consisted of missing values for their understanding of generality (these participants responded “I have no idea” regarding the two relevant questions, see section 5.3.5). After removing these observations, the percentage of observations in which participants claimed to be only partially convinced by the generic or ordinary proofs with a correct understanding of generality was lower than for both, participants claiming to be completely convinced by the argument and participants who were not convinced at all (about 53% vs about 70 and 65%, respectively). A mixed effects logistic regressionFootnote 6 was calculated with the individual participants as a random effect and the type of argument and the level of conviction as fixed effects (see Model 2 in Tab. 6.9). Participants, who claimed to find generic or ordinary proofs completely convincing were more likely to have a correct understanding of generality than participants who claimed to be only partially convinced (\(\upbeta =.74, \text {p.adj}<.001\)). Noteworthy, participants, who were not at all convinced by the proofs were also more likely to have a correct understanding of generality than participants who claimed to be only partially convinced, but this effect did not reach significance (\(\upbeta =.52, \text {p.adj}=.202\)). The effect of the type of argument (generic vs ordinary proof) was very small and highly insignificant (\(\upbeta =.05, \textrm{p}=.788\)).

Figure 6.28
figure 28

Understanding generality by type argument and level of comprehension

Similar to the findings regarding students’ conviction, the percentage of missing values for understanding generality is very high regarding participants who claimed to have not understood the generic and ordinary proofs at all (see Fig. 6.28). There seems to be a positive relation between students’ self-reported proof comprehension and their understanding of generality, as the percentage of students with a correct understanding of generality was the highest for students claiming to have understood the respective arguments completely, regarding both generic and ordinary proofs (about 68 and 71%, respectively). However, due to the percentage of missing values, the effect is again less clear regarding students who claimed to have not understood the argument at all. After removing observations with missing values, for generic arguments, the percentage of students with a correct understanding of generality was slightly higher for students claiming to have not understood the respective arguments at all than for those claiming to have partially understood them (about 55 and 52%, respectively). For ordinary arguments, the positive relation between students’ self-reported proof comprehension and their understanding of generality holds in general: The higher the level of proof comprehension, the higher the percentage of observations with a correct understanding of generality (about 43, 57, and 71%). A mixed effects logistic regressionFootnote 7 was again calculated with the individual participants as a random effect and the type of argument and the level of self-reported comprehension as fixed effects (see Model 3 in Tab. 6.9). Participants, who claimed to have understood the generic or ordinary proofs completely were more likely to have a correct understanding of generality than participants who claimed have only partially understood the arguments (\(\upbeta =.70, \text {p.adj}=.001\)). In contrast, participants who claimed to have understood the proofs not at all were less likely to have a correct understanding of generality, but this effect was comparatively smaller and not significant (\(\upbeta =-.29, \text {p.adj}=.411\)). The type of argument was again not predictive (\(\upbeta =.10, \textrm{p}=.578\)).

A further mixed effects logistic regression was fitted, in which the predictive variables from the main analysis of students’ understanding of generality (Model 3 in Tab. 6.8) were included. After controlling for these variables, the observed effects regarding the influence of students’ conviction and proof comprehension are partly different. The direction of the effect of students’ self-reported proof comprehension remains: Students who claimed to have understood the proofs completely were more likely to have a correct understanding of generality and students who self-reportedly understood the proofs not at all were less likely to have a correct understanding of generality, both compared to students who claimed to have only partially understood the proofs (\(\upbeta =.42, \text {p.adj}=.169\) and \(\upbeta =-.79, \text {p.adj}=.159\), respectively), even though these effects did not reach significance. The effect regarding students being completely convinced compared to students being only partially convinced by the proofs remains positive, however, not reaching significance anymore (\(\upbeta =.35, \text {p.adj}=.202\)). The effect that students who were not convinced by the argument at all were more likely to have a correct understanding of generality than students who claimed to be partially convinced is larger after controlling for other variables (\(\upbeta =1.07, \text {p.adj}=.013\)). The positive effects of the CRT score and students correct knowledge of the meaning of mathematical generality on students understanding of generality mainly remain. However, the participation in an honors course (LK) and the familiarity with the statement were not predictive anymore after students’ proof comprehension and conviction were included in the model (\(\upbeta =.23, \textrm{p}=.263\) and \(\upbeta =.17, \textrm{p}=.299\), respectively).

Students’ Understanding of Generality in Relation to Their Proof Schemes

Figure 6.29 gives an overview of students’ proof schemes in relation to their understanding of generality. The percentage of students with a correct understanding of generality was the highest for students with (complete and incomplete) deductive proof schemes (about 90%) and the lowest for students with purely empirical (no awareness of generality) or incomplete transformative proof schemes) about 60%). The percentage of participants having a correct understanding of generality was similar for students referring to relevant aspects, authorities, a rule, or giving pseudo arguments, namely about 75 to 80%.

To analyze the statistical significance of these differences, Chi square test was used. To increase the power of the test, the categories were summarized as explained in section 5.4.6. Overall, a relation between students’ proof schemes and their understanding of generality seems to exist (see Tab. 6.10). The percentage of students with a correct understanding of generality was the highest for students with an analytical proof scheme (about 84%), followed by students with an external proof scheme or one that consists of giving correct counterexamples (about 77 and 76%, respectively). The lowest percentage was observed for students having empirical proof schemes (about 62%). The percentage of students with a correct understanding of generality whose justifications were coded as unclear was lower than the average in group A (about 69% vs 73%). The differences reported are statistically significant with medium effect size (\(\chi ^2(4) = 12.0\), \(p=.017\), Cramer’s V \(= .16\)).

Figure 6.29
figure 29

Understanding of generality by students’ proof schemes (based on 462 observations)

Table 6.10 Number/Percentages of observations in which students had a correct (yes) or incorrect (no) understanding of generality by proof schemes

6.6 Analysis of Missing Values

The results presented above have shown a comparatively high percentage of students who responded “I have no idea” to both questions used to determine their understanding of generality. Even though these observations are not true missing values (because the participants in fact chose an answer regarding the two relevant questions), a decision regarding the understanding of generality of statements could not be made for these observations. Therefore, they were treated as missing values in the regression analyses. This section aims at identifying any patterns among these observations by calculating mixed effects logistic regression models. A dummy variable dropout generality was defined as follows:

  • yes (1), for missing values in the variable understanding generality, and

  • no (0), if a value for understanding generality was observed (either yes or no).

Table 6.11 GLMM results of the dropout variable regarding students’ understanding of generality

The individuals were again used as a random effect. All variables that were considered in the regression models above were used as fixed effects, to analyze the potential relation between these variables and observations with missing values (in the sense described above) regarding the understanding of generality. Table 6.11 shows the regression results. To estimate the effect of the type of argument participants received, Model 1 excluded the variables regarding students’ conviction and comprehension, because this data was not collected for participants in group A, who received no arguments. Participants who received any type of argument were less likely to “drop out” (i.e., answering “I have no idea” regarding both the estimation of truth and the existence of counterexamples) than participants who received no arguments at all. This effect was particularly large and highly significant regarding empirical arguments (\(\upbeta =-1.23, \textrm{p}<.001\)) and generic proofs (\(\upbeta =-.97, \textrm{p}<.001\)) , but also present for ordinary proofs (\(\upbeta =-.57, \textrm{p}=.031\)). Compared to the true unfamiliar statements, participants were significantly more likely to answer “I have no idea” regarding the familiar statements (\(\upbeta =1.80, \textrm{p}<.001\)) and the false statement (\(\upbeta =.96, \textrm{p}=.003\)). Participants who attended a mathematics honors course in high school were less likely to drop out than participants who attended a regular mathematics course (\(\upbeta =-1.22, \textrm{p}<.001\)). Noteworthy, the CRT score seemed only have a minor effect on participants’ likelihood of choosing “I have no idea” regarding both questions, not reaching significance (\(\upbeta =-.47, \textrm{p}=.063\)). Furthermore, the higher the math grade (which in Germany means a worse grade), the more likely participants dropped out (\(\upbeta =.44, \textrm{p}=.036\)). However, similar to the CRT score, this effect was comparatively small.

In Model 2, only observations from groups C and D (generic and ordinary proofs) were considered and the false statement was again excluded. Because the participation in a transition course showed no significant effect, it was excluded in Model 2, also, because of the otherwise large number of variables. The participation in an honors course was not predictive for missing values in the variable understanding generality, once comprehension and conviction were included (\(\upbeta =-.92, \textrm{p}=.114\)). The level of self-reported comprehension and conviction were both predictive for the missing values regarding understanding of generality. In particular, students who claimed to have partially or completely understood the proofs were less likely to drop out than students who claimed to have not understood the proofs at all (\(\upbeta =-1.57, \textrm{p}=.007\) and \(\upbeta =-2.52, \textrm{p}=.003\)). Similarly, students who found the proofs partially or completely convincing were less likely to drop out than students who claimed to be not convinced by the proofs at all (\(\upbeta =-1.06, \textrm{p}=.085\) and \(\upbeta =-3.01, \textrm{p}=.004\)).

Overall, these results indicate that the missing values–observations in which students responded with “I have no idea”–substantially depended on other variables, such as the (type of) statement, the type of argument (or more general, receiving any argument at all), the self-reported comprehension of the proofs, and how convincing students evaluated the proofs.

6.7 Summary of Main Results

This section provides an overview of the main results of the present study, in particular regarding the influence of the type of argument and statement on proof-related activities, and students’ understanding of the generality of mathematical statements and the relation to proof reading and construction.

6.7.1 Influence of the Type of Argument

The study was mainly designed to experimentally analyze the influence of the type of argument–receiving no arguments, empirical arguments, generic proofs, or ordinary proofs–on students’ understanding of the generality of statements and other proof-related activities (see section 5.2.3). In summary, the type of argument significantly influenced:

  • Students’ estimation of truth: Participants who received empirical arguments were more likely to correctly estimate the truth value of the statements than participants who got no arguments. Reading generic proofs had a similar, but smaller effect and did not reach significance after Holm’s correction was applied.

  • Students’ proof evaluation regarding conviction: Participants who received generic or ordinary proofs were more likely to claim being convinced by these arguments than participants who received empirical arguments. The reasons why participants claimed not to be convinced by the arguments also differed by the type of argument. The reason most often referred to by participants who received empirical arguments was a lack of generality of these arguments (78% of observations), while participants who received generic or ordinary proofs were mainly not convinced by these arguments because they did not (completely) understand them (64 and 81%, respectively).

  • Students’ proof comprehension: Participants who received ordinary proofs were less likely to have self-reportedly understood the arguments than those participants, who received generic proofs. Further, aspects that participants claimed to have not understood differed between generic and ordinary proofs. For instance, the generality of the proof was not mentioned at all by participants who received ordinary proofs, but by those who received generic proofs (14%).

  • The probability of missing values: Participants who received any type of argument were less likely to answer “I have no idea” regarding the estimation of truth and the existence of counterexamples than participants who received no arguments. This effect was particularly strong for participants who got empirical arguments.

The type of argument did not have a large effect on students’ understanding of the generality of statements, but participants who received ordinary (and with a smaller effect generic) proofs were less likely to have a correct understanding than students who got no arguments. However, these effects did not reach significance after Holm’s correction.

6.7.2 Influence of the Type of Statement

To analyze the influence of the type of statement (truth value and familiarity), all participants received five statements of different types: Two (true) familiar statements, two (true) unfamiliar statements, and one false (unfamiliar) statement. Overall, the truth value of the statement and the familiarity with the statement both significantly influenced students’ performance in all considered activities. In summary, compared to true, unfamiliar statements, participants were

  • less likely to correctly estimate the truth value of the false statement,

  • less likely to be convinced by (incorrect) arguments regarding the false statement,

  • more likely to show a correct understanding of generality regarding the false statement,

  • more likely to answer “I have no idea” regarding the estimation of truth and the existence of counterexamples for the false statement.

Regarding the familiar (geometry) statements, participants were

  • more likely to correctly estimate the truth value,

  • more likely to be convinced by the arguments,

  • less likely to claim to have understood the arguments,

  • more likely to show a correct understanding of generality (with a comparatively smaller effect than regarding the false statement),

  • much more likely to answer “I have no idea” regarding the estimation of truth and the existence of counterexamples,

all compared to the true, unfamiliar statements from elementary number theory.

Moreover, students’ proof schemes also differed by the type of statement: Participants mainly gave counterexamples to refute the false statement (51%), empirical arguments to justify the true unfamiliar statements (45%), and external arguments (pseudo, rule based, authority) to justify the true familiar statements (25, 37, and 24%, respectively).

6.7.3 Students’ Understanding of Generality and the Relation to Proof

The focus of the present thesis was to analyze students’ understanding of the generality of mathematical statements. Most predictive for students’ (correct) understanding of generality was their knowledge of the meaning of mathematical generality (measured via the closed item shown in Fig. 5.12). Furthermore, the truth value and the familiarity with the statement both influenced students’ understanding of generality (see above). While reading different types of arguments significantly affected the probability of participants answering “I have no idea” to the two relevant questions (estimation of truth and existence of counterexamples), it did not seem to have a large effect on students’ understanding of the generality of statements. The analysis of the relation between students’ understanding of generality and their performance in other proof related activities suggests:

  • There is a positive relation between students’ self-reported proof comprehension and their understanding of generality of statements: Students who claimed to have completely understood the proofs were more likely to have a correct understanding of generality and students who claimed to have not understood the proofs at all were less likely to have a correct understanding of generality both compared to students who claimed to have partially understood the proofs.

  • There is a negative relation between students’ conviction of empirical arguments and their understanding of generality: Students who claimed to be not at all convinced by the empirical arguments were more likely to have a correct understanding of the generality of statements than those who claimed to be partially (or completely) convinced by the arguments.

  • There is no clear relation between students’ evaluation of generic and ordinary proofs and their understanding of generality. Students’ who claimed to be completely convinced by these arguments might be more likely to have a correct understanding of generality than students, who were only partially convinced. However, after considering other (predictive) variables (such as knowing the meaning of generality and the CRT score), this effect did not reach significance. Even more, participants who claimed to be not at all convinced by the arguments were more likely to have a correct understanding of generality.

  • Students’ proof schemes are related to their understanding of the generality of statements: Participants with empirical proof schemes most often had an incorrect understanding of generality, followed by participants with external proof schemes. Among the participants with deductive proof schemes, the percentage of participants with an incorrect understanding of the generality of statements was the lowest.

6.7.4 Predictive Power of Control Variables

The CRT (Cognitive Reflection Test) score and the attendance of an honors mathematics course in high school (LK) were overall the most predictive control variables. In particular, participants with a higher CRT score were more likely

  • to have a correct understanding of generality,

  • to correctly estimate the truth value,

  • to claim to have (completely or partially) understood the proofs.

Attendance of an honors course had similar effects, however, after proof comprehension (and conviction) was considered, it was not predictive for a correct understanding of generality anymore (the CRT score still was). Unexpectedly, the final mathematics grade was only predictive for students’ self-reported proof comprehension (with a smaller effect when compared to the CRT score and LK participation) and the estimation of truth (even though not reaching significance) but not for understanding the generality of statements.