In this chapter, the research design and methods used to answer the research questions are described in detail. First, I justify the chosen approach and give an overview of my research design. Second, the data collection process is outlined. In particular, the experiment and the construction of all instruments are described. Lastly, I explain and justify the methods being used to analyze the data.

5.1 Research Design

Based on the present research interest, a quantitative, experimental research design was chosen. In this section, I first justify the decision for the particular research approach. Then, an overview of the research process and the development of the research design is given.

5.1.1 Justification of the Research Approach

Despite numerous studies being conducted on students’ proof skills, surprisingly few of them have specifically focused on students’ understanding of the generality of mathematical statements. Most of these studies have used qualitative methods, making their findings limited and not widely applicable (e.g., Bryman, 2012; Mat Roni, Merga, & Morris, 2020). That is why I took a different approach. I conducted a quantitative research study to determine the extent to which students lack understanding of generality of statements and examine how their understanding is related to reading and constructing mathematical arguments. A particular goal was to investigate the impact of reading different types of arguments and statements on students’ performance in proof-related activities, with a specific focus on their understanding of generality. To truly uncover the cause-and-effect relationship, I designed and implemented an experimental study (e.g., Bryman, 2012). The experiment was further supplemented with open-ended questions to gain a deeper understanding of students’ comprehension and conceptions on proof and generality.

To estimate the effect of different types of arguments on students’ performance in proof-related activities, a between-subjects approach was taken, i.e., participants were randomly allocated to different groups, in which they received only one particular type of argument for all statements (experimental groups)—or no arguments at all (control group). Another approach would have been to provide different types of arguments to each participant. However, I decided against such a within-subjects approach regarding the type of argument, mainly because I wanted to avoid potential influences caused by participants directly comparing the types of arguments, in particular regarding proof evaluation—which would also be an interesting question, but not one I was investigating (see also discussion in Section 7.2.2).

A within-subjects approach via repeated measures was chosen to investigate the effect of different types of statements (familiarity and truth value, see Chapter 4) on students’ understanding of generality of the statements and their performances in other proof-related activities. Furthermore, the inclusion of different statements was assumed to provide more reliable results (e.g., Bryman, 2012; Döring & Bortz, 2016).

A field experiment was chosen to provide high external validity (e.g., Döring & Bortz, 2016). The decision against the conduction of a laboratory experiment was mainly based on the difficulty of recruiting participants and a potentially resulting selection bias. Potential advantages of and suggestions for laboratory experiments are discussed in Section 7.2.2.

5.1.2 Overview of the Research Process

The research process consisted of two parts, the development and conduction of the pilot study, and the main study. The pilot study was conducted in October 2019 and aimed to ensure the feasibility of the chosen approach as well as to identify any modifications needed in the overall design and items. Figure 5.1 provides an overview of the timeline of the research process.

Figure 5.1
figure 1

Timeline of research process

The first version of the experimental design was developed between June and September 2019. After the research questions had been specified, the experimental groups regarding the type of argument were defined. Due to their prominent role in mathematics education, the following types of arguments were chosen: empirical arguments, generic proofs, and ordinary proofs (see Section 2.4.2). The fourth group received no arguments at all, and therefore corresponds to a control group.

During the development of the experimental design, the questionnaire was given to several colleagues working in mathematics and mathematics education as well as to a small number of high school and university students. The mathematics (education) researchers were asked to evaluate the questions regarding correctness (both mathematical and language-wise), validity, completeness, structure, and other aspects they notice; the students mainly provided feedback regarding comprehensibility, completeness of answer options, and how reasonable the questionnaire is regarding its length and durationFootnote 1. The experimental design and items were then revised accordingly. The pilot study was deliberately conducted during the second session of a first-semester mathematics lecture (with the title Arithmetik und Algebra) to avoid influences of university lectures on students’ performance in proof-related activities, since the aim of this study is to identify students’ understanding and performance in proof-related activities when they enter university. The participants consisted of students who want to become either primary (the majority) or lower secondary school teachers. They were randomly given a piece of paper with a QR code on it with which they could access one of four different online questionnaires via their mobile phones, tablets, or laptops. The four questionnaires related to the four experimental groups. 382 students completed the questionnaire in the pilot study in October 2019.

The data of the pilot study was then analyzed and design issues regarding structure, selection of statements and arguments, as well as other necessary modification were identified. The experimental design was adapted accordingly and the questionnaire was again given to and discussed with colleagues and students. The main study was conducted in November 2020, again right at the beginning of the winter term. The data was collected in the lecture Arithmetik und Algebra as well as in a second lecture—Lineare Algebra 1—for first-semester mathematics majors and preservice higher secondary school teachers. The decision to collect data in these two lectures was made to provide a larger sample size as well as to investigate differences regarding the study program and students’ (cognitive) resources.

5.2 Data Collection

In this section, the data collection of the main study is described in detail. First, background information regarding the setting in which the data collection took place is presented. Secondly, the characteristics of the sample and the participants’ background are described. Lastly, the experimental design is outlined.

5.2.1 Setting

As mentioned in Section 5.1.2, the experiment was conducted in two mathematics lectures at Bielefeld University in North Rhine-Westphalia (NRW), Germany, which are aimed at two different groups of first-semester university students.

Due to the COVID-19 pandemic and respective lockdowns, university courses could not be given in person. This meant that, in contrast to the pilot study, the participants of both lectures were not present in a lecture hall, but attended the lectures via the online conference tool Zoom. Thus, students participated in the experiment via answering online questionnaires, which implied some lack of control over the participants’ environment, one disadvantage of internet-based research. However, internet-based research seems to be as valid as more traditional methods, such as pencil-and-paper-questionnaires (e.g., Gosling, Vazire, Srivastava, & John, 2004). Moreover, due to students not being present in a lecture hall, another way of randomly assigning students to the four experimental groups had to be found. The decision was made to use the breakout-room-function already implemented in Zoom, with which it is possible to create several sub-meetings and assign participants randomly. The four versions of the questionnaire (one for each experimental group) were implemented via the web-based survey-software unipark. The links were given to the participants via chat in the respective Zoom-breakout-rooms.

5.2.2 Participants

In total, 430 students completed the questionnaire (67.4% female, 31.2% male, and 1.4% chose not to answer). The average age of the participants was about 21 years (\(SD=4.2\)) and about 96% were German native speakers. 116 of the participants received no arguments, 112 empirical arguments, 107 generic proofs, and 95 ordinary proofs. As the same number of participants had been allocated to each experimental group in the beginning (via the Zoom-breakout-room-function), the percentage of participants not completing the questionnaire (i.e., dropping out of the experiment) was the highest for ordinary proofs.

Figure 5.2 provides an overview of the distribution of participants with respect to the study program. The vast majority of the participants were preservice primary school teachers without mathematics as majorFootnote 2 (290), followed by mathematics studentsFootnote 3 (70), preservice lower secondary school teachers (26), preservice higher secondary school teachers (25), and preservice primary school teachers with mathematics as major (19). This distribution corresponds roughly to the actual distribution of mathematics students at the faculty of mathematics at Bielefeld University (see Universität Bielefeld, 2020). About 94% of the participants stated to have attended the respective lectures for the first time. However, about 25% of the students were in the second semester or higherFootnote 4. Therefore, it cannot be ruled out that these students have already gained experience with proof in other mathematics lectures. About 57% of the participants attended a transition course—a so-called Vorkurs (see footnote in Section 5.3.7)—prior to the lecture (58% of participants that attended the lecture Arithmetik und Algebra and 54% of participants that attended the lecture Lineare Algebra 1). Most participants got their university entrance degree in North Rhine-Westphalia (about 90 %), about 9% in another German state, and less than 1% in another country. The mean university entrance gradeFootnote 5 of the participants (\(M=2.28, SD=0.58\)) corresponds approximately to the average of the university entrance grade in North Rhine-Westphalia (e.g., in 2021: \(M=2.35\); see Kultusministerkonferenz, 2022); the participants’ mean final high school grade in mathematics was \(M=2.36\) with noticeably higher dispersion (\(SD=1.05\)). About 37% of participants (about 25% of participants that attended the lecture Arithmetik und Algebra and 81% of participants that attended Lineare Algebra 1) specialized in mathematics during high school in a so-called Leistungskurs (honors course, see footnote in Section 5.3.7).

Figure 5.2
figure 2

Distribution with respect to study program

5.2.3 Structure of the Experiment

I designed an experiment, which mainly aimed at analyzing the influence of different types of arguments on students’ understanding of the generality of mathematical statements, their conviction of the truth of the statements, as well as their comprehension of proof. Apart from the instructions, the experiment consisted of three main parts as shown in Figure 5.3. The instructions contained explanations on the overall goal and implications of the project—investigating students’ knowledge at the transition from school to university to be able to better support future students. More specific research interests such as students’ understanding of generality or students’ proof skills were not communicated to avoid influencing the participants. In the first part of the experiment, all participants had to read five universal statements and respective arguments. The type of argument participants received depended on the experimental group they were randomly assigned to. Group A got no arguments at all, group B only got empirical arguments, group C got generic proofs, and group D got ordinary proofs. Then, all participants had to estimate the truth value of each statement and decide wether or not counterexamples might exist. In addition, participants in Group A—who received no arguments—were asked to justify the truth or falsity of each statement. The participants in the remaining groups were asked to evaluate the provided arguments regarding conviction and if they have comprehended the arguments. At the end of the first part, all participants were asked to evaluate the difficulty of the questions asked so far. In the second part, all participants had to complete a Cognitive Reflexion Test (see Section 5.3.6). In the third and last part, participants were asked to answer questions about their demographics as well as their understanding of the meaning of mathematical generality (in German Allgemeingültigkeit). The decision to put the demographic questions at the end was made because thinking about these questions can unconsciously influence the participants’ answers to other questions. For instance, Steele and Ambady (2006) showed that “women who were subtly reminded of ... their gender identity ... expressed more stereotype consistent attitudes towards the academic domain of mathematics ... than participants in control conditions.” (p. 428)

Figure 5.3
figure 3

Overall experimental design

5.3 Instruments

In this section, I first describe and justify the selection of statements and arguments. Then, in Sections 5.3.2, 5.3.3, 5.3.4, 5.3.5, the respective items addressing the research questions are specified. In Section 5.3.6, the CRT (Cognitive Reflexion Test) used as a control instrument for individuals’ cognitive resources, is introduced. Lastly, I give a summary of collected demographic information.

5.3.1 Selection of Statements and Arguments

The influence of the type of statement (familiarity and truth value) on students’ understanding of the generality of statements, their conviction of the truth of the statement, their comprehension of the statement, and their construction of arguments were investigated in this study (see research questions in Chapter 4). Because of their prominent place in school curricula, the following two familiar statements were chosen: 1) the pythagorean theorem and 2) the sum of interior angles of a triangle. Both statements are explicitly listed in all NRW lower secondary curricula and it is even expected that the school students prove these statements or justify the overall idea of the proof (e.g., Ministerium für Schule und Bildung des Landes Nordrhein-Westfalen, 2019, p. 30 and p. 34). Even though it is questionable, if the participants actually had to prove these theorems in school, it can be assumed that they had to apply them frequently in their mathematics classes.

Secondly, for the selection of unfamiliar mathematical statements, three main criteria were specified:

  • The statement should not be explicitly mentioned in the school curricula in NRW.

  • Only (little) basic content knowledge should be necessary to understand and prove the statement.

  • It should be possible to support/prove the statement by the three types of arguments: empirical, generic, and ordinary.

In previous studies, statements from elementary number theory (arithmetic) have often been chosen, because they generally require comparatively few prior knowledge, are therefore mostly easy to understand, and they can quite easily be proven by generic arguments (e.g., Barkai et al., 2002; Healy & Hoyles, 2000; Kempen, 2018; Martin & Harel, 1989; Tabach et al., 2011). Therefore, two unfamiliar statements (both of them true) were selected: 1) the sum of two odd numbers is always even and 2) the product of two odd numbers is always odd. In particular the first statement has been used in studies on proof and argumentation before (e.g., Healy & Hoyles, 2000; Kempen & Biehler, 2019).

To find suitable (unfamiliar) universal statements that are false and fulfill all three criteria turned out to be more difficult. In particular, it had to be possible to construct (non-general) arguments (generic and ordinary proofs) for these statements, where it is not too obvious why these arguments do not prove the truth of the statements for all cases. One such statement, which had been used in prior research (e.g., Barkai et al., 2002), was identified: The sum of three consecutive numbers is always divisible by 6. This statement proved to be suitable, because both a generic proof and an ordinary proof could be constructed rather easily on the basis that the statement is true if (and only if) the first number is odd (see Fig. 5.5 and 5.6 further below).

In summary, five statements were selected for the main study, two of them familiar, two of them unfamiliar and one of them false (and unfamiliar). The decision was made to phrase all statements using natural language, because firstly, students tend to have difficulties interpreting (more) formal, symbolic statements, in particular quantification (e.g., Dubinsky & Yiparaki, 2000; Piatek-Jimenez, piatekspsjimenez2004; J. Selden & Selden 1995). And secondly, the form of all statements became more similar in that way and thus, more comparable. For instance, the pythagorean theorem is often presented symbolically as \(\textrm{a}^2+\textrm{b}^2=\textrm{c}^2\), were ab are the legs and c the hypothenuse of a right triangle. To avoid any influence of the statement being expressed as an equation, which students might simply associate with a formula, the pythagorean statement was also phrased using natural language. Further, to express and emphasize the generality of the statement, the terms beliebig (in English arbitrary) and immer (in English always) were used. The term Behauptung (in English claim) was used for all statements to express uncertainty about the truth value.

Regarding the order of the statements in the questionnaire, two possibilities were discussed: Using randomization or ordering the statements from (presumably) easiest to most difficult. Randomization would have had the advantage of considering potential influences of the order of statements. However, it was decided to order the five statements by difficulty instead, because studies have shown that this can lead to a lower percentage of participants dropping out of the study and higher performance overall (e.g., Anaya, Iriberri, Rey-Biel, & Zamarro, 2022; Kleinke, 1980). Placing the most difficult item right at the beginning of a questionnaire, which can occur if items are randomly ordered, particularly increases the risk of high drop out rates (Anaya et al., 2022). For the assessment of difficulty of the statements, the required knowledge to understand the statement and the complexity of the proofs (number of steps, required concepts etc.) were considered. Further, several colleagues were asked to evaluate the difficulty of the statements and proofs. The type of statement (i.e., familiarity and truth value) was also taken into account. This process lead to the following order of statements (English translations, see Appendix A in the Electronic Supplementary Material for the original German items):

  1. Claim 1:

    The sum of two arbitrary odd numbers is always even.

  2. Claim 2:

    In an arbitrary triangle, the sum of the interior angles is always equal to \(180^{\circ }\).

  3. Claim 3:

    The product of two arbitrary odd numbers is always odd.

  4. Claim 4:

    The sum of three arbitrary consecutive natural numbers is always divisible by 6.

  5. Claim 5:

    In an arbitrary right triangle, the sum of the areas of the squares of the legs is always equal to the area of the square of the hypothenuse.

In the following, the selection and phrasing of the different types of arguments are described and justified.

5.3.1.1 Empirical Arguments

The empirical arguments used for the present study consisted of four examples for the false and the two unfamiliar (arithmetic) statements and three examples for both familiar (geometry) statementsFootnote 6 (see Fig. 5.4 for the empirical arguments used to justify claim 1 and 2, respectively). The empirical arguments always started with the sentence Begründung: Ich habe mir verschiedene Beispiele angeschaut und die Behauptung überprüft (which roughly translates to Justification: I looked at several examples and verified the claim), followed by the respective examples. The examples were chosen in a way that they appear to cover various cases (e.g., right and equilateral triangles)—which some students seemingly consider when they evaluate or construct empirical arguments (e.g., Chazan, 1993)—and that participants could easily verify their correctness.

Figure 5.4
figure 4

Example items for empirical arguments to justify claims 1 and 2 (translated)

5.3.1.2 Generic Proofs

All generic proofs (and the incorrect one for statement 4) consisted of specific examples which reveal the respective underlying structure that can be generalized, and an explanation that the illustrated idea indeed works for any other example as well, as suggested by Kempen & Biehler (2019), for instance (see Fig. 5.5 for example items). The explanations were rather detailed to ensure that participants can follow the argument. For instance, in the generic proof for claim 1, it was explained that every odd number can be divided into pairs of twos such that exactly one is left. The decision to use more detailed explanations was made, because students seem to lack knowledge regarding basic concepts (Conradie & Frith, 2000; Moore, 1994; Reiss & Heinze, 2000), which was also observed in the pilot study of this project (e.g., the definition of even and odd numbers). The generic proof for the false claim 4 is based on the generic proof for the (true) statement the sum of four arbitrary consecutive odd numbers is always divisible by 8 given by Brunner (2014, p. 22). A similar argument proves claim 4, but only if the first number is odd. This fact was used to construct the generic proof as well as the ordinary one (see below).

Figure 5.5
figure 5

Example items for generic proofs to justify claims 1, 2, and 4 (translated)

5.3.1.3 Ordinary Proofs

Similar to the generic proofs, the ordinary proofs (and the incorrect one for statement 4) were rather detailed to enable participants to comprehend the arguments more easily. Further, no illustrative figures were included in the proofs for claims 3 and 5 (the familiar geometry statements), even though it might have facilitated understanding the arguments. This decision was made, because such figures always show specific examples and the distinction between an ordinary proof and a generic one would then not have been that clear. Figure 5.6 provides examples of the ordinary proofs for claims 1, 2, and 4. As was already mentioned above regarding the generic proof of claim 4, the argument used in the ordinary proof for claim 4 is also not general.

Figure 5.6
figure 6

Example items for ordinary proofs to justify claims 1, 2, and 4 (translated)

5.3.2 Conviction of the Truth of Statements

Two closed items were designed to investigate students’ conviction of the truth of universal statements and the respective influence of different types of arguments. All participants were first asked to estimate the truth value of the statement (see Fig. 5.7).

Figure 5.7
figure 7

Closed item for the estimation of truth of the statements (translated)

Thereby, it was decided to give participants the opportunity to express absolute or relative conviction, as was proposed by Weber and Mejia-Ramos (2015). Further, the term richtig (in English correct) was used instead of wahr (in English true), to avoid any confusion about the meaning of true statements. The participants of the three experimental groups B, C, and D, who were provided with different types of arguments, were then asked if the provided argument has convinced them of the truth (correctness) of the statement (see Fig. 5.8). The decision was made, to not only ask if the participants find the argument convincing, as it might not have been clear to them, what is specifically meant by that and may have left more possibilities for interpretation. Instead, it was explicitly referred to conviction regarding the truth of the statement. The response options were again chosen in a way such that absolute and relative conviction could be expressed, but the participants only had three options. This decision was made, because distinguishing between being partially convinced and being partially not convinced did not seem to be useful. Further, the option I have no idea was not provided, because this question does not assess knowledge and I wanted participants to take a stand. Participants who were not completely convinced by the provided argument were further asked to describe why the justification did not convince them of the correctness of the claim. The responses were coded based on aspects identified in the literature (see Section 5.4.3 for the coding scheme).

Figure 5.8
figure 8

Closed item for the conviction of arguments (translated)

5.3.3 Comprehension of Arguments

To assess students’ comprehension of generic and ordinary proofs, participants were first asked if they have understood the provided argument (see Fig. 5.9). Similar to the closed item regarding students’ conviction, three response options were provided. Participants, who self-reportedly could not completely understand the provided argument were further asked to describe what they did not understand about the argument. These answers were coded regarding aspects of proof comprehension identified in the literature (see Section 5.4.4 for the respective coding scheme).

Figure 5.9
figure 9

Closed item for the comprehension of arguments (translated)

5.3.4 Justification: Students’ Proof Schemes

The participants in group A received no arguments justifying the truth of the statements. Instead, they were asked to explain why they think/are convinced that the claim is correct/false (see Fig 5.10). The main research question regarding students’ proof construction was to investigate students’ proof schemes. Participants were not asked to prove the statements, because it would have been unclear what they view as proof and it was not the goal of this study to investigate their respective understanding, but the aim was to analyze how students convince themselves of the truth and falsity of universal statements—and in particular how this relates to their understanding of the generality of statements (RQ4.4). The responses to this open-ended question were coded with respect to the main proof schemes (see Section 5.4.5).

Figure 5.10
figure 10

Open item regarding students’ proof schemes (translated)

5.3.5 Understanding the Generality of Statements

While students’ understanding of the generality of mathematical statements has not been explicitly defined in previous studies (see also Section 3.1), related findings mostly referred to students or teachers who were seemingly convinced of the truth of the statement (and/or the correctness of a respective proof), but were at the same time not convinced that no counterexamples can exist (e.g., Chazan, 1993; Knuth, 2002), or to students’ awareness that one counterexample disproves a universal statement (e.g., Buchbinder & Zaslavsky, 2019; Galbraith, 1981). Another approach has been taken by Healy and Hoyles (2000), who assessed students’ understanding of the generality of a proven statement by asking them if the proof “automatically held for a given subset of cases” (p. 402) or if a new proof has to be constructed. In all these conceptualizations, students’ understanding of generality of statements is explicitly related to students’ understanding of the generality of proof.

Table 5.1 Definition of a correct understanding of the generality of mathematical statements
Figure 5.11
figure 11

Example for a closed item regarding the existence of counterexamples (translated)

My aim was to define and analyze students’ understanding of the generality of statements independent of that of proof. The potential influence of and relation to proof was considered through the experimental design of my study.

Therefore, for this study, the understanding of generality of mathematical statements was defined as shown in Table 5.1. The checkmarks stand for a consistent estimation of truth of a statement and the existence of counterexamples and indicate a correct understanding of the generality of mathematical statements. Thus, in addition to the estimation of truth of the statement (see Fig. 5.7), the participants were asked to decide if a counterexample to the statement can exist. Thereby, the term counterexample was not used, as it could not be assumed that participants are familiar with the concept. Further, the usage of the term counterexample might even be suggestive and therefore potentially bias results. Instead, individual closed items for each statement were constructed, in which participants were indirectly asked about the existence of respective counterexamples. Figure 5.11 provides an example for such an item for claim 1. Participants again had the opportunity to express absolute or relative conviction regarding the (non-)existence of counterexamples. The variable measuring students’ understanding of generality (yes/no) was then defined based on Table 5.1. Observations in which participants responded “I have no idea” to both questions were treated as missing values, because no decision could be made regarding the understanding of generality. Observations in which participants responded “I have no idea” to only one of the questions were seen as inconsistent. These observations were therefore set to not having a correct understanding of generality.

The last item of the questionnaire (see Fig. 5.12) aimed at assessing students’ (domain-specific) knowledge regarding the meaning of generality of mathematical statements. A variable regarding correct knowledge of the meaning of generality (yes/no) was defined based on correct ( response option c)) and incorrect ( response options a), b), and d), as the comments of those who had chosen d) were also incorrect) responses.

Figure 5.12
figure 12

Item regarding the meaning of generality of mathematical statements (translated)

5.3.6 Cognitive Reflection Test

A so-called Cognitive Reflection Test (CRT) was used to control for individual cognitive resources. The CRT, first described by Frederick (2005), “is designed to measure a person’s propensity to override an intuitive, but incorrect, response with a more analytical correct response” (Thomson & Oppenheimer, 2016, p. 99). It is assumed that the intuitive answers do not require any effort, while effortful thinking is needed to ultimately come to the correct solution (Frederick, 2005; Patel et al., 2019). Therefore, the CRT has been very influential in literature on so-called dual-process theory, which is based on the assumption that thinking processes can be divided into these very two types, an intuitive System 1 and a more analytical, reflective System 2 (Kahneman & Frederick 2002; Patel et al., 2019; Stanovich & West, 2000). A huge body of research has provided evidence that CRT performance is associated with broad measures of rational thinking and thinking dispositionsFootnote 7 (e.g., Frederick, 2005; Patel et al., 2019; Primi et al., 2016; Thomson & Oppenheimer, 2016). For instance, there seems to be a strong relation between CRT score and the so-called need for cognition (Frederick, 2005; Toplak, West, & Stanovich, toplak2014). A need for cognition (Cacioppo & Petty, 1982) is defined as a person’s “tendency to enjoy effortful thinking” (Thomson & Oppenheimer, 2016, p. 99)—which should be quite useful in mathematical activities, in particular, in those related to proof. Further, CRT performance is also associated with mathematical abilities (Frosch & Simms, 2015), (insight) problem-solving skills, including cognitive restructuring, which “involves the ability to reinterpret a problem” (Patel et al., 2019, p. 2131), the preference for more explanatory detail (Fernbach, Sloman, Louis, & Shube, 2013), general reasoning skills (Primi et al., 2016), decision making (Frederick, 2005); (Primi et al., 2016), and belief bias (e.g., Toplak et al., 2011), which is defined as “the tendency to be influenced by the believability of the conclusion when evaluating the validity of logical arguments” (Thomson & Oppenheimer, 2016, p. 99). While several studies have provided evidence (see, e.g., Shenhav, Rand, & Greene, 2012; Toplak et al., 2011) that “the CRT assesses something beyond general intelligence” (Patel et al., 2019, p. 2131), there is no consensus in the literature regarding the question if “individual differences in the disposition to overcome an initial intuition account for the predictive power of the CRT” (Baron, Scott, Fincher, & Emlen Metz, 2015, p. 279) and, in particular, if the CRT measures something completely unique from numeracyFootnote 8 (see, e.g., Pennycook, Cheyne, Koehler, & Fugelsang, 2016; Sinayev & Peters, 2015).

Despite the fact that more research is needed to fully understand what the CRT actually measures, it seems nevertheless useful for investigating students’ performance in proof-related activities for two main reasons. Firstly, it can be assumed that activities related to proof and argumentation require numeracy, but also high levels of rational thinking, including problem-solving skills (e.g., Chinnappan et al., 2012; Moore, 1994; Stylianou et al., 2006; Weber, 2005), and thinking dispositions (such as need for cognition and belief bias) might play a role, in particular regarding the understanding of generality of mathematical statements. As the CRT performance correlates with these measures, it should cover several individual cognitive resources that have been discussed in Section 3.3. Secondly, the CRT requires not much (time) effort, thus, it does not significantly affect the test duration, while still providing potentially useful information regarding individuals’ cognitive differences. The CRT was therefore used as a control instrument for individual differences in cognitive resources. I want to highlight that the purpose for using the CRT in the present study was not to identify specific cognitive resources underlying students’ performance in proof-related activities. For the main study, German translations of the four items shown in Figure 5.13 were used in random order.

Figure 5.13
figure 13

(Based on Frederick, 2005; Primi et al., 2016; Thomson & Oppenheimer, 2016)

CRT items used in the study.

The first two items are based on Frederick ’s original version of the CRT. Because of potential prior exposure to the questions and the CRT’s reliance on numeracy as well as its level of difficulty, it was decided to also include CRT items from alternative versions. The third item was taken and translated from the so-called CRT-2 proposed by Thomson and Oppenheimer (2016), which showed to be significantly less reliant on numeracy than the original CRT. The fourth and last item is based on an item from CRT-long proposed by Primi et al. (2016)Footnote 9. The CRT-long was developed as a more appropriate scale for a wider target group, because the original CRT is limited in that the item difficulty leads to floor effects in many populations, such as non-elite groups and adolescents (Frederick, 2005; Primi et al., 2016; Toplak et al., 2014). On average, 1.24 out of 3 items (about 41%) were correctly solved by participants in the CRT by Frederick and 2.95 out of 6 items (about 49%) in the CRT-long by Primi et al. About 33% and 12% of the participants scored zero in the CRT and CRT-long, respectively.

In summary, the CRT items were selected such that the risk of prior exposure, over-reliance on numeracy, and floor-effects were reduced. Furthermore, the items’ context was considered regarding a meaningful translation into German.

The CRT score was calculated based on the number of correctly solved items, resulting in values between 0 (no CRT item correctly solved) and 4 (all CRT items correctly solved).

5.3.7 Demographics

In the last part of the questionnaire, the participants were asked about their demographics including university-entry grade, final high school mathematics grade, study degree program, and number of semesters. Furthermore, they were asked if they were attending the respective lecture for the first time and if they attended a so-called VorkursFootnote 10 prior to the beginning of the winter term. Lastly, they were asked if they specialized in mathematics during high school in a so-called LeistungskursFootnote 11. This information was used to (at least indirectly) consider prior mathematical knowledge of the participants. It can be assumed that, on average, students who specialized in mathematics gained more experience in mathematics at school than students attending a regular mathematics course. Attendance in a Vorkurs was considered because at least one of the statements used in the study had been discussed and proven in at least one of these courses. Thus, participants who attended the transition course might have been more familiar with the respective item.

5.4 Data Analysis

In this section, the methods and approaches used to analyze the data are described. First, the procedures used in this study are explained and justified. What follows is a more detailed summary of the analysis process and respective assumptions that were made.

5.4.1 Statistical Analysis

All statistical analysis was performed using the statistical software environment R (version 4.2.0, R Core Team, 2022). The relevant codes can be found at Damrau (2023).

To estimate the effects of the type of argument and statement (and other variables of interest) on students’ understanding of generality—the primary outcome measure of this study—generalized linear mixed models (GLMM) were calculated using the glmer function of the R package lme4 (version 1.1-31, Bates, Mächler, Bolker, & Walker, 2015) because understanding generality was defined as a binary variable (see Section 5.3.5). Classical linear models rely on normally distributed variables and could therefore not be used (e.g., Bolker, 2015). Similarly, to analyze students’ performance in estimating the truth value of statements, students’ conviction, and students’ proof comprehension, respectively, cumulative link mixed models (CLMM) were fitted using the function clmm of the R package ordinal (version 2019.12-10, Christensen, 2019), because the respective dependent variables are ordinal. In both cases (GLMM and CLMM), logistic link functions were used. Further, using mixed regression models has the advantage that both random and fixed effects can be included (e.g., Bolker, 2015). The individuals participating in the study were considered as a random effect to control for individual differences in the repeated measures. All other independent variables were considered as fixed effects, such as the type of argument and statement.

For variable selection and regression model building, the following approach was taken (based on suggestions in Gelman & Hill, 2007; Harrell, 2015; G. Heinze, Wallisch, & Dunkler, 2018):

  1. 1.

    Theoretical background information was used to decide which independent variables (IVs) should be considered.

  2. 2.

    A directed acyclic graph (DAG) was drawn to illustrate the relationship between IVs.

  3. 3.

    Some IVs were eliminated based on the DAG (e.g., study program).

  4. 4.

    Backward elimination model selection was cautiously applied, where control variables with comparatively small and/or highly insignificant (\(p>.5\)) effects were dropped (as suggested by Harrell, 2015), while considering the theoretical background (e.g., the expected direction of the effect). Akaike and Bayesian information criterions (AIC and BIC) were used for choosing the final model (see also Heinze et al., 2018).

The following control variables were included in the global models to consider individual (cognitive) difference of the participants:

  • CRT score

  • attendance of honors course in mathematics during high school (yes/no)

  • attendance of transition course (yes/no)

  • final mathematics grade in high school

The two continuous control variables (CRT score and final mathematics grade) were standardized by subtracting the mean and dividing by two standard deviation to be able to better compare the coefficient estimates to those of untransformed binary predictors, as suggested by Gelman (2008).

I decided to adjust the p-values in the final regression models whenever multiple testing was involved, which was for instance the case regarding hypotheses on the influence of the type of argument (e.g., comparisons of experimental groups against control) and statement (see following sections for specific comparisons and number of tests).

For choosing an appropriate correction method, the expected losses from Type I and II errors were considered. The consequences of Type I error would mainly consist of falsely considering variables (here, the type of argument, for instance) in practical implications and for future investigations. Type II errors, on the other hand, would result in falsely excluding relevant aspects. It can be expected that, in the context of the present study, both types of errors would not result in serious consequences. Further, no comparable research has previously been conducted, in particular regarding students’ understanding of the generality of statements. Thus, to avoid prematurely excluding potential variables (Type II errors), it was decided to use Holm’s correction (Holm, 1979) to control for family-wise error-rate (FWER) instead of the more conservative Bonferroni correction, which is a bit better in decreasing the probability of Type I errors, but also (highly) increases the probability of Type II errors (e.g., Aickin & Gensler, 1996). If the number of tests would have been higher, an even less conservative method such as Benjamini-Hochberg (BH) correction (Benjamini & Hochberg, 1995) that controls for false discovery rate (FDR) instead of FWER would also have been a reasonable choice. But due to the small number of tests, the results were to be expected not being much different than for Holm’s correction. Holm’s correction is a stepwise procedure in which the p-values/levels of significance are adjusted iteratively, from smallest to largest value. The unadjusted p-values are first sorted in ascending order. The i-th BH adjusted p-valueFootnote 12 \(p.adj_i\) is then calculated by

$$\mathrm{p.adj_i=min\{max_{1\le j\le i}(m-j+1)\cdot p_j, 1}\},$$

where m is the number of tests and \(p_j\) is the j-th unadjusted p-value. The adjusted p-values are then compared to the unadjusted level of significance, which was set to the common value of \(\alpha =.05\) in this thesis. The Holm’s adjusted p-values were calculated with the R function p.adjust. In the regression summaries, unadjusted significance is reported using traditional stars and significance based on adjusted p-values is marked bold in the final models. Whenever I have corrected the level of significance, Holm’s adjusted p-values are reported as p.adj.

5.4.2 Content Analysis

A quantitative content analysis (sometimes referred to as structured qualitative content analysis, see, e.g., Döring & Bortz, 2016; Mayring, 2022) was chosen to measure students’ proof schemes, students’ proof evaluation regarding conviction, and students’ proof comprehension. The three respective open questions were analyzed following theory-based coding schemes. To refine the coding schemes, the following approach was taken (see Döring & Bortz, 2016; Krippendorff, 2004):

  1. 1.

    A set of a priori categories was identified based on the literature.

  2. 2.

    The categories were discussed and modified.

  3. 3.

    The set of categories was applied to a sample of the data, resulting in the deletion of categories, rephrasing of categories, differentiation of categories, and the addition of a few new categories.

  4. 4.

    The set of revised categories were further specified to maximize mutual exclusiveness as well as exhaustiveness.

  5. 5.

    The resulting coding scheme was pretested to ensure applicability.

  6. 6.

    After adequately refining and clarifying the categories, the coding scheme was settled upon as final.

The resulting coding schemes are described in Sections 5.4.3, 5.4.4, and 5.4.5, respectively.

Over 20% of randomly chosen students’ responses were coded by two colleagues working in mathematics education. Following suggestions in the literature (e.g., Krippendorff, 2004), the coders were chosen such that they are familiar with the respective content (e.g., with typical proof tasks and university students’ attempts to solve them) and have strong backgrounds in mathematics. A detailed coding protocol (including decision trees) was provided (in German, see Appendix B in the Electronic Supplementary Material) and both coders were sufficiently trained. This lead to very good inter-coder reliabilities (\(.88<\kappa <.93\)), based on Cohen’s kappa (e.g., Davey, Gugiu, & Coryn, 2010; McHugh, 2012; O’Connor & Joffe, 2020).

In the following sections, more specific information on both the statistical as well as content analyses of the respective data is provided.

5.4.3 Conviction of the Truth of Statements

Students’ conviction of the truth of statements was assessed by students’ performance in two proof-related activities: The estimation of truth and proof evaluation regarding conviction. I first describe the analysis of students responses regarding estimation of truth and then outline how students’ proof evaluation regarding conviction was analyzed.

5.4.3.1 Estimation of Truth

To be able to meaningfully compare responses regarding both true and false statements, students’ responses were recoded with respect to a correct estimation of truth (“yes (absolutely sure)”, “yes (relatively sure)”, “no (relatively sure)”, “no (absolutely sure)”). The answer “I have no idea” was coded as undecided and allocated between “yes (relatively sure)“ and “no (relatively sure)”, because these participants could not decide whether or not the statements were true. In that way, the responses were ordered conclusively and no information was lost in the regression analysis. As has been described above, cumulative linked mixed models were used to analyze students’ (correct) estimation of truth (as an ordinal variable). The main goal was to estimate the effect of the type of argument and statement. In addition, the four control variables listed in Section 5.4.1 were considered for the global regression model. Holm’s correction was used for analyzing the influence of the type of argument (three comparisons, each experimental group against the control group, which received no arguments) and the type of statement (two comparisons, true familiar and false statements against true unfamiliar statements).

5.4.3.2 Students’ Conviction (Statistical Analysis)

The analysis of students’ proof evaluation regarding conviction is based on the responses of groups B, C, and D, as participants in group A did not receive any justification. To analyze students’ proof evaluation regarding conviction (measured via the closed item shown in Fig. 5.8), cumulative linked mixed models were again calculated. Thereby, the main goal was to analyze the effect of the type of argument and statement as well as students’ proof comprehension (as an ordinal variable). In addition, the four control variables listed in Section 5.4.1 (CRT score, attendance of honors course in mathematics during high school, attendance of transition course, final mathematics grade in high school) were considered for the global regression model. Holm’s correction was used for analyzing

  • the influence of the type of argument (three comparisons in total: generic and ordinary proofs against empirical arguments, plus another comparison, generic vs ordinary proof),

  • the influence of the type of statement (two comparisons, true familiar and false unfamiliar statements both against true unfamiliar statements),

  • the influence of the level of proof comprehension (an ordinal variable) on students’ conviction (two comparisons, completely understood and not at all understood both against partially understood).

Table 5.2 Coding scheme regarding argument evaluation of convincingness (examples translated by the author)

5.4.3.3 Students’ Conviction (Content Analysis)

The coding scheme used to analyze what aspects students’ claim to not find convincing in different types of arguments is based on respective aspects identified in research on proof evaluation (see Section 3.2.3) as well as characteristics and acceptance criteria suggested by Stylianides (2007) and Hanna (1989) as discussed in Section 2.3.3. It was decided to include the aspect of sample size/selection as a possible category regarding the evaluation of empirical arguments, because of respective observations in prior research (e.g., Chazan, 1993). Table 5.2 gives an overview of the final coding scheme. The generated measured values were then analyzed using descriptive statistics. Responses of group B for statements 1 and 2, and responses of groups C and D for statements 1, 2, 3, and 5 were included in the analysis. The goal of including responses regarding empirical arguments was to investigate if students are aware of the limitations of empirical arguments and thus refer to a lack of generality when asked why the argument did not convince them. For this purpose, it did not seem necessary to analyze the responses regarding all statements, because responses which refer to a lack of generality of empirical arguments would become redundant. To account for potential differences regarding the familiarity with the statements, statements 1 (unfamiliar, number theory) and 2 (familiar, geometry) were included. Statement 4 was not considered at all because of the statement being false and therefore the proofs incorrect.

5.4.4 Comprehension of Arguments

In the analysis of students’ proof comprehension, only responses of groups C and D (generic and ordinary proofs) were considered (as a reminder, group A did not receive any justification and group B only empirical arguments). Further, I excluded responses regarding the false statement 4 from the analyses since proof comprehension relates only to the understanding of correct proofs (see, e.g., Neuhaus-Eckhardt, 2022).

5.4.4.1 Statistical Analysis

Similar to the analysis of students’ estimation of truth and conviction, cumulative linked mixed models were used to analyze students’ (self-reported) proof comprehension. Thereby, the main goal was to estimate the effect of the type of argument (generic vs ordinary proof) and the familiarity with the statement. In addition, the four control variables listed in Section 5.4.1 were again considered for the global regression model. Holm’s correction was neither used for the type of argument (generic vs ordinary proof) nor for the type of statement (unfamiliar vs familiar), because for each only one comparison (i.e., one test) was involved.

5.4.4.2 Content Analysis

The coding scheme used to analyze the open item on students’ proof comprehension is mainly based on the local aspects introduced by Mejía Ramos et al. (2012) (see Section 3.2.4). It was decided to explicitly differentiate between students’ lack of understanding the statement itself (as a prerequisite to understand the proof) and statements/terms/illustrations etc. only used in the proof. Based on prior research, it was assumed that students would generally not refer to holistic aspects and aspects beyond the particular proof when asked to identify what they did not understand (see Section 3.2.4). Generality was added as a possible category to the coding scheme, because of its importance for the present thesis. Furthermore, it was expected that students might (implicitly) refer to an insufficient understanding of the generality of generic proofs, as research on proof evaluation suggests that some students/teachers think generic proofs lack generality (see Section 3.2.3). The aspect of generality might contain holistic aspects, because students’ not understanding why a generic proof is general could be related to an insufficient understanding of the main idea of the proof, which due to Mejía Ramos et al. ’s model corresponds to holistic understanding. Table 5.3 gives an overview of the final coding scheme. The generated measured values were then analyzed using descriptive statistics.

Table 5.3 Coding scheme regarding proof comprehension (examples translated by the author)

5.4.5 Justification: Students’ Proof Schemes

The coding scheme used to analyze students’ proof schemes is based on Harel and Sowder (1998), Bell (1976), Recio and Godino (2001), and Kempen (2019) (see Section 3.2.5). Table 5.4 gives an overview of the final coding scheme. The generated measured values were then analyzed using descriptive statistics. Further analysis was conducted to investigate the relation between students’ proof schemes and their understanding of generality (see following section).

Table 5.4 Coding scheme regarding students’ proof schemes (examples translated by the author)

5.4.6 Understanding the Generality of Statements

For a potentially easier interpretation and to be able to compare results to those of prior research, I first investigated participants who where absolutely convinced of the truth of the statement but not absolutely convinced that no counterexample to the statement exists (which relates to the first row shown in Tab. 5.1). Chi square tests and Cramer’s V were used to analyze differences regarding the type of argument and statement.

For interpreting the effect size given by Cramer’s V, the following rule of thumb introduced by Cohen (1988) was used (Table 5.5):

Table 5.5 Cohen’s rule of thumb for interpreting Cramer’s V

As has been described in Section 5.4.1, generalized linear mixed models were then calculated to analyze students’ understanding of the generality of mathematical statements as defined in Table 5.1. In addition to the type of statement and argument, the analysis also aimed at estimating the effect of students’ knowledge of the meaning of mathematical generality (correct knowledge or not, see Fig. 5.12 in Section 5.3.5). The four control variables listed in Section 5.4.1 were again considered for the global regression model. Holm’s correction was used for analyzing the influence of the type of argument (three comparisons, the experimental groups against the control) and the type of statement (two comparisons, true familiar and false unfamiliar statements both against true unfamiliar statements).

5.4.6.1 Understanding Generality in Relation to Conviction and Proof Comprehension

Further regressions were calculated to analyze the relation between students’ understanding of generality and their conviction and comprehension (see footnotes in Section 6.5 for p-value adjustment). Observations regarding the false statement were excluded in these analyses, because the effect was expected to be in the opposite direction, in particular regarding conviction. Moreover, analyzing the relation to proof comprehension was based on the data of groups C and D (participants who received generic or ordinary proofs), because the influence of students’ proof comprehension of empirical arguments did not seem to be meaningful.

5.4.6.2 Understanding Generality in Relation to Proof Schemes

To analyze the relation between students’ understanding of generality and their proof schemes (group A), Chi-square test and Cramer’s V were used instead of fitting generalized mixed effects models. This decision was made to increase statistical power and because setting a reference category was not possible in a meaningful way, even though using GLMM would have been preferable, because individuals could have been included as a random effect. To further increase statistical power, the categories introduced in Section 5.4.5 were summarized as follows, based on the main categories of Harel and Sowder (1998):

  • External conviction: authority based, rule, pseudo

  • Empirical: empirical (no apparent awareness of generality), empirical (awareness of generality)

  • Counterexamples: counterexamples

  • Analytical: deductive (complete and incomplete), transformative (complete and incomplete), relevant aspects

  • Unclear: unclear

In theory, counterexamples could be seen as an empirical proof scheme—because they are indeed empirical—however, as counterexamples prove the falsity of a statement (in contrast to other empirical arguments, which usually are not able to proof the truth of a statement), it seemed oversimplifying to code them empirical. Because neither of the other main categories introduced by Harel & Sowder seemed to be appropriate either, counterexamples were treated as another main category. Further, pseudo arguments were allocated to external conviction proof schemes. Similar arguments, such as restating the statement or saying its contraposition, had been observed by Harel and Sowder (1998) as examples for authoritarian proof schemes. I decided to use the term external conviction for the main category, based on external proof schemes, proposed by Harel & Sowder, to distinguish pseudo arguments and references to a rule from explicit references to authorities. Responses that were coded as unclear were considered as an additional category to not lose potentially useful information.