Peer-Assisted Reflection: A Design-Based Intervention for Improving Success in Calculus
Introductory college calculus students in the United States engaged in an activity called Peer-Assisted Reflection (PAR). The core PAR activities required students to: attempt a problem, reflect on their work, conference with a peer, and revise and submit a final solution. Research was conducted within the design research paradigm, with PAR developed in a pilot study, tried fully in a Phase I intervention, and refined for a Phase II intervention. The department’s uniform grading policy highlighted dramatic improvements in student performance due to PAR. In Phase II, the department-wide percentage of students (except for the experimental section) who received As, Bs, and Cs in calculus 1, compared to Ds, Fs, and Ws (withdrawal with a W but no grade on a transcript), was 56 %. In the experimental section, 79 % of students received As, Bs, and Cs, a full 23 % increase. Such increased success has rarely been achieved (the Emerging Scholars Program is a notable program that has done so.)
KeywordsExplanation Formative assessment Peer assessment Reflection
This paper documents the use of reflection tools to improve student success in calculus. Since the calculus reform movement (Ganter 2001), calculus learning has been a major research focus in the United States (US), with over 2/3 of departments reporting at least modest reform efforts (Schoenfeld 1995). Despite some successes, introductory college calculus remains an area of persistent difficulty. In the US, each fall semester, over 80,000 students (27 %) fail to successfully complete the course (Bressoud et al. 2013).
Calculus concepts, such as functions (Oehrtman et al. 2008) and limits (Tall 1992), are notoriously difficult for students. These conceptual difficulties are exacerbated by the challenges of the high school to college transition (e.g., developing greater independence, learning new study habits, forming new relationships; cf. Parker et al. 2004). Moreover, many students enter college unprepared; only 26 % of 12th grade students achieve a level of proficient or better on the National Assessment of Educational Progress (NAEP) exam (NCES 2010). Through K12 instruction, students often develop learning dispositions that do not align well with the requirements of collegiate mathematics (Schoenfeld 1988). All of these factors impede student success in calculus.
To help students succeed in calculus, I engaged in three semesters of study using the design-based research paradigm (Cobb et al. 2003). Design-based research aims to make practical and theoretical contributions in real classroom settings (Brown 1992; Burkhardt and Schoenfeld 2003; Gutiérrez and Penuel 2014). By specifying the theoretical underpinnings of my design in detail, I developed a practical instructional tool and refined principles of why it works (Barab and Squire 2004). Over three semesters of design, I developed a collaborative activity called Peer-Assisted Reflection (PAR). PAR was developed in a pilot study, tried fully in a Phase I intervention, and refined for a Phase II intervention.
The core PAR activities required students to: (1) work on meaningful problems, (2) reflect on their own work, (3) analyze a peer’s work and exchange feedback, and finally (4) revise their work based on insights gained throughout this cycle. PAR was based on theoretical principles of explanation (Lombrozo 2006) and assessment for learning (Black et al. 2003). In particular, PAR leverages the connection between peer analysis and self-reflection (Sadler 1989) to help students develop deeper mathematical understandings (Reinholz 2015).
During Phase I and Phase II of the study, I used quasi-experimental methods to study the impact of PAR on student outcomes in calculus. The study took place in a mathematics department with many parallel sections of the same course, all of which used common exams and grading procedures. This paper focuses primarily on student understanding as measured by exam performance. Two companion pieces (Reinholz forthcominga, forthcomingb) provide in-depth analyses of student explanations and the evolution of the PAR design.
This paper is organized into three major components. The first component describes the PAR intervention, including its theoretical basis, core activities, and a brief history of its evolution. The next component focuses on the impact of PAR on student performance during Phase I and Phase II of the study. Finally, I further elaborate the intervention by discussing the impact of students’ revisions and the mechanisms that appeared to make PAR such an effective intervention.
Efforts to Improve Calculus Learning
To date, two of the most notable efforts to improve calculus learning in the US were the calculus reform movement (Ganter 2001) and Emerging Scholars Program (ESP; Fullilove and Treisman 1990). Internationally, calculus continues to be an area of interest, as evidenced by the recent ZDM special issue focused on calculus research (Rasmussen et al. 2014). As these researchers note, a number of advances have been made (e.g., in understanding how students learn specific concepts), but “these advances have not had a widespread impact in the actual teaching of and learning of calculus” (Rasmussen et al. 2014, p. 512). Accordingly, I focus primarily on the ESP and calculus reform movement, both of which have had notable impacts on the teaching and learning of calculus.
The ESP is based on Treisman’s observational study of minority learners (Fullilove and Treisman 1990); the ESP seeks to reproduce the learning conditions of successful students from the original study. Students in the ESP attend special 2-hour problem sessions, twice a week, in addition to their traditional calculus section. In the sessions, groups of 5–7 students work collaboratively on exceptionally difficult sets of problems; both the quality of the problems and the collaborative environment are essential (Treisman 1992). The ESP is open to students of all races, but enrolls primarily minority students; African American students have increased their success rates by 36 % through their participation (Fullilove and Treisman 1990). Versions of the ESP at other institutions (e.g., the University of Texas at Austin, the City College of New York) have also improved outcomes for minority students (Treisman 1992). Although ESP-style learning has been difficult to implement in regular calculus sections, the ESP provides evidence of the impact of meaningful problems in a supportive, collaborative learning environment.
Calculus reform interventions often introduced technology and/or collaborative group work to help students solve real-world problems. These studies generally reported positive improvements in engagement and deeper understanding, with mixed performance on traditional exams (Ganter 2001); however, it is difficult to generalize from these studies, due to lack of common measures (e.g., many of them did not compare student passage rates directly). To contextualize the present study, I report on some notable efforts. The Calculus Consortium at Harvard impacted a number of universities, with one of the most notable outcomes the gain of 12 % improvement in passage rates documented at the University of Illinois at Chicago (Baxter et al. 1998). Nevertheless, this finding is limited, because students were not compared using the same exams. Smaller positive gains were noted in the Calculus, Concepts, Computers, and Cooperative Learning (C4L) program, which showed a 4 % improvement in course GPA scores (Schwingendorf et al. 2000). Other notable efforts, such as Calculus and Mathematica (Roddick 2001) and Project CALC (Bookman and Friedman 1999) did not directly compare student outcomes in Calculus I, but comparisons of the GPAs of traditional and reform students in subsequent courses showed mixed results. As a whole these studies show promise, but in many cases the outcomes were difficult to interpret due to the methodological difficulties of conducting such studies. I attempt to account for some of these issues in the present study by comparing students using a common set of exams.
Explanation and Understanding
Although mathematical understanding has been defined in a number of ways, there is general consistency between recent attempts to create useful definitions (NGAC 2010; NCTM 2000; Niss 2003; NRC 2001). These standards and policy documents tend to focus on learning as both a process of acquiring knowledge and as the ability to engage competently in social, disciplinary practices (Sfard 1998). The standards focus on holistic learning, and as a result, focus on a large number of skills and practices. This is an important shift for improving mathematical teaching, learning, and research, but it also highlights the difficulty of measuring learning in a meaningful way.
Explanation is a highly-valued mathematical practice, and is considered a “hallmark” of deep understanding in the common core state standards (NGAC 2010). This aligns well with the five NCTM process standards, of which explanation is fundamental to three (reasoning and proof, communication, and connections) and important to the other two (problem solving and representation; NCTM 2000). Explanation is also prevalent in the Danish KOM standards (e.g., in reasoning and communication; Niss 2003).
Explanation also supports learning, because it helps individuals uncover gaps in their existing knowledge and connect new and prior knowledge (Chi et al. 1994). In this way, explanation provides students with opportunities to grapple with difficult mathematical concepts in a supportive environment, so that they can learn to overcome conceptual difficulties rather than avoid them (Tall 1992).
Focusing on student practices, explanation supports productive disciplinary engagement (Engle and Conant 2002). This framework provides a lens for understanding the types of activities students engaged in through PAR. Engle and Conant (2002) describe four principles for productive disciplinary engagement: (1) problematizing, (2) authority, (3) accountability, and (4) resources. As a whole, these principles require that students work on authentic problems and are given space to address the problems as individuals, but are held accountable to their peers and the norms of the discipline. PAR was designed to support such engagements, because it provides students with opportunities to explain and justify their ideas on rich mathematical tasks, and receive and incorporate feedback from their peers.
Using Assessment for Learning
The PAR intervention was designed using principles of assessment for learning to support student understanding. Recognizing students as partners in assessment (Andrade 2010), such activities focus on how students can evoke information about learning and use it to modify the activities in which they are engaged (Black et al. 2003). Generally speaking, assessment for learning improves understanding (e.g., Black et al. 2003; Black and Wiliam 1998). In the present study, students analyzed their peers’ work to develop analytic skills that they could later use to reflect on their own work (Black et al. 2003).
Through reflection, an individual processes their experiences to better inform and guide future actions (Boud et al. 1996; Kolb 1984). In the context of PAR, these experiences focused on problem-solving processes, such as metacognitive control, explanation, and justification. I use the terms analysis and reflection, rather than assessment, to distinguish PAR from other activities focused on assigning grades, which contain little information to support such learning processes (Hattie and Timperley 2007).
To self-reflect, a learner must: (a) possess a concept of the goal to be achieved (in this case, a high-quality explanation), (b) be able to compare actual performance to this goal, and (c) act to close the gap between (a) and (b) (Sadler 1989). Through actual practice analyzing examples of various quality, students can develop a sense of the desired standard and a lens to view their own work. This allows students to reflect on and improve their mathematical work (Reinholz 2015).
Many researchers have studied students analyzing the work of their peers (Falchikov and Goldfinch 2000), but they focus primarily on peer writing (e.g., Min 2006). Most of these studies have focused on calibration between peer and instructor grades, rather than peer analysis as a tool for learning (Stefani 1998). Even studies focused on learning rarely measured quantitative changes in student outcomes (Sadler and Good 2006). The present study is unique because it focuses on the impact of peer analysis on student outcomes in a domain where such activities are rare.
In this article I define PAR as a specific activity structure, but in theory, PAR could be implemented in other ways. PAR involves students analyzing one another’s work and conferencing about their analyses. Through peer-conferencing, students explain both their own work and the work of their peers, which promotes understanding.
Did PAR improve student exam scores and passage rates in introductory calculus?
How did PAR impact student performance on problems that required explanation compared to those that did not?
In what ways did PAR appear to support student learning?
The first research question focuses on whether or not PAR can help address the persistent problem of low student success in calculus. Although PAR targets student explanations specifically, I was interested in whether or not PAR could improve student performance more broadly (research question two). I address these two questions through quantitative analyses of student performance during Phase I and Phase II of the study. To understand the ways that PAR appeared to support learning, I analyzed how students revised their work and also conducted interviews with students.
Core PAR Activities
Students were assigned one additional problem (the “PAR problem”) as a part of their weekly homework (for a total of 14 problems throughout the semester). The core PAR activities required students to: (1) complete the PAR problem outside of class, (2) self-reflect, (3) trade their initial work with a peer and exchange peer feedback during class, and (4) revise their work outside of class to create a final solution. Students turned in written work for (1)–(4), but only final solutions were graded for correctness. During their Tuesday class session, students were exposed to each other’s work for the first time (unless they worked together outside of class). Each student analyzed their partner’s work silently for 5 min before discussing the problem together for five more minutes, to ensure that students focused on one another’s reasoning and not just the problems themselves. This meant that students spent a total of approximately 10 min of class time each week dedicated to PAR; this was a relatively small amount of the 200 min of class time that students met each week. Most of the time students spent working on PAR took place outside of class.
Through PAR, students practiced explanation; gave, received, and utilized feedback; and practiced analyzing others’ work. PAR feedback was timely (before an assignment was due; cf. Shute 2008) and the activity structure (submission of both initial and final solutions) supported the closure of the feedback cycle (Sadler 1989). Through repeated practice analyzing others’ work, students were intended to transition from external feedback to self-monitoring (see appendices A and B for the self-reflection and feedback forms). To support students to meaningfully engage in PAR conferences, students practiced analyzing hypothetical work during class sessions and discussed it as a class.
PAR problems were inspired and modified from: the Shell Centre, the Mathematics Workshop Problem Database (a database of problems used in the ESP), Calculus Problems for a New Century, and existing homework problems from the course. I further narrowed the problem sets by drawing on Complex Instruction, a set of equity-oriented practices for K-12 group work (Featherstone et al. 2011), and Schoenfeld’s (1991) problem aesthetic. Ideal problems: were accessible, had multiple solution paths to promote making connections, and provided opportunities for further exploration. Most problems required explanation and/or the generation of examples; as a result, each pair of students was likely to have different solutions. Thus, these tasks could be considered real mathematical problems, not just exercises (related to problematizing; cf. Engle and Conant 2002).
I illustrate PAR by discussing a student interaction around PAR10 (the 10th assigned problem). PAR10 required students to trace their hand, use simple shapes to estimate the enclosed area, and estimate an error bound (see Fig. 1). This interaction was chosen because it illustrates how peer discussions were able to support meaningful revisions.
Lance asked Peter if he was trying to get an under approximation (line 12), and then stated that he was trying to get an over approximation (line 14). Peter and Lance realized that combining these ideas together, they could get bounds on the actual value from above and below (lines 15 and 16). Peter used this idea to come up with a correct method for bounding the error (see Fig. 6).
 Lance: Were you trying to…get an under?
 Peter: Yeah, initially.
 Lance: What I was thinking was you could make an over-approximation. Take this right here and create an over-approximation and then subtract what you got from here with your under-approximation and it should get you this space.
 Peter: So it’s showing it has to be between those two values. That’s the error.
 Lance: Right, that actual value is going to be between your low approximation and your high approximation.
 Peter: It says you want to think about bounding your error with some larger value. So that would make sense then.
In Peter’s final solution, he calculated an over approximation and an under approximation, and reasoned that the actual value must be between the two. Peter used the boxes from his initial solution that were entirely inside the hand as an underestimate. He then added additional boxes that surrounded the outside of his hand and reasoned that “since this will be an over approximation, I know that the true area under the curve will be less than the area I calculate by error.” While not all PAR conversations resulted in productive revisions, this example illustrates the PAR process. PAR was designed to provide students with the authority to grapple with rich problems while holding students accountable to their peers through conferencing, key components of productive disciplinary engagement (Engle and Conant 2002).
Development of the Intervention
Although literature supported using peer analysis to promote self-reflection, it did not specify instruction in detail. Thus, PAR was developed over the three semesters of study. An in-depth analysis of the evolution of the PAR design is reported on in a companion piece (Reinholz forthcomingb). For the present paper, I highlight three crucial areas of development: (1) the use of real student work, (2) randomization of partners, and (3) student training.
Use of Real Student Work
The basic PAR activity structure was developed in a pilot study in a community college algebra classroom (during spring 2012). The class consisted of 50 % females and 79 % traditionally underrepresented minorities (of 14 students after dropouts) who were simultaneously enrolled in a remedial English class. Experienced educational designers and community college instructors advised me to have students analyze hypothetical work to mitigate possible issues from students having their work analyzed by peers. However, even by mid-semester, students struggled to analyze hypothetical work. For instance, on the 6th homework assignment students were asked to rank order and analyze four sample explanations. Not a single student provided a clear rationale for their ordering. Some students also remarked that they did not understand the purpose of these activities.
I didn’t get it before, why you were always asking us to explain, but now it makes sense.When you don’t explain things people can’t tell what you’re doing.
Students were now able to discuss their analyses, and the activity was more meaningful, because students could see themselves as helping a peer. Because students had to present to a peer, they were held accountable for the quality of their work by peers in addition to the instructor (Engle and Conant 2002). Finally, students received feedback that they could use to revise their own work (providing additional opportunities or “resources” for improvement). For all of these reasons, the revised activity structure was adopted as the basis for PAR. The PAR procedures for how and when students would engage in this process were solidified at the beginning of Phase I (as described in the Core PAR Activities section).
Randomization of Partners
During Phase I, a small subset of students had short, superficial peer conferences.
In contrast to most student conversations (e.g., Peter and Lance’s conference above), Nicki did not discuss the concepts at all. She simply told Alex what to revise (lines 6, 8, and 10).
 Nicki: I think you did it right, except for the last 3 parts. (in a sing-song voice)
 Alex: Yeah, totally! (sarcastically)
 Nicki: Do you know how to do it, just using triangles?
 Alex: Yeah, I got that.
 Nicki: You gotta add the ones underneath, and subtract the other ones.
 Alex: Yep.
 Nicki: It looks pretty good, and then for more accuracy, you could do some more triangles.
 Alex: Even more triangles. (sarcastically)
 Nicki: And more triangles. (sarcastically)
 Alex: I said yours is awesome, and, yeah.
Alex provided no feedback, which was atypical. Alex revised his solution, but apparently did so without understanding, because he still answered the question incorrectly.
These superficial conversations took place between a small subset of students who worked with the same partners repeatedly. These students appeared to be working with their friends in the class, and usually did not provide useful feedback, or simply provided answers to one another. This issue was addressed in Phase II by having students sit in a random seat on PAR day; I did not observe such conversations during Phase II. Having students sit in a random seat as they entered the classroom also meant that little to no time was required for students to find partners.
Training of Students
 Instructor: What about number 2?
(Three students shake their heads no, Patrick, Colton, and Barry)
 Patrick: In the lab we just did, we created that graph to show that midpoints aren’t always more accurate.
 Instructor: So midpoints aren’t always the best. What else?
 Sue: Is this just general, or about the PAR?
 Instructor: These are always about the PAR
 Sue: Wouldn’t midpoints be better?
 Instructor: What do you think, would midpoints be better?
 Barry: Would it even matter, because it says “as accurate as you would want,” and you can only get so accurate with midpoints?
 Instructor: What do you guys think about that? Did you not hear him, or do you disagree?
 Jim: I couldn’t hear him.
 Instructor: Could you shout it from a mountain Barry?
 Barry: Yeah, so I just said that the prompt asks how you could improve the method to make your estimate as accurate as you would want, but using midpoints you can only make it so accurate, which is a problem.
After the instructor asked students what they thought about the explanation (line 3), a number of students shook their heads, indicating they thought it had problems. In line 4, Patrick connected the calculator lab that the class had been working on to the existing prompt, noting that midpoints don’t necessarily create the most accurate estimate. Sue was unsure about this, so she asked to clarify (line 8). Rather than answering himself, the instructor allowed the class to respond (giving them authority in the discussion). Barry gave an explanation for why the midpoint method is insufficient to produce arbitrary accuracy (lines 10 and 14). As this brief transcript highlights, students had opportunities to analyze various explanations and explain their reasoning (developing authority; cf. Engle and Conant 2002). This gave students opportunities to calibrate their own observations to the perspectives of their peers and instructor. I now analyze the impact of the intervention.
Phase I (Fall 2012)
Materials and Methods
Phase I took place in a university-level introductory calculus course in the US targeted at students majoring in engineering and the physical sciences.1 The course met 4 days a week for 50 min at a time. Ten parallel sections of the course were taught using a common syllabus, curriculum, textbook, exams, grading procedures, calculator labs, and a common pool of homework problems (instructors chose which problems to assign). Many of the PAR problems were drawn from this pool, but some were used only in the experimental section.
The calculus course was carefully coordinated, with all instructors meeting on a weekly basis to ensure alignment in how the curriculum was taught. Historically, the course had been taught using primarily a lecture-based format, which I confirmed through observations of three of the comparison sections. Instructors generally dedicated the same number of days to the same sections within the book and covered similar examples. The experimental section also used a lecture format, with some opportunities for student presentations and group work. The primary difference between the experimental and comparison sections was the use of PAR, as described in the Core PAR Activities section. Students had some opportunities to analyze hypothetical work to develop analytic skills, but during Phase I the systematic training procedure had not yet been implemented; students only engaged in three training activities during the entire semester.
Phase I data collection table
Michelle, who had a PhD in mathematics education and nearly 10 years of teaching experience, taught the experimental section and one of the comparison sections. Michelle taught two sections to help control for the impact of teacher effects. Of the two sections Michelle taught, I had her use PAR in the larger section, to garner evidence that PAR could be used in a variety of instructional contexts (not just small classes). Michelle used identical homework assignments and classroom activities in both sections, except for PAR (students in both sections completed the PAR problems, but the comparison students did not conference about their work).
Comparison instructors were chosen who had considerable prior teaching experience. Heather, a full-time instructor with over a decade of teaching experience, taught another observed comparison section. Logan, an advanced PhD student, taught the final observed comparison section. All observed instructors had taught the course a number of times before. Teachers in the comparison sections taught the course as they normally would.
To document changes in student understanding, I collected exam scores and final course grades for students in all sections of the course; all students took common exams, which allowed me to compare the experimental section to the department average scores. To study student interactions, I video recorded class sessions of the experimental section and three comparison sections. I performed all video observations with two stationary video cameras: one for the teacher, one for the class. As a researcher, I attended all class sessions, taking field notes of student behaviors and class discussions. In the experimental section, I also scanned students’ PAR assignments and made audio records of students’ conversations during peer-conferences. After the second midterm, I conducted semi-structured interviews with students in the experimental section about their experiences with PAR. A summary of the data collected is given in Table 1 (enrollment numbers are for students who remained in the course after the W-drop date, which was approximately halfway through the semester). Any students who enrolled but did not take the first exam were removed from all analyses; taking the first exam was used as an indicator of a serious attempt at the course.
Exam Design and Logistics
A five-member team wrote all exams. After three rounds of revisions, the course coordinator compiled the final version of the exam. Exams were based on elaborated study guides (3–4 pages); students were given the study guides 2–3 weeks in advance and only problems that fell under the scope of the study guides were included on exams. This ensured that exams were unbiased towards particular sections.
Exam problem types
Non-rote mathematical problems. Over 80 % had multiple parts and required written explanations.
Students must explain why it is true, or provide a counterexample and explain why it is false.
Procedural practice of limits, derivatives, and integrals.
Multiple choice, fill-in-the-blank, and curve sketching.
Percentage of points by problem type (phase I)
Total points possible
Exams were administered in the evenings, each covering 3–4 weeks of material, except for the comprehensive final exam. Grading was blind, with each problem delegated to a single team of 2–3 graders, to ensure objectivity. Each team of graders designed their own grading rubrics, with approval from the course coordinator. These rubrics followed standard department procedures for many types of problems, such as true/false or procedural computations. Students needed to show their work and explain their reasoning to receive full credit on any problem other than pure computations of limits, derivatives, and integrals, and multiple-choice questions (in contrast, true/false questions did require explanations). Partial credit was offered on all problems.
Figure 8 is a typical problem-solving problem. Prompts (a)–(c) were a nontrivial calculus problem (maximization with the use of a parameter), and the final two prompts (d) and (e) required students to explain and justify their work. Figure 9 provides two sample prompts from a true/false problem. On average, each exam included four such prompts.
For true/false questions, students were required to provide an explanation justifying their answer. Even if they gave a counterexample to show the statement was false, they needed to provide a written justification of their counterexample to receive full credit. Problem solving and true/false problems (shown above) all required written explanations, and comprised about two-thirds of each exam (see Table 3). The only problems from exams in the present study that would accurately be classified as procedural recall are the “Pure Computation” problems, which comprised less than 30 % of any given exam. Although miscellaneous questions did not require explanations, none of them could be solved through simple recall.
Success in the course was defined as receiving an A, B, or C (the grade requirement for math-intensive majors like engineering), compared to receiving a D, F, or W (withdrawal with a W on the transcript, which is not calculated into one’s GPA). I computed success rates using student course grades, 70 % of which was based on exam scores and 30 % on homework and calculator lab scores. The course coordinator scaled homework and lab scores to ensure consistency across sections. The experimental section had an 82 % success rate, which was 13 % higher than the 69 % success rate in the comparison sections. This effect was marginally significant, χ2(1, N = 409) = 3.4247, p = 0.064. This improvement is comparable to other active learning interventions in STEM, which result in a 12 % improvement in passage rates on average (Freeman et al. 2014). Moreover, students in the experimental section were more likely to persist in the course; the experimental drop rate was only 1.75 %, while the drop rate for non-experimental sections was 5.87 %.
Comparison of nested models for phase I exam scores
Fixed effects [estimate (SE), t-value]
67.583 (3.689), 18.2
66.866 (3.60), 18.57
6.972 (3.06), 2.282
Random effects [Variance (SD)]
Overall model tests
Phase I exam (percentage) scores (SD in parentheses; p < 0.01**, p < 0.05*, p = 0.06†)
Experimental (N = 56)
Comparison (N = 353)
ES (Cohen’s d)
Phase I Mean (percentage) scores (SD in parentheses) by problem type
Experimental (N = 56)
Comparison (N = 353)
Student performance by problem type was calculated as a percentage of the total possible points for each problem type, to account for the different number of points assigned to different problem types. Students in the experimental section scored numerically higher on all aspects of the exams, but differences in problem solving were not significant. This contrasts with prior studies on calculus reform that showed that students often fell behind on traditional procedural skills (Ganter 2001).
Analyses of student exams addressed the first two research questions: (1) students in the experimental section had 13 % higher success rates (marginally significant) than the other sections, and (2) these improvements were evident throughout the exams, not just on explanation-focused problems (see Tables 4, 5 and 6). Although exams were analyzed by problem type, this analysis did not account for differences in item difficulty between items. Finally, using Exam 1 as a proxy for a pre-test score, I established that there were no significant differences between groups in baseline calculus understanding, so the effects found can likely be attributed to PAR.
Phase II (Spring 2013)
Materials and Methods
Phase II took place in a subsequent semester of the same calculus course. The same coordinator ran the course, and the curriculum and lecture-based teaching styles were the same as Phase I. In the experimental section, students engaged in PAR, just as in Phase I. There were three revisions to the Phase I design: (1) minor updates to the reflection and feedback forms, (2) the assignment of random partners (see the Randomization of Partners section), and (3) weekly training in analyzing work (see the Training of Students section).
Phase II once again had a single experimental section with 3 observed comparison sections. I taught the experimental section, to ensure full implementation of the design. I was a graduate student with approximately 3 years of teaching experience. I had not taught introductory calculus in the last 4 years. Comparison instructors were chosen who had prior experiences with teaching and with this particular course to provide a fair comparison; some of the other instructors had little teaching experience or had not taught this course before. Sam, a post-doctoral researcher working on mathematics education projects, taught one of the observed comparison sections. Graduate student instructors from Phase I, Tom and Bashir, taught the other two observed sections. These instructors and I had comparable experience with this course, but their teaching experiences were more recent. Grading procedures were the same as in Phase I.
Phase II data collection summary
Data collection procedures were the same, except for a few minor changes: (1) one camera was used rather than two to reduce logistical difficulties, (2) a research assistant conducted interviews and performed video observations (to maintain objectivity), and (3) students were offered one extra credit homework assignment as an incentive to give an interview, which greatly increased the number of respondents. Also, I had one of the comparison instructors (Tom) assign PAR problems as regular homework, which I collected. Finally, I collected background demographic data for students in the four observed sections.
There were no significant differences in academic background data (ACT scores and high school GPA) between the four observed sections. Numerically, the lowest averages were in the experimental section (mean GPA: 3.43 vs. 3.56, and mean ACT scores: 25.45 vs. 26.3). There were no significant differences in gender or race. The population of students who answered the survey consisted of 19 % females and 17 % traditionally underrepresented minorities. While demographics for all students were not collected, this sampling seemed to be relatively representative of the typical student population of calculus at this institution. Although students were not surveyed specifically, based on an analysis of student PAR conversations, none of the students appeared to have limited English proficiency or were English Language Learners. Although demographics were not collected for Phase I, these Phase II results suggest that the natural distribution of students across sections was relatively balanced (i.e. various sections are indeed comparable).
As in Phase I, student course grades were used to compute student success rates. Course grades consisted 70 % of exam scores, with the other 30 % assigned to labs and homework. Once again, the course coordinator scaled student homework and lab scores to achieve consistency between sections. During Phase II, the experimental success rate was 79 %, while the comparison success rate was only 56 %. This 23 % difference in success rates was even larger than the 13 % difference during Phase I. This result was a statistically significant, χ2(1, N = 336) = 6.3529, p = 0.0117*. To contextualize these results, I note that active learning interventions in STEM result in a 12 % improvement in passage rates on average (Freeman et al. 2014). Moreover, students in the experimental section were more likely to persist in this course; the experimental drop rate was 10.5 %, while the drop rate for non-experimental sections was 15.25 %. These drop rates were much higher than during Phase I, likely due to differences in the students who enroll in the fall and spring versions of this course.
Phase II exam (percentage) scores (SD in parentheses; p < 0.05*, p < 0.01**)
Experimental (N = 34)
Comparison (N = 302)
Difference (Phase II)
Effect Sizes (Cohen’s d)
Difference (Phase I)
As before, Exam 1 provides a baseline to further establish the equivalence of the experimental and comparison groups. Because there were no significant differences for Exam 1 (a proxy for a pre-test), but the differences were significant for the other three exams, the differences can likely be attributed to PAR. The improvements in exam performance (row 3) were even larger than in Phase I (row 4). The effect sizes were medium (Cohen 1988). To contextualize these results, I note that active learning interventions in STEM classrooms result in a 6 % improvement in exam scores, on average (Freeman et al. 2014). Once again, the smaller, non-significant differences in Exam 1 scores are likely an indicator of early benefits of PAR instruction.
Comparison of nested models for phase II exam scores
Fixed effects [estimate (SE), t-value]
66.98 (2.09), 31.99
65.48 (1.66), 39.55
12.28 (3.27), 3.75
Random effects [variance (SD)]
Overall model tests
Phase II mean (percentage) scores (SD in parentheses) by problem type
Experimental (N = 34)
Comparison (N = 302)
It is likely that the differences between the experimental and comparison sections are overstated for miscellaneous problems. There were a total of 450 points possible across all exams, with only 35 associated with miscellaneous problems. Because these problems made up such a small percentage of the exams, they are likely to be less reliable than categories such as problem solving, which made up approximately 50 % of the exams.
Phase II provided a replication of Phase I’s results; (1) students in the experimental section had 23 % higher success rates than other sections, and (2) they performed numerically better on all aspects of the common exams (gains for problem solving and miscellaneous were significant). Once again, item difficulty was not taken into consideration. During Phase I the impact of PAR was measured while controlling for teacher effects. Thus, it is unlikely that improvements during Phase II can be attributed entirely to teacher effects. Moreover, Phase II featured an improved version of the Phase I design, which likely accounts for at least some of the additional improvement. Phase II demonstrates that multiple teachers could use PAR successfully.
Comparison of Phases I and II
Students in the experimental sections numerically outperformed the students in the comparison sections for all problem types. Nevertheless, there were notable differences between phases. During Phase I, the differences for true/false and pure computation problems were significant, while they were not during Phase II. Also, during Phase II the differences for problem solving problems were significant, while they were not significant during Phase I. These differences may be attributable to differences in teaching style across phases; Michelle was much more likely to use IRE-style questioning in her classroom, emphasizing procedural computations, while Dan was more likely to require open-ended explanations from the students. Moreover, analyses did not account for the difficulty of items on exams, which may also account for some of these differences.
The average success rate for comparison sections during Phase I was considerably higher than during Phase II (69 % vs. 56 %). This difference was reflected in average exam scores, which were 5–10 % higher on exams 1–3 comparing Phase I to Phase II students; notably, Phase II students scored higher on the final exam compared to the Phase I students, by 5.7 %. In the design of the Phase II final exam, the course coordinator noted that the previous semesters’ exam was too difficult, and made efforts to decrease the length and difficulty level of the Phase II exam. The course coordinator also noted that students during spring semesters historically tend to have lower success rates than those in the fall, because they are generally students who were not on the “standard” track, meaning that they may have had to take additional remedial mathematics courses before they could take calculus.
Because Tom and Bashir both taught during Phase I and Phase II, the average scores from their sections also provide a point of comparison. Bashir’s scores increased between phases (63.7 to 69.4 %), while Tom’s remained mostly the same (65.5 to 65.8 %). Given the differences in student populations, this seems to indicate that both instructors improved in their teaching across semesters, but without knowing more about their specific classes no more definitive conclusions can be drawn. The next major section describes student revisions, and the section following that describes the PAR mechanisms that supported learning.
Improvement in Student Explanations
PAR was designed to improve student understanding generally, and student explanations specifically. While in-depth analyses of student explanations are beyond the scope of this paper, they are discussed in a forthcoming paper (Reinholz forthcominga). To contextualize student improvements on exams, I provide a brief summary of those results.
Student explanations were analyzed on three of the PAR problems (PAR 5, 10, and 14) to see the progression of student explanations over the course of the semester. Student work was analyzed from the Phase I experimental section, the Phase II experimental section, and a comparison section from Phase II. Student explanations were scored using a rubric consisting of four categories: accuracy, mathematical language, clarity, and use of diagrams. Solutions were double coded, with 94.1 % agreement.
Aggregating explanation scores across the semester, the Phase II section scored more than 4.5 times higher than the comparison section, and more than 1.5 times higher than the Phase I section. The results held across individual dimensions as well, with Phase I scoring higher than the comparison section on all aspects of their explanations and Phase II scoring higher than Phase I on all aspects of their explanations. As this brief summary of some of the results highlights, students improved their explanations considerably as a result of PAR.
Students Revisions in PAR
The quantitative results from Phases I and II provided evidence of the positive impact of PAR on student performance in calculus. To better understand how students learned from PAR, I analyzed PAR assignments in the Phase II experimental section to look for changes in PAR scores as a result of revision.
Materials and Methods
To understand the impact of PAR for different students in the course, I broke the class into thirds (High, Middle, and Low), according to students’ final scores on the PAR assignments. I used a random number generator to select three students from each of these groups. Of these nine students who were randomly sampled, there were four cases in which I had recorded a score for their final solution, but did not have a scan of the student’s PAR packet. These were students who turned in their assignment separately from the rest of the class, and as a result some assignments did not get scanned. I dropped these four solutions from the analysis. I had a total of 122 PAR packets to analyze, each with an initial and final solution.
To measure the impact of PAR on student solutions, I blindly re-scored each student’s initial and final solutions. Although I did not conduct double scoring to establish inter-rater reliability, the purpose of this analysis was to investigate changes in scores, so any systematic biases in scoring should be present in both the scoring of initial and final solutions.
The sampled students turned in all of their PAR homework assignments, except for two students in the Low group didn’t turn in PAR14 (the final problem). This was a 98 % completion rate for PAR homework assignments. In contrast, the comparison section had only a 70.2 % completion rate for the same problems.
Average PAR scores, by group
3.4 · 10−11
1.8 · 10−9
2.6 · 10−7
Potential sources of considerable improvement on PAR problems
Number of students
All students made significant improvements to their homework solutions as they revised from initial to final solutions. Students in the Low and Middle groups had similar distributions of initial PAR scores. However, students in the Middle group were much more likely to considerably improve their solutions after revision. These data suggest that one of the key differences between students who scored in the Low and Middle groups may be how they benefited from PAR and their revisions. Table 12 indicates that when students made considerable improvements in their revisions it was mostly due to their PAR conversations and additional time spent working on the problem after their conferences.
The help-seeking literature suggests that students with moderate need are the most likely to seek help (Karabenick and Knapp 1988). This is consistent with the finding that students in the middle group were the most likely to improve as a result of seeking external help. However, the students in the low and middle groups had relatively similar initial scores, so it is unclear what factors may have caused some of them to seek help while others did not. In general, low-performing students may be less likely to seek help due to low self-efficacy or negative emotions related to failure (Karabenick and Knapp 1988), which may have been factors at work here. This is an area for future research.
Materials and Methods
To understand student experiences with PAR, I analyzed interviews from Phase II. I focused on Phase II data because there was a much higher response rate than Phase I, which meant the interviews were more likely to represent a range of opinions. The following analyses focus on the first interview question that was asked: “Let’s discuss the PAR; what’s working well and not so well for you?” I focus on this question, because it was likely to elicit a balance of positive and negative aspects of PAR.
After transcribing student responses, I read through all of the transcripts multiple times to identify themes. After a set of themes was identified, I developed codes, both for positive and negative reactions. Using this set of codes I re-analyzed each student response and marked whether or not each code was present.
Frequency of student reactions to PAR (N = 22 interviews)
The positive student reactions described four mechanisms of PAR that appeared to support learning. PAR required students to work in iterative cycles: students made a preliminary attempt at a problem, received feedback and thought about the problem more deeply, revised, and turned in their final solution. Within these iterative cycles, students encountered new ideas to support their learning: by discussing with peers, by explaining and hearing explanations, and by seeing the work of others. These four sources can be consolidated into the acronym IDEA, meaning Iteration, Discussion, Explanation, exposure to Alternatives. I now discuss these mechanisms. The student quotes given below are intended to exemplify each category of student responses.
Mike made a similar remark,
I like the PAR. It’s like we get to come to class and be wrong, and that’s okay. Then later we get to revise our work and be right.
This activity structure seemed to increase students’ perseverance. Rather than giving up when they could not solve a problem on their own, students seemed to realize that getting input from peers, the instructor, or other resources was often sufficient to help them solve challenging problems. This perseverance was evident in students’ homework assignments; the number of students who fully completed the challenging PAR problems in the experimental section was much higher than in the comparison section (98 % vs. 70.2 %).
PAR is good. I like how we can put our initial solution down, and even if it’s wrong it doesn’t really matter, because we can just talk about it with a group member the next day, and figure it out together. And generally you don’t get stuck on a wrong solution, you figure it out.
Discussing the Problem Together
The value of student discussions was exemplified by the first example provided in this paper, in which Peter and Lance revised their methods for approximating error.
I like the PAR because it got you to interact and communicate with the other students…no one likes to just watch someone talk at a board all day. The self teaching and student interaction helped…PAR helps us be more social, so you can talk to other students, set up study groups, and get to know your classmates.
As Maria said, PAR helped her learn:
I like the PAR. I’d take it over the other homework. You do about the same amount of work, and I think you learn more from it. You have do explain what you did, rather than just say here, I got this magical number. You actually understand the process and I think that helps more in learning than just getting the magical number.
how to make it easier to read from another person’s perspective. It’s one thing if I think it looks good, but other people look at it and say it doesn’t make sense to me. So it helps me figure out how to communicate better. It helps me to explain things in a way that is readable to others and not just myself.
Exposure to Alternatives
I really like looking at other people’s initial models. I can see what they are thinking, it puts me in their head, and I can see that. A lot of times I’m really wrong and I can see different ways to do the same thing.
PAR provided students with opportunities to analyze, explain, and discuss the work of their classmates. These opportunities seemed to help students make mathematical connections and develop deeper understandings of the problems. Students in comparison sections rarely had opportunities during class sessions to explore their peers’ reasoning in depth.
 Revati: I know I did it all wrong. I was reading yours and was like, “oh my goodness. How did I miss this?” Okay, so. You did a really good job explaining, so you have all that right. And your math is all correct so… good job! You could have turned this in as your final and gotten 100 %
 Federico: Okay, thank you. Em, well I think now you know the errors?
Negative Reactions to PAR
The other main issue that students had (only two students noted) was the amount of work required by the course. Given the large number of assignments they had, PAR felt like it was too much on top of an already overloaded course. This criticism was not of PAR specifically, but of the organization of the course.
If it’s a confusing problem we just get together and talk about how neither of us know what is going on or we have no idea how to do it.
This paper focused on how reflection tools could promote improved understanding of calculus. Through cycles of problem solving, reflection, feedback and analysis, and revision, students had opportunities to exercise disciplinary authority and were held accountable to their peers (Engle and Conant 2002). PAR was supported by training exercises that helped students learn to analyze work and provide feedback. The PAR activities were conducted using a rich problem set, which provided opportunities for students to explain their ideas and compare multiple solutions with one another. Although these problems seemed to be an important part of the intervention, by themselves they were insufficient; these problems were also assigned in Tom’s experimental section during Phase II with little impact.
Students in the PAR sections were given some opportunities to explain their ideas during class and engage in group work to support PAR. Although in-depth analyses of classroom activities are beyond the scope of this paper, similar activities were observed in some of the comparison classrooms as well (e.g., Heather’s and Sam’s sections). Accordingly, it seems reasonable to assert that the standard classroom activities in the PAR sections were not considerably different from the comparison sections; PAR was implemented in a primarily lecture-based environment, which was typical of calculus instruction at this institution.
Success rates in the experimental sections were higher than the comparison sections, 13 % in Phase I (marginally significant), and 23 % in Phase II (statistically significant). This demonstrated the impact of PAR on student success (research question one). These are impressive gains, showing that the impact of PAR compares favorably with other active learning interventions in STEM (Freeman et al. 2014). Moreover, these gains were replicated over two semesters. These gains are important, because student success in calculus remains an area of concern. The persistence rates were also higher in the experimental sections during both phases of study; it is possible that the community-building aspects of PAR may have made students less likely to drop the course.
Improvements were also apparent on exam scores during Phase I (experimental vs. comparison, same instructor: 6.19 %; and experimental vs. comparison, other instructors: 5.71 %) and Phase II (experimental vs. comparison: 11 %). Students improved numerically on all aspects of their exams, both explanations and procedures (research question two). These are considerable differences, especially given that the experimental sections included more students who would traditionally drop out of the course. No significant differences were apparent on Exam 1, which provides a proxy for a baseline pre-test score to establish the comparability of students in different sections. A companion paper (Reinholz forthcominga) focuses more directly on student explanations, and provides results consistent with the present findings.
During Phase I, Michelle taught two sections to control for teacher effects. Michelle’s comparison section performed similarly to the other comparison sections, which suggests that improvements can be attributed to PAR, not the teacher. During Phase II, teacher effects were not controlled for specifically. As a result, it is possible than some improvements may be attributed to the particular instructor, but given the impact of PAR during Phase I, and the improvements to the design for Phase II, it is unlikely that improvements can be attributed entirely to teacher effects. A goal of future studies would be to further replicate these results through a randomized experimental design.
The present study also makes an important contribution to literature on assessment for learning. Despite a large body of work on peer assessment, most of it has focused on calibration between instructor and peer grades, not how assessment can be used to promote learning. PAR demonstrates the effectiveness of such techniques, particularly in a content area where such practices are uncommon. Moreover, the iterative nature of PAR seemed to help students develop the perseverance required to solve challenging problems. The impact of PAR on student dispositions is an area for further study.
Design-based revisions provided greater affordances to support student learning (research question three). In particular, students worked with random partners, had regular training opportunities, and used streamlined self-reflection and peer-feedback forms. The underlying principles of having students analyze each other’s work and provide feedback to each other appear to be productive activities that may work in a variety of contexts.
PAR was developed with two very different student populations in different contexts: primarily traditionally underrepresented minorities in a remedial algebra class and mostly White students in introductory college calculus. Since this initial study, PAR has been used in differential equations, introductory mechanics (physics), engineering statics, and thermodynamics. Given that PAR has been implemented in a variety of contexts, it appears to be a general method that could be effective across a broad variety of STEM learning contexts. To implement PAR and the corresponding training activities requires no more than 20 min of in-class time each week, which means that it is possible to include PAR as a part of a variety of different classrooms. As the impact of PAR is studied in new contexts, it will provide further insight into the activity structure and how students learn through peer analysis.
All research reported in this manuscript was conducted in accordance with the ethical standards in the Helsinki Declaration of 1975, as revised in 2000, as well as national law, with approval of the appropriate Institutional Review Board.
The author thanks Elissa Sato, Hee-jeong Kim, and Bona Kang for their helpful feedback on an earlier version of this manuscript. The research reported here was supported in part by the Institute of Education Sciences pre-doctoral training grant R305B090026 to the University of California, Berkeley. The opinions expressed are those of the authors and do not represent views of the Institute of Education Sciences or the U.S. Department of Education.
- Andrade, H. L. (2010). Students as the definitive source of formative assessment: Academic self-assessment and the self-regulation of learning. In NERA Conference Proceedings 2010. Rocky Hill, Connecticut: Paper 25.Google Scholar
- Black, P., Harrison, C., & Lee, C. (2003). Assessment for learning: Putting it into practice. Berkshire: Open University Press.Google Scholar
- Bookman, J., & Friedman, C. P. (1999). The Evaluation of Project Calc at Duke University, 1989–1994. MAA NOTES, 253–256.Google Scholar
- Boud, D., Keogh, R., & Walker, D. (1996). Promoting reflection in learning: A model. In Boundaries of Adult Learning (Vol. 1, pp. 32–56). Routledge.Google Scholar
- Bressoud, D. M., Carlson, M. P., Mesa, V., & Rasmussen, C. (2013). The calculus student: insights from the Mathematical Association of America national study. International Journal of Mathematical Education in Science and Technology, 44(4), 685–698. doi:10.1080/0020739X.2013.798874.CrossRefGoogle Scholar
- Cohen, J. (1988). Statistical power analysis for the behavioral sciences. NY: Routledge.Google Scholar
- Featherstone, H., Crespo, S., Jilk, L. M., Oslund, J. A., Parks, A. N., & Wood, M. B. (2011). Smarter together! Collaboration and equity in the elementary math classroom. Reston, VA: National Council of Teachers of Mathematics.Google Scholar
- Freeman, S., Eddy, S. L., McDonough, M., Smith, M. K., Okoroafor, N., Jordt, H., & Wenderoth, M. P. (2014). Active learning increases student performance in science, engineering, and mathematics. Proceedings of the National Academy of Sciences, 201319030. http://doi.org/10.1073/pnas.1319030111.
- Ganter, S. L. (2001). Changing calculus: a report on evaluation efforts and national impact from 1988–1998. AMC, 10, 12.Google Scholar
- Kolb, D. A. (1984). Experiential learning: Experience as the source of learning and development (Vol. 1). Upper Saddle River: Prentice-Hall.Google Scholar
- National Center for Education and Statistics. (2010). The nation’s report card: Grade 12 reading and mathematics 2009 National and pilot state results (No. NCES 2011–455). Institute of Education Sciences, U.S. Department of Education: Washington, DC.Google Scholar
- National Governors Association Center for Best Practices & Council of Chief State School Officers. (2010). Common core state standards for mathematics. Washington, DC: Authors.Google Scholar
- National Council of Teachers of Mathematics. (2000). Principles and standards for school mathematics. Reston, VA: The National Council of Teachers of Mathematics.Google Scholar
- Niss, M. (2003). Mathematical competencies and the learning of mathematics: The Danish KOM project. In 3rd Mediterranean conference on mathematical education (pp. 115–124).Google Scholar
- National Research Council. (2001). Adding it up: Helping children learn mathematics. Washington, DC: National Academy Press.Google Scholar
- Oehrtman, M., Carlson, M., & Thompson, P. W. (2008). Foundational reasoning abilities that promote coherence in students’ function understanding. In C. Carlson & C. Rasmussen (Eds.), Making the connection: Research and teaching in undergraduate mathematics education (pp. 27–42). Washington, DC: Mathematical Association of America.Google Scholar
- Rasmussen, C., Marrongelle, K., & Borba, M. C. (2014). Research on calculus: what do we know and where do we need to go? ZDM – The International Journal on Mathematics Education, 46(4), 507–515.Google Scholar
- Reinholz, D. L. (2015). The assessment cycle: a model for learning through peer assessment. Assessment & Evaluation in Higher Education, 1–15. http://doi.org/10.1080/02602938.2015.1008982.
- Reinholz, D. L. (forthcoming a). Using peer-review to improve undergraduate calculus explanations: A design-based approach.Google Scholar
- Reinholz, D. L. (forthcoming b). Design bridges: Supporting inferences across multiple levels of design-based research.Google Scholar
- Roddick, C. D. (2001). Differences in learning outcomes: calculus & mathematica vs. traditional calculus. PRIMUS, 11(2), 161–184.Google Scholar
- Schoenfeld, A. H. (1991). What’s all the fuss about problem solving. ZDM – The International Journal on Mathematics Education, 91(1), 4–8.Google Scholar
- Schoenfeld, A. H. (1995). A brief biography of calculus reform. UME Trends: News and Reports on Undergraduate Mathematics Education, 6(6), 3–5.Google Scholar
- Schwingendorf, K. E., McCabe, G. P., & Kuhn, J. (2000). A longitudinal study of the C4L calculus reform program: comparisons of C4L and traditional students. CBMS Issues in Mathematics Education, 8, 63–76.Google Scholar
- Tall, D. (1992). Students’ difficulties in calculus. In Proceedings of Working Group 3 on Students’ Difficulties in Calculus (Vol. 7, pp. 13–28). International Congress on Mathematics Education.Google Scholar
- Tallman, M., Carlson, M., Bressoud, D. M., & Pearson, M. (forthcoming). A characterization of Calculus I final exams in U.S. colleges and universities.Google Scholar