Introduction

Examining the development of grading in the United States reveals a progression from its informal origins in 1646 to the establishment of the standardized A-F system that characterizes academic accomplishment today (Bowen & Cooper, 2022). Records from the late 1700s indicate early grades were assigned and influenced by social factors such as student socio-economic status (Bowen & Cooper, 2022). With records dating back to 1898, the A-F grading scale was first used to objectively reflect classroom achievement but did not become common until the mid-1900s (Bowen & Cooper, 2022; Schinske & Tanner, 2014). The rising popularity of intelligence testing in the mid-1900s prompted educators to shift away from the previous norm of assigning grades influenced by social factors and instead assign grades based on their perception of student merit and achievement, often curving to fit a bell curve (Bowen & Cooper, 2022; Feldman, 2018; Schinske & Tanner, 2014). Aside from lowering the threshold for failure from 75 to 60% in the mid-20th century, and adjusting the letter-grade bins accordingly, the traditional A-F grading scheme remains in effectively identical form today with grades of A = 90–100%, B = 80–89%, C = 70–79%, D = 60–69%, and F = 0–59%. The practice of assigning grades based on this scale has come to be a hallmark of traditional grading practices across higher education.

Traditional grading practices are increasingly critiqued as perpetuating systemic inequities (Feldman, 2018) by conflating the outcome of learning with behaviors in the process of learning (Lipnevich et al., 2020, 2021). Alternative grading practices which include, but are not limited to, standards-based grading (SBG; Lewis, 2022a), specifications grading (Nilson & Stanny, 2023), and ungrading (Kohn & Blum, 2020), all share a goal of more accurately communicating what a student knows and can do (Clark & Talbert, 2023; Nilson & Stanny, 2023; Schinske & Tanner, 2014; Townsley & Schmid, 2020). The popularity of alternative grading practices in undergraduate Science, Technology, Engineering, and Mathematics (STEM) courses, evidenced through dedicated conferences (e.g., The Grading Conference), substack newsletters (e.g., Grading for Growth), and a multitude of conversations on social media, highlights several needs for faculty in higher education. However, there is confusion over what elements constitute specific alternative grading practices. Alternative grading practices have been given many names with very little universality in definitions or implementation. Additionally, interested instructors are often overwhelmed by the many options for alternative grading, and struggle to systematically implement classroom changes. Finally, there is skepticism that alternative grading practices positively impact student learning. It is this last need—the need for evidence of efficacy—that we address in this scoping review.

An interdisciplinary scoping review

We used an interdisciplinary approach to explore alternative grading practices across undergraduate STEM disciplines for several reasons. First, STEM disciplines are “not a monolith” (Reinholz et al., 2019); disciplinary differences can have profound impacts on classroom instructional practices, including the uptake of new practices. For example, research finds that adoption of evidence-based teaching strategies is not uniform across STEM fields (Lund & Stains, 2015; Shadle et al., 2017; Stains et al., 2018). Second, grading practices also vary extensively across STEM disciplines (Lipnevich et al., 2020). As a result, we might expect to see differences in how STEM disciplines adopt and adapt alternative grading practices. Such variation has repercussions for students, who must navigate a curriculum that includes a suite of introductory STEM courses, all while making sense of their distinct grading systems. Finally, there is a tendency for discipline-based education research (DBER) to occur in silos, with limited cross-talk across disciplines (Slominski et al., 2023, 2020; Trujillo & Long, 2018). If we are to make systemic changes to our grading practices, it is essential to use interdisciplinary approaches, so that we can build a broad consensus about how grading practices impact student learning, and ultimately, whether they result in more inclusive classrooms.

As an interdisciplinary group, the DBER community at NDSU is uniquely positioned to tackle a scoping review exploring alternative grading practices across undergraduate STEM. We are a collaborative community with faculty, post-doctoral researchers, and graduate students from Biology, Chemistry, Engineering, Physics, and Psychology. We are also active practitioners, teaching courses in which we have implemented alternative grading practices. Represented in our community are faculty who teach nearly every introductory science course, which are typically large-enrollment and often gatekeeping, prerequisite courses. We first began exploring alternative grading practices through a book club in 2020 that read Grading for Equity (Feldman, 2018). Out of this small group grew an interest in alternative grading, both in practice (i.e., how do we implement this in a large enrollment, first-year course) and research (i.e., what impact do these practices have on student outcomes, both cognitive and affective), across the NDSU DBER community.

Scoping review—a type of literature review

The growing interest in alternative grading approaches in undergraduate STEM education, particularly the calls for evidence of their efficacy, warrants an exploration of existing literature. However, the use of alternative grading practices in undergraduate STEM education is relatively recent; as a result, the literature corpus is limited and disparate, and not conducive to more traditional literature reviews like a systematic review or meta-analysis.

Unlike a systematic review, which is narrowly focused and driven by a well-defined research question, a scoping review is suited to rapidly map or describe the current state of an emerging research field (Arksey & O’Malley, 2005; Khalil et al., 2016). Commonly used in healthcare research, scoping reviews follow a systematic approach to determine the extent of research on a particular topic (Arksey & O’Malley, 2005; Munn et al., 2018). In the present study, we adopt the five stage framework of Arksey and O’Malley (2005), which includes (1) identifying the research question, (2) identifying relevant studies, (3) study selection, (4) analysis, and (5) reporting results.

Scoping reviews report the extent and nature of current research on a particular topic and can be used to clarify key concepts, examine research methods, and identify knowledge gaps in the literature (Munn et al., 2018); however, scoping reviews do not describe the quality of existing research (Arksey & O’Malley, 2005). In our current study, we were particularly interested in gaps in the existing literature on alternative grading. Miles (2017) describes seven categories of research gaps which includes evidence gaps, knowledge gaps, practical knowledge gaps, methodological gaps, empirical gaps, theoretical gaps, and population gaps. An evidence gap is indicated when results in the body of literature are contradictory to each other. A knowledge gap is when research is not found; evidence does not exist or has not been published. Practical knowledge gaps are a gap between what is considered empirically as the most supported action, but this does not translate into practice. A methodological gap occurs when a variety of methods are needed to generate new results. An empirical gap occurs when findings or predictions have yet to be empirically verified. When the research area lacks an overarching theoretical backing, a theoretical gap is present. Finally, a population gap is indicated when the current literature lacks representation of a population (e.g. race, gender).

We conducted a scoping review to describe the extent and nature of recent research on alternative grading and the impacts on undergraduate student outcomes (e.g., grades, motivation, etc.) across STEM disciplines, and to identify the types of gaps present in the collective body of literature. This study describes the alternative grading research landscape through three mechanisms: (1) descriptive statistics summarizing the context of existing publications, (2) analysis describing the study characteristics with a focus on the measurements and metrics used, use of any validated instruments, and results, and (3) direct citation and co-citation analyses to understand how publications in this body of work are citing the broader literature.

Methods

Context: developing our collaboration

For the last 15 years, our DBER community at NDSU has hosted a vibrant Journal Club. Though we primarily serve the DBER community, we routinely host faculty from the broader NDSU community who have a growing interest in teaching and learning. All attendees - faculty, postdocs, graduate students, and undergraduate students - contribute to the Journal Club as leaders, facilitators, and contributors. Each Friday during the fall and spring semesters, we gather to discuss contemporary research (whether our own or others), which often focuses on evidence-based pedagogical practices. Over the past several years, we held multiple Journal Club sessions that centered on alternative grading approaches. Given our group’s interest in this topic, we decided to allocate a subset of our weekly discussions to initiate this scoping review. While the emphasis of the current work is on the findings from the scoping review, our interdisciplinary approach on this project exemplifies the strengths and advantages that come from this type of collaboration (see Henderson et al., 2017), a point we return to and elaborate on in the Discussion.

During the Fall 2022 semester, we devoted several Journal Club meetings to the scoping review process (Table 1). Our first meeting was faculty-led (AL, JM), with the goal of scaffolding the beginning of the process by exploring different types of literature reviews to confirm that a scoping review was an appropriate tool given our research interests. Subsequent sessions were co-led by faculty and students, who expressed particular interest in later topics associated with the scoping review process. At the end of the Fall 2022 semester, a smaller group of faculty and students formed from those interested in a deeper and ongoing involvement with the scoping review; participation in this group was voluntary. During the Spring 2023 semester, our group met on a bi-weekly basis outside of Journal Club to make progress on the scoping review. At each meeting, smaller working groups discussed their progress over the previous two weeks and outlined goals for the upcoming two weeks. These meetings also provided the opportunity for working groups to bring questions and seek input from the larger group. During the Summer of 2023, we continued to make progress in our smaller working groups (focused primarily on thematic analysis, direct citation/co-citation, and manuscript preparation) and met on a biweekly basis. During the Fall 2023 semester, we met biweekly and our focus shifted to data analysis, figure generation, and writing the final manuscript.

Table 1 Timeline of our collaborative approach to the scoping review

Our group also created and signed an Authorship Agreement (Supplementary Materials), which was intended to help us establish and maintain clear expectations regarding authorship. Authorship was based on criteria described in and derived from the International Committee of Medical Journal Editors (2023).

Scoping review

Our scoping review followed the five stages described in Arksey and O’Malley (2005) and meets the items from the PRISMA checklist for scoping reviews (Peters et al., 2020; Tricco et al., 2018). We briefly describe each stage as it pertains to our study.

Stage 1: identifying the research question

Alternative grading is both an emerging practice in STEM education and a developing research area in DBER. Early discussions in our group (Table 1) centered on determining our research question. Through iterative discussion, we developed two initial guiding research questions: (1) what is currently known about the impacts of alternative grading practices on student outcomes across STEM disciplines, and (2) what gaps currently exist in the literature. As our scoping review progressed, we identified a third research interest, namely if the research on alternative grading in STEM was occurring in discipline- or methods-based silos. We recognize ‘alternative grading’ is a broad term, one we chose to specifically encompass the diversity of grading practices faculty are currently adopting (e.g., specifications grading, ungrading, standards-based grading, etc.). We also note the uptake of alternative grading practices is uneven across disciplines, hence our need to use a broad and encompassing term.

Stage 2: identifying relevant studies

As with systematic reviews and meta-analyses, we searched a variety of sources to answer our guiding research questions. In Fall 2022, databases and journals were searched by the disciplines of the graduate students and faculty in our group: Biology, Chemistry, Engineering, Physics, and Psychology (Supplementry Materials Table 1). Some of these sources also indexed published conference proceedings, while others did not. After discussion, we decided to include peer-reviewed conference proceedings as they are reflective of the current state of the research for several disciplines.

As a community, and after a discussion with our Science Librarian, we developed a set of search terms related to alternative grading (Supplementry Materials Table 2). Each keyword was combined with our focal disciplines using Boolean operators (e.g., “standards-based grading AND biol*”). This initial search yielded 467 records (Fig. 1). Duplicate records were removed, leaving 332 distinct records. These 332 records underwent blind inclusion sorting (Ouzzani et al., 2016) using Rayyan, a freely available, web-based software designed to support collaborative literature reviews. If the necessary details were not present or clear in the record abstract while applying the inclusion criteria at this stage, records underwent a full-text review.

Fig. 1
figure 1

Flow diagram of the study selection process

Stage 3: study selection

After an initial review of our 332 records, we recognized many studies were not relevant to the goal of our scoping review, which was to characterize the landscape of the empirical research on alternative grading. We thus limited ourselves to data-driven peer reviewed studies, and omitted essays and opinion pieces. We also excluded entire books, theses, and dissertations because we wanted research that was widely available and, importantly, peer reviewed. We also chose to limit our scoping review to studies in formal learning environments from the undergraduate level because our community at NDSU is focused, and has expertise specifically on DBER at that level. Additionally, we limited our studies to those completed at schools in the United States because grading practices differ substantially in the U.S. from those in other countries. Finally, based on an early exploration of the literature, we did not limit the publication dates of our search, resulting in a corpus including studies as early as the 1970s, with most being published after 2014. While the norms of teaching and learning in higher education may have evolved since the 1970s, what we now consider to be traditional grading practices have not changed since they were first introduced in the 1890s. Therefore, alternative grading practices have been “alternative” since well before the 1970s. Further, the alternative practices presented in early studies are similar in motivation and execution to those presented in the studies published later, so we opted to include all studies (regardless of publication date) that met our inclusion criteria to avoid biasing our sample while also maximizing the number of included studies. Indeed, both earlier and later studies published on this topic are relevant to our goal of characterizing the landscape of the research into alternative grading practices. After applying the inclusion criteria (see Table 2) there were 92 records.

Table 2 Inclusion and exclusion criteria

Upon a deeper reading of the records during analysis, we identified 20 records that required further review due to potentially not fully meeting our inclusion criteria. Some records included reflective manuscripts with no data collected and it was unclear if they had undergone peer review. After multiple rounds of discussion with both the full group and coding teams (see Stage 4), we excluded 17 records. We also note our search returned two studies in pre-print. One of those studies (Lengyel et al., 2023) was published during our analysis in August 2023 and was therefore included. The other pre-print has not been published at the time of writing in a peer-reviewed journal and was one of the 17 not included. The final number of studies included in our data corpus for this scoping review was 75 (Supplementry Materials Table 3).

Stage 4: analysis of chosen studies—charting the data

We identified two types of coding, study context and study characteristics. We also intended to conduct thematic analysis to characterize the motivations, theoretical framework, and implementation of alternative grading in each study. Unfortunately, we found few studies included sufficient details to enable such an analysis. As a result, we were limited in our ability to make any meaningful thematic conclusions about the corpus as a whole, and thus abandoned this aspect of our analysis.

To characterize the context of the studies included in our data corpus, we coded each study for course delivery, type, audience, discipline, enrollment, name of the alternative grading practice used, and Carnegie classification of the institution where the study took place (Table 3). At this stage of the scoping review process, our coding returned additional disciplines beyond our search terms from Stage 2, including geology, computer science, and mathematics.

Table 3 Study contexts, context codes, and number of studies in each code

Each study was initially coded for context by two independent coders (either NJ, AK, JJN, JMN, WF, KG). Following this first round of coding, a subset of coders (NJ, AK, JJN) met to compare codes and flag disagreements. These disagreements were identified, reviewed, and assigned to a third independent coder for further review; all disagreements were discussed asynchronously via Slack until consensus was reached.

A separate coding team (ELH, TS, AM) coded the characteristics of each study included in our data corpus. Specifically, we focused on describing the variables reported in each study, identifying seven categories or types of variables (Table 4) and the tools used in the research. Performance variables included measurements of perceived student performance, performance in subsequent classes or throughout a program, student grades (e.g., GPA, course grade, exam grades), and course-level grade measures (i.e. D/Failure/Withdraw or DFW rate and grade distribution). Some variables with theoretical foundations included affective construct variables such as anxiety, mindset, motivation, self-efficacy, etc. Attitude variables assessed less specific student beliefs about alternative grading (e.g. whether students liked the alternative grading practice) and included course evaluations. The Learning code was assigned to studies that compared end of course performance to initial performance. Retention was also reported in some studies, and these measures were coded as Retention.

Table 4 Study characteristics

There were also a subset of studies that reported information about the instructor experience when implementing alternative grading practices. The Instructor Measures code captures faculty or graduate TA perceptions of their time investment and other experiences with alternative grading.

We (ELH, TS, AM) also coded each study for the instrument or tool used to measure the variables identified in Table 4. Surveys developed by the authors of a study were coded as researcher-generated surveys (RGS) and data collected through a focus group or interview was coded as FG/I. When the survey or instrument used in a study was from previously cited work it was coded as a Validated Tool. Tools for reporting grades included final grades, GPA, and exams. A code of ‘Gradebook’ was assigned when course-level DFW rates and grade distributions were measured. When no instrument was reported in a study, we coded it as ‘None’.

Finally, we coded the findings reported by each study. Given that each study could have multiple variable types, the results of each variable type were coded separately. Regardless of the variables measured or tools used, we were interested in whether the research outcomes supported the efficacy of alternative grading practices on student learning. A result was coded as Positive when there was statistical significance indicating a positive impact of alternative grading on a given variable of interest; a Negative code was used when the statistical analysis indicated a statistically significant negative impact. Trending Positive or Trending Negative codes were used when studies reported a trend in favor of or against alternative grading, but that trend was not statistically significant. Studies that present the results of a variable being both positive and negative (e.g. some students liked the alternative method and some did not) were coded as Mixed. Studies that reported no trend in either direction were coded as Neutral.

Each study was coded for these characteristics by two independent coders. Following this coding, ELH, TS, and AM met to compare codes. Disagreements were identified and discussed until consensus was reached.

Citation network analysis

As mentioned previously, STEM disciplines are “not a monolith’” and disciplinary differences inevitably manifest themselves in instructional practices. This can manifest not only in what grading practices a discipline values, but also how they communicate with each other and how they may approach trying to “solve” the problem of traditional grading in the classroom. With alternative grading practices being relatively new, we were interested in whether all of the STEM disciplines we sampled were citing a common body of literature on alternative grading practices and whether they were building on work across disciplines. To answer this question, we conducted both direct citation and co-citation analyses of the records in our data corpus. This allowed us to gauge whether disciplines are building on the same foundational knowledge of these methods and whether there are shared practices across disciplines.

The data set for the network analysis started with PDFs from the full data corpus identified in stage 3 of the scoping review. These PDFs were scanned for references using Scholarcy (Gooch, 2021) to create a database of reference papers cited by the corpus papers. Each entry was manually checked and corrected for accuracy and completeness (NJ, JB, LS, JJN, DLJC, LM). A matrix was created where each column referred to a paper in the data corpus, and each row corresponded to a paper cited by a paper in the corpus. A “1” was entered into the corresponding cell if the reference was cited by the paper in a given column, and a “0” was entered otherwise.

This matrix was then converted into a direct citation network file and a co-citation network file using Python. In the network files, each row and column corresponds to a paper in the corpus. In the direct citation network file, a “1” was put into a cell if the paper in the row paper cited the column paper. For example, if cell [8, 40] has a “1” in it, that means paper 8 cited paper 40. In the co-citation network file, the value in each cell corresponds to the number of commonly cited papers between the row and column papers. For example, if cell [3, 12] has a “5” in it, that means papers 3 and 12 shared 5 citations. These network files were then put into Gephi (Bastian et al., 2009) for visualization. Colors and shapes were used to visualize both disciplines and alternative grading practices.

To determine if disciplines or alternative grading practices were citing only within their same discipline or grading practice, the statistical significance of each discipline and grading practice community was calculated using the methods presented in He et al. (2021). This method considers the weights between items within a discipline or grading-type community and compares it to the weights between the items within the community and items outside the community. It then uses a significance testing approach to determine if the community is statistically significant (p < 0.01).

Results

Stage 5: reporting results

Descriptive overview

Our scoping review identified 75 studies (Fig. 1), including 44 peer-reviewed journal publications and 31 published conference papers (Table 2). The studies range in publication date from 1970 to 2023, with a majority of studies having publication dates of 2016 and later (Fig. 2).

Fig. 2
figure 2

The publication timeline of studies included in the review. Color indicates the discipline represented in the study. Studies are assigned a shape based on their alternative grading practice, with “Other” being any grading practice with only one study (Cafeteria, Criterion, DIR, Multiplier, Outcome-Based, Portfolio, and Ungrading)

Study context

Studies from Chemistry (n = 21) and Engineering (n = 30) made up 68% of our data corpus. Biology (n = 3), Computer Science (n = 1), Geology (n = 2), Mathematics (n = 9), Physics (n = 7), and Psychology (n = 2) comprised the remaining 32% (Fig. 3). The 21 publications in Chemistry were all published in peer-reviewed journals, while 28 of the 30 publications in Engineering were published in conference papers (Table 5). Engineering comprised almost the entirety of the 31 studies we found in published conference papers, with the remaining 3 coming from Physics (2 conference papers) and Computer Science (1 conference paper).

Fig. 3
figure 3

(A) Ratios of alternative grading strategies identified in the studies in our data corpus, broken down by discipline. Studies are assigned a color based on their alternative grading practice with “Other” being any grading practice that had only one record. (B) Ratios of disciplines identified in the studies in our data corpus, broken down by alternative grading strategy. Studies are assigned a color based on their discipline

Table 5 Publication type broken down by discipline

Standards-based grading was the most commonly identified alternative grading practice (n = 18) followed closely by Mastery grading (n = 16) and Specification grading (n = 14). There were seven grading strategies that appeared only once in our corpus: Cafeteria, DIR, Multiplier, Outcome based, Portfolio, and Ungrading. The Keller Method (n = 5) and the 4.0 scale (n = 2) are the only grading methods to only occur in a single discipline, Chemistry and Physics, respectively, and appear in more than one study. Engineering had the largest number of alternative grading practices (n = 6), followed by Chemistry (n = 5) (Fig. 3).

Course enrollment was reported in 67 out of the 75 studies (Fig. 4), which were binned as Small Enrollment (<20 students, n = 7), Medium Enrollment (20–60 students, n = 26), and Large Enrollment (>60 students, n = 34) (Fig. 4). Within each size category there were at least 5 different alternative grading practices used, and no single alternative grading practice emerged as dominant. The most common alternative grading practice reported in large enrollment courses was mastery grading (n = 8 of 36); standards-based grading was most common in medium (n = 6 of 23) and small (n = 3 of 8) enrollment courses.

Fig. 4
figure 4

Ratios of alternative grading strategies identified in the studies in our data corpus, broken down by enrollment size. Records with less than 20 students per section were classified as “Small”, between 20 and 60 students were classified as “Medium”, and greater than 60 students classified as “Large”. Studies are assigned a color based on their alternative grading practice with “Other” being any grading practice that had only 1 record

Of studies that reported course delivery mode (n = 64), the majority (n = 58) represented in-person courses, with a few (n = 5) being hybrid delivery; only one was online. Two-thirds of the studies (n = 50) reported the courses as lectures, while only a handful (n = 9) were reported as labs. A majority of studies (n = 55) also reported serving mostly students in a STEM major. Studies were overwhelmingly from the introductory level (n = 60). Additionally, a large proportion of these studies were from doctoral universities with very high (n = 29) or high (n = 10) research activity and larger master’s colleges/universities (n = 10).

Study characteristics

We identified 179 variables measured across the 75 studies in our data corpus, indicating most studies measured multiple variables. The most common variable reported in our corpus was Performance (n = 74 of 179), which was reported via final course grades (n = 25), an exam grade (n = 15), GPA (n = 3), student self-reported performance (n = 4), performance in subsequent courses (n = 2), a validated tool (n = 2), or another classroom artifact measure (e.g. learning objectives; n = 2). The remaining 21 instances of the performance variable represent course-level DFW rates and grade distributions (Fig. 5).

Fig. 5
figure 5

Sankey diagram illustrating outcome measurements (N = 179), tool, and whether the results are in favor of (Positive) or against (Negative) alternative grading. Measurements and tools with 2 or fewer occurrences were grouped as “Other”

Another commonly reported variable was general student Attitudes towards the alternative grading strategy (n = 51 of 179). Student attitudes were largely captured using researcher-generated surveys (RGS; n = 36), but were also captured through Course Evaluations (n = 8), focus groups or interviews (FG/I; n = 3), and in one instance, a Validated Tool (Fig. 5). The remaining studies reporting on attitudes did not specify an instrument or tool.

We were interested in the use of validated tools because these studies could support future research looking to compare the impacts of alternative grading practices across contexts. Validated tools were used to capture 17 variables across 10 studies (Supplementry Materials Table 4). Concept inventories were used to characterize learning gains in two studies (Fig. 6). Those concept inventories were the Force Concept Inventory (FCI) and the Strength of Materials Concept Inventory (SMCI) from physics and engineering, respectively. Standardized tests were also used to capture student performance by two studies and include the Dunning-Abeles Physics Test and the American Chemical Society (ACS) exam, each used by a single study. Surveys were typically used to characterize affective constructs, but only Dweck’s Implicit Theory of Intelligence Scale (Dweck 3; Dweck, 2006) was used in more than one study, and even then, it was used in only two studies. An additional nine validated surveys were each used only in one study.

Fig. 6
figure 6

Sankey diagram illustrating the outcome measurement (n = 11), specific validated tool, and whether the results are in favor of (Positive) or against (Negative) alternative grading

The studies included in our data corpus largely reported findings that found positive impacts of alternative grading practices on student learning and attitudes. Across the 75 studies and 179 variables, we coded 106 outcomes as Positive and 38 as Trending Positive (Fig. 5). Variables measured through RGS were the single largest contributor to Positive results (n = 35). Results from validated tools were mostly Positive (n = 6) or Neutral (n = 7) (Fig. 6).

There were ten outcomes that were coded as Negative, where alternative grading practices had a statistically significant negative association with learning outcomes and attitudes. These negative results predominantly came from performance measures, specifically course-level DFW rates and grade distributions (n = 4), impacts on GPA (n = 2), and exam scores (n = 1). Negative results also came from measures of skill development through an RGS (n = 1) and student anxiety measured with a validated survey (n = 1).

Citation analysis

To determine whether and how the studies in our data corpus were citing one another, we created a direct citation network of the data corpus (Fig. 7). Each study was given a shape corresponding to the alternative grading method used and a color corresponding to discipline. The network representation shows that many citations occur within a discipline and/or grading practice. Of the 90 direct citations that occurred within the dataset, 48 citations had the same discipline and alternative grading practice, 16 had the same discipline with different grading practices, 17 had the same grading practice with different disciplines, and 13 had neither the same discipline nor grading practice – suggesting that both alternative grading method and discipline play a large role in who is citing who in the alternative grading literature.

Fig. 7
figure 7

Direct citation network illustrating what records cited one another. Each node in the network represents a study in the data corpus, and the directional edges represent which records have cited each other. Studies are given a shape based on the alternative grading practice and a color based on their discipline

To further extend this citation analysis to explore if disciplines and alternative grading practices cite a similar body of literature as a whole, co-citation networks of the studies were created. Significance testing of the alternative grading practice communities shows that SBG, mastery, specifications, pass/fail, Keller, and contract grading are statistically significant independent communities that each cite consistent but different bodies of literature (p < 0.01; Fig. 8). Similarly, engineering, chemistry, mathematics, and physics are statistically significant independent communities that each cite consistent but different bodies of literature (p < 0.01; Fig. 8). Other communities have less than five members, which may explain why those communities were not detected as statistically independent, either due to limited statistical power to detect differences in smaller communities, or the small community size forces its members to cite across communities.

Fig. 8
figure 8

Co-citation networks illustrating the citations shared between studies in the data corpus and the presence of communities based on alternative grading practice or discipline. Each node in the network represents a study in the data corpus, and the thickness of the edges represents the number of shared citations between two nodes. Studies are colored by (A) alternative grading practice and (B) discipline. Groups with an asterisk (*) represent statistically significant communities (p < 0.01)

Not surprisingly, studies that use the same alternative grading practice are more likely to cite one another. However, both the direct citation and co-citation analysis show that discipline also plays an equally important role in describing who is citing who within the alternative grading literature.

Discussion

Alternative grading practices are increasingly popular in STEM classrooms, yet as our scoping review documents, empirical evidence supporting their efficacy on learning outcomes is currently limited. We further find a fragmented landscape with inconsistent terminology, a dizzying array of variables studied, and limited theoretical underpinnings.

Describing the landscape of research on alternative grading practices

Our initial research question sought to describe what we currently know about the impacts of alternative grading practices on student outcomes across STEM disciplines. Unfortunately, we struggled to answer this question beyond superficial findings, namely that most studies do find a positive effect of alternative grading practices on student learning and attitudes. Two factors impeded our analysis. First, the studies in our data corpus had limited connections to theoretical frameworks (discussed in more detail below) and second, these studies used a wide array of tools to measure learning and attitudes. Together, these factors made it impossible to thematically code the studies and limited our ability to more fully generalize the current state of the research on alternative grading practices in STEM.

Identifying gaps in the research

An important outcome of our scoping review was the identification of gaps in the literature. Guided by the types of gaps described by Miles (2017), the authorship team discussed and identified three types of gaps found in the alternative grading practices research.

Knowledge gap

A knowledge gap is indicated when there is a lack of research or a lack of research with desired measures (Miles, 2017). Given that grades and grading are omnipresent in higher education and have immense influence on students’ undergraduate and professional careers (Feldman, 2018), we argue the limited number of studies identified in this scoping review is evidence of a knowledge gap: we lack necessary empirical research investigating the impacts of alternative forms of grading on learning in undergraduate STEM courses.

While the research available on alternative grading is still limited, we are encouraged by recent momentum (Fig. 2), largely led by publications from individual STEM disciplines (namely chemistry and engineering). The skewed disciplinary representation in the aforementioned knowledge gap raises important considerations. First, disciplines are citing more within themselves than across the broader literature (Fig. 8). While there is momentum, the siloed nature of the research makes it more challenging for these findings to be extended into other disciplines from both a research and practice perspective. Second, the absence of research from disciplines like biology or physics may not necessarily reflect inactivity or disinterest, but rather a difference in the venues through which research is shared. In our corpus, we see research in engineering largely coming from published conference papers (Table 2). Not all DBER communities have formal venues that routinely publish conference papers (as is the case with American Society for Engineering Education (ASEE)). As such, there may be more research on alternative grading practices happening in other STEM disciplines, but publication practices limit their indexing by databases and search engines. As an example, the American Association for Physics Teachers (AAPT) is one of the largest gatherings of physics educators and physics education researchers; however, their conference abstracts are not indexed nor are papers published as part of conference proceedings. Research on alternative grading practices presented at these conferences that is not subsequently published in a peer-reviewed journal is unlikely to contribute to a broader or interdisciplinary conversation on alternative grading practices in STEM.

Our scoping review identified eight STEM disciplines engaged in research on alternative grading practices; however, the disciplines seem to be largely unaware of each other. The co-citation analysis finds that Chemistry, Engineering, Mathematics, and Physics were each statistically significant independent communities; in other words, studies within each of those disciplines build on a body of work that is different and independent from other disciplines. For example, Engineering papers cite the same body of work as other Engineering papers, but cite different papers from Chemistry, Mathematics, and Physics. The direct citation analysis supports this siloing of disciplines as almost no studies in our corpus cite other corpus studies outside of their respective discipline. Additionally, many studies in our corpus (n = 22) do not cite any other study in the corpus (Fig. 7). This finding may be expected in a relatively young field (half of the studies in our corpus were published in 2016 or beyond), but it also presents an opportunity for alternative grading researchers and practitioners to become familiar with relevant work outside of their disciplinary expertise.

Methodological gap

A methodological gap is present when there is little variation in methodological approaches (Miles, 2017). Our scoping review revealed two major types of methodological gaps, one related to research practices and the second related to implementation of alternative grading practices.

We found evidence to suggest a methodological gap with respect to research practices as most studies in our corpus used tools that have not undergone validation efforts (Fig. 5). While nearly 90% of the outcomes captured by studies that used RGS found positive impacts of alternative grading practices (i.e., Positive or Trending Positive), these studies often provided little description or rationale for the survey design and none included validation efforts. While these surveys may be accurately capturing the researchers’ variables of interest (e.g., attitudes, perceptions, learning gains, affective constructs, etc.), the findings obtained through these measures are limited in their generalizability and not necessarily replicable, which in turn makes it challenging for researchers and practitioners to make comparisons across populations or time.

Just over 13% of the 75 studies used validated tools to measure affective constructs (e.g., Dweck’s implicit theory of Intelligence Scale) and conceptual learning (e.g., Force Concept Inventory; Fig. 6). In these studies, far fewer (47%) found a positive impact of alternative grading. By using validated measures, these studies enable comparisons across populations and time, support generalizability, and build a more robust understanding of how alternative grading strategies impact undergraduate students in STEM courses.

Further, there were few studies measuring content or conceptual learning. Most studies seemed to focus on affective constructs or final grades. There are several issues with this: first, we need empirical support that alternative grading positively impacts learning of content and skills because at the end of the day, this is what most college instructors really care about. Second, the dizzying array of affective measures dilutes the findings and we simply cannot synthesize across studies. This dilution highlights the need for theory-driven methodology - because these studies are not grounded in theories about learning and teaching, they cannot help us understand the mechanisms by which grading practices may impact the learner.

The second methodological gap we found stemmed from a lack of universality in the definitions used by the studies in our data corpus of different alternative grading practices. We had initially intended to characterize the grading systems used in each study using the definitions provided by Clark (2023); however, we found many records did not include a thorough enough description of their grading practices for us to independently characterize the method. Thus, we relied on the name of the method used by the authors in each study. So, while standards-based grading was the most common grading system in our corpus (Fig. 3), there is likely variation in implementation. This lack of detail also made it challenging to identify themes across grading strategies in a meaningful way, and we were not able to discern whether seemingly discipline-specific grading methods (e.g., the Keller Method) shared attributes with more broadly seen methods (e.g., SBG; Fig. 7). In addition, discipline-specific grading methods may contribute to the similarity in citation patterns seen between grading methods (Fig. 8) furthering the idea that defining and implementing alternative grading practices is currently a discipline-specific endeavor. This ultimately precluded our ability to make conclusions about the outcomes or efficacy of any specific alternative grading practice.

Theoretical gap

A theoretical gap is indicated when there is a lack of theory underlying the research in a given area (Miles, 2017). As noted in our methods, we initially intended to capture the underlying theory motivating the research described in the studies included in our data corpus; however, we found these were often not sufficiently named or described and impeded any thematic analysis.

Theoretical frameworks influence all aspects of research - from the questions asked and the methods used to collect data, to the interpretation and discussion of the results (Luft et al., 2022). By naming the theoretical framework used in a given study, researchers (both the original authors and other scholars) can better situate the findings and claims presented within the existing body of research. Articulating the theoretical framework gives greater meaning to the choices made by researchers and reveals the lens researchers applied in their attempt to understand a given phenomenon. Currently, only a fraction of the research exploring the impacts of alternative grading on student outcomes explicitly draws on theories of learning and other relevant theoretical frameworks. The effects of this scant, disjointed theoretical footing can be seen in the many disparate variables and tools observed in the current body of research (Fig. 5). With no theoretical foundations in place, researchers working to understand the impact of grading practices on student outcomes are left to place stock in variables and tools that may ultimately be ill-suited for their intended research aims, making it all the more challenging to develop a robust understanding of this complex phenomenon.

Limitations

Our search for empirical research into alternative grading was limited by the disciplines represented by our interdisciplinary team. We have many STEM experts, but not across all STEM disciplines. As a result, we opted to limit our search to those disciplines where we had one or more members with expertise. Through our search process, we did identify and ultimately include several studies that fell outside of our collective expertise (i.e., computer science, geology, and mathematics).

Additionally, while our list of alternative grading practices was exhaustive to the best of our knowledge, given the lack of consensus on names and definitions of the many alternative grading practices, it is likely our search missed studies that did not explicitly name their alternative grading practice or that used a practice not on our list.

Our search of databases and subsequent study analysis revealed that the alternative grading conversation is not restricted to journal articles. The presence of peer-reviewed and published articles from conference proceedings in engineering leads us to believe there are other venues (i.e., conferences) that include work about alternative grading that were not identified through our database search and thus not represented in our data corpus. Therefore, we believe the research into alternative grading practices is broader than we can currently characterize and report.

In addition, the communication channels about alternative grading are not limited to peer-reviewed journals and conference proceedings but includes blog posts, books, and social media conversations. While these dissemination platforms were not included in our search criteria, they contribute substantially to the broader conversation around alternative grading in higher education. These venues typically advocate for the adoption of alternative grading practices and are often based on anecdotal or limited empirical evidence. While these less formal dissemination pathways may not contribute to the empirical findings of alternative grading, their role of rapid communication is an important consideration for the landscape as a whole and warrant further exploration.

Implications

Alternative grading is rapidly increasing in popularity in STEM classrooms, resulting in calls for empirical evidence of its efficacy. Our scoping review provides an initial map of the research landscape (Fig. 2) and identifies areas of research needs. First, research in alternative grading needs to be grounded in theoretical frameworks, enabling us to develop informed hypotheses about how and when alternative grading practices should impact learning and other affective constructs. Such grounding will subsequently impact the variables we measure, allowing us to develop a more robust and unifying understanding of how alternative grading practices impact student learning.

Second, it is critical that authors fully describe their implementation of alternative grading practices, including defining their terms using common language. While term definition may seem a trivial task, this lack of consensus is currently hindering our ability to uncover patterns across courses, disciplines, or institutions. Robust descriptions of implementation practices will enable us to develop clearer definitions of alternative grading practices, resulting in better research.

Third, using validated tools rather than researcher-generated instruments will support richer comparisons within and across contexts such as grading systems, disciplines, class sizes, etc. Validated tools also enable us to ask deeper questions, such as whether a particular grading system is better at developing learners’ self-regulation skills or if alternative grading practices create more equitable learning environments. Researcher-generated surveys and course evaluations help us gain insights into an individual course but when placed in the overall landscape of research, do not fully enable meaningful comparisons.

Fourth, the STEM disciplines represented in our corpus are citing different literature (Figs. 6 and 7), which may contribute to the lack of a unifying theory or universal definitions of alternative grading strategies. This disciplinary siloing may also lead to many instances of “reinventing the wheel”, where each discipline does not avail itself of the lessons learned by other disciplines. By extension, students, who are often enrolled in courses across STEM disciplines, may face confusion and shifting expectations. Interdisciplinary efforts are critical to capturing the entire landscape of a research area in STEM education and will be important in building a more in-depth understanding of the costs and benefits of alternative grading practices moving forward.

Conclusions

Our scoping review does not allow us to make comparisons across the studies in our corpus. However, the high proportion of positive results is promising and warrants further investigation. First, future research should explicitly connect to theoretical frameworks to explore how alternative grading practices impact students’ learning of skills and content. Indeed, a primary goal of education is to support students’ learning, so evaluating the extent to which alternative grading practices produce tangible and positive effects on memory and comprehension of material is critical. Research investigating the efficacy of these practices should aim to involve a variety of experimental techniques and draw from various cognitive science frameworks (Creswell & Plano Clark, 2018; Spivey, 2023) to increase generalizability and applicability across disciplines. Second, alternative grading practices might have an impact on students’ development of other skills such as self-regulated learning skills. For example, many alternative grading approaches involve components that provide students with autonomy in their learning experiences and numerous opportunities to demonstrate competence–two key factors in self-determination theory. Self-determination theory (see Deci & Ryan, 1985; also see 2008) is a macro-theory of motivation that takes into account an individual’s psychological needs and factors that impact individual’s growth and development, with research showing that incorporating activities that support autonomy and competence has a positive effect on motivation and learning in education environments (for a review, see Niemiec & Ryan, 2009). Content learning and the development of self-regulated learning skills are both areas in which validated tools can be employed, allowing the broader community to draw comparisons between traditional and alternative grading practices. Finally, though preliminary, outcomes from this review indicate an increased usage of alternative grading across STEM disciplines, which suggests a need to support effective implementation of these practices. Thus, another fruitful avenue for future research could be to scaffold faculty’s transformation of their grading practices by guiding them through the five stages of Roger’s Diffusion of Innovation Framework: knowledge, persuasion, decision, implementation, and confirmation (see Reinholz et al., 2021; also see Rogers, 1995, 2004).

More generally, this review underscores the need for further interdisciplinary research efforts in STEM, echoing calls like the ones from Henderson and colleagues (Henderson et al., 2017). The structure and composition of the NDSU Journal Club facilitates our ability to conduct cross-disciplinary research on teaching and learning practices in STEM, with findings from the current review highlighting the value, strength, and richness that can come from such collaborations. The interdisciplinary approach employed in this scoping review illustrates how future investigations into alternative grading practices in STEM would be strengthened by increased interdisciplinary communication and collaboration.