Introduction

Digital games have emerged as an effective medium to improve student engagement and learning in some learning domains (Clark et al., 2016; Mayer, 2019; Scoresby & Shelton, 2011). Although many empirical studies have reported that learning games are effective overall, it has been noted that relatively few studies have taken a rigorous empirical approach to understanding why these games are effective (see discussion in Richey et al., 2021). In other words, what behavioral or cognitive changes do digital learning games promote when compared to non-game instruction, and how do these changes relate to learning outcomes? With the increase in efforts to develop learning games for varied content areas and student populations (e.g., Math—Lomas et al., 2013; Khan et al., 2017a, 2017b; McLaren et al., 2017a; Riconscente, 2013; Science—Cheng et al., 2015, 2017; Harpstead et al., 2013; Lester et al., 2014; Shute et al., 2015, 2021; Computational Thinking—Hooshyar et al., 2021; Rowe et al., 2021; Tahir et al., 2020; Policy argumentation—Easterday et al., 2017; Reading—Jacovina et al., 2016), there is a need to identify the features that make digital games effective for learning and understand why these features are beneficial. Accordingly, a few recent reviews have speculated about which features and designs are most likely to lead to games being effective (e.g., Clark et al., 2016; Mayer, 2019; Wouters & van Oostendorp, 2017). However, further empirical research is needed to deeply understand the mechanisms through which digital learning games promote learning. Developing this understanding can ultimately guide the efforts of game designers in building better learning games, and teachers in selecting when and how to use them. In this paper, we focus on gaming the system—a measure of behavioral disengagement—as a mechanism that may explain differences in learning outcomes between games and other learning activities and between different subgroups of students.

The lack of a comprehensive explanation of how, when, and why games are effective also poses challenges to achieving equitable student outcomes, as design choices are often not informed by cognitive theory or clear empirical evidence regarding the psychological and behavioral effects of those choices. Compounding this issue, many studies have not considered whether digital learning games work in the same ways (and with comparable effectiveness) for different sub-groups of overall student populations. Though some studies have looked at whether games work equally well for different groups (e.g., female vs. male students—Papastergiou, 2009; Chung & Chang, 2017; McLaren et al., 2017b; Tsai, 2017; students of different races—Shin et al., 2012; Kao & Harrell, 2015), this remains a small proportion of the studies on learning in games. Furthermore, as noted by Dele-Ajayi et al. (2018), only a small number of the studies that do check for differences in learning or engagement in terms of student group membership continue on to explicitly investigate why and how these differences are seen. It is difficult to change a pattern of lower success for some groups of students, and to design to promote success for all learners, without understanding who is currently less supported, and how games are less successful for those learners. Thus, efforts to understand how digital games work (and how to better design them) must more explicitly investigate not only who benefits from these games, but specifically how these differences manifest—what cognitive and behavioral processes accompany the greater and lesser effectiveness of specific games for specific groups of students? Understanding this can subsequently guide the efforts of game designers in building more effective and more equitable learning games.

To contribute towards answering this question, we investigate previously documented gender differences in the effectiveness of a digital learning game by exploring how female and male students interact with the game. Within different game activities, we compare students’ propensity to game the system, a behavioral measurement of disengagement where a student misuses a learning system’s properties to complete the learning activity, as opposed to engaging and learning with the material (Baker et al., 2004b). Gaming the system has been found to be associated with differences in learning outcomes in a variety of studies (Cocea et al., 2009; Fancsali, 2014; Pardos et al., 2014), with differences in student emotional experiences (Baker et al., 2010a, 2010b) and long-term academic and professional outcomes (Almeda & Baker, 2020; San Pedro et al., 2013). One prior study found that different levels of gaming the system explained differences in learning outcomes between a game and non-game control (Richey et al., 2021), with lower levels of gaming the system (i.e., greater behavioral engagement) and better learning outcomes seen in the game condition. This suggests that gaming the system may be a particularly useful behavioral measure for understanding how games affect the engagement of different populations of students differently. This behavior, specific to some types of learning activities, does not represent the full spectrum of disengaged behavior seen across learning activities, but serves as a clear and impactful indicator of disengagement in the games and other learning contexts and activities where it manifests. In this paper, we aim to extend this prior research by examining whether gender differences in learning outcomes might reflect gender differences in engagement, as measured by gaming the system, while students play a digital learning game. Specifically, we compare the difference in gaming the system within different activities between female and male students, and study whether gaming the system in specific activities plays a mediating role in the differences in learning outcomes between female and male students. This represents a step towards understanding the full range of mediating variables that explain these differences. Better understanding how games affect learners’ behaviors and affective experiences—and how female and male students respond differently to the same game activity—is a critical step to inform successful game design that better promotes productive learning processes and outcomes for all students.

Background

Digital learning games

Digital learning games are increasingly used in education and there is increasing evidence that they are effective at promoting successful outcomes in mathematics and science (see reviews in Clark et al., 2016; Mayer, 2019). In math learning, for example, Riconscente (2013) studied the use of a tablet game for fractions and found a significant increase (10–15% on average) in students’ learning, self-efficacy, and math interest compared to the students in the regular mathematics instruction condition. This suggests that math games may be especially beneficial for student groups with lower levels of self-efficacy (e.g., girls; Louis & Mistele, 2012). In another controlled experiment, Siew et al. (2016) reported a significant increase in algebraic thinking in students playing an android learning game compared to a conventional approach to teaching algebra based on imitation and repetition. Even beyond a general focus on domain content, a wide range of digital learning games have been developed focusing on varied skills and competencies such as creativity (Jackson, 2012), civic engagement (Easterday et al., 2017; Ferguson & Garza, 2011), and visual-spatial abilities and attention (Shute et al., 2015).

One of the key reasons for the uptake of games in education is the potential for them to be more fun and engaging than traditional learning activities. Digital learning games also have been reported to promote motivation in students, with several meta-analyses finding medium, positive effect sizes for digital learning games compared to more traditional instruction (Sitzmann, 2011; Vogel et al., 2006). A meta-analysis conducted by Vogel et al. (2006) compared games and simulated environments to traditional teaching methods and reported significantly better attitudes in students learning from games. A later meta-analysis (Sitzmann, 2011) also found a substantial overall increase in students’ self-efficacy when learning with digital games, but the authors also discovered a concerning potential publication bias in this research by identifying sixteen unpublished results. A key limitation within the literature, identified by these meta-analyses, is that many of the identified studies compared digital learning games to conventional instruction rather than other learning technologies, making it difficult to conclude whether the benefits of games come specifically from their game features or more general aspects of technology-supported instruction. It is also likely that the affective benefits of games vary based on game design. In fact, one research synthesis on affect and engagement in technology-supported instruction found that games and other types of learning technology each had examples of very positive and very poor affect and engagement (Rodrigo & Baker, 2011).

There is a wide variation in the design of games, with game features drawn from and inspired by a range of sources, including theories of learning and motivation (Howard-Jones & Demetriou, 2009; Shute et al., 2014). The in-depth study of the features of games for learning dates back more than four decades (i.e., Malone, 1981), with researchers developing taxonomies of game features and using these taxonomies to study how different game features impact students’ motivation to play (King et al., 2010; Malone, 1981) and their learning outcomes (Bedwell et al., 2012). Recently, meta-analyses have offered insight into which features of games are beneficial for learning and engagement (Clark et al., 2016; Ke, 2016; Mayer, 2019; Wouters & van Oostendorp, 2017). In one such meta-analysis, intentional learning supports such as explicit training or instruction, cues and feedback, in-game learning tools, and prompts for self-explanation or reflection were found to improve students’ learning (Wouters & van Oostendorp, 2017). Another meta-analysis reported a higher success rate in games with more complex game mechanics, a wider variety of potential game actions, and lower degrees of contextualization (Clark et al., 2016).

Motivational theories offer a number of potential explanations for learning benefits from digital games. For example, the four-phase model of interest development suggests that attention-grabbing learning contexts such as digital learning games can trigger situational interest, which in turn may lead to more developed phases of interest if the learner returns to the content over time (Hidi & Renninger, 2006). Heightened situational interest has been associated with greater engagement with the learning content, more connections between new content and prior knowledge, and better learning outcomes (Schraw & Lehman, 2001), providing one potential pathway for digital learning games to produce better learning outcomes compared to non-game learning systems. However, more research is needed to test the motivational pathways through which digital learning games might impact learning outcomes.

Thus far, the majority of research on digital learning games has asked the question of whether they are better for learning and engagement than non-game instruction. While evidence for the overall effectiveness of digital learning games is important to assess the claims for their educational benefits, it is also important to understand the mechanisms that make games effective for some students (Richey et al., 2021). Motivation and engagement are frequently identified as likely mechanisms for explaining the learning benefits of games, but relatively few studies have tested motivation or engagement as a mediating pathway to learning outcomes attained from digital learning games. Specifically, if digital learning games increase learning by increasing engagement, we should also see behavioral changes in how students interact with the game that reflect those effects. In this paper, we aim to do this by examining gaming the system, a measure of behavioral disengagement based on students’ interactions with the game compared to their interactions in a non-game digital control. By comparing rates of gaming the system in the game and a non-game control, we can see whether gender differences in engagement appear only in digital learning games or across both game and non-game digital learning platforms.

Gaming the system

A range of behaviors occur during gameplay and during digital learning in general, with differing impacts on student outcomes. One form of behavior that emerges in a variety of learning systems is gaming the system, defined as “attempting to succeed in an educational task by systematically taking advantage of properties and regularities in the system used to complete that task, rather than by thinking through the material” (Baker et al., 2006a, 2006b). Often construed as a form of behavioral disengagement (e.g. DeFalco et al., 2014), gaming the system is associated both with cognitive and affective processes (Baker et al., 2010a, 2010b) and has been found to have substantial negative relationships with student outcomes (Baker et al., 2004b; Fancsali, 2014; Mogessie et al., 2020; Pardos et al., 2014), even several years after the gaming the system occurs (Almeda & Baker, 2020; San Pedro et al., 2013). The first automated detector that could recognize this behavior directly from student log data was developed in 2004 for the Cognitive Tutor (Baker et al., 2004a). Since then, detectors of gaming the system have been developed for a variety of other learning systems (ASSISTments: Pardos et al., 2014; Paquette & Baker, 2019; Newton’s Playground: Wang et al., 2013; Decimal Point: Mogessie et al., 2020).

Students game the system in several different fashions; some of the most frequently reported gaming behaviors include help abuse (i.e., repeatedly and quickly asking for hints or help until the learning system provides the answer) and systematic guessing (such as trying every given value in a problem statement, trying every plausible answer, or counting). These behaviors are more strongly associated with low student learning than off-task behaviors (e.g., talking to a neighbor, surfing the web) that do not lead to systematic misuse of the learning system’s features (Cocea et al., 2009; Pardos et al., 2014). Gaming the system has also been reported to be associated with experiencing frustration (Walonoski & Heffernan, 2006a) or boredom (Baker et al., 2010a, 2010b).

Several studies have shown that students’ propensity to game the system is influenced considerably by the design features of the learning system they are using. Baker et al. (2009) studied the design features of Cognitive Tutors (Koedinger & Aleven, 2016), a type of intelligent tutoring system, to identify which aspects of design correlated to the different frequencies of gaming behaviors observed in different lessons in the system. They found that lessons that involved concrete materials but with limited engagement-increasing text were gamed more often, and that activities that lacked clarity in the activity or material also tended to be gamed more often. Slater et al. (2016) studied the text and linguistic features of mathematics problems in ASSISTments and found that several such features were associated with differences in the frequency of gaming the system. Specifically, they found that the use of complex grammar and the heavy use of pronouns led to higher gaming.

Attempts to improve system design by incorporating interventions to prevent gaming have seen partial success, although at the cost of increased complexity in student interaction. Baker et al. (2006a, 2006b) experimented with the use of supplementary exercises with the content on which students gamed and found improvements in their learning. Visualizations of student gaming behavior and meta-cognitive messages about gaming have also led to lower frequency of this behavior (Arroyo et al., 2007; Roll et al., 2007; Walonoski & Heffernan, 2006b; Xia et al., 2020). Personalizing learning content to a student’s personal interests has also been shown to reduce the frequency of gaming (Walkington & Maull, 2011). However, simply making it more difficult to game the system leads to students finding new ways to game the system (Murray & VanLehn, 2005).

Adopting a game-based design might be another way to reduce disengagement, including gaming the system. Though few studies have examined whether digital learning games reduce gaming the system compared to equivalent non-game digital learning systems, this question has been studied in Decimal Point, the digital learning game that is the focus of this study. Though gaming the system is associated with worse outcomes in Decimal Point (Mogessie et al., 2020), Richey et al. (2021) reported significantly lower levels (around half as much) of gaming the system behavior in Decimal Point compared to a more traditional computer-based instructional system covering identical content (non-game). Furthermore, a mediation analysis showed that the better learning seen in students playing the digital game was fully mediated by their lower frequency of gaming the system behavior.

Some studies have suggested that gaming the system is not closely associated with demographic factors, but these studies have only examined a small number of demographic variables. Baker and Gowda (2010) found that the prevalence of gaming the system did not vary based on whether students lived in urban, suburban, or rural areas. In addition, Paquette and Baker (2017) did not find strong evidence that the frequency of gaming the system varied based on urbanicity, race/ethnicity, math and reading proficiency, or economic status. They found that the differences were associated more strongly with learning environments than with student populations. However, research has not yet investigated differences between female and male students in the propensity to game the system. Given the influence of design features on the choice to game the system, and Decimal Point’s overall effect on the prevalence of gaming, there is some reason to anticipate gender differences in gaming the system within Decimal Point. In fact, Decimal Point has led to consistently better learning for female students than male students (Hou et al., 2022; McLaren et al., 2017b, 2022b; Nguyen et al., 2022). One possible hypothesis is that this may be because female students are more engaged—and thus may game the system less often—than male students when playing Decimal Point. Below, we discuss evidence of gender differences in gameplay experiences and outcomes from digital learning games. In the current paper, we obtained multiple data sets from the Decimal Point team, representing data from three studies testing different iterations of Decimal Point, and tested this hypothesis across those datasets. We also examined the hypothesis across different components of the game, in particular problem-solving activities and self-explanation activities.

Gender differences in learning games

Research on gender differences in digital learning game outcomes has shown mixed results, with an overall pattern suggesting female students benefit more. Female students have been shown to enjoy learning games more (Adamo-Villani et al., 2008; Chung & Chang, 2017), to be more likely to find a learning game worth playing (Joiner et al., 2011), and to achieve better learning outcomes (Khan et al., 2017a; Klisch et al., 2012; McLaren et al., 2017b; Nguyen et al., 2022; Tsai, 2017). However, other studies report no gender differences in outcomes (Chang et al., 2014; Clark et al., 2011; Dorji et al., 2015; Manero et al., 2016; Papastergiou, 2009).

The differing effectiveness of learning games for male and female students is sometimes attributed solely to broad differences between genders, such as differences in decision-making processes and the degree of emphasis placed on interpersonal goals versus task-orientation (Koivisto & Hamari, 2014). More generally, prior research has identified gender-based differences in motivation and cognitive strategies (Wolters & Pintrich, 1998). If differences in game behaviors reflect general gender differences, then we would expect to see the same patterns of gender differences in a non-game control; on the other hand, if differences emerge specifically because of the unique features of digital learning games, then we would not expect to see the same differences in a non-game control.

Games may affect emotions and confidence differently across genders, particularly in domains like mathematics in which anxiety and stereotype threat can disproportionately affect female students by reminding them of negative stereotypes and thus consuming available working memory with distracting thoughts (Doyle & Voyer, 2016; Spencer et al., 1999). In this case, a game context might reduce the saliency of math cues and thus free up more working memory space for female students to focus on learning and practicing the academic content of the game. According to this hypothesis, games would not necessarily produce gender differences in engagement or interest in games, and they would tend to benefit female students only in domains in which they experience stereotype threat. Prior work has also found that female students sometimes report lower self-efficacy in certain academic contexts, such as mathematics (Louis & Mistele, 2012). Given that games have been found to increase self-efficacy in math, this might also provide a pathway for games to benefit female students in particular (Riconscente, 2013; Sitzmann, 2011).

Other accounts have suggested that learning differences for female and male student are caused by gender differences in the motivational appeal of learning games and how games are perceived (Ferguson & Olson, 2013; Huang, 2013; Osunde et al., 2018). If groups of students learn less from games because they find them less interesting or engaging, then we also would expect to see differences in how those students play the games, with those behavioral differences mediating the relations between individual characteristics and learning outcomes. However, few studies have tried to analyze learning game behaviors to test whether differences in male and female students’ interactions with the game explain the differences in learning outcomes.

Despite being popular among both females and males (Hamari & Keronen, 2017), there are significant gender differences in preferences about digital game features such as avatar characteristics, social interaction, game speed and style (Aleksić & Ivanović, 2017; Chou & Tsai, 2007; Greenberg et al., 2010; Romrell, 2014). Similarly, there are also gender differences in preferences for learning games. For instance, female students tend to be more collaborative in games, while male students are more competitive (Dele-Ajayi, 2018; Garber et al., 2017). Female students also tend to prefer playing competitive games alone while male students prefer to play in the company of other male students (Jenson & de Castell, 2005). A recent game preferences survey of 333 middle school students found that girls reported more interest in the casual, music and party, and cooperative genres of games, while boys tended to prefer action, sports and racing, and battle-oriented game genres (Nguyen et al., 2023). Other studies have reported that scores and rewards are more appealing to and valued by male students (Hartmann & Klimmt, 2006; Raney et al., 2006). Furthermore, rewarding speedy play has a more negative impact on female students than male students (Heeter & Winn, 2008). Given the gender-based differences in game preferences and learning outcomes, there have been discussions around adapting learning game design based on the students’ gender to support them better, based on evidence that this can be useful within digital (non-learning) games more broadly (Boyle & Connolly, 2009; Kinzie & Joseph, 2008; Law, 2010; Steiner et al., 2009). The challenge, however, is that although a lot is known about male and female students’ preferences in games, considerably less is known about how these preferences translate to differences in gameplay behaviors. Do female and male students engage in the same behaviors? Do they behave differently in the presence of specific gameplay features? We investigate these questions within the current study, with the goal of better informing game design to promote learning for all students.

The digital learning game Decimal Point

Decimal Point is a single-player, computer-based game designed for 5th through 7th grade students to learn about decimal numbers, operations, and concepts (McLaren et al., 2017a). The game runs on the Internet, within any standard browser, and was originally developed using Flash and later ported to HTML/JavaScript. The Cognitive Tutor Authoring Tools (or CTAT—Aleven et al., 2016) were used to develop the game, to assess and log student actions. The materials are deployed on the web-based learning management system TutorShop (Aleven et al., 2009), which logs all student actions, such as correct and incorrect steps and hint requests.

The game is set in the thematic context of an amusement park and is composed of a series of 24 mini-games. The mini-games are presented to students in a pre-defined order—at least in the base version of the game and most versions that have been studied—with each mini-game containing two problems for students to solve. In Studies 2 and 3 students were given agency to pick mini-games to play in any order, as explained below. Seventy-two problems were implemented for the game in total. Five types of problems are available in the mini-games, which include (1) ordering decimals; (2) number line placement; (3) completing decimal sequences; (4) sorting decimals into less-than and greater-than “buckets”; and (5) adding decimals. The subject matter and specific content of each problem type was selected because decimal number misconceptions are particularly robust, persisting through middle school and sometimes even into adulthood (Putt, 1995). Each problem type focuses on providing practice opportunities for a specific decimal number operation or concept aligned with specific, well-documented decimal number misconceptions (Isotani et al., 2010). The problems were designed in consultation with a mathematics education expert to specifically target decimal number misconceptions that have been well documented in the math education literature, such as the misconception that longer decimal numbers are larger in magnitude (e.g., 0.234 > 0.9—Irwin, 2001; Isotani et al., 2010; Stacey et al., 2001).

Every problem is composed of two steps (i.e. problem-solving and self-explanation). Problem-solving and self-explanation activities are distinct but connected in the game, and each can be expected to play specific roles in the learning process (Richey & Nokes-Malach, 2015). Problem-solving practice consists of executing correct procedures to solve various decimal number problems and is essential for skill acquisition, with repeated practice leading to reduced time and greater accuracy on tasks (Singley & Anderson, 1989). Self-explanation occurs when the learner is prompted to explain what they are learning to themselves, which can involve making inferences about why something is right or wrong, developing justifications, or identifying their own lack of understanding or misunderstanding (Chi & Wylie, 2014). Self-explanation helps learners revise errors in prior knowledge, fill in gaps in their understanding, and connect fragmented knowledge, all of which support more robust learning and transfer (see review in Chi & Wylie, 2014; also see McLaren et al., 2022a, McNamara, 2017; Nokes et al., 2011; Richey & Nokes-Malach, 2015; Rittle-Johnson et al., 2017). Particularly when paired with problem-solving practice, self-explanation can make knowledge of problem-solving procedures more flexible by helping learners connect problem-solving steps with principles and application conditions (Ainsworth & Burcham, 2007; Aleven et al., 2003). Self-explanation activities are more common in non-game digital learning technology than in games (Bisra et al., 2018; McLaren et al., 2008; Renkl & Atkinson, 2002; Roy & Chi, 2005), but some evidence suggests that incorporating self-explanation in games may yield similar benefits. In particular, Johnson and Mayer (2010) found that self-explanation led to better learning outcomes from a digital learning game, but only when self-explanation took a multiple-choice format where students were asked to select the correct explanation. They argued that more open-ended forms of self-explanation, in which learners are prompted to type explanations, do not support learning in games because they disrupt the game flow. Similarly, Decimal Point incorporates self-explanation prompts in a multiple-choice format.

When students encounter a problem in Decimal Point, they start in the problem-solving step and are prompted to solve the problem through game play. After solving the problem, students then move on to the self-explanation step, reflecting on how they derived the answer by selecting from a multiple-choice list of possible explanations. For example, in an ordering decimals problem, students are asked to “hit the gophers from the smallest to the largest” in the problem-solving step (see the left side of Fig. 1). Once they finish solving the problem, they are presented with the self-explanation question (see the right side of Fig. 1). This self-explanation step is designed based on an extensive literature showing that self-explanation promotes deeper learning, and that multiple-choice self-explanation prompts are most effective in game contexts because they are less disruptive to game flowFootnote 1 (Bichler et al., 2022; Chi & Wylie, 2014; Johnson & Mayer, 2010; McLaren et al., 2022a; McNamara, 2017; Richey & Nokes-Malach, 2015; Rittle-Johnson et al., 2017).

Fig. 1
figure 1

Whac-a-Gopher, an example of an ordering mini-game, includes a problem-solving step (left) and a self-explanation step (right)

Decimal Point incorporates elements of fantasy (Malone, 1981) through the amusement park context and through six alien characters who accompany the students throughout the game. The alien characters playfully incorporate accuracy feedback when students provide correct or incorrect responses, as well as providing encouragement throughout game play. Feedback is immediately provided after each step, and students must correctly answer both the problem solving and self-explanation steps in order to advance to the next problem. There are no penalties or limits on attempts for incorrect responses.

The present set of studies

The present set of studies utilizes past data from the use of a digital learning game, Decimal Point, obtained from the Decimal Point team. In this paper, we seek to answer the following research questions.

  1. 1.

    Do female and male students differ in how they interact with a digital learning game? Specifically, do they differ in the rates of gaming the system (a measure of behavioral disengagement) in a digital learning game as compared to a non-game control?

  2. 2.

    Do differences in gaming the system between female and male students occur in a specific activity (i.e., problem solving, self-explanation) within the game?

  3. 3.

    Do the differences in gaming the system in these specific contexts explain differences in learning outcomes between female and male students?

  4. 4.

    Do female and male students differ in their self-efficacy or interest in the game, and if so, does controlling for these differences eliminate mediating effects of gaming the system?

By comparing the frequency of gaming between female and male students in each step of the game (i.e., the problem solving and self-explanation steps), within the digital learning game Decimal Point and a non-game equivalent, we can investigate where and when differences manifest in this form of engagement between female and male students. We then further investigate whether gaming the system mediates and explains the relationships between gender and learning outcomes. Such mediation models can illuminate the specific learning processes and outcomes for different students playing Decimal Point, which in turn can inform instructional design to better support optimal learning processes.

To understand how female and male students differ in how often they game the system and the impact of this form of engagement on learning, we reanalyzed interaction and outcome data from three studies where students used Decimal Point. Each dataset was obtained from the Decimal Point team and contained pretest scores, immediate posttest scores, delayed posttest scores, and log data capturing students’ interaction with the digital learning game or non-game tutor. We describe each dataset as a separate study below.

Study 1 method

Study 1 utilized a dataset collected in the fall semester of the 2015 school year. The original study investigated the benefit of computer-based games in digital learning and results were first reported in McLaren et al. (2017a). In this experiment, students were assigned to use either the Decimal Point game or a non-game tutor with equivalent problem content. This dataset allowed us to examine the differences in the proportion of gaming between female and male students in both the game and non-game conditions. This allowed us to draw conclusions about the degree to which the game changes engagement compared to a non-game control, and whether male and female students engage differently in the game compared to a non-game tutor. We also examined the effect of engagement on learning outcomes and, given prior findings of gender differences in learning outcomes (Nguyen et al., 2022), we investigated whether levels of engagement mediate the relationship between gender and learning outcomes.

Study 1 participants

In this dataset, 213 students at two middle schools in a northeastern U.S. metropolitan area used either Decimal Point or the non-game equivalent as part of their normal classroom math instruction. Because of the distraction and demotivation that might have occurred with students sitting next to one another but working with very different materials, the researchers assigned students by classroom to one of the two instructional conditions; teachers classified each class as a low-, medium-, or high-performing class, and classes were equally distributed based on these ratings across the two conditions. Students who did not complete the materials in time or had an incomplete pretest, posttest, or delayed posttest (N = 52) were excluded from the analysis. An additional 8 students were removed for having gain scores that were more than 2.5 standard deviations above or below the mean. Of the remaining 153 students, 70 students were assigned to play Decimal Point, while 83 students completed the non-game equivalent of the system covering the same content. Both conditions had similar proportions of male and female students. Specifically, 31 male and 39 female students were in the game condition, and 35 male and 48 female students were in the non-game condition. Demographic information about participants in each study is reported in Table 1.

Table 1 Participant demographic information across studies

Study 1 materials and procedure

Study 1 compared students learning in Decimal Point to a non-game control that presented identical learning and test problems, problem-solving mechanics, self-explanation prompts, and accuracy feedback. Figure 2 shows the equivalent non-game item as the Whac-A-Gopher problem in Fig. 1. The cover stories for the learning problems differed in the non-game context to avoid having a consistent theme. All problems in the non-game tutor were presented on a plain screen without characters.

Fig. 2
figure 2

The non-game equivalent of the same ordering mini-game shown in Fig. 1

Knowledge tests

Three isomorphic tests were designed to target students’ learning of the decimal number operations practiced during the game and non-game tutor, as well as the underlying concepts and decimal number misconceptions addressed through the game and tutor. Tests were counterbalanced and administered as a pretest for students to take immediately before the beginning of the game or tutor. The pretest was used to assess prior knowledge; a posttest administered immediately after the end of the game or tutor was used to assess knowledge after completing the learning materials; and a delayed posttest administered one week after the end of the game or tutor was used to assess knowledge retention. Each test consisted of 42 items worth a total of 52 points, as some test items were worth multiple points.

Gaming detector construction

Models were developed to recognize gaming the system within the interaction data from Decimal Point and its non-game comparison condition, by first hand-labeling a subset of the Decimal Point data in terms of whether it involved gaming behavior, and then using machine learning to develop “detectors” that replicate those human judgments at scale (Baker et al., 2006a, 2006b).

The hand labels were obtained through text replay coding. Text replay coding has been used in many past studies, producing labels with acceptable inter-rater reliability, and in turn being used to develop automated detectors that are successful at recognizing when gaming the system is occurring (Baker & de Carvalho, 2008; Baker et al., 2006a, 2006b, 2010a, 2010b; Paquette & Baker, 2019). In text replay coding, human coders read through a clip of log data that captures a student’s interaction with the learning environment, and then use their judgment to infer the learner’s behaviors at the time. In the current study, we used text replay coding to identify gaming the system within the log data obtained from Decimal Point and the non-game tutor.

To develop text replays for coding, the research team first breaks down log data into clips, sub-segments of student behavior within the system. In general, each clip can capture a specific amount of time, a specific number of student actions, or a specific segment of an activity. In this study, in order to understand how engagement differs in each step (whether a problem-solving step or a self-explanation step) in male and female students, we delineated clips by treating each step as its own clip. Two iterations of text replay coding were conducted. The first iteration of the text replay coding labeled the gaming behaviors in the problem-solving steps while the second iteration labeled the gaming behaviors in the self-explanation steps.

In each iteration, text replay coding was conducted in three phases. In phase 1, two human coders coded a set of clips together. By discussing their judgment and the behavioral patterns noticed in the clips, the coders established a labeling rubric (this rubric was also based on the extensive past work that has been published on understanding gaming the system—cf. Paquette et al., 2014). The rubric contains a set of behavioral patterns that indicate the student is gaming the system. Specifically, within Decimal Point, behaviors that were identified as gaming the system included:

  • Clicking through the hints at high speed to obtain the answer, then immediately entering the answer and moving on

  • Systematically and rapidly guessing numbers (i.e., 1, 2, 3, …)

  • Systematically and rapidly selecting each multiple-choice option (i.e., A, B, C, …)

In phase 2, the two coders coded another set of clips separately using the rubric established. The labels from each coder were then compared and used to compute the inter-rater reliability (Cohen’s Kappa). If the inter-rater reliability had been below criterion, the coders would have discussed the differences and repeated the phase 2 coding until an acceptable inter-rater reliability had been achieved before moving on to phase 3. However, in this specific case, the two coders achieved acceptably high inter-rater reliability (by the typical standards of data used as the basis for machine learning of this nature) on the first round of phase 2 coding for both the problem-solving (k = 0.74) and self-explanation (k = 0.88) steps. In phase 3, coders split the remaining clips and coded independently. Clips were stratified to equally represent schools, problem type, and experiment condition. In total, 800 problem-solving clips and 1500 self-explanation clips were coded and used to construct the automated gaming detectors. More self-explanation clips were coded than problem-solving clips, because the first 800 self-explanation clips only had a small number of positive cases for the algorithm to learn from.

To create automated gaming detectors, the labeled data was input into machine learning algorithms to replicate the coders’ judgment. This approach has been used to detect gaming the system in prior published studies (see, for example, Baker & de Carvalho, 2008; Baker et al., 2010b; Paquette & Baker, 2019). After evaluating the performance of several classification algorithms on this data, an Extreme Gradient Boosting (XGBoost) classifier (Chen & Guestrin, 2016) was used to build the automated detector for each of the two types of steps, classifying whether a clip capturing either the problem-solving or self-explanation step contains a gaming behavior. XGBoost uses an ensemble technique that trains an initial, weak decision tree and calculates its prediction errors. Following the initial training, the classifier then trains subsequent trees iteratively to predict the errors in the previous trees. The final prediction represents the sum of the predictions of all the trees in the set.

The models were tested with tenfold student-level cross-validation, in which models were trained using data from a subset of students and tested on other students’ data. Based on the cross-validation results, we determined that the models could reliably predict gaming in unseen students in both the problem-solving (AUC = 0.89, k = 0.50) and self-explanation (AUC = 0.99, k = 0.95) steps. The detectors were then applied to the rest of the dataset to predict gaming. We computed the rate of gaming for each student and step using the gaming labels from the detectors. The rate of gaming reflects how often a student gamed the system at either the problem-solving or the self-explanation steps.

Study 1 results

First, rates of gaming were computed for each student and step to reflect how often each student gamed the system on the problem-solving and self-explanation steps. Correlations between test scores, gender, and rates of gaming are shown in Table 2. For both the problem-solving and self-explanation steps, gaming the system was significantly, negatively correlated with test performance.

Table 2 Correlations between test performance, gender (female = 0, male = 1), and gaming the system for Study 1

We then compared rates of gaming between female and male students on the problem-solving and self-explanation steps in both the game and non-game conditions. Means and standard deviations for test scores and rates of gaming (for female and male students, in the two conditions, across types of activities) are shown in Table 3. Analysis of variance (ANOVA) was performed to assess whether the students’ rates of gaming on each step differed by gender. ANOVA was selected rather than a non-parametric test, due to lack of evidence of non-normality in the variables (skewness and kurtosis were in the acceptable range for all variables). In the non-game condition, no significant difference between male and female students’ gaming the system behaviors was observed in either the problem-solving (F(1,81) = 0.17, p = 0.68) or self-explanation steps (F(1,81) = 0.07, p = 0.79). This suggests that students engaged similarly with the non-game tutor regardless of gender. However, in the game condition, male students gamed the system significantly more than female students within the self-explanation steps (F(1,68) = 4.83, p = 0.031), indicating that male students demonstrated more disengagement than female students in the game. There was no significant difference on the problem-solving step (F(1,68) = 0.096, p = 0.76).

Table 3 Average rates of gaming the system for problem solving (PS) and self-explanation (SE) steps and test scores by condition and gender for Study 1

The lack of differences in the frequency of gaming the system between female and male students in the non-game condition suggests that some aspect of playing the game triggered differences in gaming the system behaviors. To understand how gaming the system in the self-explanation step relates to learning outcomes for female students and male students playing the game, linear regression models were used to predict the immediate and delayed posttest scores by rates of gaming in the self-explanation steps, controlling for pre-test scores (Table 4). Gaming the system in the self-explanation step did not significantly predict students’ scores on the immediate posttest when controlling for pretest. However, both pretest and gaming the system in the self-explanation step significantly predicted students’ scores on the delayed posttest, with both predictors statistically significant (see Table 4). A model with just pretest scores predicted 55% of the variance in posttest scores, and the overall model predicted 58% of the variance, indicating that adding the rate of gaming the system on the self-explanation step predicted an additional 3 percent of the variance over the pretest alone.

Table 4 Regression models predicting immediate and delayed posttest with pretest scores and rates of gaming for students in the game condition. Beta and p values are from combined models

In both models predicting students’ test scores, gaming in the self-explanation steps was negatively associated with the posttest performance (according to the β coefficients), suggesting that when controlling for pretest, higher rates of gaming were associated with worse learning outcomes.

Finally, we examined whether the difference in gaming the system in the self-explanation step explained any effect of gender on learning outcomes on the delayed posttest. Given the fact that gaming the system behaviors differ between genders and gaming the system significantly predicts learning outcomes on the delayed posttest, we created a mediation model (Hayes, 2017) examining the relationship between gender and delayed posttest scores with gaming in the self-explanation step as the mediator (Fig. 3). The “mediate” function in the “psych” package in R was used to build each model, using 5000 bootstrap iterations. This model generates confidence intervals to test the indirect effect of gender (female = 0, male = 1) on delayed posttest scores, with gaming the system on the self-explanation step as the mediator. Pretest scores were again included as a covariate. We focused on delayed posttest scores because gaming on the self-explanation step was a significant predictor of delayed posttest score.

Fig. 3
figure 3

The mediation model showing path standardized coefficients for a mediation analysis of gender on delayed posttest through gaming the system on self-explanation questions

Results indicated that male students had a significantly higher frequency of gaming the system in the self-explanation step, a = 0.06, p < 0.008. Gaming the system on the self-explanation step was negatively associated with performance on the delayed posttest regardless of gender, b = − 16.7, p = 0.03. There was no direct effect of gender on delayed posttest performance when controlling for gaming the system,  =  − 0.30, p = 0.85, but the indirect effect of gender on posttest through gaming the system on the self-explanation step was significantly different from zero, ab = − 1.07, 95% CI [− 2.71, − 0.06]. This indicates that gaming the system mediates the effect of gender on delayed posttest scores.

Study 2 method

Results from Study 1 showed that male and female students engaged differently with the game but not the non-game tutor, and that gaming the system on the self-explanation step mediated the effect of gender on the delayed posttest. In Studies 2 and 3, we examined additional Decimal Point datasets to see whether these effects would replicate across studies. Study 2 utilized a dataset collected in the 2017 fall semester, and results were originally reported in Nguyen et al., (2018). The original purpose of Study 2 was to investigate the effect of agency in digital learning games. Specifically, the study examined whether enabling students to choose which mini-games to play and when to quit would lead to different behaviors and learning outcomes.

Study 2 participants

In the Study 2 dataset, 197 students at one of the same middle schools from the Study 1 dataset and at an elementary school in the same northeastern U.S. metropolitan area used Decimal Point as part of their normal math instruction. Students who did not complete the pretest, posttest or delayed posttest were excluded from the analysis (N = 32). Seven additional students were excluded as outliers because their gains from pretest to posttest or delayed posttest were more than 2.5 standard deviations greater or less than the mean. Of the remaining 158 students, 77 were female and 81 were male. Additional demographic information about the participants in Study 2 is reported in Table 1.

Study 2 materials and procedure

Students in the Study 2 dataset used either the original version of Decimal Point (low-agency condition) or a version of the game in which they could select the order in which they played the mini-games and could choose to quit at any point after completing 24 rounds of mini-games (Nguyen et al., 2018). All problem content and within-game mechanics were the same across conditions, and we analyze the two conditions together in this paper. This choice enables us to focus on investigating the impact of gender on gameplay behaviors and learning outcomes, regardless of what order the games were played in. Study 2 also introduced questionnaires asking students to self-report their interest in the game and self-efficacy for decimal number operations.

Knowledge tests

The same three isomorphic tests designed for Study 1 were used to assess knowledge in Study 2. As in Study 1, tests were counterbalanced and administered as a pretest immediately before the beginning of the game to assess prior knowledge; a posttest administered immediately after the end of the game to assess knowledge after completing the game; and a delayed posttest administered one week after the end of the game to assess knowledge retention.

Gaming detectors

The same models developed to recognize gaming the system within the interaction data from Study 1 were applied to interaction data to detect gaming in Study 2.

Self-efficacy and interest surveys

Study 2 added several additional measures of students’ affective experiences: a questionnaire assessing student self-efficacy, and a questionnaire assessing interest. Self-efficacy items were administered before the start of the game (five items, ɑ = 0.79). Students responded to statements such as “I do well on decimal problems in school” and “Before this lesson, I understood decimals (such as 0.235)”. After completing the game, students responded to three items about their interest in the game (ɑ = 0.86). Example statements included “I liked doing this lesson” and “I would like to do more lessons like this.” For both questionnaires, students responded on a 1–5 Likert-type scale from 1 (strongly disagree) to 5 (strongly agree).

Study 2 results

First, rates of gaming were computed for each student and step to reflect how often each student gamed the system on the problem-solving and self-explanation steps. Correlations between test scores, gender, and rates of gaming are shown in Table 5. Similar to Study 1, there were strong, negative correlations between test performance and gaming the system; surprisingly, there were also negative correlations between test performance and decimal self-efficacy.

Table 5 Correlations between test performance, gender (female = 0, male = 1), gaming the system, decimal self-efficacy, and interest in the game for Study 2

We then compared rates of gaming between female and male students on the problem-solving and self-explanation steps in the game. Means and standard deviations for test scores, self-efficacy, interest, and rates of gaming (for female and male students, across types of activities) are shown in Table 6. A one-way ANOVA revealed no significant difference in pretest performance between male and female students, F(1,156) = 3.47, p = 0.064. One-way ANCOVAs controlling for pretest revealed no significant effect of gender on the immediate posttest, F(1,155) = 0.48, p = 0.49, or on the delayed posttest, F(1,155) = 1.19, p = 0.28. As in Study 1, male students gamed the system significantly more than female students within the self-explanation steps F(1,156) = 4.82, p = 0.030. As in Study 1, there was not a statistically significant difference in gaming the system on the problem-solving step, F(1,156) = 2.67, p = 0.10. Prior research has indicated that apparent gender differences in math or digital learning game outcomes could result from differences in learners’ self-efficacy or interest in the content. To examine this possibility, we assessed gender differences in self-reported decimal self-efficacy and interest in the Decimal Point digital learning game. Contrary to prior research, female students reported significantly higher decimal self-efficacy than male students, F(1,156) = 7.63, p = 0.006. However. there were no significant gender differences in students’ interest in the game, F(1,156) = 1.50, p = 0.22.

Table 6 Average probabilities of gaming the system for problem solving (PS) and self-explanation (SE) activities, and average self-reported decimal self-efficacy and interest

As with Study 1, regression models were used to predict the immediate and delayed posttest scores by rates of gaming in the self-explanation steps, controlling for pretest scores. Both pretest and gaming the system in the self-explanation step significantly predicted students’ scores on the immediate posttest, with both predictors statistically significant (see Table 7). A model with just pretest scores predicted 74 percent of the variance in posttest scores, and the overall model predicted 75 percent of the variance, indicating that adding the rate of gaming the system on the self-explanation step predicted an additional 1 percent of the variance over the pretest alone. Despite the small amount of additional variance explained by gaming the system, this predictor remained statistically significant in a combined model, t(155) = − 2.46, p = 0.015.

Table 7 Regression models predicting immediate and delayed posttest with pretest scores and rates of gaming. Beta and p values are from combined models

However, while pretest significantly predicted students’ scores on the delayed posttest, gaming the system in the self-explanation step did not significantly predict students’ delayed posttest scores in Study 2, t(155) = − 1.22, p = 0.22.

Since gender did not have a significant effect on test performance, we applied a bootstrap mediation analysis that does not require the predictor variable to significantly predict the outcome variable (Hayes, 2017). This mediation approach can detect significant indirect pathways even when the direct pathway is not significant. We built a mediation model to test the indirect effect of gender (female = 0, male = 1) on posttest scores, with gaming the system on the self-explanation step as the mediator. Pretest scores were again included as a covariate. We focused on posttest scores because gaming the system on the self-explanation step was a significant predictor of posttest score in Study 2.

Results indicated that male students gamed the system significantly more often than female students in the self-explanation step, a = 0.14, p < 0.001, and gaming was found to be significantly, negatively associated with immediate posttest scores, b = − 4.60, p = 0.02. The rate of gaming in the self-explanation steps was shown to explain the relationship between gender and the immediate posttest scores. There was no direct effect of gender on posttest performance when controlling for gaming the system,  = 0.05, p = 0.95, but the indirect effect was statistically significantly different from zero, ab = − 0.63, 95% CI [− 1.45, − 0.06] (see Fig. 4). Similar to Study 1, gaming the system mediated the relation between gender and test performance.

Fig. 4
figure 4

The mediation model showing path standardized coefficients for a mediation analysis of gender on posttest through gaming the system on self-explanation questions, in Study 2

To account for the possibility that gender differences in self-efficacy might contribute to gender differences in gaming the system or learning outcomes, we re-ran the mediation model predicting immediate posttest, this time including decimal self-efficacy as an additional covariate. The overall results of the mediation model did not change. Male students were again found to have gamed the system significantly more often than female students in the self-explanation step, a = 0.14, p < 0.001, and gaming was found to be significantly negatively associated with the immediate posttest scores, b = − 4.60, p = 0.02. The rate of gaming in the self-explanation steps still mediated the relation between gender and the immediate posttest scores, as the indirect effect was statistically significantly different from zero, ab = − 0.63, 95% CI [− 1.47, − 0.045].

Study 3 method

Results from Study 2 replicated our Study 1 findings that male and female students engaged differently with Decimal Point, and that these differences in engagement mediated the relation between gender and learning outcomes. In Study 3, we assessed whether we could replicate our findings again with a third Decimal Point dataset. Study 3 utilized a dataset collected in the 2018 spring semester; results were originally reported in Harpstead et al., (2019). As with Study 2, the original purpose of Study 3 was to investigate the effect of agency in digital learning games. Specifically, Study 3 was originally designed to examine whether enabling students to choose which mini-games to play and when to quit would lead to different behaviors and learning outcomes; it also investigated the effects of indirect control on students’ gameplay choices.

Study 3 participants

In the Study 3 dataset, 285 students at two different middle schools in the same northeastern metropolitan area used Decimal Point as part of their normal math instruction. Students who did not complete the pretest, posttest or delayed posttest, who did not complete the learning materials, or who experienced log-in errors were excluded from the analysis (N = 48). One additional student was excluded as an outlier based on posttest scores, and another was excluded because they declined to provide gender information. In total, 237 students were included in the analysis, with 130 female students and 107 male students. Students used either the original version of Decimal Point or one of two modified versions of the game that allowed students to select the order in which they would play the mini-games and when to quit after completing a minimum of 24 rounds. As with Study 2, we collapsed across all conditions when analyzing the Spring 2018 data set because the within-game mechanisms and content did not vary across conditions.

Study 3 materials and procedure

The materials for Study 3 included the same two conditions used in Study 2 (the original low-agency condition and a high-agency condition that introduced student choice), as well as a third condition that removed the visual path through the amusement park map (Fig. 5). Results from the Nguyen et al. (2018) indicated that students tended to follow the same path through the amusement park even when they had the choice to play in a different order, and the authors speculated that the visual path might exert indirect control over students’ selections concerning the order in which they completed the games (Harpstead et al., 2019).

Fig. 5
figure 5

The original theme park map (left) and the map without a line (right) used to compare high-agency conditions in Study 3; the line was considered a form of indirect control hypothesized to constrain learners’ choices

Knowledge tests

The same three isomorphic tests designed for Study 1 were used to assess knowledge in Study 3. As in Studies 1 and 2, tests were counterbalanced and administered as a pretest immediately before the beginning of the game to assess prior knowledge; a posttest administered immediately after the end of the game to assess knowledge after completing the game; and a delayed posttest administered one week after the end of the game to assess knowledge retention.

Gaming detectors

The same models developed to recognize gaming the system within the interaction data from Study 1 were applied to interaction data to detect gaming in Study 3.

Self-efficacy and interest surveys

Study 3 included the same measures of affective experiences introduced in Study 2: a self-efficacy questionnaire (reduced to four items for Study 3, ɑ = 0.84) and an interest questionnaire (ɑ = 0.87). For both questionnaires, students responded on a 1–5 Likert-type scale from 1 (strongly disagree) to 5 (strongly agree).

Study 3 results

First, rates of gaming were computed for each student and step to reflect how often each student gamed the system on the problem-solving and self-explanation steps. Correlations between test scores, gender, and rates of gaming are shown in Table 8. There were strong, negative correlations between test performance and gaming the system and strong, positive correlations between test performance and decimal self-efficacy.

Table 8 Correlations between test performance, gender (female = 0, male = 1), gaming the system, decimal self-efficacy, and interest in the game for Study 3

We then compared rates of gaming between female and male students on the problem-solving and self-explanation steps in the game. Means and standard deviations for test scores, self-efficacy, interest, and rates of gaming are reported in Table 9. A one-way ANOVA indicated no effect of gender on pretest, F(1,235) = 3.45, p = 0.064. A one-way ANCOVA controlling for pretest revealed a significant effect of gender on posttest, F(1,235) = 3.93, p = 0.048, with female students improving more than male students when controlling for pretest. There was no effect of gender on delayed posttest when controlling for pretest, F(1,234) = 2.08, p = 0.15. As in Studies 1 and 2, male students gamed the system significantly more than female students within the self-explanation steps, F(1,235) = 12.58, p < 0.001. As in Studies 1 and 2, there was not a statistically significant difference on the problem-solving step, F(1,235) = 0.39, p = 0.53. There were also no significant differences in students’ decimal self-efficacy, F(1,235) = 2.45, p = 0.12, or interest in the game, F(1,235) = 3.28, p = 0.072.

Table 9 Average probabilities of gaming the system for problem solving (PS) and self-explanation (SE) activities, and average self-reported decimal self-efficacy and game interest

As with Studies 1 and 2, regression models were used to predict the immediate and delayed posttest scores by rates of gaming in the self-explanation steps, controlling for pretest scores. Both pretest and gaming the system in the self-explanation step significantly predicted students’ scores on the immediate posttest, with both predictors statistically significant (see Table 10). A model with just pretest scores predicted 64 percent of the variance in posttest scores, and the overall model predicted 66 percent of the variance, indicating that adding the rate of gaming the system on the self-explanation step predicted an additional 2 percent of the variance over the pretest alone. Despite the small amount of additional variance explained by gaming the system, this predictor remained statistically significant in a combined model, t(234) = − 3.92, p < 0.001.

Table 10 Predicting immediate and delayed posttests with pretest and gaming the system

In addition, both pretest and gaming the system in the self-explanation step significantly predicted students’ scores on the delayed posttest, with both predictors statistically significant (Table 10). A model with just pretest scores predicted 62 percent of the variance in posttest scores, and the overall model predicted 65 percent of the variance, indicating that adding the rate of gaming the system on the self-explanation step predicted an additional 3 percent of the variance over the pretest alone. Despite the small amount of additional variance explained by gaming the system, this predictor remained statistically significant in a combined model, t(234) = − 4.53, p < 0.001.

Since gaming on the self-explanation step was a significant predictor of immediate posttest and delayed posttest, we built mediation models to test the indirect effects of gender (female = 0, male = 1) on both test scores, with gaming the system in the self-explanation step as the mediator. Pretest scores were again included as a covariate.

For the model predicting immediate posttest scores, the indirect effect of gender on immediate posttest through the mediator of gaming the system was significantly different than zero, ab = − 0.94, 95% CI [− 1.63, − 0.38], as was the relationship between indirect effect of gender on delayed posttest scores through the mediator of gaming the system, ab =  − 1.21, 95% CI [− 20.01, − 0.58] (Fig. 6).

Fig. 6
figure 6

The mediation model showing path standardized coefficients for a mediation analysis of gender on posttest through gaming the system on self-explanation questions, in Study 3

Although there were no significant differences in decimal self-efficacy by gender in Study 3, we again ran the mediation model predicting immediate posttest scores including decimal self-efficacy as an additional covariate to account for the possibility that self-efficacy might contribute to gender differences in gaming the system or learning outcomes. The overall results of the mediation model again did not change. Male students were again found to have gamed the system significantly more often than female students in the self-explanation step, a = 0.17, p < 0.001, and gaming was found to be significantly negatively associated with the immediate posttest scores, b = − 5.38, p = 0.001. The rate of gaming in the self-explanation steps moderated the relationship between gender and the immediate posttest scores. The total effect (c = − 1.78, p = 0.034) was significant but the direct effect ( = − 0.87, p = 0.31) was not significant; the indirect effect was statistically significantly different than zero, ab = − 0.91, 95% CI [− 1.59, − 0.35].

Similar results were found for delayed posttest. Male students gamed the system significantly more often than female students in the self-explanation step, a = 0.17, p < 0.001, and gaming was negatively associated with the delayed posttest scores, b = − 7.05, p < 0.001. The total effect (c = − 1.36, p = 0.12) and direct effect ( =  − 0.17, p = 0.85) were not significant, and the indirect effect of gender on delayed posttest through the mediator of gaming the system was significantly different than zero, ab = − 1.19, 95% CI [− 2.04, − 0.54]. Results from these mediation models suggest that the frequency of gaming in the self-explanation steps explained the impact of gender on both the immediate and delayed posttest scores.

Discussion

As shown in previous research, digital learning games can be particularly effective for female students. In fact, digital learning games have been found to be more effective for female students than for male students in terms of learning and affective outcomes in a number of studies (Arroyo et al., 2014; Hou et al., 2020; McLaren et al., 2017b; Nguyen et al., 2022). However, few studies have tested whether digital learning games influence students’ gameplay behaviors differently for female and male students, and whether these differences could account for the better learning outcomes often seen for female students.

Within this paper, we investigate this issue in the context of data obtained from the learning game Decimal Point. A number of studies with Decimal Point, over a period of more than five years, have shown that playing Decimal Point led to greater learning gains for female students than male students (Nguyen et al., 2022). Our current paper considered the hypothesis that this effect may have been due to differences in the frequency of gaming the system, a disengaged behavior. We analyzed three retrospective data sets collected from students playing Decimal Point. We found that in the game condition, but not the non-game condition, male students gamed the system significantly more frequently than female students in one key part of the learning experience, the self-explanation step. However, the male students did not game the system more frequently in other activities within Decimal Point.

This pattern of results suggests that male students were not generally inclined to game the system more in Decimal Point, but rather that one specific element of the digital learning game was associated with differences in gaming the system between female and male students. We then investigated whether this difference in gaming behavior could explain the difference in learning outcomes between female and male students. We found that indeed, the difference in the rates of gaming in the self-explanation step mediated the relation between gender and learning outcomes across the three datasets. This result provides a potential explanation for why female students benefited more from using Decimal Point than male students, a finding reported in previous work (Hou et al., 2020; McLaren et al., 2017b; Nguyen et al., 2022). It also provides a broader hypothesis that the differences in learning game effectiveness between female and male students seen in many cases may be due to differences in the engagement produced by specific games when played by different students.

It is worth considering why gaming the system might have played such a significant role in the different levels of learning experienced by female and male students. As discussed above, gaming the system is generally associated with poorer learning outcomes, but the prevalence of gaming the system in self-explanation activities might have played a particularly important role. Self-explanation activities help students connect their prior knowledge to new content, correct errors in understanding, and develop deeper knowledge that supports more robust learning and transfer (Chi & Wylie, 2014; McNamara, 2017; Richey & Nokes-Malach, 2015; Rittle-Johnson et al., 2017). Gaming the system—and therefore successfully completing the self-explanation activities without actually self-explaining—is likely to eliminate most or all of these benefits. Gaming the system on self-explanation steps might therefore be especially detrimental to students’ learning processes, as this choice can disrupt opportunities to connect newly acquired problem-solving skills with existing knowledge and to fill conceptual knowledge gaps related to the content being learned.

Although we had initially hypothesized that gaming behavior could help explain the differences in learning outcomes in Decimal Point for female and male students, we did not initially expect that gender differences in gaming the system would emerge only during self-explanation. This finding is surprising because Decimal Point generally has less gaming the system than an intelligent tutor covering the same content (Mogessie et al., 2020), but its playful game mechanics are more prominent during the problem-solving steps than the self-explanation steps. Therefore, if the gameplay itself were more engaging for female students than male students, we might have expected to see a greater impact on engagement—and therefore a greater reduction in gaming the system—during problem-solving steps. One possible explanation is that the digital learning game context may have reduced disengagement overall but actually increased the likelihood that students would become more disengaged during a specific part of the activity that more closely resembled typical instruction: the self-explanation steps. If this is the case, it may not be that female students found the game more engaging overall, but rather that they were less likely than male students to become disengaged during the less playful components of the game such as the self-explanation steps.

Another possible explanation for this difference comes from the fact that the self-explanation steps were designed in a way that made them easier to game than the problem-solving steps. While a mindless guess-and-check approach to problem-solving steps could include testing a very large number of possibilities (i.e., all possible locations on a number line, a long list of possible values in sequence problems, all order permutations in ordering problems—cf. Paquette et al., 2014), the self-explanation questions were multiple-choice, typically with 3 or 4 options, and therefore could be answered correctly through gaming within a small number of attempts. However, this difference in question design was true of both the game and non-game versions of the content, and the gender differences emerged only in the game. It may be useful for future research to examine whether similar differences in gaming the system between female and male students emerge if self-explanation questions are formatted in a way that can be less easily gamed, such as open-ended responses or drag-and-drop items (McLaren et al., 2022a).

Limitations and future work

This study has several limitations that should be considered in future work. First of all, it would be worthwhile to consider additional behaviors and indicators that represent engagement and disengagement beyond just gaming the system. Positive engagement produced by the game may manifest as experiences of flow (Perttula et al., 2017) or delight (Rodrigo & Baker, 2011), and may produce positive behaviors such as persistence (Ventura & Shute, 2013). Beyond just gaming the system, disengagement may manifest as careless errors (Hershkovitz et al., 2012), or actions within the game not aimed at completing the learning task (Sabourin et al., 2011). Different engagement measures may capture other cognitive and motivational aspects of student experiences within digital learning games, such as a desire to get the experience over with (carelessness) or general disinterest in the game (game task-unrelated behavior), different in kind than the motivations and attitudes underlying gaming the system. These measures are not yet available for Decimal Point but could be developed. Therefore future work should investigate the prevalence of a broader range of behaviors in games such as Decimal Point, and whether they play a mediating role in the relationship between gender and learning outcomes within these games. Doing so will help expand understanding of the role that disengagement plays in the different learning gains seen for female and male students within digital learning games. Beyond this, there will be value in repeating this same type of analysis for other learning activities and contexts, towards fully understanding the many proximal variables that play a role in the complex pattern of differences in learning activities between male and female students.

Another area of future work lies in the application of the research paradigm used here to a broader range of differences between students. As Dele-Ajayi et al. (2018) note, there is evidence that many games’ effectiveness varies considerably depending on the characteristics of the learners using those games, but there has been insufficient research into why these effects are seen. By applying automated detection of disengaged behaviors and other key processes such as self-regulated learning (Fan et al., 2022), we can obtain a set of measures that can be used as mediators to investigate the differences in the effectiveness of games between groups. It is possible that differences in gaming the system may explain some of the differences in learning game effectiveness between groups—it is quite plausible that some combination of disengaged behavior, affective state, and self-regulated learning will explain many of these differences. Replicating the analytical methods used in the current study, future research can investigate how specific aspects of student identity and individual differences (e.g., race/ethnicity, cultural backgrounds, game preferences) influence how students interact with digital learning games and specific game features differently, and how these differences influence learning outcomes. Results from this line of research will expand current understanding of why digital learning games work and for whom they work, helping to produce digital learning games with more equitable and positive impacts. Additionally, while we focused on gender differences, it may be that other factors such as self-efficacy or game interest—factors that are known to often vary by gender and impact learners’ experiences with digital learning games—could also predict disengagement and learning outcomes (Louis & Mistele, 2012; Riconscente, 2013; Sitzmann, 2011).

A third key area of future work involves broadening the conceptualization of gender applied in our current study. When the Decimal Point team initially collected the data used in this paper’s secondary data analyses, gender was treated as a binary categorization. However, gender is increasingly understood in research as a complex and dynamic set of traits that go beyond the birth-assigned, binary representation (e.g. Hyde et al., 2019). Using the birth-assigned, binary categorization, as our paper’s current analysis necessarily did, oversimplifies the complexity of gender and overlooks within-gender heterogeneity and variation in gendered behavior. As shown in Santos et al. (2006), larger differences are often seen when comparing students in terms of self-reported gendered traits (i.e., masculine, feminine, androgynous and undifferentiated traits) instead of binary, birth-assigned genders (i.e., female and male). As such, future work along these lines should leverage a richer understanding of gender, studying students in terms of a multidimensional gender framework that better captures the complexity of gender. For instance, future work could complement binary measures of gender with categorizations of additional dimensions, such as gender identity (Wood & Eagly, 2015) and gender typicality (Egan & Perry, 2001). In fact, Nguyen et al. (2023) has already taken a step in this direction by presenting a game survey to over 300 elementary and middle school students and analyzing it according to multiple dimensions of gender. Using a multidimensional gender framework for analyses will help to explicate not just the overall relationships between gender, engagement, and learning, but which more nuanced elements and aspects of gender play these roles.

Conclusions

Overall, this paper’s results show that gender interacts with student behavior and learning in complex ways within digital learning games. Previously documented effects for the game Decimal Point indicating that female students have better learning outcomes were explored, using an automated measure of a disengaged behavior, gaming the system. Prior work did not clearly indicate what aspects of the learning activities these differences manifested in. Our current results indicate that female students are less likely to game the system than male students on self-explanation steps, and that this difference in behavior mediates the difference in learning outcomes previously observed.

This pattern of results highlights the importance of delving into the fine-grained details of student behavior to understand differences in learning, and the role that automated detectors making inference from student log data can play in this type of research. The results also highlight the importance of examining students’ interactions with digital learning games in a more comprehensive way that takes users’ gender into consideration.

Going forward, our results show that while there are ample studies investigating the features that make a digital learning game effective, it is equally important to understand how games influence students’ learning behaviors and how individual differences, such as gender, can predict differences in such behaviors and learning outcomes. Such an approach is critical for building understanding of when and how different game features will benefit specific students. Through understanding how different students interact with digital learning games, our field can work towards designing and developing digital learning games that are more equitable and ultimately more effective.