Using latent variable models to make gaming-the-system detection robust to context variations

Huang, Yun; Dang, Steven; Elizabeth Richey, J.; Chhabra, Pallavi; Thomas, Danielle R.; Asher, Michael W.; Lobczowski, Nikki G.; McLaughlin, Elizabeth A.; Harackiewicz, Judith M.; Aleven, Vincent; Koedinger, Kenneth R.

doi:10.1007/s11257-023-09362-1

Using latent variable models to make gaming-the-system detection robust to context variations

Open access
Published: 18 May 2023

Volume 33, pages 1211–1257, (2023)
Cite this article

Download PDF

You have full access to this open access article

User Modeling and User-Adapted Interaction Aims and scope Submit manuscript

Using latent variable models to make gaming-the-system detection robust to context variations

Download PDF

Yun Huang¹,
Steven Dang²,
J. Elizabeth Richey³,
Pallavi Chhabra¹,
Danielle R. Thomas¹,
Michael W. Asher⁴,
Nikki G. Lobczowski³,
Elizabeth A. McLaughlin¹,
Judith M. Harackiewicz⁴,
Vincent Aleven¹ &
…
Kenneth R. Koedinger¹

1369 Accesses
1 Altmetric
Explore all metrics

Abstract

Gaming the system, a behavior in which learners exploit a system’s properties to make progress while avoiding learning, has frequently been shown to be associated with lower learning. However, when we applied a previously validated gaming detector across conditions in experiments with an algebra tutor, the detected gaming was not associated with reduced learning, challenging its validity in our study context. Our exploratory data analysis suggested that varying contextual factors across and within conditions contributed to this lack of association. We present a new approach, latent variable-based gaming detection (LV-GD), that controls for contextual factors and more robustly estimates student-level latent gaming tendencies. In LV-GD, a student is estimated as having a high gaming tendency if the student is detected to game more than the expected level of the population given the context. LV-GD applies a statistical model on top of an existing action-level gaming detector developed based on a typical human labeling process, without additional labeling effort. Across three datasets, we find that LV-GD consistently outperformed the original detector in validity measured by association between gaming and learning as well as reliability. LV-GD also afforded high practical utility: it more accurately revealed intervention effects on gaming, revealed a correlation between gaming and perceived competence in math and helped understand productive detected gaming behaviors. Our approach is not only useful for others wanting a cost-effective way to adapt a gaming detector to their context but is also generally applicable in creating robust behavioral measures.

Modeling and Studying Gaming the System with Educational Data Mining

Gamification for smarter learning: tales from the trenches

Article Open access 21 May 2015

Predictors and Outcomes of Gaming in an Intelligent Tutoring System

1 Introduction

Assessing students’ engagement levels or motivation from their interaction behaviors in digital learning environments is a compelling challenge both practically and theoretically. Practically, valid behavioral assessment of student engagement can drive adaptations that adjust to students’ needs, leading to greater learning and motivation; theoretically, it can be used to better understand when and why interventions or system designs work for enhancing student learning or motivation. One frequently explored behavioral indicator of student engagement is “gaming the system” (abbreviated as “gaming” in this paper), which is defined as “attempting to succeed in an educational environment by exploiting properties of the system rather than by learning the material and trying to use that knowledge to answer correctly” in the seminal works from Baker et al. (2006a, b, 2008a, b). Typical gaming-the-system behaviors include help abuse (e.g., copying the answer from a hint, repeated help requests) and systematic guessing (e.g., quickly answering after errors, making successive errors) (Paquette et al. 2014). Many studies have demonstrated that gaming the system is associated with poor learning outcomes in the short term or the long term (Almeda and Baker 2020; Baker, Corbett, Koedinger, and Wagner 2004; Cocea et al. 2009; Peters et al. 2018; San Pedro et al. 2013). Prior research suggests that interventions directly targeting gaming can reduce gaming behaviors (Baker, Corbett, Koedinger and Roll 2006a; Walonoski and Heffernan 2006b) and improve learning (Baker, Corbett, Koedinger and Roll 2006b), demonstrating the practical value of gaming detection. Recent work (Richey et al. 2021) has also shown that the positive effect of learning with an educational game was fully mediated by lower levels of gaming the system, showcasing the theoretical value of gaming detection for understanding how a specific intervention influences learning.

Over the course of development of gaming detectors, one problem that challenges the validity and practical effectiveness of gaming detectors has emerged but has not received enough attention: detected gaming is not always associated with poorer learning. In the seminal and subsequent works on gaming detectors, the avoidance of learning has been explicitly stated in the definitions of gaming (Baker et al. (2006a, b, 2008a, b; Cocea et al. 2009; Muldner et al. 2011). Thus, theoretically, the unproductiveness or harmfulness for learning is implied in the gaming construct.^{Footnote 1} Practically, a gaming detector that only detects behaviors unproductive or harmful for learning also has higher effectiveness than a detector that does not have such a constraint. If a system intervenes when students are productively engaged due to false alarms of gaming, it may impair learning and reduce students’ trust in the system, leading to actual disengagement and greater learning impairment. Thus, the negative association between gaming measures and learning should be an important aspect of the validity of a gaming detector. However, this property is not inherently guaranteed in the operationalization of gaming, i.e., the detected gaming behaviors.

Our direct application of a previously validated gaming detector (Paquette et al. 2014) showed that detected gaming behaviors were associated with higher learning for a large proportion of students, and were not associated with learning for the overall population (reported later in this paper). Several prior works have also identified cases where detected gaming behaviors were not harmful for learning (Baker, Corbett and Koedinger 2004; Baker et al. 2008a, b; Cocea et al. 2009), or even appeared to be productive for learning (Shih et al. 2008). For example, in some cases the identified behavior of bypassing hints in search of bottom-out hints acting as worked examples can be viewed as “a positive meta-cognitive strategy that is only related to gaming the system at a surface level” (Baker et al. 2013, p. 16). Such productive detected gaming where gaming^{Footnote 2} was associated with higher learning reduces the validity and practical effectiveness of the gaming detectors.

To address this, we only found one approach proposed in Baker et al. (2008a, b), where they used pretest and posttest scores to constrain the detected gaming labels to be assigned only to actions from students with low learning gains and low pretest scores. This approach utilizes the information from the “effects,” i.e., learning gains; we wonder whether we could utilize information from the “causes,” i.e., contextual factors that trigger productive detected gaming, which allows for generating more actionable insights than the former approach. Limited contextual factors have been considered during the human labeling process (in text replays or live observations), which usually sets the “ground truth” for developing gaming detectors (Baker, Corbett, Koedinger, and Wagner 2004; Baker and de Carvalho 2008; Walonoski and Heffernan 2006a). Typically, human coders make judgments of gaming based on observed information (e.g., action time, correctness, help request) from an individual student within a segment (e.g., five consecutive actions in text replays or 20 s in live observations). They do not interpret a student’s behavior in relation to the population behaviors on the task, i.e., the general propensity of a task to trigger detected gaming due to its design; nor do they consider a student’s knowledge on the task or the learning progression across tasks beyond the current segment. The limited contextual information used by human coders makes labeling faster and easier but risks introducing bias. For example, a student’s multiple fast, wrong attempts on drop-down menus in a segment may be labeled as gaming considering only her information within the segment, but if she did not deviate from the general behavior of the population on such steps and got similar steps correct in the future, then it may be more accurate to label these actions as not gaming.

To integrate contextual factors that may account for productive behaviors but are not considered in the original labeling processes, we propose latent variable-based gaming detection (LV-GD), an approach that integrates contextual factors in a cost-effective way for more valid and robust gaming assessment. The cost-effectiveness of LV-GD lies in applying a statistical model on top of an existing gaming detector developed based on a typical human labeling process, without additional labeling effort. The validity and robustness of LV-GD benefits from interpreting a student’s behaviors in relation to the population's behaviors in the same contexts represented by critical contextual factors. In the following subsections, we review related work, motivate the current work, and then introduce our approach and study in more depth.

1.1 Contextual factors for detected gaming behaviors

In this section, we review two kinds of contextual factors that might trigger productive behaviors detected as gaming, but are not considered in typical human labeling processes. One kind is task features, i.e., characteristics of a problem, a set of problems (e.g., lessons, sections), or a system. Baker et al. (2009) examined a year-long log dataset with 22 different lessons of Cognitive Tutor Algebra and identified a set of task features that explained 56% of the variance in gaming, over five times the degree of variance explained in any prior study of student individual differences and gaming. For example, one such feature is “proportion of hints in each hint sequence that refer to abstract principles.” Results showed that gaming was more frequent in lessons that were abstract, ambiguous, and had unclear presentation of the content or task. Although the authors did not investigate whether detected gaming behaviors in such contexts were associated with better learning, results did suggest that in less well-designed tasks, students may game to acquire necessary information to perform the task. In another study, Baker (2007) found that lessons explained over three times as much of the variance in gaming as student individual differences did. In particular, 31% of lessons had average gaming frequencies higher than 20% with three lessons even reaching 40%. Paquette and Baker (2017) also found that differences in gaming behaviors were more strongly associated with the learning environments than with student populations. Although this indicates the important role of task features in explaining gaming, they have not been considered in typical human labeling processes.

Another kind of contextual factor is students’ knowledge levels on tasks. Roll et al. (2014) showed that on steps for which students have low prior knowledge, avoiding help and entering wrong answers repeatedly (which may be traditionally considered as systematic guessing, a form of gaming) is associated with better learning than seeking help. Shih et al. (2008) provided evidence that when students bypass hints to get bottom-out hints (traditionally considered as help abuse, a form of gaming), they are sometimes seeking worked examples. Dang and Koedinger (2019) also suggested that detected gaming can be a desirable adaptive learning behavior when students encounter challenges far beyond their abilities. However, students’ knowledge levels have not been considered in typical human labeling processes. Moreover, it has not been investigated whether these two contextual factors can have an interaction effect on detected gaming behaviors.

1.2 Existing gaming detectors

Past research has developed two classes of gaming detectors: knowledge-engineering models and machine-learned models. In knowledge-engineered models, experts develop rational rules (sometimes called patterns) that can predict well human labels of gaming, and such rules are used to identify gaming behaviors (Muldner, et al. 2011; Paquette et al. 2014; Walonoski and Heffernan 2006b). In machine-learned models, a function between a set of features (e.g., correctness on a step) and human coded gaming labels is learned on a given dataset where only predictive features are maintained in the model, and the final model is used to identify gaming behaviors (Baker et al. 2008a, b; Pardos et al. 2014; Walonoski and Heffernan 2006a). In defining rules or features for the detectors, the emphasis has mainly been put on student features (Muldner, et al. 2011; Paquette et al. 2014; Pardos et al. 2014), such as how a student utilizes help (Aleven et al. 2006) or makes errors (Walonoski and Heffernan 2006a), or a student’s estimated knowledge on the related skill (Baker et al. 2008a, b). Task features have received less attention. We have only identified two machine-learned gaming detectors that incorporated task features such as interfaces (e.g., multiple-choice or textbox; Baker et al. 2008a, b; Walonoski and Heffernan 2006a). Meanwhile, in knowledge-engineered gaming detectors, task features typically are not considered, i.e., rules to identify gaming behaviors are usually described in a task type-independent way (Muldner, et al. 2011; Paquette et al. 2014).

Among existing gaming detectors, one stands out due to its superior performance in recent comprehensive evaluations in terms of generalizability, interpretability, and development cost in new contexts: the knowledge-engineered gaming detector (Paquette et al. 2014), which is referred to as KE-GD in this paper. KE-GD was developed by using cognitive task analysis to elicit knowledge about how experts code students as gaming or not in Cognitive Tutor Algebra (Koedinger and Corbett 2006). It consists of 13 patterns of students’ systematic guessing and help abuse behaviors. KE-GD represents the broad class of behavioral detectors that are built based on rational rules specified by experts. KE-GD has been initially validated by its acceptable predictive performance on human labeled gaming (Paquette et al. 2014). Recent comprehensive studies (Paquette and Baker 2019; Paquette et al. 2015) further compared KE-GD with two separately validated, representative gaming detectors across multiple datasets collected from different systems: a machine-learned model (Baker and de Carvalho 2008), and a hybrid model (Paquette et al. 2015) that combines both knowledge engineering and machine learning. The comparisons focused on predictive performance of human labels of gaming in held-out test sets in the original data and two new datasets collected from two other learning environments; the comparison also considered the interpretability of models. Results showed that KE-GD achieved greater generalizability to new datasets (or systems) and interpretability than the machine-learned model, and achieved comparable to slightly better generalizability and interpretability than the hybrid model. Although the initial cost in developing KE-GD was higher than that of the machine-learned model, it could be directly used in new datasets without further cost since actions that match any of the 13 patterns can be directly labeled as gaming. However, one may need to retrain the machine-learned or hybrid model, which needs a machine-learned model as input, given the much lower (and even unacceptable) predictive performance of the machine-learned model than KE-GD on new datasets. However, despite its proven advantages, the gaming patterns of KE-GD do not consider task features or students’ knowledge levels, and the association between its detected gaming and learning has not been examined in prior studies. This raises questions of the robustness of KE-GD on systems or datasets beyond the ones examined by the authors.

1.3 Evaluation methods for gaming detectors

In past work, the standard procedure to evaluate a gaming detector is as follows (Baker et al. 2008a, b; Pardos et al. 2014; Walonoski and Heffernan 2006a): two or more human coders label gaming on student attempts or actions by classroom observations or text replays; if the inter-rater reliability is acceptable, these labels are used as the ground truth to develop detectors where a detector with better predictions of human labels are preferred. However, as mentioned earlier, human labels could contain bias (due to not considering task features or individuals’ knowledge) against which even a high inter-rater reliability cannot safeguard.

One evaluation method that addresses this concern is to examine the association between gaming estimates and learning, which is also intrinsically required by the standard definition of gaming (Baker et al. 2006a, b, 2008a, b). Some prior works have examined and found higher gaming levels to be associated with lower learning (Baker, Corbett, Koedinger, and Wagner 2004; Mogessie et al. 2020; Muldner, et al. 2011; Richey, et al. 2021), but others have not (Dang and Koedinger 2019; Paquette and Baker 2019; Paquette et al. 2015). Examining the relation with learning has not been generally considered as an integral part of evaluating gaming detectors. However, a gaming detector can be viewed as an instrument for assessing the gaming construct, and thus, validity is of great relevance. Validity provides “an overall evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of interpretations and actions based on test scores or other modes of assessment” in the seminal work (Messick et al. 1995). In particular, the association between gaming and learning is closely related to external validity (i.e., does the test have convergent, discriminant, and predictive qualities?).

Besides validity, there are other desirable properties for a gaming detector that can be considered in the evaluation. Reliability measures the consistency (e.g., correlation) of results of an instrument over multiple samples. Different samples can be collected in various ways, such as across time or across subsets of items, which correspond to different types of reliability with distinct focuses (e.g., test–retest reliability, split-half reliability). Reliability does not imply validity, but it places a limit on the overall validity of an instrument that aims at measuring stable attributes or traits of people. An instrument with both high reliability and validity is usually desirable. To the best of our knowledge, only one prior study has examined and demonstrated the reliability of their proposed gaming detector (Muldner, et al. 2011) where they focused on correlations of gaming estimates from two buckets by a random split of problems or students.

Generalizability is also a desirable property of a gaming detector. It refers to how well a detector developed based on a sample can make predictions or estimations on a new sample, where a new sample can be from a different set of problems, a different student population, or a different system. It has some overlaps with reliability but emphasizes the performance on the new sample rather than the consistency between the estimates of the two samples. This aspect has been examined in various prior works (Baker et al. 2008a, b; Paquette et al. 2015).

1.4 Latent variable models and the trait-like property of gaming

Obtaining a student-level gaming assessment is valuable for studying the relation between student-level attributes (e.g., motivation, learning gain) and gaming for understanding causes of disengagement or the effect on gaming of an intervention. Existing gaming detectors have focused on observed action-level gaming assessment, i.e., whether an action is part of a sequence of gaming behaviors, and the student-level gaming assessment is obtained by computing the proportion of gamed actions (i.e., observed gaming frequencies) or the average of predicted probabilities of gaming on actions for each student (Baker et al. 2008a, b; Paquette and Baker 2017; Razzaq et al. 2005). However, such direct aggregation may be prone to bias as illustrated earlier. Dang and Koedinger (2019) showed that observed gaming frequencies failed to correlate with motivation, while a statistical model that estimates a latent gaming tendency on a student level controlling for curricular sections yielded strong correlations between motivation and gaming. This is an example of a latent variable model, although the authors did not explicitly describe it as such. Latent variable models (LVMs) estimate values of latent theoretical variables (e.g., abilities, attitudes) based on observed response variables (e.g., task performance, survey ratings) through a statistical model that models the observed variables as a function of the latent variables. They have been widely used in knowledge modeling for estimating students’ abilities or knowledge levels (Desmarais and Baker 2012). One widely used LVM for ability assessment is item response theory (De Boeck and Wilson 2004). Item response theory models the observed correctness (responses) on each item (e.g., a problem step) of each student as a function of item difficulties and student abilities and thus provides an ability estimate for each student controlling for item difficulties. In essence, if a student performs better than the expected performance of the population on the items, the student is estimated as having a high ability; for two students with the same proportion correct over items, the one who can get harder items correct is estimated as having a higher ability than the one who only can get easier items correct. Ability estimates obtained in this way are more accurate than simply looking at the proportion correct over all items. Despite the prevalence of LVMs in knowledge modeling, their application in behavior modeling has been limited. Only a handful of papers have applied LVMs to estimate students’ affects or attitudes where latent variables were theoretical constructs measured from surveys such as cognitive appraisal (Sabourin, Mott, and Lester 2011) or attitudes toward learning (Arroyo and Woolf 2005).

To use LVMs in gaming detection, one requisite assumption is the existence of a trait-like property of gaming, for which some evidence has been accumulated. Baker et al. (2008a, b) and Dang and Koedinger (2019) both showed that gaming was associated with a range of survey measures that measured students’ motivational goals, beliefs, and dispositions at the beginning of the use of the system. Muldner et al. (2011) found that student factors explained a much higher proportion of variance than problems (50% vs. 19%) and were significantly more consistent than problems in gaming proportions across randomly split samples. Although some studies support the state-like property of gaming (see those cited in Sect. 1.1), they could not rule out the trait-like property of gaming by their analyses. Several studies suggest that gaming is a mixture of state and trait (Dang and Koedinger 2019; Muldner et al. 2011; Peters et al. 2018), or the domination of the state-like or trait-like property depends on the system design (Botelho et al. 2021). For example, Muldner et al. (2011) showed that a regression model with both students and problems as predictors explained more variance (61%) than students alone (50%) or problems alone (19%), and both predictors were significant. Thus, we can view the gaming construct at two interconnected levels: the latent student level, which corresponds to trait-like gaming tendencies, and the observed action level, which corresponds to state-like gaming behaviors affected by contextual factors and latent gaming tendencies. If we are interested in obtaining a student-level gaming measure for student-level analyses, we should consider extracting the trait-like component of gaming from behaviors across contexts, and LVMs offer an effective way to do so.

1.5 The current study

In the current study, we demonstrate a new approach, latent variable-based gaming detection (LV-GD), that integrates contextual factors in a cost-effective way for more valid and robust gaming detection. We report our comprehensive evaluation and applications of LV-GD to support the validity, robustness, and usability of LV-GD. In addressing the gaps in existing research pointed out above, our work makes two main contributions. One contribution is a general cost-effective approach that can adapt an existing gaming (or other behavior) detector to a new context by integrating contextual factors not originally considered. Another contribution is the use of latent variable modeling in behavior assessment, showing the value of latent trait level assessment which is different from the dominant observed action level assessment.

LV-GD estimates a latent gaming tendency for each student controlling for the propensity of contexts to trigger detected gaming. Essentially, it makes student-level judgments about gaming by looking at a student’s behaviors across different contexts and in relation to the population-level behaviors in the same contexts. A student is estimated as having a high gaming tendency if the student is detected to game more than the expected levels of the population in the same contexts. LV-GD can do so by latent variable modeling (and generalized mixed effect modeling specifically) that predicts the action-level gaming judgements from an existing gaming detector given hypothesized latent factors that trigger detected gaming, without additional requirements on human labeling. From the fitted student random intercepts, we obtain gaming tendency estimates as the gaming measure. In the current study, LV-GD is used on top of the gaming detector KE-GD, a previously validated detector (Paquette and Baker 2019; Paquette et al. 2015); however, LV-GD can be used on top of any action-level gaming detectors.

Our study on LV-GD is reported in the following structure. Section 2 describes the development of LV-GD. We started with applying KE-GD on a dataset collected from experimentation with an algebra tutor. Observing the lack of association between detected gaming and learning, we conducted an iterative exploratory data analysis informed by prior research, and identified overlooked contextual factors. We then formulated LV-GD by incrementally integrating the contextual factors and exploring variants, chose the best model, and established initial validity. In Sect. 3, we examine the generalizability of LV-GD on new datasets in terms of generating gaming measures that can be negatively associated with learning. We compared LV-GD with KE-GD in nine contexts, obtained from three datasets with three condition configurations per dataset. To further support the validity of LV-GD, in Sect. 4 we examine reliability of LV-GD in comparison with KE-GD, which can also be viewed as further evaluating generalizability: how well does a gaming detector generalize to new contexts in terms of having consistent latent gaming tendencies? In Sect. 5, we demonstrate three applications of LV-GD: to study intervention effects on gaming, to explore the relation between gaming and motivation, and to help understand productive detected gaming behaviors through a qualitative analysis. Finally, in Sect. 6 we conclude and discuss the results.

2 Development of LV-GD

2.1 The tutor

We used datasets collected from an algebra intelligent tutoring system for middle and high school students (Huang et al. 2021). Students learn about writing algebraic expressions in story problems in various formats: writing an expression in a textbox with dynamic scaffolding steps that appear if a student fails in the original question (text format as shown in Fig. 1); writing expressions in a table where the main question step and scaffolding steps are accessible at any time and are all required (table format as shown in Fig. 2); explaining a set of expressions extracted from a given equation by choosing the matching textual description from a dropdown menu for each expression (menu format as shown in Fig. 3); and given an equation, writing a set of expressions that match a given set of textual descriptions (flipped-menu format as shown in Fig. 4). These tasks also vary in the complexity of the expressions involved (e.g., one or two operators).

The algebra tutor was continuously redesigned and tested in three experiments with different student populations across 3 years. In each experiment (eight sessions over 4 weeks), we compared two versions of the tutor corresponding to two conditions differing in task design and sequencing. The control (CT) condition, corresponding to the original tutor, provided a normal deliberate practice schedule, where students received tasks with feedback and as-needed repetition for improving critical aspects of performance. Students received full tasks representing the full version of the problem and were required to fill in all steps (including scaffolding steps) given a cover story. There were three consecutive units: the first unit contained all the table tasks, the second unit contained less complex menu and flipped-menu tasks, and the third unit contained more complex menu and flipped-menu tasks. Steps were labeled with coarser-grained knowledge components (KCs; skills). Students received individualized practice until reaching mastery of all KCs in a unit before moving on to the next unit. Across the three experiments, the design of the control condition remained the same. The experimental (EXP) condition corresponds to a redesigned tutor based on data mining outcomes such as a refined KC model revealing hidden difficulties after original KCs were split to differentiate easier and harder use cases. It provided an intense deliberate practice schedule where students practiced on a larger number of KCs with a higher variety of tasks targeting different subsets of KCs. Focused tasks were introduced to reduce over-practicing easier KCs and target particularly difficult KCs. Examples of focused tasks include: text format tasks asking for the final expression without the mandatory intermediate steps required in the table task; text format tasks that further remove the story and focus on learning algebraic grammar rules; and simpler menu and flipped-menu tasks with equations less complex than the original equations. There were three or more learning units where different task formats or task types (full or focused) were interleaved in each unit. Students received individualized practice until reaching mastery of all KCs in a unit before moving on to the next unit. Across the three experiments, the design of the experimental condition was continuously refined to promote greater learning. Our prior work has shown that the experimental condition led to better learning outcomes compared to the control condition (Huang et al. 2021). Here, we are interested to see whether the experimental condition also led to higher behavioral engagement, particularly lower levels of gaming the system, and whether gaming was linked with motivation. We started our investigation with the first dataset collected from the first experiment explained below.

2.2 A previously validated gaming detector did not generalize

We chose a previously validated knowledge-engineered gaming detector, KE-GD, as the starting point for studying students’ behavioral engagement when using the algebra tutor. KE-GD contains 13 interpretable patterns modeling systematic guessing and help abuse. For example, one pattern is “the student enters an incorrect answer, enters a similar and incorrect answer in the same part of the problem and then enters another similar answer in the same part of the problem.” It is coded as “incorrect → [similar answer] [same context] & incorrect → [similar answer] & [same context] & attempt,” consisting of constituents such as “[similar answer]” (judged by Levenshtein distance), and action types such as “attempt” (correct or incorrect) or “help.” If a sequence of actions (i.e., attempts on steps) matches any one of the 13 patterns, then all actions involved are labeled as gaming. Details of the patterns and the validation of KE-GD could be found in (Paquette and Baker 2019; Paquette, de Carvalho, and Baker, 2014).

We used KE-GD to label actions as gaming or not and then examined its validity. We defined two metrics of validity in the current study, both of which evaluate the association between gaming and learning. The primary metric was the correlation between gaming levels and normalized learning gains over students. For each student, we computed a gaming level using the proportion of gamed actions (referred to as proportion of detected gaming or detected gaming (proportion)) for KE-GD, or the estimated gaming tendency for LV-GD (explained in Sect. 2.3.2); we computed the normalized learning gain using the widely adopted formula, (posttest—pretest) / (1- pretest). We used Spearman correlation (rho) because it is less sensitive to outliers than Pearson correlation. As a supplementary metric, we conducted a regression analysis predicting posttest scores controlling for pretest scores and gaming levels over students and examined the coefficient of the variable of gaming levels. We considered negative correlations and coefficient values at a significance level of 0.10 as acceptable validity. Prior studies have used significance levels of 0.05 and 0.10 for correlation analyses involving behavior measures (Baker et al. 2004a, b; Dang and Koedinger 2019; Shih et al. 2008).

Two observations emerged. First, the detected gaming proportion 18% (last column in Table 1) was much higher than the previously reported proportions (3.5% in Dang and Koedinger (2019) and 6.8% in Paquette et al. (2014)) of the same detector in other math intelligent tutoring systems. Second, there was a lack of association between detected gaming and learning (correlation: rho = − 0.02, p = 0.86; regression coefficient: b = 0.07, p = 0.69), challenging KE-GD’s validity in our context.

Table 1 Statistics of the Fall 2019 dataset including gaming levels detected by KE-GD

Using latent variable models to make gaming-the-system detection robust to context variations

Abstract

Similar content being viewed by others

Modeling and Studying Gaming the System with Educational Data Mining

Gamification for smarter learning: tales from the trenches

Predictors and Outcomes of Gaming in an Intelligent Tutoring System

1 Introduction

1.1 Contextual factors for detected gaming behaviors

1.2 Existing gaming detectors

1.3 Evaluation methods for gaming detectors

1.4 Latent variable models and the trait-like property of gaming

1.5 The current study

2 Development of LV-GD

2.1 The tutor

2.2 A previously validated gaming detector did not generalize

2.3 Identifying and integrating contextual factors to improve validity

2.3.1 Identifying the effect of task formats

2.3.2 The basic latent variable model controlling for task formats

2.3.3 Identifying other contextual factors

2.3.4 The full latent variable model accounting for critical contextual factors

3 Generalizability of LV-GD

4 Reliability of LV-GD

4.1 Temporal split

4.2 One-vs-rest format split

4.3 Random format split

4.4 The effect of format context similarity

5 Applications of LV-GD

5.1 Studying intervention effects on gaming

5.2 Studying the relation between motivation and gaming

5.3 Understanding productive detected gaming

6 Conclusions and discussion

6.1 Validity and generalizability of LV-GD

6.2 Implications for developing gaming detectors

6.3 Implications for system design

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation