Psychometrika

, Volume 78, Issue 3, pp 481–497

Using Deterministic, Gated Item Response Theory Model to Detect Test Cheating due to Item Compromise

Authors

    • Educational Testing Service
  • Robert Henson
    • The University of North Carolina at Greensboro
  • Richard Luecht
    • The University of North Carolina at Greensboro
Article

DOI: 10.1007/s11336-012-9311-3

Cite this article as:
Shu, Z., Henson, R. & Luecht, R. Psychometrika (2013) 78: 481. doi:10.1007/s11336-012-9311-3
  • 527 Views

Abstract

The Deterministic, Gated Item Response Theory Model (DGM, Shu, Unpublished Dissertation. The University of North Carolina at Greensboro, 2010) is proposed to identify cheaters who obtain significant score gain on tests due to item exposure/compromise by conditioning on the item status (exposed or unexposed items). A “gated” function is introduced to decompose the observed examinees’ performance into two distributions (the true ability distribution determined by examinees’ true ability and the cheating distribution determined by examinees’ cheating ability). Test cheaters who have score gain due to item exposure are identified through the comparison of the two distributions. Hierarchical Markov Chain Monte Carlo is used as the model’s estimation framework. Finally, the model is applied in a real data set to illustrate how the model can be used to identify examinees having pre-knowledge on the exposed items.

Key words

cheatingmodel estimation

Test cheating (Cizek 1999) is defined as any activity that violates the established rules governing the administration of a test, which is ubiquitous at all levels of a school. Most organizations involved in high-stakes testing are fairly certain that test cheating or compromise (usually by item over-exposure and memorization) widely occurs, especially when many items and test forms must be exposed to accommodate on-demand computer-based testing (CBT). Most of the risks for item exposure stem from the seating capacity, scheduling limitations and item development costs (Luecht, 1998, 2005; Drasgow, Luecht, & Bennett, 2006). Such test cheating has an even greater impact on test validity when it is facilitated by modern technologies. For example, item sharing through internet collaboration can compromise a large number of items in a very short time (Luecht, 1998, 2005). In addition, cheating due to item over-exposure occurs with a higher frequency in the K-12 settings. Teachers may share items with their students to help increase the average score in their classes, and students may purposely remember the items in aim for a higher score, which would show their learning effectiveness. Cheating incidences will continue to increase as a larger emphasis is placed on test results to evaluate student knowledge, educational quality of teachers, and quality of schools.

The Deterministic, Gated Item Response Theory Model (DGM; Shu 2010) was proposed to detect test cheating that results from item over-exposure. Specifically, this model addresses cheating that has occurred because the examinees have had previous access to an item. The DGM classifies test takers as cheaters or non-cheaters by conditioning on two mutually exclusive item types. The first type of item is one that has probably been compromised. This first type of item could be identified based on empirical exposure counts, time in use, or other indicators (called “exposed items”). The second type of item is considered a secure item due to its recent release or other factors (called “unexposed items”) (Segall 2002). Notice that, in many ways, exposure of items acts as a gate through which cheating is possible. Even students with a tendency to cheat are not able to cheat on secured items. The DGM identifies potential test cheaters by computing the score gain in the exposed items when compared to the unexposed items. The DGM decomposes the observed item performance attribute to either an examinees true-proficiency function or a response function due to a cheating ability. The gating mechanism and specific choice of parameters in the model further allow estimation of a statistical cheating effect at the level of individual examinees or groups (e.g., individuals suspected of collaborating), and identification of examinees’ real competence level. In this context, “gating mechanism” is used to refer to the process of defining those items that have been exposed and thus could be cheated on, as opposed to unexposed items, for which both cheaters and non-cheaters are expected to behave in the same way.

A fair amount of research has focused on cheating detection; however, most of this work has focused on cheating by means of copying answers (e.g., Angoff 1974; Frary, Tideman, & Watts, 1977; Hanson, Harris, & Brennan, 1987; Dwyer & Hecht 1996; Holand 1996; Stocking, Ward, & Potenza, 1998; Watson, Iwamoto, Nungester, & Luecht, 1998; Sotaridona 2003; Sotaridona & Meijer 2003; Wollack, 1997, 2006; Wollack & Cohen 1998; Wollack, Cohen, & Serlin, 2001; van der Linden & Sotaridona, 2004, 2006) and person fit indices (e.g., Levine & Rubin 1979; Drasgow, Levine, & Williams, 1985; Tatsuoka 1996; see Meijer 1996). Although most of this research has been proven to be useful, they are rarely discussed in the context of test cheating as a result of item compromise/exposure. For example, several answer copying indices need to identify a “source”, the person from whom cheaters copy their answers; however, it is difficult to identify the source in a cheating context based on item exposure/memorization. In addition, various person fit indices (Nering, 1996, 1997) tend to lose statistical power in the detection of test cheaters when a large proportion of the examinee population cheats.

Segall (2002) proposed a test cheating model to detect cheating due to item compromise by assuming that cheaters correctly respond to compromised items with 100 % certainty. It is reasonable to believe that test cheating activities should increase test cheaters’ observed performance. However, it seems to be unlikely that every examinee will memorize every exposed item with certainty and be able to retrieve those answers while taking the test. Hence, Segall’s cheating detection model makes a strong assumption that may limit its flexibility, its statistical power to correctly classify examinees, and its ability to estimate the magnitude of the cheating in the population. As compared to Segall’s model, the DGM assumes that cheaters generally have a higher probability of correctly answering a set of compromised items, but does not assume that each of the compromised items correctly answered with 100 % certainty.

The DGM is different from the existing cheating detection methods in its practical functioning and methodological design, which should be a beneficial addition to the existing cheating detection methods. In this article, the model structure and its distinction from models such as Multi-dimensional Item Response Theory model and the Mixture Rasch Model are addressed, followed by its estimation framework. A simulation is designed to illustrate the DGM’s properties and evaluate its Type I and Type II error by comparing with the lz index (e.g., Drasgow, Levine, & Williams, 1985; Drasgow & Levine 1986; Mcleod & Lewis 1999). Finally, a real data application is used to demonstrate the DGM’s performance in a practical setting.

1 Model Structure

The DGM uses a true ability to characterize examinees’ real competency and a cheating ability to estimate the cheating effectiveness. The structural part of the DGM is defined in Equation (1) and the measurement models specific to the two types of ability are defined in Equations (2) and (3):
https://static-content.springer.com/image/art%3A10.1007%2Fs11336-012-9311-3/MediaObjects/11336_2012_9311_Equ1_HTML.gif
(1)
and
https://static-content.springer.com/image/art%3A10.1007%2Fs11336-012-9311-3/MediaObjects/11336_2012_9311_Equ2_HTML.gif
(2)
https://static-content.springer.com/image/art%3A10.1007%2Fs11336-012-9311-3/MediaObjects/11336_2012_9311_Equ3_HTML.gif
(3)
https://static-content.springer.com/image/art%3A10.1007%2Fs11336-012-9311-3/MediaObjects/11336_2012_9311_Equ4_HTML.gif
(4)
https://static-content.springer.com/image/art%3A10.1007%2Fs11336-012-9311-3/MediaObjects/11336_2012_9311_Equ5_HTML.gif
(5)
where θtj is the true ability characterizing the jth examinee’s real competency level, θcj is the cheating ability determined by the jth examinee’s cheating effectiveness, bi is the ith item difficulty, Ii is a model input defining the ith item compromise status which is referred to as the “gating” mechanism, and Tj is an indicator variable flagging the jth examinee as a cheater or non-cheater conditioning on the item compromise status. Tj=1 represents that the jth examinee is a cheater, and Tj=0 indicates that the jth examinee is not a cheater. The measurement models in Equations (2) and (3) are Rasch models.

The constraint ∑bi=0 is used to center the item scale at zero which is common in Rasch Model families (e.g., Rost’s Mixture Rasch Model, 1990). Note that item parameters are not class specific in the DGM, which ensures that the two ability distributions are on a common scale. Equation (5) (θt<θc) is an assumption made in this model that examinees’ cheating ability should be greater than their true ability. Technically, the DGM can deal with either θt>θc or θt<θc, but cannot model them simultaneously. The case that an examinee cheats on tests and obtains a lower score (θt>θc) rarely occurs in real settings, and is less important than the case of θt<θc. On the one hand, cheaters in the case of θt>θc have already been penalized by their cheating activities; and, on the other hand, the cheating activities in the case of θt<θc are more likely to mislead stake-holders. Thus, the DGM primarily focuses on detecting cheaters who make significant score gain in the exposed items over the secured/unexposed items.

The value Ii is dichotomously defined relative to item exposure status, where Ii=1 means that the ith item has been exposed/compromised, and Ii=0 means that the ith item is considered to be secure. For example, the items that have not been used in previous forms are considered to be secure (i.e., the unexposed items) and those which have been used before are probably compromised, exposed items (e.g., anchor items). The goal of the conditioning on the two item types is to use the information provided from the secured items to infer the level of item-compromise contained in the exposed items. In an operational setting test items will have differing degrees of exposure. In such cases, Segall (2002) suggested that the exposed items can be defined as those which are exposed to test takers beyond a span of a week, month, or a year, depending on empirical investigations or policy decisions. Moreover, I in the DGM can also be continuously defined as the degree of exposure/compromise (the value of the continuous I is between 0 and 1). For example, McLeod, Lewis and Thissen (2003) uses item difficulty to model the degree of item compromise, which can also be applied in the DGM.

In this paper, both Ii and Tj are dichotomously defined in the DGM. Therefore, the model can be further broken down to four conditional models:
$$ {P}({U}_{{ij}}=1|\theta_{{tj}},\theta _{{cj}},{T}_{{j}},{I}_{{i}}, {b}_{{i}})=\left\{ \begin{array}{l@{\quad}l}P(U_{ij}=1 | \theta_{tj},{b}_{{i}}), & \mbox{when}\ {T}_{{j}}=0, {I}_{{i}}=0 \\ P(U_{ij}=1 | \theta_{tj},{b}_{{i}}),& \mbox{when}\ {T}_{{j}}=1,{I}_{{i}}=0\\ P(U_{ij}=1 | \theta_{cj},{b}_{{i}}),& \mbox{when}\ {T}_{{j}}=1,{I}_{{i}}=1\\ P(U_{ij}=1 | \theta_{tj},{b}_{{i}}),& \mbox{when}\ {T}_{{j}}=0,{I}_{{i}}=1 \end{array} \right. $$
(6)
When T=0, examinees’ responses to all items are based only on their true ability, θt; however, when T=1 (i.e., for examinees that are cheaters), examinees’ responses to the unexposed items (I=0) are based on their true ability (θt) and responses to the exposed items (I=1) are based on their cheating ability (θc). Two extreme cases occur when T=1. First, if all the items are classified as exposed items, the DGM will be simplified to P(Uij=1|θtj,θcj,Tj,Ii,bi)=P(Uij=1|θcj,bi), i∈[1,n] (n is the total number of items), where the model only estimates the cheating ability. Second, if all the items are unexposed, the DGM can be simplified as P(Uij=1|θtj,θcj,Tj,Ii,bi)=P(Uij=1|θtj,bi), i∈[1,n], where the model only estimates the true ability. In these two extreme cases, the model cannot classify examinees as cheaters or non-cheaters.

The DGM has two ability variables (the cheating ability θc and true ability θt) to characterize population characteristics, which is similar to multidimensional item response theory (MIRT) models such as Rasch’s multidimensional 1PL (M1PL, Rasch 1960). However, in the DGM, examinees’ probability of correctly answering an item is only a function of one ability, either θt or θc. In the MIRT, examinees’ probability of correctly answering an item is conditional on two or more abilities simultaneously. The DGM is a Rasch-based model with a mixture structure, like the Mixture Rasch Model (MRM; Rost 1990). The DGM is different from the MRM because it classifies examinees into two classes based on their responses to the two different item types as defined by the gating mechanism, which is not required in the MRM. In addition, item parameters in the DGM are not class specific, whereas item parameters in the MRM are estimated for each latent class. For example, in the MRM with two classes, the number of estimated item parameter is (n−1)∗2, but only n−1 item parameters are estimated in the DGM (n is the total number of items). Finally, the true ability θt is a variable to characterize both cheaters and non-cheaters, and the true ability of both cheaters and non-cheaters is estimated in one distribution. However, θt in the cheating class is characterized only by the unexposed items, and in the non-cheating class it is characterized by all the items. The ability θc is an additional parameter used to characterize cheating effectiveness if and only if examinees are classified as cheaters. In contrast, all of the items of an exam are used to estimate ability in the MRM.

2 Model Estimation

Markov chain Monte Carlo (MCMC; Patz & Junker, 1999a, 1999b) is used to estimate the model parameters of the DGM. The prior distribution of each parameter in the DGM is defined as follows:
https://static-content.springer.com/image/art%3A10.1007%2Fs11336-012-9311-3/MediaObjects/11336_2012_9311_Equ7_HTML.gif
(7)
https://static-content.springer.com/image/art%3A10.1007%2Fs11336-012-9311-3/MediaObjects/11336_2012_9311_Equ8_HTML.gif
(8)
https://static-content.springer.com/image/art%3A10.1007%2Fs11336-012-9311-3/MediaObjects/11336_2012_9311_Equ9_HTML.gif
(9)
https://static-content.springer.com/image/art%3A10.1007%2Fs11336-012-9311-3/MediaObjects/11336_2012_9311_Equ10_HTML.gif
(10)
where \(\mathcal{N}(.)\) refers to the normal distribution. As a common practice, the prior of the true ability θt is assumed to be normally distributed with mean μt and standard deviation δt. The prior of the cheating ability θc is characterized by a normal distribution with μc and standard deviation δc. The cheating ability θc and true ability θt are proposed independently (i.e., the prior correlation between the two abilities is zero). Both the means and variances for the cheating and true abilities (μt, μc, \(\sigma_{t}^{2}\), and \(\sigma_{c}^{2}\)) can be estimated assuming non-informative or informative priors or setting them at a fixed level (e.g., μt=μc=0 and \(\sigma_{t}^{2}= \sigma_{c}^{2}=1\)). Note that even though the prior distribution assumes that these two abilities (the true ability and cheating ability) are independent, their estimates for the cheaters may be correlated in the posterior distribution.
The “prior” for the values T=1 or T=0 is governed strictly by the relationship between θt and θc. That is, T=1 when θt<θc and T=0 otherwise. In other words, T simply behaves as an indicator of whether the cheating ability is greater than the true ability. Therefore, T has a “deterministic” prior, which is inherently determined by the relationship between the two latent abilities, as shown in Equations (11) to (12).
https://static-content.springer.com/image/art%3A10.1007%2Fs11336-012-9311-3/MediaObjects/11336_2012_9311_Equ11_HTML.gif
(11)
https://static-content.springer.com/image/art%3A10.1007%2Fs11336-012-9311-3/MediaObjects/11336_2012_9311_Equ12_HTML.gif
(12)
Because of this design, it is not the prior of cheating ability itself alone but the relationship between the two priors of the two abilities that suggests if an examinee is a cheater or not. Given that the cheating ability and true ability are proposed independently in the prior, so θcθt in the prior has a normal distribution defined as Equation (13):
$$ \theta_{c}-\theta_{t}\sim\mathcal{N} \bigl(\mu_{c}-\mu_{t},\sqrt{\sigma_{t}^{2}+\sigma _{c}^{2}}\bigr) $$
(13)
According to Equations (11), (12) and (13), P(T=1)=P(T=0)=0.5 when μc=μt which means that each student has an equal probability of being cheaters or non-cheaters in the prior (i.e., a non-informative prior). Given that a normal distribution as the prior of the true ability is a common practice, using a normal distribution as the prior of the cheating ability accordingly is the key for defining a non-informative prior of the cheating indicator T in this particular case. Although the cheating ability probably has a different distribution from the normal distribution for a population, the normal distribution is still used as the cheating ability prior because the prior of the cheating ability itself does not determine whether examinees are cheaters or not. Furthermore, the hyperparameters μt, μc, \(\sigma_{t}^{2}\), and \(\sigma_{c}^{2}\) are used to generally represent the DGM estimation framework. The purpose of estimated hyperparameters is to allow the data to suggest whether the prior of the indicator T should be informative or non-informative and therefore increase the MCMC efficiency.
In MCMC, the joint acceptance probability for θcj, θtj and Tj at the kth step, \(\alpha(\theta_{cj}^{k},\theta_{tj}^{k},T_{j}^{k})\), is defined as
$$ \begin{aligned}[b] \alpha\bigl(\theta_{cj}^{k},\theta_{tj}^{k}, T_{j}^{k}\bigr) =& \min \bigl[\bigl[P\bigl({U}_{{j}} \bigl| {b}_{{i}},{I}_{{i}},\theta _{cj}^{k},\theta_{tj}^{k}\bigr)*\mathcal{N}\bigl(\theta_{cj}^{k} \bigl| \mu_{c},\sigma_{c}^{2}\bigr)*\mathcal{N}\bigl(\theta_{tj}^{k} \bigl| \mu _{t},\sigma_{t}^{2}\bigr) \\ &{}* P \bigl(T^k_j \bigl| \theta_{cj}^{k},\theta_{tj}^{k}\bigr) *q\bigl(\theta_{cj}^{k};\theta_{cj}^{k-1}\bigr) *q\bigl(\theta_{tj}^{k};\theta_{tj}^{k-1}\bigr) *q\bigl(T_{j}^{k};T_{j}^{k-1}\bigr) \bigr] \\ &{}\times \bigl[P\bigl({U}_{{j}} \bigl| {b}_{{i}},{I}_{{i}},\theta _{cj}^{k-1},\theta_{tj}^{k-1}\bigr)*\mathcal{N}\bigl(\theta _{cj}^{k-1} \bigl| \mu_{c},\sigma_{c}^{2}\bigr) *\mathcal{N}\bigl(\theta _{tj}^{k-1} \bigl| \mu_{t},\sigma_{t}^2\bigr) \\ &{}* P \bigl(T^{k-1}_j \bigl| \theta_{cj}^{k-1},\theta_{tj}^{k-1}\bigr) *q\bigl(\theta_{cj}^{k-1};\theta_{cj}^{k}\bigr) *q\bigl(\theta_{tj}^{k-1};\theta_{tj}^{k}\bigr) *q\bigl(T_{j}^{k-1};T_{j}^{k}\bigr) \bigr]^{-1}, 1\bigr] \\ [8pt] \end{aligned} $$
(14)
where \(\mathcal{N}(.)\) refers to the normal distribution and q(.) is defined as the proposal density of estimated parameters in MCMC. The prior distributions for both θt and θc are defined as normal distributions (symmetric distribution); and, so, in a random walk Metropolis-Hastings the proposal densities of the two latent variables will be canceled out,
https://static-content.springer.com/image/art%3A10.1007%2Fs11336-012-9311-3/MediaObjects/11336_2012_9311_Equa_HTML.gif
because the T is independently determined by the relationship between θt and θc at each step, and, therefore, \(T_{j}^{k}\) is independent of \(T_{j}^{k-1}\). The proposal density of \(T_{j}^{k}\) given \(T_{j}^{k-1}\), \(q(T_{j}^{k};T_{j}^{k-1})\) can be simplified as
$$q\bigl(T_{j}^{k};T_{j}^{k-1}\bigr) =P\bigl(T_{j}^{k-1}\bigl|T_{j}^{k}\bigr)=\frac {P(T_{j}^{k-1},T_{j}^{k})}{P(T_{j}^{k})}=\frac {P(T_{j}^{k-1})*P(T_{j}^{k})}{P(T_{j}^{k})}=P\bigl(T_{j}^{k-1}\bigr) $$
and similarly the proposal density of \(T_{j}^{k-1}\) given \(T_{j}^{k}\), \(q(T_{j}^{k-1};T_{j}^{k})\) can be further written as
$$q\bigl(T_{j}^{k-1};T_{j}^{k}\bigr)=P \bigl(T_{j}^{k}\bigl|T_{j}^{k-1}\bigr)=\frac {P(T_{j}^{k},T_{j}^{k-1})}{P(T_{j}^{k-1})}=\frac {P(T_{j}^{k-1})*P(T_{j}^{k})}{P(T_{j}^{k-1})}=P\bigl(T_{j}^{k}\bigr) $$
Given the T’s deterministic feature, T can be either 0 or 1 at each step, specifically:
  1. (1)
    If \(\theta_{cj}^{k}>\theta_{tj}^{k}\), then \(T_{j}^{k}=1\)
    $$P\bigl(T_{j}^{k}=1\bigr)=P\bigl(\theta_{c}-\theta_{t}>0 \bigl| \mu_{t}, \mu_{c},\sigma_{t}^{2}, \sigma_{c}^{2}\bigr) $$
     
  2. (2)
    If \(\theta_{cj}^{k}<\theta_{tj}^{k}\), then \(T_{j}^{k}=0\)
    $$P\bigl(T_{j}^{k}=0\bigr)=P\bigl(\theta_{c}-\theta_{t}\leq0 \bigl| \mu_{t},\mu_{c},\sigma_{t}^{2}, \sigma_{c}^{2}\bigr) $$
    therefore,
    https://static-content.springer.com/image/art%3A10.1007%2Fs11336-012-9311-3/MediaObjects/11336_2012_9311_Equ15_HTML.gif
    (15)
     
Similarly, the acceptance probability of the item difficulty bi at the kth step, \(\alpha(b_{i}^{k})\) is defined as
$$ \alpha\bigl({b}_{{i}}^{{k}}\bigr)=\min\biggl[\frac{P({U}_{{j}} | {b}_{{i}}^{{k}},{I}_{{i}},\theta _{cj},\theta _{tj})*\mathcal{N}({b}_{{i}}^{{k}} | 0,1)*q({b}_{{i}}^{{k}};{b}_i^{k-1})}{P({U}_{{j}} | {b}_{{i}}^{{k}-1},{I}_{{i}},\theta _{cj},\theta _{tj})*\mathcal{N}({b}_{{i}}^{{k}-1} | 0,1)*q({b}_{{i}}^{{k}-1}; {b}_i^k)},1\biggr] $$
(16)
because the item difficulty bi has a symmetrical normal distribution as its prior, so
$$q\bigl(b_{i}^{k};b_{i}^{k-1}\bigr)= q\bigl(b_{i}^{k-1};b_{i}^{k}\bigr) $$
Essentially, students’ response vectors can be classified into four categories by the gate function I and indicator variable T in the DGM: A1, A2, A3, and A4 shown in Table 1. The categories A1, A2 and A3 are conditional on the true ability θt, and A4 is conditional on the cheating ability θc. The items in A1 and A2 are unexposed and thus examinees have no opportunity to cheat. Examinees in A3 do not cheat and thus data in A3 provide no information regarding cheating either, even though they have the opportunity to cheat on the exposed items. The examinees in A4 are defined as cheaters who have a score gain. Therefore, the observed data from A1, A2 and A3 are used to estimate the true ability, which is not contaminated by the cheating information, and the data from A4 are used to estimate cheating ability.
Table 1.

Observed data for the four conditional models.

 

I=0

I=1

T=0

A1: θt

A3: θt

T=1

A2: θt

A4: θc

The observed response data in A1, A2 and A3 define the scale of the true ability for each student based on the unexposed items. Because the item parameters bi are assumed to be the same for cheaters and non-cheaters, the scale of both the exposed and unexposed items will be fixed by the scale of the true ability. Thus, the true ability, cheating ability and item difficulty are on a common scale. However, when the number of cheaters becomes larger and larger, the cheating scale will have greater impact on the scale of exposed items, which inevitably impacts the scale of the true ability and unexposed items. Consequently, the scale of the true ability will drift to be more like a composite scale defined by both the cheaters and non-cheaters to accommodate the impact of the cheating ability. If this “drift” occurs it is more difficult to detect differences between cheating and not cheating response behavior, which undermines the model’s detection power (Shu 2010). Because of the interaction between the cheating scale and the true ability scale, the mean and standard deviation of the prior distribution for both the cheating ability and the true ability are estimated, which reduces the impact of the cheating ability scale on the true ability scale.

3 Classification of Examinees: Cheaters or Non-cheaters

The gating variable I is the key factor to provide the distinction between examinees’ real competence level and their cheating ability. The difference between the true value of θcj and the true value of θtj should be zero when the jth examinee is not a cheater, because he/she should perform equally well on both exposed and unexposed items. Otherwise, he/she will have a positive score gain (θcjθtj>0). In the DGM, the existing of the score gain is used as the evidence of cheating in the exposed items.
$$ \left\{ \begin{array}{l@{\quad}l}\theta_{cj}-\theta_{tj}=0, & \mbox{for noncheater} \\ \theta_{cj}-\theta_{tj}>0, & \mbox{for cheater} \end{array} \right. $$
(17)
\(\hat{T}_{j}\) is computed as the average of the posterior samples of Tj. Thus, because Tj is defined as an indicator variable, \(\hat{T}_{j}\) represents the probability that a random draw from this posterior distribution is selected such that the posterior cheating ability θcj is greater than the posterior true ability θtj. Theoretically, \(\hat{T}_{j}\) is defined as
https://static-content.springer.com/image/art%3A10.1007%2Fs11336-012-9311-3/MediaObjects/11336_2012_9311_Equ18_HTML.gif
(18)
A greater \(\hat{T}_{j}\) means a stronger statement that the jth examinee’s cheating ability is greater than his true ability, that is, a positive score gain exists in the exposed items; and the DGM uses such score gain as evidence of cheating on the exposed items.
In the DGM, examinees can be classified as cheaters or non-cheaters by setting a cut point Pc (0<Pc<1) for \(\hat{T}_{j}\), as shown in Equation (19),
$$ T_{j}=\left\{ \begin{array}{l@{\quad}l} 1, & \hat{T}_{j}\geq P_{c}\\ 0, & \hat{T}_{j}<P_{c} \end{array} \right. $$
(19)
If the \(\hat{T}_{j}\) is greater than Pc, it is believed that the jth examinee has a cheating ability greater than his true ability and, thus, should be classified as cheater (Tj=1); otherwise, the jth examinee’s cheating ability is considered to be equal to his true ability (or maybe the cheating ability may be less than the true ability), and he should be classified as non-cheater (Tj=0). For example, using a 0.85 as the cut point of the \(\hat{T}_{j}\) implies that in a chain of posterior samples, we believe that the cheating ability is greater than true ability when at least 85 % of the posterior samples of the cheating ability are greater than the corresponding posterior samples of the true ability. Those examinees with \(\hat{T}_{j}\) greater than or equal to 0.85 are therefore classified as cheaters, because they are believed to have a score gain due to item exposure. A higher cut point results in a stronger statement that the estimate of the cheating ability is greater than that of the true ability. The selection of the cut point has impact on the sensitivity (percentage of cheaters being correctly identified) and specificity (percentage of non-cheaters being correctly identified) of detecting test cheating.

4 Simulation Design

A simulation study is introduced to illustrate how effectively the DGM can detect examinees showing unusual score gain under different conditions. The lz index is compared with the DGM. There are other types of cheating detection index (e.g., Frary et al. 1977; Bellezza & Bellezza 1995; Holand 1996; Lewis & Thayer 1998; Wollack 1997; Sotaridona 2003; van der Linden & Sotaridona, 2004, 2006; van der Linden & Jeon 2012); however, they are developed to detect cheating activities in different scenarios and therefore are not comparable with the DGM.

Cheating Characteristics

Cheating characteristics, including cheating size and cheating effectiveness, are considered in this simulation design. Cheating size refers to how many examinees have pre-knowledge of the exposed items. Three levels of cheating size, 5 percent, 35 percent, and 70 percent, are considered in this study. Cheating effectiveness represents the level of score gain as a result of pre-knowledge of the exposed items. According to the level of score gain, high-effective cheaters have the largest score gain, low-effective cheaters have the smallest score gain, and medium-effective cheaters have a score gain between the high-effective and low-effective cheaters. Specifically, the high-effective cheaters are characterized by Beta(9,4)∗3, medium-effective cheaters according to Beta(5,5)∗3 and low-effective cheaters simulated by Beta(1.5,5)∗3.

Ability Simulation

True ability (θt) was simulated according to the standardized normal distribution \(\mathcal{N}(0,1)\), and the cheating ability (θc) was created by combining the true ability and a score gain (Δ).
https://static-content.springer.com/image/art%3A10.1007%2Fs11336-012-9311-3/MediaObjects/11336_2012_9311_Equ20_HTML.gif
(20)
https://static-content.springer.com/image/art%3A10.1007%2Fs11336-012-9311-3/MediaObjects/11336_2012_9311_Equ21_HTML.gif
(21)
https://static-content.springer.com/image/art%3A10.1007%2Fs11336-012-9311-3/MediaObjects/11336_2012_9311_Equ22_HTML.gif
(22)
where the parameters A and B in Equation (22) are scaling factors to change the lower and upper limits of the Beta distribution. Specifically, A is set to 3 and B is set to 0 in this simulation design.
Furthermore, examinees who are competent normally tend to rely on their own knowledge and competence to respond to items. However, those examinees who are not competent enough would be more likely to cheat, and thus they could obtain a greater test score. Therefore, in this simulation design 60 percent of cheaters are sampled from low ability students whose true ability is less than −0.5, 30 percent of cheaters are randomly sampled from medium ability examinees whose true ability is between −0.5 and 0.5, and 10 percent of cheaters are randomly selected from high ability examinees whose true ability is greater than 0.5. As an illustration, when the cheating size is 5 percent, the total number of cheaters is 2000∗5 %=100, where 60 (60 %∗100) out of the 100 cheaters are low ability students whose true abilities are below −0.5, 30 (30 %∗100) out of the 100 cheaters are the students whose true abilities are between −0.5 and 0.5, and 10 (10 %∗100) out of the 100 cheaters are capable examinees whose abilities are greater than 0.5. Together with the cheating effectiveness, the different categories of test cheaters are listed in Table 2.
Table 2.

Joint conditions of ability distribution.

Category

High-effective cheaters

Medium-effective cheaters

Low-effective cheaters

60 %, true ability <−0.5

Yes

Yes

Yes

30 %, true ability [−0.5,0.5]

Yes

Yes

Yes

10 %, true ability > 0.5

Yes

Yes

Yes

Note: “Yes” = the joint conditions of column and row is considered in this research.

All the factors considered in this simulation design are listed in Table 3. The total number of joint conditions considered in this research is 27 (1×3×3×3). Each joint condition is replicated 10 times in the simulation.
Table 3.

Joint conditions.

Factors

Levels

Test length

40 items1

Proportion of compromised items

30 %, 50 % and 70 %

Cheating size

5 %, 35 % and 70 %

Cheating category

High-effective, medium-effective and low-effective

Guessing level

0

1The item difficulty was simulated by the standardized normal distribution N(0,1).

Data Generation

A more general model called data generation model rather than the DGM itself is used to generate the response data, defined as
$$ P(U_{ij}=1)=(1-s )*P(U_{ij}=1 | \theta_{t})+s* P(U_{ij}=1 | \theta_{c}) $$
(23)
where S=TI. T is the dichotomous cheating parameters for each examinee, which is a vector N×1; I is the model input relative item exposure status, which is a vector 1×J. Thus S is a matrix with N×J, which is a joint parameter defining the probability that an individual examinee cheats on each of items. As a note: N is the number of examinees and J is the number of items.

5 Results

Two indices, sensitivity and specificity, are used to evaluate the accuracy and reliability of the DGM as well as the lz index. Sensitivity is the proportion of true cheaters who are correctly detected as cheaters, and Specificity refers to the proportion of examinees without conducting cheating who are correctly classified as non-cheaters. The results of this simulation study are summarized in Tables 4 and 5. The cut point for T is 0.9 in this study, and 0.95 for the lz index. The reason for selecting the 0.95 as the cut point of the lz index is because the 0.95 for the lz index achieves about-same level of specificity as the DGM.
Table 4.

Specificity of the DGM and lz index.

Factors

Cheating size

5 %

35 %

70 %

Proportion

30 %

50 %

70 %

30 %

50 %

70 %

30 %

50 %

70 %

Methods

lz

DGM

lz

DGM

lz

DGM

lz

DGM

lz

DGM

lz

DGM

lz

DGM

lz

DGM

lz

DGM

High

Mean

0.99

0.94

0.99

0.96

0.99

0.98

0.98

0.96

0.99

0.98

0.99

0.99

0.94

0.98

0.93

0.99

0.95

1.00

SD

0.00

0.02

0.00

0.01

0.00

0.01

0.01

0.02

0.01

0.01

0.01

0.01

0.08

0.01

0.11

0.01

0.08

0.00

Medium

Mean

0.98

0.94

0.97

0.97

0.98

0.98

0.98

0.96

0.98

0.98

0.99

0.99

0.96

0.99

0.96

0.99

0.97

1.00

SD

0.01

0.01

0.05

0.01

0.01

0.01

0.01

0.01

0.01

0.01

0.01

0.00

0.03

0.01

0.04

0.00

0.03

0.00

Low

Mean

0.96

0.95

0.97

0.96

0.97

0.97

0.96

0.97

0.96

0.98

0.97

0.99

0.96

0.98

0.96

0.99

0.96

0.99

SD

0.01

0.00

0.01

0.00

0.01

0.00

0.01

0.00

0.01

0.00

0.01

0.00

0.01

0.00

0.01

0.00

0.01

0.00

High = High-effective cheaters; Medium = Medium-effective cheaters; Low = Low-effective cheaters; Proportion = Proportion of exposed items; SD = Standard deviation of specificity; Mean = Mean of specificity.

Table 5.

Sensitivity of the DGM and lz index.

Factors

Cheating size

5 %

35 %

70 %

Proportion

30 %

50 %

70 %

30 %

50 %

70 %

30 %

50 %

70 %

Methods

lz

DGM

lz

DGM

lz

DGM

lz

DGM

lz

DGM

lz

DGM

lz

DGM

lz

DGM

lz

DGM

High

Mean

0.25

0.79

0.33

0.82

0.33

0.79

0.06

0.70

0.08

0.76

0.08

0.70

0.02

0.50

0.02

0.52

0.02

0.43

SD

0.23

0.12

0.28

0.11

0.25

0.11

0.07

0.16

0.09

0.15

0.08

0.16

0.01

0.18

0.01

0.21

0.01

0.22

Medium

Mean

0.11

0.65

0.15

0.66

0.15

0.59

0.03

0.54

0.05

0.57

0.05

0.50

0.02

0.36

0.02

0.38

0.02

0.31

SD

0.10

0.12

0.13

0.13

0.12

0.15

0.03

0.14

0.04

0.15

0.03

0.15

0.01

0.12

0.01

0.14

0.01

0.14

Low

Mean

0.02

0.34

0.03

0.32

0.03

0.26

0.02

0.27

0.03

0.26

0.03

0.20

0.02

0.19

0.02

0.18

0.03

0.14

SD

0.01

0.08

0.01

0.09

0.02

0.09

0.01

0.07

0.01

0.07

0.01

0.08

0.01

0.05

0.01

0.06

0.00

0.06

High = High-effective cheaters; Medium = Medium-effective cheaters; Low = Low-effective cheaters; Proportion = Proportion of exposed items; SD = Standard deviation of specificity; Mean = Mean of specificity.

As shown by Table 4, the specificity of the DGM is consistently about 96 percent in all the joint conditions, that is, the DGM classifies almost all the innocent test takers as non-cheaters, with a small degree of error. The factors, including the proportion of exposed items, cheating size and cheating effectiveness, slightly impact the model’s specificity. Moreover, the DGM performs as well as the lz index in terms of specificity. The lz index has a slightly better performance when only 5 percent of teat takers are cheating, and the DGM slightly outperforms the lz index when 70 percent of test takers are cheating.

Unlike the specificity, the DGM’s sensitivity is greatly impacted by the cheating effectiveness, proportion of cheaters and percentage of exposed items as presented in Table 5. The DGM is more likely to detect effective cheaters showing a high level of score gain (i.e., a high level of pre-knowledge). For example, the sensitivity in the high-effective cheating category is always greater than the corresponding medium/low-effective cheating cases. However, the DGM is less sensitive when the proportion of cheaters increases. For instance, the DGM is able to detect about 80 % of high-effective cheaters when only 5 % of the population are cheaters, while it only detects 48 % of high-effective cheaters when the proportion of cheaters is 70 %. Furthermore, the sensitivity in the cases with 50 % of exposed items is uniformly greater than the cases with 30 % and 70 % of exposed items. In other words, a reliable estimation of both true and cheating ability (more items in each category) will increase the sensitivity. In the comparison, the DGM shows a greater level of sensitivity than the lz index in all the joint conditions, especially when there is only a small proportion of test cheaters.

In summary, a higher score gain in the exposed items leads to a high probability of being detected by the DGM. In other words, the cheaters with a greater score gain are more likely to be detected by the DGM. However, the power of the DGM decreases when the proportion of cheaters in the population increases, which causes the scale shift discussed in the model estimation section. Overall, the DGM is effective in detecting students with score gain in the exposed items, as compared to the lz index.

6 A Real Data Application

A real data set provided by CTB/McGraw-Hill comes from a low-stake test for measuring student proficiency in English at Grade 4. It has 35 items with more than 15,000 students. Of these 35 items, 14 are anchor items used in a previous administration and 21 are new items never used before. The anchor items are treated as the exposed items, because it was determined that students may have memorized items from a previous administration conducted three months ago. The other 21 new items were used as the unexposed items. There will be 8,000 iterations for the MCMC with the first 5,000 iterations as the burn-in. The cut-point for the posterior \(\hat{T}_{j}\) is set at 0.9, for two reasons: (1) the simulation study shows that the specificity of the DGM is uniformly around 0.96 when the cut point is 0.9, and (2) this is a low-stake test used for monitoring students’ learning progress and so specificity 0.96 may be enough in this particular case. Practitioners may select a greater cut point to conduct a more conservative detection in high-stake tests, together with other evidence.

Model Convergence

The parameters of the DGM converge well. As an illustration, the plots of the posterior samples of examinee true ability, item difficulty and hyperparameters (the mean and standard deviation of cheating ability) for one of the replications are presented in Figure 1 (an example to show the convergence of examinee true ability), Figure 2 (an example to show the convergence of the item difficulty) and Figure 3 (an example to the convergence of the hyperparameters of the cheating ability). As demonstrated by the three plots, the posterior samples distribute around a stable value.
https://static-content.springer.com/image/art%3A10.1007%2Fs11336-012-9311-3/MediaObjects/11336_2012_9311_Fig1_HTML.gif
Figure 1.

Example of the convergence of examinee ability.

https://static-content.springer.com/image/art%3A10.1007%2Fs11336-012-9311-3/MediaObjects/11336_2012_9311_Fig2_HTML.gif
Figure 2.

Example of the convergence of item difficulty.

https://static-content.springer.com/image/art%3A10.1007%2Fs11336-012-9311-3/MediaObjects/11336_2012_9311_Fig3_HTML.gif
Figure 3.

Example of the convergence of the hyperparameter.

Comparison with BILOG-MG

The unexposed items have no cheating information; and, thus, the examinees’ ability estimated solely based on the unexposed items represents the examinees’ real competence level. Hence, the examinees scored by BILOG-MG based only on the unexposed items of this test can be used to validate the examinees’ true ability estimated by the DGM. The correlation between the BILOG ability estimate (only based on unexposed items) and the DGM true ability estimate only for the flagged students is 0.963. Furthermore, the correlation between the item difficulties (including all items) derived by the two methods is 0.991. These two correlations demonstrate the consistency between the DGM and BILOG-MG.

Bootstrap Results

A bootstrap design is used to illustrate the DGM’s invariance. 3,000 examinees are randomly sampled from the whole test population with replacement, and this process is replicated 10 times. As a practical note: the sample size of the examinee population is 16,723. The proportion of examinees that are identified as potential test cheaters by the DGM is shown in Table 6.
Table 6.

Proportion of identified students in each replication.

Replicates

R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

Proportion cheaters

9 %

9.03 %

8.90 %

8.77 %

9.20 %

9.93 %

9.33 %

9.43 %

8.43 %

9.20 %

R1 = the first replication.

In Table 6, the proportion of test cheaters is consistently around 9 percent across the 10 replications, as expected. Each set of 3,000 examinees is randomly sampled with replacement from one single examinee population. Thus, each set of selected examinees should exhibit similar statistic characteristics without significant population variance. Therefore, the stability of the proportion of test cheaters across different replications provides evidence that the DGM is internally consistent in real settings.

Characteristics of Identified Students

The T is a deterministic parameter which governs if the cheating ability plays a role in the DGM. When T=1 (i.e., a score gain exists in the exposed items), the DGM utilizes the cheating ability to account for the common variance within the exposed items and the true ability to explain the variance within the unexposed items. When T=0 (i.e., a score gain does not exist in the exposed items), the cheating ability does not exist and the true ability will explain the variance within both the exposed and unexposed items. As a result, an unusual score gain in the exposed items is meaningful and can be reported for the jth examinee when \(\hat{T}_{j}\) is greater than the cut point, otherwise, the score gain does not exist and only true ability is reported for him/her. Therefore, unlike θt and \(\hat{T}\) which are parameters for all the students, the score gain only plays a role for appropriately identified students. Furthermore, although the cheating ability is directly used in the DGM, the positive score gain (θcθt) is the variable that is modeled by the DGM. It is recommended to report the score gain for the identified students, but not directly the cheating ability for all the students.

The descriptive statistics for the true score and score gain for the flagged students are presented in Table 7. The mean of the flagged students’ true score is −0.87 and the standard deviation is 0.49; and the distribution of their true score is presented in Figure 4. The mean of the flagged students’ score gain is 1.33 and the standard deviation is 0.28, and its distribution (represented by Delta) is presented in Figure 5.
https://static-content.springer.com/image/art%3A10.1007%2Fs11336-012-9311-3/MediaObjects/11336_2012_9311_Fig4_HTML.gif
Figure 4.

Distribution of true score for identified students.

https://static-content.springer.com/image/art%3A10.1007%2Fs11336-012-9311-3/MediaObjects/11336_2012_9311_Fig5_HTML.gif
Figure 5.

Distribution of Delta for identified students.

Table 7.

Percentile statistics for the identified students.

Percentile of true score

Percentile

0 %

10 %

20 %

30 %

40 %

50 %

60 %

70 %

80 %

90 %

100 %

True score

−2.05

−1.50

−1.32

−1.18

−1.034

−0.90

−0.75

−0.58

−0.40

−0.19

0.47

Percentile of delta

Percentile

0 %

10 %

20 %

30 %

40 %

50 %

60 %

70 %

80 %

90 %

100 %

Delta

0.58

1.02

1.08

1.13

1.20

1.26

1.34

1.43

1.56

1.72

2.65

Delta=θcθt.

In Table 7, 80 percent of the identified students are medium or low ability students with true ability between −1.50 and −0.19. The score gain of the identified students has a positively skewed distribution with a mode at 1.28, as shown in Figure 5. The positively skewed shape in Figure 5 implies that the proportion of students making greater score gain becomes smaller, which is determined by the DGM’s methodology design. In the DGM, a greater score gain (θcθt) will result in a stronger evidence of cheating. Consequently, the DGM tends to identify the students who have significant score gain. In IRT settings, students with low/medium true ability estimates have a larger space for a greater score gain than those with high true ability estimates, because the capable students do not have much to gain in their ability estimates (determined by the scale interval of the IRT ability estimates). Therefore, the DGM, as an IRT-based model, is more likely to identify low/medium ability students, and is less efficient in detecting high ability students. Although most of the students identified by the DGM are at low/medium ability level, it does not necessarily mean that the high ability students do not have pre-knowledge on the exposed items. Fewer high ability students are identifiable as cheaters because, if such cheaters exist, they generally do not benefit as much as the low/medium students.

In this real application, we classify the new items as unexposed items and the anchor items as exposed items because these anchor items were possibly compromised due to item memorization and the new items were secure. In other real application cases, all of the items in a test may have been exposed previously; and, as a result, students may have a chance to cheat on all the items. In such cases, practitioners could classify items based on other information, such as exposure times, time span in use, or other policy decisions (Segall, 2002, 2004; Mcleod et al. 2003). Some compromised items may be misclassified as uncompromised items, which will undermine the DGM’s power to identify cheaters. However, practitioners could use the item parameters that were estimated during the first administration to counter-balance the impact of mis-classifying items.

As an extension, practitioners may think about updating their item banks with new items, if a large proportion of students are identified by the DGM as cheaters. Practitioners could also analyze flagged students’ school information and teacher information to monitor group cheating that may be organized by teachers/schools. For example, group cheating probably occurs when a large proportion of students from the same school or taught by the same teacher are identified as cheaters. Moreover, the flagged students could be evaluated by additional tests (e.g., taking another form of tests), or a sanction of disbarment from the test for a period of years may be imposed. When widespread cheating is evident (i.e., a large portion of students are identified as cheaters by the DGM), a remedial action may be used to retract all scores and require every examinee to complete a retest. Furthermore, the accuracy of linking/equating is often undermined by students’ memorization of anchor items. The DGM can be used to detect the students who show unusual score gain in the anchor items, and then exclude the detected students from the linking and equating analysis.

7 Discussion

In this paper, the model structure, estimation framework and population structure of the DGM are presented. Special attention is paid to the definition of the DGM’s indicator variable in addition to its distinction from the MIRT and MRM. Overall, the DGM uses the true ability to characterize the construct of interest and the cheating ability to model the degree of success for cheaters. It classifies examinees as cheaters or non-cheaters through an indicator variable T by conditioning on the performance difference between the compromised and uncompromised items. The simulation study and the real application demonstrate that the DGM can be a potentially useful tool in detecting the test cheating featured with an unusual score gain in exposed items.

The DGM is different from the existing cheating detection methods, including the answer copying indices, person fit indices and Segall’s cheating model. As compared to the cheating detection indices based on hypothesis tests at examinee level, the DGM allows for the estimation of a relatively complicated population structure to characterize examinees’ cheating activities. Although both the DGM and Segall’s cheating model are designed to detect the same type of cheating, the DGM seems to be more flexible by making a relatively less strong assumption. Certainly, this new model is not immune to abuse or probative misuse. First, the DGM is limited to detect other types of test cheating (e.g., answer copying, answer erasure). Secondly, the DGM requires a large enough number of examinees and items in both compromised/uncompromised categories in order to obtain a sensitive detection.

Intensive research surrounding this model should be done theoretically and practically. Theoretically, the performance of the DGM based on 2pl-IRT and 3pl-IRT along with a more time-efficient estimation algorithm should be investigated and evaluated to further extend the DGM’s application. Operationally, the cut point for the T is currently empirically selected; and, therefore, a parametric null-hypothesis on the T may be one of the working directions in future. Furthermore, the real data used in this research are from a low-stake test, and so future research should explore application of the DGM in high stakes testing environments. Last, but not least, the impact of mis-classifying items as exposed or secured items on the DGM’s performance should also be investigated in future work. Successful and appropriate application in a wide variety of real settings is the ultimate purpose for the DGM, and the final standard to evaluate the Deterministic, Gated IRT model.

Acknowledgements

The real data used in this research are provided by CTB/McGraw-Hill, based in Monterey, CA. An appreciation goes to Dr. Furong Gao and Dr. Jesswlyn Smith at CTB/McGraw-Hill who helped review this research. Further, Dr. Charlie Lewis and Dr. Xueli Xu, both from ETS, are also appreciated because of their kind and wise advice.

Copyright information

© The Psychometric Society 2012