DIF Analysis with Unknown Groups and Anchor Items

Ensuring fairness in instruments like survey questionnaires or educational tests is crucial. One way to address this is by a Differential Item Functioning (DIF) analysis, which examines if different subgroups respond differently to a particular item, controlling for their overall latent construct level. DIF analysis is typically conducted to assess measurement invariance at the item level. Traditional DIF analysis methods require knowing the comparison groups (reference and focal groups) and anchor items (a subset of DIF-free items). Such prior knowledge may not always be available, and psychometric methods have been proposed for DIF analysis when one piece of information is unknown. More specifically, when the comparison groups are unknown while anchor items are known, latent DIF analysis methods have been proposed that estimate the unknown groups by latent classes. When anchor items are unknown while comparison groups are known, methods have also been proposed, typically under a sparsity assumption -- the number of DIF items is not too large. However, DIF analysis when both pieces of information are unknown has not received much attention. This paper proposes a general statistical framework under this setting. In the proposed framework, we model the unknown groups by latent classes and introduce item-specific DIF parameters to capture the DIF effects. Assuming the number of DIF items is relatively small, an $L_1$-regularised estimator is proposed to simultaneously identify the latent classes and the DIF items. A computationally efficient Expectation-Maximisation (EM) algorithm is developed to solve the non-smooth optimisation problem for the regularised estimator. The performance of the proposed method is evaluated by simulation studies and an application to item response data from a real-world educational test.


Introduction
Psychometric models to analyse data from instruments such as survey questionnaires and educational tests rely on an equivalence assumption on the item parameters across groups of respondents.That is, conditioning on the latent construct measured by the instrument, a respondent's response to each item is independent of their group membership.This assumption is known as measurement invariance.If violated, the psychometric property of the item(s) is not constant across groups, which can cause measurement bias (Millsap, 2012).
The measurement invariance assumption is typically investigated through differential item functioning (DIF) analysis, a class of statistical methods that compares respondent groups at the item level and detects noninvariant, i.e., DIF, items.model selection.
The comparison groups may sometimes be unavailable, and DIF analysis in this situation is typically referred to as latent DIF analysis (Cho et al., 2016;De Boeck et al., 2011).As suggested in De Boeck et al. (2011), latent DIF analysis is needed when we do not know the crucial groups for comparison, we cannot observe the groups of interest, or there are validity concerns regarding the true group membership of the respondents.For example, for self-reported health and mental health instruments (Teresi and Reeve, 2016;Reeve and Teresi, 2016;Teresi et al., 2021), many covariates are collected, such as age, gender, ethnicity, and other background variables, but the crucial groups for DIF analysis are typically unclear.For another example, when analysing data from an educational test in which a subset of test takers have preknowledge on some leaked items (Cizek and Wollack, 2017), the two comparison groups of interest -the ones with and without item preknowledge -are not directly observable.Moreover, the observed group membership may sometimes poorly indicate the "true" group membership that causes the DIF pattern in the item response data (e.g., Bennink et al., 2014;Cho and Cohen, 2010;Finch and Hernández Finch, 2013;Von Davier et al., 2011).Most existing latent DIF analysis methods assume that an anchor set is known and use a mixture IRT model -a model that combines IRT and latent class analysis-to identify the unknown groups and detect the DIF items simultaneously (Cho and Cohen, 2010;Cohen and Bolt, 2005;De Boeck et al., 2011); see Cho et al. (2016) for a review.
In practice, both the comparison groups and the anchor set may be unknown.For example, besides the aforementioned challenges of identifying the crucial comparison groups, the DIF analysis of self-reported health and mental health instruments also faces the challenge of identifying anchor items (Teresi and Reeve, 2016;Reeve and Teresi, 2016).In the item preknowledge example above, not only the comparison groups are unobserved, but also prior knowledge about non-leaked items is likely unavailable, and thus, correctly specifying an anchor set is a challenge (O' Leary and Smith, 2016).Almost no general methods are available for latent DIF analysis when the anchor set is unavailable.Two notable exceptions are Chen et al. (2022) and Robitzsch (2022).In Chen et al. (2022), a Bayesian hierarchical model for latent DIF analysis is proposed and applied this model to the simultaneous detection of item leakage and preknowledge in educational tests.In this model, latent classes are imposed among the test takers to model the comparison groups, and also among the items to model the DIF and non-DIF item sets.In addition, both the person-and itemspecific parameters are treated as random variables and inferred via a fully Bayesian approach.However, the inference of this model relies on a Markov chain Monte Carlo algorithm, which suffers from slow mixing.Moreover, as most traditional DIF analysis methods adopt a frequentist setting, it is of interest to develop a frequentist approach to latent DIF analysis when the anchor set is unknown.Robitzsch (2022) proposed a latent DIF procedure based on a regularised estimator under a mixture Rasch model.In this work, a nonconvex penalty called the Smoothly Clipped Absolute Deviation (SCAD) penalty (Fan and Li, 2001) other than the L 1 penalty is investigated.The methodology proposed in the current paper is similar in spirit to that of Robitzsch (2022) but developed independently.The proposed framework focuses on the two-parameter logistic (2-PL) model (Birnbaum, 1968) with an L 1 penalty and further provides a scope to generalise to other item response theory models.This paper proposes a frequentist framework for DIF analysis when both the comparison groups and the anchor set are unknown.The proposed framework combines the ideas of mixture IRT modeling for latent DIF analysis and regularised estimation for manifest DIF analysis with unknown anchor items.More specifically, the unknown groups are modelled by latent classes, and the DIF effects are characterised by itemspecific DIF parameters.An L 1 -regularised marginal maximum likelihood estimator is proposed, assuming that the number of DIF items is relatively small.This estimator penalises the DIF parameters by a Lasso regularisation term so that the DIF items can be selected by the non-zero pattern of the estimated DIF parameters.Computing the L 1 -regularised estimator involves solving a non-smooth optimisation problem.
We propose a computationally efficient Expectation-Maximisation (EM) algorithm (Dempster et al., 1977;Bock and Aitkin, 1981), where the non-smoothness of the objective function is handled by a proximal gradient method (Parikh and Boyd, 2014).We evaluate the proposed method through simulation studies and an application to item response data from a real-world educational test.For the real-world application, we consider data from a midwestern university in the United States.This data set has been studied in Bolt et al. (2002), where end-of-test items are believed to cause DIF due to insufficient time.Both the comparison groups, i.e. the speeded and non-speeded respondents, and the anchor items are unknown.In Bolt et al. (2002), the DIF items and comparison groups are detected by borrowing information from an additional test form which is carefully designed so that the potential speededness-DIF items in the original form are administered at earlier locations, and thus, are unlikely to suffer from speededness-DIF.Thanks to the proposed procedure, we are able to identify the unknown DIF items and comparison groups without utilising information from the additional test form, and our findings are consistent with those of Bolt et al. (2002).
The rest of the paper is organised as follows.In Section 2, we propose a modelling framework for latent DIF analysis with unknown groups and anchor items and a regularised estimator that simultaneously identifies the unknown groups and detects the DIF items.In Section 3, we propose a computationally efficient EM algorithm.The proposed method is evaluated by simulation studies in Section 4 and further applied to data from a real-world educational test in Section 5. We conclude with discussions in Section 6.Details about the computational algorithm are given in the Appendix.

Measurement Model
Consider N respondents answering J binary items.Let Y ij ∈ {0, 1} for i = 1, . . ., N and j = 1, . . ., J be a binary random variable recording individual i's response to item j.The response vector of individual i is denoted by Y i = (Y i1 , . . ., Y iJ ) ⊤ .We assume that the items measure a unidimensional construct, which is modelled by a latent variable θ i .We further assume that the respondents are random samples from K + 1 unobserved groups, where the group membership is denoted by the latent variable ξ i ∈ {0, 1, ..., K}.Given the latent trait θ i and the latent class ξ i , consider the two-parameter item response model with a logit link (2-PL) (measurement model) (Birnbaum, 1968) logitP where a j and d j are known as the discrimination and easiness parameters respectively and δ jξi is referred to as the DIF-effect parameter, as it quantifies the DIF effect of latent class k on item j.
We treat ξ i = 0 as the baseline group, also known as the reference group, and set δ j0 = 0 for all j = 1, . . ., J.In that case, a j θ i + d j denotes the item response function for the reference group.When a j is common across all items, the baseline model becomes the Rasch model (Rasch, 1960).We focus on the 2-PL model here, but the proposed method easily adapts to other baseline IRT models.
The parameter δ jk characterises how respondents in group k differ from those in the reference group in terms of the item response behaviour on item j.For the reference group, the DIF parameter remains zero for all items, serving as a reference point.For the remaining latent classes, the DIF parameter can be non-zero for certain items.Crucially, the magnitude of this parameter is allowed to differ across these latent classes.
This flexibility accounts for varying degrees of DIF effects across different latent groups, when comparing with the reference group.The DIF effect parameter can also be expressed in terms of log-odds.Specifically, under the 2-PL model, i.e., δ jk is the log-odds-ratio when comparing two respondents from group k and the reference group given that they have the same latent construct level.

Structural Model
The structural model specifies the joint distribution of the latent variables (θ i , ξ i ).We assume that the latent classes follow a categorical distribution, where We further assume that conditional on ξ i , the latent ability θ i follows a normal distribution with class-specific mean and variance, i.e., To ensure model identification, we fix the mean and variance of the reference group, i.e. µ 0 = 0 and σ 2 0 = 1.The path diagram of this model is given Figure 1.We note that the model coincides with a MIMIC model (Jöreskog and Goldberger, 1975) for manifest DIF analysis (Muthen and Lehman, 1985;Muthén, 1989;Woods, 2009;Woods and Grimm, 2011) when conditioning on the latent class ξ i (i.e., viewing ξ i as observed).However, since ξ i is unobserved, the statistical inference of the proposed model differs substantially from that of the MIMIC model.More specifically, the inference of the proposed model will be based on the marginal likelihood function where both latent variables ξ i and θ i are marginalised out.When the baseline IRT model is the 2-PL model, the marginal likelihood function takes the form where ϕ(θ|µ k , σ 2 k ) denotes the density function of a normal distribution with mean µ k and variance σ 2 k , and we use vector ∆ to denote all the unknown parameters, including the item parameters a j and d j , δ jk , ν k , µ k and σ 2 k , for j = 1, ..., J and k = 0, 1, ..., K.

Model Identifiability
The current model suffers from two sources of unidentifiability.The first source of unidentifiability comes from not knowing the anchoring items, which occurs even if we condition on the latent class ξ i , i.e., when the model becomes a MIMIC model for manifest DIF analysis.That is, for any constants c 1 , ..., c K , if we simultaneously replace µ k by µ k + c k and replace δ jk by δ jk − a j µ k for all j = 1, ..., J and k = 1, ..., K, the likelihood function value L(∆) remains the same.This source of unidentifiability can be avoided when one or more anchor items are known a priori.Suppose that item j is known to be DIF-free.Under the proposed model framework, it implies the constraints δ jk = 0 for all k.Consequently, the aforementioned transformation can no longer apply, as otherwise, the zero constraints for the anchor items will be violated.
As discussed in Section 2.4, this source of unidentifiability can be handled by a regularised estimation approach under a sparsity assumption that many DIF parameters δ jk are zero.
The second source of unidentifiability is the label-switching phenomenon of latent class models (Redner and Walker, 1984), as a result of the exchangeability of the latent classes.Under the current model, the baseline class is uniquely identified through the constraints δ j0 = 0, µ 0 = 0 and σ 2 0 = 1.However, the remaining latent classes are exchangeable, and the likelihood function value remains the same when switching their labels.While label switching often causes trouble when inferring a latent class model with Bayesian Markov chain Monte Carlo (MCMC) algorithms (Stephens, 2000), it is not a problem for the estimator to be discussed in Section 2.4.Our estimator is proposed under the frequentist setting and computed by an EM algorithm.When the EM algorithm converges, it will reach one of the equivalent solutions in the sense of label switching.

Sparsity, Model Selection and Estimation
As explained above, the latent trait cannot be identified without anchor items.In that case, additional assumptions are needed to solve the DIF analysis problem.Specifically, we adopt the sparsity assumption, i.e., many DIF parameters δ jk are zero.This is a common assumption in the manifest DIF literature, see for example Magis et al. (2015); Tutz and Schauberger (2015); Belzak and Bauer (2020); Bauer et al. (2020); Schauberger and Mair (2020).In many applications, for example, the detection of aberrant behaviour or parameter drift in educational testing, the number of DIF items is low, suggesting that this assumption is meaningful.
Under the above sparsity assumption, we propose an L 1 regularised estimator to simultaneously estimate the unknown model parameters and learn the sparsity pattern of the DIF-effect parameters.This estimator takes the form where L(∆) is the marginal likelihood function defined in (2), and λ > 0 is a tuning parameter.The computation of this estimator will be discussed in Section 3. Similar to Lasso regression (Tibshirani, 1996), the L 1 regularisation term λ J j=1 K k=1 |δ jk | in (3) tends to shrink some of the DIF-effect parameters to be exactly zero.In the most extreme case where λ goes to infinity, all the DIF-effect parameters will shrink to zero.Under suitable regularity conditions and when λ is chosen properly (i.e., λ goes to infinity at a suitable speed), the L 1 regularised estimator yields both estimation and selection consistency (Zhao and Yu, 2006;van de Geer, 2008).In that case, the latent trait is consistently identified, and the consistently selected sparse patterns of the estimated DIF-effect parameters can be used to classify items as DIF and non-DIF items.
We select the tuning parameter λ based on the Bayesian Information Criterion (BIC; Schwarz, 1978) using a grid search approach.Specifically, we consider a pre-specified set of grid points for λ, denoted by λ 1 , ... λ M .For each value of λ m , we solve the optimisation problem (3) and obtain ∆ (λm) .To compute the BIC value for the model encoded by ∆ (λm) , we compute a constrained maximum likelihood estimator, fixing The BIC corresponding to λ m is calculated as where Card( ∆(λm) ) denotes the number of free parameters in ∆(λm) that equals to the total number of free parameters in ∆ minus the corresponding number of zero constraints.The tuning parameter is then selected Thanks to the asymptotic properties of the BIC (Shao, 1997), the true model will be consistently selected if it can be found by one of the tuning parameters.
We use the constrained maximum likelihood estimator ∆( λ) as the final estimator of the selected model and declare an item j to be a DIF item if We summarise this procedure in Algorithm 1 below.

Other Inference Problems
The latent class membership can be inferred by an empirical Bayes procedure, i.e., by the maximum a posteriori (MAP) estimate under the estimated model.For the MAP estimator, the goal is to find the most Algorithm 1 Regularised estimation and model selection.
Output: Estimate of the selected model, ∆( λ) , and the selected DIF items {j : probable latent class k for each respondent i, given their observed responses y i .This is done by maximizing the posterior probability of ξ i being equal to k, conditioned on the observed responses y i : We get these posterior probabilities through based on the estimated model parameters.
Lastly, the number of latent classes, i.e., the choice of K, can be determined using the BIC.That is, we solve the optimisation problem in (3) for different values of K, yielding ∆(K) .We thereafter compute the BIC as and select the K that yields the smallest BIC:

Extensions
The proposed model is possible to extend in several ways to accommodate different data types and more than one factor.To make this clear, our model can be expressed as The function g determines the parametrisation of the person and item parameters, denoted by θ and β respectively.If for example the Rasch model (Rasch, 1960) is adopted, β j = (a, d j ) where a j = a for all j = 1, . . ., J, and It is also possible to consider link functions other than the logistic function considered in this paper, such as the probit link: DIF analysis using multidimensional IRT models with unknown anchor items has recently been considered in Wang et al. (2023).As in the unidimensional case, no method can handle situations where both the groups and the anchor items are missing.Our proposed framework can however be extended to handle such situations.Consider the extension of ( 9) where each respondent, in addition to the latent class membership ξ i , is represented by an L-dimensional latent vector θ i = (θ i1 , . . ., θ iL ) ⊤ .Each item is represented by an intercept parameter d j and L loading parameters a j = (a j1 , . . ., a jL ) ⊤ .This extension of the model can be expressed as a multidimensional 2-PL model with an added DIF component, i.e., The proposed modeling framework can also be extended to accommodate ordinal data.Denoting the response categories of Y ij by m = 1, . . ., M , such model can, using the logistic link, be expressed as The model without the DIF parameter is known as the proportional odds model (Samejima, 1969).Note the negative sign in front of the slope parameter so that if a j is positive, increasing θ i will increase the probability of higher-numbered levels of Y ij .
Lastly, we mention the possibility of extending the model to accommodate for DIF effects in the discrimination parameter, known as non-uniform DIF.To consider such a case, we introduce a similar DIF-effect parameter for a j , just as we have for d j .Let's denote the DIF effect on the discrimination parameter α jξi .
Given that ξ i = 0 is treated as the reference group, we set α j0 = 0 for all j = 1, . . ., J.In this case, the modified 2-PL model with a logit link which accounts for DIF in both a j and d j can be written as: To include DIF in the L 1 regularised estimator, the penalty term needs to be modified to penalize both α jk and δ jk terms.The modified estimator is given by 3 Computation The computation of the optimisation problems (3) and ( 4) is carried out using the EM algorithm (Dempster et al., 1977;Bock and Aitkin, 1981).An EM algorithm is an iterative algorithm, alternating between an Expectation (E) step and a Maximisation (M) step.Optimisation problem (4) involves maximising the marginal likelihood function of a regular latent variable model, and thus, can be solved by a standard EM algorithm.Thus, its details are skipped here.However, the optimisation problem (3) involves a non-smooth L 1 term.Consequently, the M step of the algorithm cannot be carried out using a gradient-based numerical solver, such as a Newton-Raphson algorithm.We develop an efficient proximal-gradient-based EM algorithm that uses a proximal gradient update (Parikh and Boyd, 2014) to carry out the non-smooth optimisation problem in the M-step.In what follows, we elaborate on this algorithm using the 2-PL model as the baseline IRT model, while pointing out that the algorithm easily extends to other baseline IRT models.
Suppose that t iterations of the algorithm have been run and let ∆ (t) be the current parameter value.In the E-step of the tth iteration, we construct a local approximation of the negative objective function at ∆ (t)   in the form of We note that the expectation in ( 12) is with respect to the conditional distribution of the latent variables In the M-step, we find ∆ (t+1) such that or equivalently, By Jensen's inequality, it consequently guarantees that the objective function of (3) decreases, i.e., More specifically, we write ∆ = (∆ ⊤ 1 , ∆ ⊤ 2 ) ⊤ , where ∆ 1 = (ν 0 , ..., ν K ) ⊤ and ∆ 2 contains the rest of the parameters.We notice that −Q(∆|∆ (t) ) in ( 12) can be decomposed as the sum of a smooth function can be obtained by solving the following constrained optimisation problem Using the method of Lagrangian multiplier, this optimisation problem has a closed-form solution; see the Appendix for the details.
We then find 2 ).Consider the optimisation problem min F t (∆ 2 ) + G(∆ 2 ).Due to the non-smoothness of G, this objective function is not differentiable everywhere.Consequently, gradient-based methods are not applicable.We find ∆ (t+1) 2 by using a proximal gradient method (Parikh and Boyd, 2014).Denote the dimension of ∆ 2 by d, where d equals the number of free parameters which is determined by the choice of baseline model (2J + 2K for the 2-PL model).We define the proximal operator of G at ∆ 2 as We update ∆ 2 by where ∇F t (∆ 2 ) denotes the gradient of 2 , and α is a step size.According to Parikh and Boyd (2014, sect. 4.2), for a sufficiently small step size α, it is guaranteed that F t (∆ 2 is already a stationary point.Thus, we select α by a line search procedure, whose details are provided in the Appendix.We note that this proximal operator has a closed-form solution.Specifically, each DIF-effect parameter δ jk is updated by solving which has a closed-form solution given by soft-thresholding (Chapter 3, Hastie et al., 2009).The rest of the parameters in ∆ 2 do not appear in the non-smooth function G(∆ 2 ), and thus, the resulting update of ( 13) degenerates to a vanilla gradient descent update.For example, .
Further details of the proximal gradient update can be found in the Appendix.We summarise the main steps of this EM algorithm in Algorithm 2 below.
Remark.While differentiable approximations, such as the smoothed Lasso, can allow the use of gradientbased methods, they often come with their own set of challenges.Introducing a smoothing parameter can make the method sensitive to its choice, and in some situations, the approximation might not be close enough to the original problem, especially when the smoothing parameter is not sufficiently small.We opt for an approach based on non-smooth optimisation.We believe that directly tackling the non-smoothness ensures that we do not compromise on the sparsity of the solution, which is critical for our analysis.We have designed our algorithm around the EM algorithm, which inherently possesses certain convergence properties.
Specifically, the EM algorithm is guaranteed to increase the log-likelihood in each iteration, and under mild regularity conditions, it converges to at least a local maximum of the log-likelihood.While we acknowledge that there are potential pitfalls in the convergence of non-smooth optimisation algorithms, we have implemented strategies to ensure the stability of our algorithm, such as adaptive step sizes and convergence checks.

Simulation Study
In this simulation, we evaluate the performance of the proposed method, treating the number of latent classes K as fixed.We consider cases with two and three latent classes, respectively.For each simulation scenario, we run B = 100 independent replications.

Settings
We examine 16 scenarios under the two-group setting and 8 scenarios under the three-group setting, considering J ∈ {25, 50}, N ∈ {500, 1000}, and K ∈ {1, 2}.Note that K = 1 represents a two-group setting and K = 2 indicates a three-group setting.In the two-group setting, the class proportions are varied, with the reference group consisting of either 50%, 80%, or 90% of the respondents.In the three-group scenario, 50% of the respondents belong to the reference group, 30% belong to the second latent class, and 20% belong to the third latent class.The number of DIF items is set to 10 for all cases with J = 25 and 20 for all cases with J = 50, i.e., the proportion of DIF items remains the same.
The intercept parameters d j are generated from the Uniform(−2, 2) distribution and the slope parameters a j from the Uniform(0.5,1.5) distribution.In the two-group setting, we consider three cases for the class proportions, where ν = (0.1, 0.9), ν = (0.2, 0.8), and ν = (0.5, 0.5), respectively.The latent construct θ within each latent class ξ follows and for the three-group case, we additionally let The DIF effect parameters are generated as δ jk ∼ Uniform(0.5,1.5) for the non-zero elements and set to 0 for the remaining items.For the three-group case, the DIF effects for the second and third latent class are generated from a Uniform(0.5, 1) and Uniform(1, 1.5) distribution, respectively.In the two-group setting, the DIF items are positioned at the beginning of the scale (items 1-10 when J = 25 and 1-20 when J = 50).In the three-group setting, the DIF items are positioned at the end of the scale (items 16-25 when J = 25 and 31-50 when J = 50).The true model parameters are given in the supplementary material.

Evaluation Criteria
Detection of DIF items.We check whether the DIF items can be accurately detected by the proposed method.In this analysis, we assume the number of latent classes K is known.We calculate the average True Positive Rate (TPR) as where { δ(b) jk } J×K is the estimated DIF effect matrix in the b-th replication and {δ * jk } J×K denotes the true DIF effect matrix.Similarly, we calculate the average true negative rate (TNR), which is the failure rate of identifying zero elements: To better evaluate the performance of the proposed method in detecting DIF items, we compare the FPR and TPR with the results under an oracle setting where the group membership ξ i is known while anchor items are unknown.Under this oracle setting, we solve the following regularised estimation problem as in Bauer et al. (2020): where and Ξ includes the parameters in ∆ except for those in ν.The tuning parameter λ is chosen by BIC.The FPR and TPR for detecting DIF items are also calculated under this setting and compared with those from the proposed method where ξ i s are unknown.
Classification of respondents.We then consider the classification of respondents based on the MAP estimate.Again, we assume the number of latent focal groups K is known.We calculate the average classification error rate, given by the fraction of incorrect classifications to the overall number of classifications averaged over the number of replications: where ξMAP,i = arg max k∈{0,1,...,K} P ( ξ = k|y i ) and P ( ξ = k|y i ) is the posterior probability of category k given in (6).We notice that there is a labelswitching problem under the setting with K = 2 when calculating the classification error.This problem is solved by a post-hoc label switching based on the estimated ν 1 and ν 2 , using the ordering information that class 2 is larger than class 1, i.e., ν 2 > ν 1 .We also calculate the MAP classification error under the true model and compare it with the classification error of the proposed method.
Parameter estimation accuracy.We further evaluate the accuracy of our final estimator ∆( λ) , assuming that K is known.For each unknown parameter, the root mean square error (RMSE) and the absolute bias are calculated based on the 100 replications.

Results
The classification performance in the simulation study is presented in Tables 1 and 2, which display the respondent and item classification accuracy for the two-group and three-group settings, respectively.First, it is observed that the classification error is predominantly influenced by the number of items, with larger item sizes resulting in better respondent classification performance.This observation is consistent with previous literature on DIF detection using IRT models (Chen et al., 2022;Kuha and Moustaki, 2015).
For respondent classification in the two-group setting, we observe in Table 1 that the classification error is small for all simulation scenarios, and always better than a naïve classifier that assigns all respondents to the reference group.This is true even when the focal group only consists of 10% of the respondents.As the proportion in the focal group (in Table 1 denoted by π) increases, the proposed method's enhancement over the naïve classifier, which assigns all respondents to the baseline group, becomes more pronounced.We also note that the AUC values increase when the proportion of respondents in the focal group increases, but the increments are small.Table 1 furthermore shows that the classification accuracy is only slightly worse when using the estimated parameters compared to when the true model parameters are used.
For item classification in the two-group setting, the true positive rates are very high, and with no item being falsely flagged as a DIF item, across all scenarios.This suggests that the proposed framework is effective in identifying DIF items and minimizing false positives for various numbers of items and proportions in the focal group.We also note that the oracle estimator performance is only slightly better, i.e., knowing the group membership of the respondents only leads to marginal improvement in item classification accuracy.
In the three-group setting, Table 2 shows that the classification error rates are generally higher than those observed in the two-group case, which is expected given the increased complexity of the DIF detection problem when more than two groups are involved.However, note that the classification error is always clearly smaller than the naïve classifier that assigns all of the respondents to the reference group.This increase in classification performance is particularly clear in the simulation scenarios with J = 50.We also observe that the AUC values for classes 2 and 3 are within reasonable ranges, suggesting that the proposed method is capable of adequately classifying respondents even in more challenging settings.The TPR and FPR values for item classification in the three-group setting are furthermore high for both class 2 and class 3 items and with no misclassified DIF-free item.This further supports the effectiveness of the proposed framework in identifying DIF items across different group configurations.
Figures 2 and 3 show the RMSEs (Root Mean Squared Errors) for all the item parameter estimates for the two-group setting (K = 1).The RMSEs are small for all estimates across the items, with the exemption of one or two items that show larger RMSEs for the estimated DIF parameter.The RMSEs for the estimated DIF parameters for the DIF-free items are zero as the proposed estimation procedure successfully classifies the DIF items.We also see that the number of items and proportion of respondents in the focal group have essentially no influence on the RMSE, for the configurations considered in this simulation study.
In Figure 4 In Table 3 and Table 4, the absolute bias and RMSE under the 2-group setting, averaged over the number of items of the same type, are displayed for both sample sizes and every focal group proportion π considered.
The absolute bias and RMSE are small for all parameters, and differences are very small between different values of π and different values of N .The most notable difference is seen for the estimate of the focal group proportion π, where the bias and RMSE clearly decrease when the sample size increases.We also see that the DIF effect parameter is estimated with only a small bias and RMSE.In Table 5, the absolute bias and RMSE are shown for the 3-group setting.As in the two-group setting, the bias and RMSE values are small under all settings.
In summary, the simulation results presented in Tables 1-5, and Figures 2-4 demonstrate the potential of the proposed framework for DIF analysis with unknown anchor items and comparison groups.The framework performs well in terms of respondent and item classification accuracy across a range of scenarios, and with good parameter recovery, suggesting its applicability in various real-world settings.

Real Data Analysis
To illustrate the proposed method, we analyse a mathematics test from a midwestern university in the United States.This data set has been analysed in both Bolt et al. (2002) and De Boeck et al. (2011).The data      Bolt et al. (2002) hypothesised the existence of two latent classes: one speeded class that answered end-of-test items with insufficient time, and another non-speeded class.The identification of speeded items was conducted using a two-form design.Specifically, they examined common items across two test administrations, where the common items were placed at the end of the test in one form and earlier in the other form.By estimating the item difficulty for the end-of-test common items and comparing it to the difficulty estimates from the other form, they were able to quantify the DIF effect.Our goal is to detect the DIF items, i.e., the speeded items, and classify respondents into latent classes.Thanks to our procedure, we can analyse only one form, without using information from the second form.A similar analysis using simulated data was also conducted in Robitzsch (2022) but using a Rasch mixture model.
We start by fitting the proposed model to the data for different values of K, which determines the number of latent classes.The BIC for a model with K = 0, i.e. no latent classes other than the reference group, equals 117,552.2, the BIC for K = 1 equals 92,300.1, and for K = 2, the BIC equals 92,522.8.We therefore proceed with a model using K = 1, i.e., two latent classes.This aligns with the two-group model considered in Bolt et al. (2002).It took the proposed EM algorithm 37.04, 144.760, and 277.390 seconds to converge 2 for the 1-.2-, and 3-group solutions, respectively 3 .The proposed model classifies 25.8% into the second latent.If we interpret the two classes as a speeded and non-speeded class, this means that about 26% of the respondents belong to the speeded class.The estimated mean ability in the speeded class equals -0.351 with the estimated standard deviation equal to 1.075.Since the reference group (the non-speeded class) has a prespecified ability mean and standard deviation equal to 0 and 1, respectively, our results therefore indicate that the speeded class has a lower ability on average compared to the non-speeded class.These findings align closely with the results presented in Bolt et al. (2002).
In Table 6 we give the estimated item parameters from the educational test data.The estimated item discrimination and easiness parameters, â and d respectively, are provided together with the estimated DIF effect δ. Common items are denoted by asterisks.For the majority of the items, the DIF effects are estimated to be zero, indicating that these items do not exhibit any significant measurement bias between different groups.However, items 20-26 exhibit non-zero DIF effects, suggesting that these items might be functioning differently for the two latent classes.Among these, items 20, 21, 22, 23, and 24 are also common items, which may require further investigation to ensure fair assessment across test administrations.Since the DIF effect for these end-of-test items is all negative, it suggests that these items become more difficult for the second latent class.This class could therefore consist of respondents that ran out of time and had insufficient time to answer these items.This is known as a speededness effect.As a result, the item difficulty is inflated, which could lead to biased subsequent analyses.The presence of non-zero DIF effects for some end-of-test items highlights the need to scrutinize these items more closely and potentially revise the test administration to minimize the impact of speededness.For instance, increasing the allocated time for the test or redistributing the items more evenly throughout the test could help alleviate the speededness effect and create a more unbiased assessment.

Concluding Remarks
In this paper, we presented a comprehensive framework for DIF analysis that overcomes several limitations of existing methods.Our approach can deal with the situation in which both anchor items and comparison groups are unknown, a setting commonly encountered in real-world applications.The use of latent classes in our approach allows us to model heterogeneity among the observations.In this sense, our approach relates to an exploratory dimensionality analysis where there is, in addition to the primary latent dimension, a second dimension that is treated as unknown.In our empirical analysis, this additional dimension is labeled as a speededness effect.In addition to modeling the additional latent dimension(s), the proposed regularised estimator enables us to identify DIF items and quantify their effect on the intercept parameter of the model.We also propose an efficient EM algorithm for the estimation of the model parameters4 .One merit of our framework is its flexibility.While focusing on the 2-PL model as the baseline model, our approach can be easily extended to accommodate other widely used IRT models, such as the Rasch model and the proportional odds model.We can also allow the baseline model to be a multidimensional IRT model, as shown in Section 2.6.Our framework is furthermore able to accommodate more than two comparison groups, allowing DIF effects to vary between the groups.Lastly, the proposed method can be extended as shown in Section 2.6 to detect non-uniform DIF.This flexibility makes our framework applicable to a wide range of contexts.
Although our approach shows promising results, there are still several limitations to be addressed in future research.For example, we do not provide confidence intervals for the DIF effect parameters which would be useful for practitioners and researchers in interpreting the magnitude and significance of the DIF effects.In Chen et al. (2023) for example, where the comparison groups are known but the anchor items are unknown, the distribution of δjk − δ jk is approximated by Monte Carlo simulation to yield valid statistical inference.This procedure does in essence apply to our case as well.In addition, we have not linked the latent classes to covariates, as in for example Vermunt (2010) and Vermunt and Magidson (2021).By doing so, researchers can gain insights into the underlying characteristics of the different classes and better understand the factors that may contribute to DIF.This would enhance the interpretability of the results and help identify potential sources of DIF that could be addressed in the development of assessment instruments.To address this limitation, future research could explore the integration of covariates within a structural equation modeling (SEM) framework.This would enable the simultaneous modelling of both the measurement model (i.e., the IRT model) and the structural model (i.e., relationships between latent variables and covariates).
Incorporating covariates in this manner would not only improve the interpretability of the results but could also provide a more comprehensive understanding of the relationships between the items, latent traits, and potential sources of DIF.Our framework could also be extended to accommodate non-uniform DIF, i.e., DIF in the slope parameter, such as in Wang et al. (2023) that considers a multidimensional IRT model with known comparison groups and unknown anchor items.
In this study, we focus on the Lasso penalty for its simplicity, computational efficiency, and welldocumented ability to perform both variable selection and regularisation.The Lasso's convex optimisation problem is easier to solve computationally than some non-convex penalties like the SCAD (Fan and Li, 2001) and the Minimax Concave Penalty (MCP; Zhang, 2010).We acknowledge that the Lasso penalty can introduce some bias into parameter estimates.However, in our proposed method we use the Lasso for model selection.As we thereafter refit the selected model there will be no bias, asymptotically, supposing that the model selection based on the Lasso is consistent (Zhao et al., 2021).Alternative penalties, including the adaptive Lasso (Zou, 2006), SCAD, and MCP, have their own merit.However, they also come with some challenges, especially in terms of computational complexity and algorithm stability.We, therefore, argue that the Lasso penalty is a suitable choice for the proposed model and its identifying assumptions.We believe it would be interesting in the future to compare the performance of estimators with different penalty functions under the current latent DIF setting.
Our proposed framework provides a powerful tool for DIF analysis with unknown anchor items and comparison groups.The framework has the potential to inform the development of fair and unbiased assessments.
Future research can build upon our approach by addressing the limitations and exploring other applications.
In terms of the potential impact of our work, the framework could be particularly beneficial in specific contexts, such as educational assessment, where identifying and addressing DIF is critical to ensure that tests fairly measure students' abilities across heterogeneous populations, thereby promoting equal access to educational opportunities.It could also be considered in employment selection, where unbiased assessment instruments are crucial to creating a diverse and inclusive workforce that complies with legal requirements related to fairness in employment practices (Ployhart and Holtz, 2008;Hough et al., 2001).Another application is psychological evaluations, where accurate identification of DIF can help improve diagnostic tools and treatment recommendations, leading to better outcomes for individuals from diverse backgrounds (Teresi et al., 2021).By addressing the limitations and further refining our approach, this framework has the potential to contribute to the development of more fair assessment practices in these and other domains, ultimately benefiting a wide range of stakeholders.

A Gradients for the Proximal Gradient Descent
In the M-step of the proposed EM-algorithm, we implement a proximal gradient descent.This algorithm requires the gradients of the objective function, which for the proposed model can be expressed as where η j is a generic notation for the item parameters (a j , d j , δ jk ), and φ ij = d j + a j θ i + δ jξi , To give an example, we give the explicit expressions for the two-group case.We start by parametrising the latent construct of the focal group as θ 2 = µ 2 + σ 2 θ 1 , where µ and σ is the mean and standard deviation of the latent construct in the focal group, and θ 1 ∼ N (0, 1) is the latent construct in the reference group.
We have that φ (1) ij = d j + a j θ i,1 for the reference group and φ (2) ij = d j + δ j + a j θ i,2 for the focal group.The partial derivatives in (17) are given by ∂φ (1) ij B The Line Search Procedure The implemented line search algorithm attempts to find an appropriate step size for the proximal gradient descent implemented in the M-step of the proposed EM algorithm.It does so by iteratively adjusting the step size until the change in the objective function (i.e., the penalised log-likelihood) is within a specified tolerance.The algorithm starts with an initial step size and iteratively reduces it by a factor (in this case, dividing by 2) until the new objective function value satisfies the tolerance condition.If the maximum number of iterations is reached without finding a satisfactory step size, the algorithm returns the current step size.The implemented line search algorithm is given in the following Line Search Algorithm.

Figure 1 :
Figure 1: Path diagram of the proposed model, where the dashed lines indicate the DIF effects.
, we can observe the RMSEs specifically for the three-group scenario.It is important to note that in this case, there exist two DIF effects, one for each focal group.For the first focal group, the true DIF effects are drawn from values in the range [0.5 -1] and for the second focal group, they are drawn from the range [1 -1.5].The increased difficulty of estimating smaller DIF effects is reflected by larger RMSEs for the first focal group.
(t−1) ∇ µCalculate new objective function valueO (t) ← −l(∆ (t) ) + λ If |O (t) − O (t−1) | < tol then return γ (t)else Adjust the step size and update the value of the objective function:γ (t) ← γ (t−1) /2 end if end for γ * ← γ (t)return Optimal step size γ *C The Soft-Thresholding ProcedureThe soft threshold function and the proximal gradient function are used to identify the DIF-free items, i.e., the anchor items, from the data.The soft threshold function takes a vector x and a scalar λ as inputs and applies element-wise thresholding to x.It sets elements with absolute values less than or equal to λ to zero, subtracts λ from elements greater than λ, and adds λ to elements less than −λ.The proximal gradient function takes the (estimated) DIF effect δ, its gradient ∇ x , a regularisation parameter λ, and a step size γ as inputs.It calls the soft threshold function with the updated vector δ − γ∇ δ and the product λγ.The output of the proximal gradient function is the updated DIF effect parameter estimate after applying the soft threshold function.These algorithms are summarised below.Algorithm Soft Threshold Function Input: x, λCreate a temporary variable temp ← xSet elements of temp to 0 if their absolute value is less than or equal to λ:temp[|x| ≤ λ] ← 0Update elements of temp where x > λ:temp[x > λ] ← temp[x > λ] − λUpdate elements of temp where x < −λ:temp[x < −λ] ← temp[x < −λ] + λ Output: temp Algorithm Proximal Gradient Function Input: δ, ∇ δ , λ, γCall the soft threshold function with input (δ − γ∇ δ , λγ):Prox ← sof t_thre(δ − γ∇ δ , λγ)Output: ProxD The Closed-Form Solution of the Latent Class ProportionUsing the Lagrange multiplier method, we obtain ν ik = P (ξ i = k|y i ) We can see this from the following argument.Given thatD t (∆ 1 ) = − N i=1 E log ν ξi Y i , ∆ (t) .

Table 1 :
Respondent and item classification accuracy under different simulation scenarios for the twogroup case.The classification error and AUCs present respondent classification performance, where 'AUC true' gives the results using the true parameter values.The TPRs and FPRs present the results for item classification, where 'TPR oracle' and 'FPR oracle' are the performance of the oracle estimator.

Table 2 :
Respondent and item classification accuracy under different simulation scenarios for the threegroup case.The classification error and AUCs present respondent classification performance, where 'AUC true' gives the results using the true parameter values.The TPRs and FPRs present the results for item classification, where 'TPR oracle' and 'FPR oracle' are the performance of the oracle estimator.
1. Six of the common items are of particular interest as they are positioned at the end of the test.

Table 3 :
Average absolute bias and RMSE over all items by estimated parameter type, when J = 25, N = 1000, and 5000, under the 2-group setting.

Table 4 :
Average absolute bias and RMSE over all items by estimated parameter type, when J = 50, N = 1000, and 5000 under the 2-group setting.

Table 5 :
Average absolute bias and RMSE over all items by estimated parameter type, N = 1000 and 5000, under the 3-group setting.

Table 6 :
Estimated item easiness and DIF effects for the detected DIF items.The asterisks denote the common items.