DIF Statistical Inference Without Knowing Anchoring Items

Establishing the invariance property of an instrument (e.g., a questionnaire or test) is a key step for establishing its measurement validity. Measurement invariance is typically assessed by differential item functioning (DIF) analysis, i.e., detecting DIF items whose response distribution depends not only on the latent trait measured by the instrument but also on the group membership. DIF analysis is confounded by the group difference in the latent trait distributions. Many DIF analyses require knowing several anchor items that are DIF-free in order to draw inferences on whether each of the rest is a DIF item, where the anchor items are used to identify the latent trait distributions. When no prior information on anchor items is available, or some anchor items are misspecified, item purification methods and regularized estimation methods can be used. The former iteratively purifies the anchor set by a stepwise model selection procedure, and the latter selects the DIF-free items by a LASSO-type regularization approach. Unfortunately, unlike the methods based on a correctly specified anchor set, these methods are not guaranteed to provide valid statistical inference (e.g., confidence intervals and p-values). In this paper, we propose a new method for DIF analysis under a multiple indicators and multiple causes (MIMIC) model for DIF. This method adopts a minimal \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_1$$\end{document}L1 norm condition for identifying the latent trait distributions. Without requiring prior knowledge about an anchor set, it can accurately estimate the DIF effects of individual items and further draw valid statistical inferences for quantifying the uncertainty. Specifically, the inference results allow us to control the type-I error for DIF detection, which may not be possible with item purification and regularized estimation methods. We conduct simulation studies to evaluate the performance of the proposed method and compare it with the anchor-set-based likelihood ratio test approach and the LASSO approach. The proposed method is applied to analysing the three personality scales of the Eysenck personality questionnaire-revised (EPQ-R). Supplementary Information The online version contains supplementary material available at 10.1007/s11336-023-09930-9.


Introduction
Measurement invariance refers to the psychometric equivalence of an instrument (e.g., a questionnaire or test) across several specified groups, such as gender and ethnicity.The lack of measurement invariance suggests that the instrument has different structures or meanings to different groups, leading to biases in measurements (Millsap, 2012).
Measurement invariance is typically assessed by differential item functioning (DIF) analysis of item response data that aims to detect the measurement non-invariant items (i.e.DIF items).
More precisely, a DIF item has a response distribution that depends on not only the latent trait measured by the instrument but also respondents' group membership.Therefore, the detection of a DIF item involves comparing the item responses of different groups, conditioning on the latent traits.The complexity of the problem lies in that individuals' latent trait levels cannot be directly observed but are measured by the instrument that may contain DIF items.In addition, different groups may have different latent trait distributions.This problem thus involves identifying the latent trait, and then conducting the group comparison given individuals' latent trait levels.
Many statistical methods have been developed for DIF analysis.Traditional methods for DIF analysis require prior knowledge about a set of DIF-free items, which is known as the anchor set.
This anchor set is used to identify the latent trait distribution.These methods can be classified into two categories.Methods in the first category (Mantel and Haenszel, 1959;Dorans and Kulick, 1986;Swaminathan and Rogers, 1990;Shealy and Stout, 1993;Zwick et al., 2000;Zwick and Thayer, 2002;May, 2006;Soares et al., 2009;Frick et al., 2015) do not explicitly assume an item response theory (IRT) model, and methods in the second category (Thissen, 1988;Lord, 1980;Kim et al., 1995;Raju, 1988Raju, , 1990;;Woods et al., 2013;Oort, 1998;Steenkamp and Baumgartner, 1998;Cao et al., 2017;Woods et al., 2013;Tay et al., 2015Tay et al., , 2016) ) are developed based on IRT models.Compared with non-IRT-based methods, an IRT-based method defines the DIF problem more clearly, at the price of potential model misspecification.Specifically, an IRT model represents the latent trait as a latent variable and further characterizes the item-specific DIF effects by modelling each item response distribution as a function of the latent variable and group membership.
The DIF problem is well-characterized by a multiple indicators, multiple causes (MIMIC) IRT model, which is a structural equation model originally developed for continuous indicators (Zellner, 1970;Goldberger, 1972) and later extended to categorical item response data (Muthen, 1985;Muthen et al., 1991;Muthen and Lehman, 1985).A MIMIC model for DIF consists of a measurement component and a structural component.The measurement component models how the item responses depend on the measured psychological trait and respondents' group membership.The structural component models the group-specific distributions of the psychological trait.The anchor set imposes zero constraints on item-specific parameters in the measurement component, making the model, including the latent trait distribution, identifiable.Consequently, the DIF effects of the rest of the items can be tested by drawing statistical inferences on the corresponding parameters under the identified model.
Anchor-set-based methods rely heavily on a correctly specified set of DIF-free items.The misspecification of some anchor items can lead to invalid statistical inference results -Type I errors increase and power decreases when anchor items are not completely DIF-free (Kopf et al., 2015b).To address this issue, item purification methods (Candell and Drasgow, 1988;Clauser et al., 1993;Fidalgo et al., 2000;Wang and Yeh, 2003;Wang and Su, 2004;Wang et al., 2009;Kopf et al., 2015b,a) have been proposed that iteratively select an anchor set by stepwise model selection methods.Several recently developed tree-based DIF detection methods (Strobl et al., 2015;Tutz and Berger, 2016;Bollmann et al., 2018), which can detect DIF brought by continuous covariates, may be viewed as item purification methods.However, with multiple items containing DIF, item purification may suffer from masking and swamping effects (Barnett and Lewis, 1994).
More recently, regularized estimation methods (Magis et al., 2015;Tutz and Schauberger, 2015;Huang, 2018;Belzak and Bauer, 2020;Bauer et al., 2020;Schauberger and Mair, 2020) have been proposed that use LASSO-type regularized estimation procedures for simultaneous model selection and parameter estimation.Moreover, Bechger and Maris (2015) and Yuan et al. (2021) proposed DIF detection methods based on the idea of differential item pair functioning, which does not require prior information about anchor items.Unfortunately, unlike many anchor-set-based methods with a correctly specified anchor set, all these methods do not provide valid statistical inference for testing the null hypothesis of "item j is DIF-free", for each item j.Consequently, the type-I error for testing the hypothesis cannot be guaranteed to be controlled at a pre-specified significance level.Furthermore, although the regularised estimation methods have been shown to accurately detect DIF items, they are typically computationally intensive, since they involve solving multiple regularized maximum likelihood estimation problems with different tuning parameters.This paper proposes a new method that addresses the aforementioned issues with the existing methods.The proposed method can statistically accurately and computationally efficiently estimate the DIF effects without requiring prior knowledge about anchor items.It draws statistical inferences on the DIF effects of individual items, yielding valid confidence intervals and p-values.The point estimation and statistical inference lead to accurate detection of the DIF items, for which the item-level type-I error and further some test-level risk (e.g., false discovery rate) can be controlled by the inference results.The method is proposed under a MIMIC model with a two-parameter logistic (Birnbaum, 1968) IRT measurement model and a linear structural model.The key to this method is a minimal L 1 norm assumption for identifying the true model.As will be discussed later, this assumption holds when the proportion of non-DIF items is sufficiently large.Methods are developed for estimating the model parameters and obtaining confidence intervals and p-values.
Procedures for the detection of DIF items are further developed.Our method is compared to the likelihood ratio test method (Thissen et al., 1993) that requires an anchor set, and a recently proposed LASSO-based approach (Belzak and Bauer, 2020).
The rest of the paper is organised as follows.In Sections 2, we introduce a MIMIC model framework for DIF analysis.Under this model framework, a new method is proposed for the statistical inference of DIF effects in Section 3. Related works are discussed in Section 4. Simulation studies and a real data application are given in Sections 5 and 6, respectively.We conclude with discussions in Section 7. All the proofs for the proposition and theorems presented in the article, and the implementation details of the proposed algorithms can be found in the Supplementary Materials.

A MIMIC Formulation of DIF
Consider N individuals answering J items.Let Y ij ∈ {0, 1} be a binary random variable, denoting individual i's response to item j.Let y ij be the observed value, i.e., the realization, of Y ij .For the ease of exposition, we use Y i = (Y i1 , ..., Y iJ ) to denote the response vector of individual i.The individuals are from two groups, indicated by x i = 0, 1, where 0 and 1 are referred to as the reference and focal groups, respectively.We further introduce a latent variable θ i , which represents the latent trait that the items are designed to measure.DIF occurs when the distribution of Y i depends on not only θ i but also x i .More precisely, DIF occurs if Y i is not conditionally independent of x i , given θ i .Seemingly a simple group comparison problem, DIF analysis is non-trivial due to the latency of θ i .In particular, the distribution of θ i may depend on the value of x i , which confounds the DIF analysis.In what follows, we describe a MIMIC model framework for DIF analysis, under which the relationship among Y i , θ i , and x i is characterized.It is worth pointing out that this framework can be generalized to account for more complex DIF situations; see more details in Section 4.

Measurement Model
The two-parameter logistic (2PL) model (Birnbaum, 1968) is widely used to model binary item responses (e.g., wrong/right or absent/present).In the absence of DIF, the 2PL model assumes a logistic relationship between Y ij and θ i , which is independent of the value of x i .That is, where the slope parameter a j and intercept parameter d j are typically known as the discrimination and easiness parameters, respectively.The right hand side of (1) as a function of θ is known as the 2PL item response function.When the items potentially suffer from DIF, the item response functions may depend on the group membership x i .In that case, the item response function can be modeled as where γ j is an item-specific parameter that characterizes its DIF effect.More precisely, That is, exp(γ j ) is the odds ratio for comparing two individuals from two groups who have the same latent trait level.Item j is DIF-free under this model when γ j = 0. We further make the local independence assumption as in most IRT models.That is, Y i1 , ..., Y iJ are assumed to be conditionally independent, given θ i and x i .

Structural Model
A structural model specifies the distribution of θ i , which may depend on the group membership.
Specifically, we assume the conditional distribution of θ i given x i to follow a normal distribution, Note that the latent trait distribution for the reference group is set to a standard normal distribution to identify the location and scale of the latent trait.A similar assumption is typically adopted in IRT models for a single group of individuals.
The MIMIC model for DIF combines the above measurement and structural models, for which a path diagram is given in Figure 1.The marginal likelihood function for this MIMIC model takes the form where Ξ = {β, σ 2 , a j , d j , γ j , j = 1, ..., J} denotes all the fixed model parameters.
The goal of DIF analysis is to detect the DIF items, i.e., the items for which γ j = 0. Unfortunately, without further assumptions, this problem is ill-posed due to the non-identifiability of the model.We discuss this identifiability issue below.

Model Identifiability
Without further assumptions, the above MIMIC model is not identifiable.That is, for any constant c, the model remains equivalent, if we simultaneously replace β and γ j by β + c and γ j − a j c, respectively, and keep a j unchanged.This identifiability issue is due to that all the items are that the corresponding γ * j s are known to be zero, which fixes the location indeterminacy.However, if no anchor item is known, we need to answer the question: which member of this equivalent class should be used to define DIF effects?We address it in Section 3 below.

Proposed Method
In what follows, we address the model identifiability problem raised above and then propose a new method for DIF analysis that does not require prior knowledge about anchor items.As will be shown in the rest, the proposed method can not only accurately detect the DIF items, but also provide valid statistical inference for testing the hypotheses of γ j = 0, for any j = 1, ..., J.

Model Identifiability, Sparsity, and Minimal L 1 Condition
We now address the model identifiability problem.The most natural idea is to choose Ξ * as the true parameter vector when the corresponding γ * = (γ * 1 , ..., γ However, there are still ambiguities with the above sparsity assumption.First, how sparse should γ * be?We note that the true parameters need to satisfy γ * 0 ≤ J − 2; that is, there are at least two zeros in the true γ * vector.This is because, γ * (c) 0 ≤ J − 1, if c = γ * j /a * j for any j.
However, we may not want to identify the latent trait with only two items because, in that case, the two items are of high leverage -the latent trait becomes unidentifiable when we remove one of these items.For the latent trait to be firmly identified, it may be sensible to assume that γ * is sufficiently sparse, where the sparsity may be measured by the proportion of non-zero coefficients in γ * .Further discussions will be provided in the sequel regarding the sparsity level.We note that this "sufficiently sparse" assumption aligns well with the practical utility of DIF analysis in educational testing (e.g., Holland and Wainer, 1993) as well as certain settings of psychological measurement (e.g., Chapter 1, Millsap, 2012) and health-related measurement (e.g., Scott et al., 2010).For example, in educational testing, DIF analysis is conducted to ensure the fairness of a test form.In this application, the test operator aims to identify a small number of DIF items that cause a bias in the test result.The identified items will be reviewed by domain experts, and then revised or removed from the item pool.For this process to be operationally feasible, one typically needs to assume that the majority of the items are DIF-free, i.e., γ * is sufficiently sparse.
Second, the L 0 norm is not easy to work with from a statistical perspective.Due to the randomness in the data, likelihood-based estimation methods almost never give us a truly sparse solution.
Consequently, one essentially needs to search over O(2 J ) all possible models to find the sparest model (e.g., using a suitable information criterion).Item purification and regularized estimation methods narrow the search by stepwise procedures and regularized estimation procedures, respectively.Even with these methods, the computation can still be intensive, and consistent selection of the true model is not always guaranteed.
Following the previous discussions, we now impose a condition for identifying the true model parameters, which is statistically easy to work with and suitable when the true γ * is sufficiently sparse.Specifically, we require the following minimal L 1 (ML1) condition to hold for all c = 0.This assumption implies that, among all models that are equivalent to the true model, the true parameter vector γ * has the smallest L 1 norm.Equivalently, we can rewrite (4) as where h(c) = J j=1 |γ * j − a * j c|.We give an example of h(c) in Figure 2, where h(c) is constructed with a sparse γ * .More specifically, we construct h(c) with J = 10, a * j = 1 for all j, γ * j = 0 and 1 when j = 1, ..., 8 and j = 9, 10, respectively.In this example, we note that h(c) has a unique minumum at c = 0, i.e., (4), or equivalently, (5) holds.
In what follows, we show that the ML1 condition holds when γ * is sufficiently sparse.The following proposition provides a sufficient and necessary condition for the ML1 condition (4) (or equivalently ( 5)) to hold.The proof is given in the Supplementary Materials.
Proposition 1. Assume that a * j = 0 for all j.Condition (4) holds if and only if where I(•) is the indicator function.
, where J = 10, a * j = 1 for all j, γ * j = 0 and 1 for j = 1, ..., 8 and j = 9, 10, respectively.The minimal value of h(c) is achieved when c = 0. We note that inequalities ( 6) and (7) hold for the example in Figure 2, where To elaborate on the results of Proposition 1, we first consider a special case when a * j = 1 for all j, i.e., the measurement model is a one-parameter logistic model when there is no DIF.Then according to Proposition 1, the ML1 condition holds if and only if . Suppose that more than half of the items are DIF-free, i.e., J j=1 I(γ * j = 0) > J/2.
Then the ML1 condition holds, because be the order statistics of γ * 1 , ..., γ * J .The ML1 condition holds when γ * ((J+1)/2) = 0 if J is an odd number, and when γ * (J/2) = γ * ((J/2)+1) = 0 if J is an even number.That is, the ML1 condition holds when we have similar numbers of positive and negative DIF items and a few non-DIF items, in which case the ML1 condition can hold even if J j=1 I(γ * j = 0) ≤ J/2.However, if all the DIF items are of the same direction (all positive or all negative), then it is easy to show that the ML1 condition does We then extend the above discussion to the general setting where the discrimination parameters vary across items.Based on Proposition 1, we provide a sufficient condition for the ML1 condition, which suggests that the ML1 condition holds when γ * j = 0 for a sufficient number of items.
Corollary 1. Assume that a * j = 0 for all j.Let ρ * = max j {|a * j |}/ min j {|a * j |}.Then Condition (4) and We note ( 8) and ( 9) are not a necessary condition, meaning that the ML1 condition can still hold even if ( 8) and ( 9) are not satisfied.Here, ρ * quantifies the variation of the absolute discrimination parameters, where a larger value of ρ * indicates a higher variation.Corollary 1 suggests that ML1 , at least two-thirds of the items are DIFfree.This sparsity requirement can be relaxed if the sizes of items with γ * j /a * j > 0 and those with γ * j /a * j < 0 are balanced.
The estimate Ξ can be easily computed using a two-step procedure as described in Algorithm 1.
We provide some remarks about these steps.The estimator (12) in Step 1 can be viewed as the MML estimator of the MIMIC model, treating item 1 as an anchor item.We emphasize that the constraint γ 1 = 0 in Step 1 is an arbitrary but mathematically convenient constraint for Algorithm 1: Step 1: Solve the following MML estimation problem Step 2: Solve the optimization problem ensuring the estimability of the MIMIC model when solving (12).It does not require item 1 to be truly a DIF-free item.This constraint can be replaced by any equivalent constraint, for example, γ 2 = 0, while not affecting the final estimation result.
Step 2 finds the transformation that leads to the ML1 solution among all the models equivalent to the estimated model from Step 1.The optimization problem ( 13) is convex that takes the same form as the Least Absolute Deviations (LAD) objective function in median regression (Koenker, 2005).Specifically, the LAD function is a statistical optimization function measuring the sum of absolute residuals.Given a set of data (x i , y i ) for i = 1, . . ., n, the LAD function is defined as S(f ) = n i=1 |y i − f (x i )| and we seek to find f that minimizes LAD function S. Our problem ( 13) is convex since we are minimizing a convex LAD function over a set of real numbers, which gives us a unique global optimum.Consequently, it can be solved using standard statistical packages/software for quantile regression.The R package "quantreg" (Koenker, 2022) is used in our simulation study and real data analysis.
The ML1 condition (4), together with some additional regularity conditions, guarantees the consistency of the above ML1 estimator.That is, Ξ will converge to Ξ * as the sample size N grows to infinity.This result is formalized in Theorem 1, with its proof given in the Supplementary Materials.
Further, assume that the ML1 condition (4) holds.Then With a consistent point estimator, one can consistently select the true model, i.e., identifying the zeros and non-zeros in γ * , using a hard-thresholding procedure (see e.g.Meinshausen and Yu, 2009).As our focus is on the statistical inference of DIF parameters, we skip the details of the hard-thresholding procedure here.

Statistical Inference
The statistical inference of individual γ j parameters is of particular interest in the DIF analysis.
With the proposed estimator, we can draw valid statistical inference on the DIF parameters γ j .
Note that the uncertainty of γj is inherited from Ξ, where √ N ( Ξ − Ξ † ) asymptotically follows a mean-zero multivariate normal distribution1 by the large-sample theory for maximum likelihood estimation; see Supplementary Materials for more details.We denote this multivariate normal distribution by N (0, Σ * ), where a consistent estimator of Σ * , denoted by ΣN , can be obtained based on the marginal likelihood.We define a function where Ξ = {β, σ 2 , a l , d l , γ l , l = 1, ..., J}.Note that the function G j maps an arbitrary parameter vector of the MIMC model to the γ j parameter of the equivalent ML1 parameter vector.To draw statistical inference, we need the distribution of By the asymptotic distribution of ), and the latter can be further approximated by , where Z follows a normal distribution N (0, ΣN ).Therefore, although function G j does not have an analytic form, we can approximate the distribution of γj − γ * j by Monte Carlo simulation.We summarize this procedure in Algorithm 2 below.
Algorithm 2: Input: The number of Monte Carlo samples M and significance level α.
Step 1: Generate M i.i.d.samples from a multivariate normal distribution with mean 0 and covariance matrix ΣN .We denote these samples as Z 1 , ..., Z M .
Output: Level 1 − α confidence interval for γ * j is given by (γ j − q 1−α/2 , γj − q α/2 ).In addition, the p-value for a two-sided test of γ * j = 0 is given by Algorithm 2 only involves sampling from a multivariate normal distribution and solving a convex optimization problem based on the LAD objective function, both of which are computationally efficient.The value of M is set to 10,000 in our simulation study and 50,000 in the real data example below.
The p-values can be used to control the type-I error rate, i.e., the probability of mistakenly detecting a non-DIF item as a DIF one.To control the item-specific type-I errors to be below a pre-specified threshold α (e.g., α = 0.05), we detect the items for which the corresponding p-values are below α.Besides the type-I error, we may also consider the False Discovery Rate (FDR) for DIF detection (Bauer et al., 2020) to account for multiple comparisons, where the FDR is defined as the expected ratio of the number of non-DIF items to the total number of detections.To control the FDR, the Benjamini-Hochberg (B-H) procedure (Benjamini and Hochberg, 1995) can be employed to the p-values.Other compound risks may also be considered, such as the familywise error rate.
4 Related Works and Extensions

Related Works
Many of the IRT-based DIF analyses (Thissen et al., 1986;Thissen, 1988;Thissen et al., 1993) require prior knowledge about a subset of DIF-free items, which are known as the anchor items.
More precisely, we denote this known subset by A. Under the MIMIC model described above, it implies that the constraints γ j = 0 are imposed for all j ∈ A in the estimation.With these zero constraints, the γ j parameters cannot be freely transformed, and thus, the above MIMIC model becomes identifiable.Therefore, for each non-anchor item j / ∈ A, the hypothesis of γ j = 0 can be tested, for example, by a likelihood ratio test.The DIF items can then be detected based on the statistical inference of these hypothesis tests.
The validity of the anchor-item-based analyses relies on the assumption that the anchor items are truly DIF-free.If the anchor set includes one or more DIF items, then the results can be misleading (Kopf et al., 2015b).To address the issue brought by the mis-specification of the anchor set, item purification methods (Candell and Drasgow, 1988;Clauser et al., 1993;Fidalgo et al., 2000;Wang and Yeh, 2003;Wang and Su, 2004;Wang et al., 2009;Kopf et al., 2015b,a) have been proposed that iteratively purify the anchor set.These methods conduct model selection using a stepwise procedure to select the anchor set, implicitly assuming that there exists a reasonably large set of DIF items.Then DIF is assessed by hypothesis testing given the selected anchor set.This type of methods also has several limitations.First, the model selection results may be sensitive to the choice of the initial set of anchor items and the specific stepwise procedure (e.g., forward or backward selection), which is a common issue with stepwise model selection procedures (e.g., stepwise variable selection for linear regression).Second, the model selection step has uncertainty.As a result, there is no guarantee that the selected anchor set is completely DIF-free, and furthermore, the postselection statistical inference of items may not be valid in the sense that the type-I error may not be controlled at the targeted significance level.Bechger and Maris (2015) and Yuan et al. (2021) proposed DIF detection methods based on the idea of differential item pair functioning.They considered a one-parameter logistic model setting, which corresponds to the case when a 1 = • • • = a J in the current MIMIC model.Their idea is that the difference γ j − γ j is identifiable for any j = j , though each individual γ j is not identifiable due to location indeterminacy.Bechger and Maris (2015) focused on testing γ j − γ j = 0 for all item pairs, and Yuan et al. (2021) proposed data visualization methods and a Monte Carlo test to identify individual DIF items.However, they did not provide statistical inferences for the DIF effects of individual items.
Regularized estimation methods (Magis et al., 2015;Tutz and Schauberger, 2015;Huang, 2018;Belzak and Bauer, 2020;Bauer et al., 2020;Schauberger and Mair, 2020) have also been proposed for identifying the anchor items, which also implicitly assumes that many items are DIF-free.These methods do not require prior knowledge about anchor items, and simultaneously select the DIF-free items and estimate the model parameters using a LASSO-type penalty (Tibshirani, 1996).Under the above MIMIC model, a regularized estimation procedure solves the following optimization problem, Ξλ = arg min where λ > 0 is a tuning parameter that determines the sparsity level of the estimated γ j parameters.Generally speaking, a larger value of λ leads to a more sparse vector γλ = (γ λ 1 , ..., γλ J ).A regularization method (e.g.Belzak and Bauer, 2020) solves the optimization problem ( 14) for a sequence of λ values, and then selects the tuning parameter λ based on the BIC.Let λ be the selected tuning parameter.Items for which γλ j = 0 are classified as DIF items and the rest are classified as DIF-free items.While the regularization methods are computationally more stable than stepwise model selection in the item purification methods, they also suffer from some limitations.First, they involve solving non-smooth optimization problems like ( 14) for different tuning parameter values, which is not only computationally intensive but also requires tailored computation code that is not available in most statistical packages/software for DIF analysis.Second, these methods may be sensitive to the choice of the tuning parameter.Although methods and theories have been developed in the statistics literature to guide the selection of the tuning parameter, there is no consensus on how the tuning parameter should be chosen, leaving ambiguity in the application of these methods.Third, from the theoretical perspective, it is not clear whether these methods can guarantee model selection consistency.In particular, the model selection consistency of the LASSO procedure almost always requires a strong assumption called the irrepresentable condition (Zhao and Yu, 2006;van de Geer and Bühlmann, 2009).It is not clear when this assumption holds for the current problem.On the other hand, the proposed ML1 condition is much easier to understand and check, as discussed in Section 3.1.Finally, as a common issue of regularized estimation methods, obtaining valid statistical inference from these methods is not straightforward.That is, regularized estimation like ( 14) does not provide a valid p-value for testing the null hypothesis γ j = 0.In fact, post-selection inference after regularized estimation was conducted in Bauer et al. (2020), where the type I error cannot be controlled at the targeted level under some simulation scenarios.
We notice that there is a connection between the proposed estimator and the regularized estimator ( 14).Note that Ξ is the one with the smallest J j=1 |γ j | among all equivalent estimators that maximize the likelihood function (3).When the solution path of ( 14) is smooth and the solution to the ML1 problem ( 13) is unique, it is easy to see that Ξ is the limit of Ξλ when the positive tuning parameter λ converges to zero.In other words, the proposed estimator can be viewed as a limiting version of the LASSO estimator ( 14).According to Theorem 1, this limiting version of the LASSO estimator is sufficient for the consistent estimation of the true model under the ML1 condition.
We clarify that the proposed method may not always outperform other methods in terms of accuracy in classifying items, such as the LASSO procedure.From the simulation results in Section 5 below, we see that the proposed method and the LASSO procedure have similar accuracy in item classification when the DIF parameters are large.The key advantage of the proposed method is that the proposed one provides valid statistical inference (e.g., p-values) when anchor items are not available.The inference results allow us to tackle the uncertainty in the decisions of DIF detection, which can be useful in many applications of DIF analysis where high-stake decisions need to be made.

Extensions
While we focus on the two-group setting and uniform-DIF (i.e., only the intercepts depend on the groups) in the previous discussion, the proposed framework is very general that can be easily generalised to other settings.In what follows, we discuss the ML1 condition under different settings.
The proposed methods for point estimation and statistical inference can be extended accordingly.
Non-uniform DIF.Under the 2PL measurement model, non-uniform DIF happens when the discrimination parameter also differs across groups.To model non-uniform DIF, we extend the current measurement model (2) to while keeping the structural model the same as in Section 2.2.This extended model has both location and scale indeterminacies.
Under the same spirit as the ML1 condition (4), we may assume the true model parameters Ξ * to satisfy |a * j + m|, and when m = 0 and c = 0.These conditions tend to be satisfied when the proportion of DIF-free items is sufficiently large.
Multi-group setting.There may be more than two groups in some DIF applications.Suppose that there are K + 1 groups -one reference group and K focal groups.Let x i ∈ {0, ..., K} indicate the group membership.
For simplicity, we focus on the uniform DIF setting.Then the measurement model becomes and The structural model becomes Under this model, an item j is DIF-free if γ jk = 0 for all k.The location indeterminacy under this model leads to the following ML1 condition for identifying the true model parameters We note that this ML1 condition for the multi-group setting allows the majority of the items to be DIF items as long as the vector (γ * 1k , ..., γ * Jk ) is sufficiently sparse for each focal group.Similar to the discussion in Section 3.1, in the special case of the one-parameter logistic model, the ML1 condition is guaranteed to hold if J j=1 I(γ * jk = 0) > J/2, for all k.Note that the set of items satisfying γ * jk = 0 can vary across focal groups.
Continuous covariates.In some applications, DIF might be caused by continuous covariates, such as age.Suppose that we have K continuous covariates x i = (x i1 , ..., x iK ) , rather than discrete groups.Then we may consider the following measurement model where γ j = (γ j1 , ..., γ jK ) be the corresponding DIF parameters.We may assume the structural model takes a homoscedastic latent regression form θ|x i ∼ N (βx i , 1), where the variance is fixed to 1 to avoid scale indeterminacy2 .Under this MIMIC model, an item j is DIF-free if γ jk = 0 for all k.The location indeterminacy under this model leads to the following ML1 condition for identifying the true model parameters We note that this ML1 condition is similar to that under the multi-group setting.This is because the multi-group setting can be written in a very similar form as the current MIMIC model (by representing the groups using a covariate vector with dummy variables), except that the structural model under the multi-group setting allows heteroscedasticity.We also note that the current model assumes that a DIF effect is a linear combination of the covariates, which may seem inflexible, especially when comparing with the tree-based methods (Strobl et al., 2015;Tutz and Berger, 2016;Bollmann et al., 2018).However, we note that one can always move beyond the linearity by including transformations of the raw covariates (e.g., using spline basis) into the covariate vector and increasing the dimension of the DIF parameter vector γ j simultaneously.
Ordinal response data.Finally, we note that the proposed method can be extended to IRT models for other types of response data.To elaborate, we consider the generalized partial credit model (GPCM) (Muraki, 1992) for ordinal response data as an example.For simplicity, we focus on the two-group setting (i.e., x i ∈ {0, 1}) and uniform DIF.Let {0, 1, ..., m j } be the ordered categories of item j.Then the measurement model becomes where the DIF parameters γ jk depend on both the item and the category.We keep the structural model the same as in Section 2.2.Under this model, an item j is DIF-free if γ jk = 0 for all k.The location indeterminacy under this model leads to the following ML1 condition for identifying the true model parameters Ξ * = {β * , (σ * ) 2 , a * j , d * j , γ * jk , k = 1, ..., m j , j = 1, ..., J}: for all c = 0.

Simulation Study
This section conducts simulation studies to evaluate the performance of the proposed method and compare it with the likelihood ratio test (LRT) method (Thissen, 1988) and the LASSO method (Bauer et al., 2020).Note that the LRT method requires a known anchor item set.Correctly specified anchor item sets with different sizes will be considered when applying the LRT method.
In the simulation, we set the number of items J = 25, and consider two settings for the sample sizes, N = 500, and 1000.The parameters of the true model are set as follows.First, the discrimination parameters are set between 1 and 2, and we consider two sets of easiness parameters with one small d j set between −1 and 1 and another large d j set between −2 and 2, respectively.
Their true values are given in Table 1.The observations are split into groups of equal sizes, indicated by x i = 0, and 1.The parameter β in the structural model is set to 0.5 and the parameter σ is set to 0.5, so that the latent trait distribution is standard normal N (0, 1) and N (0.5, 0.5 2 ) for the reference and focal groups, respectively.We consider six settings for the DIF parameters, three settings with DIF item proportions from high to low at smaller absolute DIF parameter values, and the other three with DIF item proportions from high to low at larger absolute DIF parameter values.Specifically, at smaller and larger absolute DIF parameter values, the three settings contain 5, 10 and 14 DIF items out of 25 items for low, medium and high DIF proportions, respectively.
Their true values are given in Table 1.For all sets of the DIF parameters, the ML1 condition is satisfied.The combinations of settings for the sample sizes and DIF parameters lead to 24 settings in total.For each setting, 100 independent datasets are generated.
We first evaluate the accuracy of the proposed estimator Ξ given by Algorithm 1. Table 2 shows the mean squared errors (MSE) for β and σ and the average MSEs for a j s, d j s, and γ j s that are obtained by averaging the corresponding MSEs over the J items.As we can see, these MSEs and average MSEs are small in magnitude and decrease as the sample size of individuals N increases under each setting.This observation aligns with our consistency result in Theorem 1.
We then compare the proposed method and the LRT method in terms of their performances on statistical inference.Specifically, we focus on whether FDR can be controlled when applying the B-H procedure to the p-values obtained from the two methods.The comparison results are given in Table 3.As we can see, FDR is controlled to be below the targeted level for the proposed method and the LRT method with 1, 5, and 10 anchor items under all settings.
When anchor items are known, the standard error can be computed for each estimated γ j , and thus the corresponding Wald interval can be constructed.We compare the coverage rates of the confidence intervals given by Algorithm 2 and the Wald intervals that are based on five anchor items.The results are shown in Figure 3.We see that the coverage rates from both methods are comparable across all settings and are close to the 95% targeted level.Note that these coverage rates are calculated based on only 100 replicated datasets, which may be slightly affected by the Monte Carlo errors.
Finally, we compare the detection power of different methods based on the receiver operating characteristic (ROC) curves.For a given method, a ROC curve is constructed by plotting the true positive rate (TPR) against the false positive rate (FPR) at different threshold settings.More specifically, ROC curves are constructed for the LASSO methods by varying the corresponding tuning parameters λ from 0.02 to 0.2.ROC curves are also constructed by the LRT method with 1, 5, and 10 anchor items, respectively.Note that for the LRT method, the TPR and FPR are calculated based on the non-anchor items.For each method, an average ROC curve is obtained based on the 100 replications, for which the area under the ROC curve (AUC) is calculated.A larger AUC value indicates better detection power.The AUC values for different methods across our simulation settings are given in Table 4.According to the AUC values, the proposed procedure, that is, the p-value based method from Algorithm 2, performs better than the rest.That is, without knowing any anchor items, the proposed procedure performs better than the LRT method that knows 1 or 5 anchor items, and has similar performance as the LRT method that knows 10 anchor items under some settings with large DIF or large sample size N .The superior performance of the proposed procedures is brought by the use of the ML1 condition, which identifies the model parameters using information from all the items.Based on the AUC values, we also see that the LASSO procedure performs similarly to the proposed procedures under some of the large DIF settings, but is less accurate under the small DIF settings.
6 Application to EPQ-R Data DIF methods have been commonly used for assessing the measurement invariance of personality tests (e.g., Escorial and Navas, 2007, Millsap, 2012, Thissen et al., 1986).In this section, we apply the proposed method to the Eysenck Personality Questionnaire-Revised (EPQ-R, Eysenck et al. 1985), a personality test that has been intensively studied and received applications worldwide Figure 3: Scatter plots of the coverage rates of the 95% confidence intervals for γ * j 's.x-axes and y-axes are labelled with item numbers and coverage rates, respectively.Panels (a) -(d) correspond to our proposed method, and panels (e) -(h) correspond to the Wald intervals constructed with five anchor items.Blue solid circle ( ) corresponds to small d j with high proportion DIF items.Purple solid triangle ( ) corresponds to small d j with medium proportion DIF items.Red solid square ( ) corresponds to small d j with low proportion DIF items.Blue square cross ( ) corresponds to large d j with high proportion DIF items.Purple diamond plus ( ) corresponds to large d j with medium proportion DIF items.Red circle plus ( ) corresponds to large d j with low proportion DIF items.(Fetvadjiev and van de Vijver, 2015).The EPQ-R has three scales that measure the Psychoticism (P), Neuroticism (N) and Extraversion (E) personality traits, respectively.We analyze the long forms of the three personality scales that consist of 32, 24, and 23 items, respectively.Each item has binary responses of "yes" and "no" that are indicated by 1 and 0, respectively.This analysis is based on data from an EPQ-R study collected from 1432 participants in the United Kingdom.Among these participants, 823 are females, and 609 are males.Females and males are indicated by x i = 0 and 1, respectively.We study the DIF caused by gender.The three scales are analyzed separately using the proposed methods.The results are shown through Tables 5-7, and Figure 4. Specifically, Tables 5-7 present the p-values from the proposed method for testing γ j = 0 and the detection results for the P, E, N scales, respectively.For each table, the items are ordered by the p-values in increasing order.The items indicated by "F" are the ones detected by the B-H procedure with FDR level 0.05, and those indicated by "L" are the ones detected by LASSO method whose tuning parameter λ is chosen by BIC.The item IDs are consistent with those in Appendix 1 of Eysenck et al. (1985), where the item descriptions are given.The three panels of Figure 4 further give the point estimate and confidence interval for each γ j parameter, for the three scales, respectively.Under the current model parameterization, a positive DIF parameter means that a male participant is more likely to answer "yes" to the item than a female participant, given that they have the same personality trait level.We note that the absolute values of γj are all below 1, suggesting that there are no items with very large gender-related DIF effects.From Tables 5-7, we see that all three scales have some items whose p-values are close to zero, suggesting that gender DIF may exist across the three scales.The DIF items selected by the B-H procedure at the 5% FDR level seem sensible.In what follows, we give some examples.For the P scale, the top four items are selected.These items are "14.Do you dislike people who don't know how to behave themselves?", "7.Would being in debt worry you?", "34.Do you have enemies who want to harm you?" and "81.Do you generally 'look before you leap' ?", with the DIF effect of item 7 being negative while those of the rest being positive.The discovery of items 14, 7 and 34 is consistent with the personality literature, where previous research has found that women are more gregarious and trusting than men while men tend to be more risk-taking (Costa et al., 2001;Feingold, 1994).It is unclear from previous research why item 81 has a positive DIF effect.We conjecture that it is due to sociocultural influences.This result is consistent with that of another P-scale item "2.Do you stop to think things over before doing anything?"whose statement is Although not selected by the B-H procedure, the estimated DIF effect of this item is also positive, and its 95% confidence interval does not include zero.
For the E scale, eleven items are selected by the B-H procedure.Here, we discuss the top five items, including "63.Do you nearly always have a 'ready answer' when people talk to you?", "36.
Do you have many friends?","90.Do you like plenty of bustle and excitement around you?", "6.
Are you a talkative person?" and "33.Do you prefer reading to meeting people?",where items 63 and 33 have positive DIF effects while the rest three have negative DIF effects.The discovery of these items is not surprising.The DIF effects of items 36, 90, 6 and 33 are consistent with previous observations that women are more motivated to involve in social activities and tend to have more interconnected and affiliative social groups (Cross and Madson, 1997), which may be explained by the theory of self-construals (Markus and Kitayama, 1991).The DIF effect of item 63 is consistent with the previous findings that men tend to score higher on assertiveness (Costa et al., 2001;Feingold, 1994;Weisberg et al., 2011).
For the N scale, ten items are selected by the B-H procedure.Again, we discuss the top five and 87 is consistent with the fact that women tend to score higher in tender-mindedness (Costa et al., 2001;Feingold, 1994).The positive DIF effects of items 84 and 70 may again be explained by the theory of self-construals (Markus and Kitayama, 1991).
From Tables 5-7, we see that the selection based on the B-H procedure with FDR level 0.05 and that based on the LASSO procedure are quite consistent but do not exactly match.For the P-scale, the two procedures agree on four DIF detections, while the LASSO procedure additionally identifies four DIF items.For the E scale, they agree on six DIF detections, while the B-H procedure additionally identifies five items and the Lasso procedure additionally identifies one.Finally, for the N scale, the number of common detections is eight.Besides that, there are two items uniquely identified by the B-H procedure and four items uniquely identified by the Lasso procedure.Since the two procedures have different objectives (controlling FDR versus consistent model selection), it is not surprising that their results are not exactly the same.A consensus between the two methods suggests strong evidence, and thus, these common detections should draw our attention and be investigated first.For example, the content of the DIF items may be reviewed by experts, and new data may be collected to test these DIF effects through a confirmatory analysis.When there are enough resources, the items identified by one of the methods should also be investigated.Table 5: P-values for testing γ * j = 0 for items in P scale.Note that the items are ordered in increasing p-values.Items selected by the B-H procedure with FDR control at 5% and the LASSO method are identified using "F" and "L", respectively, following the item numbers.anchor item set and can also provide valid p-values.The p-values can be used for the detection of DIF items and controlling the uncertainty in the decisions.According to our simulation results, the proposed p-value-based procedure has comparable performance in terms of classifying DIF and non-DIF items, comparing with the LASSO method of Belzak and Bauer (2020).In addition, the p-value-based methods accurately control the item-specific type-I errors and the FDR.Finally, the proposed method is applied to the three scales of the Eysenck Personality Questionnaire-Revised to study gender-related DIF.For each of the three long forms of the P, N, and E scales, around 10 items are detected by the proposed procedures as potential DIF items.The psychological mechanism of these DIF effects is worth further investigation.While the paper focuses on the two-group setting and uniform DIF, extensions to more complex settings have been discussed in Section 4, including non-uniform DIF, multi-group, and continuous covariate, and ordinal response settings.An R package has been developed for the proposed procedures that will be published online upon the acceptance of this paper.
The proposed method has several advantages over the LASSO method.First, the proposed method does not require a tuning parameter to estimate the model parameters, while the LASSO method involves choosing the tuning parameter for the regularization term.Thus, the proposed method is more straightforward to use for practitioners.Second, we do not need to solve opti-linking approach based on an L p loss function, which is similar in spirit to the proposed method but focuses on linking multiple groups rather than DIF detection.We believe the proposed method can easily adapt to the linking problem to provide consistent parameter estimation and valid statistical inference.This problem is left for future investigation.

Appendix
This appendix contains additional proofs of all the proposition and theorems in Section A and discusses asymptotic distribution of Ξ and the implementation details of Algorithms 1 and 2 in Section B.

A Proofs of Propositions and Theorems
Proof of Proposition 1.Note h is differentiable for all c = 0 with, x i p ij (1 − p ij ).
In practice, we can use Gaussian quadrature method to approximate the expectation of these terms so as to obtain Î( Ξ).Then ΣN can be evaluated with ΣN = Î−1 ( Ξ).This then enables Step 1 of Algorithm 1, where Monte Carlo samples of Ξ † can be simulated from N( Ξ, ΣN ).

Figure 1 :
Figure 1: The path diagram of a MIMIC model for DIF analysis.The subscript i is omitted for simplicity.The dashed lines from x to Y j indicate the DIF effects.

Figure 2 :
Figure 2: Function h(c) = J} be a set of parameters for the true model.Then a set of parameters yields the same data distribution as the true model if there exist constants m and c such that Ξ

Figure 4 :
Figure 4: Plots of 95% confidence intervals for the DIF parameters γ * j s on scale P, N, and E data sets.The red horizontal lines denote γ = 0. Items are arranged according to the increasing p-values.

Table 1 :
Discrimination, easiness and DIF parameter values used in the simulation studies.

Table 2 :
Average mean squared errors of the estimated parameters in the simulation studies.Mean squared errors are first evaluated by averaging out of 100 replications and then averaged across 25 items to obtain the average mean squared errors for a, d and γ.The mean squared errors for β and σ are presented.

Table 3 :
Comparison of the FDR of the proposed p-value based method and the LRT method with 1, 5 and 10 anchor items respectively at the FDR control of 5%.The values are averaged out of 100 replications.

Table 4 :
Comparison of AUC of the proposed p-value based method, the LASSO method and the LRT method with 1, 5 and 10 anchor items respectively.

Table 6 :
This paper proposes a new method for DIF analysis under a MIMIC model framework.It can accurately estimate the DIF effects of individual items without requiring prior knowledge about an P-values for testing γ * j = 0 for items in E scale.Note that the items are ordered in increasing p-values.Items selected by the B-H procedure with FDR control at 5% and the LASSO method are identified using "F" and "L", respectively, following the item numbers.

Table 7 :
P-values for testing γ * j = 0 for items in N scale.Note that the items are ordered in increasing p-values.Items selected by the B-H procedure with FDR control at 5% and the LASSO method are identified using "F" and "L", respectively, following the item numbers.