Semi-automated Rasch analysis with differential item functioning

Rasch analysis is a procedure to develop and validate instruments that aim to measure a person’s traits. However, manual Rasch analysis is a complex and time-consuming task, even more so when the possibility of differential item functioning (DIF) is taken into consideration. Furthermore, manual Rasch analysis by construction relies on a modeler’s subjective choices. As an alternative approach, we introduce a semi-automated procedure that is based on the optimization of a new criterion, called in-plus-out-of-questionnaire log likelihood with differential item functioning (IPOQ-LL-DIF), which extends our previous criterion. We illustrate our procedure on artificially generated data as well as on several real-world datasets containing potential DIF items. On these real-world datasets, our procedure found instruments with similar clinimetric properties as those suggested by experts through manual analyses.


Introduction
In measurement theory, personal aspects may contain latent constructs or traits which cannot be approached directly, such as "intelligence" and "quality of life".In an effort to measure these latent constructs, many scales have been developed from uniquely designed questionnaires.Rasch analysis is one of the scientific methods to transform the original survey into a linear-weighted, clinimetrically sound scale.Using inherent criteria, e.g., goodness-of-fit, unidimensionality, and local dependency (Mesbah, 2010), manual Rasch analysis follows a step-by-step procedure, repeatedly fitting the observed responses to the Rasch model.The worst item(s) are generally removed, after which the remaining items are reevaluated, until a clinimetrically optimal itemset has been obtained.Rasch analysis becomes even more complex when the original survey contains items that can function differently due to the respondents' backgrounds (e.g., age, gender, and nationality).This phenomenon is known as differential item functioning (DIF) (Holland & Wainer, 1993).DIF occurs if respondents from a particular group tend to score higher or lower on a particular item compared to other group(s), despite having otherwise similar characteristics.This type of item is often found in clinical observations, for example in running, old people tend to have more trouble compared to young people.Erroneous ignorance of such biases leads to a biased instrument (Borsboom, 2006;Kopf, Zeileis, & Strobl, 2015).DIF assessment has become one of the standard ingredients of Rasch analysis and has been implemented in various ways, e.g., (Holland & Thayer, 1986;Swaminathan & Rogers, 1990;Kreiner & Christensen, 2011;Magis & Facon, 2013;Tutz & Schauberger, 2015;Komboz, Strobl, & Zeileis, 2018;Schauberger & Mair, 2020;Schneider, Strobl, Zeileis, & Debelak, 2021).
In current practice, step-by-step procedures are carried out manually by the experts, which can be relatively timeconsuming even with the support from the available software packages, such as (Choi, Gibbons, & Crane, 2011;www. rasch.org, 2014;Magis & Facon, 2014;Jeon & Rijmen, 2016;Bollmann, Berger, & Tutz, 2018).Decisions on how to prioritize the various evaluation criteria and which items * Feri Wijayanto f.wijayanto@cs.ru.nl; feri.wijayanto@uii.ac.id to include partly rely on human judgments blended with clinical expertise, and different experts may obtain different but equally suitable instruments.These procedures will become even more complex when the DIF items have to be resolved iteratively (Andrich & Hagquist, 2015;Hagquist & Andrich, 2017).The objective of this research is to incorporate the DIF assessment procedure while automating the Rasch analysis.In doing so, we extend our previous method, which automates the Rasch analysis using the in-plus-out-of-questionnaire log likelihood (IPOQ-LL) criterion (Wijayanto, Mul, Groot, van Engelen, & Heskes, 2021).The extended method naturally incorporates standard Rasch criteria, e.g., item goodness-of-fit and unidimensionality (Wijayanto et al., 2021).Additionally, we expect the method to perform fairly well, automatically, even though it does not address local dependencies directly: reliable estimation of abilities fares better from items with uncorrelated residuals than those with correlated residuals (Wijayanto et al., 2021).Accordingly, we will show that our new procedure in addition naturally incorporates the standard DIF assessment in Rasch analysis.Our novel procedure makes use of a generalization of the IPOQ-LL criterion, which we will refer to as the in-plus-outof-questionnaire log likelihood with DIF (IPOQ-LL-DIF).
The rest of this article is structured as follows."Preliminary" section describes the central model in our implementation, the GPCMlasso model (Schauberger & Mair, 2020), its transformation to other models, and the idea to solve its estimation problem using the L1 (lasso) penalty together with the coordinate descent."The proposed method" section discusses the main part of our proposed method, the in-plus-out-of questionnaire log likelihood with DIF (IPOQ-LL-DIF), which extends the previous method and argues for the method in comparison with the typical assessment of DIF items in standard Rasch analysis."Experimental study" section reports our experimental results on an artificial and three real-world datasets."Discussion and conclusions" section discusses general aspects of our procedure and the results it obtained, and concludes our research.The R package containing the algorithm and results reported in this paper can be found at https:// github.com/ fwija yanto/ autoR asch.

Generalized partial credit model with DIF
Differential item functioning (DIF) refers to the situation where members from different groups (age, gender, race, education, culture) on the same level of the latent trait (disease severity, quality of life) have a different probability of giving a certain response to a particular item (Chen & Revicki, 2014).In short, DIF occurs as a result of an inconsistency between estimated abilities and true abilities for given groups.If the inconsistency uniformly affects all subjects in the group, then it is known as a uniform DIF, otherwise, it is a non-uniform DIF (Hagquist & Andrich, 2017).Additionally, Penfield (2007) discusses the complexity of the DIF in the polytomous case by introducing differential step functioning (DSF), which allows an item not only to have differential functioning at the item level but also at the category level.DSF simplifies to DIF when there is a constant difference between groups at the category level.For now, we consider the DIF and provide more details on DSF in Appendix 2.
In this work, we focus on uniform DIF and adopt the GPCMlasso model, introduced in Schauberger and Mair (2020), which extends the generalized partial credit model (GPCM) (Muraki, 1992) after parameterizing the DIF effects.Rooted to the GPCM, the GPCMlasso has the ability to model responses that are coded into two or more ordered categories.We write x ni ∈{0,1,…,m i } for the observed response of subject n on item i, where item i consists of m i + 1 ordered categories.We have m i = 1 for dichotomous test items and m i > 1 for polytomous items.
The GPCMlasso model contains the same type of parameters as the GPCM:  n for the ability of subject n, β ij , with j = 1,…,m i for the difficulties or thresholds of item i, and α i for the discrimination parameter of item i.Additionally, to model the difference in difficulty on item i between the members and non-members of focal group f, the DIF parametersδ if are introduced.Furthermore, κ nf , with f = 1,…,m f and where m f represents the number of potential DIF-inducing covariates, is a binary matrix that maps subject n into group f with κ nf = 1 if respondent n is a member of group f and κ nf = 0 otherwise.
Given these definitions, the probability of subject n gives response x on item i reads for x > 0, and From now on, we will refer to this as the generalized partial credit model with differential item functioning, GPCM-DIF.Setting α i = 1 for i = 1, … , ℙ in the GPCM-DIF model gives what we will refer to as the partial credit model with DIF (1) .
(PCM-DIF).Using the PCM-DIF to estimate the respondents' traits is comparable to the use of the partial credit model (PCM) on items after the DIF has been resolved.With δ if = 0 for i = 1, … , ℙ and f = 1,…,m f we obtain the GPCM.By then also fixing α i = 1 for i = 1, … , ℙ , we get the PCM.

Coordinate descent
Given observed responses x ni , the log likelihood of all model parameters for a given set of items S ⊂ {1, … , ℙ} reads with P(X = x ni |,β,α,δ) from Eqs. 1 and 2. This log likelihood measures how well the parameters predict the subjects' observed responses on the items from set S.
We turn the log likelihood into a penalized log likelihood by adding penalty terms.As in Wijayanto, Mul, Groot, van Engelen, and Heskes (2021), we add Tikhonov regularization for the abilities , to regularize these towards zero, as well as for ln , to drive the discrimination parameters towards one.Inspired by Schauberger and Mair (2020), we further add a Lasso (L1) penalty for the DIF parameters δ, so that irrelevant DIF parameters are optimized to zero: with λ  , λ α , and λ δ the penalty coefficients of , α, and δ parameters, respectively.
To optimize (4), we propose to apply two-level coordinate descent (Friedman, Hastie, Höfling, & Tibshirani, 2007).At the top level, we treat the GPCM parameters , α, and β as one coordinate, and the DIF parameters δ as another.Given fixed DIF parameters, we optimize the GPCM parameters using penalized joint maximum likelihood estimation (PJMLE).As an alternative, we could here replace the PJMLE by marginal maximum likelihood estimation (MMLE), which optimizes the β parameters after integrating out the  parameters.In this paper, we stick to the PJMLE for simplicity.Moreover, in recent studies, it has been demonstrated that PJMLE yields comparable estimates to the MMLE (Paolino, 2013;Chen, Li, & Zhang, 2019;Robitzsch, 2021).Given fixed GPCM parameters, we optimize the DIF parameters through coordinate descent at the second level, treating each δ if for i = 1, … , ℙ and f = 1,…,m f as a unique coordinate.Details of the coordinate descent algorithm applied to Eq. 4 are provided in Appendix 2.

The proposed method
In-plus-out-of-questionnaire log likelihood with DIF In instrument design, we are given an initial set of ℙ items that, based on responses on a survey including all these items, we would like to reduce to a smaller set of items that make up the final questionnaire.We will refer to the set of included items as the included itemset, denoted S in , and to its complement as the excluded itemset, denoted S out = {1, … , ℙ} ⧵ S in .In Wijayanto et al., (2021), we intro- duced a novel criterion called in-plus-out-of-questionnaire log likelihood (IPOQ-LL) for evaluating the quality of any split into S in and S out given the observed responses on the original survey.Following the same rationale, we here extend this criterion to also incorporate the possibility of item(s) with differential functioning.For a given final questionnaire, only the items in the included itemset S in can be used to estimate the subjects' abilities .We propose to obtain these abilities, and at the same time the discrimination parameters, thresholds, and DIF parameters corresponding to the included items, by maximizing the penalized log likelihood in Eq. 4: We refer to the log likelihood of these fitted parameters on the included itemset as the in-questionnaire log likelihood with DIF: This IQ-LL-DIF resembles standard test statistics in Rasch analysis (e.g., item fit statistics to the resolved DIF items) (Tennant et al.,, 2004, p I-40).
Next, although we may not need the excluded items to arrive at a reliable and valid scale, we would like the abilities estimated on S in to properly represent the observed responses on S out as well, if only because the original survey was designed to also include these items.We therefore fix the abilities θS in and optimize the penalized log likelihood given the responses on the excluded items w.r.t. the thresholds, the discrimination parameters, and the DIF parameters: (5) We refer to as the out-of-questionnaire log likelihood with DIF.Our new criterion, the in-plus-out-of-questionnaire log likelihood with DIF (IPOQ-LL-DIF), is the total of both log likelihoods: Algorithm 1 outlines the procedure for computing the in-plus-out-of-questionnaire log likelihood with DIF given a subdivision of all items into the included itemset S in and excluded itemset S out .
In our earlier work (Wijayanto et al., 2021), we noticed that the outcome of our fitting procedure without the additional DIF parameters is relatively insensitive to the setting of the regularization parameters, as long as the regularization parameter λ in of the β and α parameters for the included itemset is an order of magnitude larger than the regularization parameter λ out for the excluded itemset.In this paper, we therefore stick to the same settings: λ  = 0.05, λ in = 50, and λ out = 1.
Whether or not non-zero DIF parameters are obtained, does depend on the precise setting of the regularization parameter λ δ : the larger λ δ , the fewer non-zero DIF parameters will remain.Unless specified otherwise, in this paper ( 7) we set λ δ = 10.With this setting, our procedure yields more or less the same DIF items in the three real-world datasets compared to those obtained with a manual analysis.An arguably more principled, but computationally much more intensive approach would be a cross-validation procedure for finding the optimal value of λ δ , as described in Appendix 2.

Comparison with other approaches for DIF assessment
There are two main approaches for handling DIF items.
Blending in with the Rasch analysis.In many practices of Rasch analysis, DIF detection is infused as an additional step in the estimation procedure (Rosato et al., 2016;Vaughan, 2018;2019).Resolved DIF items are treated as any other items: if they fit the Rasch model well they are kept, otherwise they are removed.In accordance with our previous method (Wijayanto et al., 2021), our new method has a tendency to put predictive split-items in the included itemset: these items help to obtain a better estimate of the subjects' ability not only in the included itemset, but also in the excluded itemset.
Treating the DIF items separately.Andrich and Hagquist (2015) distinguish between 'real' and 'artificial' DIF items.A real DIF item is stable, independent of the inclusion or exclusion of other potential DIF items.An artificial DIF item, on the other hand, only becomes a DIF item by virtue of the presence of other (real) DIF items.Therefore, Andrich and Hagquist (2015) suggest to resolve the DIF items iteratively, starting with the largest effect, in an attempt to neutralize the effect of artificial DIF items.Our procedure also applies a thorough strategy to identify and resolve all potential DIF items.However, instead of doing this sequentially, we simultaneously estimate the DIF effects for all items that are still included.The Lasso (L1) penalty helps to distinguish between DIF and non-DIF items by nullifying the insignificant DIF effects.

Itemset selection
In this paper, we introduce a single criterion, IPOQ-LL-DIF, to measure the quality of a final instrument by considering the differential functioning over items.With this criterion, we can in principle apply any optimization procedure to determine which items to keep in the included itemset S in and which items to put in the excluded itemset S out .In our experiments, we consider the same optimization procedure as in our previous work (see Wijayanto et al., (2021) for details), i.e., stepwise selection.Stepwise selection alternates between backward elimination, which starts from the Algorithm 1 Pseudocode for computing the in-plus-out-of-questionnaire log likelihood with DIF for a particular included itemset S in and excluded itemset S out .
full set of items, and forward selection, which starts from the empty set.Starting from a full itemset, backward elimination will eliminate the item that corresponds to the highest IPOQ-LL-DIF.Forward selection gives the search procedure the ability to recover items later in the process.

Experimental study
To evaluate our new method, we experiment on an artificial dataset and on three publicly available real-world datasets.

Application to artificial data
This simulation aims to show that our semi-automated algorithm aligns with the standard Rasch analysis procedure for dealing with DIF, i.e., identifies, resolves, and removes split items that are relatively hard to predict.In this experiment, we consider an artificial dataset that consists of responses to 14 items from 490 subjects.The dataset is composed of two inhomogeneous subsets with six items (12 items in total) and two DIF items.To simulate the DIF effect, the subjects are split into two different groups of 245.Responses are generated independently from the generalized partial credit model for the polytomous case with m i = 5 ordered categories.
For a given dataset containing DIF item(s), the PCM-DIF (see "Generalized partial credit model with DIF" section) can be applied to estimate the DIF parameters.Figure 1 shows that the PCM-DIF can identify the DIF items (item 13 and item 14 ) when the value of λ δ is not too high.However, for a high value of λ δ , these DIF effects disappear and the PCM-DIF leads to the same estimated parameters as a standard PCM without DIF.The PCM-DIF correctly estimates the DIF parameters (δ) of all non-DIF items to equal zero for any value of λ δ .
Infit is one of the item fit statistics that is commonly used to judge the goodness-of-fit of items to the Rasch model and to the PCM.In Fig. 2, we show that this statistic relates to the discriminative power of the items, represented by the discrimination parameter α .A hard-to- predict item with low discriminative power normally has a high Infit, which indicates misfit.On the contrary, an easy-to-predict item with high discriminative power normally has a low infit.Further, we also show that the misfitting item 13 in Fig. 2a (estimated using the PCM) improves its Infit after considering the DIF effect (estimated using PCM-DIF).The PCM-DIF clearly models the responses of item 13 better than the PCM.As expected, applying the PCM-DIF does not improve the Infit of item 14 , the hardto-predict DIF item.For the non-DIF items, the PCM-DIF estimates are indistinguishable from the PCM estimates.
When DIF is present in particular items, standard Rasch analysis tends to detect and resolve these items.This step, together with expert awareness of the inspected items, is then followed by removing misfits, including those that are hard to predict even after splitting the DIF items.As shown in Fig. 3b, our semi-automated algorithm does the same for reasons explained in "Comparison with other approaches for DIF assessment" section: the IPOQ-LL-DIF score favors the DIF item that has a good fit after being split and puts the one that has a low discrimination parameter in the excluded set.The maximum of the IPOQ-LL-DIF score as a function of the number of items in the included set in this simulation is obtained when seven items are still included, including the resolved item 13 .
As a comparison, we also apply our previous criterion, the IPOQ-LL, to this dataset.Figure 3a shows that the IPOQ-LL detects item 13 (the predictive item) as a hard-to-predict Item 14 Item non-DIF Fig. 1 The estimated DIF parameters ( δ ) along log( ) for DIF and non-DIF items.The dashed grey line represents the value of λ δ that is used in the estimation item since it cannot estimate the DIF effect.Consequently, when a potential DIF effect is ignored item 13 will be put to the excluded itemset.As for item 14 , being designed as a hard-to-predict DIF item, both IPOQ-LL and IPOQ-LL-DIF agree to put it in the excluded itemset.This fact is also supported by Fig. 2 which shows that the Infit statistic of item 14 does not improve even after the DIF effect has been identified.

Application to real-world datasets
To validate our method on real-world data, we searched for datasets that satisfy the following criteria: • The original dataset (survey with responses) is publicly available.• A manual Rasch analysis has been applied to develop an instrument.• According to the manual Rasch analysis, the initial survey contains differential item functioning.
• None of the authors of the current paper have been involved in the development of the instrument.• The corresponding publication is not more than 5 years old.
We found three such datasets: the Osteopathy Clinical Teaching Questionnaire dataset (Vaughan, 2018), the Interdisciplinary Education Perception Scale dataset (Vaughan, 2019), and the Multiple Sclerosis Quality of Life Scale dataset (Rosato et al., 2016).To these three datasets, we applied our semiautomated procedure with the new criterion, the IPOQ-LL-DIF.For comparison, we also use the IPOQ-LL criterion.

The Osteopathy Clinical Teaching Questionnaire (OCTQ) dataset
The Osteopathy Clinical Teaching Questionnaire (OCTQ) is an instrument that was developed to assess the quality of the clinical educators (Vaughan, 2018).The original survey contains 30 items with five-point Likert scale and three global questions that have been answered by 399 participants.Vaughan  , 5, 7, 9, 10, 12, 15, 16, 18, 20, 23, 30} , as the final instrument.We will refer to this set of 12 items as the OCTQ manual instrument.
In the original survey, Vaughan (2018) identified some items with disordered thresholds, four items with DIF (item 14 , item 19 , item 27 , and item 28 ), and 122 misfitting persons.After resolving the few items with disordered thresholds in the original survey (item 1 , item 9 , item 27 , and item 30 ), we applied the semi-automated procedure for both criteria.Running the whole stepwise procedure leads to the result shown in Fig. 4.Both criteria agree that the maximum of the IPOQ-LL-DIF occurs when the same 26 items are still included.The vertical lines give the location of maximum scores, |S in | = 26 and |S in | = 12 .The horizontal lines give the location of the corresponding scores and show the score differences among instruments.Figure 4b zooms in on the search result near |S in | = 12 , the number of items in the OCTQ manual instrument.
For a fair and easy comparison with the manual instrument, we zoom in on the semi-automated instruments that are based on the same number of included items.We will refer to these as the IPOQ-LL-DIF ( S in = {2, 3, 5, 7, 10, 12, 14, 16, 18, 22, 26, 30} )  and IPOQ-LL ( S in = {2, 3, 5, 7, 10, 12, 14, 16, 17, 22, 26, 30} )  instruments, respectively.The semi-automated instruments only differ in one item: 17 versus 18.The overlap between the IPOQ-LL-DIF and the manual instrument is eight items, which can be considered large: the probability of having an overlap of eight or more items just by chance is smaller than 0.05.The overlap between the IPOQ-LL instrument and the manual instrument is seven items.
In the initial analysis, Vaughan (2018) suspected some items to be DIF items, i.e., item 27 , and item 28 for institution, item 14 for institution and educator gender, and item 19 for student gender.As part of the Rasch analysis, Vaughan (2018) chose to remove all DIF items to ensure that the final version of OCTQ would be applicable to a range of teaching institutions and free from gender influence.Employing the IPOQ-LL-DIF criterion, our semi-automated procedure retains three DIF items, i.e., item 14 , item 22 , and item 26 .
To further illustrate the clinimetric quality of the three (i.e., manual, IPOQ-LL-DIF, and IPOQ-LL) instruments, we consider standard Rasch statistics such as goodnessof-fit, local independency, reliability, and unidimensionality.For comparison we also compute these statistics for 10,000 randomly drawn 12-item instruments.The statistics of the three instruments are all well within the acceptable range and, in this case, are better than most of the random 12-item instruments in local independency (see Appendix 1 for details).Furthermore, we compute the Cronbach-Mesbah curve (Fig. 13) to track how the instrument's internal consistency changes over time (Mesbah, 2010).Despite the highest Cronbach's α obtained after removing one item, the instrument with the highest IPOQ-LL-DIF score is also considered to have excellent internal consistency (α = 0.97).
Figure 5 compares all instruments using our own IPOQ-LL-DIF criterion.By definition, the IPOQ-LL-DIF instrument is very well optimized for this criterion and slightly higher than the IPOQ-LL instrument.However, the manual instrument also does well and better than most of the randomly drawn 12-item instruments, which shows that as an extension of the IPOQ-LL criterion, the IPOQ-LL-DIF intrinsically captures many of the properties that a typical Rasch analysis cares about, including the presence of the DIF.
Considering the standard Rasch statistics, which are averages over all items and all subjects, we conclude that the manual and the semi-automated instruments are clinimetrically all very similar.We then also expect that the abilities estimated for individual subjects based on the manual and the IPOQ-LL-DIF instruments will be very much alike.Figure 6 plots these estimated ability parameters for the two instruments against each other.Indeed, the estimated ability parameters for the two instruments are highly correlated (ρ = 0.975), further showing that both instruments are very similar.

The interdisciplinary education perception scale (IEPS) dataset
The interdisciplinary education perception scale is an instrument to evaluate students' professional perception in a particular program (Vaughan, 2019).The complete survey consists of 18 items of sixpoint Likert scale that are answered by 319 participants.Adopting the work of Leitch (2014) and Vaughan (2019) excluded item 12 and item 17 and applied a manual Rasch analysis to 16 items as the initial survey.During the analysis, Vaughan (2019) identified 51 misfitting persons, resolved four items with disordered thresholds (i.e., item 10 , item 13 , item 15 , and item 16 ), removed eight items, and ended up with eight items as the final instrument.We will refer to this set of eight items, S in = {1, 2, 7, 10, 13, 14, 15, 16} , as the IEPS manual instrument.
After resolving the four items with disordered thresholds, we applied our semi-automated procedure to the remaining 16-items IEPS responses.Running the whole stepwise procedure using the IPOQ-LL-DIF and IPOQ-LL criteria leads to the graph shown in Fig. 7.Both criteria agree that the maximum of the IPOQ-LL-DIF occurs when the same 12 items are still included.Using the same setup, the vertical lines give the location of the maximum scores, |S in | = 12 and |S in | = 8 .The horizontal lines give the location of the corresponding scores and show the score differences among instruments.
For a fair and easy comparison with the manual instrument, we again zoom in on the semi-automated instruments that are based on the same number of included items as the manual instrument.We will refer to these as the IPOQ-LL-DIF and IPOQ-LL instruments, respectively, which happen to contain the exact same items ( S in = {1, 2, 4, 5, 7, 13, 15, 16} ). Figure 7b zooms in on the search result near |S in | = 8 , the number of items in the IEPS manual instrument.The overlap between the IPOQ-LL-DIF and the manual instrument is 6 items.The probability of having an overlap of six or more items just by chance is 0.07.
Vaughan ( 2019) also reported the presence of three DIF items, i.e., item 6 for year level, item 11 for gender, and item 18 for university.Vaughan (2019) chose to remove all DIF items in order to produce a questionnaire that is free of demographic influence.Our semi-automated procedure also led to the removal of these three potential DIF items, but for a different reason: they did not survive the selection procedure when optimizing the IPOQ-LL-DIF and IPOQ-LL.
The figures in Appendix 1 show that the three (i.e., manual, IPOQ-LL-DIF, and IPOQ-LL) instruments are clearly better than most of the 10,000 randomly drawn eight-item instruments, especially on person separation reliability (PSR), local dependency, and unidimensionality.Furthermore, the statistics for these three instruments are clinimetrically very similar and all well within the acceptable range.Figure 8 shows that the manual instrument obtains an IPOQ-LL-DIF score that is only slightly smaller than the one for the semi-automated instruments.The estimated abilities of the manual and the semi-automated instruments indeed turn out to be very similar (ρ = 0.94) (see Fig. 9).Moreover, in Fig. 16, the Cronbach-Mesbah curve shows that the instrument that obtains the highest IPOQ-LL-DIF score also obtains the highest Cronbach's α.

The multiple sclerosis quality of life (MSQOL) dataset
The multiple sclerosis quality of life questionnaire is an individual's or a group's perceived physical and mental health over time for people with multiple sclerosis (Rosato et al., 2016).The initial MSQOL survey consists of 54 items with different numbers of categories that was answered by 473 patients.The items are grouped into 12 multi-item and two single-item subscales.Rosato et al., (2016) applied separate manual Rasch analyses to 11 subscales, each originally containing at least three items.For two of these subscales, no items survived the analysis.We will refer to the remaining sets of nine subscales as the MSQOL manual instruments.They are listed in Table 3.
For two subscales ("Bodily Pain" and "Sexual Function"), the manual Rasch analysis kept all items.Running our semi-automated stepwise procedure on these same subscales leads to the results shown in Fig. 10.It can be seen that the semi-automated procedure agrees to keep all items from both subscales: the maximum IPOQ-LL-DIF scores are obtained with all items still in the included set.
Next, we applied our semi-automated procedure on all 11 subscales, constraining the semi-automated instruments to end up with the same number of items as the corresponding manual instruments.For optimization with the IPOQ-LL-DIF and IPOQ-LL criteria we arrived at the exact same included item sets.We will refer to these as the MSQOL semi-automated instruments, also listed in Table 3.As can be seen, the semi-automated and manual instruments have 21 out of 27 items overlapping, which can be considered a lot: the probability of having an overlap of 21 or more items just by chance is smaller than 0.01.
Table 4 compares the psychometric quality of the manual and the semi-automated MSQOL instruments for all subscales.It can be seen that the standard Rasch statistics for both instruments are more or less the same.Rosato et al., (2016) also reported the presence of eight DIF items, i.e., item 1 , item 8 , item 10 , item 23 , item 32 , item 36 , item 40 , and item 51 and decided to remove these by hand.Our semiautomated procedure, however, retains two of these items, i.e., item 8 and item 40 , albeit with corresponding DIF parameters δ set to zero, i.e., without treating these as DIF items.

Discussion and conclusions
In this work, we have successfully enhanced our semi-automated procedure to deal with DIF items.We extend our previous criterion, the so-called in-plus-out-of-questionnaire log likelihood (IPOQ-LL) to a new criterion named in-plus-outof-questionnaire log likelihood with DIF (IPOQ-LL-DIF).The new criterion is based on the same ideas as the IPOQ-LL (Wijayanto et al., 2021): a good final instrument should reliably estimate people's abilities.Although this ability estimate is fitted on the responses of the items in the final instrument, it should still represent the items that are left out.
The effectiveness of our extended procedure to yield clinimetrically similar results as the standard Rasch analyses relies on four essential ingredients.Two are passed to |S in | = 10 .The horizontal grey lines display the differences of the IPOQ-LL-DIF scores obtained by the three instruments, i.e., the manual, IPOQ-LL, and IPOQ-LL-DIF instruments Fig. 8 In-plus-out-questionnaire log likelihood with DIF (IPOQ-LL-DIF) values for the IPOQ-LL-DIF instrument (green dotted-dashed line), the IPOQ-LL instrument (brown dashed line), the manual instrument from Vaughan (2019) (red dotted line), and random eightitem instruments (histogram) on the IEPS dataset down from the previous procedure, while the other two are new.The inherited ingredients are the flexible discrimination parameters and stronger regularization for this discrimination parameter on the included itemset compared to the excluded itemset (Wijayanto et al., 2021).The new ingredients are the DIF parameters together with a lasso penalty that distinguishes between the DIF and non-DIF items.We have shown that the new procedure also naturally incorporates essential aspects of Rasch analysis, where it tends to favor DIF items with a good Infit value and to exclude the ones without.
In our simulations, we have shown that DIF item(s) can indeed help obtain a better parameter estimate, in accordance with Andrich and Hagquist (2015).In a validation with real-world datasets our procedure yields similar instruments to the manual analyses.Not only do our instruments have comparable statistics, but they also contain similar (or even the same) items in the itemset when we constrain the number of included items to be the same as the number of included items in the manual instrument.With reasonable settings for the regularization parameters, our procedure tends to be somewhat more conservative in that it typically prefers to keep more items than the manual instrument.
In our experiments on real-world data, the IPOQ-LL-DIF and the IPOQ-LL criterion lead to very similar, often even the same instruments.Even though our procedure does detect and include DIF items, properly modeling this DIF effect has a relatively small effect on the selection of other items.However, this does depend on the setting of the regularization parameter λ δ .In our main experiments, we chose λ δ = 10, which is relatively small, to arrive at more or less the same DIF items as in the manual analysis.An alternative approach would be to optimize λ δ in a cross-validation procedure (see Appendix 2).Applying this cross-validation procedure to the real-world data, we obtain a much larger λ δ .With this stronger penalty, our procedure no longer finds any DIF items (see Fig. 17).
To summarize, in our real-world experiments the manual, IPOQ-LL-DIF, and IPOQ-LL instruments are largely comparable and have clinimetrically similar qualities.Compared to randomly generated instruments, they all score well on standard Rasch statistics (see Figs. 11 through 15).
In this paper, we have assumed that the DIF groups are specified beforehand.For binary types of information (e.g., gender), these groups are naturally defined.For continuous type information (e.g., age), our procedure can be easily extended with a recursive partitioning method based on the IPOQ-LL-DIF to find the optimal groups, along the lines of some recent methods for detecting DIF (Strobl, Kopf, & Zeileis, 2015;Tutz & Berger, 2016;Komboz et al., 2018).
Even though our method has the advantage to develop a valid, reliable, and robust instrument in a less time-consuming and more objective manner from a decent original survey, we are aware that this method lacks substantive human knowledge in the process.Knowing this, we are careful to frame our procedure as semi-automated rather than fully automated, which always welcomes the application of experts' knowledge, e.g., through pre-and post-analysis.(Figures 12,13,14,15).
The R package containing the algorithm and results reported in this paper can be found at https:// github.com/ fwija yanto/ autoR asch.

A.3 MSQOL dataset analysis results
This section describes the analysis results applied to the 11 subscales of the MSQOL dataset.Table 3 shows the final instruments obtained by the manual Rasch analysis and the semi-automated procedure for each subscale.For all subscales, the IPOQ-LL instruments contain the same items as the IPOQ-LL-DIF instruments.Table 4 reports the results of the commonly used standard Rasch statistics for subscales of which the manual and the semi-automated instruments show some differences in item preferences.
we then have an inner loop in which we consider the DIF parameters (δ) as separate coordinates and solve them sequentially.

B.2 Parameter Setting
To tune the λ δ to its optimal value, some methods are available.These methods include model selection criteria, e.g., Akaike information criterion (AIC) (Akaike, 1974) and Bayesian information criterion (BIC) (Schwarz, 1978), as well as k-fold cross-validation (Schauberger and Mair, 2020).In this work, we implement the k-fold cross-validation procedure described in Algorithm 3. The items and subjects are split into several groups whose combinations constitute small blocks of the dataset.During cross-validation, we use one block as a test set and the other as a training set for every iteration.
We apply Algorithm 3 for cross-validation and the results are shown in Fig. 17.In Fig. 17a, we see that for the artificial dataset it is better to give no penalty to the DIF parameters (δ).In contrast, Fig. 17b-d on the real-world datasets suggest giving a large penalty to the DIF parameters (δ), which makes the DIF model equivalent to a non-DIF model.This suggests that the responses on these real-world datasets are not significantly different between the different DIF groups.Penfield (2007) introduces the concept of step functions (corresponding to the thresholds in the PCM) that describe at which particular level of ability a subject steps (or advances) from one score level to a higher level.For a given observed response x ni ∈{0, 1,…,m i }, there will be m i step functions.In the dichotomous case, with m i = 1, any inconsistency between groups in the probability to score high can be modelled as a DIF in item level.In the polytomous case, with m i > 1, however, this inconsistency may vary for every step function and is then called differential step functioning (DSF).To link DSF and DIF, Penfield, Gattamorta, and Childs (2009) explains the DIF effect as the aggregated DSF Algorithm 2 Pseudocode for two-level coordinate descent.

B.3 Differential step functioning
Algorithm 3 Pseudocode for k-fold cross-validation.
effect across the m i steps.To implement DSF, we need to generalize Eq. 1 to where we only had to change DIF parameter δ if into the DSF parameter δ jif , which now also depends on the category j.
As an experiment with the DSF model, we make use of an artificial dataset comprised of responses to six items from 400 subjects.The items composed of four non-DSF items and two DSF items.To simulate the DSF effect, the subjects are split into two different groups of 200.Responses are generated independently from the partial credit model (α = 1) for the polytomous case with five ordered categories (m i = 4).
Figure 18 shows the paths of the estimated DSF (a) and DIF (b) parameters for various values of λ δ .These figures show that the GPCM-DSF model could satisfactorily identify the differential functioning at step 4 of item 5 and all steps in item 6 .Despite item 5 having a large DSF effect at step 4, the other steps have zero.As a result, as discussed by Penfield et al., (2009), the DIF effect of item 5 as an aggregation of all DSF effects will be small.Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Fig. 2 Fig. 3
Fig. 2 The estimated discrimination parameters ( α ) against the Infit statistics of (a) the PCM and (b) the PCM-DIF.Discrimination parameters (α) are estimated using (a) the GPCM and (b) the GPCM-

Fig. 4
Fig.4The highest IPOQ-LL-DIF scores as a function of the number of included items |S in | when running the semi-automated procedure using the IPOQ-LL-DIF and IPOQ-LL criteria on the OCTQ dataset.

Fig. 7
Fig. 7 The highest IPOQ-LL-DIF scores obtained for each number of included items |S in | when running the semi-automated procedure using both the IPOQ-LL-DIF and IPOQ-LL criteria on the IEPS dataset.a Of all available |S in | .b Zoomed in version from |S in | = 6

Fig. 9
Fig. 9 Estimated abilities for individual subjects based on the IPOQ-LL-DIF against those based on the manual instrument.The average root mean squared standard error for the estimates on both axes is visualized through the error bars at the top left

Fig. 17
Fig. 17 Results of cross-validation of four datasets using different values of λ δ .a The artificial dataset.b The OCTQ dataset.c The IEPS dataset.d The first dimension of the MSQOL dataset