1 Introduction

Health professions educators have long recognized the importance of fostering a lifelong learning culture in their students (Demuth et al., 2018). However, learning should not be a matter of whether students learn but how they learn (Görlich & Friederichs, 2021). The purpose of progress tests (PTs), which involve testing students with a comprehensive exam that covers every aspect of the curriculum in succession, is to break the examination’s steering effect and encourage deep and long-term learning (Albanese & Case, 2016). More specifically, it is a methodical and ongoing evaluation that seeks to gauge students’ knowledge at the level they should have attained by the end of the programme (Cecilio-Fernandes et al., 2018).

The majority of medical examinations include multiple-choice questions (MCQs) (Royal et al., 2018). However, creating MCQs for formative assessment (FA) modalities such as PTs is expensive and time-consuming since specialists must manually create and fine-tune each item individually. This onerous task is impractical when hundreds of items are required for multiple test versions or to populate item banks (Lai, Gierl, Byrne et al., 2016). Consequently, there is an urgent need for content-specific test items (and respective feedback) to be available for FA because these questions are regularly given to students, especially in PTs (Gierl & Lai, 2015).

In the next-generation assessment theory known as Automatic Item Generation (AIG), computer algorithms swiftly develop large sets of testing items with precise content (Falcão et al., 2022, 2023). Despite the encouraging structure and applicability in high-stakes examinations (e.g., Gierl and Lai, 2013; Pugh, de Champlain et al., 2016), few is known about how AIG-items behave in FA modalities such as PTs compared to manually written items used in higher education institutions (HEI). This study compared the psychometric properties of PT-generated AIG items versus manually written items. Based on an robust statistical pipeline, we examined data from a single medical PT conducted at the University of Minho, delivered in December 2021. Our research adds to the body of knowledge by determining whether AIG-items can live up to the high criteria required by testing procedures used in FA such as PTs.

1.1 Background

1.1.1 Assessment of learning or assessment for learning?

Educational institutions today require strong tools to accurately evaluate students’ understanding of the concepts covered in their study materials (Nwafor & Onyenwe, 2021). Traditionally, the focus has been on summative assessment (SA), which is considered reliable (Knight, 2002). SA is conducted under standardized conditions using end-of-unit exams that provide a snapshot of educational achievement to guide critical decisions (Gierl & Lai, 2018). However, this approach merely calls for assessments of students’ learning, not for providing them with feedback throughout the learning process (Boston, 2002).

Evaluation should not only determine if pupils have acquired knowledge (assessment of learning) but also encourage learning (assessment for learning) (Prashanti & Ramnarayan, 2019). Teachers should therefore identify students’ misconceptions and deliver feedback on their progress (Preston et al., 2020). FA refers to any practice that improves student learning based on feedback (Irons, 2007). It is a methodical procedure that entails acquiring data on student learning and using this knowledge to clarify which instructions should be modified to better meet students’ needs (Black & Wiliam, 2009). Providing pupils feedback results in a positive evaluation of their performance, which in turn helps them develop their knowledge and shape their performances (Xinxin, 2019). Consequently, the developmental underpinnings of assessment become more prominent, and the roles of both teachers and students are acknowledged, making FA a cyclical programme of high and low-stake activities in which the student actively participates (Leenknecht et al., 2021).

1.1.2 Making progress: formative assessment within medical education

Student feedback moved centre stage and FA is currently used in medical education to monitor students learning, develop clinical competences and promote clinical reasoning (Schüttpelz-Brauns et al., 2020). However, educators typically refrain from delivering feedback since many students often report that the feedback they receive leaves them frustrated (Chowdhury & Kalu, 2004). Considering the shift from judgements based on test scores to student-based instruction, there is a need to engage students with learning in a safe environment for them to learn and for educators to deliver feedback (Watling & Ginsburg, 2019).

PTs are long-term, feedback-focused educational assessment strategies for determining how well student’s knowledge develops and endures over time (Couto et al., 2019). Typically, all students receive them on a regular basis (2–4 times annually) during the academic program. The test samples all of the knowledge domains expected of graduates to master upon degree completion, regardless of the student’s year level (Öztürk, 2013). Its main purpose is to promote deep learning while reducing the steering effect of exams (Albanese & Case, 2016). Considering their properties, PTs are effective and trustworthy instruments for assessing how much is being learned during undergraduate health professions education (Görlich & Friederichs, 2021).

FA such as PT were traditionally conducted using paper-based tests. However, this was logistically challenging and presented drawbacks, such as requiring students to gather in a single location at a specific time to administer the test and to obtain feedback (Olson & McDonald, 2004). To make matters worse, PTs are very resource-binding as they demand lots of testing items and time required for planning, implementing, and evaluating results (Koşan et al., 2019). Thanks to the advent of modern technology, we have witnessed significant growth in online FA systems, which have maximised the value of instant feedback (Joyce, 2018). Of these, online computerized formative testing (CFT) are technology-enhanced assessment systems that deliver students feedback on their learning progress (Gierl & Lai, 2018). These systems have a solid theoretical basis and are prevalent in HEI, being widely recognized in health professions schools as a resource for self-directed learning (Bijol et al., 2015). Through CFT, students are able to assess their knowledge and determine where they need more training. Additionally, since students are being virtually tested, they provide equity and inclusiveness, allowing students to be evaluated anonymously, anytime, and anywhere, providing a safe environment where trial and error is permitted (Mitra & Barua, 2015).

1.1.3 The underlying need for a next-gen assessment theory: automatic item generation

Most tests within medical education, such as PT, are composed of multiple-choice questions (MCQ) (Dijksterhuis et al., 2009). MCQs are economical, efficient, instantly scored and are appropriate for item types that include a variety of skills (McCoubrie, 2004). These types of questions rapidly adapted to the hypermedia environment of CFTs given the ability of these systems to produce automated tests, which facilitates the delivery of feedback (Farrell & Leung, 2004). However, developing MCQ is costly (with prices ranging from US$1500 to US$2000 per item) and time-consuming for specialists since they must manually write and refine each item individually (Kosh et al., 2019). This daunting task is not feasible when hundreds of items are needed for multiple test versions or when thousands of items are needed to fill item banks (Lai, Gierl, Byrne et al., 2016). Since these items are continuously administered to students and their production is arduous, there is a pressing need for content-specific items and feedback to be available for CFT (Gierl & Lai, 2018; Xinxin, 2019).

Digitally based assessment methods produce sophisticated data that more closely reflects how students interacted with the items than in traditional settings (von Davier et al., 2021). Automatic Item Generation (AIG) is a next-generation assessment theory, validated empirically and theoretically, that promises to ease this burden (Falcão et al., 2022, 2023; Jendryczko et al., 2020). It is a contemporary method that combines the knowledge of content experts with computer modules to produce large numbers of high-quality and content-specific test items, both quickly and effectively, following specific guidelines (Lai, Gierl, Byrne et al., 2016). These guidelines cover the (a) creation of a model containing the variables to be manipulated (item model), which includes the stem (part of the item model with the data required for problem-solving), the response options (both correct and incorrect ones), a lead-in-question (the complete sentence with the question) as well as supporting information; and the (b) systematic fusion of the components listed above by computer algorithms to produce a large number of new items (Bejar, 2012).

According to Gierl and Lai (2012), a three-step approach is necessary to generate MCQs in health professions education using AIG: (i) in the first step, content specialists outline a framework for item generation with the knowledge and skills expected to be used by students to formulate a diagnosis (cognitive model) (Pugh, de Champlain et al., 2016). This framework identifies the problem specific to a test item, presenting different scenarios related to it, the variables to be manipulated for item generation, and the data required to establish a diagnosis (Gierl et al., 2012); (ii) In the second step, the contents of the framework are added to MCQ’s to form item models, which are similar to templates highlighting the variables to be manipulated and contain the relevant information to answer each question and respective options (Gunabushanam et al., 2019); (iii) In the third and final step, specialized computer modules work on the item models, manipulating data in components such as the stem and the options, systematically generating massive high-quality digital items (Prasetyo et al., 2020). Figure 1 outlines this process.

Fig. 1
figure 1

Three-step process for generating medical MCQs based on AIG

1.1.4 Implementing AIG in online PT

AIG promises to produce tests based on unlimited item banks rapidly. It hinders item exposure, predicts the psychometric properties of generated items and presents construct validity since it relies on cognitive mechanisms underlying task performance (Harrison et al., 2017). Although the many advantages of AIG have been proven in high-stakes exams, it is not known how AIG-items compare to handwritten items in FA modalities such as PTs. In response to these concerns, strategies to enhance the quality/validity of AIG-items have undergone a lot of development. However, the quality/validity of these items has only been the subject of a small number of research (e.g., Falcão et al., 2022; Pugh et al., 2020), particularly in high-stakes examinations administered by licensing bodies whose resources to develop and control the quality of developed items and inherent costs are significantly different from HEI. The University of Minho’s School of Medicine (EMUM) has added AIG to the MCQ content development process during the past few years, increasing the capacity of the item banks available to evaluate its students. To gauge the potential of these items, AIG items have progressively been included on tests. In the present study, we first provide a psychometrical analysis of a PT conducted at the EMUM, where 23 AIG-items were included. Along with the psychometric approach considered, the AIG and manually written items included in the PT were both subjected to a qualitative evaluation and a validity assessment procedure. Our study contributes to the body of knowledge by evaluating whether AIG-items can meet the high standards expected of testing methodologies used in FA, such as PT in HEI.

2 Statistical and psychometric methods

2.1 Study design

This study involved a mixed-methods analysis of real data using software capable of gauging the psychometric qualities of the items included in the PT.

2.2 Data collection and sample

We analysed the responses to 126 dichotomously scored single best-answer five-option MCQs from the EMUM PT of Medicine, administered in December 2021. The questions, which were presented in the form of clinical scenarios, were designed to gauge how well medical knowledge—including that from the foundational medical sciences—is applied. Of these, 23 (18%) were automatically generated from previously designed cognitive models, generating hundreds of different items. The selection of the AIG-items was random and solely based on the PT topics. Different topics/disciplines, each with unique target competencies, conditions and unique item numbers, were covered in the PT (Cf. Appendix A). Candidates should understand subject-specific elements as part of the content objectives of the PT. 279 medical students of the EMUM (clinical years only) were used as our sample. Most students were females (72.9%) with ages ranging from 21 to 40 (M = 24; SD = 3.20). 243 of these students have been enrolled in the school’s new curriculum plan (MinhoMD)Footnote 1 since 2020. The remaining students were enrolled in an alternative curricular plan.

2.3 Procedure

Students completed the 3.5-hour PT through an electronic testing platform (QuizOne®) under online supervision from their teachers. QuizOne is an e-assessment management system integrated with AIG functionalities that are designed to deliver and administer knowledge tests, among other features. Results obtained by the students in the PT did not contribute to SAs. After finishing the PT, students submitted their responses, and the platform closed the respective session.

2.4 Psychometric approach

The psychometric properties of the PT questions were analyzed using an Item Response Theory (IRT) approach. The Rasch model (RM) (Rasch, 1960) was employed in the PT data due to its determinants of item response (respondent’s ability and item difficulty) and its relevance for achievement tests, providing a proper scalling method to establish measures based on students’ response patterns (Hohensinn & Kubinger, 2011; Tor & Steketee, 2011). The RM explains the conditional probability of a binary outcome, considering the person’s latent trait level (θ) and the item’s difficulty level (Rasch, 1960). Mathematical representation of this relationship is as follows:

$$\varvec{P}\left({\varvec{Y}}_{\varvec{i}\varvec{j}}=1|{\varvec{\uptheta }}_{\varvec{j}},{\varvec{b}}_{\varvec{i}}\right)=\frac{\mathbf{exp}\left(\varvec{D}\left({\varvec{\uptheta }}_{\varvec{j}}-{\varvec{b}}_{\varvec{i}}\right)\right)}{1+\mathbf{exp}\left(\varvec{D}\left({\varvec{\uptheta }}_{\varvec{j}}-{\varvec{b}}_{\varvec{i}}\right)\right)}$$
(1)

Where P(Yij = 1) is the probability of correctly answering an item, \({{\uptheta }}_{j}\) is the level of the latent trait of respondent j (j = 1, …, J), bi is the item difficulty parameter for item I (i = 1, …, I), and D is a scaling constant that maps the model’s parameters to the scale of a typical ogive model (Desjardins & Bulut, 2018). We recommend De Champlain’s (2010) work for a brief overview of IRT.

Our statistical pipeline was as follows: first, we evaluated if fundamental presumptions were true prior to running the RM: (a) unidimensionality (i.e., the PT was optimally measuring a single underlying construct); (b) local independence (i.e., absence of systematic conditional covariance among items); and (c) monotonicity between θ and true scores (i.e., the requirement that the probability of endorsing an item increases as θ increases). Appendix B contains the strategies for evaluating RM assumptions and respective results. Second, after ensuring model-data fit (Cf. Appendix C), we conducted a calibration process to estimate item properties (item difficulty – bi) and obtain estimates of θ. Third, item reliability was evaluated using item information function (IIF) plots, whereas exam reliability was examined through the Kuder-Richardson 20 statistic (KR-20) (Kuder & Richardson, 1937), the Person Separation Index (PSI) and test information function (TIF) plots. Fourth, differential item functioning (DIF) was tested for the variable “curricular plan” through the Mantel-Haenszel (MH) method (Mantel & Haenszel, 1959). DIF is assessed in educational data to detect item-level bias and occurs when respondents from different subgroups display the same θ but answer differently across some items (Shea et al., 2012). The MH procedure is one nonparametric method for detection of DIF. It is based on comparing matched groups, so that item functioning can be evaluated conditional on θ (Socha et al., 2015). For curricular plan DIF, the RM again branched into two groups: (0) students enrolled in the new curricular plan; and (1) students enrolled in the alternative plan. Students from the new curricular plan were used as reference (focal group). Fift and finally, we conducted a traditional distractor analysis to measure how well the incorrect options contributed to the quality of these MCQs. Since student’s performance is influenced by how the distractors are designed, it is necessary to include plausible distractors that are more likely to attract examinees with partial knowledge (Desjardins & Bulut, 2018; Wind et al., 2019). Distractor analysis was conducted by examining the percentage of students who chose a particular distractor (Desjardins & Bulut, 2018; Gierl et al., 2017). Items with distractors selected by 5% or less of respondents were candidates for potential revisions as they were endorsed at such a low level as to suggest that most examinees did not consider them viable. A two-sample equal variance Student t-test compared the number of distractors needing revisions of both AIG-items and 23 manually written items that were selected at random through a random integer generator. The “N-1” Chi-squared test(Campbell, 2007) compared the proportion of functional distractors between these items. All statistical procedures were based on the full sample. A significance level of α = 0.05 was set for all analyses.

2.5 Qualitative assessment (expert review)

When evaluating the quality of an item, it is important to consider not only its statistical data but also its qualitative information (Rzasa, 2002). A test development specialist with experience in evaluating MCQs conducted a blind qualitative assessment of each item administered in the PT. The expert used a proper item quality rating scheme (Jozefowicz et al., 2002) to evaluate the quality of each MCQ. The authors of the rating system created it to be consistent with standard item-writing guidelines. Each question was given a score between 1 (“the item tested recall only and was technically flawed”) and 5 (“the item used a clinical or laboratory vignette, required reasoning to answer, and was free of technical flaws”). The test development specialist scored the items independently, without being aware of the study’s objective. Data obtained was compiled using descriptive statistics. To determine whether there were statistically significant variations in the quality ratings between the two item types, a Mann-Whitney-U-Test (Mann & Whitney, 1947) was conducted.

2.6 Validity assessment (hierarchical linear modeling)

Validity refers to the extent to which available evidence supports the intended use of test results (American Educational Research Association, 2018). Response processes are an understudied yet promising source of validity evidence (Hubley, 2021). They describe the mental activities that a respondent employs when responding to test items (Russell & Hubley, 2017). It reaches the core of validity assessments, revealing how theory and evidence support the interpretation of test results. Additionally, they assist researchers in creating better test items, reducing construct irrelevant variance (Hubley, 2021). In the present paper, AIG-items response processes were evaluated using response time (RT; in seconds) as a measure for validity evidence. RT is one useful approach for examining response processes (Padilla & Benítez, 2014). However, it only provides indirect information regarding the difficulty and the degree of processing involved. Therefore, the validity of AIG-item response processes should be evaluated by examining the effect of the item type on RT in conjunction with student related variables through proper statistical models (Hubley, 2021).

In light of this restriction, and since we’re dealing with nested data, we employed hierarchical linear modelling (HLM) analysis to our PT data to disentangle possible within- and between-student variance and to predict RT by reference to random and fixed item and student-level predictors. Using HLM, one obtains accurate estimates of standard errors of beta coefficients and information on the variance distribution between various levels of analysis (Klusmann et al., 2008). We examined the impact of the item type on student’s RT while considering answer switching (i.e., the number of changes from the initial response to other option(s) that the students considered more appropriate during the PT). The reason for choosing this variable is that items where response changes occur may be prompting students to explore their doubts, leading them to change their answers. Item (i.e., item type = Manual vs. AIG) and student variables (i.e., answer switching) were specified at the first and second level, respectively. This may be an interesting way to complement the analysis of the response processes used to answer AIG-items, as we can evaluate whether these can explore students’ doubts as handwritten items do. In total, four different models were developed: (i) a random-intercept model (i.e., a model with no predictors; null model); (ii) an intermediate constrained model (ignoring between-cluster variation of the level-1 variable); (iii) an intermediate augmented model (considering between-cluster variation of the level 1-effect variable); and (iv) a final model including slope residuals and cross-level interactions - for a detailed insight at HLM, see Sommet and Morselli (2017). Continuous predictor variables were centered at the sample’s grand mean to enhance the regression coefficients’ interpretability. Dichotomous variables kept their original metrics.

2.7 Software

The procedure was implemented in Winsteps (Version 5.3.0) and within the R open-source statistical programming environment (http://www.r-project.org) with the R-packages: “Psych” (Revelle & Revelle, 2015); “mirt” (Chalmers, 2012); “eRm” (Patrick et al., 2018); and “Lme4” (Kuznetsova et al., 2015).

3 Results

A detailed explanation of the results of each procedure follows.

3.1 Item calibration: student and question comparison

Appendix D provides parameter estimations and standard errors from the RM. 3 questions (X7; X28; X75) were previously dropped for lack of variance (100% correct answers). None of these items were automatically generated. The mean θ of the students was centered on 0.0 logits with a standard error of 0.19. Positive values of θ indicated better medical knowledge and negative values represent less knowledge. A θ value of 0 demonstrates that the PT was well calibrated for the students and that most respondents had a 50% chance of correctly answering most of the provided items. Overall, it can be concluded that the quality of the PT was not affected by the presence of AIG-items. Figure 2 displays the bi hierarchy of the questions as answered by the students (i.e., the distribution of subject’s θ and the distribution of item bi levels within the same scale). The questions of the PT were parallel compared to the subject’s θ and spread evenly. bi estimates of the items represented the minimum θ required to correctly answer an item and ranged from − 5.7 (AIG106) to 3.89 (AIG83) logits (the mean bi value for the RM is 0). These AIG-items were at the extreme ends of the bi spectrum, which may suggest that they were mismatched and require adjustment. Mean item bi was 0.26 logit below the mean person θ. Items with bi levels above and below the bi average value had a balanced amount of items. This means that the items included in the PT covered a wide portion of the θ. The RM estimates of the bi values suggested that the PT was reasonably “moderate” and was targeted toward samples with medium levels of θ.

Fig. 2
figure 2

Map of the difficulty level of the questions from the RM

Note. Area on the left represents the distribution of the subject’s θ; Area on the right represents the distribution of items; Items with the highest difficulty level are at the top, while the easiest items are at the bottom; Each # represents 4 students; each ‘.’ represents 1–4 students; The values on the left of each scale are in logits. T = 2 standard deviations from the mean; S = 1 standard deviation from the mean; M = mean

The range of bi for AIG-items was between − 5.7 (AIG106) and 3.89 (AIG83) logits, while the bi for manually written items ranged between − 4,31 (Q23) and 3.27 (Q58) logits. The RM calibration bi levels for AIG-items were generally comparable to those of manually written items, supporting validity evidence for AIG. Figure 3 displays the bi values for the AIG-items included in the PT. There were some AIG-items with bi measures below the least able student (easy items) and few items with bi beyond the most able one (difficult items). However, most of the AIG-items were located at the same level that majority of the students (medium difficulty), demonstrating that they functioned as intended in the PT and were, therefore, appropriate for evaluating medical students at various levels of the θ continuum in HEI.

Fig. 3
figure 3

Bi values for the PT AIG-items

Figure 4 displays the item characteristic curves (ICCs) for the AIG-items included in the PT. Most items followed an increasing monotonic function. As we can see from this figure, most of the curves shift to the centre, which means there was a 50% probability that students with an ability of 0 logits answered most of the AIG-items correctly. Through this graphic visualization, it becomes clear that items AIG_83 and AIG_106 need to be revised. The first one was simple and could be easily solved even by those students with less θ. On the other hand, the second was extremely difficult and required great knowledge to be solved correctly.

Fig. 4
figure 4

ICCs for the PT AIG-items

3.2 Reliability

The KR20 was high (KR20 = 0.76), suggesting good internal consistency of the PT. PSI for the PT data was 0.76, which means that 76.4% of the variance in the observed scores was due to the estimated true variance in students’ levels of clinical competence. The PSI determines how well students can be differentiated, with a value > 0.70, meaning that the exam was adequate for group evaluation (Tennant & Conaghan, 2007). Once again, we found that the AIG-items successfully adapted to the PT at the global level. Both item types provided an equivalent quantity of information, suggesting that the AIG-items were of good quality and offered measurement the same way as manually written items. The item information curves in Fig. 5 demonstrate how the AIG-items of the PT discriminated between different levels of θ. The high information values of these functions suggest that θ at relevant points may be precisely measured and used to distinguish students from adjacent θ levels. Once again, items AIG106 and AIG83 appeared to be problematic. Item AIG_71 also presented rather flat information curves for the left of the θ continuum, suggesting that these items could also require revision.

Fig. 5
figure 5

IIFs for the PT-items

The information curve of all items of the PT was then summed up into the overall TIF presented in Fig. 6. The TIF was computed with the conditional standard error of measurement (CSEMs) to evaluate the accuracy with which the PT measured different values of θ throughout the continuum. For the PT, the TIF peaked at a trait level of around 0 (lowest amount of CSEM), consistent with the moderate difficulty values of the exam. The TIF illustrates the region of the underlying θ that is measured most precisely, revealing the reliability of the PT at different levels of student’s θ. The PT provided an appropriate test information profile for the intended usage. The test provided reliable information between the middle and high end of the trait; thus, it was able to more accurately classify students in that area. Figure 6 also reveals that the PT was precise in differentiating between students with low and above average θ because high amounts of test information were gathered from these students.

Fig. 6
figure 6

TIF for the PT

3.3 Quantifying DIF

Results from the MH chi-square (MH χ2) test revealed that 10 items in the PT were flagged for the alpha level of α = 0.05 for exhibiting uniform DIF between students enrolled in the new curricular plan and students enrolled in the alternative plan (Cf. Table 1). 9 of them were manually written (X3; X33; X43; X72; X84; X108; X115; X118; X119). Only 1 AIG-item (AIG_66) presented DIF. Effect size measures (∆MH) were used to supplement the chi-square test of statistical significance. DIF in the PT was balanced. Most ∆MH values were negative, indicating that most items were not advantageous to the focal group. Only 4 items (X33, X84, X118, X119) presented positive ∆MH values, indicating that they benefited the focal group. Using the classification scheme for measuring the effect size of DIF developed by Dorans and Holland (1992), we found that the above mentioned items presented large DIF. Items that fall under this classification present values of MH χ2 that differs from one at the 5% level and a ∆MH that is more than 1.5 in absolute value – see Socha et al. (2015).

Table 1 MH procedure for measuring and detecting DIF

3.4 Distractor analysis

A total of 230 distractors were analysed. Distractor performance for the AIG-items was slightly positive. 39% of the distractors had a choice frequency ≤ 5%. Nearly 61% of the distractors were functional (choice frequency > 5%). Distractor performance for the manually written items was slightly superior, with 32% of the distractors presenting a choice frequency ≤ 5%. This means that 68% of the manually written distractors were functional. Table 2 contains the frequency of choice distribution for the AIG-items’ distractors. 2 AIG-items revealed problems with all distractors (AIG_71; AIG_106). 3 manually written items (X23; X44; X84) presented problems with all distractors. These problems may explain the ease of these items and respective poorer quality. The two-sample equal variance Student t-test revealed no significant differences between the number of distractors needing revisions of both item types (t(45) = 0.42, p = .15). The N-1 test revealed no significant difference between the proportion of functional distractors of both item types (χ2 (1) = 0.001, p = .977, 95% CI [-14.7, 14.9] ).

Table 2 Distractor analysis

3.5 Qualitative assessment (expert review)

The majority of AIG-items received higher quality ratings (M = 4.40; SD = 1.10). Manually written items scored slightly lower (M = 4.12; SD = 1.39). These average scores for both item types means that not all items in the PT were able to satisfy, at least, one of three conditions: (i) a vignette; (ii) a one-best answer format (i.e., not true-or-false); and (iii) no technical flaw (Jozefowicz et al., 2002). The Mann-Whitney-U-Test revealed the existence of statistically significant differences in the quality ratings of both item types (U = 831, p < .05, r = .11). Higher quality ratings were associated with AIG-items.

3.6 Validity assessment (HLM)

Table 3 contains the results of the HLM analysis. A separate null model for RT was first specified. The model revealed whether the means of RT differed across students (level-2 unit). The Intraclass Correlation Coefficient (ICC) (Hox et al., 2017) of the null hypothesis was 0.11 (p < .001), justifying the use of HLM. This means that 11% of the variance in RT could be attributed to between student differences. Conversely, 89% of the variance in RT could attributed to within-student differences. The intercept (B) was 61, meaning that students answered each MCQ of the PT in an average time of 61 s ± 1.02 (standard errors - SE), regardless of other variables. Next, we addressed whether/how item type and answer switching predicted RT (model 1). The model included the predictors mentioned above simultaneously. Results revealed that the fixed effect of both item type and answer switching was positively significant (Bswitching=3.29; Btype=15.6, p < .001). The average effect of answer switching for the typical student on the PT was 15.6 s (SE = 0.47), and the average RT for manually written items was 3.29 s (SE = 0.77), lower than the average RT for AIG-items. Following this, we built an augmented model (Model 2) considering cluster-specific effects of item type and the overall effect of answer switching. Both predictors were again positively significant of RT, and the coefficients were equal to those of model 1. However, this time we obtained a measure of the differences in the effect (slope) of item type (level 1-effect) on RT between the students (level-2 units). The average deviation of the item type effect of a student from the average effect was about 1.89 s. We then compared the deviance of models 1 and 2 to test whether including the between-student variation of the item type-effect improved the estimation using the likelihood-ratio test (LR χ²) and the Akaike Information Criterion (AIC) statistic. The p-value of the LR χ² was below 0.20 (p = .069) and Model 2 had a lower AIC than Model 1 (AICM1= 132,377 ; AICM2= 132,376). These values are proof that estimating the variance and covariance terms of item type improved the fit (Sommet & Morselli, 2017), the reason why we decided to include it in the final model (Model 3). Fixed and random effects included in model 3 explained 19% of the variance in the final model. The coefficient estimates of both item type and answer switching didn’t vary compared to the other models. In model 3 we also included cross-level interactions between answer switching and item type. The coefficient estimate of this cross-level interaction was 0.836 (p = .367). This means that the pooled within-student effect of switching answers was not statistically significant between both item types. After considering this data, we discover that the response processes for both AIG and manually written items appear to be pretty similar, which is a source of validity evidence for automatically generated items.

Table 3 Hierarchical linear model analysis

4 Discussion

4.1 Overview

There is a high need for numerous and valid MCQs due to changes in student assessment brought about by new computer-based exam formats that traditional item construction methods cannot keep pace (Arendasy & Sommer, 2012). It is, therefore, necessary to extend classic assessment methods and psychometric procedures so that it covers modern procedures/techniques (von Davier et al., 2021). AIG is a cutting-edge approach to item development and management that combines cognitive and psychometric theories for futuristic assessment services in digital contexts (Choi & Zhang, 2019; Falcão et al., 2023). The study of AIG is a promising pursuit since it enables computer technology to produce many new items (Hommel et al., 2022).

In the present paper, we provide a fresh approach for developing material for MCQ FA’s modalities, such as PTs, using AIG principles and practices. 23 AIG-items were included in a medicine PT along with traditionally developed MCQs. AIG-items were generated using cognitive models outlined in advance of the PT. Data obtained was analysed from an RM, distractor performance and HLM perspective. First, the psychometric analysis of the PT demonstrated good internal consistency and test reliability, revealing that the exam was a suitable tool for evaluating students’ θ. Additionally, the PT presented unidimensionality and the questions that composed the exam were locally independent and presented a monotonic relationship between the scores and values of θ. This means that both AIG and manually written items assessed the same construct. Neither question was redundant, and using true scores in place of θ scores was justified.

In addition to fitting the RM, the AIG-items have successfully fit the PT and proven to be on par with the manually written items in terms of quality. At the item level, psychometric analysis of the PT suggested several strengths of the AIG-items as well as some opportunities for future improvement. AIG-items revealed a broad spread of bi (which demonstrates more adaptability and possibly more discriminating power), with only 3 items (AIG_71; AIG_83; AIG_106) located at both ends of the bi continuum. These extreme positions indicate that the bi of these 3 items was either too simple/too challenging for the students to answer, revealing the need for revision. The remaining AIG-items provided a comprehensive assessment across a wide range of the underlying θ, providing reliable information for sorting students with relatively moderate levels of knowledge. Among the 23 AIG items included in the exam, only the 3 items mentioned above captured very high or very low values of θ, offering little information about the students’ knowledge.

Item bias was evaluated via DIF. The RM found evidence for curricular-plan uniform DIF for only 1 AIG-item. The possible existence of DIF in AIG-items may sound problematic for some. The reason for this concern relates essentially to the solution traditionally adopted by subject matter experts to deal with items that present this phenomenon in achievement testing, which consists of their omission or replacement with alternative items from a larger item bank (Silvia et al., 2021). Concurrently, scholars have been questioning the value of analysing DIF in educational measures in recent years, claiming that tests of cognitive ability and educational achievement are not test biased and produce results comparable to test performance. These items are professionally developed to evaluate educational achievement and are subject to extensive reviews before being released, which is why they should not be considered biased towards the test. However, DIF will always exist, and its sources will always be uncertain because there are too many interrelated variables, which is why we believe that these outcomes do not call into question the quality of the AIG-items (Teresi & Fleishman, 2007).

Distractor performance was slightly better, though non significant, for the manually written items, with 68% of them being functional. Approximately 61% of the distractors of the AIG-items used in the PT properly functioned. These outcomes are favorable for the use of AIG. Nevertheless, a significant number of distractors (39%) call for possible revisions. Our results are in line with the literature focusing on the plausibility of AIG item distractors (e.g., Gierl and Lai, 2013; Lai et al., 2016), which appear less plausible than handwritten items and refer to the need of a clear methodology for the generation of distractors. Despite these issues, we must consider that this same research line claims that AIG distractors can already distinguish between low-and-high-performing examinees, which is a definite sign that AIG distractors can still be improved and made even more effective.

The qualitative outcomes attained using expert review were partially consistent with the quantitative evaluation procedures that were conducted in this paper. Both points of view agree that the AIG-items were of high quality. However, the qualitative viewpoint used here went further, claiming that AIG-items reflected accepted item-writing principles and assigning slightly higher quality scores to automatically generated items over those manually written. This finding is interesting and merits additional investigation in future studies. It also serves as a potential starting point for a review of the AIG items’ content validity. Since test developers have at least a general understanding of the object they intend to assess, the more clearly these objects are expressed, the more precisely content validity may be examined (Beck, 2020). By conducting this assessment, we are able to identify the most significant defects (in terms of item writing) that these items harbor and what the most important fixes ought to be. In this case, we realized that these defects consist, in particular, of the presence of a possible technical flaw, the possible absence of a vignette, or the absence of a single-answer format.

Finally, the validity of the AIG-items was evaluated using HLM. Since the research focused specifically on the validity of AIG-items is practically non-existent, it is not surprising that the validity of these items still falls into a murky area of the literature. However, along with the affording logistical benefits mentioned, AIG can indeed play a role in building a validity case (Colvin et al., 2016). Points in favor of this argument relate to possible evidence of content validity(Hommel et al., 2022) and construct validity (Harrison et al., 2017). In this paper we decided to go further. The method employed here for developing this argument is novel as we have gathered evidence supporting the validity of AIG in terms of response processes from a modern psychometrics standpoint (American Educational Research Association, 2018). We found that the RT for the manually written items as for the AIG-items varied by only 3 s. This brief time difference, in our opinion, illustrates how similar these item types are. Since RT provides an explicit, altough indirect information about the complexity of the item (and, therefore, the amount or degree of processing involved) (Padilla & Benítez, 2014), we may partially claim that the response processes/processing mechanisms used to answer both item types are very similar, at least in terms of how much processing is done or how students interpret the questions (Deng et al., 2021). To materialize this evidence, the RT to the PT exam questions was analyzed considering both answer switching and item type. We discovered that students performed fairly similarly while responding to both item types utilized in the PT since there was no interaction between the number of responses altered and the item type. With the RT being equiprobable with the responses altered between both item types, these data complement one another and represent evidence for AIG-items’ validity, strengthening any validity argument for this procedure.

4.2 Practical implications for FA: how should feedback be delivered?

AIG pledges to provide test items and sizable item banks rapidly. Since it is based on the cognitive processes that underlie task performance, it prevents item exposure, and exhibits construct validity (Harrison et al., 2017). These characteristics appeal to SA but are also notably helpful to FA regarding the quantity of testing items available. A potential concern of using CFT with an AIG functionality may be based on how feedback can be delivered within a bank of test items. At first glance, it may come to mind that feedback delivered by these systems could only be given based on achievement standards.

Consequently, students using these platforms could learn by becoming familiar with key elements (through item repetition) or coming across more challenging items (Choi et al., 2018). Although useful, this approach seems simplistic and amenable to improvement, as students should receive real feedback on how to solve each testing problem (Gierl & Lai, 2018). The available literature offers some suggestions on how AIG can be used to deliver feedback. Gierl and Lai (2018) described a method for generating both the items and the rationales required to solve testing tasks within FA in medical education. According to the authors, rationale generation could be incorporated into the three-step AIG process by expanding the item model in the second step and identifying key features in the task required to solve the items. Xinxin (2019) presented a modified generation framework that employs a tree structure for cognitive modelling, an assembly mechani, and a validation tool to support CFT within the context of HEI. In a process known as a tree traversal, elements that are related by nodes and edges can be automatically and logically searched for and merged, producing test cases and the corresponding feedback. These methods of providing feedback highlight AIG’s adaptability and versatility. For more information on this matter, refer to the original papers.

4.3 Strengths, limitations and directions for future research

This paper’s main contribution is a comprehensive psychometric/statistical analysis that compares the quality/validity of AIG-items versus those written manually within a PT of medicine in a HEI. It also reviews issues pertinent to educational assessment and explains how AIG can be used to complement CFT systems. The inventiveness of this research is another benefit. As far as we know, no research has been done that examines explicitly how AIG-items perform in FA assessments such as PT or the validity of these items using quantitative and qualitative procedures such as the ones used here.

However, this paper has some drawbacks. It is important to note that the sample of AIG-items used in the PT is notoriously limited compared to the number of manual written items. To compare the exam’s questions more accurately, there should be a more evenly distributed distribution of AIG elements. However, due to school norms, adding more AIG-items to the PT was impossible, as this method is currently at an experimental stage. Additionally, one must consider that the number of respondents who took the test sample prevented us from using a more reliable IRT model that would have allowed us to evaluate other parameters of the AIG-items besides bi, such as discrimination or guessing. Evaluating these parameters, one could obtain a thorough picture of how these items performed. (Gierl & Lai, 2018). Finally, one should note that only one test developer expert conducted the qualitative review of the items included in the PT. This constitutes a limitation of our work, as analyses of this type provide solid robust results when more than one expert performs this evaluation, in order to assess the degree of agreement between raters. However the expert is a highly trained item writer and trainer in item writing and was blind to the the nature of items. Future avenues should work with more extensive assessment panels to qualitative evaluate/compare the quality of both AIG and manually written items.

The mass manufacturing, intelligent item calibration and management, learner-centered evaluation, and other elements of AIG are quite varied and promise to revolutionize educational measurement. However, a comprehensive validation of these assertions in educational settings such as FA has not yet been achieved (Choi et al., 2018). There is a clear need for creating feedback guidelines within AIG frameworks (Gierl & Lai, 2018; Xinxin, 2019). More applications with AIG-items in PTs should be run in order to get reliable results regarding the use of automatically generated items in such tests. Research in this area would be beneficial for spreading AIG in the context of FA.

5 Conclusion

AIG-items represent suitable material for evaluating student’s knowledge, even on FA modalities such as PT. Despite being computer-generated, these items are valid, present psychometrical quality and are most advantageous in terms of production speed and quantity. Additionally, they have a superior quality as assess by item-writing experts. These capabilities are expected to ease the item development burden, resulting in significant cost savings for educational institutions when developing test items.