Responses to the EMQ examination taken by one hundred and ninety three 4th year medical students were used as the source data. The examination is designed to test factual and applied knowledge taught in the Medical and Surgical Specialties course and is taken by the students at the end of the 16 weeks course. The course is run three times per year and rotates with two other courses (Paediatrics / Obstetrics / Gynaecology and Primary Care / Public Health / Psychiatry). All questions were devised by the lead educational supervisor within each specialty. Training in EMQ writing was provided to the medical specialty supervisors. The examination consisted of 98 EMQs distributed across eight specialties and 27 themes, each with two to four EMQs. Each themed group of EMQs had eight to 15 possible response options (e.g. see example in Figure 1). There were 12 Oncology, 14 Anaesthetics, 12 Dermatology, 12 A&E, 12 Infectious Diseases, 16 Orthopaedics, 8 Rheumatology and 12 Rehabilitation EMQs. The final exam mark is the sum of correct answers to all themes, summed across specialties, giving an indication of the applied knowledge across the range of medical and surgical specialties which comprised the Medical and Surgical Specialities module.
No other information was collected about the students other than which term they had sat the EMQ examination. The students take this examination at the end of the course and the medical and surgical specialties course is repeated three times a year. Differential Item Functioning (see below) was used to determine the impact of the term in which the examination was taken on student performance.
The Rasch model is a probabilistic unidimensional model which asserts that (1) the easier the question the more likely the student will respond correctly to it, and (2) the more able the student, the more likely he/she will pass the question compared to a less able student. The model assumes that the probability that a student will correctly answer a question is a logistic function of the difference between the student's ability [θ] and the difficulty of the question [β] (i.e. the ability required to answer the question correctly), and only a function of that difference
From this, the expected pattern of responses to questions can be determined given the estimated θ and β. Even though each response to each question must depend upon the students' ability and the questions' difficulty, in the data analysis, it is possible to condition out or eliminate the student's abilities (by taking all students at the same score level) in order to estimate the relative question difficulties [14, 16]. Thus, when data fit the model, the relative difficulties of the questions are independent of the relative abilities of the students, and vice versa . The further consequence of this invariance is that it justifies the use of the total score [18, 19]. In the current analysis this estimation is done through a pair-wise conditional maximum likelihood algorithm, which underlies the RUMM2020 Rasch measurement software [20, 21]
If the above assumptions hold true then the relationship between the performance of students on an individual question and the underlying trait (applied knowledge within the medical and surgical specialties course) can be described by an S shaped curve (item response function). Thus the probability of answering the question correctly consistently increases as the location on the trait (knowledge) increases (Figure 2). The steepness of the curve indicates the rapidity with which the probability that a student responding to the question correctly, changes as a function of this location (ability). The location of the curve along the horizontal axis (defined by the point at which the 0.5 probability level bisects the horizontal scale) indicates the difficulty of the question. The location of the student on the same axis indicates their level (of knowledge, ability etc.) on the trait.
When the observed response pattern does not deviate significantly from the expected response pattern then the questions constitute a true measurement or Rasch scale . Taken with confirmation of local independence of questions, that is, no residual associations in the data after the person ability (first factor) has been removed, this supports the unidimensionality of the scale [23, 24].
General tests of fit
In this analysis, responses to the EMQ are analysed as dichotomous options, that is, one correct answer and all of the other options are analysed together as one incorrect response. To determine how well each question fits the model, and so contributes to a single trait, a set of 'fit' statistics are used which test how far the observed data match those expected by the model. The trait refers to the required knowledge base that the student must acquire within the medical and surgical specialties course. The Item – Trait Interaction Statistic (denoted by the chi-square value), reflects the degree of invariance across the trait. A significant chi-square value indicates that the relative location of the question difficulty is not constant across the trait. In addition, question fit statistics are examined as residuals (a summation of the deviations of individual students responses from the expected response for the question). An estimate of the internal consistency reliability of the examination is based on the Person Separation Index where the estimates on the logit scale for each person are used to calculate reliability.
Misfit of a question indicates a lack of the expected probabilistic relationship between the question and other questions in the examination. This may indicate that the question does not contribute to the trait under consideration. In the current study students are divided into three ability groups (upper third, middle third and lower third) denoting each Class interval with approximately 65 students in each. Furthermore, significance levels of fit to the model are adjusted to take account of multiple testing (e.g. for 24 items the level would be 0.002 and for 98 the level would be 0.0005) .
As well as invariance across the trait, questions should display the same relative difficulty, irrespective of which externally defined group is being assessed. Thus, the probability of correctly answering a question should be the same between groups given the same ability level of the student. For example, given the same ability, the students should not be more likely to answer a question correctly simply because they sat the exam in the third term instead of the first or second term. This type of analysis is called Differential Item Functioning (DIF) . The basis of DIF analysis lies in the item response function, and the proportion of students at the same ability level who correctly answer the question. If the question measures the same ability across groups of students then, except for random variations, the same response curve is found irrespective of the group for whom the function is plotted . Thus DIF refers to questions that do not yield the same response function for two or more groups (e.g. gender or the cohort of students).
DIF is identified by two way analysis of variance (ANOVA) of the residuals with the term in which the examination was taken by the student as one factor and the class interval as the other . Two types of DIF are identified: (a) uniform DIF demonstrating that the effect of the term in which the exam was taken are the same across all class intervals (main effect), and (b) non-uniform DIF which demonstrates that the effect of which term the student sat the exam in is different across class intervals (interaction effect). Where there are more than two levels of a factor, Tukey's post hoc test is used to indicate which groups are contributing to the significant difference.
Although EMQ are analysed as though they have dichotomous response categories (correct or incorrect), it is possible to examine how the separate incorrect options within an individual EMQ are contributing to the student's response. This procedure is very similar to the technique of Graphical Item Analysis(GIA) , though in this case the RUMM 2020 programme  produces the analysis with no extra user effort. The proportions of students in each class interval who have selected the various response categories, including the correct option, are plotted on a graph of the item response function. This visually illustrates how often the various response options are being selected by the students in relation to one and other, and can be compared across themes given that different options are likely to have different response patterns for different questions within a theme. This is particularly useful in improving the quality of the distractor responses.
In view of our limited sample size (and particularly the ratio of students to items) we elected in the first instance to examine in detail the psychometric properties of the musculoskeletal component of the EMQ examination, acknowledging the limitations associated with the accuracy of the person estimate based upon 24 items (29). Subsequent analysis of the whole examination is reported to demonstrate the potential benefits of using Rasch analysis, but again acknowledging the limited conclusions that can be drawn on student ability and question difficulty estimates for the whole examination as a result of looking at 98 items with 193 students.