Introduction

The American Board of Surgery Certifying Exam (ABS-CE) constitutes the final step of becoming a board-certified general surgeon. The purpose of the exam is to verify that an individual is competent and safe for independent practice by assessing knowledge and decision-making. Mock oral exams (MOEs) are simulations of the ABS-CE for surgical trainees. MOEs closely resemble the certifying exam, and MOE performance has also been shown to predict passage of the ABS-CE [1, 2].

The structure and content of the ABS-CE are confidential. Therefore, most MOEs are written by local content experts not affiliated with the American Board of Surgery and lack standardization. Since MOEs must discern an examinee’s mastery of surgical knowledge using a limited number of questions, each question should be regularly assessed for effectiveness [3, 4]. Effective questions should incorporate more advanced skills such as synthesis of ideas and problem-solving with less questions assessing lower-order cognitive functions such as recall of information [5]. Exam questions should avoid focusing on tedious details and instead emphasize practical applications of knowledge [6]. Previous studies have shown that item analysis of multiple-choice examinations can distinguish questions by reliability, discriminative ability, and overall difficulty [7,8,9,10].

The objective of this study was to perform a thorough item analysis of a large, multi-center MOE. We hypothesized that item analysis would identify questions that best measure content mastery, predict exam passage, and distinguish between high- and low-performing residents.

Methods

Data from a 2022 standardized multi-institutional general surgery MOE with 64 participating residents from 6 institutions were retrospectively reviewed. Exam questions were written by participating faculty on a volunteer basis. All cases were reviewed and revised by study authors for clarity and content. Examinees were asked 73 questions across 12 standardized cases, and examiners would enter whether students would receive a “pass” or “fail” after every question. In the examination, each case consisted of discrete questions. In our analysis, each question was considered a separate item that could be graded as “pass,” “fail,” and sometimes “critical fail.” For each item, examiners were given a rubric with specific criteria for each grade. An item was considered correctly answered if both examiners agreed the examinee passed. At the end of each case, examiners graded the overall performance as "pass," "borderline," or "fail" and were not given specific criteria on the definition of each. Thus, the number of items answered correctly was not directly tied to passage or failure of the entire case. Resident levels (PGY-3, PGY4, PGY-5) were blinded to examiners to avoid bias and unnecessary expectations by amount of training. Furthermore, examiners were paired only with examinees from a differing institution to remove preconceived notions of the resident due to personal familiarity. “Room score” was defined as the mean of the 4 case scores. MOE passage was defined as both of percentage questions correct and mean room score 1 standard deviation below mean or higher. Study authors (JW, JW, FC) categorized questions by clinical topic (surgical critical care, skin and soft tissue, large intestine, stomach, pediatric surgery, breast, endocrine, biliary, trauma) and clinical competency (diagnosis, decision to operate, operative approach, professionalism, patient care, medical knowledge).

Rates of passage were reported for all participants and stratified by PGY using Microsoft Excel. Independent two-sample t-tests were used to compare rates of item passage among pairs of PGY-levels. For all analyses, p ≤ 0.05 was considered statistically significant.

Item analysis was performed for each test question. We assessed whether answering correctly was associated with MOE passage and whether answering incorrectly was associated with MOE failure using Fisher’s exact test. We defined questions as “effective” if there was a significant correlation between answering correctly and passing the MOE because it revealed that the examinee was more likely to have mastery over the MOE content overall. We defined questions as “ineffective” that: 1) had no discriminatory ability (all examinees answered correctly, or all examinees answered incorrectly) and did not have a critical fail option or 2) answering the question correctly was correlated with exam failure.

Results

Exam characteristics and examinee performance

A total of 64 resident examinees, PGY 3–5, from six general surgery residency programs participated in the MOE. Rate of overall MOE passage was 76.7% (49/64) and total percent of items answered correctly was 78.0% (Table 1). There were no statistically significant differences in pass rates by clinical year. By PGY, pass rates were 73.3% (11/15), 78.3% (18/23), and 76.9% (20/26) for PGY-3, 4, and 5, respectively. Overall percentage of items correct was 71.7% for PGY-3s, 77.6% for PGY-4s, and 81.8% for PGY-5s. Differences in performance between PGY-levels were all statistically significant (p < 0.01).

Table 1 Number of exam participants, pass rates, and percent questions correct by PGY

Effective items

Item analysis identified 17 items (23.2%) that were significantly correlated with MOE passage overall. With each question considered individually, residents answering correctly had a pass rate of ranging from 79.7 to 93.8%. Conversely, residents who answered one of these items incorrectly had a pass rate that ranged from 0 to 65.6% (Fig. 1, 2). These items with high discriminatory ability were relatively evenly distributed by clinical topic. Topics included endocrine, surgical critical care, trauma, and large intestine (17.7% each) with fewer in stomach (11.8%), breast (11.8%), and pediatric surgery (5.9%). By clinical competency, most of these items pertained to patient care (52.9%) and operative approach (23.5%), with fewer related to diagnosis (11.8%), decision to operate (5.9%), and professionalism (5.9%) (Fig. 3).

Fig. 1
figure 1

Percentage correct on predictive questions if answered correctly

Fig. 2
figure 2

Percentage correct on predictive questions if not answered correctly

Fig. 3
figure 3

Distribution of predictive questions by clinical topic and clinical competency

Ineffective items

Item analysis identified 14 items with zero or negative discriminatory ability. First, there were three items with 100% correct rate, and therefore had no discrimination between pass and fail outcomes. Of note, none of these questions had the possibility of “critical fail” that contained elements of knowledge deemed essential to safe practice. Although questions that all examinees answer correctly are not necessarily ineffective, these are unable to discriminate between residents with low rates of content mastery from those with high rates of content mastery within our local cohort. Second, eleven items had a higher pass rate for examinees who answered incorrectly than for those who answered correctly, although these findings did not reach statistical significance (Fig. 4). By clinical topic, ineffective items had a greater proportion of stomach (35.7%) and large intestine (21.4%) with lesser representation of surgical critical care, pediatric surgery, breast, and endocrine (7.1% each). Clinical competencies tested by these items were mostly related to diagnosis (42.9%) and decision to operate (21.4%). Operative approach (14.3%), patient care (14.3%), and professionalism (7.1%) were less represented (Fig. 5).

Fig. 4
figure 4

Questions which displayed greater pass rates if not answered correctly

Fig. 5
figure 5

Distribution of low-discrimination questions by clinical topic and clinical competency

Discussion

Our item analysis of mock oral examination questions found that only 23.3% of questions were associated with passing the exam, and also identified 19.2% as ineffective It is crucial to identify and expand the most effective questions and to minimize or eliminate the least effective ones. Item analysis is a means of quality assurance and improvement for question writing when there is no other standard available. Our study is the first to describe item analysis in MOEs.

Consistent with the results of our study, previous studies have found that item analysis is effective in identifying multiple-choice questions according to discriminatory ability [8,9,10]. Moderate difficulty questions were most discriminatory whereas excessively difficult or easy questions were less effective [8]. Removal of ineffective questions with negative discrimination (i.e. low achievers performed better than high achievers) from subsequent exams increases validity and reliability [9]. Similarly, another study found that revision or replacement of low-discrimination questions led to increased discrimination and exam quality in subsequent iterations of the exam [10]. Taken together with our results, item analysis is a value practice for general surgery programs writing their own mock oral examination questions to increase the reliability and efficiency of their exam. Easiest items to target for removal or revision are items with negative discrimination or no discriminatory ability. Finally, it is worth noting that a poorly performing item may simply have a poorly written scoring rubric, with nothing to do with topic or the examinee. Nevertheless, item analysis can at least identify problematic items for further scrutiny.

Although eliminating ineffective questions is easy, it is more difficult to discern what makes a question more effective than others. Effectiveness for a general surgery mock oral exam is defined by its ability to simulate the ABS CE, but the content of ABS CE cannot be shared with individuals writing the mock oral exam. Unsurprisingly, mock oral examinations are widely utilized in surgical education, but have untested reliability and validity [11, 13]. In our study, we found that effective questions were relatively evenly distributed throughout clinical topics. We did note that when effective questions were stratified by clinical competency, 76% were related to patient care and operative approach. In comparison, 64% of ineffective questions related to diagnosis and decision to operate. Thus, future question writers should focus on patient care and operative approach, which are typically learned in the senior years of training. While other areas are important, it is possible these would be better assessed in a different format.

In our analysis of question performance by PGY level, clinical topic, and clinical competency, we found statistically significant differences in performance across all PGY-levels. This finding supports the validity of our MOE as the questions, taken in aggregate, assess knowledge and skills that improve as examinees attain higher levels of training. We also identified 13 questions that may be particularly effective at discerning topics that discriminate among levels of training. Distribution of effective items by clinical topic was relatively even, whereas the greatest proportions of ineffective items were identified among topics of stomach and large intestine. This may be explained in part by the clarity of rubrics provided for these items. It is also plausible that questions regarding stomach and large intestine tested knowledge that is attained at lower PGY-levels, and thus had low discriminatory value. Regarding clinical competencies, patient care and operative approach contained greater proportions of effective items, likely because these are skills that significantly improve with increasing levels of training and clinical exposure. There were relatively greater proportions of ineffective items among categories of diagnosis and decision to operate, which may be in part explained by residency curricula prioritizing diagnostic skills and indications for surgical management earlier in training than other competencies.

Our study has several limitations. First, we cannot assess the true validity of the mock oral examination as it compares to the American Board of Surgery Certifying Exam, and our results are based on exam passage of the mock oral exam. Our study lacks sufficient longitudinal data to analyze the correlation of performance on our MOE with ABS-CE passage. However, with subsequent iterations of the MOE and increasing data available, assessing this relationship may be a promising area of future research. Despite this, multiple studies have found that participating in practice exams significantly improves exam performance [4, 12]. Whether item analysis improves the validity of general surgery MOE, and whether more valid mock oral exams lead to better performance on the ABS-CE are two areas of possible future research. Second, this is based on a limited sample of residents from southern California general surgery programs. A larger study incorporating a nation-wide sample may yield more generalizable results.

Conclusion

Many general surgery residency programs use MOE to prepare residents for the ABS-CE. Therefore, MOEs must be optimized with questions that are reliable, discriminatory, and predictive of overall performance. Item analysis can identify both effective and ineffective questions to guide future exam development.