Background

Accurate self-assessment is crucial for the professional development of physicians [1, 2]. It helps them identify their own strength and weakness, set realistic expectations, and continue self-directed lifelong learning [3]. In spite of these benefits, previous studies on various medical specialties have suggested the inconsistency between self-assessments and external measures, such as expert assessments or objective examinations [2, 4,5,6,7,8,9]. Therefore, it is necessary to discover factors that may affect the accuracy of self-assessment and develop strategies to ameliorate this inconsistency.

The Anesthesiology Milestone 2.0 has been developed by the Accreditation Council for Graduate Medical Education (ACGME) to assess competency acquisition of anesthesia residents in the United States (US), which provided descriptions of behaviors that residents are expected to demonstrate as they progress through training in six domains of competencies, including patient care (PC), medical knowledge (MK), system-based practice (SBP), practice-based learning and improvement (PBLI), professionalism (PROF), and interpersonal and communication skills (ICS) [10]. Each residency training programs is required to establish a Clinical Competency Committee (CCC), which is responsible to rate each resident using the Anesthesiology Milestones every six months in the US [11]. In recent years, many residency training programs have encouraged residents to assess themselves using the ACGME Milestones [9, 12,13,14,15,16,17]. Realizing the differences between self- and CCC-assessments can effectively reinforce the understanding of the Milestone Evaluation System and improve the ability of reflective practice. However, there has been sparse data on the comparison between self- and CCC-assessments on Anesthesiology Milestones.

The aim of this study was to evaluate the differences between resident self- and faculty- assessments on Anesthesiology Milestones and to investigate the associated factors.

Methods

This was a single-center cross-sectional study conducted at Peking Union Medical College Hospital (PUMCH), a general tertiary university-affiliated teaching hospital in Beijing, China. The institutional review board of PUMCH classified our study as "exempt" and waived the requirement of written informed consent. This article adheres to Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guideline.

Study population

We included anesthesia residents who were enrolled in the standardized residency training program in postgraduate year (PGY) two and three at the time of the Milestone evaluation. The medical education programs vary widely in China. After high school graduation, there are two major pathways to pursue a clinical medical degree: the 5-year program leading to a Bachelor of Medicine degree and the 8-year program leading to a Doctor of Medicine degree [18]. The 8-year programs are more favored for offering extensive clinical rotations, research opportunities, and enhanced employment prospects; consequently, they require higher admission scores in the National College Entrance Examination. Upon passing the National Graduate School Entrance Examination, graduates from 5-year medical programs have the option to enroll in a Master of Medicine program. During this period, they undergo residency training and engage in research. Graduates from 5-year programs who did not pass the examination, as well as graduates from 8-year programs, are also required to complete standardized residency training before the promotion to attending physicians. The anesthesiology residency training program in China spans three years, consisting of a 9-month rotation in anesthesia related departments including medical, surgical, and intensive care departments, followed by 27-month rotation in various subspecialties of anesthesiology and pain medicine [19]. Residents in PGY1 spent the majority of their first year rotating through anesthesia related departments and were consequently excluded from this evaluation.

Development and validation of the Chinese version of Anesthesiology Milestone

The Anesthesiology Milestone 2.0 [20] was translated into Chinese by two professors of anesthesiology and a professor of English literature. There were a total of 23 subcompetencies and five milestone levels in each subcompetency. Tracking from level 1 to level 5 was synonymous with moving from novice to expert resident in the subcompetency. A numeric rating scale of 0–9 were used (0: not yet assessable, 1: not yet completed level 1, 2: equal to level 1, 3: between level 1 and 2, 4: equal to level 2…… 10: equal to level 5). The CCC of our program consisted of 8 anesthesiologists. Five of them had more than 15 years of experience in postgraduate education, while the remaining three possessed 7–8 years of experience. They specialized in diverse subspecialties such as cardiac, thoracic, obstetric, pediatric, and regional anesthesia, enabling them to assess residents comprehensively across various subcompetencies. The CCC faculty assessed 64 anesthesia residents using the Chinese version of Anesthesiology Milestone 2.0 in the year 2022. Their ratings demonstrated satisfactory inter-rater reliability, internal consistency, and correlation with written examination scores.

Data collection

We requested the CCC faculty to evaluate anesthesia residents in PGY2 and PGY3 on the 23 subcompetencies using the Chinese version of Anesthesiology Milestones 2.0 in January 2023 and January 2024. All the CCC faculty members supervised residents in PGY2 and PGY3 in daily work and were thus familiar with their performances. Before assessing the residents, they underwent training in the use of Anesthesiology Milestones 2.0 based on the Supplemental Guide issued by ACGME [21]. They collaboratively discussed the ratings for each resident until a consensus was reached. During these discussions, scores from daily supervisor evaluations, quarterly written examinations and annal Objective Structured Clinical Examinations were provided. At the same time, the residents in PGY2 and PGY3 were also requested to select the level that best described their own performance on each subcompetency using the same version of Milestones. The faculty and residents were blinded to each other’s rating scores during this process.

We also collected data of variables that may be associated with the accuracy of Milestone assessments, including age, gender, grade, evaluation year, medical education degree, and rank of written examination scores. The written examinations were conducted every three months, composed of case-based single- and multiple-choice questions. In this study, we initially standardized the scores of each written examination using z scores (subtracting each resident’s score from the average score of all the residents in an examination and dividing this by the standard deviation). Subsequently, we ranked the residents based on their mean standardized score of all the written examinations within one year before the evaluation.

Outcome measure

The primary outcome was the differences between self- and faculty-assessments, measured by subtracting CCC-rated scores from self-rated scores on each subcompetency of Anesthesiology Milestone 2.0.

Statistical analysis

Normally distributed continuous variables and categorial variables were described as mean (standardized deviation) and number (percentage), respectively. The differences between self- and faculty-rated scores on each subcompetency were analyzed using paired Student’s t test, as suggested by the normal distribution indicated by histograms. The consistency between them was analyzed using intraclass correlation coefficient (ICC) which was estimated by two-way mixed-effects models on absolute agreements. The Bland–Altman plots were used to assess the agreement between the self- and faculty-rated scores within each competency.

The assessment of each subcompetency was considered as an observation in the analysis of factors associated with the differences between self- and faculty-assessments. The association was analyzed using a multivariable generalized estimating equation (GEE) linear model with a robust standard error estimate to account for the clustering effects of assessments on different subcompetencies in the same resident. In addition to the variables related to resident characteristics, the domain of competency was also included into the multivariable model as an independent variable, since residents might rate themselves higher or lower in certain domains. Independent, autoregressive 1, and exchangeable working correlation structures were all used. The independent working correlation structure was finally selected since it had the smallest quasi-likelihood information criterion indicating the best fitness.

In the power analysis, the design effect (D) was calculated by D = 1 + ICC (m-1) [22]. The ICC was estimated using a linear mixed effects model, in which the faculty’s rating on each subcompetency was the dependent variable and the resident was the random intercept. In our study, m was the number of subcompetencies per resident. An ICC of 0.53 and an m of 23 resulted in a design effect of 12.6. The sample size (N) was then calculated by N = m × number of residents/ D. Since there were 88 residents evaluated in both years, the sample size was 160. Based on a two-sided probability of the type I error of 0.05, the statistical power was 82.8% to detect a mean difference between self- and faculty-rated Milestone scores of 0.3 with a pooled standardized deviation of 1.3.

A two-sided P < 0.05 was considered statistically significant. The statistical analysis was carried out using R (version 4.2.1, R Foundation for Statistical Computing, Vienna, Austria, 2022) with the packages of irr, blandr, geepack, lme4, and pwr.

Results

A total of 46 and 42 residents were enrolled in the residency training program in the PGY2 and PGY3 grades at PUMCH in January 2023 and January 2024, respectively. Data of the faculty- and self-assessments on Anesthesiology Milestone 2.0 were collected and analyzed from all these residents. Notably, only 64 distinct residents participated in the evaluation, as 24 residents were in PGY2 during the first evaluation and transitioned to PGY3 for the second evaluation, resulting in their repeated assessments. Table 1 provides further details of the residents’ characteristics.

Table 1 Characteristics of residents

Table 2 summarizes self- and faculty- rated scores on the 23 subcompetencies of Anesthesiology Milestone 2.0. The self-rated sum score was significantly higher than that rated by faculty [mean (standardized deviation, SD): 120.39 (32.41) vs. 114.44 (23.71), P = 0.008 in the paired t test] with an ICC of 0.55 [95% confidence interval (CI): 0.31 to 0.70] that did not indicate a strong consistency. Residents’ ratings were significantly higher than faculty’s ratings on 10 subcompetencies and significantly lower on foundational knowledge. The ICCs varied widely from 0.22 (95% CI: -0.17 to 0.49) of PROF3 to 0.60 (95% CI: 0.40 to 0.74) of ICS2 among all the subcompetencies. The Bland–Altman plots (Fig. 1) showed that residents rated significantly higher than faculty in PC (bias 0.32, 95% CI: 0.05 to 0.60), PBLI (bias 0.45, 95% CI: 0.07 to 0.84), and PROF (bias 0.37, 95% CI: 0.02 to 0.72). The bias of the sum rating was 5.94 (95% CI: -0.75 to 12.63) with a lower limit of agreement of -55.95 (95% CI: -67.42 to -44.47), an upper limit of agreement of 67.83 (95% CI: 56.36 to 79.31) and 5 (5.7%) outliers beyond both limits.

Table 2 Comparison between self- and faculty-assessments on Anesthesiology Milestones (N = 88)
Fig. 1
figure 1

Bland–Altman plots of self- and Clinical Competency Committee- evaluated Milestone rating scores. A patient care, B medical knowledge; C systems-based practice; D practice-based learning and improvement; E professionalism; F interpersonal and communication skills; G sum score. Grey area: bias and 95% confidence intervals, blue area: upper limit of agreement and 95% confidence intervals; pink area: lower limit of agreement and 95% confidence intervals

Table 3 demonstrates the results of GEE logistic regression models of the differences between self- and faculty-assessments. Medical education degree and domain of competency were significantly associated with the differences between self- and faculty- rated Milestone scores in the multivariable model (Table 3). Ratings from residents with master’s degrees (mean difference: -1.06, 95% CI: -1.80 to -0.32, P = 0.005) and doctorate degrees (mean difference: -1.14, 95% CI: -1.91 to -0.38, P = 0.003) were closer to the faculty-assessments, when compared to residents with bachelor's degrees. Compared with the PC competency, the differences between self- and faculty- rated scores were smaller in MK (mean difference: -0.18, 95% CI: -0.35 to -0.02, P = 0.031) and ICS (mean difference: -0.41, 95% CI: -0.64 to -0.19, P < 0.001). The multivariable model did not detect any significant associations between age, gender, grade, or ranks of written examination scores and the differences between self- and faculty-assessments.

Table 3 Generalized estimating equation logistic regression models of the differences between self- and faculty-assessments on Anesthesiology Milestones (N = 2024)

Discussion

This cross-sectional study found that residents generally rated themselves significantly higher than CCC faculty on Anesthesiology Milestones. Medical education degree and domain of competency were independently associated with the differences between resident self- and faculty- assessments.

This study provided emerging evidence that anesthesia residents tended to overestimate their own competence, which might affect their clinical judgements. Anesthesiologists often face unexpected crisis events during clinical practice. A key element of crisis management is decidedly calling for help when needed. Therefore, it is of utmost importance to cultivate the ability of self-assessment during residency training, which can help residents know whether the current clinical situation is beyond their capacity and when additional help is required. To our knowledge, there has been only limited data regarding the accuracy of anesthesia residents’ self-assessments. Ross FJ et al. demonstrated a strong agreement between resident self- and faculty-assessments on Anesthesiology Milestones 1.0 [16]. However, Fleming M et al. found that anesthesia residents’ ratings were lower than the faculty’s on a 5-point anchored Likert scale that was designed to evaluate overall clinical performance [7]. Previous studies have also shown conflicting results regarding the consistency between self- and faculty-assessments on ACGME Milestones among residents in other specialities, including surgery, ophthalmology, emergency medicine, and family medicine [8, 9, 12,13,14,15, 17].

There were some potential limitations in Milestones evaluation and feedback that may cause inaccurate assessments in our program. Faculty assessed residents using different methods across programs. In our program, CCC faculty discussed ratings based on their own impressions, daily assessments from supervisors, and objective examination scores. In some other programs, CCC faculty reviewed available 360° evaluations on residents’ performance [9, 15, 16]. The latter method should be more objective, since CCC faculty possibly lacked opportunities to observe residents’ behaviors in all the subcompetencies and thus needed supplemental evidence, such as recent evaluations from other physicians, nurses, peers, or patients. Furthermore, residents have regularly received Milestone-based feedback for more than five years in the US [23, 24], while the Milestone Evaluation System has just been promoted in China. Therefore, our residents cannot understand the descriptions of Milestones as adequately as the US residents, which might contribute to their overestimation. This can be improved using the “ask-tell-ask” feedback method, in which faculty ask residents to perform self-assessments first, inform them of faculty-assessments next, and finally discuss action plans together [25, 26].

This study suggested the differences between self- and faculty-assessments varied across competencies. Residents in our program overestimated their own competency in PC, PBLI, and PROF (Fig. 1). The self- and faculty-assessments were less different in MK and ICS, when compared with PC (Table 3), aligning with findings of a study in ophthalmology residents [9]. A plausible explanation is that our residents received feedback of their written examination scores every three months in our program, providing them with a comprehensive understanding of their progress in MK competency. Similarly, residents could assess their own ICS competency by considering others’ attitudes towards them and feedback received during their interactions. Conversely, competencies of PBLI and PROF were not frequently remarked upon by supervisors or discussed among peers. Some residents acknowledged the significance of PBLI and committed considerable time to this endeavor. Nevertheless, their insufficient self-learning skills impeded them from achieving the desired learning outcomes. This kind of residents was prone to overestimating their proficiency in PBLI, possibly due to a confusion between their efforts and actual competency.

Our study revealed that residents with higher medical degrees tended to have ratings closer to the faculty's assessments compared to those with a bachelor's degree. On one hand, residents with advanced medical degrees received higher scores in National Entrance Examinations, indicating better academic performance that may be associated with enhanced self-awareness abilities [27]. On the other hand, their graduate education, which includes clinical rotations and research training, could effectively strengthen their abilities for self-awareness.

Accumulating evidence has illustrated that female residents were more likely to underestimate themselves than male residents in surgery programs [14, 15, 28], which was not supported by our data. It is worth noting that females accounted for approximately 70% of the residents, 80% of the CCC members, and 60% of all the faculty in our department; hence, female anesthesiologists were not at a disadvantage, which was a major difference from surgeons. We also did not observe a significant association between the differences and grade of resident or ranks of written examination scores. Some studies found a tendency of underestimation in senior residents and overestimation in junior residents [29, 30]. This was explained by the metacognitive bias known as the Dunning-Kruger Effect, which means the least skilled individuals tend to be the most overconfident [3]. However, the Dunning-Kruger Effect has been questioned recently, as poor performers lack cognitive resources to assess their ability and may therefore either overestimate or underestimate themselves [31].

This study had the following limitations. First, the cross-sectional design limited our ability to draw a causal conclusion; thus, future studies are required to validate the significant associations found in our study. Second, the accuracy of self- and faculty-assessments may be influenced by factors such as the experience and number of CCC faculty, the methodology of evaluation, and the feedback provided to residents. In addition, variations in medical education programs and residency training across countries may restrict the generalizability of our findings to other centers, countries, or specialties. Finally, this study did not investigate deep reasons behind the differences between resident- and faculty-assessments, since some potentially associated factors could not be summarized as quantitative data. Interviews with residents and faculty can provide more detailed information.

Conclusions

Our study revealed differences between resident self- and faculty-assessments on Anesthesiology Milestones among anesthesia residents in China. The differences between them were associated with residents’ medical education degrees and domains of competency. These findings emphasized the need to improve the accuracy of Milestones self-assessment, especially in residents with bachelor’s degree and in competencies of PC, PBLI and PROF.