INTRODUCTION

The Accreditation Council for Graduate Medical Education (ACGME) outcomes project marked the transition from process-based to outcomes-based assessment in medical education.1 This project introduced the six core competencies of patient care, medical knowledge, systems-based practice (SBP), practice-based learning and improvement (PBLI), professionalism, and interpersonal and communication skills (ICS). While these competencies increased emphasis on educational outcomes, assessing individual learner performance in the competencies can be difficult.2

With the Next Accreditation System (NAS), the ACGME and the American Board of Medical Specialties introduced 22 competency-based developmental outcomes, referred to as sub-competencies or reporting milestones.3 Advantages of sub-competencies include the incorporation of specific behavioral anchors and the ability to gauge learners’ progression over time. However, utilizing sub-competencies for clinical assessments may be challenging as they were designed primarily as a tool for reporting, not assessment.4

Consequently, entrustable professional activities (EPAs) and observable practice activities (OPAs) have been developed to operationalize the assessment of sub-competencies in the clinical setting.5 EPAs incorporate entrustment decisions—which are determinations regarding levels of responsibility—into the assessment of clinical performance.6 9 Ultimately, EPAs are useful for clinical assessment because they incorporate multiple competencies, observable behaviors, and the ability to draw conclusions about accountability and trustworthiness. OPAs differ from EPAs in that they are more context or content specific than EPAs, but are used in a similar manner to EPAs for learner assessment.10 Observable practice activities were originally described as a way to assess learners’ observed performance, which in turn serves to inform entrustment within EPAs.10 However, EPAs expand beyond OPAs, by addressing activities that society trusts all physicians to do,5 and by allowing assessment of these activities over the course of a resident’s training.

Although experts have clearly articulated the importance of EPAs in outcomes-based assessment, there remains a need for quantitative research on developing and assessing the quality of EPAs.11 Literature regarding EPAs is just emerging, with recent studies exploring the feasibility and process of using EPAs for assessment.12 15

Several national organizations have published EPAs for assessing medical students or internal medicine residents.16 18 While these EPAs are informative and useful, it is recognized that individual programs will need to develop additional EPAs at local levels in order to determine learners’ progress and readiness for advancement in their unique clinical environments.17 Consequently, there is a need for validated tools that will help educational programs create high-quality EPAs. Our objectives were to: 1) develop and validate an instrument for assessing the quality of EPAs for milestones-based assessment, 2) describe the features of high quality EPAs, and 3) examine associations between EPA quality scores and the characteristics of rotations from which they originated, in order to determine whether certain learning environments were more conducive to writing EPAs for use in resident assessment.

METHODS

Setting and Participants

This was a prospective content validation study to identify and rate the quality of locally developed EPAs that were designed to assess the competence of internal medicine residents at our institution in 2014. Our program consists of 168 resident physicians, 33 rotation directors and ten associate program directors (APDs). All rotation directors were provided with basic education regarding the definition and purpose of EPAs at a noon conference. This one-hour educational session involved an interactive presentation by our local expert in EPAs, who reviewed the literature regarding EPAs and discussed the purpose of creating EPAs for evaluation. This education was reinforced with a subsequent e-mail message. Rotation directors were then instructed to write and submit EPAs that pertained to their subspecialty rotations, resulting in 229 locally developed EPAs (see electronic addendum). We rated the quality of these EPAs as well as previously published EPAs17 with an instrument we developed for the purpose of this study, which was deemed exempt by the Mayo Clinic Institutional Review Board.

Instrument Development

To develop an instrument for rating the quality of EPAs, we created a team of raters separate from the rotation directors who had submitted the EPAs. This team of raters consisted of five APDs (authors JP, CW, DD, KT, and TB) who had previous experience in graduate medical education, resident assessment, scale development and validation.

Identifying Item Content

The team reviewed salient literature on the topic of EPAs,2 , 5 9 , 11 , 13 15 , 19 , 20 and then met to discuss the qualities of EPAs that would be useful in assessment. After repeated review and discussions, the team reached consensus on the following essential domains of an EPA: 1) Focus, 2) Observable, 3) Clear Intention, 4) Realistic, 5) Articulates Trustworthiness, 6) Generalizable Across Rotations, and 7) Integrates Multiple Competencies. For each domain, every team member proposed three potential items, resulting in 105 candidate items equally distributed across these seven domains. Subsequently, the team reached agreement on three items for each EPA quality domain. Items were structured on a five-point scale (1 = strongly disagree, 5 = strongly agree).

Determining Item Reliability

To pilot test the instrument, all team members rated a convenience sample of ten locally developed EPAs. Intraclass correlation coefficients (ICCs) were calculated to determine inter-rater reliability for each item, with ICCs < 0.4 considered to be poor, 0.4 to 0.75 considered to be fair to good, and > 0.75 considered to be excellent.21 This first round of pilot testing revealed good reliability (ICC range 0.72 to 0.94) for the domains of Focus, Observable, Realistic, Generalizable, and Integrates Multiple Competencies. However, poor reliability (ICC range 0.24 to 0.61) existed for the domains of Clear Intention and Articulates Trustworthiness. Despite additional pilot testing by rating ten different EPAs, the domains of Clear Intention and Articulates Trustworthiness continued to perform poorly, so they were dropped from the instrument, leaving the five domains of Focus, Observable, Realistic, Generalizable, and Integrates Multiple Competencies.

Rating the Quality of EPAs

The final instrument was called the Quality of EPA (QUEPA). Since the items in these five domains had demonstrated excellent inter-rater reliability, the team members were randomly assigned and independently rated equally sized subsets (n = 46) of all locally developed EPAs using the QUEPA tool. Each team member also rated the AAIM End of Training EPAs (EOTEPA) to assess the application of QUEPA to EPAs that were not locally developed. Study data were collected and managed using REDCap electronic data capture tools hosted at Mayo Clinic Rochester.22 REDCap (Research Electronic Data Capture) is a secure, web-based application designed to support data capture for research studies.

Study Variables

Independent variables were primary ACGME competency (patient care, ICS, professionalism, medical knowledge, SBP, and PBLI), practice location (inpatient, outpatient), rotation type (general medicine, medicine subspecialty, non-medicine specialty), and activity type (EPA vs OPA). For primary ACGME competency and activity type, each team member classified every locally developed EPA. Although EPAs reflect multiple competencies, the group assessed which competency they felt was most represented by each EPA. This is a relevant step, as many programs will likely use performance in EPAs to assess resident progress in the sub-competencies, and high quality EPAs will need to be created to inform these decisions. In circumstances where the team did not agree, author JP adjudicated the final decision. The outcome variable that was utilized to calculate associations with the clinical context was the average QUEPA score (scale 1–5).

Data Analysis

Characteristics of rotation directors were described using descriptive statistics including sex, academic rank, time on faculty, and proceduralist versus non-proceduralist specialty. For the purposes of this study, non-proceduralist was defined as general medicine, allergy, endocrinology, hematology, infectious diseases, nephrology, neurology, preventive medicine, and rheumatology; and proceduralist was defined as cardiology, gastroenterology, emergency medicine, and pulmonary critical care. The dimensionality of the final 15-item QUEPA instrument was determined using factor analysis with orthogonal rotation. To account for the clustering of multiple ratings within raters and EPAs, we generated an adjusted correlation matrix using generalized estimating equations. This adjusted correlation matrix was then used to perform confirmatory factor analysis with orthogonal rotation. In addition, for a sensitivity analysis, we performed factor analysis using an unadjusted correlation matrix and within rater and EPA combinations separately. Factors were identified using the minimal proportion criteria. The threshold for item retention was factor loading > 0.4. We then calculated the percentage of shared variance that the extracted factors contributed to the original variables. Internal consistency reliability for items within factors and overall was calculated using Cronbach α, with coefficients > 0.7 considered to be acceptable. ANOVA models with random effects for rotation directors were used to compare overall QUEPA scores for subcategories within ACGME competency, practice location, practice type and activity type. The level for statistical significance was set at α = 0.05. Statistical analyses were conducted using SAS 9.3 (SAS Institute Inc., Cary, North Carolina).

RESULTS

Summary of Rotation Directors who Submitted EPAs

Of the 33 rotation directors, 15 (45 %) provided a total of 229 EPAs for 20 rotations. Participating rotation directors were predominantly female (N = 9, 60 %) and from non-proceduralist specialties (10, 67 %). Academic ranks of the rotation directors were: Instructor (2, 13 %), Assistant Professor (11, 73 %), and Associate Professor (2, 13 %), with an average of 9.3 years on faculty (SD = 7.1).

Factor Analysis and Internal Consistency Reliability

Factor analysis revealed the five essential domains of an EPA reduced to four distinct factors (number of items) when applying the QUEPA instrument: Realistic and Generalizable (6), Observable (3), Focused (3), and Multiple Competencies (3). These factors explained 100 % of the shared variance among the original items. Internal consistency for the individual factors was excellent (Cronbach α range: 0.950 to 0.990) (Table 1 ).

Table 1. Quality of Entrustable Professional Activity (QUEPA) Rating Instrument: Item Loadings and Internal Consistency Reliability

Associations Between QUEPA Scores and Competency or Clinical Context

Statistically significant associations were seen between mean QUEPA score and ACGME competency (p < 0.0001), activity type (p < 0.0001), and practice location (p = 0.03). Across the six competencies, ICS EPAs were rated highest (4.00), while medical knowledge EPAs were rated lowest (3.34). Entrustable Professional Activities scored higher on average (4.00) than Observable Practice Activities (3.51), and EPAs from inpatient rotations scored higher on average (3.88) than their outpatient counterparts (3.66). There were no significant associations seen between QUEPA scores and rotation type (Table 2 ).

Table 2. Associations Between Quality of Entrustable Professional Activity (QUEPA) Scores with ACGME Competency, Rotation Practice Location, Rotation Specialty, and Entrustable Professional Activity (EPA) Versus Observable Professional Activity (OPA)
Table 3. End of Training EPA Ratings Using QUEPA

Ratings of End of Training EPAs

QUEPA ratings of EOTEPAs showed good to excellent inter-rater reliability (overall mean ICC = 0.75) (Table 3 ). The overall mean QUEPA score for EOTEPAs was 3.83, with the highest rated factor being Multiple Competencies (4.53), and the lowest rated factor being Focused (2.60). The highest rated individual EOTEPA was “facilitate family meetings” (4.43), and the lowest rated EOTEPA was “Improve the quality of health care at both the individual and systems level” (3.31).

DISCUSSION

To our knowledge, this is the first study to create and validate an instrument for rating the quality of EPAs. This instrument also allowed us to determine associations between EPA quality and characteristics of the clinical context in an internal medicine residency training program. We found that EPAs written to assess medical knowledge scored lowest among the six ACGME competencies; that EPAs for outpatient rotations scored lower than EPAs for inpatient rotations; and that EPAs were rated more highly than OPAs. These findings have implications for creating and refining EPAs for learner assessment in residency training.

We found a significant difference in QUEPA scores, with EPAs assessing medical knowledge scoring lower than the five other competencies. A potential reason for this finding is that EPAs are designed for evaluating activities, such as communication, that integrate multiple skills and tasks 6. Previous studies have found that assessing the medical knowledge of trainees is challenging23 25; therefore, it is understandable that faculty would have difficulty creating EPAs for assessing medical knowledge. Furthermore, it can be challenging in the clinical setting to distinguish resident performance in the medical knowledge competency versus the patient care competency based solely on observation. Hence, the most effectively written EPAs are more likely to describe an observable patient care behavior as opposed to the simple recollection of facts.

We also found that EPAs written for inpatient rotations were rated higher than EPAs written for outpatient rotations. This finding supports a previous study showing that EPAs written for the inpatient setting had the highest ratings.13 One explanation for this finding is that learner–patient interactions are more likely to be observed in the inpatient setting than in the outpatient setting. Traditionally, hospital rounds create opportunities to see a learner’s performance at the bedside, whereas outpatient clinics often require only the confirmation of key findings by the faculty, which may allow for less frequent, direct observation. Moreover, senior residents usually function more independently in the clinic and are thus supervised less directly by faculty members. Consequently, we suspect that faculty members who teach primarily in the hospital were better poised to write higher quality EPAs that focused on observed behaviors. This finding suggests that clinic-based faculty might consider models of teaching that include more direct observation.

Another finding was that activities identified as EPAs had higher QUEPA ratings than those identified as OPAs. When asking rotation directors to create EPAs, some created items that reflected OPAs, as they were much more limited in scope or setting. These OPAs may be very commonly observed in some rotation settings, but are unlikely to be generalizable to other settings or contexts. An example of an OPA from our study was “manage hypertension in the peripartum period.” While this is an important skill, it is not likely to generalize to other settings or patient populations. One EPA from our study reads, “Recommend appropriate preventive services or diagnostic interventions based on review of the patient’s current goals of care, prognosis, and evidence based guidelines.” This example illustrates an activity that expands beyond a singular setting or context, to include a potential range of behaviors that would reflect degrees of skill over the course of residency training. Using the QUEPA instrument to rate the quality of EPAs may help programs to avoid choosing items that are overly narrow in scope or content.

The QUEPA ratings reported in this study reflect diverse validity evidence.26,27 The main validity evidence for this study is content, which is supported by QUEPA items that were based on existing EPA literature, input from team members with expertise in resident education and scale design, and instrument refinement through the use of iterative pilot testing. Internal structure validity is supported by instrument dimensionality as demonstrated by factor analysis, and excellent inter-rater and internal consistency reliability. Criterion (i.e., relations to other variables) validity is supported by associations between QUEPA ratings and meaningful characteristics of the other variables, including ACGME competency, practice location and activity type.

This study has several limitations. First, it was performed at a single institution, so the findings may not generalize to all settings. Second, we utilized a cross-sectional study design of locally developed EPAs and one set of nationally developed EPAs; other EPAs will need to be studied in the future. Third, we report only the creation of EPAs and their quality ratings, which represent Kirkpatrick Level 3 (behaviors) and Level 1 (reaction).28 Nonetheless, a systematic review showed that most education research studies report outcomes at the reaction level,29 and behavior level outcomes have been noted to strike the ideal balance between feasibility and rigor.30 Finally, it should be noted that the high inter-rater reliability we reported for the assessment of EPAs was achieved through repeated discussions; similar results may not be obtained elsewhere without a similar degree of attention to the rating process.

This study has important implications for graduate medical education. The Education Redesign Committee of the Alliance for Academic Internal Medicine (AAIM) has recently introduced 16 end-of-training EPAs for internal medicine training.17 The authors note that these EPAs, while well developed and very appropriate for the end of training, may be overly broad. This statement is supported by our study, which revealed that EOTEPAs had low QUEPA scores in the domains of focus and observable. This finding may reflect that EOTEPAs were developed for summative, end-of-training assessments, and not necessarily for continuous assessment throughout training. Therefore, it is likely that programs will need to develop more narrowly focused EPAs for ongoing assessments of resident performance. The QUEPA tool should help training programs create and identify their own high-quality EPAs. Since the QEUPA contains 15 items with excellent reliability, it would be reasonable, for practical applications, to select a smaller number of items from each of the four QUEPA domains. For the purpose of research, however, it would be advisable to replicate the entire QUEPA, given the potential for factor instability when utilizing an instrument in new educational environments.31

This study provides a new instrument for assessing the quality of EPAs written by internal medicine faculty. We are hopeful that this tool can aid in the development of effective EPAs for evaluating resident progress through their training. Care should be taken when writing outpatient EPAs and those related to medical knowledge, as these EPA categories were of lower quality on average and may be more challenging to create. Our findings have implications for graduate medical education as residency programs shift to EPA-informed evaluation systems. Further research should study the mappings of EPAs to sub-competencies and core competencies, compare EPA related assessment to historical assessment methods, determine learners’ perceptions of this new approach, and, most importantly, explore the performance of these EPAs in the actual assessment of resident physicians.