Current theory in medical education emphasizes competencies that extend beyond medical knowledge and procedural skills.1 In Canada, residency training and assessment are organized according to the CanMEDS competency framework which incorporates seven competencies: the intrinsic CanMEDS roles of Communicator, Collaborator, Health Advocate, Manager, Scholar, and Professional are positioned around the Medical Expert competency which occupies the central role.2 Primary causes of adverse events in health care are more often related to communication and teamwork than they are to knowledge deficit or lack of technical skills (TSs).3-6 High-reliability industries have referred to the skills required to manage critical situations as crisis resource management or non-technical skills (NTSs).7,8 These NTSs are increasingly identified in health care and constitute a perspective on competency that parallels and overlaps the intrinsic competencies of the CanMEDS framework.

Controversy exists in the literature regarding the taxonomy, nomenclature, and definitions of the terms used for managing critical situations in health care. For high-reliability organizations, NTSs have been defined as “the cognitive, social, and personal resource skills that complement TSs and contribute to safe and efficient task performance”.9 Seven skills have been well described: situation awareness, decision-making, communication, teamwork, leadership, managing stress, and coping with fatigue. The adoption of industry concepts, such as TSs and NTSs, is problematic and has recently been questioned because of inaccuracy.10 In contrast with a dichotomous approach, the CanMEDS framework offers a more nuanced view that reflects the complex interaction and overlap of the various generic core competencies for physicians. Nevertheless, the competencies in the CanMEDS framework have not been designed specifically for crisis management.

The Medical Expert competency is traditionally assessed using written, oral, and objective structured clinical examinations (OSCEs).11 Communication and interviewing skills are also assessed using OSCEs.12 The intrinsic CanMEDS competencies as well as NTSs are generally assessed in the workplace training environment and are under-represented in formal assessments. Since it is challenging to assess crisis management skills using traditional assessment tools, simulation has been proposed as an alternative assessment tool to evaluate both technical and non-technical skills.13 This proposal has stimulated the development of various assessment tools to measure the performance of NTSs during clinical crisis simulation. These scales do not explicitly mention CanMEDS, but they overlap with the Communicator, Collaborator, and Manager competencies.8,14 The focus of other evaluation tools is almost solely on the Medical Expert role, neglecting intrinsic competencies and the NTSs discourse.15-17 In our view, CanMEDS has the requisites to serve as the conceptual framework to facilitate translation and integration of the industry concepts of TSs and NTSs to medical and health care education.

The objective of this study was to integrate CanMEDS intrinsic and Medical Expert competencies with NTSs into a Generic Integrated Objective Structured Assessment Tool (GIOSAT) that is capable of assessing residents’ performance during clinical crisis simulation and is appropriate for different specialties, scenarios, and environments. We investigated the reliability and construct validity of the GIOSAT using simulation scenarios with anesthesia residents.

Methods

Development of the tool

Using an approach similar to that of other researchers,18 we developed the GIOSAT following the steps shown in Fig 1.19 A group of five physicians (V.M.N., A.N., P.M., D.D., H.W.) from a variety of clinical backgrounds in anesthesia, intensive care, and obstetrics and with experience in postgraduate medical education defined the purpose of the assessment tool.

Fig. 1
figure 1

Generic Integrated Objective Structured Assessment Tool (GIOSAT) development process. (Modified with permission from: Hamstra SJ. Keynote address: the focus on competencies and individual learner assessment as emerging themes in medical education research. Acad Emerg Med 2012; 19(12): 1336-43. Chichester, UK: Wiley)19

Two physicians and a senior investigator (V.M.N., H.W., and S.H.) performed a literature search in English and French using the internet search engines, PubMed, Medline, and Ovid, for the period from January 2000 to December 2010. Search terms included: assessment, assessment tools, simulation, crisis resource management, non-technical skills, anesthesia, critical care, surgery, emergency medicine, medicine, and pediatrics. We included articles that tested an assessment tool during crisis simulation and investigated construct validity as part of the study. We hand-searched reference lists from relevant papers and also included resulting key articles. The original search produced 828 publications, and the abstracts of these articles were screened, leaving 86 articles for full text review. Eighteen publications were finally selected. The retained articles were classified according to their main focus as NTSs,8,14,20-24 Medical Expert competency,16,17,25-30 or both.13,15,31 The assessment tools were classified as checklists and global rating scales (GRS) (Fig. 2), and items in each assessment tool were classified according to the evidence of construct validity.32 Assessment tool content was mapped to corresponding CanMEDS competencies and rated as 0 = not applicable or 1 = applicable (table available as Electronic Supplementary Material).

Fig. 2
figure 2

Literature search and article selection process for assessment methods in clinical crisis simulation

We used a modified Delphi process33 to design an assessment tool based on the existing literature. A draft of the assessment tool was sent to the original group, and after two iterations of the first version, the GIOSAT was developed. We designed a GRS of four points (1 = poor, 2 = marginal, 3 = acceptable, 4 = good) with free text for comments divided into two sections: Medical Expert competency with 12 items and intrinsic CanMEDS competencies with eight items.

Pilot study

After obtaining Ethics Board approval at the Children’s Hospital of Eastern Ontario Research Institute (Dec 3, 2008), we performed a pilot study (Dec 2008 to May 2009) with two video recorded pediatric anesthesia scenarios (laryngospasm, and hyperkalemia) based on the perioperative cardiac arrest and closed claims studies in pediatric anesthesia.6,34 We invited anesthesia residents (postgraduate year (PGY) 3-5) to participate in the study and obtained informed consent. Ten residents participated in the laryngospasm scenario and 14 participated in the hyperkalemia scenario.

Four trained raters independently assessed the residents’ video performance using the GIOSAT. Inter-rater intraclass correlation (ICC) was used to analyze reliability categorized according to Landis and Koch.35 The demographics of residents participating in the pilot study are shown in Table 1. The hyperkalemia scenario had moderate to substantial ICCs in all Medical Expert competencies. The Communicator, Collaborator, and Manager competencies had substantial ICCs (0.69-0.77), but there was poor reliability for the Professional competency (ICC = 0.06). Health Advocate and Scholar roles were not identified in these scenarios and were left blank by raters (Table 2). We did not undertake a formal validity analysis for this pilot study due to the small number of participants.

Table 1 Pilot study
Table 2 Inter-rater reliability of four raters using the first version of the GIOSAT

Validation study

The pilot study underwent critical review by an expert group of medical education researchers (D.B., S.B., V.N., and S.H.). The results of this process were used to redesign the tool, and a second study was performed to evaluate the new version. The revised version of the GIOSAT is divided into two sections: Medical Expert competencies with eight items and intrinsic competencies with six items. Each item has abbreviated descriptors and is scored with a GRS (1 = very poor to 6 = very good). Summed Medical Expert items, intrinsic items, and total GIOSAT scores are 8-48, 6-36, and 14-84, respectively (Appendix; available as Electronic Supplementary Material).

In this second study, we used video recordings from other previously reported research36 in order to examine the reliability and validity of the newly designed GIOSAT scale. Research Ethics Board approval was obtained from St. Michael’s Hospital, Toronto for the original study (October 3, 2008) comparing debriefing techniques with a pre-test/post-test design. Only pre-test video recordings were used in our analysis to avoid any bias. The sample size was calculated a priori for the original study. We used the method recommended by Cohen for an intraclass correlation of 0.7 using four raters for a power of 0.8 and found 35 participants.37 We decided to include all pre-test video recordings (n = 50). An amendment was approved by the same Research Ethics Board to use previously collected video records for our study (February 3, 2010). Written informed consent was previously obtained from the subjects.

Fifty anesthesia residents with different levels of training (PGY 2-5) were randomized to perform in one of two intraoperative advanced cardiac life support (ACLS) scenarios lasting five minutes: ventricular fibrillation due to hyperkalemia or pulseless ventricular tachycardia secondary to myocardial infarction.36 Four independent raters (three anesthesiologists and one emergency physician) from the University of Ottawa were trained to use the GIOSAT.38 Raters, blinded to subjects’ identity and PGY level of training, independently scored all residents’ performances using GIOSAT (October- November 2010). To examine the relative difficulty of the two scenarios, a comparison of the GIOSAT scores and competency scores was analyzed with Student’s t test.

Investigation of reliability

The ICC was determined with a consistency definition using a two-way random model for both single measures (individual rater) and average measures (the average of the four raters’ scores) for total GIOSAT scores, summed Medical Expert scores, summed intrinsic scores, and individual item scores.

Investigation of construct validity

Our primary outcome was the correlation between total GIOSAT scores and PGY level, as measured using Spearman’s correlation coefficient (Rho). Our secondary outcomes were correlations between PGY level and summed scores for Medical Expert items, summed scores for intrinsic items, and individual items. We hypothesized that residents’ performances would improve with level of training.

Comparison of performance in the two scenarios

The potential confounding effect of gender and type of scenario on the GIOSAT score was analyzed with a two-way analysis of variance. A Z-test was performed to compare ICCs between scenarios corrected for multiple testing using Holm’s method.39 P values < 0.05 were considered significant. We used SPSS® version 18 (SPSS, INC. 2010, Chicago, IL, USA) for the statistical analysis.

Results

Results of the validation study

The revised GIOSAT was used in the second study. Residents’ distribution according to scenario, gender, and level of training is shown in Table 3. The PGY level was similar in both scenarios, and there was an apparent imbalance in gender distribution between scenarios (Fisher’s exact test P = 0.045). No significant differences in GIOSAT scores were found between scenarios: Medical Expert scores P = 0.40; intrinsic scores P = 0.56; and total scores P = 0.54 (Student’s t test) (Table 4). Scholar was scored as not applicable (N/A) by raters in 69% of the ratings (139/200 ratings), and for that reason, it was excluded from the analysis.

Table 3 Demographic characteristics of 50 anesthesia residents performing two ACLS scenarios
Table 4 GIOSAT score results in two ACLS scenarios (VF/ VT) in 50 anesthesia residents

Investigation of reliability

Inter-rater intraclass correlations with pooled results of both scenarios are shown in Fig 3. The ICCs were substantial for single measure summed scores of Medical Expert competencies (0.69), intrinsic competencies (0.62), and total GIOSAT scores (0.62). The single measure ICCs for individual Medical Expert items were moderate to substantial (0.43-0.69), except for examine the patient and equipment (0.29). The single measure ICCs for individual intrinsic items were fair to moderate (0.36-0.60). The average measure ICCs were substantial to almost perfect for individual items (0.61-0.90) and almost perfect for summed Medical Expert items, summed intrinsic items, and total scores (0.87-0.90).

Fig. 3
figure 3

Intra-class correlation coefficients (ICCs) with 95% confidence intervals (CI) for Generic Integrated Objective Structured Assessment Tool (GIOSAT) total scores and items. Dots represent ICCs and lines represent 95% CI of GIOSAT scores from four raters assessing two simulated advanced cardiac life support (ACLS) scenarios with 50 anesthesia residents. Ventricular fibrillation (n = 27), ventricular tachycardia (n = 23). Examine Pt and equipment = examine patient and equipment; Diagnosis & dif. = diagnosis and differentials

Investigation of construct validity

For our primary outcome, we found a significant correlation between PGY and total GIOSAT scores (r = 0.36; P < 0.011) (Fig. 4). For our secondary outcomes, we found a significant correlation between PGY and summed Medical Expert competencies (r = 0.42; P < 0.003), but not for summed intrinsic competencies (r = 0.24; P = 0.09).

Fig. 4
figure 4

Relationship between Generic Integrated Objective Structured Assessment Tool (GIOSAT) total score and postgraduate year. Spearman’s correlation coefficient (Rho) = 0.36 (P = 0.011). GIOSAT score is the sum of all item scores with a maximum of 84. For each box, the dark horizontal line represents the median. The bottom of the box represents the 25th percentile, and the top represents the 75th percentile. A whisker extends down from the bottom of the box to the lowest data point within 1.5 times the interquartile range (IQR) of the 25th percentile. A whisker extends up from the top of the box to the highest data point within 1.5 IQR of the 75th percentile. Points beyond the whiskers are represented as open circles

Comparison of performance in the two scenarios

The results of the analysis of variance showed that residents’ GIOSAT scores were not significantly influenced by scenario type or residents’ gender. We found significant differences between scenarios for single measure ICCs in two of the Medical Expert competencies, diagnosis and differentials and dealing with changing situations (P = 0.03 and 0.006, respectively), and in one intrinsic competency, Collaborator (P = 0.006). No differences between scenarios were found in average measure ICCs.

Discussion

We developed a global rating scale “GIOSAT” that integrates NTSs concepts with the CanMEDS competencies for generic assessment during crisis simulation. A pilot study was used to refine the GIOSAT through an iterative process. We found evidence of substantial reliability (single measures) for the refined tool using four raters for Medical Expert, intrinsic, and total scores. There was evidence of construct validity for the Medical Expert and total scores, but we found no evidence of construct validity for the intrinsic scores. The Medical Expert section of GIOSAT has good psychometric properties and could be used for summative assessment during simulated crises. Although the metrics for reliability and validity for the total score is acceptable, the intrinsic competencies section is problematic, which may raise concerns for total scores as well.

Non-technical skills have been described as being difficult to define10,40 and assess.8,14,20 The reliability of GIOSAT is comparable with previous studies of NTSs that have shown fair to substantial reliability. As would be expected, NTSs scales have greater reliability when their items are summed, as we have found with the GIOSAT. Our finding of poor reliability and construct validity of some of the intrinsic competencies not included in the NTSs discourse is in keeping with the literature on professionalism which is not often taught formally (i.e., part of the hidden curriculum) and is difficult to assess.41,42

Our study has a number of limitations. The data are based on a re-analysis of scenarios not designed to identify certain competencies, such as Professional, Health Advocate, and Scholar. It may also be that the ACLS scenarios were too short to identify all CanMEDS competencies fully. Increasing the number of scenarios and changing scenario design may improve construct validity of the intrinsic section. It has also been shown that reliability improves with increased testing time, for instance, it has been shown in OSCEs that several hours of testing and ten or more cases are required for high-stakes examinations.12

A further limitation is the possibility that intrinsic competencies are underrepresented in GIOSAT and the Medical Expert role is overrepresented. Future iterations of the GIOSAT tool may emphasize the intrinsic competencies either in terms of the number of items or the extensiveness of the descriptors. According to current concepts, reliability and validity are not properties of the instrument but properties of the instrument’s scores and interpretations.43 The same instrument used in a different setting or with different subjects may produce different results; thus, our results may not necessarily be generalizable to other populations.

The GIOSAT Medical Expert competencies section relating to psychometric properties is appropriate to be used for summative assessment. The intrinsic competencies section and, by implication, the scale as a whole are not yet appropriate for summative assessment.

The aim of future research will be to identify the number of raters, scenarios, and examinees necessary to establish reliability and generalizability. The development of scenarios specifically designed to challenge specific domains (Professional, Health Advocate and Scholar: PHAS Roles) may be required for the appropriate use of GIOSAT for summative assessment.