Measurement of faculty anesthesiologists’ quality of clinical supervision has greater reliability when controlling for the leniency of the rating anesthesia resident: a retrospective cohort study
- 184 Downloads
Our department monitors the quality of anesthesiologists’ clinical supervision and provides each anesthesiologist with periodic feedback. We hypothesized that greater differentiation among anesthesiologists’ supervision scores could be obtained by adjusting for leniency of the rating resident.
From July 1, 2013 to December 31, 2015, our department has utilized the de Oliveira Filho unidimensional nine-item supervision scale to assess the quality of clinical supervision provided by faculty as rated by residents. We examined all 13,664 ratings of the 97 anesthesiologists (ratees) by the 65 residents (raters). Testing for internal consistency among answers to questions (large Cronbach’s alpha > 0.90) was performed to rule out that one or two questions accounted for leniency. Mixed-effects logistic regression was used to compare ratees while controlling for rater leniency vs using Student t tests without rater leniency.
The mean supervision scale score was calculated for each combination of the 65 raters and nine questions. The Cronbach’s alpha was very large (0.977). The mean score was calculated for each of the 3,421 observed combinations of resident and anesthesiologist. The logits of the percentage of scores equal to the maximum value of 4.00 were normally distributed (residents, P = 0.24; anesthesiologists, P = 0.50). There were 20/97 anesthesiologists identified as significant outliers (13 with below average supervision scores and seven with better than average) using the mixed-effects logistic regression with rater leniency entered as a fixed effect but not by Student’s t test. In contrast, there were three of 97 anesthesiologists identified as outliers (all three above average) using Student’s t tests but not by logistic regression with leniency. The 20 vs 3 was significant (P < 0.001).
Use of logistic regression with leniency results in greater detection of anesthesiologists with significantly better (or worse) clinical supervision scores than use of Student’s t tests (i.e., without adjustment for rater leniency).
KeywordsAverage Quality Clinical Supervision Rater Leniency Anesthesia Resident Periodic Feedback
La mesure de la qualité de supervision clinique des anesthésiologistes facultaires est plus fiable lorsqu’on tient compte de l’indulgence du résident en anesthésie responsable de l’évaluation: une étude de cohorte rétrospective
Notre département contrôle la qualité de supervision clinique des anesthésiologistes et donne des rétroactions périodiques à chaque anesthésiologiste. Nous avons émis l’hypothèse qu’une plus grande différentiation dans les scores de supervision des anesthésiologistes serait obtenue en tenant compte de l’indulgence du résident évaluateur.
Dès le 1er juillet 2013 et jusqu’au 31 décembre 2015, notre département s’est servi d’une échelle de supervision unidimensionnelle à neuf points, soit l’échelle de Oliveira Filho, afin d’évaluer la qualité de la supervision clinique offerte par les membres du département et telle que notée par les résidents. Nous avons passé en revue les 13 664 évaluations des 97 anesthésiologistes (les évalués) par les 65 résidents (les évaluateurs). Nous avons évalué la cohérence interne parmi les réponses aux questions (coefficient alpha de Cronbach étendu > 0,90) afin d’écarter la possibilité qu’une ou deux questions expliqueraient l’indulgence. Une régression logistique à effets mixtes a été utilisée pour comparer les évalués tout en contrôlant l’indulgence des évaluateurs vs l’utilisation de tests t de Student sans indulgence de l’évaluateur.
Le score moyen sur l’échelle de supervision a été calculé pour chaque combinaison des 65 évaluateurs et des neuf questions. Le coefficient alpha de Cronbach était très étendu (0,977). La note moyenne a été calculée pour chacune des 3421 combinaisons observées de résident et anesthésiologiste. La distribution des logits du pourcentage de notes égales à la valeur maximale de 4,00 était normale (résidents, P = 0,24; anesthésiologistes, P = 0,50). Au total, 20/97 anesthésiologistes ont été identifiés comme valeurs aberrantes (13 ayant des notes de supervision au-dessous de la moyenne et sept des notes au-dessus) à l’aide de la régression logistique à effets mixtes avec l’indulgence de l’évaluateur en tant qu’effet fixe, mais pas par le test t de Student. Par contre, trois des 97 anesthésiologistes ont été identifiés comme valeurs aberrantes (les trois au-dessus de la moyenne) à l’aide des tests t de Student, mais pas par régression logistique avec l’indulgence. Les 20 vs 3 étaient significatifs (P < 0,001).
L’utilisation de la régression logistique avec l’indulgence permet une plus grande détection des anesthésiologistes présentant des notes significativement meilleures (ou moins bonnes) de supervision clinique que l’utilisation de tests t de Student (c.-à-d. sans ajustement pour tenir compte de l’indulgence de l’évaluateur).
de Oliveira Filho et al.’s instrument6 for measuring faculty anesthesiologists’ supervision of residents during clinical operating room care
1. The faculty provided me timely, informal, nonthreatening comments on my performance and showed me ways to improve
2. The faculty was promptly available to help me solve problems with patients and procedures
3. The faculty used real clinical scenarios to stimulate my clinical reasoning, critical thinking, and theoretical learning
4. The faculty demonstrated theoretical knowledge, proficiency at procedures, ethical behaviour, and interest/compassion/respect for patients
5. The faculty was present during the critical moments of the anesthetic procedure (e.g., anesthesia induction, critical events, complications)
6. The faculty discussed with me the perianesthesia management of patients prior to starting an anesthetic procedure and accepted my suggestions, when appropriate
7. The faculty taught and demanded the implementation of safety measures during the perioperative period (e.g., anesthesia machine checkout, universal precautions, prevention of medication errors, etc.)
8. The faculty treated me respectfully and strived to create and maintain a pleasant environment during my clinical activities
9. The faculty gave me opportunities to perform procedures and encouraged my professional autonomy
Previous findings regarding supervision of anesthesia residents and nurse anesthetists by faculty anesthesiologists
1. Supervision is a single-dimensional construct that incorporates several different attributes, including participation in perianesthesia planning, availability for help/consultation, presence during critical phases of the anesthetic, and fostering safety measures6,7,10,12
2. Supervision can be quantified reliably using an instrument with nine questions, each question assessing a different attribute of supervision.6 The nine questions take < 90 sec to complete. The Cronbach’s alpha achieved in routine use was equal to 0.948 ± 0.001 (SE)10
3. Raters evaluate how often each attribute is demonstrated by the anesthesiologist (never = 1; rarely = 2; frequently = 3; and always = 4), and the supervision score is the mean of the nine answers.6 When each anesthesiologist’s mean resident and mean nurse anesthetist scores were paired, the means were correlated (P < 0.0001).8,9 Thus, the behaviour and attributes used to assess the quality of an anesthesiologist’s supervision have significant commonality between residents and nurse anesthetists8,9
4. There were very small differences in anesthesiologist supervision scores provided by residents when 1) a resident had more units of work that day with the rated anesthesiologist; (“units together”, τb = 0.083 ± 0.014) or 2) the rated staff anesthesiologist had more units of work that same day with other providers, (“units not together”; τb = −0.057 ± 0.014).8 Anesthesiologists’ mean supervision scores provided by residents and nurse anesthetists were not correlated with anesthesiologists’ semi-annual clinical activity (multiple all P > 0.65).4 A very active clinician can provide ineffective supervision, and a less active clinician can be very effective.4 Supervision served as an independent contributor to the value that an individual anesthesiologist added to the care of the patient4
5. The most common supervision score provided by nurse anesthetists was 4.0 (P < 0.0001), indicating that all of the questions were considered important, including those related to teaching.9 Anesthesiologist supervision scores provided by residents are even greater (P < 0.0001).8,9 The pairwise differences by anesthesiologist are also significantly greater than zero (P < 0.0001)8,9
6. All residents evaluated all anesthesiologists’ supervision during a study performed during a single weekend, such that each resident was in one class (e.g., “CA-1”, “CA-2”, etc.).7 There was no association between residents’ perception of supervision by anesthesiologists that met expectations and years since the start of training (P = 0.77).5 There were very small differences among classes (mean differences ≤ 0.07 units).7 Thus, “residents” can be treated as a single group, regardless of total years of clinical experience.
7. Mean resident scores for anesthesiologist’s supervision were correlated with mean resident choice of the anesthesiologist to care for their family (Kendall’s τb = +0.77; P < 0.0001),7 mean resident evaluations of the anesthesiologist’s clinical teaching (τb = +0.87; P < 0.0001),7 and mean nurse anesthetist scores for anesthesiologist’s supervision (τb = +0.36; P < 0.0001 among all anesthesiologists and τb = +0.51; P < 0.0001 among those with 15 raters of each type).
8. When the supervision instrument was applied to departments12 (Tables 2.14 and 2.15), the internal consistency (Cronbach’s alpha) of the scale was 0.909 ± 0.007. Convergent validity was based on a positive correlation between supervision and variables related to safety culture (all P < 0.0001): “Overall perceptions of patient safety”, “Teamwork within units”, “Non-punitive response to errors”, “Handoffs and transitions”, “Feedback and communication about error”, “Communication openness”, and the rotation’s “overall grade on patient safety”.12 Convergent validity was based on significant negative correlation with variables related to the rater’s burnout (all P < 0.0001): “I feel burnout from my work”, “I have become more callous toward people since I took this job”, and “errors with potential negative consequences to patients (that you have) made and/or witnessed”.12 Among these variables, supervision was most closely predicted by the same one variable using multiple types of regression trees: “Teamwork within (the rotation)” (e.g., “When one area in this rotation gets busy, others help out”).12 Discriminant validity was based on absence of rank correlation of supervision score with characteristics of raters and programs (all P > 0.10): age, hours worked per week, sex, promptness of survey response, number of survey raters from the department, and rotation (specialty) (as random effect)12
9. There was no significant association between anesthesiologist supervision score and the number of occasions that a resident rater had worked with the anesthesiologist, based on billing data (by patients, τb = +0.01; P = 0.71 and by days, τb = −0.01; P = 0.46)7
10. Among anesthesia residents, “the mean ± standard deviation of staff supervision scores that meets expectations”, neither “exceeds expectations” nor is “below expectations” was 3.40 ± 0.30.5 “Most … residents (94%) perceived that supervision that met their expectations was at least frequent (i.e., a score ≥ 3.0)” (P < 0.0001).5 These values were greater than for nurse anesthetists (P < 0.0001)5
11. Anesthesia departments can measure individual anesthesiologists’ supervision with high reliability (i.e., mean score is known with precision) when supervision scores are provided by at least nine different resident raters per anesthesiologist.7 Monitoring is done by taking each of the raters’ mean supervision scores for the anesthesiologist and weighting them equally (i.e., treating each rater’s mean as a single observation)8,15
12. With residents’ evaluations of anesthesiologists, mean supervision scores differed among anesthesiologists based on generalizability analysis (P < 0.0001)7
13. Anesthesiologist performance can be monitored daily using Bernoulli cumulative sum (CUSUM) control charts.15 A reasonable threshold for low scores is < 3.0 for residents.15 The true positive detection of anesthesiologists with incidences of low scores greater than the chosen “out-of-control” rate was 14/14.15 The false-positive detection rate was 0/29.15 Bernoulli CUSUM detection of low scores was within 50 ± 14 (median ± quartile deviation) days15
14. Anesthesia residents’ mean scores for anesthesiologists’ supervision for entire departments were significantly lower (P < 0.0001) than the mean scores for individual anesthesiologist’s supervision.13 The median ratio was 86% ± 1%. The correlation between departmental and mean (individual) anesthesiologists’ scores was τb = 0.35 ± 0.11 (P = 0.0032).13 When considering national survey results, individual anesthesiologists’ supervisory performance needs to be greater.13
15. Anesthesia residents reporting mean supervision scores for their entire department (i.e., the mean of all anesthesiologists) that were < 3.00 (i.e., less than “frequent”) reported anesthesiologists making more “mistakes that had negative consequences for the patient”, with an accuracy (area under the curve) of 89% (99% confidence interval [CI], 77 to 95).11 Supervision less than “frequent” (i.e., < 3.00) predicted “medication errors (dose or incorrect drug) in the last year” with an accuracy of 93% (99% CI, 77 to 98).11 Among residents reporting overall supervision during the current rotation that was less than frequent (i.e., < 3.0) vs frequent, the 10th, 25th, 50th, 75th, 90th, and 95th percentiles of errors were 1 vs 1, 1 vs 1, 2 vs 2, 3 vs 2, 4 vs 3, and 6 vs 4, respectively (P < 0.0001).12 There was no detected effect of resident burnout on numbers of reported errors while controlling for supervision (all P > 0.138 by different types of analyses)12
16. Nurse anesthetists’ comments with (not) “see” or (not) “saw” and the theme “I did not see the anesthesiologist during the case(s) together” increased the odds of a nurse anesthetist providing a supervision score < 3 (odds ratio 48.2; P < 0.0001).9 Many more such comments were made by nurse anesthetists than by residents (P < 0.0001).9 Nevertheless, resident comments of insufficient anesthesiologist presence were associated with evaluation scores that were less than other evaluations with comments (P < 0.0001).10 Each anesthesiologist with at least one such resident comment had lower mean scores than the other anesthesiologists (P = 0.0071)10
17. For both residents and nurse anesthetists, monitoring anesthesiologists’ supervision and providing feedback resulted in greater scores by individual anesthesiologist.4 For example, pairwise by anesthesiologist, the mean supervision scores provided by residents increased by 0.08 ± 0.01 points when equally weighting each anesthesiologist (P < 0.0001) and by 0.04 ± 0.02 points weighting by the precision of the difference (P = 0.0011).4 Similarly, pairwise by anesthesiologist, the supervision scores provided by nurse anesthetists increased by 0.28 ± 0.02 points when equally weighting each anesthesiologist (P < 0.0001) and by 0.27 ± 0.02 points weighted by the precision of the difference (P < 0.0001).4 Among nurse anesthetists, this was due principally to questions associated with teaching (e.g., “stimulate my clinical reasoning, critical thinking, and theoretical learning”)4
18. Among anesthesia residents, evaluations of anesthesiologists with comments related to poor teaching had lower scores than the other evaluations with comments (P < 0.0001).10 The anesthesiologists who each received a comment related to poor teaching had lower mean scores than the other anesthesiologists (P < 0.0001).10 Each increase in the anesthesiologist’s number of comments of poor-quality teaching was associated with a lower mean score (P = 0.0002).10 Likewise, each increase in the anesthesiologist’s number of resident comments of being disrespectful was associated with a lower mean supervision score (P = 0.0002).10 A low supervision score (< 3.00; i.e., less than “frequent”) had an odds ratio of 85 for disrespectful faculty behaviour (P < 0.0001)10
Supervision, in this context, refers to all clinical oversight functions directed toward assuring the quality of clinical care whenever the anesthesiologist is not the sole anesthesia care provider.3-5 The de Oliveira Filho unidimensional nine-item supervision instrument is a reliable scale used to assess the quality of supervision provided by each anesthesiologist (Table 1).6-9 The scale measures all attributes of anesthesiologists’ supervision of anesthesia residents (Table 2.1)6,7,10-13 and has been shown, in multiple studies, to do this as a unidimensional construct (Table 2.2).6,9,10,12 Low supervision scores are associated with written comments about the anesthesiologist being disrespectful, unprofessional, and/or teaching poorly that day (Table 2.18).10,14,15 Scores increase when anesthesiologists receive individual feedback regarding the quality of their supervision (Table 2.17).4 Scores are monitored daily and each anesthesiologist is provided with periodic feedback.3,15
The supervision scale’s maximum value is 4.00, which corresponds to a response of 4 (i.e., “always”) to each of the nine questions (Table 1).6 Because of the ceiling effect, multiple scores of 4.00 reduce the scale’s reliability7,9,10,15 to differentiate performance among the anesthesiologists, even though such differentiation is mandatory (see Discussion).
We previously asked residents to provide a single evaluation for the overall quality of supervision they received from the department’s faculty (i.e., as if intended as an evaluation of the residency program) (Table 2.14).13 We compared those overall scores pairwise with the mean of each resident’s evaluations of all individual anesthesiologists with whom they worked during the preceding eight months.13 Both sets of scores showed considerable heterogeneity among the residents (e.g., some residents provided overall lower scores than those of other residents).13 Consequently, our hypothesis was that greater differentiation among anesthesiologists’ supervision scores could be obtained by incorporating scoring leniency by the resident (rater) into the statistical analysis (i.e., treating a high score as less meaningful when given by a resident who consistently provides high scores, in other words, lenient relative to other raters).1
The University of Iowa Institutional Review Board affirmed (June 8, 2016) that this investigation did not meet the regulatory definition of research in human subjects. Analyses were performed with de-identified data.
From July 1, 2013 to December 31, 2015, our department utilized the de Oliveira Filho supervision scale to assess the quality of clinical supervision by staff anesthesiologists (Table 1).6,7 The cohort reported herein includes all rater evaluations of all staff anesthesiologists (ratees) over that 2.5-year period chosen for convenience. We used five six-month periods because we previously showed that six months was a sufficient duration in our department for nearly all ratees to receive evaluations and for an adequate number of unique raters to differentiate reliably among ratees using the supervision scale.9,10,15
The evaluation process consisted of daily, automated e-mail requests16 to raters to evaluate the supervision provided by each ratee with whom they worked the previous day in an OR setting for at least one hour, including obstetrics and/or non-operating room anesthesia (e.g., radiation therapy).4,8-10 Raters evaluated ratees’ supervision by logging in to a secure webpage.8 The raters could not submit their rating until each of the nine questions was answered with their choice of 1-4: 1 = never; 2 = rarely; 3 = frequently; or 4 = always (Table 1). The “score” for each evaluation was equal to the mean of the responses to the nine questions (Table 1). The scores remained confidential and were provided to the ratees periodically (every six months) only after averaging among multiple raters.1,15,17
If one or two of the nine questions resulted in leniency among raters, a potential intervention would have been to either modify the question(s) or provide an example of behaviour that should affect the answer to the question(s) (see Discussion). In contrast, if leniency were present throughout all questions, then an analysis of leniency would need to incorporate the average scores of the raters. The question whether leniency was present in a few vs all questions was addressed by analyzing the mean ratings for each combination of the 65 raters and nine questions. These means had sample sizes of at least 37 answers and mean sample sizes of 210 answers (i.e., sufficient to make the ranks of 1, 2, 3, or 4 into interval levels of measurement). Cronbach’s alpha, a test for internal consistency among answers to questions,2 was calculated using the resulting 65 × 9 matrix, equally weighting each rater. The confidence interval (CI) for Cronbach’s alpha was calculated using the asymptotic method.18
The same approach was used when calculating the average of the means by ratee. In order to assess whether the scores of individual ratees were unusually low or high, we compared the averages of the means of each ratee with the value of 3.80 using Student’s t test and Wilcoxon signed-rank test. The value of 3.80 is the overall mean supervision score among all ratees’ scores (see Results). The P values using the Wilcoxon signed-rank test were exact, calculated using StatXact® 11 (Cytel Inc., Cambridge, MA, USA). Student’s t test does not adjust for rater leniency.
Internal consistency of raters’ answers to the nine questions contributing to the score
Individual questions did not contribute significantly to leniency (i.e., consideration of individual questions could not improve the statistical modelling). Cronbach’s alpha as to the raters’ answers to questions was high in value (0.977; 95% CI, 0.968 to 0.985); therefore, the score for each rating could be used (i.e., the mean of the answers to the nine questions in the supervision scale) (Table 1).
Statistical distributions of rater and ratee scores
We used the 13,664 scores,4 with 3,421 observed combinations of the 65 raters and 97 ratees. In Appendix 2, we show lack of validity of the statistical assumptions for a random effects model in the original score scale.20,21
We treated the rater as a fixed effect to incorporate rater leniency in a mixed-effects logistic regression model. Fig. 2 shows the distribution among raters of the percentage of scores equal to the maximum value of 4.00 (i.e., all nine questions answered “always”). The 65 raters differed significantly amongst each other in terms of the percentages of their scores equal to 4.00 (P < 0.001 using fixed-effect logistic regression).
The mixed-effects model with rater as a fixed effect and ratee as a random effect relies on the assumption that the distribution of the logits among ratees follows a normal distribution, which it does. Specifically, no ratee had all scores equal to 4.00 (i.e., for which the logit would have been undefined because of division by zero). In addition, no ratee had all scores less than 4.00 (i.e., for which the logit would also be undefined). There were 60 ratees each with ≥ 14 scores lower than 4.00 among their ≥ 32 scores (i.e., sample sizes large enough to obtain reliable estimates of the logits).8,22 The logits followed a normal distribution [Lilliefors test, P = 0.50; mean (standard deviation [SD]), −0.781(0.491)].
Effectiveness of logistic regression with leniency relative to Student’s t tests (i.e., without adjustment for leniency)
The principal result is that 20/97 ratees were identified as outliers using the logistic regression with leniency, but not by Student’s t tests. There were 3/97 ratees identified as outliers using the Student’s t tests, but not by logistic regression with leniency. The 20 vs 3 is significant; exact P < 0.001 using McNemar’s test. Thus, adjusting for rater leniency increased the ability to distinguish the quality of anesthesiologists’ clinical supervision.
In Appendix 3, we confirm the corollary that there is less information from scores < 4.00 vs the percentage of scores equal to the maximum score of 4.00.
In Appendix 4, we show that our previous observation of an increase in supervision score over time with evaluation and feedback (Table 2.17)4 holds when analyzed using logistic regression with leniency.
In Appendix 5, we show that our previous analyses and publications without consideration of rater leniency were reasonable because initially there was greater heterogeneity of scores among ratees.
Graphical presentation of the principal result
In this final section, we examine why incorporating rater leniency increased the sensitivity to detect both below average and above average performance differences among ratees. Readers who are less interested in “why” may want to go directly to the Discussion.
Statistical significance of the logistic regression with leniency depended on the number of scores < 4.00, shown on the vertical axes of Figs 3-6 (see Appendix 1 and Appendix 6). For a given ratee average score, blue circles showing lack of statistical significance are more often present for smaller sample sizes than red triangles and orange squares.
Among the 30 ratees with average scores < 3.80, 13 were not significantly different from the average of 3.80 using the Student’s t test, but were significantly different from the other ratees by logistic regression with leniency (Fig. 3). For illustration, we consider the ratee with an average score of 3.56, shown by the left-most orange square. This score was the smallest value not found to be significantly less than the overall average of 3.80 using the Student’s t test, but found to differ significantly from the other ratees by logistic regression with leniency. In Appendix 7, we show that this finding was caused by substantial variability among raters (i.e., residents) regarding how much the ratee’s quality of supervision was less than the maximum score (4.00).
Among the 53 ratees who had average supervision scores > 3.80 and who had at least nine different raters, seven were not significantly different from average as determined by the Student’s t test, but were significantly different using logistic regression with leniency (Fig. 4). There were 3/53 ratees who were significantly different from average by the Student’s t test, but not significantly different using logistic regression with leniency. For illustration, we consider the ratee with the highest average score. In Appendix 8, we show that logistic regression with, or without, leniency (Fig. 6) lacked statistical power to differentiate this ratee from other anesthesiologists because the ratee had above average quality of supervision and relatively few clinical days (i.e., ratings).
The supervision scores are the cumulative result of how the anesthesiologists perform in clinical environments. The scores reflect in situ performance and can improve with feedback.4,15 Supervision scores are used in our department for mandatory annual collegiate evaluations and for maintenance of hospital clinical privileges (i.e., the United States’ mandatory semi-annual “Ongoing Professional Practice Evaluation”). Consequently, the statistical comparisons could reasonably be considered to represent high-stakes testing.5 We therefore considered statistical approaches that satisfy statistical assumptions as much as possible. In addition, we conservatively treated as statistically significant only those differences in ratee scores with small P values < 0.01 and used random effects modelling (i.e., shrinkage of estimates for anesthesiologists with small sample sizes toward the average).23-27 Nevertheless, we show mixed-effects logistic regression modelling, with rater leniency entered as a fixed effect, which resulted in greater detection of performance outliers than with the Student’s t test (i.e., without adjustment for rater leniency). Comparing the mixed-effects logistic regression model with rater leniency with multiple Student’s t tests, rather than with a random effects model of the average scores without rater leniency, resulted in a lesser chance23-25 of detecting benefit in logistic regression (i.e., our conclusion is deliberately conservative).
Previous psychometric studies of anesthesiologists’ assessments of resident performance have also found significant rater leniency.28,29 Even with an adjustment of the average scores for rater leniency, the number of different ratings that faculty needed for a reliable assessment of resident performance exceeded the total number of faculty in many departments.28 Our paper provides a methodological framework for future statistical analyses of leniency for such applications.
Suppose the anesthesiologists were distributed into nine categories. There are those with a less than average, average, and greater than average annual number of clinical days, thereby receiving a less than average, average, and greater than average number of evaluations of their clinical performance. There are anesthesiologists who provide less than average, average, and greater than average quality of supervision. We think that, among these nine (3 × 3) groups, the least institutional cost for misclassifying the quality of clinical supervision (below average, average, above average) would be to consider the group of anesthesiologists providing less than average clinical workload and greater than average quality of supervision as providing average quality of supervision. Because this was the only group that was “misclassified” through use of logistic regression with leniency, we think it is reasonable managerially to use this method to analyze the supervision data.
We showed that leniency in the supervision scale (Table 1) was caused by the cumulative effect of all questions (i.e., leniency was not the disproportionate effect of a few questions). If an individual question had accounted for variability in leniency among raters, providing examples of behaviour corresponding to an answer could have been an alternative intervention to reduce leniency. Because our department provides OR care for a large diversity of procedures, it is not obvious to us how to provide examples because there are so many different interactions between residents and anesthesiologists that could contribute to less than or greater than average quality of supervision.1,10 Nevertheless, the finding that leniency arises because of the cumulative effect of all questions shows that the issue is moot. Variability in rater leniency is the result of the raters’ overall (omnibus) assessments of anesthesiologists’ performance, without distinction among the nine items describing specific attributes of supervision.
The supervision score is a surrogate for whether a resident would choose the anesthesiologist to care for their family (Table 2.7).7 Supervision scores for specific rotations are associated with perceived teamwork during the rotation (Table 2.8).12 Observation of intraoperative briefings has found that sometimes anesthesiologists barely participate (e.g., being occupied with other activities).30 Team members can “rate the value” of the intraoperative briefing performed “in the OR when the patient is awake”.31 Thus, we have hypothesized that leniency may be related to interaction among organizational safety culture, residents’ perceptions of the importance of the intraoperative briefing to patient outcome, and the anesthesiologists’ participation (or lack) in the briefings. Our finding of large internal rater consistency among the nine questions shows that such a hypothesis cannot be supported. Supervision begins when residents and anesthesiologists are assigned cases together, ends after the day’s patient care is completed, and includes inseparable attributes (Table 1). Future studies could evaluate whether rater leniency is personality based and/or applies to rating other domains such as quality of life.
Our findings are limited by raters being nested within departments (i.e., residents in one department rarely work with anesthesiologists in other departments). Consequently, for external reporting, we recommend that evaluation of each ratee (anesthesiologist, subspecialty,12 or department11 be performed using the equally weighed average of the scores from each rater. Results are reported as average scores of equally weighted raters, along with confidence intervals.8,C In contrast, for assessment and progressive quality improvement within a department, we recommend the use of mixed-effects logistic regression with rater leniency. Results are reported as odds ratios, along with confidence intervals. Regardless, in situ assessment of the quality of supervision depends (Figs 4 and 6) on there being at least nine (and preferably more) unique raters for each ratee (Table 2.11).7 Although this generally holds for operating room anesthesia, it can be a limitation for specialties (e.g., chronic pain) in which residents rotate for weeks at a time and work with one or two attending physicians.
Leniency is the scientific term. We searched Google Scholar on December 8, 2016. There were 962 results from “rater leniency” OR “raters’ leniency” OR “rating leniency” OR “leniency of the rater” OR “leniency of the raters”. There were 93.4% fewer results for “rater heterogeneity” OR “raters’ heterogeneity” OR “heterogeneity of the rater” OR “heterogeneity of the raters”.
See http://FDshort.com/CronbachSplitHalf, accessed February 2017. For each respondent, select four of the nine questions, calculate the mean score, and calculate the mean score of the other five questions. Calculate among all raters the correlation coefficient between the pairwise split-half mean scores. Repeat the process using all possible split halves of the nine questions. The mean of the correlation coefficients is Cronbach’s alpha. This measure of internal consistency provides quantification for the reliability of the use of the score alone.
The sample sizes are too small to estimate the variance within pairs, and the variances are generally unequal among pairs.7,8 See the Anesthesia & Analgesia companion papers for mathematical details.7,8 Even when there are many ratings per rater, using each rating’s score minimally influences final assessments clinically.19
Residents provided a response for 99.1% (n = 14,585) of the 14,722 requests.10 For 6.3% (n = 921) of requests, residents responded that they worked with the faculty for insufficient time to evaluate supervision, leaving n = 13,664 ratings.10 The mean (SD) intraoperative patient care time together was 4.87 (2.53) h day−1.10
High-stakes Testing. Wikipedia. Available from URL: https://en.wikipedia.org/wiki/High-stakes_testing (accessed February 2017).
Conflicts of interest
This submission was handled by Dr. Hilary P. Grocott, Editor-in-Chief, Canadian Journal of Anesthesia.
Franklin Dexter and Bradley J. Hindman helped design the study. Franklin Dexter helped conduct the study. Franklin Dexter and Johannes Ledolter helped analyze the data. Franklin Dexter, Johannes Ledolter, and Bradley J. Hindman helped write the manuscript.
- 20.Doane DP, Seward LE. Measuring skewness: a forgotten statistic? J Stat Educ 2011; 19 (2).Google Scholar
- 21.Box GE, Cox DR. An analysis of transformations. J R Stat Soc Series B Stat Methodol 1964; 26: 211-52.Google Scholar