# Measurement of faculty anesthesiologists’ quality of clinical supervision has greater reliability when controlling for the leniency of the rating anesthesia resident: a retrospective cohort study

- 184 Downloads
- 3 Citations

## Abstract

### Background

Our department monitors the quality of anesthesiologists’ clinical supervision and provides each anesthesiologist with periodic feedback. We hypothesized that greater differentiation among anesthesiologists’ supervision scores could be obtained by adjusting for leniency of the rating resident.

### Methods

From July 1, 2013 to December 31, 2015, our department has utilized the de Oliveira Filho unidimensional nine-item supervision scale to assess the quality of clinical supervision provided by faculty as rated by residents. We examined all 13,664 ratings of the 97 anesthesiologists (ratees) by the 65 residents (raters). Testing for internal consistency among answers to questions (large Cronbach’s alpha > 0.90) was performed to rule out that one or two questions accounted for leniency. Mixed-effects logistic regression was used to compare ratees while controlling for rater leniency *vs* using Student *t* tests without rater leniency.

### Results

The mean supervision scale score was calculated for each combination of the 65 raters and nine questions. The Cronbach’s alpha was very large (0.977). The mean score was calculated for each of the 3,421 observed combinations of resident and anesthesiologist. The logits of the percentage of scores equal to the maximum value of 4.00 were normally distributed (residents, *P* = 0.24; anesthesiologists, *P* = 0.50). There were 20/97 anesthesiologists identified as significant outliers (13 with below average supervision scores and seven with better than average) using the mixed-effects logistic regression with rater leniency entered as a fixed effect but not by Student’s *t* test. In contrast, there were three of 97 anesthesiologists identified as outliers (all three above average) using Student’s *t* tests but not by logistic regression with leniency. The 20 *vs* 3 was significant (*P* < 0.001).

### Conclusions

Use of logistic regression with leniency results in greater detection of anesthesiologists with significantly better (or worse) clinical supervision scores than use of Student’s *t* tests (i.e., without adjustment for rater leniency).

## Keywords

Average Quality Clinical Supervision Rater Leniency Anesthesia Resident Periodic Feedback# La mesure de la qualité de supervision clinique des anesthésiologistes facultaires est plus fiable lorsqu’on tient compte de l’indulgence du résident en anesthésie responsable de l’évaluation: une étude de cohorte rétrospective

## Résumé

### Contexte

Notre département contrôle la qualité de supervision clinique des anesthésiologistes et donne des rétroactions périodiques à chaque anesthésiologiste. Nous avons émis l’hypothèse qu’une plus grande différentiation dans les scores de supervision des anesthésiologistes serait obtenue en tenant compte de l’indulgence du résident évaluateur.

### Méthode

Dès le 1^{er} juillet 2013 et jusqu’au 31 décembre 2015, notre département s’est servi d’une échelle de supervision unidimensionnelle à neuf points, soit l’échelle de Oliveira Filho, afin d’évaluer la qualité de la supervision clinique offerte par les membres du département et telle que notée par les résidents. Nous avons passé en revue les 13 664 évaluations des 97 anesthésiologistes (les évalués) par les 65 résidents (les évaluateurs). Nous avons évalué la cohérence interne parmi les réponses aux questions (coefficient alpha de Cronbach étendu > 0,90) afin d’écarter la possibilité qu’une ou deux questions expliqueraient l’indulgence. Une régression logistique à effets mixtes a été utilisée pour comparer les évalués tout en contrôlant l’indulgence des évaluateurs *vs* l’utilisation de tests *t* de Student sans indulgence de l’évaluateur.

### Résultats

Le score moyen sur l’échelle de supervision a été calculé pour chaque combinaison des 65 évaluateurs et des neuf questions. Le coefficient alpha de Cronbach était très étendu (0,977). La note moyenne a été calculée pour chacune des 3421 combinaisons observées de résident et anesthésiologiste. La distribution des logits du pourcentage de notes égales à la valeur maximale de 4,00 était normale (résidents, *P* = 0,24; anesthésiologistes, *P* = 0,50). Au total, 20/97 anesthésiologistes ont été identifiés comme valeurs aberrantes (13 ayant des notes de supervision au-dessous de la moyenne et sept des notes au-dessus) à l’aide de la régression logistique à effets mixtes avec l’indulgence de l’évaluateur en tant qu’effet fixe, mais pas par le test *t* de Student. Par contre, trois des 97 anesthésiologistes ont été identifiés comme valeurs aberrantes (les trois au-dessus de la moyenne) à l’aide des tests *t* de Student, mais pas par régression logistique avec l’indulgence. Les 20 *vs* 3 étaient significatifs (*P* < 0,001).

### Conclusion

L’utilisation de la régression logistique avec l’indulgence permet une plus grande détection des anesthésiologistes présentant des notes significativement meilleures (ou moins bonnes) de supervision clinique que l’utilisation de tests *t* de Student (c.-à-d. sans ajustement pour tenir compte de l’indulgence de l’évaluateur).

*in situ*assessments quantify anesthesiologists’ clinical performance in the dynamic and unpredictable environment where they personally deliver care. This environment includes a range of large and unexpected problems, where anesthesiologists’ roles include foreseeing and preventing problems and where social, team, and environmental factors influence anesthesiologists’ effectiveness.1,2 Thus, as part of an overall assessment of clinical competency, our department uses

*in situ*assessments of individual anesthesiologists working in operating rooms and other procedural locations (henceforth referred to as “ORs”) to determine how well they provide clinical supervision of anesthesia residents (Table 1).3-14 Higher scores for clinical supervision are associated with fewer resident reports of errors with adverse effects on patients (Table 2.15)11-13 and greater preference for the anesthesiologist to care for the rating resident’s family (Table 2.7).7

de Oliveira Filho* et al.’s* instrument6 for measuring faculty anesthesiologists’ supervision of residents during clinical operating room care

1. The faculty provided me timely, informal, nonthreatening comments on my performance and showed me ways to improve |

2. The faculty was promptly available to help me solve problems with patients and procedures |

3. The faculty used real clinical scenarios to stimulate my clinical reasoning, critical thinking, and theoretical learning |

4. The faculty demonstrated theoretical knowledge, proficiency at procedures, ethical behaviour, and interest/compassion/respect for patients |

5. The faculty was present during the critical moments of the anesthetic procedure (e.g., anesthesia induction, critical events, complications) |

6. The faculty discussed with me the perianesthesia management of patients prior to starting an anesthetic procedure and accepted my suggestions, when appropriate |

7. The faculty taught and demanded the implementation of safety measures during the perioperative period (e.g., anesthesia machine checkout, universal precautions, prevention of medication errors, etc.) |

8. The faculty treated me respectfully and strived to create and maintain a pleasant environment during my clinical activities |

9. The faculty gave me opportunities to perform procedures and encouraged my professional autonomy |

Previous findings regarding supervision of anesthesia residents and nurse anesthetists by faculty anesthesiologists

1. Supervision is a single-dimensional construct that incorporates several different attributes, including participation in perianesthesia planning, availability for help/consultation, presence during critical phases of the anesthetic, and fostering safety measures6,7,10,12 |

2. Supervision can be quantified reliably using an instrument with nine questions, each question assessing a different attribute of supervision.6 The nine questions take < 90 sec to complete. The Cronbach’s alpha achieved in routine use was equal to 0.948 ± 0.001 (SE)10 |

3. Raters evaluate how often each attribute is demonstrated by the anesthesiologist (never = 1; rarely = 2; frequently = 3; and always = 4), and the supervision score is the mean of the nine answers.6 When each anesthesiologist’s mean resident and mean nurse anesthetist scores were paired, the means were correlated ( |

4. There were very small differences in anesthesiologist supervision scores provided by residents when 1) a resident had more units of work that day with the rated anesthesiologist; (“units together”, τ |

5. The most common supervision score provided by nurse anesthetists was 4.0 ( |

6. All residents evaluated all anesthesiologists’ supervision during a study performed during a single weekend, such that each resident was in one class (e.g., “CA-1”, “CA-2”, etc.).7 There was no association between residents’ perception of supervision by anesthesiologists that met expectations and years since the start of training ( |

7. Mean resident scores for anesthesiologist’s supervision were correlated with mean resident choice of the anesthesiologist to care for their family (Kendall’s τ |

8. When the supervision instrument was applied to departments12 (Tables 2.14 and 2.15), the internal consistency (Cronbach’s alpha) of the scale was 0.909 ± 0.007. Convergent validity was based on a positive correlation between supervision and variables related to safety culture (all |

9. There was no significant association between anesthesiologist supervision score and the number of occasions that a resident rater had worked with the anesthesiologist, based on billing data (by patients, τ |

10. Among anesthesia residents, “the mean ± standard deviation of staff supervision scores that meets expectations”, neither “exceeds expectations” nor is “below expectations” was 3.40 ± 0.30.5 “Most … residents (94%) perceived that supervision that met their expectations was at least frequent (i.e., a score ≥ 3.0)” ( |

11. Anesthesia departments can measure individual anesthesiologists’ supervision with high reliability (i.e., mean score is known with precision) when supervision scores are provided by at least nine different resident raters per anesthesiologist.7 Monitoring is done by taking each of the raters’ mean supervision scores for the anesthesiologist and weighting them equally (i.e., treating each rater’s mean as a single observation)8,15 |

12. With residents’ evaluations of anesthesiologists, mean supervision scores differed among anesthesiologists based on generalizability analysis ( |

13. Anesthesiologist performance can be monitored daily using Bernoulli cumulative sum (CUSUM) control charts.15 A reasonable threshold for low scores is < 3.0 for residents.15 The true positive detection of anesthesiologists with incidences of low scores greater than the chosen “out-of-control” rate was 14/14.15 The false-positive detection rate was 0/29.15 Bernoulli CUSUM detection of low scores was within 50 ± 14 (median ± quartile deviation) days15 |

14. Anesthesia residents’ mean scores for anesthesiologists’ supervision for entire departments were significantly lower ( |

15. Anesthesia residents reporting mean supervision scores for their entire department (i.e., the mean of all anesthesiologists) that were < 3.00 (i.e., less than “frequent”) reported anesthesiologists making more “mistakes that had negative consequences for the patient”, with an accuracy (area under the curve) of 89% (99% confidence interval [CI], 77 to 95).11 Supervision less than “frequent” (i.e., < 3.00) predicted “medication errors (dose or incorrect drug) in the last year” with an accuracy of 93% (99% CI, 77 to 98).11 Among residents reporting overall supervision during the current rotation that was less than frequent (i.e., < 3.0) |

16. Nurse anesthetists’ comments with (not) “see” or (not) “saw” and the theme “I did not see the anesthesiologist during the case(s) together” increased the odds of a nurse anesthetist providing a supervision score < 3 (odds ratio 48.2; |

17. For both residents and nurse anesthetists, monitoring anesthesiologists’ supervision and providing feedback resulted in greater scores by individual anesthesiologist.4 For example, pairwise by anesthesiologist, the mean supervision scores provided by residents increased by 0.08 ± 0.01 points when equally weighting each anesthesiologist ( |

18. Among anesthesia residents, evaluations of anesthesiologists with comments related to poor teaching had lower scores than the other evaluations with comments ( |

Supervision, in this context, refers to all clinical oversight functions directed toward assuring the quality of clinical care whenever the anesthesiologist is not the sole anesthesia care provider.3-5 The de Oliveira Filho unidimensional nine-item supervision instrument is a reliable scale used to assess the quality of supervision provided by each anesthesiologist (Table 1).6-9 The scale measures all attributes of anesthesiologists’ supervision of anesthesia residents (Table 2.1)6,7,10-13 and has been shown, in multiple studies, to do this as a unidimensional construct (Table 2.2).6,9,10,12 Low supervision scores are associated with written comments about the anesthesiologist being disrespectful, unprofessional, and/or teaching poorly that day (Table 2.18).10,14,15 Scores increase when anesthesiologists receive individual feedback regarding the quality of their supervision (Table 2.17).4 Scores are monitored daily and each anesthesiologist is provided with periodic feedback.3,15

The supervision scale’s maximum value is 4.00, which corresponds to a response of 4 (i.e., “always”) to each of the nine questions (Table 1).6 Because of the ceiling effect, multiple scores of 4.00 reduce the scale’s reliability7,9,10,15 to differentiate performance among the anesthesiologists, even though such differentiation is mandatory (see Discussion).

We previously asked residents to provide a single evaluation for the overall quality of supervision they received from the department’s faculty (i.e., as if intended as an evaluation of the residency program) (Table 2.14).13 We compared those overall scores pairwise with the mean of each resident’s evaluations of all individual anesthesiologists with whom they worked during the preceding eight months.13 Both sets of scores showed considerable heterogeneity among the residents (e.g., some residents provided overall lower scores than those of other residents).13 Consequently, our hypothesis was that greater differentiation among anesthesiologists’ supervision scores could be obtained by incorporating scoring leniency by the resident (rater) into the statistical analysis (i.e., treating a high score as less meaningful when given by a resident who consistently provides high scores, in other words, lenient relative to other raters).^{1}

## Methods

The University of Iowa Institutional Review Board affirmed (June 8, 2016) that this investigation did not meet the regulatory definition of research in human subjects. Analyses were performed with de-identified data.

From July 1, 2013 to December 31, 2015, our department utilized the de Oliveira Filho supervision scale to assess the quality of clinical supervision by staff anesthesiologists (Table 1).6,7 The cohort reported herein includes all rater evaluations of all staff anesthesiologists (ratees) over that 2.5-year period chosen for convenience. We used five six-month periods because we previously showed that six months was a sufficient duration in our department for nearly all ratees to receive evaluations and for an adequate number of unique raters to differentiate reliably among ratees using the supervision scale.9,10,15

The evaluation process consisted of daily, automated e-mail requests16 to raters to evaluate the supervision provided by each ratee with whom they worked the previous day in an OR setting for at least one hour, including obstetrics and/or non-operating room anesthesia (e.g., radiation therapy).4,8-10 Raters evaluated ratees’ supervision by logging in to a secure webpage.8 The raters could not submit their rating until each of the nine questions was answered with their choice of 1-4: 1 = never; 2 = rarely; 3 = frequently; or 4 = always (Table 1). The “score” for each evaluation was equal to the mean of the responses to the nine questions (Table 1). The scores remained confidential and were provided to the ratees periodically (every six months) only after averaging among multiple raters.1,15,17

### Statistical analysis

If one or two of the nine questions resulted in leniency among raters, a potential intervention would have been to either modify the question(s) or provide an example of behaviour that should affect the answer to the question(s) (see Discussion). In contrast, if leniency were present throughout all questions, then an analysis of leniency would need to incorporate the average scores of the raters. The question whether leniency was present in a few *vs* all questions was addressed by analyzing the mean ratings for each combination of the 65 raters and nine questions. These means had sample sizes of at least 37 answers and mean sample sizes of 210 answers (i.e., sufficient to make the ranks of 1, 2, 3, or 4 into interval levels of measurement). Cronbach’s alpha, a test for internal consistency among answers to questions,^{2} was calculated using the resulting 65 × 9 matrix, equally weighting each rater. The confidence interval (CI) for Cronbach’s alpha was calculated using the asymptotic method.18

^{3}For each rater, we obtained the equally weighted average of the mean scores provided by that rater for all ratees (Fig. 1).8,

^{C}We previously showed that the number of scores per pair differs markedly among raters for each ratee (i.e., there is non-random assignment of residents and anesthesiologists such that leniency will not average out;

*P*< 0.001).8,19

The same approach was used when calculating the average of the means by ratee. In order to assess whether the scores of individual ratees were unusually low or high, we compared the averages of the means of each ratee with the value of 3.80 using Student’s *t* test and Wilcoxon signed-rank test. The value of 3.80 is the overall mean supervision score among all ratees’ scores (see Results). The *P* values using the Wilcoxon signed-rank test were exact, calculated using StatXact^{®} 11 (Cytel Inc., Cambridge, MA, USA). Student’s *t* test does not adjust for rater leniency.

^{®}14.1 was used (StataCorp LP, College Station, TX, USA) to perform mixed-effects analyses treating the rater as a categorical fixed effect and the ratee as a random effect. Results of the mixed-effects analyses allowed us to assess the ratees’ quality of supervision. Mixed-effects analyses were carried out for two separate dependent variables, modelled individually: 1) the average score, and 2) the binary variable whether the score equalled the maximum of 4.00 (Fig. 2). The logistic regression was performed using the “melogit” command option of mean-variance adaptive Gauss-Hermite quadrature with 30 quadrature points. As described later (see Results), several analyses were repeated using other estimation methods, including cluster-robust variance estimation, the unstructured covariance matrix, or estimation by Laplace approximation. All tests for differences between ratees were performed treating two-sided

*P*< 0.01 as statistically significant. The imbalance in the number of resident ratings per ratee is considered in Appendix 1.

## Results

### Internal consistency of raters’ answers to the nine questions contributing to the score

Individual questions did not contribute significantly to leniency (i.e., consideration of individual questions could not improve the statistical modelling). Cronbach’s alpha as to the raters’ answers to questions was high in value (0.977; 95% CI, 0.968 to 0.985); therefore, the score for each rating could be used (i.e., the mean of the answers to the nine questions in the supervision scale) (Table 1).

### Statistical distributions of rater and ratee scores

We used the 13,664 scores,^{4} with 3,421 observed combinations of the 65 raters and 97 ratees. In Appendix 2, we show lack of validity of the statistical assumptions for a random effects model in the original score scale.20,21

We treated the rater as a fixed effect to incorporate rater leniency in a mixed-effects logistic regression model. Fig. 2 shows the distribution among raters of the percentage of scores equal to the maximum value of 4.00 (i.e., all nine questions answered “always”). The 65 raters differed significantly amongst each other in terms of the percentages of their scores equal to 4.00 (*P* < 0.001 using fixed-effect logistic regression).

The mixed-effects model with rater as a fixed effect and ratee as a random effect relies on the assumption that the distribution of the logits among ratees follows a normal distribution, which it does. Specifically, no ratee had all scores equal to 4.00 (i.e., for which the logit would have been undefined because of division by zero). In addition, no ratee had all scores less than 4.00 (i.e., for which the logit would also be undefined). There were 60 ratees each with ≥ 14 scores lower than 4.00 among their ≥ 32 scores (i.e., sample sizes large enough to obtain reliable estimates of the logits).8,22 The logits followed a normal distribution [Lilliefors test, *P* = 0.50; mean (standard deviation [SD]), −0.781(0.491)].

### Effectiveness of logistic regression with leniency relative to Student’s t tests (i.e., without adjustment for leniency)

*P*< 0.01) different from the average score among all ratees when using a Student’s

*t*test (i.e., without adjustment for rater leniency). We also indicate whether the ratee’s percentage of scores < 4.00 differed from other ratees when using mixed-effects logistic regression, with rater leniency treated as a fixed effect and ratee as a random effect. We subsequently refer to that mixed-effects model as “logistic regression with leniency”.

The principal result is that 20/97 ratees were identified as outliers using the logistic regression with leniency, but not by Student’s *t* tests. There were 3/97 ratees identified as outliers using the Student’s *t* tests, but not by logistic regression with leniency. The 20 *vs* 3 is significant; exact *P* < 0.001 using McNemar’s test. Thus, adjusting for rater leniency increased the ability to distinguish the quality of anesthesiologists’ clinical supervision.

In Appendix 3, we confirm the corollary that there is less information from scores < 4.00 *vs* the percentage of scores equal to the maximum score of 4.00.

In Appendix 4, we show that our previous observation of an increase in supervision score over time with evaluation and feedback (Table 2.17)4 holds when analyzed using logistic regression with leniency.

In Appendix 5, we show that our previous analyses and publications without consideration of rater leniency were reasonable because initially there was greater heterogeneity of scores among ratees.

### Graphical presentation of the principal result

In this final section, we examine why incorporating rater leniency increased the sensitivity to detect both below average and above average performance differences among ratees. Readers who are less interested in “why” may want to go directly to the Discussion.

*vs*those potentially performing above average.

Statistical significance of the logistic regression with leniency depended on the number of scores < 4.00, shown on the vertical axes of Figs 3-6 (see Appendix 1 and Appendix 6). For a given ratee average score, blue circles showing lack of statistical significance are more often present for smaller sample sizes than red triangles and orange squares.

Among the 30 ratees with average scores < 3.80, 13 were not significantly different from the average of 3.80 using the Student’s *t* test, but were significantly different from the other ratees by logistic regression with leniency (Fig. 3). For illustration, we consider the ratee with an average score of 3.56, shown by the left-most orange square. This score was the smallest value not found to be significantly less than the overall average of 3.80 using the Student’s *t* test, but found to differ significantly from the other ratees by logistic regression with leniency. In Appendix 7, we show that this finding was caused by substantial variability among raters (i.e., residents) regarding how much the ratee’s quality of supervision was less than the maximum score (4.00).

Among the 53 ratees who had average supervision scores > 3.80 and who had at least nine different raters, seven were not significantly different from average as determined by the Student’s *t* test, but were significantly different using logistic regression with leniency (Fig. 4). There were 3/53 ratees who were significantly different from average by the Student’s *t* test, but not significantly different using logistic regression with leniency. For illustration, we consider the ratee with the highest average score. In Appendix 8, we show that logistic regression with, or without, leniency (Fig. 6) lacked statistical power to differentiate this ratee from other anesthesiologists because the ratee had above average quality of supervision and relatively few clinical days (i.e., ratings).

## Discussion

The supervision scores are the cumulative result of how the anesthesiologists perform in clinical environments. The scores reflect *in situ* performance and can improve with feedback.4,15 Supervision scores are used in our department for mandatory annual collegiate evaluations and for maintenance of hospital clinical privileges (i.e., the United States’ mandatory semi-annual “Ongoing Professional Practice Evaluation”). Consequently, the statistical comparisons could reasonably be considered to represent high-stakes testing.^{5} We therefore considered statistical approaches that satisfy statistical assumptions as much as possible. In addition, we conservatively treated as statistically significant only those differences in ratee scores with small *P* values < 0.01 and used random effects modelling (i.e., shrinkage of estimates for anesthesiologists with small sample sizes toward the average).23-27 Nevertheless, we show mixed-effects logistic regression modelling, with rater leniency entered as a fixed effect, which resulted in greater detection of performance outliers than with the Student’s *t* test (i.e., without adjustment for rater leniency). Comparing the mixed-effects logistic regression model with rater leniency with multiple Student’s *t* tests, rather than with a random effects model of the average scores without rater leniency, resulted in a lesser chance23-25 of detecting benefit in logistic regression (i.e., our conclusion is deliberately conservative).

Previous psychometric studies of anesthesiologists’ assessments of resident performance have also found significant rater leniency.28,29 Even with an adjustment of the average scores for rater leniency, the number of different ratings that faculty needed for a reliable assessment of resident performance exceeded the total number of faculty in many departments.28 Our paper provides a methodological framework for future statistical analyses of leniency for such applications.

Suppose the anesthesiologists were distributed into nine categories. There are those with a less than average, average, and greater than average annual number of clinical days, thereby receiving a less than average, average, and greater than average number of evaluations of their clinical performance. There are anesthesiologists who provide less than average, average, and greater than average quality of supervision. We think that, among these nine (3 × 3) groups, the least institutional cost for misclassifying the quality of clinical supervision (below average, average, above average) would be to consider the group of anesthesiologists providing less than average clinical workload and greater than average quality of supervision as providing average quality of supervision. Because this was the only group that was “misclassified” through use of logistic regression with leniency, we think it is reasonable managerially to use this method to analyze the supervision data.

We showed that leniency in the supervision scale (Table 1) was caused by the cumulative effect of all questions (i.e., leniency was not the disproportionate effect of a few questions). If an individual question had accounted for variability in leniency among raters, providing examples of behaviour corresponding to an answer could have been an alternative intervention to reduce leniency. Because our department provides OR care for a large diversity of procedures, it is not obvious to us how to provide examples because there are so many different interactions between residents and anesthesiologists that could contribute to less than or greater than average quality of supervision.1,10 Nevertheless, the finding that leniency arises because of the cumulative effect of all questions shows that the issue is moot. Variability in rater leniency is the result of the raters’ overall (omnibus) assessments of anesthesiologists’ performance, without distinction among the nine items describing specific attributes of supervision.

The supervision score is a surrogate for whether a resident would choose the anesthesiologist to care for their family (Table 2.7).7 Supervision scores for specific rotations are associated with perceived teamwork during the rotation (Table 2.8).12 Observation of intraoperative briefings has found that sometimes anesthesiologists barely participate (e.g., being occupied with other activities).30 Team members can “rate the value” of the intraoperative briefing performed “in the OR when the patient is awake”.31 Thus, we have hypothesized that leniency may be related to interaction among organizational safety culture, residents’ perceptions of the importance of the intraoperative briefing to patient outcome, and the anesthesiologists’ participation (or lack) in the briefings. Our finding of large internal rater consistency among the nine questions shows that such a hypothesis cannot be supported. Supervision begins when residents and anesthesiologists are assigned cases together, ends after the day’s patient care is completed, and includes inseparable attributes (Table 1). Future studies could evaluate whether rater leniency is personality based and/or applies to rating other domains such as quality of life.

Our findings are limited by raters being nested within departments (i.e., residents in one department rarely work with anesthesiologists in other departments). Consequently, for external reporting, we recommend that evaluation of each ratee (anesthesiologist, subspecialty,12 or department11 be performed using the equally weighed average of the scores from each rater. Results are reported as average scores of equally weighted raters, along with confidence intervals.8,^{C} In contrast, for assessment and progressive quality improvement within a department, we recommend the use of mixed-effects logistic regression with rater leniency. Results are reported as odds ratios, along with confidence intervals. Regardless, *in situ* assessment of the quality of supervision depends (Figs 4 and 6) on there being at least nine (and preferably more) unique raters for each ratee (Table 2.11).7 Although this generally holds for operating room anesthesia, it can be a limitation for specialties (e.g., chronic pain) in which residents rotate for weeks at a time and work with one or two attending physicians.

## Footnotes

- 1.
Leniency is the scientific term. We searched Google Scholar on December 8, 2016. There were 962 results from “rater leniency” OR “raters’ leniency” OR “rating leniency” OR “leniency of the rater” OR “leniency of the raters”. There were 93.4% fewer results for “rater heterogeneity” OR “raters’ heterogeneity” OR “heterogeneity of the rater” OR “heterogeneity of the raters”.

- 2.
See http://FDshort.com/CronbachSplitHalf, accessed February 2017. For each respondent, select four of the nine questions, calculate the mean score, and calculate the mean score of the other five questions. Calculate among all raters the correlation coefficient between the pairwise split-half mean scores. Repeat the process using all possible split halves of the nine questions. The mean of the correlation coefficients is Cronbach’s alpha. This measure of internal consistency provides quantification for the reliability of the use of the score alone.

- 3.
The sample sizes are too small to estimate the variance within pairs, and the variances are generally unequal among pairs.7,8 See the

*Anesthesia & Analgesia*companion papers for mathematical details.7,8 Even when there are many ratings per rater, using each rating’s score minimally influences final assessments clinically.19 - 4.
Residents provided a response for 99.1% (

*n*= 14,585) of the 14,722 requests.10 For 6.3% (*n*= 921) of requests, residents responded that they worked with the faculty for insufficient time to evaluate supervision, leaving*n*= 13,664 ratings.10 The mean (SD) intraoperative patient care time together was 4.87 (2.53) h day^{−1}.10 - 5.
High-stakes Testing. Wikipedia. Available from URL: https://en.wikipedia.org/wiki/High-stakes_testing (accessed February 2017).

## Notes

## Conflicts of interest

None declared.

## Editorial responsibility

This submission was handled by Dr. Hilary P. Grocott, Editor-in-Chief, *Canadian Journal of Anesthesia.*

## Author contributions

*Franklin Dexter* and *Bradley J. Hindman* helped design the study. *Franklin Dexter* helped conduct the study. *Franklin Dexter* and *Johannes Ledolter* helped analyze the data. *Franklin Dexter, Johannes Ledolter,* and *Bradley J. Hindman* helped write the manuscript.

## Funding

Departmental funding

## References

- 1.
*Dexter F*,*Ledolter J*,*Hindman BJ*. Quantifying the diversity and similarity of surgical procedures among hospitals and anesthesia providers. Anesth Analg 2016; 122: 251-63.CrossRefPubMedGoogle Scholar - 2.
*Dexter F*,*Epstein RH*,*Dutton RP*,*et al*. Diversity and similarity of anesthesia procedures in the United States during and among regular work hours, evenings, and weekends. Anesth Analg 2016; 123: 1567-73.CrossRefPubMedGoogle Scholar - 3.
*Epstein RH*,*Dexter F*. Influence of supervision ratios by anesthesiologists on first-case starts and critical portions of anesthetics. Anesthesiology 2012; 116: 683-91.CrossRefPubMedGoogle Scholar - 4.
*Dexter F*,*Hindman BJ*. Quality of supervision as an independent contributor to an anesthesiologist’s individual clinical value. Anesth Analg 2015; 121: 507-13.CrossRefPubMedGoogle Scholar - 5.
*Dexter F*,*Logvinov II*,*Brull SJ*. Anesthesiology residents’ and nurse anesthetists’ perceptions of effective clinical faculty supervision by anesthesiologists. Anesth Analg 2013; 116: 1352-5.CrossRefPubMedGoogle Scholar - 6.
*de Oliveira Filho GR*,*Dal Mago AJ*,*Garcia JH*,*Goldschmidt R*. An instrument designed for faculty supervision evaluation by anesthesia residents and its psychometric properties. Anesth Analg 2008; 107: 1316-22.CrossRefPubMedGoogle Scholar - 7.
*Hindman BJ*,*Dexter F*,*Kreiter CD*,*Wachtel RE*. Determinants, associations, and psychometric properties of resident assessments of faculty operating room supervision. Anesth Analg 2013; 116: 1342-51.CrossRefPubMedGoogle Scholar - 8.
*Dexter F*,*Ledolter J*,*Smith TC*,*Griffiths D*,*Hindman BJ*. Influence of provider type (nurse anesthetist or resident physician), staff assignments, and other covariates on daily evaluations of anesthesiologists’ quality of supervision. Anesth Analg 2014; 119: 670-8.CrossRefPubMedGoogle Scholar - 9.
*Dexter F*,*Masursky D*,*Hindman BJ*. Reliability and validity of the anesthesiologist supervision instrument when certified registered nurse anesthetists provide scores. Anesth Analg 2015; 120: 214-9.CrossRefPubMedGoogle Scholar - 10.
*Dexter F*,*Szeluga D*,*Masursky D*,*Hindman BJ*. Written comments made by anesthesia residents when providing below average scores for the supervision provided by the faculty anesthesiologist. Anesth Analg 2016; 122: 2000-6.CrossRefPubMedGoogle Scholar - 11.
*De Oliveira GS*,*Jr Rahmani R*,*Fitzgerald PC*,*Chang R*,*McCarthy RJ*. The association between frequency of self-reported medical errors and anesthesia trainee supervision: a survey of United States anesthesiology residents-in-training. Anesth Analg 2013; 116: 892-7.CrossRefPubMedGoogle Scholar - 12.
*De Oliveira GS*,*Jr Dexter F*,*Bialek JM*,*McCarthy RJ*. Reliability and validity of assessing subspecialty level of faculty anesthesiologists’ supervision of anesthesiology residents. Anesth Analg 2015; 120: 209-13.CrossRefPubMedGoogle Scholar - 13.
*Hindman BJ*,*Dexter F*,*Smith TC*. Anesthesia residents’ global (departmental) evaluation of faculty anesthesiologists’ supervision can be less than their average evaluations of individual anesthesiologists. Anesth Analg 2015; 120: 204-8.CrossRefPubMedGoogle Scholar - 14.
*Dexter F*,*Szeluga D*,*Hindman BJ*. Content analysis of resident evaluations of faculty anesthesiologists: supervision encompasses some attributes of the professionalism core competency. Can J Anesth 2017. DOI: 10.1007/s12630-017-0839-7.Google Scholar - 15.
*Dexter F*,*Ledolter J*,*Hindman BJ*. Bernoulli cumulative sum (CUSUM) control charts for monitoring of anesthesiologists’ performance in supervising anesthesia residents and nurse anesthetists. Anesth Analg 2014; 119: 679-85.CrossRefPubMedGoogle Scholar - 16.
*Epstein RH*,*Dexter F*,*Patel N*. Influencing anesthesia provider behavior using anesthesia information management system data for near real-time alerts and post hoc reports. Anesth Analg 2015; 121: 678-92.CrossRefPubMedGoogle Scholar - 17.
*O’Neill L*,*Dexter F*,*Zhang N*. The risks to patient privacy from publishing data from clinical anesthesia studies. Anesth Analg 2016; 122: 2017-27.CrossRefPubMedGoogle Scholar - 18.
*Feldt LS*,*Woodruff DJ*,*Salih FA*. Statistical inference for coefficient alpha. Appl Psychol Meas 1987; 11: 93-103.CrossRefGoogle Scholar - 19.
*Yamamoto S*,*Tanaka P*,*Madsen MV*,*Macario A*. Analysis of resident case logs in an anesthesiology residency program. A A Case Rep 2016; 6: 257-62.CrossRefPubMedGoogle Scholar - 20.
*Doane DP, Seward LE*. Measuring skewness: a forgotten statistic? J Stat Educ 2011; 19 (2).Google Scholar - 21.
*Box GE*,*Cox DR*. An analysis of transformations. J R Stat Soc Series B Stat Methodol 1964; 26: 211-52.Google Scholar - 22.
*Dexter F*,*Wachtel RE*,*Todd MM*,*Hindman BJ*. The “fourth mission:” the time commitment of anesthesiology faculty for management is comparable to their time commitments to education, research, and indirect patient care. A A Case Rep 2015; 5: 206-11.CrossRefPubMedGoogle Scholar - 23.
*Austin PC*,*Alter DA*,*Tu JV*. The use of fixed-and random-effects models for classifying hospitals as mortality outliers: a Monte Carlo assessment. Med Decis Making 2003; 23: 526-39.CrossRefPubMedGoogle Scholar - 24.
*Racz MJ*,*Sedransk J*. Bayesian and frequentist methods for provider profiling using risk-adjusted assessments of medical outcomes. J Am Stat Assoc 2010; 105: 48-58.CrossRefGoogle Scholar - 25.
*Yang X*,*Peng B*,*Chen R*,*et al*. Statistical profiling methods with hierarchical logistic regression for healthcare providers with binary outcomes. J Appl Stat 2014; 41: 46-59.CrossRefGoogle Scholar - 26.
*Glance LG*,*Li Y*,*Dick AW*. Quality of quality measurement: impact of risk adjustment, hospital volume, and hospital performance. Anesthesiology 2016; 125: 1092-102.CrossRefPubMedGoogle Scholar - 27.
*Dexter F*,*Hindman BJ*. Do not use hierarchical logistic regression models with low-incidence outcome data to compare anesthesiologists in your department. Anesthesiology 2016; 125: 1083-4.CrossRefPubMedGoogle Scholar - 28.
*Baker K*. Determining resident clinical performance: getting beyond the noise. Anesthesiology 2011; 115: 862-78.CrossRefPubMedGoogle Scholar - 29.
*Baker K*,*Sun H*,*Harman A*,*Poon KT*,*Rathmell JP*. Clinical performance scores are independently associated with the American Board of Anesthesiology Certification Examination scores. Anesth Analg 2016; 122: 1992-9.CrossRefPubMedGoogle Scholar - 30.
*Whyte S*,*Cartmill C*,*Gardezi F*,*et al*. Uptake of a team briefing in the operating theatre: a Burkean dramatistic analysis. Soc Sci Med 2009; 69: 1757-66.CrossRefPubMedGoogle Scholar - 31.
*Einav Y*,*Gopher D*,*Kara I*,*et al*. Preoperative briefing in the operating room: shared cognition, teamwork, and patient safety. Chest 2010; 137: 443-9.CrossRefPubMedGoogle Scholar