Anesthesia is a high-risk medical specialty where the ability to perform practical procedures proficiently is essential. In spite of this, training opportunities are under significant pressure from a variety of factors, including the decrease in working hours demanded by the European Working Time Directive and an increasingly time-pressured clinical environment. This has led some in the medical profession to question whether there is time for procedural skills to be learned adequately within current training programs.1

In 2007, the American Accreditation Council for Postgraduate Medical Education (ACGME) developed an initiative called the “Outcome Project”2 which emphasized the importance of the educational outcomes of residency programs rather than only their potential to educate. This requires data on learners’ performance to be assessed adequately and their competence to be documented reliably. This ethos is invaluable to ensure that doctors are properly trained and to protect patients from unsafe practice, and it is in line with the international trend towards competency-based training.

A robust system for ensuring competence in procedural skills in anesthesia is therefore required: first, to address concerns regarding the lack of training opportunities, and second, to show that the delivered training is effective. Current methods of evaluating technical skills are logbook summaries and work-based assessments (WBAs). The former is best-suited to detailing the learning cases encountered.3 It does not normally include a record of success or failure and is unable to identify unsafe or poor practice. The WBAs, designed to document proficiency in specific skills, are also prone to weakness. They assess only single (often favourably selected) episodes; they may be completed only after success; and the assessor can be carefully chosen to avoid poor reports. These problems affect their validity and reliability. It is usually easy for instructors to recognize trainees having extreme difficulties from logbook analysis and WBAs, but it is much harder to identify more subtle performance deficiencies.4 The rotational nature of most training programs compounds this problem as it commonly results in trainees working with many different trainers. This results in a lack of continuity in their supervision and, therefore, in their evaluations and assessments, which may mean that episodes of poor performance are dismissed as “one off” mistakes rather than recognized as a pattern of behaviour, repeated failings, or an inability to progress that requires further action.

Assessment of procedural skills in anesthesia is poor compared with other domains of learning and has fallen behind surgical fields.5 Owing to the dependency of patients’ outcomes on a surgeon’s technical skills, research into this area has been pioneered in surgery.6 Several assessment tools have now been developed and validated for use on surgical trainees outside the operating room. Examples include: The Objective Structured Assessment of Technical Skills,7,8 which involves a task checklist and a global rating score; the McGill Inanimate System for Training and Evaluation of Laparoscopic Skills,9 which tests generic laparoscopic skills; and the Imperial College Surgical Assessment Device, which tracks trainees’ hand movements via sensors and provides an effective index of technical skill in both laparoscopic10 and open11,12 procedures. Anesthesia knowledge, clinical judgment, and communication skills are all tested in postgraduate exams,13 but there is currently no formal evaluation of procedural skills. Given that numerous studies have shown that the time required to achieve competency at specific procedures varies widely depending on the individual learner,4,14 there is a great need for a reliable and valid method for demonstrating procedural competency and for identifying struggling trainees who require additional support.

Cumulative sum (cusum), a statistical method that looks at the outcome rather than at the process of performing procedural skills, is an alternative tool that may be used to assess an individual’s procedural performance. It produces graphs that allow rapid detection of deviations from a pre-established standard, initially being developed during World War II as a quality control tool in munitions factories.15 The graphs are generated by relatively simple calculations based on set acceptable and unacceptable failure rates and the degree to which type 1 (α) and type 2 (β) errors (false positive and false negative errors) will be tolerated (Appendix). The null hypothesis is: the true failure rate is not different from the acceptable failure rate. The calculations produce decision limits (h0 and h1) and a value for the cusum - s. The cusum value is plotted on the y-axis, and the number of consecutive attempts is plotted on the x-axis,16 as shown in Fig. 1. The graphs start at zero and successes cause the cusum to fall by a value equal to s and failures to rise by a value equal to 1-s. To aid interpretation, the decision limits (and multiples thereof) may be drawn onto the graphs as horizontal boundary lines. When α and β are equal, h0 and h1 are of the same magnitude. Crossing the lower decision limit (h0) from above means that the true failure rate does not differ significantly from the acceptable failure rate, with the probability of a type 2 error equal to β4 (as occurs in Fig. 1 for doctor A after 39 attempts). This has been taken, in scientific literature, to show competency as defined by cusum. When the upper line (h1) is crossed from below, the actual failure rate is greater than the unacceptable failure rate (as occurs in Fig. 1 for doctor B after 16 attempts). This shows a process that is out of control. From this position, competency (or the acceptable failure rate) can be achieved only by a falling cusum that crosses two adjacent boundary lines.17 When the plot is between the decision limits, no statistical inference can be made and performance remains uncertain.4

Fig. 1
figure 1

This is a fictional example of a cusum chart for two doctors serially performing a procedure. Using the formula in Appendix, the decision limits h0 and h1 were calculated, along with s (the amount the cusum falls with each success). The unacceptable failure rate (f 1 ) was set as 0.4 (40%) and the acceptable failure rate (f 0 ) as 0.2 (20%). A 10% chance for a type 1 or type 2 error was tolerated and therefore alpha and beta were both 0.1. From this h0 was calculated as −2.240, and h1 as 2.240. These decision limits have been drawn on the graph as horizontal lines. The value for s was 0.293, so for each success the cusum falls by 0.293, and each failure rises by 0.707 (s−1). The graph shows doctor A’s cusum crossing h1 after 39 procedures, indicating cusum defined competency with a 10% risk of a type 2 error. Doctor B’s cusum crosses h1 after 18 attempts demonstrating an unacceptably high failure rate

Cusum charts have been used in a variety of specialties (including endoscopy, orthopedics, surgery, and anesthetics) as a quality control method for experienced clinicians and to examine trainee learning curves.18 Their use, though, is currently limited to research despite having the potential to be a useful tool for providing continuous performance data, for evidence of achieving competency, and potentially for assessing training programs themselves. The aim of this review is to evaluate the available literature on the current use of cusum in anesthetic training with a view to establishing its role.

Literature review

A literature search of MEDLINE® (1950 to present), EMBASE™ (1980 to present), BNI (1985 to present), and CINAHL® (1981 to present) was conducted using the key terms: anesthesiology/ed (ed = Education) OR an*esthes* (all fields); cusum (all fields) OR ‘Cumulative Sum*’ (all fields) OR learning curves (all fields). The Cochrane Library, NHS Evidence, and the Trip database were also reviewed. The last electronic search was performed in October 2012. All papers using cusum to investigate performance in anesthetic procedural skills were included. The search was limited to studies reported in English. Review articles, commentaries, abstracts, and letters were excluded. Thirteen relevant studies were identified and are shown in Table 1.

Table 1 Summary of papers

All 13 studies had small sample sizes (< 30), with most researching novice trainees’ performance. The procedural skills they investigated could be split broadly into three groups: regional anesthesia, airway and cannulation, and ultrasound skills. These are dealt with in turn.

Regional anesthesia

Four studies examined cusum charts for epidural insertion.4,16,17,19 Naik17 had a cohort of 11 novices, ten of whom achieved cusum defined competency (i.e., crossing the lower boundary line from above) with between 1-85 attempts. In contrast, Kestin16 found only 4/12 recruits were competent, needing between 29-128 attempts, with five trainees having an unacceptable failure rate. De Oliveira Filho4 had similar results to those of Kestin,16 with 4/11 trainees achieving the acceptable failure rate.

One reason for the variation in the numbers of trainees reaching competency is illustrated in the paper by Sivaprakasam.19 They adjusted the acceptable and unacceptable failure rates from 10% and 15% to 20% and 30%, respectively. By doing so, the number of trainees reaching competency increased from 4/6 to 5/6, thus showing the importance of the initial variables used to construct the cusum graph. The methods used for deciding the failure rates varied widely between the papers: Sivaprakasam’s team19 arbitrarily set the rates; de Oliveira Filho4 used rates from a control sample of trained anesthetists; and Kestin16 and Naik17 employed departmental consensus to decide. Table 2 shows the differences in set failure rates for the four studies, along with the numbers of participants reaching competency. Lowering the chosen failure rates means more successes and, therefore, attempts are required to reach competency; thus, in order to produce valid results, it is essential that these values are set appropriately for the level of the individual’s training.

Table 2 Summary of regional anesthesia papers

The definition of success and failure also varied between the studies and was another reason for the differing number of trainees attaining competency. In Naik’s trial, any degree of pain relief from the epidural signified a success,17 but both Kestin16 and de Oliveira Filho4 had stricter criteria. Kestin required satisfactory analgesia/anesthesia, and de Oliveira Filho required technical success at the first interspace chosen and adequate surgical anesthesia. It is, however, well documented that correct anatomical placement of an epidural catheter does not always provide adequate or indeed any analgesia,20,21 and therefore, it could be argued that complete analgesia/surgical anesthesia may be too rigid an endpoint by which to judge the technical ability of a trainee inserting an epidural (although obviously it is the ideal outcome for the patient). This highlights the point that appropriate and consistent definitions of success and failure, that are clearly defined and unambiguous, must be used to avoid confusion and ensure meaningful cusum results.

Three studies examined spinal anesthesia. The studies by Kestin16 and de Oliveira Filho4 had similar definitions for success, namely, adequate surgical anesthesia, but the former used less lenient acceptable and unacceptable failure rates in the statistical analysis (10% and 20% vs 15% and 30%, respectively). This accounts for the fact that 64% (7/11) of de Oliveira Filho’s trainees were deemed competent vs 25% (2/8) of Kestin’s trainees. Again, this stresses the importance of the figures used in the cusum calculations.

One randomized controlled study investigated whether there was a difference in the learning curves of trainees when using two different spinal needles (25G and 27G),22 and results showed no significant difference. Using cusum to evaluate the effect of different equipment on the acquisition of technical skills is novel to anesthetic practice. Given its origins as a quality control tool, it could be employed effectively to assess both this and the impact of changes in equipment on the performance of experts. For example, if an expert practitioner plotted their cusum chart for a particular procedure (e.g., epidural insertions), their performance would be expected to be in steady state, i.e., tracking between the decision limits h0 and h1. If the equipment used (needles/syringes etc.) then changed, the cusum chart would identify any impact this would have on performance. A plot remaining within the decision limits would indicate no significant effect, but if either limit were crossed, it would signify that a statistically significant change in performance had occurred, either an improvement if h0 were breached or a deterioration if h1 were crossed (Fig. 2). In a similar manner, other interventions, e.g., different teaching methods, such as simulation-based procedural training, could be investigated by assessing their effect on the cusum plot.

Fig. 2
figure 2

This is an example of a cusum chart showing the possible effects a change in equipment could have on an expert’s performance. The first 22 procedures were carried out prior to the change and show a steady state performance graph. The plot then splits into the possible outcomes after different equipment has been introduced: either no change in performance (blue line) or a statistically significant improvement in performance (green line crossing h0); or conversely, a statistically significant decrease in performance (purple line crossing h1)

Schuepfer23 investigated a new technique for performing psoas compartment blocks (PCB) in children using cusum. It was calculated that at least 55 blocks would need to be performed to achieve a success rate of 70% by looking at the learning curves of residents practicing the procedure. Schuepfer concluded that, with a strict definition for success, > 100 PCBs may need to be attempted. This has significant implications: If cusum is used to define competency and average procedure-specific learning curves are known, institutions and training rotations could then be evaluated to determine whether they are likely to provide trainees with enough opportunities to achieve competency. This is an interesting area, and one that demands further research.

Airway and cannulation skills

The findings of studies plotting cusum curves for basic airway and cannulation skills are summarized in Table 3. Kestin’s16 study had few recruits for arterial and central line insertion (five and two, respectively) because most of the trainees had prior experience and were therefore excluded. Also, the trainees who were included performed only a small number of the procedures so interpretation of the results is difficult. It is clear, though, that using the cusum method requires novices to perform a large and very variable number of procedures before they are statistically proven to have an acceptable failure rate. This applies even to basic skills like cannulation, as the interns in de Oliveira Filho’s study4 required between 19-146 attempts to achieve competency, despite the fact they were allowed to miss 20% of the cannulas!

Table 3 Summary of airway and cannulation papers

Komatsu’s group24 performed an interesting additional analysis in their trial. Airway management was risk stratified by grading the likelihood of difficulties in bag-mask ventilation and tracheal intubation. This produced a risk-adjusted cusum. As a single failure has a significant impact on the cusum graph, a few atypically difficult patients and subsequent procedural failures can require learners to perform large numbers of procedures successfully in order to be anywhere near the lower boundary line of statistical significance. Risk-adjusting the cusum score would help account for this, and therefore, this approach is very appealing. Komatsu used this adjusted score to assess trainee performance as either better than expected, given the level of difficulty encountered, or worse than expected. The “expected” level of performance was taken from the average performance of all of the interns. Ideally, this would have been derived from a larger external source of performance data.

Two studies investigated the learning curves of trainees performing airway procedures (upper airway endoscopy and orotracheal intubation with the Truview EVO2 laryngoscope) on mannequins.25,26 Again, a wide variation of attempts were needed to reach proficiency in both studies, showing that performance is very variable even in a controlled environment with the same training opportunities and teaching. This suggests that training should be individualized for each trainee and should ensure that extended practice of a procedure is possible (when required) to allow each trainee time to become competent.

Ultrasound skills

Five studies investigated ultrasound skills pertinent to anesthesia.5,27-30 In one study, cusum was used to determine the amount of training required to achieve competency in spinal ultrasound.27 The conclusion was that 20 attempts and a coaching session were not sufficient to teach the relevant skills, and that this should inform the planning of future educational sessions and workshops. Unlike most other papers, no feedback was given during the attempts, which meant that all learning was experiential once the trial began. This was similar to the study by de Oliveira Filho5 that involved needling a phantom. After the trial, 6/26 subjects were deemed competent at following their needle using ultrasound, and only 2/18 were able to follow their needle to a target. The argument for the lack of feedback was that “most individuals use a type of discovery learning when incorporating ultrasound guidance into their practice”.31,32 Feedback was given in all the other studies when required and has been shown to be of great value in learning.33 This might account for the low number of trainees achieving competency at these basic ultrasound skills in these studies.

Barrington et al. 28 used a bovine cadaver to assess the number of attempts 15 trainees required before they were able to visualize their needles competently on ultrasound during simulated sciatic nerve blocks. The trainers provided feedback after each attempt. The mean number of attempts to achieve competency was 28, but again, the range was wide.

Niazi et al. 29 used cusum to assess the effect of simulation on the acquisition of ultrasound skills in 20 novices by splitting them into two groups, one that was simulation trained and the other acting as a control. In the simulation group, 8/10 achieved proficiency compared with only 4/10 in the control group. This did not reach statistical significance, but the study was hampered by the low number of subjects. Nevertheless, this highlights another potential use for cusum, i.e., a tool to evaluate the effectiveness of different teaching methods in the development of procedural skills.

Halpern et al.30 used cusum to prove that it was possible to learn to identify the lumbar spinous processes using ultrasound. Two experienced anesthesiologists performed an ultrasound scan of the lumbar spine and placed a radio-opaque marker at a designated level. The actual level was then determined by a radiologist after reviewing the patients’ computed tomography scans. The results showed that skilled anesthesiologists required a minimum of 22 attempts to become reliable in defining lumbar spine anatomy with ultrasound, but that it was possible and could be used to improve the accuracy of needle placement during neuraxial techniques.

Use of cusum for quality control in anesthesia

Only one study34 has implemented cusum analysis as a quality control system for anesthesiologists, which is surprising given that this was initially the reason for its creation. The study was performed by an experienced consultant who (bravely) published cusum charts for all arterial and central line insertions he performed over a three-year period. He concluded that it was a good practical performance monitor for consultants (and ideal for appraisals), but would need to be adapted to monitor trainees to reflect their level of experience.

Two other studies investigated the use of cusum with experienced doctors.35,36 They compared the cusum graphs of a registrar with those of a consultant performing non-anesthetic medical procedures. The consultant’s graphs rapidly produced a steady-state plot with acceptable failure rates, whereas the registrar’s graphs were more haphazard and showed a significant learning curve.

Discussion

In anesthetic literature the use of cusum analysis has been limited almost totally to investigating the learning curves of specific procedures. From this type of work, an estimate of the number of procedures required to achieve competency can be made. This information could then be used to inform and evaluate training programs and help guide decisions about the most appropriate hospitals for trainees to rotate to depending on their educational requirements.

The ACGME requires that graduating residents perform a minimum of 50 spinal and 50 epidural techniques for surgical procedures.37 Looking at the published cusum studies, this number may be sufficient for some trainees to acquire competency but certainly not for all. It is difficult, however, to provide an accurate estimate of the actual number needed from the available literature. This is because the existing studies have significantly different results due to the varying definitions for procedural success and failure, the differences in the variables used to construct the cusum graphs (e.g., acceptable and unacceptable failure rates), and also their small sample sizes. It is clear, though, that there is a wide spectrum of learning curves and consequently, the only way to guarantee competency is to tailor training to the individual rather than to focus on minimum numbers.

The accepted meaning of cusum-defined competency in the literature is crossing the h0 boundary line from above or crossing any two consecutive boundary lines from above.17 The problem is that the latter criterion may well demand a significantly larger number of successes than the former, as the distance required to travel down the cusum chart is much greater. Indeed, at certain points on the chart, the number of successes required to achieve the acceptable failure rate is almost double that needed when compared with starting at the zero point. This means that novices who have several initial failures (which is to be expected when learning a new skill) will potentially end up at a great disadvantage when trying to prove their competency. Therefore, it would be more appropriate to reset the cusum to zero each time the upper boundary is breached. This approach has also been suggested when the lower boundary line is breached as if the cusum is allowed to continue to fall, a run with an unacceptably high failure rate may go unnoticed.38 In fact, reaching a steady state on the graph may be enough assurance to conclude that the learning curve has settled down.18

There are several problems with using cusum analysis to assess performance in procedural skills. First, there are no nationally agreed definitions for success or failure at any given procedure, and those used in the literature vary greatly. Also, there is currently no consensus of opinion as to where the acceptable and unacceptable boundaries should be set or to what degree alpha and beta errors should be tolerated. Tight boundaries are important for quality control and for assessing trained individuals, but should these boundaries be much wider for the novice trainee to allow for their learning curve and to provide encouragement and a sense of achievement? The number of competent doctors produced can increase dramatically simply by altering the boundaries.19 Therefore, if procedural competency is to be defined by cusum, it would be necessary to establish national rates, and these would need to be tailored to the experience of the trainee.

Second, ensuring the accuracy of the recorded data is problematic. Cusum often relies on self-reporting, which introduces a subjective element to the interpretation of a procedural outcome. There is also the potential for recording bias, where favourable results are documented more frequently than unfavourable ones.5 If competency is to be defined by cusum, then the consequences of repeated failures are significant for trainees. This will increase the pressure on them to perform, and therefore, potentially give a positive skew to their procedural outcomes.

Third, as trainee seniority increases, so does their exposure to more difficult procedures. This could result in a deterioration in the cusum curve, as failures are more likely because of increasing procedural difficulty despite no change in skill level.5 As described previously, Komatsu24 risk adjusted the cusum score for airway management, showing that this can be achieved successfully. It is however a single study, with a small sample population which has not been validated. Therefore, a universally recognized and accepted method is required to stratify the technical difficulty of different procedures.

Finally, cusum graphs can be difficult to construct and interpret. A recent review article suggested that only 17 of the 31 cusum graphs analyzed were drawn correctly.18 If these problems were overcome, cusum would have a valuable role in assessing trainee procedural performance.

Cusum is a good performance monitor for trained individuals and is a valuable quality control tool that could be used for revalidation and appraisal. It could be employed for rapid detection of medical errors, near misses, and suboptimal clinical performance and to monitor the effects of prolonged periods of time off work. For example, Kestin18 identified a registrar whose performance at spinal anesthesia fell significantly after an 18-mth period of non-anesthetic medical training. With the introduction of increasingly complex procedures and technologies, it may also be more sensitive in assessing health care providers’ skill than the current available methods.18 Finally, it could help assess the impact of new equipment on performance and therefore advise on procurement of medical supplies.

In summary, cusum has many potential applications in anesthesia. In its current form, it could be adopted readily to monitor performance in trained individuals. Also, it can produce an objective graph of performance in newly learned techniques, providing trainers with information that is unattainable from logbooks or WBAs. This allows trainees to assess their progress and consequently self-direct their learning, and gives trainers the opportunity to review a trainee’s current skills on first contact. Poor performance can be readily identified and rapidly remediated, thus providing high-quality health care.39 There are, however, several hurdles to overcome before cusum can be used reliably as proof of trainee competency. Further work in this area should focus on assessing the failure rates of expert anesthesiologists for individual procedures so informed decisions can be made about the acceptable and unacceptable trainee failure rates. Setting such standards nationally would aid the move towards competency-based residency training and act as a benchmark for future research. This should include investigating ways to adjust cusum scores for predictably difficult procedures, e.g., epidurals in morbidly obese patients, and performing validation studies.