Adequate anesthetic depth (AD) is a fundamental responsibility of the anesthesiologist during general anesthesia. Too deep or too light levels of anesthesia are disadvantageous for patients. While a too light AD can result in intraoperative awareness and major psychological sequelae, a too deep AD carries the risk of hemodynamic instability, prolonged awakening, and postoperative delirium.1,2,3,4,5,6 Clinical evaluation is still considered to play the leading role in AD assessment during general anesthesia.

Despite their well-known limitations,7 electroencephalography (EEG)-based monitors are increasingly used to support AD assessment.2,8,9 It has been shown that using an EEG monitor can reduce anesthetic consumption,10,11 recovery time,10,12 postoperative nausea and vomiting,10,13 awareness,14 and postoperative delirium.15,16 Taking these data into account, one can assume that clinical judgement is inferior to EEG-derived measurements in determining adequate AD, and leads to inadvertent under- or overdosing of anesthetic agents.

The aim of this prospective observational cohort study was to compare the clinical, EEG-free judgement of anesthesiologists in daily practice with an EEG-based measurement of AD during stable general anesthesia. We hypothesized that there would be no differences in the level of agreement between an EEG-based measurement and clinical judgement of AD.

Methods

This study was approved by the Ethics Committee of Upper Austria (study number C-109-16) and written informed consent was obtained from all subjects participating in the trial. The trial was registered prior to patient enrollment at clinicaltrials.gov (NCT02766894, principal investigator: Wolfgang Puchner, date of registration: May 2016) and conducted in accordance with ethical principles of the Declaration of Helsinki. The manuscript was prepared as suggested by the STROBE (STrengthening the Reporting of OBservational studies in Epidemiology) statement. The Narcotrend® monitor used in this study was provided free of charge by Gepa-Med Medizintechnik (Vienna, Austria). The company played no role in the study design, data collection, statistical analysis, or interpretation of study results. This analysis was designed as an investigator-initiated, single-centre, prospective, observational cohort study. It was performed in the operating theatres of a tertiary university hospital in Linz, Austria from May 2016 to February 2017.

Selection of study patients

Patients aged between one and 110 yr, who were scheduled for elective or non-elective surgery requiring a general anesthetic, were eligible for study enrollment. Duration of surgery < ten minutes, a moribund state, proximity of the EEG electrodes to the operating field, skin lesions over the forehead, known pathologies of the frontal brain lobe, hypoxic brain damage, and hypothermia (≤ 33°C) were considered exclusion criteria.

Selection of anesthesiologists

Twenty anesthesiologists with no experience in EEG monitoring and EEG-based guidance of anesthesia were invited to take part in this study. Participation was voluntary and decided on a first come-first serve basis. Each physician received detailed information before the study and was expected to participate in 30 individual measurements. This number of measurements was estimated to allow for correction of within-subject correlation. If a participant could not complete 30 measurements within the study period, data were omitted, and the physician was replaced by another colleague, who performed 30 measurements.

Data collection

All measurements were taken under stable general anesthesia at the earliest five minutes following induction, when the surgical intervention already in progress. The general anesthetic regimen (balanced anesthesia with volatile anesthetics or total intravenous anesthesia) and the type of anesthetics were at the discretion of the attending anesthesiologist, except for ketamine, dexmedetomidine, and/or nitrous oxide as these substances modify the EEG signal and are not accurately reflected in the processed EEG index.17,18,19

The Narcotrend®-Compact M monitor (software version 3.0, MT Monitor Technik, Bad Bramstedt, Germany) was used to objectively measure AD.20 This device processes the raw EEG signal (acquired over an average of 20 sec) to the Narcotrend index, which is updated every second. This index is a dimensionless number from 0 (isoelectric EEG) to 100 (awake). These indices are categorized by the manufacturer into six stages (A [awake] to F [anesthesia with burst suppression]), with the stages C, D, E0,1 reflecting adequate AD (Table 1). Similar to the bispectral index, the Narcotrend monitor is a validated tool for measuring AD.21,22,23 It has repeatedly been used to measure AD in patients aged one year or older.24,25

Table 1 Levels of anesthetic depth assessed by clinical judgement and Narcotrend® monitor measurement

The scale by which the anesthesiologist determined the AD was derived from the Naroctrend® index scale and had five categories (Table 1). After surgery had started, the attending anesthesiologist excluded an inadequate AD, and the investigator verified a stable and undisturbed EEG processing and index recording (electrode impedance < 8 kΩ, no interference by electrocautery) for at least one minute. Under these conditions, the attending anesthesiologist, who was blinded to the raw EEG and the Narcotrend index, determined the AD using their clinical judgement and standard perioperative monitoring. According to the AD scale (Table 1), the anesthesiologist could then choose from the remaining three AD categories: mid-adequate, (adequate but) fairly deep, or (adequate but) fairly light. Simultaneously with clinical judgement, the raw EEG trace was recorded and the Narcotrend index documented. Only one measurement was performed per patient. The attending anesthesiologist was not given access to the raw EEG or the index on the Narcotrend monitor during the course of study.

In addition, the following data were documented at the same time: the attending anesthesiologist, number of the anesthesiologist’s years of practice, and the period of time from induction of anesthesia until measurement. We collected the following patient data: age, sex, body mass index, redhead status (defined by phenotypic appearance, e.g., red or orange hair, freckles), frailty (based on the Fried criteria),26 the American Society of Anesthesiologists (ASA) physical status classification, history of awareness or postoperative nausea and vomiting, urgency of surgery, pre-medication received, anesthesia regimen, calculated effect-site or plasma concentration of propofol (in case a target controlled infusion was used) or the age-adjusted endtidal minimum alveolar concentration at the time of measurement, neuromuscular blockade during assessment, heart rate and mean arterial blood pressure during assessment, intraoperative movements before measurement, and intraoperative use of catecholamines. Within 24 hr after recovery from anesthesia, study patients were interviewed by one of four trained investigators using the modified Brice method to identify intraoperative awareness.27 This information was not obtained in very young children, patients with postoperative delirium, or critically ill subjects.

Study endpoints

The primary endpoint of this study was to determine the level of agreement between the anesthesiologist’s clinical judgement and the Narcotrend measurement of AD. The level of agreement was determined for each measurement by counting the number of AD levels judged by the anesthesiologist that were discordant from the Narcotrend monitor measurements(d-score). For example, if the AD levels determined by clinical judgement and the Narcotrend monitor were in agreement, the number of discordant levels was zero. If the AD level judged by the anesthesiologist was two levels lower than the level measured by the Narcotrend monitor, the number of discordant levels (d-score) was two. We used negative values to indicate that the Narcotrend monitor categorized patients into deeper AD and positive values if the Narcotrend monitor categorized patients into lighter AD than judged by the anesthesiologist. Discordance by one AD level was considered minor, whereas discordance by two or more AD levels was considered major. The secondary study endpoint was risk factors for discordance between the anesthesiologists’ judgement and the Narcotrend measurement using a feature selection algorithm of a “random forest” algorithm.

Statistical analysis

As no data on the level of agreement between clinical judgement and an EEG monitor based measurement of AD existed at the time of study planning, we could not perform a power analysis to determine which sample size would be adequate for our study. We knew that 20 anesthesiologists were willing to participate in the study, and we believed that 30 measurements per anesthesiologist would be sufficient to identify participant-specific effects. This resulted in a sample size of 600, a number which can also be seen as a lower limit for the number of measurements necessary for performing a machine learning algorithm.

The level of agreement (beyond that observed by chance) between the anesthesiologist’s judgement and Narcotrend measurements was tested using the Cohen’s kappa method. Furthermore, the number of measurements where the d-score was −1, 0, or 1 was counted to determine how often a minor deviation between the judgement of the anesthesiologist and the measurement of the Narcotrend was observed.

Additionally the “random forest” algorithm, provided by Breiman28 was used to predict the d-score as described above (R statistical software, package “random forest”). For this purpose, the data set was split into a training data set (80% of measurements) and a test data set (20% of measurements). After training the model on the training data set, the test data set was used for prediction. This approach determines the relative feature importance using the “Boruta” package provided by “The Comprehensive R Archive Network”. The relative feature importance is then given in arbitrary units.

All statistical analyses were performed using the open-source R statistical software package, version 3.5.0 (The R Foundation for Statistical Computing, Vienna, Austria). Variables are given as median values with interquartile ranges or absolute numbers with percentages.

Results

During the study period, complete sets of intraoperative measurements were taken in 600 patients (Table 2, Fig. 1) who were anesthetized by 20 anesthesiologists. In all but 41 patients, postoperative data on intraoperative awareness and dreams could be collected.

Table 2 Characteristics of all study patients as well as subjects with concordant and discordant judgement of the anesthetic depth
Fig. 1
figure 1

Histogram of measured anesthetic depths (AD). Distribution of AD of all 600 patients during stable conditions of adequate anesthesia from the attending anesthesiologist`s perspective as measured by Narcotrend index (NI) values. Dark-shaded bars represent measurements of inadequate AD with an NI beyond the range of 20–79, as recommended by the manufacturer of Narcotrend to be considered an either too deep or too light anesthesia level

Primary study endpoint

In 42% of patients (n = 250), the anesthesiologist’s clinical judgement of AD agreed with anesthetic level measured by the Narcotrend monitor. In 46% of patients (n = 274), the anesthesiologist and Narcotrend monitor differed by one AD level (minor discordance). Major discordance was observed in 76 (13%) measurements (judged deeper than measured, n = 29 [5%]; judged lighter than measured, n = 47 [8%]). In 7% of patients (n = 44), the Narcotrend index was outside the limits of adequate AD (too deep, n = 28 [5%]; too superficial, n = 16 [3%]). The overall level of agreement between the anesthesiologist’s clinical judgement and the Narcotrend monitor was statistically insignificant (Cohen’s kappa, −0.039; P = 0.17). Figure 2 gives an overview over the levels of dis-/agreement between the anesthesiologist’s clinical judgement and the Narcotrend index results for all patients. None of the 559 study patients who were followed up with interviews reported signs of intraoperative awareness.

Fig. 2
figure 2

Cross-reference of judged and measured anesthetic depths (AD). Depiction of all 600 jugdements of AD (left: categories of fairly deep and fairly light anesthesia, right: category of mid-adequate level of anesthesia) compared with the measurements by the Narcotrend index (middle: table of index values in steps of 5). The strength of the line corresponds to the number of patients. Horizontal line means conformity between assessed and measured anesthetic levels, falling or ascending lines indicate discordance. Dark-shaded areas show the range of inadequate anesthesia levels according to the recommendations of the manufacturer of Narcotrend

Secondary study endpoint

In 90.8% of patients the difference between the real and the predicted d-score was −1, 0, or 1 showing a very high accuracy of the underlying model. Performing the feature selection algorithm of the “Boruta” package on the same data set revealed age, mean arterial blood pressure, pediatric surgery, ASA physical status classification, body mass index, and frailty as the variables with highest relative feature importance for predicting the d-score (Fig. 3). All other variables contributed less to predicting the d-score.

Fig. 3
figure 3

Relative feature importances of study variables as determined in Boruta algorithm output. Based on the inferences of a Random Forest model, features are removed from the training set, and model training is performed anew. Boruta infers the relative importance of each independent variable (feature) in the obtained predictive outcomes by creating shadow features. All features that gain a higher relative feature importance than the shadow feature with the highest relative feature importance are defined as relevant for the prediction. For our data set, the relevant features (marked in green) are age, mean arterial blood pressure, pediatric surgery, American Society of Anesthesiologists classification, body mass index, and frailty

Discussion

The main finding of our study is that clinical judgement of AD during stable anesthesia by the anesthesiologist agreed with Narcotrend monitor measurements in 42% of patients. We observed minor discordance between clinical and EEG-based assessments of AD in 46% of subjects and major discordance in 13% of subjects. The Cohen’s kappa coefficient, which was used to statistically report the level of agreement between the two assessments of AD, was low (−0.039) and not significant (P = 0.17), indicating that clinical judgement and EEG-based measurements of AD were not in agreement in our study.

Currently, there is no scientific evidence on the optimal AD in terms of either clinical assessment or EEG-derived indices, including the Narcotrend index. In contrast to another study which evaluated the ability of anesthesiologists to predict the bispectral index,29 the aim of our study was to evaluate the agreement of the clinical judgement of AD and measurements of the EEG-based Narcotrend monitor. Therefore, we used a three-stage scale to clinically report AD (i.e., fairly light, mid-adequate, fairly deep). This pragmatic approach was chosen to reflect clinical practice in the absence of other validated scales to clinically report AD. Nevertheless, it needs to be kept in mind that this three-stage scale has not been validated, and may have influenced the non-agreement between clinical judgement and EEG-based measurement of AD in our study.

Minor discordance was observed in 46% and major discordance in 13% of study subjects. In one third of the study patients, the attending anesthesiologist assessed AD as either adequate but “fairly deep” or “fairly light”. The rationale behind individual decision-making processes was not determined in our protocol. Thus, it can only be speculated on which variables the anesthesiologists’ decisions were based. Given that age, blood pressure, pediatric surgery, the ASA classification, body mass index, and frailty were the variables with the highest relative feature importance for predicting the level of agreement between the anesthestiologist and the Narcotrend monitor, one could assume that the anesthesiologists in our study mainly relied on hemodynamic indicators and physical appearance of the patients. This practice is, however, not supported by scientific evidence, which shows that hemodynamic variables may be influenced by comorbid conditions and a relevant patient to patient variability in pharmacodynamics of anesthetics.30,31 Notably, the anesthesiologists’ judgement of adequate but “fairly deep” or adequate but “fairly light” was rarely confirmed by the Narcotrend index. This appears to be clinically relevant for certain surgical procedures and circumstances (e.g., ophthalmological surgery, neurosurgery, end of surgery) where anesthesiologists intend to achieve fairly deep or fairly light AD. Our data indicate that without EEG-based measurement of the AD, this might not be reliably achieved by the anesthesiologist’s clinical judgement alone.

Age, blood pressure, pediatric surgery, the ASA classification, body mass index, and frailty score were identified as clinically relevant risk factors for discordance between clinical judgement and Narcotrend measurements of AD. Deeper levels of anesthesia than clinically perceived were more frequently observed in old and comorbid patients, whereas lighter levels of anesthesia were predominantly observed in younger patients. This finding is in accordance with the well-known phenomenon of anesthesia-overdosing in older patients.32,33 It was even more evident when older age was associated with a higher ASA score, suggesting that managing geriatric anesthesia without EEG monitoring is prone to misjudgement of AD and therefore overdosing of hypnotics. Our results could therefore explain why the use of an EEG monitor reduced anesthetic consumption,10,11 shortened recovery time,10,12 and decreased the incidence of postoperative delirium15,16 in other studies.

In study subjects aged one to 18 yr, lighter than clinically judged levels of AD were observed more frequently than in older patients in our analysis. These results are consistent with reports on increased total EEG power and on higher-frequency bands in children and adolescents during anesthesia resulting in processed EEG values indicating a light AD, which may lead to increased doses of hypnotics being administered unnecessarily.34,35 Similarly, high EEG index values were measured with an entropy monitor in adults despite an adequate clinical AD.36,37 It was hypothesized that this could be an artefact of the device’s processing algorithm.36,37 Furthermore, analgesia-based general anesthetic may result in an AD that is lighter than that assessed by EEG-monitors but still clinically adequate.38 Accordingly, we did not observe any indicators of explicit awareness in this population, despite current literature suggesting higher rates of intraoperative awareness in children than in adults.39

Neither the anesthesiologist nor their professional experience was associated with the primary study endpoint in our population. It can be hypothesized that anesthesiologists with limited experience at the time of study participation had been sufficiently trained to assess and achieve an adequate AD. Nevertheless, when interpreting these results, it needs to be kept in mind that young anesthesiologists less frequently cared for study patients with a higher ASA score or for very young or very old patients.

Strengths of our study are its prospective design and the high number of study patients included. Enrollment of 20 anesthesiologists, on the other hand, minimized the effect of inter-individual differences on the study results. Certain limitations need to be considered when interpreting the results of this study. First, measurements were taken only once and at single, arbitrarily chosen time points during general anesthesia. Therefore, we cannot conclude that the values obtained correlate with other phases of anesthesia (e.g., during episodes of instability or intense/minimal surgical stimulation). Second, we only recorded the Narcotrend index and the respective AD levels. Doing so, we did not analyze the original EEG trace, which may give more accurate information about the true AD.

In conclusion, the results of our study suggest that clinical judgement of AD during stable anesthesia did not agree with EEG-based assessment of anesthetic depth in 58% of cases. Nevertheless, this finding could be influenced by the lack of validated scales to clinically judge AD. Age, arterial blood pressure, the ASA classification, body mass index, and frailty appear to determine the level of disagreement between clinical judgement and the Narcotrend monitor.