Background

Machine learning and natural language processing (NLP) techniques, coupled with adoption of electronic health records (EHR), and widespread availability of high-performance computational resources offer new avenues for perioperative risk stratification whereby free-form text sources, such as medical notes, may be directly loaded into prediction models without the need to define, input or abstract predetermined data elements (e.g. diagnoses, medications, etc.). This offers the opportunity to use these techniques for preoperative assessment triage, flagging of critical/pertinent data in a voluminous electronic medical record, and a variety of other use cases based on clinician notes, which often contain narratives that richly and concisely describe a nuanced clinical picture of the patient while simultaneously prioritizing the clinician’s pertinent concerns. Unlike historical keyword-based approaches, modern NLP techniques using large pretrained language models are able to account for inter-word dependencies across the entire text sequence and have been shown to achieve state of the art performance on a variety of NLP tasks [1,2,3,4] including text classification [5, 6]. However, it is unknown whether these techniques can be successfully applied to perioperative risk stratification.

In this feasibility study, we hypothesize that NLP models can be applied to unstructured anesthesia preoperative evaluation notes written by clinicians to predict the American Society of Anesthesiologists Physical Status (ASA-PS) score [7, 8]. These preoperative evaluation notes are a pertinent summary of the patient’s medical and surgical history and describe why the patient is having surgery, all of which reflect the patient’s pre-anesthesia medical comorbidities that the ASA-PS aims to represent. In particular, we investigate four different text classification approaches that span the spectrum of historical and modern techniques: (1) random forest [9] with n-gram and term frequency-inverse document frequency (TFIDF) transform [10], (2) support vector machine [11] with n-gram and TFIDF transform, (3) fastText [12, 13] word vector model, and (4) BioClinicalBERT deep neural network language model. We also investigate the impact of using the entire note versus specific note sections. We compare the model’s prediction against the ASA-PS assigned by the anesthesiologist on the day of surgery and assess catastrophic errors made by one of these models. Finally, we use Shapley values to visualize which sections of note text were associated with the model’s predictions to explain these catastrophic errors. This approach shows that it is possible for clinicians to understand how complex NLP models are making their predictions, which is an important criteria for clinical adoption.

Methods

This retrospective study of routinely collected health records data was approved by the University of Washington Institutional Review Board with a waiver of consent. This study followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) guideline [14] and other guidelines specific to machine learning projects [15,16,17]. Figure 1 depicts a flow diagram of study design.

Fig. 1
figure 1

Flowchart of study design: dataset creation, model development, evaluation, and interpretation

Study cohort

Inclusion criteria were patients who had a procedure with anesthesia at the University of Washington Medical Center or Harborview Medical Center from January 1, 2016 – March 29, 2021 where the patient also had an anesthesia preoperative evaluation note filed up to 6 h after the anesthesia end time. This 6-h grace period reflects the reality that in some urgent or emergency situations or due to EHR behavior, text documentation may be time stamped out of order.

The anesthesia preoperative evaluation note must have contained the following sections: History of Present Illness (HPI), Past Medical and Surgical History (PMSH), Review of Systems (ROS), and Medications; notes missing at least one of these sections were excluded. No other note type was used. Cases must have had a recorded value for ASA-PS assigned by the anesthesiologist of record, a free-form text Procedure description, and a free-form text Diagnosis description; cases missing at least one of these values are excluded.

A unit of analysis is defined as a single case with an anesthesia preoperative evaluation note filed within 90 days of the procedure. This unit was chosen because ASA-PS is typically recorded on a per-case basis by the anesthesiologist to reflect the patient’s pre-anesthesia medical comorbidities at the time of the procedure. Likewise, preoperative evaluation notes filed > 90 days before the case may not reflect the patient’s current state of health, so are excluded. Data was randomly split 70%-10%-20% into training, validation, and test datasets respectively. Patients with multiple cases were randomized into a single data split to avoid information leakage between the three datasets. New case number identifiers were generated for this study and used to refer to each case.

Outcomes

The outcome variable is a modified ASA-PS with valid values of ASA I, ASA II, ASA III, ASA IV-V. ASA V cases are extremely rare, resulting in class imbalances that affect model training and performance. Thus ASA IV and V were combined into a compound class “IV-V”. ASA VI organ procurement cases are excluded. The final categories retain the spirit of the ASA-PS for perioperative risk stratification and resembles the original ASA-PS devised by Saklad in 1941 [7, 18]. The emergency surgery modifier “E” was discarded.

Predictors and data preparation

Free-form text from the anesthesia preoperative evaluation note is organized into many sections. Regular expressions are used to extract HPI, PMSH, ROS, and medications from the note. While diagnosis and procedure sections exist within the note, they were less frequently documented than in the procedural case booking data from the surgeon. Therefore, free-form text for these sections were taken from the case booking. Newline characters and whitespaces were removed from the text. Note section headers were excluded so that only the body of text from each section is included. We used text from each section to train models for ASA-PS prediction, resulting in 8 prediction tasks: Diagnosis, Procedure, HPI, PMSH, ROS, Medications (Meds), Note, Truncated Note (Note512). “Note” refers to using the whole note text as the predictor to train a model. When BioClinicalBERT is applied to the “Note” task, the WordPiece tokenizer [19,20,21] truncates input text to 512 tokens. This truncation does not occur for other models. For equitable comparison across models, we define the “Note512” task, which truncates the note text to the first 512 tokens used by the BioClinicalBERT model.

Statistical analysis and modeling

Four model architectures with different conceptual underpinnings were trained: (1) Random forest (RF) [9], (2) Support vector machine (SVM) [11], (3) fastText, [12, 13], and (4) BioClinicalBERT [22]. Each model architecture was trained on each of the 8 prediction tasks for a total of 32 final models.

Each model was trained on the training dataset. Model hyperparameters were tuned using Tune [23] with the BlendSearch [24, 25] algorithm to maximize Matthew’s Correlation Coefficient (MCC) computed on the validation dataset. The number of hyperparameter tuning trials was selected to be 20 times the number of model hyperparameters with early stopping if the MCC of the last 3 trials reaches a plateau with standard deviation < 0.001. The best model was then evaluated on the held-out test dataset. Details on the approach taken for each of the four model architectures is available in Supplemental methods.

Baseline models

Two baseline models were created for comparison: a random classifier model and an age & medications classifier model. The random classifier model generates a random prediction without using any features, thus serving as a negative control baseline. The age & medications classifier model serves as a simple clinical baseline model. It uses the patient’s age, medication list, and total medication count as input features to a multiclass logistic regression model with cross-entropy loss and L2 penalty for predicting the modified ASA-PS outcome variable. Defaults were used for all other model parameters. Both baselines were implemented using Scikit-learn.

Evaluation metrics

Final models were evaluated on the held-out test dataset by computing both class-specific and class-aggregate performance metrics. Class-specific metrics include: receiver operator characteristic (ROC) curve, area under receiver operator curve (AUROC), precision-recall curve, area under precision-recall curve (AUPRC), precision (positive predictive value), recall (sensitivity), and F1. Class-aggregate performance metrics include MCC and AUCμ, [26] a multiclass generalization of the binary AUROC. Additionally, macro-average AUROC, AUPRC, precision, recall and F1 were also computed. Each metric and model-task combination was computed with 1000 bootstrap iterations each with 100,000 bootstrap samples on the test set. For each metric, p-values were computed for all 400 pairwise model-task comparisons with the Mann–Whitney U test followed by Benjamini–Hochberg procedure to control false discovery rate with α = 0.01.

Model interpretability and error analysis

4-by-4 contingency tables were generated to visualize the distribution of model errors. Catastrophic errors were defined as cases where the model predicts ASA IV-V but the anesthesiologist assigned ASA I, or vice versa. For catastrophic errors made by the BioClinicalBERT model with the Note512 task, three new anesthesiologist raters independently assigned an ASA-PS based on only the input text from the Note512 task. These new ASA-PS ratings were compared against the original anesthesiologist’s ASA-PS as well as the model prediction’s ASA-PS.

The SHAP [27] python package was used to train a Shapley values feature attribution model on the test dataset to understand which words support prediction of each modified ASA-PS outcome variable. An analysis of model errors with Shapley value feature attributions was reviewed for each of the catastrophic error examples with representative examples included in the manuscript. Shapley values for predicting each ASA-PS are visualized as a heatmap over text examples. Text examples are de-identified by replacing ages, dates, names, locations, and entities with pseudonyms to achieve data obfuscation while preserving structural similarity to the original passage.

Results

Our study comprised 38,566 patients undergoing 61,503 procedures with anesthesia care with 46,275 notes. Baseline patient, procedure, and note characteristics are described in Table 1. A flow diagram describing dataset creation is shown in Fig. 2. A total of 30 class-aggregate and class-specific metrics were computed; 400 pairwise comparisons exist for each metric resulting in 12,000 pairwise comparisons. Only 20 of these pairwise comparisons are not statistically significant (Supplemental Tables 7 and 8). All comparisons across the same model type and varying the task, or across the same task and varying model are statistically significant for reported metrics.

Table 1 Dataset characteristics
Fig. 2
figure 2

CONSORT Flow Diagram for Dataset Creation. If a patient has multiple procedural cases and pre-anesthesia notes, all of a patient’s cases and notes are allocated to the same data split

AUROC for each model architecture and task is shown in Table 2; AUPRC is shown in Table 3; AUCµ and MCC is shown in Supplemental Table 1. RF, SVM, and fastText perform best using the entire note compared to note sections. Tasks with longer text snippets yielded better performance–HPI, ROS and Meds sections result in better model performance as compared to Diagnosis, Procedure, and PMSH. On the Note task, fastText performs the best. On the Note512 task, BioCinicalBERT performs the best.

Table 2 Area under receiver operator characteristic for all models
Table 3 Area Under Precision-Recall Curve

Direct comparison of models is most appropriate using the Note512 task since all models are given the same information content. For the Note512 task, BioClinicalBERT has better class-aggregate performance across AUROC, AUPRC, AUCμ, MCC, F1 (Supplemental Table 2) compared to other models. While F1 for both fastText and BioClinicalBERT are similar, fastText achieves this with higher macro-precision (positive predictive value) (Supplemental Table 3) whereas BioClinicalBERT achieves this with higher macro-recall (sensitivity) (Supplemental Table 4). Class-specific metrics show that fastText’s worse recall is due to imbalanced recall performance with higher recall for ASA II and III which are the most prevalent classes, but poor recall for ASA I and IV-V. Conversely BioClinicalBERT has worse precision than fastText on all classes except for ASA III. BioClinicalBERT has similar or better AUROC and AUPRC across all the ASA-PS classes. This is also seen in the ROC curves (Fig. 3) and the precision-recall curves (Fig. 4), in which the BioClinicalBERT model shows slightly better performance across most thresholds.

Fig. 3
figure 3

ROC performance of each model architecture on the Note512 task compared to baseline models. Each plot depicts model performance for predicting a specific ASA-PS

Fig. 4
figure 4

Precision-recall curve performance of each model architecture on the Note512 task compared to baseline models. Each plot depicts model performance for predicting a specific ASA-PS

Figure 5 depicts 4-by-4 contingency tables to visualize distribution of model errors on the Note512 task. When erroneous predictions occur, they are typically adjacent to the ASA-PS assigned by the original anesthesiologist. In the analysis of 40 catastrophic errors made by the BioClinicalBERT model on the Note512 task, the mean absolute difference between the model prediction and a new anesthesiologist rater is 1.025 whereas the difference from the original anesthesiologist is 3 (Fig. 6). This disparity with the original anesthesiologist and greater concordance with the new anesthesiologist rater indicates that some of the “incorrect predictions” on the test set are not true failures of the model but issues with data quality documented in routine clinical care.

Fig. 5
figure 5

4-by-4 contingency tables for each model architecture on the Note512 task. The vertical axis corresponds to modified ASA-PS recorded in the anesthetic record by the anesthesiologist. The horizontal axis corresponds to the model predicted modified ASA-PS. Numbers in the table represent case count from the test set. Percentages are case counts normalized over the model predicted ASA-PS, representing the distribution of actual ASA-PS recorded in the anesthetic record for a specific model predicted ASA-PS. Cells outlined in red in the BioClinicalBERT contingency table correspond to our definition of catastrophic errors. The 21 cases where anesthesiologist assigned ASA I and BioClinicalBERT model predicted ASA IV-V comprise 1.7% of all cases. The 19 cases where anesthesiologist assigned ASA IV-V and BioClinicalBERT model predicted ASA I comprise 1.6% of all cases

Fig. 6
figure 6

Rater assignments of ASA-PS for catastrophic error examples from the BioClinicalBERT model on Note512 task. Top plot shows scenario where model prediction is ASA IV-V, but original anesthesiologist assigned case ASA I. Bottom plot shows scenario where model prediction is ASA I, but original anesthesiologist assigned case ASA IV-V. Three anesthesiologist raters were asked to read the input text from the Note512 task and assign an ASA-PS for each of the catastrophic error examples. For each case, a dot marks a rater’s ASA-PS assignment. The model’s prediction and original anesthesiologist ASA-PS is shown as a highlighted region overlaid on the plots. Shapley feature attribution visualizations are shown for cases #57482 (Fig. 7, Supplemental Fig. 2), #41739 (Supplemental Fig. 3), #11950 (Supplemental Fig. 4), #29054 (Supplemental Fig. 5)

Shapley values in Fig. 7 provide clinically plausible explanations for model explanations, highlighting the directional probability of how specific input text contributes to predicting a specific ASA-PS. These feature attributions often provide clinically plausible explanations for why a model is making a wrong prediction and allows the clinician to evaluate the evidence the model is considering. Additional examples shown in Supplemental Figs. 2, 3, 4 and 5.

Fig. 7
figure 7

Attribution of input text features to predicting modified ASA-PS for the BioClinicalBERT model on Note512 task. Shapley values for each text token is shown to compare feature attributions to ASA I (top) and feature attributions to ASA IV-V (bottom). Red tokens positively support predicting the target ASA-PS whereas blue tokens do not support predicting the target ASA-PS. The magnitude and direction of support is overlaid on a force plot above the text. The baseline probability of predicting each class in the test set is shown as the “base value” on the force plot. The base value + sum of Shapley values from each token corresponds to the probability of predicting the ASA-PS and is shown as the bolded number. For simplicity, feature attributions to ASA II and III are omitted in this figure, but a full-visualization with all outcome ASA-PS for this text snippet is available in Supplemental Fig. 2. Text examples are de-identified by replacing ages, dates, names, locations, and entities with pseudonyms to achieve data obfuscation while preserving structural similarity to the original passage

Discussion

In this study of ASA-PS prediction using NLP techniques, we found that more advanced models made fewer categorization errors. Further, an assessment of catastrophic errors made by the BioClinicalBERT model suggests that, in the majority of cases, expert review suggested the initial ASA-PS score assigned by the anesthesiologist was erroneous rather than the ASA-PS score assigned by the NLP model. Shapley value feature attributions enable a clinician to easily identify if the model predictions are erroneous or clinically plausible. From these feature attributions, we find NLP models are able to associate both obvious and subtle clinical cues to the patient’s illness severity.

Text classification techniques have undergone substantial evolution over the past decade. Most of these techniques will be unfamiliar to the practicing clinician. In brief, RF and SVM represent more rudimentary approaches that utilize bag-of-words and n-grams. These techniques are sensitive to word misspellings, cannot easily account for word order, have difficulty in capturing long-range references within sentences, and have difficulty in representing different meanings of a word when the same word appears in different contexts [28,29,30,31,32,33].

Modern NLP techniques have overcome many of these challenges with vector space representation of words [12, 13, 34,35,36] and subword components [13, 19, 20, 37] as seen in the fastText model, attention mechanism [38, 39], and pretrained deep autoregressive neural networks [40,41,42] such as transformer neural networks [43]. This has resulted in successful large language models such as BERT [21, 44] and the domain-specific BioClinicalBERT [22]. Perhaps the most widely known large language model is ChatGPT (OpenAI, San Francisco, CA), a general purpose chatbot based on the GPT-3 model which contains 175 billion parameters [45]. In contrast, BioClinicalBERT used in this feasibility study contains roughly 1500 times fewer parameters, but has been trained specifically on clinical notes which makes it well suited for the ASA-PS prediction task [46].

Longer text length provides more information for the model to make an accurate prediction. Even though text snippets such as Diagnosis or Procedure may have high relevance for the illness severity of the patient, the better performance on longer input text sequences indicate that more information is generally better. This is similar to what is observed in the multifaceted practice of clinical medicine–where a patient’s overall clinical status is often better understood as the sum of many weaker but synergistic signals rather than a single descriptor. The limited input sequence length for BioClinicalBERT creates a performance ceiling as it limits the amount of information available to the model. Comparing Note and Note512 tasks, all other models that can utilize the full note have better performance when this input length is lifted with fastText being the top performer. These findings suggest that future development of a large language model similar to BioClinicalBERT capable of accepting a longer input context would likely have superior performance characteristics. fastText requires significantly less compute resources for model training and inference compared to BioClinicalBERT and remains a good option in lower resource settings. RF and SVM were our worst performing models, confirming that modern word vector and neural network language model-based approaches are superior.

There is significant variability on the length and quality of clinical free-form text narrative written in the note, especially in the HPI section which is typically a clinician’s narrative of the patient’s medical status and need for the procedure. In some cases, the HPI section contains one or two words in length (Supplemental Fig. 4), whereas in other cases it is a rich narrative (Supplemental Figs. 2, 5). We believe that relatively poor performance in the ASA-PS prediction using HPI alone is a consequence of variability in documentation, as the model may have limited information for prediction if the note text does not richly capture the clinical scenario.

These models rarely made catastrophic errors. Erroneous predictions are typically adjacent to the ASA-PS assigned by the anesthesiologist, suggesting the model is making appropriate associations between freeform text predictors and the outcome variable (Fig. 5). Furthermore, when new anesthesiologist raters were asked to assign ASA-PS to the cases where catastrophic errors occurred from the BioclinicalBERT model on the Note512 task, there was greater concordance between the model predictions and the new anesthesiologist rather than the original anesthesiologist (Fig. 6). Shapley feature attributions for one of these catastrophic errors in Fig. 7 reveal that the original anesthesiologist may have made the wrong assignment, or may have written a note that does not reflect the true clinical scenario. In this example, the original anesthesiologist assigned the case ASA IV-V, but the model predicted I. Feature attributions show the BioClinicalBERT model correctly identifies pertinent negatives on trauma exam, normal hematocrit of 33, and normal Glasgow Coma Scale (GCS) of 15 to all support a prediction for ASA I and against ASA IV-V [47]. In this example, all new anesthesiologist raters agree with the model rather than the original anesthesiologist. These findings from our catastrophic error analysis suggest that the model performance may be underestimated by our evaluation metrics, as our ground truth test set contains imperfect ASA-PS assignments. It also illustrates how the model is robust against potentially faulty labels. Despite a noisy training and evaluation set, NLP models are still able to make clinically appropriate ASA-PS predictions.

Our exploration of Shapley feature attributions reveal that the model is able to identify indirect indicators of a patient’s illness severity. For example, subcutaneous heparin is often administered for bed-bound inpatients to prevent the development of deep vein thrombosis. Supplemental Fig. 4 depicts an example where the model learns to associate mention of subcutaneous heparin in the medication list with a higher ASA-PS, likely because hospitalized patients are generally more ill than outpatients who present to the hospital for same-day surgery. Similarly, the model learns the association between the broad spectrum antibiotic ertapenem with a higher ASA-PS as compared to narrow spectrum or prophylactic antibiotics such as metronidazole or cefazolin. These observations show that the model is able to identify and link these subtle indicators to a patient’s illness severity. Shapley value feature attributions prove to be an effective tool that enables clinicians to understand how a model makes its prediction from text predictors.

Limitations

Our dataset is derived from a real-world EHR used to provide clinical care and includes human and computer generated errors. These issues include data entry and spelling, the use of abbreviations, references to other notes and test results not available to the model, and automatically generated/inserted text as part of a note template. For this feasibility study we use the anesthesia preoperative evaluation note. This note is typically written days or weeks in advance for elective procedures, but is sometimes written immediately prior, during, or after the procedure in urgent or emergent scenarios. These notes are included because our goal is to study the factors that affect ASA-PS prediction using note text with NLP models. We have not conducted clinical validation of these models and we have not validated model performance across multiple institutions.

The BioClinicalBERT model is limited to an input sequence of 512 tokens; future investigation is needed to understand if longer-context large language models can achieve better performance. We also did not explore more advanced NLP models such as those that perform entity and relation extraction, which may further enhance the prediction performance. Larger model sizes such as GPT-3 have been shown to be correlated with improved model performance across a variety of tasks, but these models are not specialized for the clinical domain; we do not explore these models in our feasibility study and leave this exploration to future research [48].

Finally, the ASA-PS is known to have only moderate interrater agreement among human anesthesiologists [49, 50]. Consequently, a perfect classification on this task is not possible since the ground truth labels derived from the EHR encapsulate this interrater variability.

Conclusions

Our feasibility assessment suggests that NLP models can accurately predict a patient’s illness severity using only free-form text descriptions of patients without any manual data extraction. They can be automatically applied to entire panels of patients, potentially allowing partial automation of preoperative assessment triage while also serving as a measure of perioperative risk stratification. Clinical decision support tools could use techniques like these to improve identification of comorbidities, resulting in improved patient safety. These tools may also be used at the healthcare system level for population health analyses and for billing purposes. Predictions made by more advanced NLP models benefit from explainability through Shapley feature attributions, which produce explanations that logically support model predictions and are understandable to clinicians. Future work includes assessment of more advanced natural language models that have more recently become available, use of non-anesthesiologist clinician notes, and exploration of NLP-based prediction of other outcome variables which may be less subject to interrater variability.