FormalPara Take-home message

Natural language processing (NLP) of electronic caregiver notes is a novel and powerful epidemiological tool to identify behavioral disturbance in critically ill patients. NLP identifies more patients with abnormal behavior and disturbed cognitive state, who will receive antipsychotic medications, are more severely ill, likely to stay in intensive care unit (ICU) and hospital for longer, and more likely to die than CAM-ICU positive patients.


Delirium is common in critically ill patients [1] and associated with poor outcomes [2, 3]. The fourth edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM IV) defines delirium as a “disturbance in consciousness that is accompanied by a change in cognition that cannot be accounted for by a preexisting or evolving dementia” [4]. The manual describes three criteria that may be used to identify such a disturbance in consciousness; A, “reduced clarity of awareness of the environment; B, “accompanying change in cognition” or “development of a perceptual disturbance” and C, “develops over a short period of time and tends to fluctuate during the course of the day”.

The fifth edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM V) reconsiders delirium diagnosis listing five diagnostic criteria. Three describe disturbance in behavior (i.e., “disturbance in attention”, “develops over a short period” and “disturbance in cognition”), and two describe the absence of pre-existing disturbances or conditions (i.e., “preexisting, established or evolving neurocognitive disorder” and “a direct physiological consequence of another medical condition”) that, if present, would exclude a diagnosis of delirium [5]. Although, more restrictive than the DSM IV and with dependences on interpretation, the DSM V criteria have been found to identify a similar population of patients to the DSM IV criteria [6, 7].

Despite such constructs, the systematic identification of delirium in intensive care unit (ICU) patients through direct application of these criteria is considered challenging [8]. As a consequence, derivative methodologies have been developed to operationalize screening for delirium in the ICU [9]. Of these, the Confusion Assessment Method for ICU (CAM-ICU) [10, 11] and the Intensive Care Delirium Screening Checklist (ICDSC) [12], both derived from the DSM IV, have been recommended as the most reliable for use within the adult ICU [13]. In our ICUs, the CAM-ICU methodology is used by bedside nurses once per shift to assess patients for possible delirium. The results of this assessment are documented in the electronic progress notes. However, progress notes also contain nursing, medical, and allied health caregivers’ narrative observations, which may further describe a patient’s behavior and cognitive state.

We hypothesized that, in their clinical progress notes, caregivers might use words suggestive of disturbed behavior or cognition such as “agitated” and “confused”, which would increase the information available for the assessment of the patient’s cognitive state. In a recent study [14], we successfully used Natural Language Processing (NLP) techniques to detect these words in the electronic clinical progress notes of a large cohort of critically ill patients. NLP is a software technology that is used to automatically analyze textual information in documentation such as progress notes. We referred to patients identified by such words as having “NLP-Diagnosed Behavioural Disturbance” (NLP-Dx-BD).

Accordingly, in this study, we compared the prevalence, clinical characteristics, outcomes, and antipsychotic medications-based treatment of patients with NLP-Dx-BD with those of patients with CAM-ICU-based delirium screening assessments.


Study design

This was a non-interventional, retrospective study, which used data derived from the electronic health records (EHR) of a university affiliated ICU system in Melbourne Australia. This study was approved by the Austin Hospital Human Research Ethics Committee (LNR/19/Austin/38), which waived the need for informed consent. We included all adult patients (≥ 18 years old) admitted to the ICUs of the Austin Hospital, Melbourne between 1 may 2019 and 31 December 2020. If a patient had multiple admissions only the first admission was considered for inclusion. No additional exclusion criteria were applied.

Data collection and manipulation

All baseline and outcome data were collected from the Australian and New Zealand Intensive Care Society Adult ICU Patient Database run by the Centre for Outcome and Resource Evaluation [15]. Using a proprietary intensive care clinical information system, we obtained electronic data from all progress notes entered into the ICU-specific electronic health record (EHR) by clinical staff including doctors, nurses, physiotherapists and other allied health professionals.

During the study period, by ICU policy, patients received general care aimed at decreasing the risk of delirium, including frequent family visits, dimmed lights at night, minimal interaction to facilitate night-time sleep cycling, and ensuring use of spectacles and hearing aids as necessary. CAM-ICU and Richmond Agitation-Sedation Scale (RASS) assessments were recorded in the clinical progress notes by nursing staff during every 8-h shift. NLP techniques were then used to extract the detail of these assessments from the notes. Further, the clinical progress notes of all caregivers were analyzed using NLP tokenizing techniques (natural language toolkit; NLTK 3.5) [16]. As previously described, [14] progress notes were converted to sentence vectors. each vector was then searched for the presence of words, terms, or expressions, suggestive of behavioral disturbance in accordance with terms selected in a previously published survey of clinical staff [17] (etable 1a and 1b in online supplement). In this survey, clinical staff were asked to identify words, terms or expressions that they would use to describe a situation where they thought a patient was exhibiting disturbed behavior.

Finally, we obtained data on antipsychotic medications used in our ICUs (haloperidol, olanzapine, quetiapine, and risperidone) from the hospital EHR and correlated them with other study data. We hypothesized that treatment with antipsychotic medications may reflect care givers discomfort with patients cognitive dysfunction and disturbed behavior. Accordingly, we chose this outcome as the clinically relevant primary outcome for our study.

Data definition

We searched nursing progress notes for CAM-ICU and RASS assessments using NLP. Simultaneously, we analyzed medical, nursing and allied health progress notes for words describing NLP-Dx-BD.

We classified patients as CAM-ICU positive, negative, or unable to be assessed (e.g. the patient does not speak English or the patient refuses to cooperate) or missing.

We then classified patients as CAM-ICU positive when, in at least one progress note, a CAM-ICU positive assessment was reported.

We classified patients as NLP-Dx-BD positive when, in at least one progress note, we found a word indicative of behavioral disturbance (eTable 1a and eTable 1b).


The primary exposure of the present study was the combination of CAM-ICU and NLP-Dx-BD diagnoses. Patients were, therefore, classified in four groups:

Group 1: Negative for both assessments.

Group 2: Only CAM-ICU positive.

Group 3: Only NLP-Dx-BD positive.

Group 4: Positive for both assessments.


The primary outcome was use of antipsychotic medications. Such drugs included haloperidol, quetiapine, olanzapine and risperidone. Despite controversy [18], these medications represent the most frequent medical intervention applied to agitation, perceived delirium and/or behavioral disturbance in ICU.

Secondary outcomes included the duration of mechanical ventilation (in ventilated patients), ICU and hospital length of stay, and ICU, hospital, and 28-day in-hospital mortality.

Statistical analysis

All continuous data are reported as medians (quartile 25%–quartile 75%) and categorical data as numbers and percentages. Baseline, clinical characteristics, and outcomes of the patients were compared among the groups using Fisher exact test and the Kruskal–Wallis test. The proportion of patients receiving medications over time is presented in Kaplan–Meier curves and compared with the log-rank test. Multivariable logistic regression models were used to assess the association of the exposure groups with hospital mortality, with Group 1 (CAM-ICU negative and NLP-Dx-BD negative) used as reference. To account for immortal time bias, we additionally conducted a time-dependent Cox proportional hazard model for the primary outcome and hospital mortality that considered all measurements available in each note. For the primary outcome only exposures happening before the first outcome (first time the patient received an antipsychotic) were included in the model to avoid exposures measured after the medication had already been given. The models were adjusted by age, type of admission, and by the Australian and New Zealand Risk of Death (ANZROD) after log transformation [19]. ANZROD is a validated and accurate predictor of mortality in ICUs in Australia and New Zealand [20]. Effect estimates were reported as hazard ratio (OR) with its 95% confidence interval (CI).

Sensitivity analyses were performed stratifying the cohort according to the use of mechanical ventilation. Overall rate of missing data was low, and is reported in the eTable 2 in the Online Supplement. All analyses were case complete analysis and were conducted in R v.4.0.3 (R Foundation) [21] and a P value < 0.05 was considered significant.


The baseline characteristics of the study patients according to the four groups are shown in Table 1. Their median age was 63 years, most were male, most were admitted due to medical conditions, and the largest group of patients were admitted due to cardiovascular disease. The most prevalent pre-existing disorder was diabetes, followed by chronic lung disease. Overall, just above half received mechanical ventilation and just below half received vasopressor/inotropic drugs. However, irrespective of CAM-ICU status, the presence of NLP-Dx-BD positivity identified patients with greater illness severity. The mean RASS during ICU admission was − 0.5 (− 1 to − 0.2), and was similar between the groups (P = 0.238) (eFigure 1 in Online Supplement).

Table 1 Baseline characteristics of included patients

Characteristics of CAM-ICU delirium and NLP behavioral disturbance diagnosis

The time to first CAM-ICU positivity or to first NLP-Dx-BD assessment is shown in eFigure 2 in the Online Supplement. The frequency of words used to describe NLP-Dx-BD is show in eTable 3 and, as word cloud, in eFigure 3 in the Online Supplement. Overall, 32% of the patients with NLP-Dx-BD were identified on day 0 (day of admission), but only 11% of CAM-ICU positive patients were identified on this day. In addition, the majority of patients positive for NLP-Dx-BD (37%) and of CAM-ICU (29%) were identified on day 1.

Among NLP-Dx-BD-positive patients, the median number of notes with BD positive words was 3 (1–8) and the median cumulative number of BD positive words was 4 (2–13) per patient. Among CAM-ICU positive patients, the median number of notes that were positive for NLP-Dx-BD was 7 (3–18) and the median cumulative number of words was 11 (4–30). The median number of ICU-shift notes per patient where CAM-ICU was reported was 3 (2–7). Among CAM-ICU positive patients a median of 2 (1–4) notes were positive. The number of notes per patient with CAM-ICU reported as “unable to be done” was 0 (0–2).

Classification according to NLP-Dx-BD and CAM-ICU

We studied 2932 patients. Among these, NLP-Dx-BD or CAM-ICU assessments were not available and reported as “unable to be performed” or missing in 1 (0%) and in 619 (21.1%) patients, respectively, leaving 2313 patients with complete data for analysis. Of these, 1246 (53.9%) were NLP-Dx-BD positive and 578 (25%) were CAM-ICU positive. Among NLP-Dx-BD positive patients 539 (43.3%) were CAM-ICU positive, while among CAM-ICU positive patients 539 (93.3%) were NLP-Dx-BD positive.

When assessing the four key groups, 1028 (44.4%) were categorized in Group 1; 39 (1.7%) in Group 2; 707 (30.6%) in Group 3; and 539 (23.3%) in Group 4. The distribution of patients into these groups shows that NLP-Dx-BD identified more patients than CAM-ICU (Fig. 1). It also demonstrates that, overall, only 44% of patients were negative for both assessments during their ICU stay.

Fig. 1
figure 1

Flow chart of participation

We additionally assessed categorization according to different levels of conscious state from coma to agitation using the lowest and highest RASS score (eFigure 4 and eFigure 5, Online Supplement) to illustrate the impact of such states during each shift on the classification of patients according to NLP-Dx-BD and CAM-ICU status. NLP identified more patients than CAM-ICU when the shift RASS score was in the awake range and when the highest shift RASS score was in the coma or agitated range.

Primary outcome: use of antipsychotic medications

The use of antipsychotic medications was highest in patients in Group 4 (24.3%) (Table 2 and Fig. 2). The group with the second greatest use of these medications was Group 3 (10.5%). For all antipsychotic medications, the rate of use in Group 2 patients was 5.1%, similar to that of Group 1 at 2.3%. (Table 2). The impact of NLP-Dx-BD on the use of antipsychotic medications was confirmed by univariate and multivariate Cox models that treated groups as time-dependent variables (eTable 4 and eTable 5).

Table 2 Use of antipsychotic medications (APM) in the study population
Fig. 2
figure 2

Bar plot illustrating antipsychotic medication use across observation groups

By day 3, most of such treatment had been given, with the exception of Group 4 patients who continued to accrue further treatment up to one week after ICU admission (Fig. 3).

Fig. 3
figure 3

Kaplan–Meier plots of time to event for antipsychotic medication use for observation groups

Secondary clinical outcomes

Clinical outcomes of study patients are shown in Table 3. Overall hospital mortality was 6.3%. Mortality was highest in Group 4 (10.8%), lower but similar in Group 2 and Group 3 (7.7% and 7.5%, respectively), and lowest in Group 1. Duration of ICU and hospital stay was significantly longer in the presence of NLP-Dx-BD positivity, irrespective of CAM-ICU status.

Table 3 Clinical Outcomes of Included Patients

On univariable analysis, both Group 4 (OR, 3.77 [95% CI 2.43–5.95]; P < 0.001) and Group 3 (OR, 2.53 [95% CI, 1.62–4.00]; P < 0.001) were associated with increased risk for hospital mortality (eTable 6 in Online Supplement). On multivariable analysis, only Group 3 (OR 1.69 [95% CI 1.05–2.76]; P = 0.03) was independently associated with increased risk for hospital mortality (eTable 7). On univariate modelling that treated groups as time-dependent variables both NLP-Dx-BD and CAM-ICU status were associated with mortality but, on multivariable modelling, no overall effect was found (eTable 8 and eTable 9).


Key findings

In a cohort of more than two thousand ICU patients, we compared the prevalence, characteristics, treatment with antipsychotic medications, and outcomes of patients with NLP-diagnosed behavioral disturbance (NLP-Dx-BD) and patients with Confusion Assessment Method for ICU (CAM-ICU) positivity. We found that more than half were NLP-Dx-BD positive and a quarter were CAM-ICU positive. Among NLP-Dx-BD positive patients, four out of ten were CAM-ICU positive. In contrast, among CAM-ICU positive patients, nine out of ten were NLP-Dx-BD positive. NLP-Dx-BD identified significantly more patients likely to receive antipsychotic medications and, in the absence of NLP-Dx-BD, treatment with antipsychotic medications was uncommon. Finally, regardless of CAM-ICU status, NLP-Dx-BD was associated with significantly longer duration of ICU and hospital stay and greater hospital mortality (Table 4).

Table 4 Univariable and multivariable models with hospital mortality as outcome

Relationship to previous studies

Dependent on risk factors and delirium phenotype, the prevalence of CAM-ICU positive patients has been reported between 10 and 89% [22,23,24,25]. Our prevalence of 25% falls within this range. CAM-ICU positive patients appear to have a hospital mortality rate of between 10.7 and 27% [26,27,28]. Our hospital mortality rate of 10.6% falls just below this range. Further, our finding that NLP-Dx-BD identified more patients with behavioral and/or cognitive disturbance earlier in the critical care episode than CAM-ICU is also consistent with previous findings that, in routine practice, the use of CAM-ICU may delay the detection of delirium [29]. Our observations that the prevalence of behavioral disturbance as detected by CAM-ICU was lower than with NLP-Dx-BD are also similar to findings from previous studies where CAM-ICU was found to detect lower rates of delirium than unstructured assessments [29, 30].

CAM-ICU was not obtained in one in five patients. This is aligned with work by Terry et al. who found a missed documentation rate of more than half of available opportunities in a pair of American medical and surgical ICUs [31] and with data from Vanderbilt University Medical Centre showing a 16% non-compliance for CAM-ICU assessment in an institution highly dedicated to CAM-ICU assessment [32]. Moreover, Kanova et al. reported in a prospective study that 14% of patients could not be assessed for CAM-ICU due to prolonged coma [33]. Our observations indicate that, during a given shift, and in the presence of RASS defined coma, CAM-ICU assessment was missing in most patients. However, this was also true to a similar extent for NLP-Dx-BD words and coma was an uncommon state in our cohort.

Finally, our observations of the prevalence, characteristics, and outcome of NLP-Dx-BD are aligned with those found in a much larger population of more than 12,000 patients recently investigated by our group [14]. Moreover, the words used to define this condition are supported by our survey of the terms that clinicians would use to describe the presence of an acute behavioral disturbance [17].

Implications of study findings

Our findings imply that NLP-Dx-BD may identify more patients with abnormal behavioral and disturbed cognitive state than CAM-ICU. Moreover, they imply that NLP-Dx-BD positive patients are significantly more likely to receive antipsychotic medications than CAM-ICU positive patients. Furthermore, they suggest that NLP-Dx-BD positive patients are more severely ill, more likely to stay in ICU and hospital for longer, and more likely to die. This is consistent with a previous study that found that agitated behavior, even without diagnosed delirium, correlated with poorer outcomes [34]. It is also consistent with clinical observations that the level of sedation and behavioral symptoms often fluctuate over an 8-h or 12-h shift. Thus, NLP-Dx-BD will capture behavioral screening features over the whole period but CAM-ICU, which represents a 'snapshot" at a single point of time during the shift, may not. Finally, our findings suggest that NLP may be particularly more sensitive during shifts where a given RASS score has identified the presence of either deep sedation or agitation.

Strengths and limitations

Our study has several strengths. It studied a novel approach to the assessment of the epidemiology of behavioral disturbance in critically ill patients. It compared such assessment with delirium assessment performed by the CAM-ICU methodology. It also conducted such comparisons in a large and heterogeneous group of ICU patients, thus increasing the external validity of the observations. The incidence of CAM-ICU positive assessment was consistent with the literature and our study applied a primary outcome based on therapy, which is likely to be relevant to clinicians as well as patients. Finally, by showing that NLP-Dx-BD positivity captured almost all of patients with CAM-ICU positive status, it provided additional indirect evidence of the validity of its construct.

We acknowledge several limitations. Our study was undertaken in a large tertiary intensive care unit system in a university affiliated hospital of a resource-rich country. Therefore, its findings may not apply to other intensive care units in low or middle-income countries. Multiple progress notes were missing a positive or negative CAM-ICU assessment. However, this may reflect the well-described operational challenges in applying CAM-ICU and had minimal material impact on our findings. We did not provide data on non-pharmacological interventions including family visits and environmental management. However, many were applied by unit policy and the efficacy of such interventions remains uncertain as demonstrated in a recent stepped wedge cluster controlled trial [35]. Patients were not assessed for the presence of delirium by an independent, psychiatrically trained clinician. However, we were studying the relationship between patient populations identified through the use of alternative screening methodologies. Thus, we were not attempting to confirm that the patients identified by such screening would then be diagnosed with delirium by such clinicians. Therefore, the changes in behavior may have reflected drug withdrawal, exacerbation of dementia or of pre-existing psychiatric disorders. Care givers generating the progress notes may have been aware of CAM-ICU results at the time of writing their notes; however, it is impossible to determine if an NLP-Dx-BD positive note preceded or followed a CAM-ICU assessment. Further, significantly more patients were NLP-Dx-BD positive than CAM-ICU positive, thus demonstrating that NLP-Dx-BD positivity was often logically determined independently of CAM-ICU assessment. Clinical progress notes may include errors in annotation that incorrectly characterize a patient’s behavior [36, 37]. However, our study analyzed multiple progress notes (nursing, medical, allied health) that documented the same shift making systematic errors unlikely. There may also have been errors in the records of administration of anti-psychotic medication. However, our ICU uses an audited medication management system that accounts for the prescription, retrieval and administration of medications, which would minimize such errors. Finally, we did not assess long term cognitive outcomes. Thus we cannot make any statement about the long term sequelae of NLP-Dx-BD.


In conclusion, Natural Language Processing of electronic caregiver notes and the Confusion Assessment Method for ICU (CAM-ICU) describe partly overlapping populations. However, NLP-Dx-BD identified more patients with behavioral and/or cognitive disturbance than CAM-ICU, while identifying more than 90% of CAM-ICU positive patients as having such a disturbance. Moreover, NLP-Dx-BD identified significantly more patients who went on to receive treatment with antipsychotic medications. In contrast, in the absence of NLP-Dx-BD, very few CAM-ICU positive patients received such treatment. Further, irrespective of CAM-ICU status, NLP-Dx-BD also identified patients with longer duration of ICU and hospital stay and greater hospital mortality. These observations suggest that NLP-Dx-BD may be a novel and useful epidemiologic screening tool in ICU patients.