INTRODUCTION

Vital signs that reflect the cardiovascular and respiratory systems are continuously displayed on bedside monitors in the neonatal intensive care unit (NICU), and aberrations may signal a variety of pathologic processes.1 Subtle changes can occur before overt clinical signs of illness, prompting the development of early warning systems that alert clinicians to changes in patient status requiring attention.2,3,4 One example is the finding of abnormal heart rate (HR) characteristics of decreased variability and transient repetitive decelerations that sometimes precede the clinical presentation of sepsis, necrotizing enterocolitis (NEC), or other infections in very low birth weight (VLBW) preterm infants.5,6,7,8,9

In a nine-NICU randomized clinical trial of 3003 VLBW infants, display of an HR characteristic index, the fold increase in risk of sepsis being diagnosed in the next day, was associated with a 22% relative decrease in mortality rate.10 Another example of a change in vital signs in preterm infants is the simultaneous fall in HR and oxygen saturation (SpO2) during neonatal apnea, the familiar bradycardia-desaturation spell.11,12,13 A measure of this, the maximum cross-correlation of HR and SpO2, increased prior to diagnosis of sepsis or NEC in a study of 1065 VLBW infants in two NICUs.14

HR and SpO2 are affected not only by illness and stress but also by maturation and by clinical care practices such as the mode of respiratory support.15,16,17 Different bedside monitor hardware and sensors may also contribute to differences in vital sign measurements across units. Here, we examined the ranges of values of canonical vital signs for >1000 VLBW infants at three large tertiary care NICUs during the first 6 weeks of hospitalization. We also compared the number of bradycardia and desaturation events and the cross-correlation of HR and SpO2. As a step toward developing mathematical predictive algorithms that are generalizable across NICUs, we sought to determine the expected ranges of these parameters over time and how they varied among infants at the three sites.

METHODS

We analyzed vital sign data from VLBW infants (≤1500 g birth weight) admitted from 2012 to 2018 at three-level IV NICUs (University of Virginia (UVA): University of Virginia Children’s Hospital, Columbia University (CU): NewYork-Presbyterian Morgan Stanley Children’s Hospital, and Washington University in St. Louis (WUSTL): St. Louis Children’s Hospital). Institutional Review Boards at each site approved the study with a waiver of consent. We excluded infants with congenital or chromosomal anomalies that could impact oxygenation, those transitioned to comfort care only, and those with fewer than 7 days of HR and SpO2 data within the first 28 days after birth.

The three participating centers routinely collect and store NICU bedside monitor vital sign data using the BedMaster system (Excel Medical, Jupiter, FL). In addition, UVA collects data using the Cardiopulmonary Corporation system (Milford, Connecticut). During the period of study, UVA and CU NICUs used GE bedside monitors (GE Healthcare, Waukesha, WI) with Masimo pulse oximeters (Masimo Corporation, Irvine, CA), and data were recorded at 0.5 Hz. The WUSTL NICU used Philips monitors (Philips Corporation, Andover, MA) with Nellcor Oximax pulse oximeters (Medtronic, Minneapolis, MN), and data were recorded at 1 Hz but down-sampled to 0.5 Hz to match the other sites. All pulse oximeters had an 8 s averaging time. During the study period, UVA and WUSTL clinicians had a default goal SpO2 range 88–95%, increasing slightly as infants approached term-corrected gestational age. CU used a goal range of 85–93% until August 2013 and then switched to 90–95%. Bradycardia alarms were set at 90 beats per minute (b.p.m.) at UVA and 100 b.p.m. at the other two sites.

HR, PR, and SpO2 metrics

We analyzed continuously measured electrocardiogram-derived HR, pulse oximeter-derived pulse rate (PR), and SpO2. Daily mean, standard deviation, skewness, and kurtosis of HR, PR, and SpO2 were computed for each infant over the first 6 weeks after birth. To control for artifact, all values of zero were removed and, for measurements other than mean, values >99th percentile were censored to the 99th percentile value.

Bradycardia and desaturation events were quantified using thresholds and definitions we have previously published.18,19 Bradycardia was defined as HR <100 b.p.m. for at least 4 s and desaturation as SpO2 < 80% for at least 10 s. Events were joined if they occurred within 4 or 10 s of each other for bradycardia and desaturation, respectively. We report the mean number and duration of events per day as well as the percentage of time spent in bradycardia or desaturation. We calculated the cross-correlation of HR or PR and SpO2%. We used our own code written in Matlab for the analyses. Data were smoothed using a sliding window of 7 days as we have done in prior work.20,21

Statistics

We assessed for site effects on each metric using daily means from the day of birth through day 42 by n-way analysis of variance. Pairwise comparisons between sites used a Bonferroni correction to account for multiple comparisons, with significance set at p < 0.05/42/3 (42 days of comparisons, three pairwise comparisons). Figures show estimated marginal means corrected for birth weight, gestational age, and sex differences between sites. Estimated population marginal means control for the influence of the covariates (gestational age, birthweight, and sex) on the outcome variable of interest (HR, SpO2%, etc.).22 They adjust for any biases from imbalances in the covariates. The estimated mean for the variable of interest is based on the equal-weighting method, resulting in adjusted means that are equally balanced across all values of all covariates. To calculate the estimated marginal means, we used the multcompare function in Matlab using a linear repeated-measures model of the data from the anovan function. The statistical impact of the site on a particular metric was measured using log10(p value), that is, by reporting the number of leading zeros for the p value.

RESULTS

During the period of study, 3209 VLBW infants were admitted to the three NICUs with vital sign data recording available, 1168 of whom had no exclusions and had at least 7 days of stored vital sign data available for analysis in the first 4 weeks after birth. Demographics of the infants in the three site cohorts are shown in Table 1. We analyzed 35,238 infant-days of data (96 infant-years). The distribution of data availability by postmenstrual age (PMA) was the same for UVA and CU, but WUSTL had lower coverage after 28 weeks PMA (Supplementary Figure S1).

Table 1 Demographic and clinical variables.

As shown in Fig. 1, the mean HR and SpO2% were similar at the three sites over the 6 weeks of study. The mean HR rose from ~150 b.p.m. in the first week to ~160 b.p.m. and changed little thereafter. After 2 weeks of age, there was a small (~4 b.p.m.) difference in infants’ daily mean HR between sites. The daily mean SpO2 was slightly different (~1%) between sites in the first 2 weeks after birth.

Fig. 1: HR and SpO2 trends of VLBWs at three NICUs.
figure 1

Daily mean HR (left) and SpO2 (right) are shown for VLBW infants at the three NICUs through the first 6 weeks from birth. Thin dotted lines indicate the 95% confidence interval. Asterisks indicate a statistically significant difference on that day compared to one other site (small asterisk) or both other sites (large asterisk). Y-axis limits are the 10th and 90th percentiles.

Figure 2 shows the number (top panels) and durations (bottom panels) of bradycardia events (left) and desaturation events (right). The differences were as large as twofold; infants at CU had up to twice as many bradycardia events per day, and infants at WUSTL had about half as many desaturation events, with the magnitude of the differences varying over time. By 3 weeks after birth, the difference in daily numbers of bradycardias between sites was no longer evident, while the difference in daily numbers of desaturations between sites increased from birth to 6 weeks. The smaller differences in event durations remained similar throughout. The percentages of time spent in bradycardia or desaturation are shown in Supplementary Figure S2. The number of bradycardia and desaturation events are shown split by birthweight in Supplementary Figure S3.

Fig. 2: Bradycardia and desaturation events by site.
figure 2

Mean number of bradycardia (a) and desaturation (b) events per day of data are shown for the first 6 weeks from birth for VLBW infants at UVA, CU, and WUSTL. Mean event duration in seconds is shown in the bottom panels (c, d). Dotted lines indicate the 95% confidence interval. Asterisks indicate a statistically significant difference on that day compared to one other site (small asterisk) or both other sites (large asterisk). Y-axis limits are the 10th and 90th percentiles.

Although the absolute differences in some of the HR and SpO2 metrics between sites were very small, the large number of data points analyzed gave some of these differences high statistical significance. This is depicted in Fig. 3 as a heat map of the number of leading zeros in the p value for inter-site differences in each metric for each day from birth through day 42 (with correction for multiple comparisons, thus statistical significance set at p < 0.05/42 or approximately p < 0.001). Metrics are ordered from those with the most to the least inter-site differences. Notably, skewness of PR measured from the pulse oximeter had more significant inter-site differences (appearing near the top of the list of metrics) compared to skewness of HR measured from the electrocardiogram (appearing near the bottom of the list). Individual trends for all Fig. 3 metrics not shown in Figs. 1 and 2 are shown in Supplementary Figures S48. Supplementary Figure S9 provides a probability density plot for all vital sign metrics in Fig. 3.

Fig. 3: Magnitude of statistical significance of site differences in HR, PR, and SpO2 metrics.
figure 3

For each metric shown on the left y-axis, the number of zeros preceding the p value for inter-site differences each day shown on the x-axis is depicted as a heat map. Black boxes indicate no significant difference between sites (adjusting for 42 comparisons, p < 0.05/42 or approximately p < 0.001). Progressively darker shades of blue indicate more leading zeros in the p value for inter-site differences. *Computed using 10 min averages. SD standard deviation, XC cross-correlation, HR ECG heart rate.

Using the average value for each infant for all HR, PR, and SpO2% metrics across each infant’s whole stay, we ran a rank-sum test to look for a difference between sexes. Upon correcting for birthweight, gestational age, and institution, only the mean, skewness, and kurtosis of SpO2% were significantly different (p < 0.05) between the sexes (Supplementary Figure S10), but the differences were small (<1% difference in mean SpO2%).

DISCUSSION

Abnormal values, trends, and patterns of continuously monitored vital signs in NICU patients can predict imminent or longer-term adverse events and outcomes. Assessment of potential inter-site differences in infants’ vital sign patterns is needed in order to optimize predictive algorithms. We, therefore, performed a three-center comparison of the most frequently monitored vital signs in VLBW infants, HR (HR measured by ECG and PR measured by pulse oximeter) and SpO2, in the first 6 weeks after birth. We found inter-center variability that may reflect differences in patient populations, equipment, or care practices.

With regard to HR and SpO2, Fig. 1 shows that the overall mean HR increased from ~150 b.p.m. in the first week after birth to ~160 b.p.m. from weeks 2–6, while the mean SpO2 of ~94% was consistent over the time period studied. The change in HR over time that we show in this VLBW cohort is similar to that previously published for term infants, but with an offset of ~20 b.p.m. (preterm infants have higher HR than term infants).23 The HR values in Fig. 1 are also similar to those previously published in a single site report on preterm infants at UVA.24 There were inter-site differences of ~4 b.p.m. in HR and 1% in SpO2, which, due to the large volume of data, were statistically significant, if not clinically meaningful. Whether these small differences would impact a mathematical model to predict outcomes would be model-specific; this highlights the importance of developing and testing models at multiple sites.

We also found inter-site differences in bradycardia and desaturation events. In Fig. 2, we note that infants at CU had more bradycardia events during the first 2 weeks after birth. Our definition of bradycardia of <100 b.p.m. was the same as the alarm threshold at CU and WUSTL (and only 10 b.p.m. higher than the alarm threshold at UVA) and thus it is unlikely that the difference in the number of bradycardia events is due to center-specific alarm management. A possible explanation for more bradycardia events at CU is less use of mechanical ventilation and more use of nasal continuous positive airway pressure25 leading to more apnea-associated HR decelerations.26 This is also supported by higher cross-correlation of HR and SpO2 (Supplementary Figure S8). With regards to desaturations, we found lower rates and durations for infants at WUSTL compared to the other two sites. The reason for this difference is unknown, but may relate to the monitor alarm tones. UVA and CU use monitors with a high alert tone for bradycardia events and a softer alert tone for desaturation events, whereas WUSTL monitors give the same alert tone for both desaturations and bradycardias. Another consideration is that different monitors and sensors have different hardware and algorithms, which could impact vital sign values. We are not implying the bradycardias and desaturations are benign; we are highlighting that differences in clinical care and patient populations between NICUs can impact bradycardias and desaturations. Therefore, cardiorespiratory predictive algorithms should be externally validated.

The small but statistically significant difference in cross-correlation of HR and SpO2 between sites, especially in the first week after birth (Supplementary Figure S7) may be an important finding since we identified its association with apnea and exaggerated periodic breathing.13 Moreover, the cross-correlation of HR and SpO2 was a significant predictor in a model targeting imminent septicemia or NEC.14 In that study of >1000 VLBW infants, we also found that infants at CU had a slightly higher baseline cross-correlation of HR and SpO2 than infants at UVA. The mechanism is unknown, but may relate to less mechanical ventilation at CU and thus more apnea, with a concurrent decline in HR and SpO2.

The strength of this analysis is the large number of VLBW infants and days of data analyzed at three NICUs with diverse patient populations and clinical practices. We acknowledge there are a number of limitations as well. We do not have individual patient data on daily respiratory support in the infants included in these analyses to validate the assumption that different approaches to mechanical ventilation at the three units impact desaturation and bradycardia events due to apnea. We are able to report more generally, however, that days on mechanical ventilation for VLBW infants is quite different at CU compared to WUSTL and UVA (mean 10, 35, and 33 days, respectively, in 2017–2018). Another limitation is that we do not have dates and doses of caffeine, although practices for caffeine administration are similar at the three NICUs. Also, the patient demographics and outcomes are different at WUSTL compared to the other two sites in that infants were, on average, ~1 week lower gestational age and had higher morbidities and mortality. This likely reflects the sociodemographic variables that contribute to well-documented higher infant mortality in St. Louis compared to the other two sites.27,28,29 In the future, we will address the impact of mechanical ventilation and oxygen support on cardiorespiratory events and outcomes of extremely preterm infants in the Pre-Vent multi-NICU collaboration, in which there are granular data on daily respiratory support, medications, and clinical outcomes linked to bedside monitor vital sign data on over 700 infants <29 weeks gestation.30

The differences we see here highlight the importance of multicenter studies, especially when developing predictive analytics. Variations in demographics, clinical practices, and monitors or sensors all have an impact on continuous vital sign data. More than 40 years ago, Ransohoff and Feinstein31 analyzed why diagnostic tests fail. They advanced the concept, later called spectrum bias, that a test is limited if developed on diseased patients who did not represent the spectrum of pathology or clinical features, or if tested on control patients that had a different spectrum of comorbidity. A vivid example of failed external validation of not one but dozens of predictive models is the recent Physionet Sepsis Challenge.32 No model that had a good performance on the two-hospital training data set did at all well on a test set from a third hospital. We note, however, a prominent example of a successful NICU predictive model generated at a single center, the HR characteristics index developed at UVA, which performed well in external validation at a second NICU and was then shown to reduce mortality in a nine-NICU study.10 Moreover, in more recent work, an HR and SpO2 model for predicting sepsis performed well at both UVA and CU in spite of differences in vital sign trends.14

CONCLUSION

In a three-NICU study of 1168 VLBW infants from birth through 6 weeks of age, we found that mean HR and SpO2 were generally similar, but bradycardia and desaturation events differed in the first 2 weeks after birth. The differences we found in bradycardia and desaturation events between sites may inspire mechanistic studies into the impact of variations in respiratory support or other clinical practices on measures of cardiorespiratory instability, which may impact clinical outcomes. Since this work is presented here in the context of developing tools for predictive analytics monitoring, our findings highlight the importance of developing and validating vital sign-based analytics at multiple sites.