Introduction

The Sepsis-3 task force in 2016 defined sepsis as a life-threatening organ dysfunction caused by a dysregulated host response to an infection [1]. This condition is a major global health problem that represents a significant burden to the health care systems of different countries. Sepsis is one of the leading causes of death in intensive care unit (ICU) affecting 49 million people annually (year 2017 [2]). This feared condition occurs in up to 30% of ICU patients and result in a mortality rate twice as high as that of non-septic patients [3]. All of this indicates how the recognition and treatment of sepsis should be considered as medical emergencies in order to reduce time for treatment and risk for patients [4, 5]. This is very important because sepsis is a rapidly progressive condition and the mortality rate of patients has been shown to be correlated to timeliness of a therapeutic intervention highlighting the importance of early detection and treatment [6, 7]. In this sense, a few hours of delay in detection and treatment from the onset are associated with a reduction in survival rate [6, 8, 9]. Unfortunately, there is currently no gold-standard test for the diagnosis of sepsis. Consequently, different sepsis scoring systems (SSS) are commonly used in clinical practice. Strengths and weaknesses have been recognized for each of these sepsis screening tools, as well as areas of preferential application [5, 10]. Manually tabulated SSS such as Systemic Inflammatory Response Syndrome (SIRS) criteria [11] and Sequential Organ Failure Assessment (SOFA) [12] are usually used to identify sepsis. These tools include the evaluation of several parameters obtained from laboratory tests. Conversely, Quick-SOFA (qSOFA) is a scoring system [1] which utilizes only three independent non-laboratory test variables and is often used for a quick assessment that may require for further investigations. This tool is normally used for the purpose of predicting organ dysfunction and death in patients with or suspected sepsis in emergency department [4, 5, 7]. Unfortunately, the presence of multiple definitions of sepsis and recommendations for the use of different SSS can lead to confusion in clinical practice and hinder a quick diagnosis and treatment of sepsis as well as the definition of shared treatment protocols [4, 10, 13, 14]. Furthermore, the use of SSS may lead to costs such as those for laboratory tests and time to obtain the score in addition to showing limits regarding the sensitivity [2]. These limitations can be particularly evident in low and middle-income countries in which a timely execution of laboratory tests can be difficult. It has recently been suggested that it is useful to use multiple SSS at the same time (mixed models). This model can further hinder the timely evaluation of patients [5, 10, 15]. All this may limit the use of the SSS and indicates the need to continue studying tests and procedures that can promptly recognize the presence of sepsis. In this context, the availability of electronic clinical records together with data relating to continuous monitoring of vital signs could offer important supporting methods for the sepsis identification. Among the data available in these datasets there are those relating to microcirculation. These data are important because multiple clinical trials have shown common microcirculatory dysfunctions in sepsis patients [16, 17]. The alterations of the microcirculation have been associated with organ failure and increase in mortality [18,19,20,21]. Microcirculatory dysfunctions in sepsis patients reflect themselves on parameters that can be easily evaluated at the skin level, as the photoplethysmogram (PPG). This signal is commonly monitored using devices such as the pulsi-oximeter. This device is widely used, user-friendly and affordable. In particular, PPG is an optical signal that utilizes the absorption or reflection of the light through blood to detect changes in blood volume and oxygen saturation at a peripheral site, typically the finger. It is worth noting that the perfusion characteristics depends on the measurement site, that needs to be defined as a part of the experiment protocol [22, 23]. Photoplethysmogram is now widely used in intensive care units for cardiovascular monitoring since it allows a non-invasive, continuous and comfortable measurement. In this sense, it is important to consider that photoplethysmogram waveform contains information on heart rate, venous blood volume and peripheral vascular tone. As a whole this information can be very important because it could allow controlling the cardiovascular system.

Spectral analysis of photoplethysmogram has already been used to gain insight into the peripheral microcirculatory function of sepsis patients. Piepoli et al. [24] showed that the low-frequency (LF, 0.04–0.15 Hz) band of fingertip PPG was suppressed in septic shock patients. This is considered relevant because low-frequency band of fingertip PPG has been associated with sympathetic control over the peripheral circulation. Middleton et al. [25] reported that the mid-frequency (MF, 0.09–0.15 Hz) band of earlobe PPG had a significant decrease in power spectral density in severe stage sepsis patients, compared to controls and early stage sepsis subjects.

Traditional machine learning algorithms have previously been exploited for the detection of sepsis in ICUs. Calvert et al. [26] developed a classical machine learning algorithm to identify sepsis using many vital signs and demographic features. Other studies have subsequently further validated the same algorithm [27, 28] using different input features and different data sets. These studies showed that the algorithm outperformed standard sepsis diagnostics methods, such as tabulated scoring systems.

Mollura et al. [29] trained multiple machine learning classifiers using features extracted from continuously recorded electrocardiogram (ECG) and arterial blood pressure (ABP) signals, in order to identify sepsis within one hour of admission to ICU. The authors reported that classification results were comparable with those obtained with tabulated scores, suggesting that vital sign waveforms might be useful in the early detection of sepsis.

A lot of studies have recently used deep learning approaches to carryout medical tasks, highlighting their potential in the healthcare field [30,31,32]. Deep learning models automatically learn from raw data without requiring conventional feature extraction and selection steps. Among deep learning architectures, Convolutional Neural Networks (CNN) are currently the state-of-the-art technique for signal processing applications. Consequently, CNNs have been increasingly used in biomedical signal analysis [33, 34]. CNN models and photoplethysmographic signals have previously been jointly used to perform classification tasks. In this sense, some authors used spectrograms and scalograms, obtained from PPG signals, to train a CNN model to perform blood pressure classification [35, 36].

In this study, raw fingertip photoplethysmography time-series data related to ICU patients were used to train and evaluate a CNN-based model. The aim of this study was to verify the possibility of detecting sepsis in patients from the photoplethysmographic signal acquired by the pulse oximeter.

Materials and methods

Dataset

This study used the MIMIC-III database [37], a large, a freely accessible critical care database. MIMIC-III is provided as a collection of comma-separated value files, that we imported into a PostgreSQL relational database system. The data are organised in tables containing information such as demographics data, vital sign measurements, laboratory test results, procedures and mortality rate. The tables are linked by identifiers allowing the extraction of information on the same patient.

Waveform recordings, such as ECG and PPG, are stored in a separate database, the “MIMIC-III Waveform Database” [38]. In particular in a subset of the waveform database, the “MIMIC-III Waveform Database Matched Subset” [39], there are the recordings for which the patient has been linked to the clinical information available in the MIMIC-III database.

The MIMIC-III database contains a large heterogeneity of subjects, allowing it to be used for a variety of analytical studies. However, this heterogeneity could make the development of an efficient machine learning algorithm challenging [40]. Moreover, diagnosis are reported only as an ICD-code generated at the end of the hospitalization, without providing any information on the date of the diagnosis. Thus, we selected a subset of the subjects, identified as “sepsis” (cases) and “non-sepsis” (controls) patients. The criteria used to select sepsis and non-sepsis subjects are reported in Table 1. In this phase a custom Structure Query Language (SQL) query was used.

Table 1 Criteria for the definition of septic patients and control (non-septic) patients

The selection criteria resulted in a large number of control subjects in comparison to the group of patients with sepsis. Therefore, we limited the control group to 40 subjects per ICD-9 code in order to have a more balanced selection. As a result, the group of patients with sepsis was of 178 subjects while the control group was of 200 subjects.

The MIMIC-III Waveform Database contains a variety of signals (such as ECG, ABP, PPG) but not all of them are available for each patient. Therefore, we further restricted the selection to those patients for whom the PPG signal was available. As a result the group of patients with sepsis was reduced to 147 subjects while the control group consisted of 155 subjects.

Preprocessing

We downloaded the recordings from the Matched Subset of MIMIC-III Waveform database and extracted the PPG of selected patients using WFDB Python package [41]. Selected signals were split into 2-min segments, and segments less than 2 min were discarded. Furthermore, in order to reduce the degree of similarity within the collected signals, we kept only every other segment.

Afterwards, the regularity and quality of each 2 min sample was assessed using a template matching approach, a technique already used by other authors [42, 43]. This quality estimation was carried out using 3-s running window over the 2 min segment. We classified each window by comparing the signal acquired from the patient to an optimal template PPG signal. The similarity between the two time-series was calculated with Pearson’s correlation coefficient.

The template was generated using NeuroKit2 python toolbox, a package for neurophysiological signal processing [44]. The reference PPG signal was simulated without noise and motion artifacts. The simulation algorithm also requires as input the sampling frequency of the signal and the mean heart rate within each window. Sampling frequency was set to 125 Hz, which is the sampling frequency of all the signals in the waveform database. Mean heart rate was calculated considering the distance between the systolic cardiac peaks. To identify the peak locations, we first filtered the signal and then we used NeuroKit’s peak finding method, as illustrated in Fig. 1a.

Fig. 1
figure 1

Template matching method. a The raw and filtered signal. The filtered signal was used to identify the position of the systolic peaks indicated by green dots. b Alignment between the acquired signal and the template on the first systolic peak, which allowed us to calculate the correlation coefficient between the two waveform

Signal filtering was carried out using a third-order Butterworth bandpass filter with cut-off frequencies of 0.5 and 8 Hz. The objective of the signal filtering was to remove the baseline component and frequencies that are not relevant for systolic peaks. This peak finding function implements a method previously proposed by Elgendi et al. [45] based on event-related moving averages with dynamic thresholds. On the bases of the procedure reported above, we were able to identify the location of systolic peaks and therefore estimate the mean heart rate within the window considered. At this stage, we excluded segments containing windows with only constant values, for which identification of systolic peaks was not possible, and windows with an estimated mean heart rate below 45 bpm. Once the reference signal was generated, we aligned the patient-acquired window and the template signal on the first systolic peak, Fig. 1b. Hence, we calculated the Pearson correlation coefficient in order to evaluate the similarity between the two signals. A flow chart that summarises the developed template matching algorithm is shown in Fig. 2.

Fig. 2
figure 2

Flow chart of the template matching algorithm. The template matching procedure was performed on 3-s windows obtained from each 2-min PPG segment. The correlation coefficient values obtained for each window were stored and subsequently used to classify the quality of the 2-min sample as acceptable or unacceptable

The 3-s windows were grouped into four classes using the thresholds for correlation coefficient as illustrated in Table 2. These thresholds were chosen experimentally, by visually inspecting a set of signal samples associated with different correlation values.

Table 2 Pearson correlation categories between window of patient-acquired signal and reference signal

An example of samples from each correlation group is shown in Fig. 3.

Fig. 3
figure 3

Examples of classified 3-s windows according to Pearson’s correlation coefficient. The figure shows that samples belonging to different groups have a different quality. The signals of group I (a) and group II (b) present the typical morphology of the PPG signal. The signals of group III (c) and group IV (d), associated with lower values of the correlation coefficient, are of poor quality

Segments containing windows belonging to group III or IV (Pearson correlation coefficient lower than 0.6) were discarded.

As a result, we obtained 720 h of recording from 139 control subjects and 2272 h of recording from 111 sepsis subjects.

Finally, the training and test sets were created. The subjects were randomly assigned to training and test sets using 80% and 20% of ratio, respectively. Segments from a single patient could not appear in both sets.

Moreover, the maximum amount of data per patient was set to 3 h in order to avoid a patient being over-represented. For patients with available signal greater than 3 h, the segments were selected by random sampling. After the data selection phase, some patients were represented only by a few data samples. We considered that a small number of samples could indicate an unreliable signal with a high signal-to-noise ratio. Therefore, we set a minimum threshold of 1 h of signal per patient in order to further improve the quality of the data set. Patients who did not have the required amount of minimal signal were excluded. A statistical description of the resulting training set and the test set is shown in Table 3.

Table 3 Data set description

After the patient selection process, data from 85 sepsis patients and 101 controls were considered. Figure 4 summarizes the procedure used for defining the dataset.

Fig. 4
figure 4

Main steps in dataset construction. For each step, the number of subjects involved is given for the septic group and the control group

Network model

We based our model on a widely used ResNet architecture [46]. Our model’s architecture started with an input layer, followed by a single convolutional and a max pooling layer. After this, we added 8 identity blocks separated by max pooling layers. Each identity block, illustrated in Fig. 5, included a shortcut connection and two convolutional layers initialized using Glorot function. Each convolutional layer was followed by a batch normalization layer, and ReLU activation. The shortcut connection performed sum of the input to the identity block and the output of the last ReLU activation.

After the identity blocks, we added a fully connected dense layer with 100 units, a dropout layer with 0.2 dropout rate, and lastly, a fully connected layer with a Softmax activation. Dense layers used the same initialization function as the convolutional layers. As an input, the model used raw 2 minutes PPG segments, normalized within the range [\(-1\), 1]. All convolutional layers had a number of filters equal to 40, with a filter width equal to 3. A depiction of the complete architecture is illustrated in Fig. 6.

Fig. 5
figure 5

Composition of the identity block. The identity block consists of two convolutional layers, each followed by a batch normalization layer and a ReLU activation. Output of the identity block is created by summing the input to the identity block and the output of the last ReLU activation

Fig. 6
figure 6

The network architecture used in our study. The core of the network consists of 8 consecutive identity blocks and max pooling layers. At the end of the network two fully connected layers separated by a dropout layer are present

Evaluation

To evaluate the performance, we trained the model using k-fold cross-validation. The k-fold cross-validation involves dividing the training data into approximately equal size k subsets, called “Folds”. The model training is then repeated iteratively k times, so that at each iteration one of the folds is used as the validation set and the other \(k-1\) folds constitute the training data. In this work, the k value was chosen to be 5, resulting the training data to be split into 5 separate folds. The division of the data was done by ensuring that PPG samples acquired from the same subject were not present in multiple folds. As a result of this approach we obtained 5 different models, each of which was trained and validated on data from different patients. The best model weights from each training iteration were selected using validation loss as a metric. These weights were then used as an ensemble to perform majority voting prediction on the test set.

Parameter optimization

To train the model, we set the learning rate to 1e−6, batch size to 128, and number of epochs to 800. In addition, we used Adam optimizer and binary cross-entropy loss.

To select the optimal parameters, we conducted several experiments that led to the final version of the model. In this section we discuss the selection of the architecture, hyperparameters, input data length and data presentation format.

Architecture

We chose ResNet architecture [46] for this project due to its prominent status and signal classification capabilities demonstrated by literature [47]. Once the architecture type was chosen, we ran several empirical experiments in order to determine suitable depth for the network. Based on the achieved results we chose architecture depth consisting of 8 identity blocks.

Learning rate

One of the most important hyperparameters is learning rate which typically has values ranging between less than 1 and 1e−6 [48]. Learning rate defines how large updates are applied to the model weights during backward pass in response to the estimated error. In our study, we found experimentally a suitable learning rate by running multiple training with various, commonly used, learning rates. Learning rates of negative powers of 10 ranging from 1e−2 to 1e−7 were evaluated. When deciding on appropriate value, we considered quantitative metrics such as maximum accuracy, minimum loss, and qualitative metrics such as perceived smoothness of the learning, convergence and absence of under- or overfitting. Based on these metrics, we chose learning rate of 1e−6.

Batch size

Smaller batch sizes have been shown to improve generalization [49], but they can be computationally less effective than larger batches [48]. In this study, we performed experiments using batch sizes of power of 2, ranging from 16 to 1024. Based on the experiments, we chose batch size of 128, which resulted in a good balance between computational efficiency and accuracy.

Data augmentations

Deep learning thrives on large datasets, but often available training data is scarce. In order to reduce this issue, data augmentations is commonly use to increase amount of training data. However, in the case of biosignals, the design of data augmentation techniques needs to consider that it is necessary to preserve the time domain characteristics that represent physiological phenomena [50].

In this study, we evaluated effectiveness of adding noise and using random windows. The noise was sampled from a normal distribution and added to the normalized PPG signal. After noise addition, the resulting noisy signal was normalized again to obtain the values within the [\(-1\), 1] range expected by the model. The noise augmentation was applied to the signal with 50% of chance. Random windows were implemented by taking a continuous 90 s window from a random location of the 2 min PPG segment. As shown in Table 4, the use of data augmentations did not lead to a significant improvement in the performance. The combined use of jitter and windows led to modest improvements in accuracy and specificity compared to the model without augmentations. As the baseline approach yielded the best performance on sensitivity, it was selected for the final version of the model.

Table 4 Augmentation results

Segment length

Moreover, we explored classification of photoplethysmography segments of various lengths. Exploration started with segments lasting 1 h. Subsequently, the length of the segments considered was gradually reduced to 1 min. Shorter segments increased amount of data and led to improved signal-to-noise ratio as shorter segments allowed visual inspection of the signal, which helped in identifying and discarding various artifacts. Furthermore, shorter segments allowed assessing the effectiveness of signal quality metrics.

Frequency domain presentation

We investigated frequency domain presentation input in addition to raw time-series. We observed a trend where frequency domain presentation compared favorably to the time-series when the PPG segments were longer, but when the segments were shorter, frequency presentation lost its advantage. We hypothesized that the better outcome obtained with the longer segments might have been due to simplified presentation in the frequency domain.

Results

All trained models and their ensemble, with corresponding evaluation metrics, are shown in Table 5. Each model is indicated in the table by the name of the fold used as the validation set. Accuracy identifies the percentage of correctly classified samples. Sensitivity indicates proportion of correctly classified sepsis samples, and specificity shows the percentage of correctly predicted control samples.

As shown in Table 5, the accuracy between the models varies from 72.27 to 74.59%, sensitivity from 65.72 to 71.31% and specificity from 76.47 to 79.72%. The ensemble method achieves 76.37% of accuracy with sensitivity of 70.95% and specificity of 81.04%. In addition to accuracy, sensitivity and specificity, Receiver Operating Characteristic (ROC) curve was calculated for the ensemble. The ROC curve, illustrated in Fig. 7, shows that our method reaches 0.842 of Area Under Curve (AUC).

Table 5 Evaluation results
Fig. 7
figure 7

Calculated ROC curve. Red diagonal line represents points where the true positive rate is equal to the false positive rate. Points to the left of the diagonal line mean that proportion of true positives is higher than false positives. Optimal value is at the top of the left corner

Fig. 8
figure 8

Validation loss curves for the first 400 training epochs. In the figure, folds 0 and 3 have not yet reached plateau, in contrast to folds 2 and 4, which show a trend that could indicate overfitting on the training data. Lower loss values indicate better performance

Figure 8 shows the trend of the loss function on the validation set for all 5 models during the first 400 training epochs. The loss curves show that the model identified as Fold 0 (gray) achieves the lowest loss, whereas Fold 4 (blue) performs the worst.

Discussion

This study allowed investigating the feasibility of using deep learning based method to classify sepsis through the only analysis of photoplethysmogram signal. In particular, we developed a deep learning based model and trained it on PPG signal extracted from the public ICU waveform database (MIMIC-III database). To the best of our knowledge, this is the first study aimed at verifying the possibility of using only photoplethysmogram signal together with a deep learning based method to classify sepsis. As regards data analysis, 5-fold cross-validation was used, which resulted in 5 models, each trained using different training and validation subjects. Due to the differences in the training and validation data in each cross-validation iteration, the performances of the models were different when evaluated on the test set. Our method showed mean and standard deviation of \(73.398 \pm 0.784\) for accuracy, \(67.754 \pm 1.921\) for sensitivity and \(78.26 \pm 1.168\) for specificity. In this regard, it was hypothesized that the performance variations between folds depended on how well the training-validation data split of a given fold represented the test data; this also suggest the selected dataset contains not homogeneous populations, thus a larger dataset may improve the classification results. The final prediction was carried out by using majority-voting which consulted all the trained models. Due to the differences among the folds, the ensemble method was presumed to make a better decision than any of the models independently. By using the ensemble, we achieved 76.37% of accuracy, 70.95% of sensitivity and 81.04% of specificity, demonstrating promising results and indicating the possibility to use a PPG signal to assist in diagnosing sepsis. Our method could provide support to the sepsis diagnostic process, and allow a more timely diagnosis. Importantly, the method might not require recording extra signals because PPG is already commonly recorded in the case of ICU patients. Moreover, the acquisition and processing of photoplethysmogram signals can ensure continuous low-cost monitoring of the patient with or at risk of sepsis.

Differently from previous studies, the model proposed in this article performs a binary classification of sepsis and was trained using only the raw plethysmographic signal. Previous studies using the MIMIC database for sepsis detection were mainly conducted using vital parameters and laboratory measurements. Among these, we feel it is worth mentioning some works that used a deep learning approach for sepsis identification.

Table 6 Summary of the results of other works on sepsis identification from MIMIC database; abbrevations: Accuracy (ACC), Sensitivity (SE), Specificity (SP)

Kam and Kim [51] extracted the minimum, mean and maximum values of hourly periods of vital signs and laboratory measurement parameters from the MIMIC-II database. The extracted features were used to train different architectures to predict sepsis 3 h before the estimated onset time. The authors reported that the Long Short-Term Memory (LSTM) architecture was the most effective, based on the Area Under the Curve (AUC) criterion.

Ashuroğlu et al. [52] proposed a model called Deep SOFA-Sepsis Prediction Algorithm, which combined CNN and Random Forest algorithms, to predict the SOFA score. The authors trained the model with 7 vital signs obtained from MIMIC-III database. Laboratory results were excluded in order to assess the feasibility of estimating the risk score. They evaluated the architecture of their performance also to predict sepsis 6 h before the estimated onset time.

Scherpf et al. [53] developed a Recurrent Neural Network (RNN) model to predict sepsis 3 h before the estimated onset time. The model was trained using white blood cell count and vital signs averaged over one-hour intervals. The training data were obtained from MIMIC-III database.

Our method achieved AUC of 0.842 compared to 0.929 reported by Kam and Kim [51], 0.972 by Ashuroğlu [52], and 0.81 by Scherpf [53].

Table 6 summarises in more detail the cited works that used the MIMIC database for the identification of sepsis. Although these studies performed better than our method, we believe the results we obtained can still be considered very interesting, as our method only uses the PPG signal as input.

Our study has some limitations. As reported in the MIMIC-III database documentation, the ICD-code was generated at the end of the hospitalisation, consequently information on when the diagnosis was made or when the patient showed the symptoms is not known. Due to this limitation, our subject selection consisted of those sepsis patients who were hospitalised only once in ICU. We hypothesised that by using this criterion, the corresponding signals contained sufficient information on the target pathology.

Nevertheless, it should be mentioned that some studies have tried to estimate the onset time of sepsis in the MIMIC database on the basis of the diagnostic criteria for sepsis: presence of 2 or more SIRS criteria or SOFA score \(>2\). After extracting the parameters necessary to estimate the SIRS or the SOFA scores, several authors [26, 27, 52, 53] considered the onset time of the disease to be when the estimated score met the diagnostic criterion for sepsis. Second limitation of the study is the selection of control and target diagnoses. In our study, the control group was restricted to a small subset (n = 5) of ICD-9 mental disorders, and sepsis group consisted of multiple (n = 3) classes of different sepsis severities. Furthermore, all waveforms used for training and testing the method were collected from the same database.

Based on the above limitations, the future direction of this research involves evaluating the model on a larger set of control diagnoses as well as sepsis diagnoses stratified by severity. Patient selection could be improved by including subjects from other databases and by extending the subject selection in MIMIC-III. To assess the generalisation capability of our model, we intend to test the performance of our method on other datasets. Furthermore, to evaluate the ability of the method in predicting the onset of sepsis, we plan to train and test the model on a dataset where the diagnosis times are known.

Conclusion

This study explored the feasibility of using a deep learning based method to classify sepsis relying only on the photoplethysmogram signal. This was possible through the use and analysis of the MIMIC-III database. The proposed method allowed us to achieve AUC of 0.842 and obtain an accuracy of 76.37% on the testing set demonstrating promising results. The proposed method, using only a non-invasive signal, is perfectly suited for long-term monitoring of patients at risk. We hypothesize this method could serve as an early warning system to trigger application of more invasive tests, and thus reduce the time to make a diagnosis. This method could contribute in improving the quality of the treatment of patients. However, as discussed in Chapter 4, future studies with a larger number of patients and data from other databases will be necessary to assess the effectiveness of the proposed method.