Keywords

1 Introduction

In recent years, wireless vital sign sensors, commonly known as the Internet of Medical Things (IoMT), have broadly extended the boundaries of remote patient monitoring (RPM) [1, 2]. These sensors remotely and continuously measure a patient's five vital signs, such as oxygen saturation (SpO2), blood pressure (BP), body temperature (Temp), respiratory rate (RR), and heart rate (HR), which allows for early identification of any abnormalities or deterioration in the patient's conditions [3]. Continuously monitored patient data and its extracted features can be used for decision-making, early clinical event prediction models [4], and automated risk analysis [5, 6].

The modified early warning score (MEWS) is widely used in hospital wards for nurses to assess a patient's condition and raise alerts [7]. First introduced as an early warning score (EWS) [8], it was replaced with the modified early warning score (MEWS) [9, 10]. Weighted scores from 0 to 3 for each five vital signs parameter datasets are assigned from thresholds on raw values; single vital parameter scores are summed to generate a MEWS score [10, 11], as shown in Table 1. Several variations to patient deterioration using MEWS and other approaches, including single parameter, multiple parameters, the aggregate score, and combination system methodologies, have been reviewed [11]; however, approaches commonly assume the minimal value of the parameter if there is no value for a particular parameter. Therefore, missing data refers to any parameter value with no recorded data in the five vital signs parameter dataset., Much research has focused on dealing with missing data [12, 13]; however, missing data in continuous vital signs in RPM is an unexplored area. The number of parameter data observed continuously in RPM makes the database highly susceptible to missing data occurrence, and its analysis is a significant challenge. Missing data is generally categorized into three categories: Missing completely at random (MCAR), Missing at random (MAR), Missing not at random (MNAR) [13].

Table 1. Modified Early Warning Score [9, 10]

The amount of data missing in any clinical setting depends upon various causes, such as delayed data transmission, environmental issues, connectivity issues due to a network, sensor malfunction, power loss, and sensor detachment from the skin. Consequently, missing or poor-quality data will impact the evaluation of vital sign abnormalities, clinical event detection algorithms, or the risk analysis of the patient [14].

Extensive work exists on analyzing missing data in the literature [13, 15]. A general regression neural network (GRNN) and successive geometric transformation model (SGTM) were applied in a study to handle the missing clinical data [16]. Missing data may frequently occur in RPM monitoring, and the amount of recorded data variations makes it challenging to analyze as these algorithms need significant data for training [17]. In addition, these techniques often ignore unique patterns in data and correlations between the observed vitals.

This study proposes a similar pattern match to continuously provide missing values imputation based on a temporal association of vitals in a sliding window approach. Patterned early modified warning score (PMEWS) is introduced in the research as a pre-processing feature for finding pattern matches for the missing vital signs values. This pattern-matching technique helps identify the trends and patterns in the data even when the variability of the missing data is significant. This approach uses the required vital signs data from a clinical trial from our RPM real-time implementation [18]. The main contribution of this study is summarized below:

  • We propose a pattern-matching algorithm to impute the missing sensor data from different sensor streams. Our algorithm creates the patterns of vital signs values appearing at a particular time to impute missing data using the pre-processing feature PMEWS.

  • Our novel algorithm can detect and predict similar matching patterns for the missing values.

  • The proposed approach utilizes the MAR and MCAR to simulate incomplete data proportions of 10%, 20%, and 30%. Then, to check the accuracy of our proposed algorithm, the actual dataset without omission is verified with our suggested matching of similar patterns for the missing values.

This paper is organized as follows: Sect. 2 outlines the related work of this research. Section 3 describes our proposed solution. Section 4 describes our results, and Sect. 5 concludes the paper.

2 Related Work

Imputation is the recovery-based approach to overcome the limitation of complete case analysis if the missing data falls under the MCAR category [14]. Imputation is used to replace the missing data values using single mean imputation, conditional mean imputation, and last observation carries forward (LOCF), multiple imputations (MI), and full information maximum likelihood (FIML) [19]. Single-mean imputation, where the missing data is imputed by the mean of the recorded value of that attribute, has a limitation that affects the dataset's variation. In comparison, conditional mean imputation intensifies the multivariate relationship in the data. Due to these limitations of mean and conditional imputation, these are recommended for baseline measurements in randomized trials [20]. LOCF is similar to single imputation, which replaces the missing values with the last observed value. MI has three phases: the imputation phase, the analysis phase, and the pooling phase [21]. MI generates n replacements for missing values. Then these completed datasets generated with n imputed values are analyzed and integrated into the final output. MI applies to both MAR and MCAR. FIML best suits the MAR context as it considers only the observed data and ignores non-response data [22]. These imputation methods are inappropriate if the missing data falls under the MNAR category.

Random Forest is inherently a MI approach, and it finds the mean of the data from numerous unpruned classification or regression trees for missForest [23, 24]. Bayesian ridge regression used the probability distributor when designing the linear regression, which enables the automated process for the missing data [25]. A cluster's variable mean, or mode is used to impute the missing values in the hot/cold deck [26]. Using the Euclidean distance, the K-nearest neighbour imputes the missing values with a similar one by defining the similarity between two values [26].

Normalization, formatting, and synchronization of the data are also necessary [27]. Normalizing the recorded data variables is essential as various scoring systems, such as the sequential organ failure assessment (SOFA) and MEWS, were registered at different frequencies [28]. Removing abnormal values from the data is also described in the literature as a part of the pre-processing [29]. One typical statistical analysis, the complete case, removes all the rows with the missing values and studies all accounts with the exclusive data [30]. This method has a disadvantage: it will reduce the amount of valuable data, producing biased results. Von Russum [12] analyzes techniques such as linear interpolation, spline interpolation, last observation, mean-forward, and cluster-based imputation for missing values. Imputation using these resulted in early warning score (EWS) miscalculations ranging from 1% to 8%. These methods produced more biased results than oversized observation windows [12]. A deep learning-based protocol for accurately predicting the missing data is also proposed in the literature [31]. Expectation Maximization (EM) [32] is a popular algorithm for imputing the missing data. One disadvantage of EM is that the recorded data should be highly correlative to get more reliable information [33].

Existing approaches perform well in imputation accuracy, though their efficiency is greatly affected by various factors such as computational time, complexity, and correlation of parameters. Our proposed pattern-matching imputation approach considers factors such as the correlation of the parameters, which helps to reduce the biased imputation and computational time can be set using the sliding window to show efficient performance in handling the missing values.

3 Pattern-Matching for Missing Values

The structure of our method is divided into two stages: Initially, the data is pre-processed using the patterns created using the MEWS [34] discussed in Sect. 3.1. Pre-processing the data helps ensure an accurate assessment of the patient's condition. Pre-processing includes removing erroneous values, no recorded vital values at a time, abnormal values, and high-frequency noise [35]. In the second stage, a pattern match is found for the missing values in an observation window. These two components work together to provide a high similarity match for the missing data.

Vital signs data for parameters systolic blood pressure, blood oxygen, body temperature, respiratory rate, and heart rate streams at a discrete time for each parameter using the wearable sensors recorded from patients in a general ward [18] were used in the current study. Publicly available datasets were not used to ensure data is simulated real data and were not curated in any other ways. The imputation method considers the data's temporal context and more accurately predicts the missing values using a sliding window. A sliding window has been used to find patterns for a particular data segment [21, 22]. A six-hour prediction window for clinical deterioration is considered ideal [36]. The sliding window is predefined in our algorithm.

A sliding window analyses a subset of data within the process window of fixed length, often with a minimum overlap of one data point and shifting with the window increment time in the process window. The notations used for the pattern-match algorithm are the number of vitals to be recorded (TMV), Sliding window time (t), String patterned array (SP[i]), Variables (Str1, Str2), Counter for the matched character (CSP[k]). Figure 1 shows the sliding window increment and overlaps in the process window. T10, T20, T30, T40, T50, and T60 denote the time in Fig. 1. The sliding window (\(s{w}_{\delta }^{i}\)) consists of different time slots \({t}_{\left(\delta +n\right)}^{i}\) shown in Eq. (1).

Fig. 1.
figure 1

Sliding window with the overlap and window increment

The process window length (L), where (L ϵ N) N is the set of all the natural numbers, and the slide window increment (δ), where (1 <  = δ < L), can be decided by the physician, as shown in Fig. 1. If urgent attention is needed, a shorter window increment time can deliver more frequent information on the patient's condition. On the other hand, a longer window increment can be employed to lower the computation burden and boost the analysis accuracy if the patient's condition is stable and not urgent. A significant observation window with personalized decision-making and imputation produced positive results for missing values on maternal health data [14]. The window number (i) where (i ϵ N). Notation (\({t}_{\left(\delta +n\right)}^{i})\) will be written as TS for calculations in the rest of the paper, as shown in Eq. (2).

$$s{w}_{\delta }^{i} =({t}_{\left(\delta +0\right)}^{i},{t}_{\left(\delta +1\right),}^{i}{t}_{\left(\delta +2\right)}^{i}\dots \dots \dots \dots \dots {t}_{\left(\delta +n\right)}^{i})$$
(1)
$$ {\text{TS = (t}}_{{\left( {\updelta {\text{ + n}}} \right)}}^{{\text{i}}} {)} $$
(2)

The physician will decide the start time (T) and the time slot (TS) for the observation. The slide window increment (δ) value will be valid for the time slot (TS) and can be changed for the next process window (L). “Patterned modified early warning score” (PMEWS) is described in the next section. The PMEWS forms the basis for similar pattern matches for missing vitals values. The method is described in the next section after PMEWS.

3.1 Pattern Modified Early Warning Score (PMEWS)

Our method for handling missing values in our data set is to make a pattern of the values from the recorded set, replacing the missing values with letters or symbols. As shown in Table 2, recorded raw, vital values at any data points are converted into their respective MEWS score to form the pattern, where ‘N’ is used for the null values. Another significant benefit of using the pattern with the letters is that it avoids using any previously registered threshold value that could affect the prediction of the patient's risk assessment by providing a biased estimation. In our method, patterns can be recorded in any format.

An ordered sequence of MEWS values that make a pattern is recorded from the patient's observed data and is labelled as the PMEWS. The pattern array best suits sensor signal discrepancies because of null values and discrete-time readings from different sensors. This step is the second most vital step in raw data transformation.

Table 2. Pattern Formation

For sliding windows (\(s{w}_{\delta }^{i}\)) consisting of time slots (TS), threshold data values of patients’ vitals are recorded after processing them into their relevant MEWS score (M), and the resulting pattern is stored in an integer array \(P\left[TS\right]\). Equation (3) shows an example of the pattern at the time slot (TS). MHR, MBP, MT, MSPO2, and MRR are the MEWS scores derived from the raw, vital data for heart rate, blood pressure, temperature, oxygen saturation and respiratory rate, respectively.

$$P\left[TS\right]= [{\text{MHR}},\text{ MBP},\text{ MT},\text{ MSPO}2,\text{ MRR}]$$
(3)

The following section describes the pattern-match algorithm to identify similar patterns for missing values of vital signs.

3.2 Pattern-Matching Algorithm

Our main aim for this algorithm is to match the pattern with the missing value with the closest matching pattern that appeared in the sliding window. When the match is found, imputing the missing value parameter with the found match values is performed. For example, in Table 3, the raw data is recorded for the heart rate (HR), temperature (T), blood pressure (BP), oxygen saturation (SpO2), and respiratory rate (RR) and relevant patterns are created for the time one minute to four minutes. HR is missing at minute four with the pattern ‘NNN20’ and will get pattern-match at minute one ‘0NN20’ as the highest number of symbols matched and imputed with the HR value of 99.

Table 3. Pattern Match

After pre-processing, the first step is to initialize the variables T, L, t, TMV, δ, \(s{w}_{\delta }^{i}\), and TS. The pattern \(P\left[TS\right]\) is generated from MEWS for the recorded raw values. Loop to traverse the string pattern SP[], from the start of a sliding window, i = T to TS is set. If there is a pattern SP[i + 1] with no null values, the counter i is incremented for the following index. Otherwise, another inner loop is set for the string pattern array with the index j = T to i to ensure all patterns are considered to match null data values. To ensure a match, two variables, str1 and str2, are set with SP[j] and SP[i + 1], respectively. A call to the function count passes these two string variables to check if str1 contains str2. For a matching pattern character case, the count is maintained for the pattern and stored in counter array CSP[k]. The counter array CSP[k] from j = T to TS is checked for the condition if the number of character matches is highest for the number of vitals observed, as shown in Table 4. Relative patterns and count values are displayed as suggestions for outcomes. Timeslot TS is updated with the sliding window increment time δ for traversing the sliding window loop. This algorithm is an iterative process in the sliding window to ensure all patterns are considered.

The proposed algorithm that accomplishes our pattern matching is shown as follows:

figure a

4 Result and Discussion

To assess the plausibility of our technique, we used the dataset collected from the clinical trial [18]. The trial dataset has the raw values for the vital parameters Temp, BP, SpO2, RR, HR, and the time for each record, as shown in Table 3. For this study, only two parameters, SpO2 and HR, were imputed. The dataset was split into additional sets of various record lengths. MI and EM are commonly used to impute missing values; our proposed method is compared against these. Normalization is an essential step in pre-processing our data, as it helps scale the data and make it more suitable for analysis by reducing the biased outcome. Normalization is performed using the MEWS. Starting with the entire collection of values, we removed various percentages of the SpO2 and HR to test the precision of our technique. The values of the missing parameters in the dataset were then imputed using our pattern-match imputation, EM and MI in the same sliding window. The highest match for parameters is selected for the imputation by our algorithm. For comparison, we calculated the root mean squared error (RMSE) for the imputed value with the starting values from the whole dataset. Table 4 presents results obtained from four different datasets using different imputation methods.

For dataset D1:

  • SpO2: Pattern-match has the lowest RMSE of 0.82, outperforming EM and MI.

  • HR: Pattern-match has the lowest RMSE of 3.9, outperforming EM and MI.

For dataset D2:

  • SpO2: Pattern-match has the lowest RMSE of 2.03, outperforming EM and MI.

  • HR: Pattern-match has an RMSE of 13.9, higher than EM and MI.

For dataset D3:

  • SpO2: Pattern-match has an RMSE of 2.4, while EM has the lowest.

  • HR: Pattern-match has the lowest RMSE of 3.9, followed by EM and MI.

For dataset D4:

  • SpO2: Pattern-match has an RMSE of 2.5, while EM has the lowest.

  • HR: Pattern-match has the lowest RMSE of 7.3, followed by EM and MI.

Table 4. Results of the Datasets.

Figures 2 and 3 show the variation of RMSE of two of our used parameters, SpO2 and heart rate (HR). In the context of missing data, the sliding window approach and the PMEWS have proven very useful as the algorithm demonstrates comparable results. The algorithm considers every pattern in the sliding window before the missing pattern and uses these patterns to find a similar match. Then it slides the window across the process window to cover all the patterns that appeared in that timeframe.

Fig. 2.
figure 2

Root Mean Square Error for SpO2

Fig. 3.
figure 3

Root Mean Square Error for HR

The proposed algorithm's accuracy was satisfactory, which suggested that this approach is practical for finding a similar effective match for the missing vital parameter values in medical datasets generated in RPM settings. It was observed that most of the similar patterns in data appeared after the normalization phase of pre-processing of the data. The algorithm can leverage these patterns to find identical matches for the missing vital values, which can improve the accuracy of the prediction. These findings have implications for the researchers, physicians, and practitioners who work with the vital sign data set with the missing values in RPM for patient risk assessment analysis.

5 Conclusion

This study presents a method for pattern-matching the missing values in the RPM vital sign dataset using the PMEWS and the sliding window. This method offers the identification of similar patterns in the data, and using the PMEWS, this technique is further improved by leveraging these patterns. Compared to other approaches, this technique considers the temporal context of the missing values, which can be particularly important in the medical domain where the timings of the measurements can have critical implications for patient care. This method also has the potential to revolutionize how continuous RPM data is processed and analyzed, leading to more efficient healthcare delivery. Future research could explore other pre-processing techniques to improve the pattern match further. Overall, the proposed method can improve the accuracy of the medical data analysis for decision-making, specifically for the missing vitals values.