Confidence-Enhanced Early Warning Score Based on Fuzzy Logic

Cardiovascular diseases are one of the world’s major causes of loss of life. The vital signs of a patient can indicate this up to 24 hours before such an incident happens. Healthcare professionals use Early Warning Score (EWS) as a common tool in healthcare facilities to indicate the health status of a patient. However, the chance of survival of an outpatient could be increased if a mobile EWS system would monitor them during their daily activities to be able to alert in case of danger. Because of limited healthcare professional supervision of this health condition assessment, a mobile EWS system needs to have an acceptable level of reliability - even if errors occur in the monitoring setup such as noisy signals and detached sensors. In earlier works, a data reliability validation technique has been presented that gives information about the trustfulness of the calculated EWS. In this paper, we propose an EWS system enhanced with the self-aware property confidence, which is based on fuzzy logic. In our experiments, we demonstrate that - under adverse monitoring circumstances (such as noisy signals, detached sensors, and non-nominal monitoring conditions) - our proposed Self-Aware Early Warning Score (SA-EWS) system provides a more reliable EWS than an EWS system without self-aware properties.


Introduction
Cardiovascular diseases are worldwide considered as one of the major causes of death [1]. The vital signs of a patient reflect the patient's health condition, and monitoring these vital signs establishes a basis for predicting a possible deterioration of the health condition. Even up to 24 hours before a sudden health deterioration occurs, specific symptoms are visible in the vital signs of a patient [2]. The assessment of the EWS of a patient's health condition is a common practice in hospitals and manually done by healthcare professionals. The EWS constitutes a number which indicates the level of criticality [3].
The availability of an autonomous mobile EWS system that constantly monitors patients' vital signs to calculate the EWS could increase the life expectancy of outpatients. High-risk patients could wear such a system which monitors them during their daily life activities and alert in case of an Maximilian Götzinger maxgot@utu.fi Extended author information available on the last page of the article. emergency. Besides a much higher survival rate, a mobile EWS system could also decrease costs related to healthcare and reduce the duration of hospitalization periods.
Internet of Things (IoT) -with its small devices and wearable technologies -is a key enabler to provide autonomous health monitoring for a mobile EWS system in a cost-efficient manner [4][5][6][7]. Such a system cannot be supervised continuously by healthcare professionals, but its reliability and the accuracy of the calculated EWS are of utter importance. The manual monitoring of a patient who is admitted and is lying in a hospital bed, done by healthcare professionals, faces much fewer problems than automated monitoring of a patient who is at home carrying out daily tasks [8]. One of the widely acknowledged and intrinsic challenges for wearable devices is the movement artifact [9]. Moreover, incorrectly attached or detached sensors, broken sensors, and noise can affect the calculation of the EWS that could lead to a false or -even worse -a missing alarm with all its consequences [10].
Self-awareness has various properties which help to make computer systems more autonomous, smarter, and reliable [11,12]. Therefore, it can also be an enabler to make the monitoring of patients and the calculation of EWS more robust as well as reliable. In one of our previous works [13], we already presented a data reliability assessment technique based on fuzzy logic, which gives information about the trustfulness of the calculated EWS. However, although the proposed system outputs a reliability value which correlates with the correctness of the monitored vital signs, the system can only provide an unmodified EWS, which is incorrect when the input data is corrupted. To improve the decisionmaking ability of a system, another self-aware property can be utilized, namely, confidence. In other words, data reliability and confidence are two self-aware properties that can enhance the conventional EWS system. Both reliability and confidence are metadata. Reliability is metadata of the given input data and provides information on to what degree the data is reliable; in this case, the system can trust its sensors. Besides, the system can make its decisions based on confidence, a meta-data for decisions, which have been motivated by observations of various pieces of information, and other metadata.
In this paper, we propose a self-aware EWS system which validates reliability and bases all decisions on a confidence assessment. These validations and assessments are techniques based on fuzzy logic. To show the effectiveness of these two mentioned self-aware properties, we recorded vital signs of a set of persons with high-quality and lowquality sensors. In our experiments, we demonstrate Self-Aware Early Warning Score (SA-EWS) system calculates the EWS correctly or with a small error close the to the value it should have even if the monitoring circumstances are adverse. The results show that our proposed SA-EWS system is more reliable than an EWS system without selfawareness. In other words, we prove that self-awareness is a good foundation for a reliable EWS system that trustfully classifies the EWS even if there is some faulty sensory data. Our main contributions are: 1. We propose a fuzzy logic based confidence metric for the quality assessment of the calculated EWS, 2. we show how a fuzzy logic based reliability metric gives information about the correctness of the input data, 3. we introduce a method for combing the input data reliability and the confidence of the system to calculate output data reliability based on both factors, and 4. using extended experiments, we demonstrate that our proposed system gives equally good or better results than a similar system that does not use reliability and confidence metrics.
After reviewing relevant related work in Section 2, we explain self-awareness properties reliability and confidence in Section 3. Section 4 shows system architecture as well as the implementation of our proposed system. While Section 5 explains the experimental setup and presents the results, finally, Section 6 concludes the paper.

Background and related work
In 1997, Morgan et al. proposed a medical method called EWS that is currently widely used in hospitals helping to determine the degree of patients' health deterioration. The patient's vital signs, such as respiration rate, heart rate, systolic blood pressure, body temperature, blood oxygen saturation (SpO 2 ), and the level of consciousness are manually collected in a regular routine and classified in different scores. These scores, ranging from 0 to 3, are determined according to the observations and predefined ranges of the vital signs. Table 1 indicates an EWS chart used for obtaining the various scores. In this chart, score 0 is allocated to a vital sign that is in perfect condition; e.g., heart rate in a range between 60 and 100. If the value of a vital sign is a bit worse than this (a bit too low or too high), the corresponding score is 1. 1 If the value of a vital sign is in even a worse condition (still higher or lower), the vital sign is classified to be score 2. Any value worse (depending on the case, higher or lower in absolute value) than the above ranges is classified as score 3.
The EWS is a simple aggregate of the scores that are abstracted from the patient's vital signs. The lower the calculated EWS, the better the patient's condition. A high EWS corresponds to a high risk of death or critical medical conditions [15]. Therefore, this likelihood reveals early signs of health deterioration and can be used to trigger a rapid response team to evaluate the patient. Similarly, an approach to predict potential sudden patient death have recently received FDA approval [16].
The EWS itself can be classified into three different risk levels: low (EWS: 0-3), medium (EWS: 4-6), and high (EWS: 7 or higher). A low-risk level demands a nurse to assess the patient periodically. A medium-risk level requires to inform medical team urgently. In contrast, a high-risk level should trigger an urgent clinical response as the patient's condition is critical [17][18][19].
There are, nevertheless, various restrictions and issues such as latency and inaccuracy in this manual data acquisition. Furthermore, this system is merely restricted to hospital settings where patients are stationary. In this regard, an IoT-based health monitoring system is proposed to monitor the vital signs autonomously and deliver the EWS score to healthcare providers [20]. Estimations suggest that the ratio between the world's population and IoT devices will be one to four [21]. These small IoT devices and wearables form a good basis for a well-structured EWS system which autonomously monitors a patient in a costefficient way while decreasing the mortality rate [4][5][6][7]. Despite IoT provides a potential solution for monitoring human's vital signs, the conventional EWS system is still not applicable for out-of-hospital monitoring since daily activities, and the environments influence the vital signs and subsequently the decision making. Usually, a person has a higher heart rate, blood pressure, respiratory rate, and body temperature when making physical effort (e.g., running and riding a bicycle) compared to more relaxed activities such as sitting or sleeping. Using the same score classification ranges, such as those in Table 1), would lead to a high EWS during physically demanding activities although there is no emergency. Towards this end, a modified EWS system has been proposed for everyday settings, providing a selfaware decision (i.e., the score) according to the context information and five 2 vital signs [22]. Autonomous mobile EWS system still faces problems that have to be solved for being able to offer a reliable EWS calculation. Incorrectly attached or detached sensors, broken sensors, or a noisy signal affect the EWS calculation. If the calculated value is still close to the truth, it may not be a problem. In contrast, an EWS that deviates more from the truth could lead to a false or -even worse -a missing alarm with all its consequences. Self-awareness is a promising solution to tackle this problem. Self-awareness is the ability of the system to monitor itself and its environment regarding the state, behavior, performance, and goals. This is often accompanied by an adjustment of some of the components and parameters which lead to achieving or approaching to the goals of the system [23]. This process has been modeled different ways by various groups, among which some of the more well-known ones are Observe-Decide-Act (ODA) [24] and Monitor-Analyze-Plan-Execute over a shared Knowledge (MAPE-K) [25]. Several works have been done in order to implement self-awareness in various systems, and take advantage of its properties [12,23,24,[26][27][28]. However, most of these works are more focused on the smart decision-making process, while paying little 2 The level of consciousness is excluded because it is not applicable in out-of-hospital monitoring. attention to the observation (monitoring) part of the process. In 2016, TaheriNejad et al. published a paper [29] which highlighted this aspect and elaborated on different elements of observation and their potential effect on self-awareness and the overall performance of the system. Since then, several publications have appeared in the literature which demonstrated this effect in various applications [13,[26][27][28][30][31][32][33].
Our previous works utilize various self-awareness properties to overcome different issues. Anzanpour et al. exploited the self-awareness in IoT-based EWS systems. In this work, situation awareness was utilized to improve the specificity of the EWS values, considering the impact of the user's physical activities in the calculation. Attention as another self-awareness property was also used to enable a self-organized system, dynamically adjusting the system's configuration for power consumption reduction [26]. Such a dynamic behavior can increase system battery life, but it could decrease the reliability of the EWS in the case of lowquality signals. In another work [13], the proposed system assess the reliability of the calculated EWS. The fuzzified reliability validation tackles the fact that the knowledge about the vital signs as well as their interactions is not complete. With this technique, it was possible to recognize erroneous vital signs caused by various measurement artifacts such as detached sensors, loose sensors, and other interferences.
Our results show that self-awareness can tackle various issues that affect the reliability of a mobile EWS system. Although the proposed system of [13] provides information about the trustworthiness of the calculated EWS, the EWS itself is still incorrectly calculated if the input data is corrupted. Enhancing the decision-making mechanism of the EWS system is a way to solve this problem and improve reliability.

Self-awareness properties
In this work, we study two aspects of self-awareness, namely confidence and data reliability, and the interplay between the two as well as their effect on the overall performance of the system. Moreover, we have tried to formalize these concepts, which were initially described in [29] only conceptually, in order to establish a more uniform understanding of these concepts.

Data reliability
Data Reliability describes the trustworthiness of a set of data at hand, which can be divided into accuracy, precision, and truthfulness. A sensor may be accurate and precise. However, if it is used outside its assumed working conditions, it does not provide reliable data; i.e., it does not provide truthful data. Moreover, even though accuracy and precision provide general measures on the overall quality of a data set (or performance of a sensor), they do not provide an explicit meta-data on each data point. A (resource constrained) self-aware system such as ours, however, sometimes needs to make decisions based on single or few data points. Therefore, accuracy and precision do not provide enough situational information for such cases, and the system needs to estimate and be aware of the overall reliability of those data points based on which it makes a decision.

Formal definition
As mentioned before, data reliability can be broken to accuracy, precision, and truthfulness. Accuracy, A(X ), is the systematic bias of the data set at hand, i.e., X = x 0 , ..., x n , compared to the ground truth values, X = x 0 , ..., x n . As a measure of statistical bias it can be defined as Precision presents the random errors in the data (for a measurement, it would be the random errors of repeated measurements under the same conditions). Since precision is a measure of statistical variability, it can be defined as: where μ = 1 n n i=0 x i . Truthfulness, t, is the distance of each value at hand, x i , with the corresponding ground truth value x i : The overall truthfulness, T (X ), of a set of values can be defined as Accuracy and precision are defined on one or more data sets, X and X, and hence are a property of a set, 3 whereas truthfulness is defined on each data sample, x i . Therefore, even though A, P , and t (and consequently T ) are correlated, a closed-form formula describing their dependency often cannot be established. Moreover, in many cases the ground truth value, x i , is not available which makes the calculation of t impossible. In consequence, often an estimation of t, namely t , is devised which may or may not include the effect of accuracy and precision.
In summary, given a sequence of sampled data points X , the data reliability R of X is given as (the same can be defined for each value) where f determines the role of each parameter and thus how well would R fit its purpose. For example, the reliability of x i ∈ X could be calculated as with constants c 1 , c 2 and c 3 defining the relative weights given to the three components of the data reliability. Ideally, the reliability is defined such that the mapping domain is between one and zero: In a cyber-physical system, A and P are usually provided by the producers of the sensors (even though that is not always the case), and the t and f are to be calculated or estimated by the system using the sensor. In the absence of these values, the designer needs to estimate r or R by r and R , respectively, using custom methods. In this work, we present our proposed method to calculate r and R , which we use as our measure of data reliability.
In the following, we present three measures which can provide an insight into the reliability of the data at hand. That is consistency, plausibility, and correlation of data. An important feature of these measures is that they could be applied to low-level data (obtained directly from sensors) or higher-level data (obtained from processes and algorithms within a system).

Plausibility
Data sets can often be associated with a membership function, specifically in the case of cyber-physical systems, that translates into how plausible is the existence of a data with a certain value in the data set. For example, the oxygen saturation can be only in the range of 0-100%; any other value reported is a sign of malfunction and unreliability of the data. The same could be said for a heart-rate of 300 beats per minute for an adult person. By tagging such data as less reliable or unreliable, a self-aware system could react accordingly (e.g., look for further sources of information or dismiss the data).

Consistency
A certain consistency is often observed within the members of a data set. This is particularly valid in the case of data sets representing natural phenomena, i.e., data collected by a sensor from the real world. Such signals often experience limited changes from one sample to the next. Therefore, the history of a signal and its consistency can provide some information on how reliable is that source of data. For example, it is established that the body temperature cannot change several degrees per minute [34]. Hence, if a larger rate of change occurs in a data set, a self-aware system should tag such an observation (which may be caused by a sensor detachment or a fault/failure in the sensor) as unreliable (regardless of its cause) and react accordingly.

Cross-validity
In some cases, there exists a correlation between the values of two data sets (or such correlation can be established). In such cases, this correlation can be exploited to evaluate the probability or possibility of the coexistence of two or more values. If their coexistence is not possible (e.g., a living patient with valid heart rate and respiratory rate but a negative body temperature) then one or some of those data could be tagged as an unreliable (in this example body temperature). If their coexistence is possible but not very probable (e.g., a body temperature around 30 o C with typical values for other biological signals), the reliability of the data could be reduced, signaling the system a need for further analysis. In the use-case of this work, there have been several works trying to establish such correlations between vital signals of the body [35][36][37]. Although they do not always provide a conclusive insight, they help us to enhance the robustness of our system by enabling additional data reliability assessments.

Confidence
Confidence is a measure of the reliability of an algorithm or a process in the system 4 [29]. Conceptually, we can say that confidence provides the system with a measure on how the results of an algorithm or a process can be relied upon. In other words, how close the output of this algorithm or process would be to the ideal output. All that with the assumption that the system has received flawless input data. Although, more often than not, the input data collected by the sensors are unideal (which we discussed in the data reliability subsection). Therefore, the reliability of the output of a system depends on both its confidence and the data reliability of its inputs.
The importance of confidence is in its ability to improve the decision-making processes [12] and allow a self-aware system to question certain abstracted data it has processed, and make more reliable decisions based on the reliability of its sub-processes and sub-algorithms. An important application of this concept for the decision-making unit is to enable it to switch between different algorithms based on their confidence, the usefulness of which has been shown in [28].

Formal definition
If I is an ideal function defined over X = x 0 , . . . , x n and g is the unideal function at hand, also defined over X, then the confidence of g(x i ) (defined for each member of X) can be defined as a function of g(x i ) and I (x i ). represents the "distance" between f and g based on some application specific metric for distance, normalized such that 0 ≤ ≤ 1. Thus, for the confidence, c, we have: Overall confidence of g (as opposed to confidence at each point), represented by C, is the average confidence of g over X: We note that 0 ≤ c(g), C(g) ≤ 1 and c(I ) = C(I ) = 1. How to calculate c (and consequently C), however, is case specific. Often the ground truth (I ) is not available and the aforementioned distance cannot be calculated. Therefore, a function is used instead to estimate , which is what we do in the rest of this work too. That is, we propose an estimation of (i.e., ). In other words, all the confidence (c) functions hereafter refer to , which is an estimation of .

Combination of data reliability and confidence
In this section, we already discussed the concepts of data reliability (as a property of a data set) and confidence (as a property of a process or algorithm) independently. However, in a real-world system, these two often are tightly intertwined. Processes consume data and produce data. Assuming an ideal input, the data reliability of the output data of a process could be associated with its confidence (although not always in a straight forward or in a simple manner). However, most often, the input data are unideal and subject to a data reliability below one. Therefore, the data reliability of the output data of a process is a function (φ) of the input data reliability and the confidence of the process. Calculating the output data reliability of a process (which in turn could be the input data reliability of another process) is particularly more difficult when data reliability or confidence are obtained using estimation functions. In this work, we explore this realm and try to propose a method which shows a good promise in the estimation of the output data reliability of different processes in our system based on respective input data reliability and confidence of that process. More details on our practical implementation are found in Section 4.3.

Formal definition
If X = x 0 , ..., x n , is the data set at hand (i.e., the unideal values), corresponding to the ground truth values X = x 0 , ..., x n , we have: Since, as mentioned before ∀x; c (g(x)) ≤ c (I (x)) and

History
History enables access to time-dependent information in a system. For example, whether the performance of a (sub)system has been improving or degrading. The historical data can provide meta-data on the current status of the system and its environment. They also help in predicting the (near) future status of the system and its environment. Given that most systems have memory limitation, choosing the type and mode of storing historical value, and a smart usage of it are important points to be considered when designing a self-aware system using history for enhancing its performance.

Formal definition
There are several methods to track the past values in a sequence. Given the sequence of values or symbols X = x 0 , · · · , x n , H = h 0 , · · · , h m is a subsequence of X, in which m ≤ n. If m = n, the system is memorizing everything which is undesirable. Therefore, most often m < n and preferably m n. We note that history function is a specific form of abstraction which concerns time, i.e., the sequence length of X. As of such we can define it as where at the sequence point of x s , where the function y determines how exactly the history H is extracted from X. An interpretation or abstraction of X (such as the average of certain number of data points), or a direct storage of the values themselves could be some examples of y.

System architecture and implementation
A hierarchical agent-based architecture (as shown in Fig. 1) consists of independent modules which can communicate with each other and may be in different hierarchical levels. The possibility of hierarchically structuring the agents enables to process data on different levels of abstraction [38]. The EWS is the aggregate of various scores abstracted from different vital signs. The task of abstraction is the same for each vital sign, but the ranges vary from vital sign to vital sign. The assessment of reliability and confidencebased decisions are done on different levels of abstraction. As an example, a part of the reliability assessment is based on the absolute value and the slope of the signal of a vital sign (principle of plausibility and consistency in Sections 3. 1.2 and 3.1.3). To analyze whether a signal is plausible and consistent, the raw data is of interest. In contrast, for making a statement about the correlation between vital signs (principle of cross-validity in Section 3.1.4) already abstracted information is needed. Because of these differences horizontal direction (different vital signs) and vertical directions (different levels of abstraction), a hierarchical agent-based model (Fig. 1) constitutes an appropriate practical architecture for this purpose.
Because an ODA loop is an appropriate approach to implement self-awareness, our system is also based on this concept [23,29,39]. Each agent acts like an ODA loop, which means that it monitors its inputs (sensor or agent), decides what to do, and acts accordingly. Furthermore, this approach allows implementing a highly modular model easily.
While the abstraction from the raw sensor value to the vital sign score (with the help of Table 1) takes place in the lower hierarchical level, the agent on top aggregates the five scores to the overall score, the EWS. In other words, each low-level agent abstracts the actual samples obtained from its dedicated sensor and sends the result to the high-level agent, which sums up all these scores. Both, the reliability assessment, as well as the confidence-based decision-making, takes place in the lower and in the higher hierarchical level. However, the implementations of these processes are different in the two hierarchical levels. In Fig. 1 Hierarchical agent-based system architecture the next two sections, we explain the reliability assessment and the confidence-based decision-making process, before Section 4.3 shows the workflow of the proposed system in detail.

Fuzzified reliability assessment
Due to the lack of complete knowledge of all functions of a patient's body, it is very challenging to determine whether a vital sign is monitored correctly or incorrectly. Therefore, in contrast to one of our previous works [32], we use fuzzy logic instead of simple boolean logic to assess the reliability value. The usage of fuzzy logic enables the coverage of the unsharp ranges in which a patient's vital sign is not tagged merely as correct or incorrect, but rather somewhere on the spectrum of reliability. Hence, the data reliability of a vital sign is assigned a value in the range between 0 and 1.
The reliability of a patient's vital sign, vs i , is composed out of two different reliability assessments: the reliability of the signal's absolute value r abs,i and the reliability of the signal's slope r slo,i . This corresponds to the plausibility and consistency of data, as described in Section 3.
The reliability for being plausible, r abs,i , is the output of a fuzzy membership function (Fig. 2) defined by four points and three intervals. If the absolute value is in the interval of [p b , p c ], it is certainly reliable. If it falls in one and u abs,i = 1 − r abs,i where the points p a , p b , p c , and p d respectively the intervals between them are configured in a way to match the characteristic of the assigned vital sign. Similar to that, the reliability for being consistent, r slo,i , and its counterpart (the unreliability, u slo,i ), a fuzzy membership function of the same shape exists (Fig. 2). Again, these functions are defined by for points and three intervals between them. These are as follows and where g is the gradient between the actual value, v a,i , to the previous one, v p,i . This gradient is calculated by where t constitutes the time between the samples. Depending on which of the two reliabilities shall be assessed, the abscissa constitutes the absolute value or the slope of a vital sign. The ordinate of the fuzzy membership function constitutes then the reliability corresponding to it. While the abscissa gives space for all values (from −∞ to +∞), the reliability values on the ordinate are limited between 0 and 1.
After the assessment of r abs,i and r slo,i , the input reliability, r in,i , can be calculated in many different ways such as conjunction (∧), disjunction (∨), or multiplication of different inputs as well as if-then-rules and other methods. We decided to use the conjunction operator because a vital sign is reliable when its absolute value and its slope are reliable. Therefore the input reliability r in,i of a vital sign is given by where the fuzzy conjunction is equal to a minimum function [40]. Because the input reliability, r in,i , depends only on the raw sensor data (absolute value and gradient of the signal), it is calculated in the low-level agents which are also responsible for the abstraction of the vital signs. This input reliability is calculated for every vital sign and provides information on whether it is reliable or unreliable considered separately. In other words, the reliability of one vital sign omits the condition of other vital signs.
Since vital signs impact each other, and therefore, one vital sign, vs i , usually does not have a terrible score while others have a perfect score, a cross-validation reliability value is needed. For this purpose, the cross-validation reliability, r cro,i,j , for the vital signs vs i and vs j is calculated by where p cro,i,j ∈ (0, ∞) denotes a coefficient of the strength of the correlation 5 between vital signs vs i and vs j , and s i , as well as s j , are the abstracted scores of these two vital signs. Because the cross-validity reliability, r cro,i,j , already makes use of the abstracted information (the various vital sign scores), it is calculated in the high-level agent which is responsible for the calculation of the EWS.

Fuzzified confidence-based decisions
As already stated in Section 3.1, data reliability describes the trustworthiness of a set of data at hand, which can be divided into accuracy, precision, and truthfulness. For the case, a sample (a sensor value) is not very accurate, two different possibilities exist. If the real vital sign value (the ground truth) is somewhere in the middle of a score range 5 The reliability module in our implementation limits the cross-validity reliability, r cro,i,j , to a value between 0 to 1, although theoretically, a coefficient less than 1 can lead to an r cro,i,j higher than 1. The standard value of p cro,i,j is 1.
of Table 1 and the sensor's inaccuracy is not very high, the abstracted score will most likely be equal to the ground truth. In contrast, a wrong score abstraction could result out of a ground truth value very close to a boundary of such a range or a highly inaccurate sensor.
To overcome this issue, the abstraction process in the lower hierarchical level is not merely based on a simple lookup table as in Table 1, the boundaries of the different score ranges are intersecting which means that the score ranges are partly overlapping. Figure 3a shows an example for the vital sign abstraction, with four different fuzzy membership functions; for each score, one fuzzy membership function.
In similar fashion to the reliability fuzzy functions, various intervals describe the confidence fuzzy membership functions. Because the fuzzy membership functions are an extension of Table 1, the intervals can vary between different vital signs. While heart rate and systolic blood pressure are symmetrical in a way that each score higher than 0 is available for the vital sign's value is either too low or too high. In contrast, respiratory rate, body temperature, and blood oxygen saturation are unsymmetrical; some scores are missing on one side or both sides. In the case of a symmetrical segmented vital sign (Fig. 3b), the confidence functions of abstracting the actual value of a vital sign to a score s i , c s,i are calculated by where s ∈ {1, 2, 3} is one of three possible scores the actual value, v a,i , of the vital sign can have. Because score 0 of each vital sign has only one range in Table 1, the confidence function of abstracting a vital sign's actual value to score 0, c 0,i is calculated by In the configuration of the proposed system, the interval of the ramp of a fuzzy membership function is congruent with the interval of the ramp of the next fuzzy membership function (e.g., p c and p d of c 0,i are equal to p a respectively p b of c 1,i ). This approach leads to the possibility of abstracting a vital sign value to two different scores with certain confidences. As an example, let us assume that the actual value of a vital sign, v a,i , is the interval between p 0,c and p 0,d (which is equal to the interval p 1,e and p 1,f ). In this case, the vital sign will be abstracted to score 0 with c 0,i = v a,i −p 0,d p 0,d −p 0,c and to score 1 with c 1,i = v a,i −p 1,e p 1,f −p 1,e . However, the high-level agent evaluates various confidences. Similar to Eq. 20, cross-validity confidence, c cro,i,j , is calculated based on a patient's individual correlations of the various vital signs; e.g., Eq. 20 does not reflect the truth if a patient -in normal health condition -has tachypnea, hypertension, or another vital sign which leads to a score higher than the scores of the other vital signs. Based on the frequency of various occurring score differences, SD i,j , between the two vital signs vs i and vs j , a patient profile is established which gives information about the likelihood of a score difference between two different vital signs. For this purpose, the patient (situated in normal condition) is monitored for the period T (the time of n samples). After n samples have been recorded, four different quantities, q SD i,j for all four possible score differences SD i,j ∈ {0, 1, 2, 3}, are known. With the knowledge of these quantities, the cross-validity confidence between the two vitals signs sv i and sv j , c cro,i,j is calculated by Figure 1 shows the system architecture we propose in this work. At the bottom are five sensors which monitor the five different vital signs (Table 1) and transmit the raw data to their dedicated agents in the lower hierarchical level. Figure 4 shows a simple schematic of the whole procedure for one low-level agent; the others are just faded out. The functional principle of both, the agents of the lower and the higher hierarchical, is explained in detail in the following.

Lower hierarchical level of computation
Each of these five low-level agents abstracts the actual value (got from its dedicated sensor) by calculating the confidences for every possible score, c s,i for s = 0, 1, 2, 3, by Eqs. 22 and 21. As shown in Fig. 4, each low-level agent also receives score suggestions from the high-level agent. Each of these suggestions consists of a score and its reliability, r sug,s,i . Their calculation is based on Eqs. 20 and 23 but Section 4.3.2 will show the exact procedure of generating these suggestions. Additionally, the input reliability, r in,i of the corresponding vital sign is calculated by Eq. 19. In a next step, the output reliability of each possible score, r out,s,i is calculated by r out,s,i = r in,i ∧ c s,i ∧ r sug,s,i (24) because the score is reliable if the vital sign value is reliable, the abstraction is done with high confidence, and if it correlates with the other vital signs (based on the suggested scores). After every possible score has been calculated, the low-level agent chooses the one with the highest output reliability and saves it in a history if the reliability is higher than a certain threshold. 6 In the next step, the low-level agent sends the last saved score and its output reliability to the high-level agent. In other words, if the reliability of the actual score is higher than the set threshold it is sent to the high-level agent; otherwise, the previous score is sent.

Higher hierarchical level of computation
The high-level agent calculates the EWS and its overall reliability, r . For this purpose, the agent reads all low-level scores and their output reliabilities, r out,i,s . However, these reliabilities are -from the perspective of the high-level agent -input reliabilities, and therefore, they are called r in,i . The EWS is just the sum of all five vital sign scores, and thus, calculated by With all five input reliabilities, r in,i , the combined input reliability, r in is calculated by For two vital sign scores, the cross-validity reliability is calculated by Eq. 20, and the personalized cross-validity confidence by Eq. 23. After the calculation of both of these metrics, the personalized cross-validity reliability, r per,cro,i,j , can be calculated in different ways. We decided to use the disjunction (∨) operator because the correlation is plausible if it is according to our general rule (20) or matches the personalized body functions of the patient (23). Therefore the personalized cross-validity reliability, r per,cro,i,j , is given by where the fuzzy disjunction is equal to a maximum function [40]. The overall reliability of the calculated EWS is composed of all input reliabilities and all personalized cross-validity reliabilities, r per,cro,i,j . All r per,cro,i,j for this purpose are combined together to the combined cross-validity reliability, r per,cro , by where the cross-validity reliabilities for i = j are not calculated because they will be 1 one for sure (20). In further consequence, the overall reliability, r , is given by r = r in ∧ r per,cro (29) and constitutes, besides the EWS (25), the output of our proposed system. As mentioned in Section 4.3.1, the high-level agent makes also score suggestions which are sent to each lowlevel agent. For this purpose, theoretically personalized cross-validity reliabilities are calculated for each possible score (s ∈ 0, 1, 2, 3) with that a vital sign could be classified. In particular, the four theoretically possible scores of one agent are calculated by Eq. 27, whereas the score difference is based on the comparisons with the real scores from the other four low-level agents. The reliability of the theoretically possible score (the suggested score) is calculated by for each possible score s ∈ 0, 1, 2, 3. Whereas the comparison of one vital sign with itself is not performed. This procedure is repeated for all of the five vital signs, and the results (the four possible scores and their theoretical cross-validity reliability) is sent to the dedicated low-level agent.

Experimental results
In this section, we describe our experimental setup as well as the validation method of our proposed system. We also discuss the experimental results in detail.

Experimental data
The data collection was performed on eight different participants aged from 23 to 37 (see Table 2). Half of the participants were male, and the other half were female.
As listed in Table 3 and shown in Fig. 5, we recorded and abstracted the vital signs with different sensors respectively in different ways. A set of sensors provides a high-accuracy source, and another set of sensors provides a low-accuracy source for normal and fault-emulated signals. As the high accuracy sensor set, we use (i) a chest strap heart rate monitor for recording Electrocardiogram (ECG) signal, (ii) a sensitive temperature sensor attached to the subject's nose for recording the airflow signal, (iii) an accurate  temperature sensor attached to the armpit (axilla), 7 (iv) an upper arm blood pressure monitor, and (v) a high-fidelity Photoplethysmogram (PPG) sensor for recording infrared and red PPG signals. 8 The low-accuracy sensor set consists of (i) another PPG sensor which consumes less power and records PPG signal with lower Signal-to-Noise Ratio (SNR, (ii) a temperature sensor with lower sensitivity is attached to armpit (axilla) measures skin temperature, and (iii) a wrist-type blood pressure monitor measuring an estimation of blood pressure. Table 3 shows the details of the sensors in each set. All continuously recording sensors 9 were connected to an ATMEGA328P microcontroller which reads the sensors values with a sampling frequency of 50 Hz. Finally, an Android phone, connected to this microcontroller via a USB-to-Serial converter, recorded the data.
In the next step, these recorded signals were analyzed to extract the vital signs. As listed in Table 3, we use two sets of PPG signals to obtain two sources of heart rate, respiration rate, and SpO 2 values (i.e., low-accuracy and high-accuracy values). First, a filter-based method is used to extract respiratory and heartbeat signals. In this method, the cut-off frequencies are selected based on Power Spectral Density (PSD) of the PPG signals [42][43][44]. Note that an acceptable SNR is needed in this method, as high noise level influences the PSD of the signal and subsequently interrupts cut-off frequency selection. Next, the respiration rate and heart rate values are determined via a peak detection method. Moreover, the SpO 2 value is calculated from the PPG signals using two light sources 7 Because Table 1 shows the body core temperature, the measured skin temperature had to be converted to an estimated core temperature. This was done as Richmond et al. state it in [41]. 8 As shown in Table 3, the MAX30100 PPG sensor was used as accurate source for monitoring SPO 2 and as one of the inaccurate sources for monitoring heart rate and respiratory rate. 9 The two blood pressure devices were manually operated and were not continuous.
with different wavelengths (i.e., red which has 660 nm and infrared which has 880 nm) [45,46]. In addition to the PPG signals, another high-accuracy heart rate and respiration rate values are determined by using the two other sources (i.e., ECG and airflow signals). Similarly, we use peak detection methods for the detection of these two vital signs. In total, we extracted three heart rate, three respiration rate, two SpO 2 , two skin temperature, and two blood pressure signals. Table 4 shows the different scenarios in which the participants were monitored. In Scenario S1, the participants were sitting without performing any physical activity. P1 and P3 were also monitored during three additional scenarios in that errors were induced in some of the low accuracy sensor setup (Scenarios S2, S3, and S4). Six participants were monitored two times, and the other two participants four times (Table 2), resulting in 20 measurements in total.

Validation of the EWS systems
As mentioned in Section 5.1, we recorded and abstracted the vital signs of each scenario (listed in Table 4) with  Scenario description S1 The person was sitting and no additional error was induced during the measurement. S2 The person was sitting and the temperature sensor was temporarily detached. S3 The person was sitting and contracted his/her biceps for a period of the measurement. S4 The person was sitting and the temperature sensor was temporarily detached. In addition, the person contracted his/her biceps for a period of the measurement. different sensors, respectively, in different ways ( All these data sets were then processed with both, the conventional EWS system without any self-awareness properties and our proposed SA-EWS system. The output of these systems is the EWS signal of the same length as the experimental data sets (one EWS value for each vital sign sample set). To have a common benchmark for comparing both systems, a ground truth for each of the 20 measurement 10 is needed. Due to the lack of a real ground truth, we took the data set of each experiment, which matches the ground truth the most. These Ground Truth Datasets (GTDSs) consists of the vital signs HR r , RR r , SPO 2,r , ST r , and BP r of Table 3. To ensure that the GTDSs are as close as possible to the real ground truth, all of these signals were additionally filtered 11 to remove noise. Due to corrupted measurements of the vital signs of participant P5, no valid ground truth could be established. Therefore, this participant was excluded from our analysis. This exclusion leads to a reduction of the number of measurements from 20 to 18, and in further consequence, reduced the number experiments: 1296 instead of 1440.
The EWS Ground Truth Dataset (EGTDS) was then created with the GTDSs processed by the conventional EWS system. The EWS system is used for this purpose because it does not -in contrast to the SA-EWS systemmanipulate the output leveraging the self-aware properties. However, because the conventional EWS system generated the EGTDSs, it is possible that, if the vital signs of the GTDSs still contain some noise or errors, the SA-EWS system assessment is tagged as erroneous whereas, in reality, the error is in the EGTDS.
We use various metrics to compare these two systems. The Root-Mean-Square Deviation (RMSD) calculation, which indicates how close two different signals are to each other is given by: where EW S GT ,i is the i th EWS value of the EGTDS and EW S i the i th outputted EWS value of the system that is compared with the EGTDS. However, the RMSD is not the best way to compare the two systems. A signal that deviates slightly (e.g., a deviation of only one score) for a long period may have a worse RMSD than a signal that shows a much larger deviation but only for a short time. While the former signal will most likely not result in a false or missing alarm, the latter signal will raise problems. Another metric, namely the maximum absolute error (ε max ) which gives information about the highest deviation that occurs in signal compared to the ground truth, is more relevant in this context. It is calculated by: where EW S GT ,i is the i th EWS value of the EGTDS and EW S i the i th outputted EWS value of the system that is compared with the EGTDS. The last metric is the number of false and missing alarms. As mentioned in Section 2, the calculated EWS shows low-, medium-, or high-medical risk of a patient. If the classification of the calculated EWS deviates from the classification of the ground truth EWS, a false or missing alarm is indicated. For example, if the ground truth EWS has a value which belongs to the low or medium risk class but the EWS of the system is in one of the higher classes, a false alarm is raised. In contrast, a calculated EWS in a lower class than the ground truth EWS leads to a missing alarm, which means an alarm should be raised, but it was missed. As a third option, both, the ground truth, as well as the calculated EWS, are in the same class. In this case, there is neither a false nor a missing alarm. Table 5 shows the vital signs which are corrupted ) and which are uncorrupted ( ) in various experiments. To evaluate which of these signals are either correct or are containing errors, the output of the conventional EWS system processing an experiment was compared with the EGTDS of the same experiment. If a vital sign score abstracted from the vital sign (e.g., RR t1 ) deviates, at any Based on the number of different vital signs, 72 different combinations (setups) of vital sign sets are possible. Such a setup can now contain some correct and some erroneous vital signs. An important factor is how many vital signs are showing an error at the same time for an experiment. The second column of Table 6 shows this number, which ranges from 0 to 4 errors at the same time. For this purpose, all 1296 experiments (18 measurements with 72 different setups) have been processed by the conventional EWS system, and the results were compared to their dedicated EGTDS. Based on the number of simultaneous vital sign errors, the EWS and the SA-EWS system are compared for each participant. In other words, all experiments performed on each person with different vital sign setups were separated in groups regarding the number of vital sign errors that occurred at the same time. Each row in Table 6 shows the performance of the two compared systems in the form of the minimum, average, and maximum RMSD of all calculated EWS values which are in the same group of the number of vital sign errors. Additionally, and more importantly, the maximum absolute error, ε max , is shown for each group.

Results
As it can be seen, in most of the cases, our proposed system performed equally good or considerably better than a conventional EWS system without self-awareness. For a better understanding of the table, here, we discuss the results using participant P1 as an example. In the experiments where no vital sign showed any error, both systems produced an output in that the calculated EWS did not deviate any single time (ε max = 0). In these setups, the calculated EWS signal was exactly identical to the ground truth EWS signal (RMSD = 0). In these experiments, both system performances were equal.
In contrast, the conventional EWS system performed much worse than our proposed system when setups were used in which three of the vital signs contained errors at the same time. One of these experiments is shown in Fig. 6. Whereas Fig. 6a shows the ground truth vital signs and the corrupted signals, Fig. 6b presents the ground truth EWS as well as the outputs of both systems. As it can be seen, the difference between the EWS of the conventional system with the ground truth is large (up to 7 scores), whereas the SA-EWS shows only absolute errors of 1 or 2 in the worst case.
The RMSD values of all considered experimental results show that the output of the SA-EWS system was significantly closer to the ground truth. However, the maximum error shows the real importance of an intelligent EWS system. Participant P5 was excluded from these experiments because of corrupted measurements, which led to an invalid ground truth.
In the four cases of P3, P7, and P8 in Table 6, the conventional EWS system performed slightly better. Some of the participants were sometimes slightly uneasy, which Table 6 The minimum, average, and maximum RMSD as well as the maximum error of both systems compared on the base of the various participants and the number of vital sign errors occurring at the same time Green color highlights the system with better performance led to temporally irregular breathing. As mentioned, we removed the majority of such noise from the GTDS. However, if there were some noise left, the EWS system may have an advantage over the SA-EWS system because the conventional EWS system generated the EGTDSs.
When comparing the RMSD and the maximum error, the SA-EWS system performed in eleven cases better than the conventional EWS system. In nine cases, the performance was equal, and only in four cases, the conventional EWS system performed slightly better. However, in the latter cases, the performance difference between the two systems was very small and did not lead to any additional false or missed alarms. As a matter of fact, the number of false alarms or missed alarms was always equal or less in the SA-EWS system. Table 7 shows how often both systems missed to raise an alarm or raised a false alarm. As mentioned, the EWS itself can be classified into three different classes, namely low-, medium-, and high risk. If the class of the system's outputted EWS deviates from the class of the ground truth EWS, it causes a false or missing alarm. In all cases, our proposed system performed better (marked in green in Table 7) or equal to the EWS system.
The wrong and missed alarms were counted based on the number of samples which deviate from the ground truth and based on the number of times (events) an alarm was incorrectly raised (false positive) or incorrectly not raised and was missed (false negative). Event-based means when two or more samples of the same event (samples in a row) deviate from the ground truth, the wrong/missed alarm is counted only once. In the example of participant P8, four and eight false alarms were raised by the SA-EWS system. However, each of these false alarms had the length of only one sample. That is why the number in both rows (samples or events) are the same. We can argue that if a doctor monitors a patient's vital signs and obtains an unrealistic result, he/she tries to redo the measurement. A logical consequence of this could be ignoring alarms of a length of only one or few sample(s), which corresponds to one second in time. However, this is out of the scope of this paper and serves only as an additional note. Therefore, we did not discount any alarms, even if they were very short. Figure 7 shows the occurrence frequency of absolute error in different sizes for all experiments combined. Both systems have almost the same number of absolute errors in the size of 0 and 1. Overall, except for having an error of 1 score, the proposed system is always better (including when the system has made no false recognition, i.e., 0 on Fig. 7). In particular, the SA-EWS system less often produces larger errors compared to the EWS system. We can see that the SA-EWS system never produce absolute errors larger than 5 (whereas the conventional EWS system experiences them more than a thousand times) and it produces significantly (approximately one order of magnitude) fewer errors in sizes of 4 and 5. This is particularly important since larger errors imply a deviation from the ground truth risk class, which is more important with regard to false or missing alarms. Altogether, Fig. 7 indicates that the proposed SA-EWS system is more reliable (less error-prone) than its conventional counterpart.

Conclusion and future work
Self-awareness has proven to be advantageous in many applications, and here we show its benefits for wearable medical devices. In particular, we showed how using basic observation elements such as history, data reliability and confidence can lead to reliable results without incurring massive processing loads that conventional Artificial Intelligence (AI) algorithms impose on systems. From the application point of view, we demonstrated that -even using less reliable, low-quality sensors (which are cheaper) -our system is able to calculate the EWS properly and comparable to a system with highly reliable, high-quality sensors (which are more expensive). We also showed that our proposed system shows good resilience against intentionally introduced measurement errors.
In summary, our contributions are; (a) formalizing data reliability, confidence, and history, (b) proposing function for aforementioned self-awareness properties, in particular for combining data reliability and confidence, (c) performing extended experiments with a large number of sensors and test scenarios, and (d) improving reliability of EWS assessment using cheaper sensors and despite adversities in real life measurements.
We note that many of the proposed functions are designed heuristically. Therefore, other functions could be proposed and studied, which lead to further improved results. We leave that for future works. Moreover, in some cases, we have tried alternative parameter settings and chose the better ones; however, these studies were not systematic or comprehensive. Mainly due to the extensive time that it takes to process all combination of sensors and errors using single setup values. That is, therefore, another future work. Table 7 The number of missing and false alarms of both systems compared on the base of the various participants and the number of vital sign errors occurring at the same time Green color highlights the system with better performance Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http:// creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.