A Novel Hybrid Methodology for Anomaly Detection in Time Series

Numerous research methods have been developed to detect anomalies in the areas of security and risk analysis. In healthcare, there are numerous use cases where anomaly detection is relevant. For example, early detection of sepsis is one such use case. Early treatment of sepsis is cost effective and reduces the number of hospital days of patients in the ICU. There is no single procedure that is sufficient for sepsis diagnosis, and combinations of approaches are needed. Detecting anomalies in patient time series data could help speed the development of some decisions. However, our algorithm must be viewed as complementary to other approaches based on laboratory values and physician judgments. The focus of this work is to develop a hybrid method for detecting anomalies that occur, for example, in multidimensional medical signals, sensor signals, or other time series in business and nature. The novelty of our approach lies in the extension and combination of existing approaches: Statistics, Self Organizing Maps and Linear Discriminant Analysis in a unique and unprecedented way with the goal of identifying different types of anomalies in real-time measurement data and defining the point where the anomaly occurs. The proposed algorithm not only has the full potential to detect anomalies, but also to find real points where an anomaly starts.


Introduction
With the increased use of sensor technology, a special focus of scientific research is anomaly detection. It is performed by defining a region within the value space that represents normal behaviour, and finding information that does not belong to those regions. Anomaly detection in the medical field is an important problem that has been extensively studied and various of different deep learning approaches has been proposed in different medical areas [1,2]. Inconsistent behavior of the different anomaly is a major challenge e.g heart rate in particular context can be normal while in a different context could indicate a health concern. On the other side, dynamic changes in monitoring environments could yield to higher false positive rates where normal examples appear as abnormal. The ability to recognize the sepsis disease as one of the leading cause of death in the world condition, as soon as possible provides the best chances for recovery. Although the medical societies have proposed a new criteria for sepsis recognition [3], early detection and treatment of sepsis is still challenging. When sepsis is detected, the organ damage is already progressed and leads to potentially irreversible stage. The general symptoms and signs of sepsis are non-specific, so we should look for specific signs e.g anomalies that will point us to the etiology of sepsis. Automatically initiating timely medical interventions that include the use of antibiotics, intravenous fluids, and targeted recovery treatments can halve the risk of dying. Patients with suspected sepsis should be referred immediately to an appropriate medical facility for the treatment of sepsis. Early treatment of sepsis is cost-effective, reducing the hospital days of patients in intensive care units. There is no unique technique that can be sufficient to diagnose sepsis and the combinations of different approaches are needed. A different studies based on patient retrospective data [4] that includes more than 40 different patients features are used for optimal treatments strategy in sepsis disease and sepsis classification [5][6][7]. To the best of our knowledge, no study, based on the waveform vital sign parameters from patient bedside time series data, has been conducted in the context of sepsis disease and anomaly detection. However, our algorithm must be seen as complementary to other approaches based on laboratory signs and physician estimation. Early treatment of sepsis is cost-effective, reducing the hospital days of patients in intensive care units. We tried to detect the moments of abnormalities in patients times series data that could support physician decision and potentially leads to early treatment of sepsis disease. In our previous research, we developed a new hybrid algorithm for sepsis disease risk classification based on statistic, Dynamic Time Warping (DTW) and DTW Barycenter Averaging (DTW-DBA) [8]. Although we have obtained significant precision in risk detection within a certain interval, the disadvantage of our approach is the impossibility of accurately defining the beginning of the event. . Additionally, we also applied our algorithm for sensor signals based on integrated environmental sensors developed specifically for mobile applications and wearables. Since the Internet of Things (IoT) generates a tremendous amount of data, anomalies occur as part of such a system. They can be related to problems like indicators for critical situations in industrial systems or like detecting an abnormal behavior in medical devices or patient data. Therefore, it is very important both to recognize anomalies in an early stage and also to identify the time instance where anomalies starts.
In this paper, we present hybrid models that are able to detect anomalies and find the time instances when anomaly starts or stops. Our proposed algorithm integrates multiple scientific fields by combining statistical methods, SOM and LDA and can identify different types of anomalies in real-time measured data and defines the point at which the anomaly occurs. The results show that the proposed algorithm has not only the full potential to detect anomalies, but also to find real points when an anomaly starts.

Related Works
In real-time applications, we have usually an insufficient amount of abnormal observations for model training. On the other side, sometimes it is very difficult to generate anomalies according to the limitations of the system. Different statistical and Machine Learning (ML) methods are reported in the literature to address anomaly prediction issues, using historical and real-time data. The authors in [9] published a systematic review about different anomaly detection, analysis and prediction approaches based on ML and statistical methods. They collected different scientific methods used in an intelligent inhabitant environment, transportation systems, smart objects and healthcare systems. They critically appraise research studies and synthesize findings qualitatively and quantitatively. The authors in [10] suggested detection and data recovery based on a multivariate statistical analysis approach that exploits spatial density. Research study [11] uses a Gaussian mixture approach to detect the probabilistic abnormality in the sphere of motion, occupancy and door sensors. They use the mixture of a finite number of Gaussian distributions with unknown parameters to generate data points. Here the alerts are generated for specific activity event and weakness of this method is that no warning is generated in case of absence of activity event. The researchers in study [12,13] proposed a hierarchical Markov model and a semi Markov model to predict sequential data and abnormal human behavior for embedded state sensors in smart home settings. An improved Hidden Markov model using frequent common patterns for anomaly detection is also suggested in medical studies [14,15]. A major obstacle for using HMM for anomaly detection is the huge training time for behavioral model construction.
The authors [16] in recent published study proposed model that uses spurious correlation coefficient to calculate the graph entropy. The proposed model reduces the distance between two climate events with similar graph entropy. The focus of this research was to detect an abnormal graph from the dynamic graph instead of trying to detect changes. The proposed algorithm ignores the spatial information of the dynamic graph and could not recognize outliers with an abnormal neighbor structure. The study [17] proposed new approach for regression based on delegating classifiers to predict radon concentration in soil gas concentration and anomalies by delegating the samples to the next lower level that do not meet the desired threshold. The authors [18] employed SOM for data clustering in the first phase and then applied the shortest path algorithm to recognize anomalous events. A method based on statistical measure percentiles is used in supervised learning over the patterns of normal behavior to detect abnormal long periods of inactivity in a home [19]. Bayesian statistic and forecasting are used in many studies to define behavioral patterns, that are statistically estimated based on three probabilistic features: activation likelihood, sequence likelihood and event duration likelihood [20,21]. As the authors conceded, that the amount of data to evaluate the models was small and should be increased.
Different ML techniques are also represented in the recent study to detect and predict anomaly behavior of the data. A new approach for anomaly detection using deep learning techniques with delayed prediction is represented in the recent study [22]. Authors in [23] introduced a novel computational approach based on Recurrence Quantification Analysis. This approach is suitable for multivariate time series data and provides multiple predicted value candidates and selects that closest to the measured value as the predicted value. Authors in [24] proposed a long short-term memory based anomaly detection method for discord search from univariate time series data. The structural features from normal training data are learned and then using statistical strategy based on the prediction error for observed data anomaly detection is performed.
The main disadvantage of this approach is that it can not be directly address multivariate sequences and bias exists in the selection of public data.
Algorithms based on clustering techniques work by grouping similar objects in a cluster and then assuming that the anomalies do not belong to any cluster, or are very far from the central cluster, or belong to clusters with low gravity. Generally, k-nearest neighbours (kNN) anomaly detection schemes are classified into two categories: densitybased and distance-based schemes [25]. Combinations of self-adaptive and dynamic k-means are also used for training data to learn weights prior to anomaly detection [26]. SOM are introduced for anomaly detection in combination with the kNN approach and Particle Swarm Optimization (PSO). The authors in [27] use the competitive learning process of the SOM and data-clustering technique in the anomaly detection. Singular Value Decomposition (SVD) is also one of the most popular methods for anomaly detection [28][29][30]. Algorithms based on principal component classifier -Principal Component Analysis (PCA) are also used with bigdimensional data [31,32]. The hybrid model that combine Recurrent Neural Networks and Convolutional Neural Networks with SVD is presented in the recent study [33]. The power of the LDA and Logistic Regression (LR) is also used as approach to solve problems of detecting atypical objects [34,35]. To our knowledge, our proposed hybrid combination of statistic, SOM and LDA has not yet been studied in context of time series data.

Step 1: Data Sets
Sensor data: We used measurements from fifteen IoT kits containing Bosh BME680 environmental data sensors [36]. These sensors measure humidity, air quality and temperature. A Raspberry Pi was used as an access point and data sink for the IoT kits. The communication between the individual devices was performed via Message Queuing Telemetry Transport (MQTT). The Raspberry Pi publishes a measurement request via MQTT to what the kits listen and then respond with the measured values. The received measurements are then written into a database. All sensors were placed on a table in the middle of the 9 m 2 room.
The window and the opposite door were opened for 5 min each hour to create anomalous measurements. Afterwards they were closed again for 55 min, as it was found that this time was necessary for the readings to settle back to normal levels.This information is used to validate our model. We defined different data sets using 5 min intervals, 30 minutes intervals and randomly generated time intervals for testing and validation. All data measurements are stored in Post-gresSQL database. An visualisation of sensors measurement is presented by Fig. 1 .
Waveforms data: As written in [4], the MIMIC-III Waveform Database (MIMICWF) contains contains a huge number of recordings of multiple physiologic signals and time series of vital signs collected in an Intesive Care Unit (ICU). Numeric data from MIMICWF typically include heart and respiration rate, SpO2, systolic, mean and diastolic blood pressure with others as available [4]. The three vital sign measurements are used in this study: heart rate, respiratory rate and mean arterial pressure. As in vital sign time series data, we could not get information about the status of the patients' diseases, we merged patients waveform data with Medical Information Mart for Intensive Care (MIMIC-III) numeric data to get already labeled data by the different Sepsis criteria [3]. Here we tried to recognize certain states of dysfunction and disease, but also tried to find the starting points of abnormal activity. The main idea behind our algorithm in the medical domain is to find out the best representative classes in the positive (disease) and negative (no-disease) direction.
The first decision to make for data splitting is to decide the proportion of data in the test set. Empirical studies show that the best results are obtained if we use 20-30% of the data for testing and the remaining 70-80% of the training [37,38]. In our splitting choice, we followed the studies [39] that provided an explanation for this empirical result. Here, for fraction p of the data that goes into the training set, the idea is to select a pone out of all possible values p, for which the product between p and (p − 1) is the largest possible and it was for p between 0.7 and 0.8. Here we considered two factors: sample size and computation intensity. The sample size is large enough, so we used 30% for test data and 70% for data training. In order to estimate how well our model has been trained and to estimate the model properties, we used an external validation data set.

Step 2: Data Transformation and SOM Classification
One of the most popular artificial neural network algorithm in the unsupervised learning category is the SOM. SOM is a type of the artificial neural network, whose training is performed by unsupervised learning in order to obtain a low-dimensional discrete representation of input patterns. This discrete representation of data is called a map. Self-organizing folders store information on the topological properties of inputs by using the function of adjacent neurons. This model was first described as a neural network by Teuvo Kohonen [40], therefore these networks are also called Kohonen maps. SOMs work in two phases: learning and mapping. Learning builds a map using input patterns. The mapping classifies the input vector. Our proposed hybrid model with steps described in detail is presented in Figs. 2, 3 and 4. The first step in Fig. 2 represents data acquisition and the definition of data sets. First, we divided the data set into a positive data set (data set with anomaly) and a negative data set (data set without anomaly). Following study [39], we divide the data set in a training and in a testing part. We used time series measurements from D different sensors S d with n time series length. We mark with N f the total number of features used in our data set. Then, we can represent each feature's measurement, in the form Eq. (1): where d denote sensor index, i represent feature and k is time index, respectively. Thus, each sensor measurement can be presented by Eq. (2) We provided descriptive statistics and an exploratory data analysis to find the approach to derive state space models from multiple time series data. Before we formed an input training vector for SOM, we represented data using the Hankel matrix [41]. Input-output data from Markov parameters are traditionally used to build the Hankel matrix, but there also exists the strategy where the Hankel matrix itself is identified from input-output data as explained in [42]. For each sensor S d and each feature i, i = (1, ..N f ) , we constructed the Hankel matrix [41] where a n represents the last measurement in taken window size n. After we made transformation of input data in Hankel matrix trajectory, we calculated standard deviation i d over the each column in Hankel matrix i,d . Similarly, we calculated the mean value i d for each column in Hankel matrix i,d . Using above described statistical transformation, the final form of each input data is represented by Eq. (4) For each series separately, we repeated the same procedure. We used the statistic knowledge and the Hankel matrix and the transformed one dimensional time series in the multidimensional matrix Eq. (5) . a n−r a n−r+1 .. a n where X represents target input data for competitive learning (SOM). Here we trained SOM network S to compute the class vectors of each of the training inputs. According to [43], we used M ∼ 5 √ N number of classes, where N represent number of observations and M is the total number of neurons. Taking into account to keep data variability, we tried made more trials to define the best class size to reduce the maximum number of neurons. Using this approach, we could also take into account the various window size and different time series lengths.

Step 3: Evaluation of Best Classes
After applying data transformation and SOM unsupervised learning, our next steps, see Fig. 3, describe the approach to find the best representative classes for the state of anomaly and also the best representatives for the non-anomaly state. For each measurement a k of each input from the data set, using a trained SOM network, we defined a position as "1" if an element is a member of classc and as "0" otherwise We calculated the membership to each class as it is presented in pseudo-code:

Step 4: Membership Analysis and Error Estimation
The next step is to check the cumulative membership to the best classes + and − . Let's define by For each training and testing element from the T i , we calculated cumulative membership to the best classes in positive (anomaly) and negative (no-anomaly) direction. The final result will be as it follows by equations Eqs. (11) and (12): For each relative frequency distribution over the each class d c we calculated the mean value over the each rows as it follows by Eq. (6): After this step, four column vectors: 1 , 2 , 3 and 4 with the same dimension (m × 1) are obtained. Let's denote with ̂ 1,2 the mean value between 1 and 2 . Similarly, for ̂ 3,4 we calculated the mean value between 3 and 4 as it follows Eq. (7): The calculated final difference in Eq. (8): is used to find the best representative classes for anomaly and the best representative classes for non-anomaly. Let's define in Eq. (9) with C + and C − the best representative classes for anomaly and non-anomaly respectively: Now, we calculate how many elements from (11) and (12) and vice versa as it is defined in (13): For each dataset T i , i = (1, 2, 3, 4) using (13), we calculated the cardinality of a set as it is presented in (14): The final error is calculated by (15) as it follows:

Step 5: Linear Discriminant Model
After we finalized the previous steps, our next goal is to create a linear model representation using the LDA. The goal is to build a model as linear combination of the best predictors (in our case the best classes), that the best separates two classes (in our case: anomaly and no-anomaly). In this step we aim at investigating how to build a linear model from the previous steps using the best classes. LDA performs dimensional reduction while preserving as much of the class discriminatory information as possible. As we only have information that anomaly exists in a data set, without knowing the precise moment, we defined two classes 0 (no -anomaly) and 1 (anomaly). The training and testing data set for LDA contain the elements from d c reduced to the best classes. Using the linear type of discriminant function, we classify [44,45] each row of the data according to the relative frequency distribution over the best classes into one of the groups "+" or "−" and calculate the posterior matrix , unconditional predictive probability density of the sample observations log as well as constant K and linear coefficients L i describing the boundary between the regions separating each pair of groups. The final model is presented in 4 and given by Eq. (16).  where d i represents relative frequency distribution for the best classes. Using the proposed approach, we used external validation data set to investigate how good the model could classify the new case in domain.

Results
In the first step, we implemented our approach using measurements collected over 15 sensors during a 24 h interval. To create different datasets, we defined datasets through different time intervals (5, 15 and 30 min). For example, we used 30 minutes data intervals with The SOM weights that presented the weight distribution from the i-th input to the layer's neurons are shown in Fig. 5a. The most negative connection is shown as black, zero connections as red, and the strongest positive connections as yellow. The neuron locations in the  Fig. 5c.The graphical representation of the best classes and histogram for the sensor example are given in Fig. 6. As result, we can conclude that the best five representative classes for anomaly are 1, 16, 12, 22 and 13, and the best five representative classes for nonanomaly are classes 10, 19, 5, 15 and 20.
As we have information about the existence of anomaly for intervals, we could validate the proposed approach not only to recognize anomaly, but also to find the time instance when the anomaly stars (time: 10:00, interval 9: 45-10.15). Using LDA, we built a linear model using the best classes that we got in the previous steps. As we have information that each hour an anomaly starts and remains for the duration of 5 minutes, we can validate our approach and check how well our proposed model recognizes the anomaly and if it recognizes it, how precise it is with respect to the time point. The final output of the linear and SOM classification for the above mentioned time interval is given by Fig. 7.
We also tried to apply our algorithm over the patient's MIMICWF matched dataset. We applied our algorithm over 964 patients from matched subset of MIMICWF data. The 3 vital sign parameters are used in this study: heart rate, mean arterial pressure and respiratory rate. As the length of vital sign measurements is different for the individual patients, using our approach, we tested the model over different window size. We defined different window size (over 4, 8, 16 and 24 hours) and applied our algorithm over the data that are labeled by different criterion [3]: Angus, Martin, CDC, Sepsis3 and Sequential Organ Failure Assessment (SOFA). We also defined different numbers of classes to find the best representative classes for diseases.We tried to identify specific pattern for disease patients and using LDA we tried to identify interval points of "disease" behavior. Furthermore, we also created one data set where we selected patients that were positive (negative) for all criterion and then applied algorithm over that data set. The final results for patient dataset (the one admission) are represented in the Tables 1, 2, 3, 4, 5 and 6.
We tried to apply our algorithm for data set of 171 patients (Table 6), where patients are positive if they are labeled in all criterion as positive (102 patients) and similarly we found all patients that are labeled as negative (69 patients) by all criterion. Here we investigated also interval 100 min. The results are presented as follows: The final results shows that the best sensitivity results by algorithm over 4 hours is for the SOFA criterion as it is represented in Table 5 . The worst results we get for the CDC criterion in Table 3. An example of original vital sign with the final results of classification is presented in Fig. 8.
Additionaly, we evaluate our anomaly detection algorithm using open sourced anomaly detection benchmarks: Skoltech Anomaly Benchmark (SKAB) [46], Yahoo Webscope [47] and Numenta Anomaly Benchmark (NAB)NAB [48]. SKAB dataset contain a multivariate time series collected from the sensors installed on the test bed. We used data from experiments with normal mode (no-anomaly) and data obtained from the other experiments (with anomaly). As it is explained [49], in NAB cases where univariate time series are included, the anomaly labeling mechanism is sometimes not relevant because window size is marked as anomalous even when in that window exists probably only 1-2 points that are anomalous. We used synthetic anomaly and no-anomaly NAB dataset. In the  Fig. 9, we present one result from anomaly labeled NAB data. As the last anomaly benchmark dataset, we used public available Yahoo Webscope dataset, that contains synthetic as well as real data. The anomalies in Yahoo A1 Benchmark are marked by humans and therefore may not be consistent [47]. All other Yahoo dataset are sythetic. The highest F1 score is obtained for Yahoo A3 synthetic dataset and the lowest result is obtained using Yahoo A1 dataset. The results are presented in the Table 7: We also investigated the other machine learning approaches and made a comparison to our hybrid approach ( Table 8) . The results demonstrate that our algorithm shows comparable performance in Yahoo A1, Yahoo A3 and SKAB dataset with a slight favor for support vector machines for A2 dataset.

Conclusion
In our approach, we demonstrated how the hybrid model using statistical methods, neural networks and discriminant analysis could be applied to solve a complex anomaly problem not only in the area of smart sensors, but also in complex medical problems using medical big data. Our proposed model has potential to detect anomalies, positions of the objects and the real moment when the anomaly starts. Using the validation data set (time series data that are not presented to our model in the training or testing phase) the algorithm correctly identify anomalies and sensor types (number). The algorithm detects the real time point when anomaly starts with 93 percent of accuracy. In the case of only one sensor, the algorithm detects an existing anomaly 10 seconds later than it really starts. We also used time series data for vital signs and tried to apply our model to detect anomalies in patient behavior and to recognize diseases.
The algorithm shows also the potential to recognize "disease" behavior over the time series data where we use only 3 vital sign parameters. The best results of classification we got for patient data set where we included patients that were positive/negative by all criterion with precision of 0.95 and Recall over 0.92. Modeling and validation of a proposed approach is performed in Python, MATLAB and PostgreSQL environment. The next phase of our research will be focused on the fusion of the proposed model with genetic algorithms and Stochastic Petri Nets. Our goal is to optimize the parameters of the proposed model by genetic algorithms and apply it on the results derived by Stochastic Petri Nets to improve classification strategy.