1 Introduction

Time series patterns refer to certain regularities appearing in temporal data that may be of user interests. The detection of patterns is crucial in time series analysis tasks such as anomaly detection, sensor stream data classification and data segmentation. There are two types of time series pattern, events and states.

An event pattern indicates the occurrence of an event. For example, one heart beat showing on EEG readings can be viewed as an event. A spike of ultrahigh voltage appearing on a power network can be also viewed as an event. One malicious request to a web server is another example of event. Events are often ephemeral as they do not last long. The task of finding this type of patterns, in another word recognising the occurrence of an event, is known as event detection. Event detection could be viewed as classification or anomaly detection depending on the task domain.

A pattern of time series state, however, shows a certain pattern consistently appearing over a period of time. For example, a power network maintaining the voltage at a certain value can be viewed as the occurrence of a state. A patient’s blood pressure being higher than normal can also be viewed as a state pattern. Recognising a state is similar but yet different to detecting an event. The need for identifying states from time series is experiencing significant increase due to the proliferation of sensor especially mobile sensor technologies. In this study, we aim to establish a competitive method based on genetic programming (GP) for detecting state patterns.

It should be noted that recognising patterns from time series data is a part of time series analysis. Another more active area of time series analysis is prediction, e.g. forecasting future value based on current values. Prediction is not the aim of this study although GP has achieved certain success in this area. Our study is in the area of time series classification, as identifying the occurrence of a state can be viewed as classifying whether a segment of time series is in that state (positive) or not.

In this study, we will present a novel method which can address the shortcomings of current time series classification approach which requires manually designed features. To determine what feature is more suitable, domain knowledge of the task in hand is required. In addition, the size of the target pattern should be known in advance (Way et al. 2012; Safavian and Landgrebe 1991; Garrett et al. 2003; Nanopoulos et al. 2001). Otherwise feature extraction would be difficult as the sampling size is undetermined. However, in the scenario of state detection, the size of a pattern of a state, or state length, may not be available. For a state, as long as the length of a data sequence along the time axis is longer than the minimum length of this state pattern, a good detection algorithm should be able to handle that. Note that the terms “State Length” and “Pattern Size” are used interchangeably through this paper.

Another challenge often seen in real-world time series analysis tasks is how to handle multiple channels of streaming input. Some channels are related while some channels are independent from each other. A state pattern on such data may appear across several channels. Most general time series analysis methods are designed for single-channel streams. By these methods a multi-channel stream is treated as multiple single-channel streams. So the interdependence between channels is difficult to capture.

Given the above challenges, a method that can automatically learn patterns of states from multi-channel time series data would be highly desirable. Ideally such a method should not require the state length to be known beforehand. It would not rely on manually constructed features. Fortunately in a series of studies such as Xie et al. (2012), genetic programming (GP) has shown its effectiveness in event detection on multi-channel time series data. The GP-based method does not require event size and predefined features. In this study, we extend GP for state detection which can be considered as a more challenging task than event detection. In particular four research questions are investigated in this study:

  1. 1.

    Is the GP-based event detection method applicable to state detection problems?

  2. 2.

    How does GP perform on a range of multivariate synthetic state detection tasks?

  3. 3.

    How does GP perform when applied to real scenario, e.g. Activity Recognition tasks?

  4. 4.

    Is the pattern learnt by GP reasonable or arbitrary?

The organisation of the rest of this paper is such. Section 2 presents a brief background of this work including the related works. Section 3 describes the methodology of our state detection approach in details. The focus is on three important functions. In Sect. 4, a group of state detection problems are explained. They have increasing levels of difficulties. Section 5 presents the experiments and the corresponding results which are companied by discussions. This paper concludes in Sect. 6.

2 Background

Firstly the concept of “state” in time series is described in this section. It is a widely recognised concept as time series patterns often seen in the literature are “events”. The differences between a state and an event can be shown from the following three aspects:

  1. 1.

    Causes A state is usually an evidence of a stable condition, for example, a person being standing or sitting. In comparison an event usually indicates the occurrence of a certain change, transiting from one status to a new status, for example, a person sitting down (changing from standing to sitting).

  2. 2.

    Data characteristics A state may show a certain degree of repetition over a time period. For example, a state of walking would show repetitive limb movements from the person. On the contrary, an event is one-hit and non-repetitive along the time axis. Figure 1 shows the body sensor reading from a person changing from standing to sitting. Figure 1 shows three regions which are separated by two dotted grey lines. They correspond to the three stages, respectively, standing straight, changing from standing to sitting and sitting peacefully. This figure contains two states and one event and illustrates the difference between two types of time series patterns.

  3. 3.

    Detection mechanisms An event could be determined at the completion of the event especially when the event duration is known. However, this approach is not applicable to state detection as a state may last indefinitely. As shown in Fig. 1, the transition event between standing and sitting does not last long. However, standing and sitting, which are states, do not have a maximum length.Footnote 1

Fig. 1
figure 1

Example of event and state: sensor reading—changing from standing to sitting

Fig. 2
figure 2

Example of a complex state: sensor reading of a state of walking (four steps)

The boundary between a state or an event can be subtle. For example, Fig. 2 shows sensor readings from four steps of walking. Each step can be viewed as one event, which is the building block of walking, as walking is nothing but repetition of “steps”. Hence, a state pattern could consist of multiple events or event patterns. A good state detection algorithm is required to be able to handle such complex patterns. To our best knowledge, there is no such existing general method which can deal with this type of composite state patterns.

Since both event detection and state detection can be viewed as classification tasks. We briefly describe typical existing methods in the literature. Two main categories of methods can be found for learning or classifying time series patterns. They are:

  1. 1.

    Similarity-based methods.

  2. 2.

    Feature-based methods.

In the first category, the class of a time series segment is determined by the similarity between the segment and segments from all classes. Nearest neighbour, a typical distance-based method is probably the most popular way to measure similarity and to determine the class of a piece of new data (Way et al. 2012). Another popular choice is by measuring the similarity of decision trees generated from different segments (Safavian and Landgrebe 1991). Obviously the key factor affecting the performance of these methods is the similarity measure (Xing et al. 2010a). Common choices include Euclidean distance (Ralanamahatana et al. 2005; Chan and Fu 1999) and dynamic time warping (DTW) (Berndt and Clifford 1996). Note these similarity-based methods assume that the patterns of a same class are identical or nearly identical. However, this assumption is always true especially in many real-world applications. This type of methods does not involve features and can be viewed as a part of sequence classification (Xing et al. 2010b).

Methods in the second category perform classification based on time series features. Hence, knowledge on the nature of the pattern is required to design the most suitable features which can distinguish different classes (Garrett et al. 2003; Nanopoulos et al. 2001). However, features are highly problem specific. A good feature set suitable for one problem often cannot be transferred to another domain (Subasi 2007; Englehart et al. 1999; Brooks et al. 2003). There is no general time series feature set that can capture the characteristics of all types of time series patterns.

In addition, methods from the above two categories need to know the pattern size beforehand. Otherwise the sampling window size cannot be determined so data segment cannot be generated for classification purposes. In comparison, our GP-based approach does not require such information. The majority of aforementioned methods are designed for single channel time series. Patterns over several parallel time series are very difficult to be captured by those methods, because redundant or irrelevant channels have to be removed from decision process and the dependencies between relevant channels have to be captured although that the dependency can be highly complex. Our proposed method aims to handle multiple channels.

Table 1 Function set
Table 2 Terminal set

3 Methodology for learning patterns of states

The method is built upon tree-based Genetic Programming. The fitness function is simply the classification accuracy during the training. It can be defined as:

$$\begin{aligned} \mathrm{Fitness} = \mathrm{Accuracy} = \frac{\mathrm{True}~\mathrm{positives} + \mathrm{True}~\mathrm{negatives}}{\mathrm{All}~\mathrm{training}~\mathrm{instances}}\nonumber \\ \end{aligned}$$

Table 1 is the function set. The parameters and return types of each function are listed on the table as well. Basically there are four basic arithmetic functions plus three extra functions specifically designed for handling multi-channel time series data. Function Window and Function Multi-Channel take special terminals. These terminals such as Operation and Temporal Index are listed in Table 2.

The three functions “Window”, “Temporal_Diff” and “Multi-Channel” are explained in the following subsections.

3.1 Window function

The window function defines the incoming sequential inputs, selects data points inside the window, and applies the operations on the select data points. It takes three parameters: i, temporal index and operation.

The first parameter (i.e. i) is the input of this function which samples a data point at every time step. It keeps the subsequence of historical values of that input in memory. The length of this subsequence is called “Window” function size (denoted as S), which is manually adjustable. The reserved data points are marked from the earliest point to the most recent one as \(t_0, t_1 ,\ldots , t_{S-1}\). The value of S is set as 8 in this study. Greater values are not used so that the evolved programs can be less complex for analysis. Moreover, this value does not deteriorate the performance.

The second parameter is from terminal “temporal index” which returns a random integer within the range of \([1, 2^S-1]\). First the integer is converted into its binary form. In case that the binary is shorter than S and not sufficient to mark all elements in the subsequence, it will be left padded with 0. For example, assuming S is 8 and the parameter value is 5 then the binary string should be 00000101, in which the first five 0s come from padding. This binary is then mapped to the subsequence of time series data under the window. A bit with “1” indicated the data point with the same index will be selected while a bit with “0” will be discarded.

The third parameter (i.e. operation) is a randomly generated integer valued from a range [1, 4]. Each value corresponds to one of the four operations: AVG, STD, DIF and SKEWNESS. They are used for calculating the average, the standard deviation, the sum of absolute differences and the skewness of the selected points under the window. The return value is the final output of the window function.

3.2 Temporal-difference function

Temporal-difference function (noted as Temporal_Diff) is introduced to capture temporal change between adjacent points as it is obviously important for identifying the occurrence of events.

It only takes one double value parameter i which defines the input. It stores the value \(t_{i-1}\) which is one time stamp earlier and returns the difference between \(t_{i-1}\) and the current value \(t_i\). It consequently can be considered to have an effective window size of 2. Eventually, it calculates the first derivative of the time series, as temporal changes can be more revealing. Higher order derivatives can be achieved as well if this function is used iteratively.

3.3 Multi-channel function

The two functions mentioned earlier only handle the temporal dependence, that is, they only work on a sequence along time axis by themselves. They can hardly be aware of any relationship cross channels at a particular time point. Consequently, a state occurring in multiple channels would not be captured by those functions. To address this problem, multi-channel function is introduced. The function selects arbitrary collection of channels and computes characteristics of these channels. It takes two integers as its parameters: channel index and channel operation. No input parameter needs to be specified as the whole set of channels are treated as input. The parameter channel index works in a similar way as to temporal index in the window function, except its range is from 1 to \(2^M-1\), where M is the total number of channels. So assuming the channel number is 6 in total, a binary form of 13, 001101, would tell the function to operate on the 3rd, the 4th and the 6th channels. The parameter for channel operation also returns an integer from 1 to 4, which corresponds to the following functions: median value which is the middle value of the selected variables (MED), their average (AVG), their standard derivation (STD) and the distance between the maximum and minimum values (RANGE).

The window function can be integrated with multi-channel function by taking the output of the latter as input data. Such combination enables GP to find both temporal relationships and variable dependence simultaneously.

4 State patterns

In this section, we introduce six sets of synthetic data which contain different state patterns with increasing complexity. They are to evaluate the capability of our proposed GP-based state detection method. These data sets vary in the state length and the number of channels. They are divided into two groups. The first group consists of four single-channel data sets while the second group has two multi-channel data sets.

4.1 Synthetic single-channel state patterns

The state patterns in these single-channel time series data are explained below.

  1. 1.

    Box functions The task is to identify a state when the signal volume is around a certain level, 100 in this case. An example is shown in Fig. 3. The starting point and the end point of a state are marked with red dots. This simulates scenarios like the voltage of a power network or the temperature of a region maintaining at a certain level but with minor fluctuations.

  2. 2.

    Oscillations In some real-world applications, constant oscillation may be viewed as a certain state, such as vibration of a spring, which may indicate the spring is functioning normally. In this type of time series, a state is defined as consecutive peaks of which the top value is above 180 and the bottom value is below 10 (Fig. 4). Note that the state should last at least for a period of p samples (\(p=4\)) given each sample taking 12 time points (see the data points in between the two red dots in Fig. 4).

  3. 3.

    Sine waves vs. random numbers A state can be more complex than being around a constant value or oscillating. It can be periodic signals in sine waves. In these data, the state pattern is actually a complete sine wave. The task is to distinguish sine waves from random signals with the same magnitude range. The regulated signal is produced \(y=|100\times \sin (x)|\) which is sampled at every \(\frac{\pi }{30}\). The minimum state size is then 8. An example of such state pattern is shown in Fig. 5.

  4. 4.

    Different sine waves The target state patterns in this case are two different sine waves. The positive is the wave same as in the last data set. The negative, however, consists of other types of sine waves, generated by a similar periodical function \(y=|100\times \sin (x\times f)|\), where \(f = 2,3,4,5,6\) (shown in Fig. 6). The aim is to investigate whether our method can discriminate periodical patterns which are similar. The minimum state size is also 8 in this scenario.

Fig. 3
figure 3

An example of box function (non-periodical) (color figure online)

Fig. 4
figure 4

An example of oscillation

Fig. 5
figure 5

An example of sine wave vs. random numbers

Fig. 6
figure 6

An example of mix of different sine waves

4.2 Synthetic multi-channel state patterns

  1. 5.

    Two-channel sine waves This scenario involves two channels. The sine waves in this scenario are also \(y=100\times \sin (x)\), but sampled at every \(\frac{\pi }{7}\). The state patterns, or the positives, in this case are both channels showing sine waves. If one channel contains random numbers then the state is considered as a negative one. For example, in Fig. 7, only the middle section in between red dots is considered containing the target state pattern, double sine waves.

  2. 6.

    Box functions in four out of five channels There are five channels in this task. The target pattern in this case is more complicated, when there are more than four channels receiving signal values above 90 for at least 8 consecutive points. There is no restriction on which one channel may not have a high reading under a positive state.

Fig. 7
figure 7

An example of two-channel sine waves (color figure online)

4.3 Real-world state patterns for activity recognition

To further evaluate the performance of our method, a benchmark data set of mobile-based activity recognition is used (Xie et al. 2014).Footnote 2 The data in this set include 21 channels of sensor data collected from five subjects. There are four different states to be learnt and detected: sitting, walking, running and lying flat. The walking state includes different gaits, including going upstairs and going downstairs.

Note that the knowledge on these state patterns are unknown. We do not know the minimum state length for sitting, walking, running and lying. The suitable window size for each pattern is also unknown. Our GP-based method is expected to handle these raw data and learn the characteristics of these state patterns.

5 Experiments and results

The data sets described in Sect. 4 were used for evaluating our proposed method described in Sect. 3. For comparison purposes five non-GP classifiers were also used on the same data, including OneR, J48, Naive Bayes and IB1. In addition AdaBoost was used to combine multiple classifiers as an ensemble to enhance learning performance. For each task, the best conventional classifier from OneR (Holte 1993), J48 (Quinlan 1993), Naive Bayes (John and Langley 1995) and IB1 (Aha and Kibler 1991) was selected as the base classifier in AdaBoost (Freund and Schapire 1996).

Note that there are other methods which are capable of time series classification such as neural networks (Wang et al. 2016), genetic algorithm (Eads et al. 2002), SVM (Ando and Suzuki 2014) and so on. This is not the purpose of this study. Our aim is to show that the proposed method is a good candidate for detecting states from time series. We are not to claim that to be the best time series analysis method. So the comparison does not include all major existing methods.

The experimental settings for our method and conventional methods are shown in Sects. 5.1 and 5.2. The experimental results are presented in Sect. 5.3.

5.1 Experimental settings for our GP-based method

The runtime parameters for synthetic data sets are shown in Table 3. For activity recognition task, the population size is increased to 1000. The window function size is also tuned to be larger that is 12 due to the complexity of the problem. All the experiments are repeated for ten runs. The best is taken as the final result.

Table 3 Runtime parameters for synthetic data sets
Fig. 8
figure 8

An example showing converting a two-channel time series data stream to vectors for conventional classifiers (pattern size: 3)

Fig. 9
figure 9

Illustration of extraction feature type B (pattern size: 3)

Fig. 10
figure 10

Illustration of extraction feature type C (pattern size: 3)

5.2 Experimental settings for conventional classifiers

Our GP-based method only handles raw multi-channel time series data. However, for conventional classifiers, we prepared raw data as well as feature data. For each task, data segments were extracted by a sliding window of which the size is the state length. This is to ensure that each data segment contains sufficient amount of information but no redundant information.

Table 4 Training and test data of the six synthetic state detection tasks
Table 5 Comparison on raw input vectors (synthetic state patterns %)
Table 6 Comparison on feature sets (synthetic state patterns %)

Raw data from these data segments can be converted as in input vector which can be fed to classifiers directly. In addition, a set of features can be extracted. The processes is demonstrated using the following example assuming the time series contains two channels and the pattern size is 3:

  1. (A)

    Raw input vector A sliding window is moving through time series to extract raw input vectors. For multi-channel time series, all the channels are flattened into one row similar to the way of representing a matrix as a one-dimensional array, see Fig. 8.

  2. (B)

    Feature set—wave length This feature is specifically designed for sine functions, which is calculated by equation \(\sum _{i=1}^{3} |t_i-t_{i-1}|\). This features is not effected by the phases of the sine wave. This feature generates identical values for sine waves at any time point. It is a good feature for defining a state of sine waves, see Fig. 9.

  3. (C)

    Feature set—temporal average and variance The feature is to capture temporal regularities. It provides the average and standard deviation over the length of a pattern. So the size of this feature set is the double of the number of channels, see Fig. 10.

  4. (D)

    Feature set—channel average This feature is to capture cross channel regularities. It enumerates the average value of all channels at each time point. The number of features is equal to the pattern size, see Fig. 11.

Fig. 11
figure 11

Illustration of extraction feature type D (pattern size: 3)

Table 4 lists data summary for all synthetic state detection tasks, including number of training and test cases, the pattern size, the numbers of attributes in raw data, the type of feature used, and the numbers of attributes in feature data.

5.3 Experimental results on synthetic state patterns

Table 5 shows the results from six methods performing on six synthetic state detection tasks using raw input data. All the results are from test only. The results of our GP methods are highlighted in bold if that results are better than that from other methods. It can be seen clearly that the overall performance of GP is much better than other classifiers. In particular, in Task 3 and 5, GP significantly outperformed other methods.

Table 6 presents the results of conventional classifiers on pre-defined feature sets BCD. The results from GP runs on raw data are also listed for comparison. GP methods are highlighted in bold if those results are better than other results on the same row. Obviously these manually designed features can help these conventional classifiers achieve better results. However, the performance of GP is still consistently better although it did not claim the top position for Task 3.

The results from Tables 5 and 6 demonstrate that GP individuals are somehow capable of extract features that can identify a state pattern from the rest of time series data, even when the state pattern exists over several channels. Because of the given functions and terminals, GP is able to operate on raw numeric values and possibly form a discrimination function. This actually can be considered as an implicit feature construction process.

Table 7 Leave-one-person-out: Accuracies, true-positive and true-negative rates (trained and tested on data from the right front pant pocket) (%)

5.4 Results on real-world state patterns

A leave-one-person-out validation is conducted for real-world state detection scenarios, that is, for each detection task, the classification is conducted for five times. For each time, the records from one subjects are used for test and the rest for training.Footnote 3

Table 7 shows the test results from all four state detection tasks on five subjects. Our method did achieve consistently good accuracy over different scenarios of state detection. These results show that GP can detect states not only from synthetic time series but also in a complex, real-world scenario.

Fig. 12
figure 12

Examples of individual for classifying patterns of state “Running”

Fig. 13
figure 13

An individual for classifying patterns of state “Running”

As mentioned early there is not domain knowledge available for activity recognition, e.g. which channels are involved in running. This knowledge is not known to human experts. Although it is difficult to verify whether the evolved individuals have captured a genuine state pattern or not, we can still analyse some best individuals.

The program shown in Fig. 12 is an individual with over 95 % test accuracy. Its tree structure can be visualised in Fig. 13. We can see it is small in size and simple in structure. It comprises five levels of windows. Due to the nested windows, this individual can actually capture much more data points than the window size at any point. It only uses three Channels 1, 12 and 15 which are raw acceleration Y, raw rotation rate X and unbiased rotation rate X. That means only up–down acceleration and rolling backwards and forewords are relevant to determine the running state. This pattern seems understandable as the up and down acceleration should be the most significant movement of the body. GP evolved pattern detector is quite selective on which channels are to be used, considering there are 21 available channels in the raw input.

Another point that is worth mentioning is that the analysis shows that the evolved GP programs do not use all data points on a time series. That might be the reason that the performance of the program is not severely affected when there are missing data points or other forms of imperfection in the input time series. In addition, operators like AVG, DIF can compensate minor variations on the signal input. This characteristic of tolerance is important for real-world applications.

As understandability is a well-known issue in genetic programming, evolved trees are difficult to be fully understood by human. However, the analysis of these GP classifiers still can reveal that their good performance is not arbitrary. They are capable of capturing certain characteristics of time series patterns, either patterns for events or for states.

6 Conclusions

Pattern of states is an important type of time series patterns. In this study, we presented a GP-based method for learning state patterns from multi-channel time series input. This method requires little domain knowledge about the target states and does not need manually defined features. Our experiments show this method can achieve significantly better results on raw inputs of synthetic single-channel state patterns, synthetic multi-channel state patterns and real-world multi-channel state patterns, when comparing with conventional classification methods which took both raw data and pre-defined feature data as input. It should be noted that the comparison with conventional methods is to illustrate the advantages of this method, not to claim that our method will always outperform traditional methods. Tuning manually designed features may improve classification accuracy on a specific problem. However, our method does not require this process.

The analysis of evolved GP individual shows that our GP method is capable of selecting a small number of relevant channels and to form accurate pattern detectors. A fixed window size does not restrict individuals from constructing more flexible input windows.

In conclusion, genetic programming can be adapted to formulate a highly competitive method for complex state detection on multi-channel stream input. In the near future, the proposed method will be evaluated on more scenarios, especially when data are noisy.