This section provides some detail on the three measures (DTW, SAX and SFA). It explains how they work and illustrates the problem that they do not readily produce PSD kernels. Once we posses a distance measure d on time series x and y we can form the kernel in one of two obvious ways:
-
The Negative kernel: \(K(x,y) = -d(x,y)\).
-
The Gaussian kernel: \(K(x,y) = e^{- \gamma d(x,y)}\).
\(\gamma\) is a parameter to be optimized.
Dynamic time warping
We will re-introduce the DTW algorithm presented in (Shimodaira et al. 2002) as well as (Cuturi 2011) and (Cuturi et al. 2007) . We begin this section by the definition of an alignment between two time series. We then define the DTW kernel in terms of those alignments. Both these definitions were made in (Shimodaira et al. 2002).
To compare two time series that do not line up on the x-axis we may use the dynamic time warping distance. As you can see in Fig. 1 when comparing two time series X and Y that are similar in shape except that the crest of one is shifted along the x-axis, the DTW kernel will reflect this similarity by warping the time axis so that the two time series align. In contrast the Euclidean distance completely ignores the inherent similarity between the two series as a result of the misalignment. In summary DTW is elastic, and Euclidean distance is not.
The dynamic time warping distance is not a metric since it does not satisfy the triangle inequalityFootnote 1 If the distance underlying a kernel fails this test, the kernel will fail to be PSD. A kernel that is not PSD is known as indefinite, and in theory the SVM optimization process involved when using an indefinite kernel need not be a convex quadratic programming problem, as a result the training algorithm is not guaranteed to terminate at a global optimum. This theoretical problem turns out to be of little consequence in experimentation, as we find in the experiments section that very often there is only a tiny margin in difference between the Gaussian and Negative kernel even though the former nearly always produces PSD kernels while the latter does not. For example with DTW there is no (statistically significant) difference in mean accuracy between the Gaussian and the Negative Kernels.
To calculate the dynamic time warping distance we must first define what is meant by a good alignment. An alignment \(\pi\) is a pair of functions \(\pi ^{1}\) and \(\pi ^{2}\) which satisfy the following properties: (\([v] = \{1,...,v\}\))
$$\begin{aligned}&\pi ^{1}: [m] \rightarrow [l]\end{aligned}$$
(1)
$$\begin{aligned}&\pi ^{2}: [n] \rightarrow [l] \end{aligned}$$
(2)
where l is known as the length of the alignment.
$$\begin{aligned} \pi ^{k}_{1} = 1,\quad for \quad k \in [2] \end{aligned}$$
(3)
and
$$\begin{aligned}&\pi ^{1}_{l} = m \end{aligned}$$
(4)
$$\begin{aligned}&\pi ^{2}_{l} = n \end{aligned}$$
(5)
$$\begin{aligned}&\pi ^{k}_{i}-\pi ^{k}_{i-1} \in \{0,1\} \quad \forall k \in [2] \ \forall i \in \{2,...,l\} \end{aligned}$$
(6)
$$\begin{aligned}&\pi ^{a}_{i}=\pi ^{a}_{i-1} \implies \pi ^{b}_{i} - \pi ^{b}_{i-1} = 1, \ \forall a,b \in [2], a \ne b, \ \forall i \in \{2,...,l\} \end{aligned}$$
(7)
We may summarize the criteria as both \(\pi ^{1}\) and \(\pi ^{2}\) must be monotonic functions from [m] and [n] onto [l] such that they contain no simultaneous repetitions (11).
Once we have the alignment \(\pi\) we may define the dynamic time warping distance between two time series \({\mathbf {x}}\) of length m and \({\mathbf {y}}\) of length n.
$$\begin{aligned} d({\mathbf {x}},{\mathbf {y}}) = \min \limits _{\pi \in {\mathcal {A}}({\mathbf {x}},{\mathbf {y}})}(\sum _{k=1}^{l}\Vert {\mathbf {x}}(t_{\pi ^{1}_{k}}) - {\mathbf {y}}(t_{\pi ^{2}_{k}}) \Vert ) \end{aligned}$$
(8)
where \({\mathcal {A}}({\mathbf {x}},{\mathbf {y}})\) is the set of all possible alignments and \(\Vert . \Vert\) is the regular Euclidean distance.
We may calculate the dynamic time warping distance in O(mn) via the recurrence relation:
$$\begin{aligned} M_{i,j} = \min (M_{i-1,j},M_{i-1,j-1},M_{i,j-1})+ \Vert {\mathbf {x}}_{i}-{\mathbf {y}}_{j} \Vert \end{aligned}$$
(9)
The resultant \(M_{m,n}\) is the dynamic time warping distance between \({\mathbf {x}}\) and \({\mathbf {y}}\). Note it is often customary to use a warping window, this limits the maximum warping that may occur between the two time series. This is trivial to implement since if TOL is our tolerance (maximum warping) then we must simply ensure that when \(|i-j|>TOL\): \(M_{i,j} = \infty\). By doing this we are ensuring there is an upper bound on the warping. A warping window of 0 is equivalent to Euclidean distance and this means that by considering all warping windows we are also considering Euclidean distance.
DTW is an elastic distance measure in that it measures two time series as similar even when there is misalignment along the time axis.
Time warp edit distance
Time Warp Edit Distance (TWED) is metric that operates on time series. It was first proposed in Marteau (2008) as an elastic time series measure which unlike DTW serves as a proper metric, satisfying all the axioms of a metric function. The algorithm is both similar to DTW and to EDR. The similarity between two time series is measured as the minimum cost sequence of ”edit operations” needed to transform one time series into another. These edit operations are defined in a way which makes sense graphically. To define the “edit operations” they use the paradigm of a graphical editing process to end up with a dynamic programming algorithm that can efficiently compute TWED in roughly the same complexity as DTW.
Edit distance on real sequences
The Edit Distance on Real sequences (EDR) distance was first published in Chen et al. (2005). The authors describe an algorithm that is an adaptation of edit distance which works on strings of letters to a distance measure that would work on time series.
The Edit Distance on Real sequences (EDR) distance was first published in Chen et al. (2005). The authors describe an algorithm that is an adaptation of edit distance which works on strings of letters to a distance measure that would work on time series.
The idea is to count the number of edit operations (insert, delete, replace) that are necessary to transform one series into the other. Of course our time series are numerical functions and therefore will almost never likely match up exactly. Therefore we use a parameter epsilon, and if two time series at a particular point in time are with epsilon of each other, we count that as a match otherwise, we don’t.
Symbolic aggregate approximation
The approach to time series analysis above is numerical. Here we introduce a symbolic approach to time series analysis: Symbolic Aggregate Approximation (SAX). One possible motivation for moving towards a symbolic approach is that we could then utilize the wealth of datamining techniques pertaining to string representations, one example would be edit distance. Another source of motivation is that a symbolic approach may yield significant dimensionality reductions. For further explanation see Lin et al. (2007).
The fist step to SAX is to discretize the input space. First we set the alphabet size (\(a>2\)) for the problem. Next for every time series \(C = c_{1},...,c_{n}\) we must assign each \(c_{i}\) to a corresponding variable in the alphabet. So if \(a=3\) and our alphabet is \(\{{\mathbf {a}},{\mathbf {b}},{\mathbf {c}}\}\) then for the time series \(c_{1},...,c_{n}\) we must map each \(c_{i}\) to a letter in the alphabet. Our approach to discretization is to first transform the data into the Piecewise Aggregate Approximation (PAA) representation and then symbolize the PAA representation into a discrete string (Fig. 2). The two main benefits of this process are the well documented dimensionality reduction of PAA ( (Keogh et al. 2001b), (Yi and Faloutsos 2000)) and lower bounding: our distance measure between two symbolic string lower bounds the true distance between the original time series ( (Keogh et al. 2001a), (Yi and Faloutsos 2000)).
We can represent a time series C of length n in a w-dimensional space by a vector \({\bar{X}} = \bar{x_{1}},...,\bar{x_{n}}\). As in Lin et al. (2007), we can calculate \(\bar{x_{i}}\) by
$$\begin{aligned} \bar{x_{i}} = \frac{w}{n}\sum \limits _{j=\frac{n}{w}(i-1)+1}^{\frac{n}{w}i}x_{j} \end{aligned}$$
(10)
We have reduced the time series from n (temporal) dimensions to w dimensions by dividing the data into w equally sized frames and then \(\bar{x_{i}}\) is simply the mean value of the time series for that frame. We may think of this process as attempting to approximate our time series with a linear combination of box functions. It is worth noting that it is important to z-normalise each time series. We then appropriately define breakpoints which determine the letter in our alphabet to which each \(\bar{c_{i}}\) will be mapped. Usually we do this by analyzing the statistics of our time series and choosing breakpoints so that each letter in the alphabet is as likely to appear as each other letter. In other words we choose breakpoints to spread out the data evenly. For a more thorough explanation see Lin et al. (2007). Once we have the breakpoints determined it is straightforward to map our PAA representation to a string consisting of letters from our alphabet. The PAA coefficient controls the proportion of examples that will be placed in each bin. Whereas the alphabet size regulates the discretization of the x-axis, the PAA coefficient regulates the discretization of the y-axis. When we have a large time series our approach is to first discretize the time series into a long string and then extract a bag of words. We determine a sliding window, usually found by parameter optimization, and this sliding window length becomes the length of each word in our bag of words. So we turn one long string of letters representing our original time series into a series of words, each word is the length of the sliding window. The first word starts from the first index in the original long string, the second word starts from the second index in the original long string. We proceed until all the words have been extracted.
As a distance measure between the time series we could use Euclidean distance, however to make our time series analysis more elastic we use edit distance. Now time series that are similar but not aligned will be marked as similar by our distance measure.
Symbolic fourier approximation
Symbolic Fourier Approximation (SFA) was introduced by Schafer et al. in 2012 as an alternative method to SAX built upon the idea of dimensionality reduction by symbolic representation. Unlike SAX which works on the time domain, SFA works on the frequency domain. The algorithm is discussed and developed in Schäfer and Högqvist (2012).
SFA uses the Discrete Fourier Transform (DFT) to represent each time series as a linear combination of sines and cosines. Recall that \(\{sin(kx),cos(kx)\}_{k=1}^{\infty }\) forms an orthogonal basis for real (and indeed complex) valued continuous functions. For each time series we perform what is known as an orthogonal projection onto basis functions. Let V be an inner product space over the field \({\mathbb {F}}\). Let \({\mathbf {v}}\in V\setminus \{0\}\). We want to decompose an arbitrary vector \({\mathbf {y}}\in V\) into the form:
$$\begin{aligned} {\mathbf {y}}= \alpha {\mathbf {v}}+ {\mathbf {z}}\end{aligned}$$
(11)
where \({\mathbf {z}}\in \{{\mathbf {x}}| \langle {\mathbf {x}},{\mathbf {v}}\rangle = 0 \}\) and \(\alpha \in {\mathbb {F}}\). Since \({\mathbf {z}}\bot {\mathbf {v}}\) we have:
$$\begin{aligned} \langle {\mathbf {v}},{\mathbf {y}}\rangle = \langle {\mathbf {v}}, \alpha {\mathbf {v}}+ {\mathbf {z}}\rangle = \langle {\mathbf {v}}, \alpha {\mathbf {v}}\rangle + \langle {\mathbf {v}}, {\mathbf {z}}\rangle = \alpha \langle {\mathbf {v}}, {\mathbf {v}}\rangle \end{aligned}$$
(12)
\(\implies\)
$$\begin{aligned} \alpha = \frac{\langle {\mathbf {v}}, {\mathbf {y}}\rangle }{\langle {\mathbf {v}},{\mathbf {v}}\rangle } \end{aligned}$$
(13)
Whence we define the orthogonal projection of y onto v:
$$\begin{aligned} Proj_{{\mathbf {v}}}({\mathbf {y}}) = \frac{\langle {\mathbf {v}}, {\mathbf {y}}\rangle }{\langle {\mathbf {v}},{\mathbf {v}}\rangle } {\mathbf {v}}\end{aligned}$$
(14)
In this case our inner product is defined as:
$$\begin{aligned} \langle f,g \rangle = \int _{-\infty }^\infty f(x)g(x)dx \end{aligned}$$
(15)
and our basis vectors are \(\mu _{k} = \{ e^{\frac{i2\pi kn}{N}}| n\in \{0,...,N-1\}\}\). If k were infinity it would be possible to approximate any real value continuous function to arbitrary precision in much the same was as the Taylor series can be used. However for SFA we use the discrete transformation i.e. \(k < \infty\).
The DFT Approximation is a part of the preprocessing step of the SFA algorithm, where all time series data are approximated by computing DFT coefficients. When all these DFT coefficients are calculated, Multiple Coefficient Binning (MCB) is used to turn the approximated time series represented as a series of coefficients into a string representation. Next we use the very simple sliding window technique to extract a bag of features. Our first feature will be the string cut from the first letter to the length of the sliding window. Then we simply slide the window along one letter, starting from the second letter instead of the first this time. Each time we extract a feature from the string and then slide up the window by one and extract the next feature until we have exhausted the string. Once we have mapped each time series to a list of strings of length sliding window, we use edit distance to compare the string representations (Fig. 3).