## 1 Introduction

Symbolic representations of time series are an active area of research, being useful for many data mining tasks including dimension reduction, motif and rule discovery, prediction, and clustering of time series. Symbolic time series representations allow for the use of algorithms from text processing and bioinformatics, which often take advantage of the discrete nature of the data. Our focus in this work is to develop a symbolic representation which is dimension reducing whilst preserving the essential shape of the time series. Our definition of shape is different from the one commonly implied in the context of time series: we focus on representing the peaks and troughs of the time series in their correct order of appearance, but we are happy to slightly stretch the time series in both the time and value directions. In other words, our focus is not necessarily on approximating the time series values at the correct time points, but on representing the local up-and-down behavior of the time series and identifying repeated motifs. This is obviously not appropriate in all applications, but we believe it is close to how humans summarize the overall behavior of a time series, and in that our representation might be useful for trend prediction, anomaly detection, and motif discovery.

To illustrate, let us consider the time series shown in Fig. 1. This series is sampled at equidistant time points with values $$t_0,t_1,\ldots , t_N\in \mathbb {R}$$, where $$N=230$$. There are various ways of describing this time series, for example:

1. (a)

It is exactly representable as a high-dimensional vector $$T = [ t_0, t_1, \ldots , t_N] \in \mathbb {R}^{N+1}$$.

2. (b)

It starts at a value of about $$-3$$, then climbs up to a value of about 0 within 25 time steps, then it stays at about 0 for 100 time steps, after which it goes up to a value of about 3 within 25 time steps, and so on.

3. (c)

It starts at a value of about $$-3$$, then goes up rapidly by about 3 units, followed by a longer period with almost no change in value, after which it again goes up rapidly by about 3 units, and so on.

Note how in (a) and (b) the emphasis is on the actual values of the time series, whereas in (c) we mainly refer to trends in the time series in relation to previously observed trends. High-level information might be difficult to extract from (a) directly, while (b) could be seen as putting too much emphasis on the time series values instead of the overall shape. The symbolic representation developed in this paper, called adaptive Brownian bridge-based aggregation (ABBA), adaptively reduces T to a shorter sequence of symbols with an emphasis on the shape information. The resulting description will be conceptually similar to (c) from the examples above.

To formalize the discussion and introduce notation, we consider the problem of aggregating a time series $$T = [ t_0, t_1, \ldots , t_N] \in \mathbb {R}^{N+1}$$ into a symbolic representation$$S = [ s_1, s_2, \ldots , s_n ]\in \mathbb {A}^n$$, where each $$s_j$$ is an element of an alphabet $$\mathbb {A} = \{ a_1, a_2, \ldots , a_k\}$$ of k symbols. The sequence S should be of considerably lower dimension than the original time series T, that is $$n \ll N$$, and it should only use a small number of meaningful symbols, that is $$k \ll n$$. The representation should also allow for the approximate reconstruction of the original time series with a controllable error, with the shape of the reconstruction suitably close to that of the original. Both n, the length of the symbolic representation, and k, the number of symbols, should be chosen automatically without parameter tuning required.

This paper is organized as follows. In Sect. 2 we give an overview of existing symbolic representations and other algorithms which are conceptually similar to ABBA. To evaluate the approximation accuracy of ABBA, we must compare the shape of the original time series and the reconstruction from its symbolic representation. Section 3 reviews existing distance measures for this purpose and discusses how well they perform in measuring shape. Sections 47 contain the key contributions of this paper:

• Section 4 introduces ABBA, our novel dimension-reducing symbolic time series representation which aims to preserve the shape of the original time series. We explain in detail how ABBA’s compression and reconstruction procedures work.

• In Sect. 5 we show that the error of the ABBA reconstruction behaves like a random walk with pinned start and end values. This observation appears to be novel in itself and allows us to balance the error of the piecewise linear approximation with that of the digitization procedure, thereby allowing the method to choose the number of symbols k automatically.

• Section 6 contains performance comparisons of ABBA with other popular symbolic representations using various distance measures, with a particular emphasis on the compression versus accuracy relation. Aside from verifying that ABBA can represent time series to higher accuracy than SAX and 1d-SAX using a comparable number of symbols k and string length n, we also find that SAX outperforms 1d-SAX when the same number of symbols k is used for both.

• In Sect. 7 we discuss some practical applications of ABBA including the handling of linear trends, anomaly detection, and VizTree visualization.

Finally, we conclude in Sect. 8 with an outlook on future work.

## 2 Background and related work

Despite the large number of dimension-reducing time series representations in the literature, very few are symbolic. Most techniques are numeric in the sense that they reduce a time series to a lower-dimensional vector with its components taken from a continuous range; see Bettaiah and Ranganath (2014), Fu (2011), Lin et al. (2007) for reviews. Here we provide an overview of existing symbolic representations relevant to ABBA.

The construction of symbolic time series representations typically consists of two parts. First, the time series is segmented, with the length of each segment being either specified by the user or found adaptively via a bottom-up, top-down, or sliding window approach (Keogh et al. 2001). The segmentation procedure intrinsically controls the degree of dimension reduction. The second part, the discretization process, assigns a symbol to each segment.

Symbolic Aggregate approXimation (SAX), a very popular symbolic representation, consists of a piecewise approximation of the time series followed by a symbolic conversion using Gaussian breakpoints (Lin et al. 2007). SAX starts by partitioning T into segments of constant length $$\texttt {len}$$, and then represents each segment by the mean of its values (i.e., a piecewise constant approximation). The means are converted into symbols using breakpoints that partition a Gaussian bell curve into k equally-sized areas. In addition to its simplicity, an attractive feature of SAX is the existence of distance measures that serve as lower bounds for the Euclidean distance between the original time series. On the other hand, both the segment length $$\texttt {len}$$ and the number of symbols k must be specified in advance. SAX is designed such that each symbol appears with equal probability, which works best when the time series values are approximately normally distributed.

The literature on applications of SAX is extensive and many variants have been proposed. Most variants modify the symbolic representation to incorporate the slope of the time series on each segment. This is often justified by applications in finance, where the extreme values of time series provide valuable information which is lost with the piecewise constant approximation used in SAX. The modifications often come at the cost of losing the lower bounds on distance measures. We now provide a brief overview of some of these variants.

Trend-based and Valued-based Approximation (TVA) uses SAX to symbolically represent the time series values, enhanced with U, D, or S symbols to represent an upwards, downwards, or straight trend, respectively (Esmael et al. 2012). The TVA representation alternates between value symbols and slope symbols, making the symbolic representation twice as long as a SAX representation with the same number of segments. A similar approach is Trend-based SAX (TSAX) which uses two trend symbols per segment (Zhang et al. 2018).

Extended SAX (ESAX) represents each segment by the minimum, maximum, and mean value of the time series ordered according to their appearance in the segment, defining the mean to appear in the center of the segment (Lkhagva et al. 2006). This results in a symbolic representation three times longer than the corresponding SAX representation with the same number of segments. ENhanced SAX (EN-SAX) forms a vector for each segment consisting of the minimum, maximum and mean value. The vectors are then clustered and a symbol is allocated to each cluster (Barnaghi et al. 2012). Time-Weighted Average for SAX (TWA_SAX) uses the time weighted average for each segment instead of the mean (Benyahmed et al. 2015). This can encapsulate important patterns which are missed by the mean.

Trend-based Symbolic approximation (TSX) represents each segment by four symbols (Li et al. 2012). The first symbol corresponds to the SAX representation. The following three symbols correspond to the slopes between the first, last, most peak and most dip points, which are defined in terms of vertical distance from the trend line (the straight line connecting the end point values of a segment). The slopes are converted to symbols using a lookup table. This results in a symbolic representation four times longer than the SAX representation with the same number of segments.

The 1d-SAX algorithm uses linear regression to fit a straight line to each segment (Malinowski et al. 2013). Each segment is then represented by the gradient and the average value of the line. Two sets of Gaussian breakpoints are used to provide symbols for both the averages and the slopes. It is unclear how many breakpoints should be allocated for the averages, and how many should be allocated for the slopes. The total number of symbols is the product of the respective number of breakpoints.

Using the same number of segments, the above SAX variants result in an increase in the length of the symbolic representation by some factor. It is unclear whether any of these approaches performs better than SAX when the SAX segment length $$\texttt {len}$$ is decreased by the same factor (keeping the overall length of the symbolic representation constant). As with the original SAX approach, all of these variants require the user to specify the segment length $$\texttt {len}$$ and the number of symbols k in advance.

In many time series applications, the assumption that the values of the normalized time series follow a normal distribution is a strong one. To overcome this, the adaptive SAX algorithm (aSAX) uses k-means clustering to find the breakpoints for the symbolic conversion (Pham et al. 2010). However, as piecewise constant approximations are used, the aSAX approach fails to represent the extreme points of the time series.

SAX’s digitization procedure based on Gaussian breakpoints allows its extension to a multi-resolution symbolic representation known as indexable SAX (iSAX) (Shieh and Keogh 2008). This clever indexing procedure allows mining of datasets containing millions of time series. At the heart of the algorithm is a SAX representation where each window uses Gaussian breakpoints with $$2^c$$ regions, where c can change from segment to segment.

The sensorPCA algorithm overcomes the fixed window length problem by using a sliding window to start a new segment when the standard deviation of the approximation exceeds some prespecified tolerance (Ganz et al. 2013). However, Ganz et al. (2013) does not provide a method to convert the mean values and window lengths to a symbolic representation.

Symbolic Aggregate approXimation Optimized by data (SAXO) is a data-driven approach based on a regularized Bayesian coclustering method called minimum optimized description length (Bondu et al. 2016; Boullé 2006). The discretization of the time series is optimized using Bayesian statistics. The number of symbols and the underlying distribution change for each time interval. The computational complexity of SAXO is far greater than that of SAX.

Mörchen and Ultsch (2006) take a completely different approach based on the persistence of a time series. A persistent time series is one where the value at a certain point is closely related to the previous value; see also Kim (2000). The authors provide “persist”, a symbolic representation based on the Kullback–Leibler divergence between the marginal and the self-transition probability distributions of the discretization symbols.

Symbolic Polynomial (SP) is a symbolic representation designed to detect local patterns (Grabocka et al. 2014). It is constructed by an overlapping sliding window of length w and stepsize 1. For each window, one computes the coefficients of a regression polynomial of degree d. The coefficients of each order are collected and allocated a symbol using an equi-area discretization. This symbolic representation provides no dimensional reduction as each window is represented by d symbols.

Baydogan and Runger (2015) introduce a symbolic representation of multivariate time series called SMTS. They construct a data table consisting of time index, time values, and first differences of the time series. A tree learner is trained on the data and each of the leaf nodes is allocated a symbol. Their approach allows multiple tree learners, which in the univariate case results in a symbolic representation much larger than the original.

Piecewise linear approximations of time series have been used for many years. The lengths of the linear pieces (segments) can be prespecified or chosen adaptively. Each segment is approximated using either linear interpolation or linear regression (Keogh et al. 2001). Luo et al. (2015) describe how the linear segments can be stitched so that each piece is represented by two parameters rather than three. An example of a piecewise linear approximation algorithm is the Ramer–Douglas–Peucker algorithm, an iterative endpoint fitting procedure which uses adaptive linear interpolation with a prespecified tolerance. These methods provide an effective shape-preserving and dimension-reducing representation but not a symbolic representation.

## 3 Distance measures

The accuracy of a symbolic time series representation S can be assessed by the distance between the original time series T and its reconstruction $$\widehat{T}$$ from S. We note that the original time series should first be normalized to have zero mean and unit variance. This ensures that distance measures are comparable across different time series; see Keogh and Kasetty (2003) for a discussion of the importance of normalization.

A detailed overview of time series distance measures and their applications can be found in Aghabozorgi et al. (2015). Distance measures for time series typically fall into two main categories: lock-step alignment and elastic alignment (Abanda et al. 2019). Lock-step alignment refers to the element-wise comparison of time series, i.e., the i-th element of one time series is compared to the i-th element of another. Such measures can only compare time series of equal length. The most popular lock-step distance is the Euclidean distance. The Euclidean distance is a poor measure of shape similarity in two particular cases: if the time series have the same shape but are stretched in value (see Fig. 2a), or if the time series have the same shape but are warped in time (see Fig. 2b). The first issue can be mitigated by differencing the time series before measuring the distance. The second issue is intrinsic to lock-step alignment distance measures.

Elastic alignment distance measures construct a nonlinear mapping between time series elements, effectively allowing for one value in a time series to be compared to multiple consecutive values in another. The most popular elastic alignment method is Dynamic Time Warping (DTW), originally proposed in Berndt and Clifford (1994). The DTW distance measure corresponds to the Euclidean distance between two DTW-aligned time series. This distance measure can be used to compare time series of different lengths but it has a quadratic computational complexity in both time and space; for further details see Keogh and Ratanamahatana (2005). Many methods have been proposed to either approximate the DTW distance at a reduced cost or calculate bounds to avoid computing the DTW alignment altogether. Keogh and Pazzani (2001) notice that DTW may pair a rising trend in one time series with a falling trend in another, and they overcome this problem by a variant known as Derivative Dynamic Time Warping (DDTW). The elastic alignment allows DTW to overcome the issues when two time series have the same shape but are warped in time (see Fig. 2b), but DTW is still a poor measure of shape similarity if the time series have the same shape but are vertically stretched (see Fig. 2a). Again, this can be mitigated by differencing the time series before measuring their DTW distance.

It is because of these advantages and drawbacks of the Euclidean and DTW distance measures and their differenced counterparts that we will test the performance of ABBA with all these distance measures in Sect. 6.

## 4 Adaptive Brownian bridge-based aggregation

We now introduce ABBA, a symbolic representation of time series where the symbolic length n and the number of symbols k are chosen adaptively. The ABBA representation is computed in two stages.

1. 1.

Compression The original time series T is approximated by a piecewise linear and continuous function, with each linear piece being chosen adaptively based on a user-specified tolerance. The result is a sequence of tuples $$(\texttt {len}, \texttt {inc})$$ consisting of the length of each piece and its increment in value.

2. 2.

Digitization A near-optimal alphabet $$\mathbb {A}$$ is identified via mean-based clustering, with each cluster corresponding to a symbol. Each tuple $$(\texttt {len}, \texttt {inc})$$ is assigned a symbol corresponding to the cluster in which it belongs.

The reconstruction of a time series from its ABBA representation involves three stages.

1. 1.

Inverse-digitization Each symbol of the symbolic representation is replaced with the center of the associated cluster. The length values of the centers may not necessarily be integers.

2. 2.

Quantization The lengths of the reconstructed segments are re-aligned with an integer grid.

3. 3.

Inverse-compression The piecewise linear continuous approximation is converted back to a pointwise time series representation using a stitching procedure.

Both the computation of the ABBA representation and the reconstruction are inexpensive. It is essential that the digitization process uses incremental changes in value rather than slopes. This way, ABBA consistently works with increments in both the time and value coordinates. Only in this case a mean-based clustering algorithm will identify meaningful clusters in both coordinate directions. As we will explain in Sect. 5, the error of the ABBA reconstruction behaves like a random walk pinned at zero for both the start and the end point of the time series. But first, we provide a more detailed explanation of the key parts of ABBA. For clarity, we summarize the notation used throughout this section in Table 1.

### 4.1 Compression

The ABBA compression is achieved by an adaptive piecewise linear continuous approximation of T. Given a tolerance $$\texttt {tol}$$, the method adaptively selects $$n+1$$ indices $$i_0 = 0< i_1<\cdots < i_n = N$$ so that the time series $$T = [t_0,t_1,\ldots , t_N]$$ is approximated by a polygonal chain going through the points $$(i_j , t_{i_j})$$ for $$j=0,1,\ldots ,n$$. This gives rise to a partition of T into n pieces $$P_j = [ t_{i_{j-1}},t_{i_{j-1}+1},\ldots , t_{i_j} ]$$, each of length $$\texttt {len}_j := i_j - i_{j-1}\ge 1$$ in the time direction. We ensure that the squared Euclidean distance of the values in $$P_j$$ from the straight polygonal line is bounded by $$(\texttt {len}_j - 1)\cdot \texttt {tol}^2$$. More precisely, starting with $$i_0 = 0$$ and given an index $$i_{j-1}$$, we find the largest possible $$i_j$$ such that $$i_{j-1} < i_j\le N$$ and

\begin{aligned} \sum _{i=i_{j-1}}^{i_j} \left( \underbrace{t_{i_{j-1}} {+} (t_{i_j} {-} t_{i_{j-1}})\cdot \frac{i - i_{j-1}}{i_j - i_{j-1}}}_{\text {straight line approximation}} \ \ - \underbrace{t_i}_{\text {actual value}} \right) ^2 \le (i_{j} - i_{j-1} -1)\cdot \texttt {tol}^2. \quad \end{aligned}
(1)

Note that the first and the last values $$t_{i_{j-1}}$$ and $$t_{i_{j}}$$ are not counted in the distance measure as the straight line approximation passes exactly through them. If required, one can restrict the maximum length of each segment by imposing an upper bound $$i_{j} \le i_{j-1} + \texttt {max\_len}$$ with a given integer $$\texttt {max\_len}\ge 1$$.

Each linear piece $$P_j$$ of the resulting polygonal chain $$\widetilde{T}$$ is described by a tuple $$(\texttt {len}_j, \texttt {inc}_j)$$, where $$\texttt {inc}_j = t_{i_j} - t_{i_{j-1}}$$ is the increment in value (not the slope!). As the polygonal chain is continuous, the first value of a segment can be inferred from the end value of the previous segment. Hence the whole polygonal chain can be recovered exactly from the first value $$t_0$$ and the tuple sequence

\begin{aligned} (\texttt {len}_1, \texttt {inc}_1), (\texttt {len}_2, \texttt {inc}_2), \ldots , (\texttt {len}_n, \texttt {inc}_n)\in \mathbb {R}^2. \end{aligned}
(2)

An example of the ABBA compression procedure applied to the time series in Fig. 1 is shown in Fig. 3. Here a tolerance of $$\texttt {tol} = 0.4$$ has been used, resulting in $$n=7$$ pieces. As the approximation error on each piece $$P_j$$ satisfies (1), the polygonal chain $$\widetilde{T}$$ also has a bounded Euclidean distance from T:

\begin{aligned} \texttt {euclid}(T, \widetilde{T})^2\le & {} [(i_1 - i_0 -1) + (i_2 - i_1 - 1) + \cdots \nonumber \\&+ (i_n - i_{n-1} - 1) ]\cdot \texttt {tol}^2\\= & {} (N -n)\cdot \texttt {tol}^2.\nonumber \end{aligned}
(3)

Hence we are sure that the ABBA approximation $$\widetilde{T}$$ (red dashed curve) in Fig. 3 has a Euclidean distance of at most $$\sqrt{223}\times 0.4\approx 6.0$$ from the original time series T (black solid curve).

Let us comment on the computational complexity of this compression phase. Assuming that all pieces have an average length $$\overline{\texttt {len}}$$, the evaluation of (1) for each $$i_j=i_{j-1}+2,i_{j-1}+3,\ldots ,i_{j-1}+\overline{\texttt {len}}$$ involves a sum of $$1,2,\ldots ,\overline{\texttt {len}}-1$$ terms, respectively. (Recall that the straight line approximation passes through the end points and hence the corresponding terms in the sum need not be evaluated.) Therefore, the overall number of summands to be evaluated and added up is $$1+2+\cdots +\overline{\texttt {len}}-1 = \mathcal O(\overline{\texttt {len}}^2)$$ per piece. Under the natural assumption that the average piece length $$\overline{\texttt {len}}$$ is independent of the length N of the time series, we have $$N \propto n\cdot \overline{\texttt {len}}$$. Hence, the compression phase has a linear complexity of $$\mathcal {O}(N) = \mathcal O(n)$$.

### 4.2 Digitization

Digitization refers to the assignment of the tuples in (2) to k clusters $$S_1,S_2,\ldots ,S_k$$. Before clustering, we separately normalize the tuple lengths and increments by their standard deviations $$\sigma _{\texttt {len}}$$ and $$\sigma _{\texttt {inc}}$$, respectively. We use a further scaling parameter $$\texttt {scl}$$ to assign different weight (“importance”) to the length of each piece in relation to its increment value. Hence, we effectively cluster the scaled tuples

\begin{aligned} \left( \texttt {scl}\frac{\texttt {len}_1}{\sigma _{\texttt {len}}}, \frac{\texttt {inc}_1}{\sigma _{\texttt {inc}}}\right) , \left( \texttt {scl}\frac{\texttt {len}_2}{\sigma _{\texttt {len}}}, \frac{\texttt {inc}_2}{\sigma _{\texttt {inc}}} \right) , \ldots , \left( \texttt {scl}\frac{\texttt {len}_n}{\sigma _{\texttt {len}}}, \frac{\texttt {inc}_n}{\sigma _{\texttt {inc}}} \right) \in \mathbb {R}^2. \end{aligned}
(4)

If $$\texttt {scl} = 0$$, then clustering is performed on the increments alone, while if $$\texttt {scl} = 1$$, we cluster in both the length and increment dimension with equal weighting. The cluster assignment is performed by (approximately) minimizing the within-cluster-sum-of-squares

\begin{aligned} \texttt {WCSS} = \sum _{i=1}^{k} \sum _{(\texttt {len}, \texttt {inc}) \in S_i} \Bigg \Vert \left( \texttt {scl}\frac{\texttt {len}}{\sigma _{\texttt {len}}}, \frac{\texttt {inc}}{\sigma _{\texttt {inc}}}\right) - \overline{\mu }_i \Bigg \Vert ^2 , \end{aligned}

with each 2d cluster center $$\overline{\mu }_i = (\overline{\mu }_i^{\texttt {len}},\overline{\mu }_i^{\texttt {inc}})$$ corresponding to the mean of the scaled tuples associated with the cluster $$S_i$$. In certain situations one may want to cluster only on the lengths of the pieces and ignore their increments, formally setting $$\texttt {scl} = \infty$$. In this case, the cluster assignment is performed by (approximately) minimizing

\begin{aligned} \texttt {WCSS} = \sum _{i=1}^{k} \sum _{(\texttt {len}, \texttt {inc}) \in S_i} \Bigg | \frac{\texttt {len}}{\sigma _{\texttt {len}}} - \overline{\mu }_i^{\texttt {len}} \Bigg |^2 , \end{aligned}

where $$\overline{\mu }_i^{\texttt {len}}$$ is the mean of the scaled lengths in the cluster $$S_i$$.

Given a clustering of the n tuples into clusters $$S_1,\ldots ,S_k$$ we use the unscaled cluster centers $$\mu _i$$

\begin{aligned} \mu _i = (\mu _i^{\texttt {len}}, \mu _i^{\texttt {inc}}) = \frac{1}{|S_i|} \sum _{(\texttt {len}, \texttt {inc}) \in S_i} (\texttt {len}, \texttt {inc}) \end{aligned}

to define the maximal cluster variances in the length and increment directions as

\begin{aligned} \mathrm {Var}_{\texttt {len}}= & {} \max _{i=1,\ldots ,k} \frac{1}{|S_i|} \sum _{(\texttt {len}, \texttt {inc}) \in S_i} \left| \texttt {len} - \mu _i^{\texttt {len}}\right| ^2, \\ \mathrm {Var}_{\texttt {inc}}= & {} \max _{i=1,\ldots ,k} \frac{1}{|S_i|} \sum _{(\texttt {len}, \texttt {inc}) \in S_i} \left| \texttt {inc} - \mu _i^{\texttt {inc}}\right| ^2, \end{aligned}

respectively. Here, $$|S_i|$$ is the number of tuples in cluster $$S_i$$. We seek the smallest number of clusters k such that

\begin{aligned} \max (\texttt {scl}\cdot \mathrm {Var}_{\texttt {len}}, \mathrm {Var}_{\texttt {inc}}) \le \texttt {tol}_s^2 \end{aligned}
(5)

with a tolerance $$\texttt {tol}_s$$. This tolerance will be specified in Sect. 5 as a function of the user-specified tolerance $$\texttt {tol}$$ and is therefore not a free parameter. (In the case of $$\texttt {scl} = \infty$$, we seek the smallest k such that $$\mathrm {Var}_{\texttt {len}} \le \texttt {tol}_s^2$$.) Once the optimal k has been found, each cluster $$S_1, \ldots , S_k$$ is assigned a symbol $$a_1, \ldots , a_k$$, respectively. Finally, each tuple in the sequence (2) is replaced by the symbol of the cluster it belongs to, resulting in the symbolic representation $$S = [ s_1,s_2,\ldots ,s_n]$$.

If $$\texttt {scl} = 0$$ or $$\texttt {scl} = \infty$$, a 1d clustering method can be used which takes advantage of sorting algorithms; see the review by Grønlund et al. (2017). We use the ckmeans algorithm (Wang and Song 2011), an order $$\mathcal {O}(n \log n+ kn)$$ dynamic programming algorithm which optimally clusters the data by minimizing the WCSS in just one dimension. We have modified the algorithm to choose the smallest k such that the maximal cluster variance is bounded by $$\texttt {tol}_s^2$$.

For nonzero finite values of $$\texttt {scl}$$, k-means clustering is used. This algorithm has an average complexity of $$\mathcal {O}(kn)$$ per iteration [see also Arthur and Vassilvitskii (2006) for an analysis of the worst case complexity] and might of course result in a suboptimal clustering. In our ABBA implementation the user can specify an interval $$[\texttt {min\_k}, \ldots , \texttt {max\_k}]$$ and we search for the smallest k in that interval such that (5) holds. If no such k exists, we set $$k = \texttt {max\_k}$$.

By default, we set $$\texttt {scl}=0$$ as we believe this corresponds most naturally to preserving the up-and-down behavior of the time series. In other words, we ignore the lengths of the pieces and only cluster the value increments. With the value increments represented accurately, the errors in lengths correspond to horizontal stretching in the time direction.

An illustration of the digitization process on the pieces from Fig. 3 can be seen in Fig. 4 with $$\texttt {scl} = 0$$ (our default parameter choice), Fig. 5 with $$\texttt {scl} = 1$$, and Fig. 6 with $$\texttt {scl} = \infty$$.

### 4.3 Inverse digitization and quantization

When reversing the digitization process, each symbol of the alphabet is replaced by the center $$(\overline{\texttt {len}}_i, \overline{\texttt {inc}}_i)$$ of the corresponding cluster given as

\begin{aligned} (\overline{\texttt {len}}_i, \overline{\texttt {inc}}_i) = \frac{1}{|S_i|} \sum _{(\texttt {len}, \texttt {inc}) \in S_i} (\texttt {len}, \texttt {inc}). \end{aligned}

Note that the mean-based clustering for digitization is performed on the scaled tuples (4), but the cluster centers used for the inverse digitization are computed with the unscaled tuples (2). The inverse digitization process results in a sequence of n tuples

\begin{aligned} (\widetilde{\texttt {len}}_1, \widetilde{\texttt {inc}}_1), (\widetilde{\texttt {len}}_2, \widetilde{\texttt {inc}}_2), \ldots , (\widetilde{\texttt {len}}_n, \widetilde{\texttt {inc}}_n) \in \mathbb {R}^{2}, \end{aligned}

where each tuple is a cluster center, that is $$(\widetilde{\texttt {len}}_i, \widetilde{\texttt {inc}}_i) \in \{(\overline{\texttt {len}}_1, \overline{\texttt {inc}}_1), (\overline{\texttt {len}}_2, \overline{\texttt {inc}}_2), \ldots , (\overline{\texttt {len}}_k, \overline{\texttt {inc}}_k)\}$$.

The lengths $$\widetilde{\texttt {len}}_i$$ obtained from this averaging are not necessarily integer values as they were in the compressed representation (2). We therefore perform a simple quantization procedure which realigns the cumulated lengths with their closest integers. We start with rounding the first length, $$\widehat{\texttt {len}}_1 := \mathrm {round}(\widetilde{\texttt {len}}_1)$$, keeping track of the rounding error $$e := \widetilde{\texttt {len}}_1 - \widehat{\texttt {len}}_1$$. This error is added to the second length $$\widetilde{\texttt {len}}_2 := \widetilde{\texttt {len}}_2 + e$$, which is then rounded to $$\widehat{\texttt {len}}_2 := \mathrm {round}(\widetilde{\texttt {len}}_2)$$ with error $$e := \widetilde{\texttt {len}}_2 - \widehat{\texttt {len}}_2$$, and so on. As a result we obtain a sequence of n tuples

\begin{aligned} (\widehat{\texttt {len}}_1, \widehat{\texttt {inc}}_1), (\widehat{\texttt {len}}_2, \widehat{\texttt {inc}}_2), \ldots , (\widehat{\texttt {len}}_n, \widehat{\texttt {inc}}_n) \in \mathbb {R}^{2} \end{aligned}
(6)

with integer lengths $$\widehat{\texttt {len}}_i$$. (The increments remain unchanged but we rename them for consistency: $$\widehat{\texttt {inc}}_i := \widetilde{\texttt {inc}}_i$$.)

## 5 Error analysis

During the compression procedure, we construct a polygonal chain $$\widetilde{T}$$ going through selected points $$\{ (i_j, t_{i_j}) \}_{j = 0}^{n}$$ of the original time series T, with a controllable Euclidean distance (3). After the digitization, inverse digitization, and quantization, we obtain a new tuple sequence (6) which can be stitched together to a polygonal chain $$\widehat{T}$$ going through the points $$\{ (\widehat{i}_j, \widehat{t}_{j}) \}_{j = 0}^{n}$$, with $$(\widehat{i}_0,\widehat{t}_0)=(0,t_0)$$. Our aim is to analyze the distance between $$\widehat{T}$$ and $$\widetilde{T}$$, and then balance it with the distance between $$\widetilde{T}$$ and T.

We first note that

\begin{aligned} (\widehat{i}_j, \widehat{t}_{i_j} ) = \left( \sum _{\ell = 1}^{j} \widehat{\texttt {len}}_{\ell }, \, t_0 + \sum _{\ell = 1}^{j} \widehat{\texttt {inc}}_{\ell }\right) , \quad j=0,\ldots ,n. \end{aligned}

As all the lengths $$\widehat{\texttt {len}}_{\ell }$$ and increments $$\widehat{\texttt {inc}}_{\ell }$$ correspond to cluster centers (averages of all the points in a cluster, consistently rounded during quantization), we have the interesting property that the accumulated deviations from the true lengths and increments exactly cancel out at the right endpoint of the last piece $$P_n$$, that is: $$(\widehat{i}_n, \widehat{t}_{i_n} ) = (i_n, t_{i_n}) = (N, t_N)$$. In other words, the polygonal chain $$\widehat{T}$$ starts and ends at the same values as $$\widetilde{T}$$ (and hence T).

We now analyze the behavior of $$\widehat{T}$$ in between the start and endpoints, focusing on the case that $$\texttt {scl}=0$$ and assuming for simplicity that all cluster centers $$S_i$$ have the same mean length $$\mu _i^\texttt {len} = N/n$$. (This is not a strong assumption as in the dynamic time warping distance the lengths of the pieces is irrelevant.) We compare $$\widehat{T}$$ with the polygonal chain $$\widetilde{T}$$ time-warped to the same regular length grid as $$\widehat{T}$$, which will give an upper bound on $$\texttt {dtw}(\widehat{T},\widetilde{T})$$. Denoting by $$d_\ell := \widehat{\texttt {inc}}_{\ell } - \widetilde{\texttt {inc}}_{\ell }$$ the local deviation of the increment value of $$\widehat{T}$$ on piece $$P_\ell$$ from the true increment of $$\widetilde{T}$$, we have that

\begin{aligned} \widehat{t}_{i_j} - t_{i_j} = \sum _{\ell =1}^j d_\ell =: e_{i_j}, \quad j=0,\ldots ,n. \end{aligned}

Recall from Sect. 4.2 that we have controlled the variance of the increment values in each cluster to be bounded by $$\texttt {tol}_s^2$$. As a consequence, the increment deviations $$d_\ell$$ have bounded variance $$\texttt {tol}_s^2$$, and mean zero as they correspond to deviations from their respective cluster center. It is therefore reasonable to model the “global increment errors” $$e_{i_j}$$ as a random process with fixed values $$e_{i_0} = e_{i_n} = 0$$, expectation $$\mathrm {E}(e_{i_j})=0$$, and variance

\begin{aligned} \mathrm {Var}({e_{i_j}}) = \texttt {tol}_s^2 \cdot \frac{j (n-j)}{n}, \quad j=0,\ldots ,n. \end{aligned}

In the case that the $$d_\ell$$ are i.i.d. normally distributed, such a process is known as a Brownian bridge. See also Fig. 7 for an illustration.

Note that so far we have only considered the variance of the global increment errors $$e_{i_j}$$ at the left and right endpoints of each piece $$P_j$$, but we are actually interested in analyzing the error of the reconstruction $$\widehat{T}$$ on the fine time grid. To this end, we now consider a “worst-case” realization of $$e_{i_j}$$ which stays s standard deviations away from its zero mean. That is, we consider a realization

\begin{aligned} {e_{i_j}} = s\cdot \texttt {tol}_s \cdot \sqrt{\frac{j (n-j)}{n}}, \quad j=0,\ldots ,n. \end{aligned}

By piecewise linear interpolation of these errors from the coarse time grid $$i_0,i_1,\ldots ,i_n$$ to the fine time grid $$i=0,1,\ldots ,N$$ (in accordance with the linear stitching procedure used in ABBA), we find that

\begin{aligned} {e_{i}} \le \sqrt{\frac{n}{N}}\cdot s\cdot \texttt {tol}_s \cdot \sqrt{\frac{i (N-i)}{N}}, \quad i=0,\ldots ,N, \end{aligned}

using that the interpolated quadratic function on the right-hand side is concave. We can now bound the squared Euclidean norm of this fine-grid “worst-case” realization as

\begin{aligned} \sum _{i=0}^N e_i^2 \le \frac{n\cdot s^2 \cdot \texttt {tol}_s^2}{N^2} \cdot \sum _{i=0}^N i (N-i) = \frac{n\cdot s^2 \cdot \texttt {tol}_s^2}{N^2} \cdot \frac{N^3 - N}{6} \le {n\cdot s^2 \cdot \texttt {tol}_s^2} \cdot \frac{N}{6}. \end{aligned}

This is a probabilistic bound on squared Euclidean error caused by a “worst-case” realization of the Brownian bridge, and thereby a probabilistic bound on the error incurred from the digitization procedure. Equating this bound with the bound (3) on the accuracy of the compression, we find that we should choose

\begin{aligned} \texttt {tol}_s = \frac{\texttt {tol}}{s} \sqrt{\frac{6(N-n)}{Nn}}, \end{aligned}

with the user-specified tolerance $$\texttt {tol}$$. We have experimentally determined that $$s=0.2$$ typically gives a good balance between the compression accuracy and the number of clusters determined using this criterion.

Example: We now illustrate the above analysis on a challenging real-world example. Consider a time series T ($$N=7127$$) consisting of temperature readings off a heat exchanger in an ethylene cracker. We use $$\texttt {tol} = 0.1$$ to compress this time series, resulting in a polygonal chain $$\widetilde{T}$$ with $$n=123$$ pieces and an approximation error of $$\texttt {euclid}(T,\widetilde{T}) = 5.3 \le \sqrt{N-n}\cdot \texttt {tol} \approx 8.4$$. See Fig. 8 for a plot of the original time series T and its reconstruction $$\widetilde{T}$$ after compression.

We then run the ABBA digitization procedure with scaling parameter $$\texttt {scl}=0$$, resulting in a symbolic representation S of length n using $$k=14$$ symbols. In Fig. 7 we show the “global increment errors” $$e_{i_j}$$ of the reconstruction $$\widehat{T}$$ on each piece $$P_j$$, that is, the increment deviation of $$\widehat{T}$$ from T at the endpoints of $$P_j$$, $$j=1,\ldots ,n$$. Note how this error is pinned at zero at $$j=0$$ and $$j=n$$, and how it resembles a random walk in between.

The reconstruction $$\widehat{T}$$ on the fine time grid is also shown in Fig. 8. The reconstruction error measured in the time warping distance is $$\texttt {dtw}(\widetilde{T}, \widehat{T}) = 9.5$$ and the overall error is $$\texttt {dtw}(T, \widehat{T}) = 10.8$$, both of which are approximately of the same order as $$\sqrt{N-n}\cdot \texttt {tol} \approx 8.4$$. Note that the ABBA reconstruction $$\widehat{T}$$ visually deviates a lot from T due to the rather high tolerance we have chosen for illustration, but nevertheless, the characteristic up-and-down behavior of T is well represented in $$\widehat{T}$$, despite the high compression rate of $$123/7128 \approx 1.7\,\%$$.

## 6 Discussion and performance comparison

A Python implementation of ABBA, along with codes to reproduce the figures and performance comparisons in this paper, can be found at https://github.com/nla-group/ABBA

When the scaling parameter is $$\texttt {scl} = 0$$ or $$\texttt {scl} = \infty$$, our implementation calls an adaptation of the univariate k-means algorithm from the R package Ckmeans.1d.dp written in $$\texttt {C++}$$. We use SWIG, the open-source “Simplified Wrapper and Interface Generator”, to call $$\texttt {C++}$$ functions from Python. If $$\texttt {scl} \in ( 0, \infty )$$, we use the k-means algorithm from the Python sklearn library (Pedregosa et al. 2011).

ABBA uses the lengths and increments of a polygonal chain on each segment to construct its symbolic time series representation. Symbolic Polynomial (Grabocka et al. 2014) with $$d=1$$ and 1d-SAX (Malinowski et al. 2013), on the other hand, use linear regression to fit a polynomial to a window of fixed pre-specified length. As we discussed in Sect. 2, Symbolic Polynomial provides no dimensional reduction and was specifically designed for time series classification problems. Most other SAX variants increase the length of the symbolic representation by enhancing the string with additional characters to capture shapes and trends. It is not clear whether these representations outperform SAX with a reduced width parameter to compensate for the increased string length. A comparison of this would be interesting but is independent of ABBA’s performance and out of the scope of this paper. SMTS (Baydogan and Runger 2015) and aSAX (Pham et al. 2010) use machine learning techniques to discretize their representation. SMTS is primarily designed for multivariate time series and provides no dimensional reduction. EN-SAX (Barnaghi et al. 2012) and aSAX suffer from a loss of the trend information in their compression step.

For these reasons, we focus on profiling the reconstructions errors of the ABBA, SAX (Lin et al. 2007), and 1d-SAX (Malinowski et al. 2013) algorithms, as these are most closely related and easily comparable. Note that none of the representations were primarily designed as compression algorithms. ABBA was designed to be adaptive in both the segment length and alphabet cardinality, whereas SAX and 1d-SAX have many other benefits such as being hashable (Chiu et al. 2003), indexable (Shieh and Keogh 2008), and permitting lower bounding distance measures. Our test set consists of all time series in the UCR Time Series Classification Archive (Dau et al. 2018) with a length of at least 100 data points. There are 128, 978 such time series from a variety of applications. Although the archive is primarily intended for benchmarking time series classification algorithms, our primary focus in this paper is on the approximation performance of the symbolic representations. Our experiment consists of converting each time series $$T=[t_0,t_1,\ldots ,t_N]$$ into its symbolic representation $$S = [s_1,\ldots ,s_n]$$, and then measuring the distance between the reconstruction $$\widehat{T}=[\widehat{t}_0,\widehat{t}_1,\ldots ,\widehat{t}_N]$$ and T in the (differenced) Euclidean and DTW norms, respectively.

Recall from Sect. 2 that both SAX and 1d-SAX require a choice for the fixed segment length. In order to provide a fair comparison, we first run the ABBA compression with an initial tolerance $$\texttt {tol}=0.05$$. This returns n, the number of required pieces to approximate T to this tolerance. If n turns out to be larger than N/5, we successively increase the tolerance by 0.05 and rerun until a compression rate of at least 20 % is achieved. If a time series cannot be compressed to at least 20 % even at the rather crude tolerance of $$\texttt {tol}=0.5$$, we consider it as too noisy and exclude it from the test. We also exclude all time series which, after ABBA compression, result in fewer than nine pieces: this is necessary because we want to use $$k=9$$ symbols for all compared methods. Table 2 shows how many of the 111, 889 remaining time series were compressed at what tolerance. The table gives evidence that most of these time series can be compressed reasonably well while maintaining a rather high accuracy. The average compression rate is 10.3 %.

After the number of pieces n has been specified for a given time series T, we determine the fixed segment length $$\texttt {len} = \lfloor (N+1)/n \rfloor$$ to be used in the SAX and 1d-SAX algorithms. We then apply SAX and 1d-SAX to the first $$n\cdot \texttt {len}$$ points of T. This guarantees that all three algorithms (SAX, 1d-SAX, and ABBA) produce a symbolic representation of with n pieces. If $$N+1$$ is not divisible by n, SAX and 1d-SAX are applied to slightly shorter time series than ABBA. The number of symbols used for the digitization is $$k=9$$ for all three methods. In the case of 1d-SAX this means that three symbols are used for the mean value, and three symbols are used for the slope on each piece. Each algorithm produces a symbolic representation of length n using an alphabet of cardinality $$k=9$$. SAX and 1d-SAX requires the value of w and k for the reconstruction, whereas ABBA requires the 2k numbers representing the lengths and increments of each cluster. In total, ABBA requires more storage to represent a time series using a string of length n and alphabet of cardinality k, but is able to represent the whole time series more accurately without truncation.

To visualize the results of our comparison we use performance profiles (Dolan and Moré 2002). Performance profiles allow to compare the relative performance of multiple algorithms over a large set of test problems. Each algorithm is represented by a non-decreasing curve in a $$\theta$$p graph. The $$\theta$$-axis represents a tolerance $$\theta \ge 1$$ and the p-axis corresponds to a fraction $$p\in [0,1]$$. If a curve passes through a point $$(\theta ,p)$$ it means that the corresponding algorithm performed within a factor $$\theta$$ of the best observed performance on $$100\cdot p$$ % of the test problems. For $$\theta =1$$ one can read off on what fraction of all test problems each algorithm was the best performer, while as $$\theta \rightarrow \infty$$ all curves approach the value $$p\rightarrow 1$$ (unless an algorithm has failed on a fraction of the test problems, which is not the case here).

In Figs. 9a–10d we present eight performance profiles for the ABBA scaling parameters $$\texttt {scl} = 0$$ and $$\texttt {scl} = 1$$, respectively, and with four different distance measures: Euclidean and DTW distances and their differenced counterparts, respectively. Figure 9a shows the performance profile for $$\texttt {scl} = 0$$, with the distance between T and $$\widehat{T}$$ measured in the Euclidean norm. As expected, SAX consistently outperforms ABBA because the Euclidean distance is very sensitive to horizontal shifts in the time direction, which ABBA has completely ignored due to the $$\texttt {scl} = 0$$ parameter. However, it is somewhat surprising that SAX also outperforms 1d-SAX. When $$k=9$$, 1d-SAX allocates 3 symbols for the slopes and 3 symbols for the averages. To represent all possible combinations of slopes and averages, 9 unique symbols are needed. It appears that the use of the slope information in 1d-SAX is detrimental to the approximation accuracy and, if the number of symbols is kept constant, they should better be used to represent the averages rather than the slopes.

The performance changes when we use the DTW distance, thereby allowing for shifts in time. In this case, ABBA outperforms SAX and 1d-SAX significantly; see Fig. 9b. This is because ABBA has been tailored to preserve the up-and-down shape of the time series, at the cost of allowing for small errors in the lengths of the pieces which are easily corrected by time warping. Again SAX with $$k=9$$ symbols performs better than 1d-SAX with $$k=9$$ symbols.

The performance gain of ABBA becomes even more pronounced when we difference the data before computing the Euclidean and DTW distances; see Fig. 9c, d, respectively. Moreover, we observe that when differencing is used, 1d-SAX performs slightly better than SAX. Computing the Euclidean and DTW distances of the differenced data amounts to comparing the gradients of the time series, rather than their values. The gradient information is better captured when we allocate some symbols for the slope information (as in 1d-SAX) rather than allocating all symbols for the averages (as in SAX). This explains the slight advantage of 1d-SAX over SAX with differenced distance measures.

In the next four tests we set $$\texttt {scl} = 1$$, so the ABBA clustering procedure considers both the increments and lengths equally. Figure 10a, b show the resulting performance profiles using the Euclidean and DTW distance measures, respectively. As expected, ABBA becomes more competitive even for the Euclidean distance measure. Computationally, however, this comes at the cost of not being able to use a fast optimal 1d-clustering algorithm. Finally, Fig. 10c, d show the performance profiles for the Euclidean and DTW distance measures on the differenced data, respectively. As in the case $$\texttt {scl} = 0$$, differencing helps to improve the performance of ABBA in comparison to SAX and 1d-SAX even further.Footnote 1

## 7 Further discussion and applications

Section 6 demonstrated that ABBA provides high compression rates while guaranteeing that the time series reconstruction is still close to the original. The high compression is a consequence of the stitching procedure during the compression stage. Section 5 showed how errors are accumulated piece by piece in the stitching process. We believe that this property prevents ABBA from admitting lower bounding distance measures as are available for SAX. SAX’s lower bounding measure and indexability make it suitable for applications where multiple time series have to be compared (like time series classification). ABBA, on the other hand, appears best suited for applications where information has to be extracted from a single time series, such as anomaly detection, motif discovery, and trend prediction. As the output of ABBA is simply a string sequence, it can be combined with existing algorithms that previously used, e.g., a SAX representation. Below we discuss various aspects and applications of ABBA.

In-built differencing. Working with the increments (instead of slopes) allows ABBA to capture linear trends in time series without preprocessing. In Fig. 11 we consider the simple test problem of a sine wave with a gradual linear trend in the presence of noise. After normalization, SAX is able to accurately represent the time series as shown in Fig. 11(i). If we used the symbolic representation for trend prediction, however, the SAX representation would be unsuitable for continuing the linear trend as new symbols would need to be introduced. Of course, this problem could be overcome by removing the linear trend through differencing the time series. A SAX representation of the differenced time series is shown in Fig. 11(ii). Unfortunately, differencing the noisy time series amplifies the noise. Figure 11(iii) compares the original time series against the reconstructed time series from the SAX representation of the differenced data. As we can see, the increased noise level renders the SAX representation extremely inaccurate. ABBA, on the other hand, does not require any differencing as it works with increments by default. As a consequence, the ABBA reconstruction shown in Fig. 11(iv) stays very close to the original time series, capturing both the gradual linear trend as well as the characteristic up-and-down behavior.

Anomaly detection refers to the problem of finding points or intervals in time series which display surprising or unexpected behavior. Recent literature reviews of existing anomaly detection algorithms are given in Gupta et al. (2013), Atluri et al. (2018). The ABBA representation can be used for anomaly detection in a variety of ways. Trend anomalies can be detected in the digitization procedure via k-means clustering of the lengths and increments. The alphabet is ordered such that ‘a’ is the most frequent symbol followed by ‘b’ and so forth. If the kth cluster contains very few elements relative to the other clusters, then this might be considered a trend anomaly.

TARZAN. Keogh et al. (2002) is a popular anomaly detection algorithm with linear time and space complexity (Pelkonen et al. 2015). The algorithm requires two time series, a reference time series R containing normal behavior and the test time series X. Both time series are converted to a symbolic representation and stored in a suffix tree (McCreight 1976). An anomaly score is computed by comparing the frequency of a substring in X to an expected frequency computed from R. SAX can be used for the discretization process in TARZAN and has been shown to outperform other symbolic representations with no dimensional reduction (Lin et al. 2007).

If both symbolic representations are short and X contains a symbol that does not appear in R, then the TARZAN score can suffer through lack of perspective. For example, suppose the expected frequency of the substring ‘abc’ is 4.2 and ‘abc’ appears 3 times in X, then the anomaly score is $$3 - 4.2 = -1.2$$. Suppose the symbol ‘d’ does not appear in R but ‘ada’ appears in X. The expected frequency of the substring ‘ada’ is 0 and ‘ada’ appears only once, so the anomaly score is $$0 - 1 = -1$$. This implies that ‘abc’ is more of an anomaly than ‘ada’. This issue can be overcome by dividing the anomaly score by the largest of the expected/actual frequency.

In Figs. 12, 13 we consider a simple experiment comparing SAX, 1d-SAX, and ABBA as discretization procedures for TARZAN with the modified anomaly score.Footnote 2 The reference time series R is a simple sine wave where each period spans 25 time samples. The time series X has a full wave replaced by a flat line of 22 time points. The SAX and 1d-SAX representations use a window length $$w=5$$ and $$k=9$$ symbols, whereas ABBA uses a tolerance tuned to give a symbolic representation of equal length and k is bounded by 9. The time series R and X and their symbolic reconstructions are shown in Fig. 12. If the length of the anomaly does not align with the window length w, then SAX and 1d-SAX tend to represent the sine wave following the anomaly as a different substring. The adapted TARZAN score is required as certain symbols appear in X that do not appear in R. Figure 13 shows the resulting TARZAN anomaly scores. Both SAX and 1d-SAX suffer from the fixed window length, returning high anomaly scores throughout time following the anomaly, whereas TARZAN using ABBA is able to recover almost immediately after the anomaly due to the adaptive segment lengths.