ABBA: Adaptive Brownian bridge-based symbolic aggregation of time series

A new symbolic representation of time series, called ABBA, is introduced. It is based on an adaptive polygonal chain approximation of the time series into a sequence of tuples, followed by a mean-based clustering to obtain the symbolic representation. We show that the reconstruction error of this representation can be modelled as a random walk with pinned start and end points, a so-called Brownian bridge. This insight allows us to make ABBA essentially parameter-free, except for the approximation tolerance which must be chosen. Extensive comparisons with the SAX and 1d-SAX representations are included in the form of performance profiles, showing that ABBA is able to better preserve the essential shape information of time series compared to other approaches. Advantages and applications of ABBA are discussed, including its in-built differencing property and use for anomaly detection, and Python implementations provided.

To formalize the discussion and introduce notation, we consider the problem of aggregating a time series T = [t 0 ,t 1 , . . . ,t N ] ∈ R N+1 into a symbolic representation S = [s 1 , s 2 , . . . , s n ] ∈ A n , where each s j is an element of an alphabet A = {a 1 , a 2 , . . . , a k } of k symbols. The sequence S should be of considerably lower dimension than the original time series T , that is n N, and it should only use a small number of meaningful symbols, that is k n. The representation should also allow for the approximate reconstruction of the original time series with a controllable error, with the shape of the reconstruction suitably close to that of the original. Both n, the length of the symbolic representation, and k, the number of symbols, should be chosen automatically without parameter tuning required.
This paper is organized as follows. In Section 2 we give an overview of existing symbolic representations and other algorithms which are conceptually similar to ABBA. To evaluate the approximation accuracy of ABBA, we must compare the shape of the original time series and the reconstruction from its symbolic representation. Section 3 reviews existing distance measures for this purpose and discusses how well they perform in measuring shape. Sections 4-7 contain the key contributions of this paper: -Section 4 introduces ABBA, our novel dimension-reducing symbolic time series representation which aims to preserve the shape of the original time series. We explain in detail how ABBA's compression and reconstruction procedures work. -In Section 5 we show that the error of the ABBA reconstruction behaves like a random walk with pinned start and end values. This observation appears to be novel in itself and allows us to balance the error of the piecewise linear approximation with that of the digitization procedure, thereby allowing the method to choose the number of symbols k automatically. -Section 6 contains performance comparisons of ABBA with other popular symbolic representations using various distance measures, with a particular emphasis on the compression versus accuracy relation. Aside from verifying that ABBA can represent time series to higher accuracy than SAX and 1d-SAX using a comparable number of symbols k and string length n, we also find that SAX outperforms 1d-SAX when the same number of symbols k is used for both. -In Section 7 we discuss some practical applications of ABBA including the handling of linear trends, anomaly detection, and VizTree visualization.
Finally, we conclude in Section 8 with an outlook on future work.

Background and related work
Despite the large number of dimension-reducing time series representations in the literature, very few are symbolic.
Most techniques are numeric in the sense that they reduce a time series to a lower-dimensional vector with its components taken from a continuous range; see [9,17,32] for reviews. Here we provide an overview of existing symbolic representations relevant to ABBA.
The construction of symbolic time series representations typically consists of two parts. First, the time series is segmented, with the length of each segment being either specified by the user or found adaptively via a bottom-up, top-down, or sliding window approach [22]. The segmentation procedure intrinsically controls the degree of dimension reduction. The second part, the discretization process, assigns a symbol to each segment.
Symbolic Aggregate approXimation (SAX), a very popular symbolic representation, consists of a piecewise approximation of the time series followed by a symbolic conversion using Gaussian breakpoints [32]. SAX starts by partitioning T into segments of constant length len, and then represents each segment by the mean of its values (i.e., a piecewise constant approximation). The means are converted into symbols using breakpoints that partition a Gaussian bell curve into k equally-sized areas. In addition to its simplicity, an attractive feature of SAX is the existence of distance measures that serve as lower bounds for the Euclidean distance between the original time series. On the other hand, both the segment length len and the number of symbols k must be specified in advance. SAX is designed such that each symbol appears with equal probability, which works best when the time series values are approximately normally distributed.
The literature on applications of SAX is extensive and many variants have been proposed. Most variants modify the symbolic representation to incorporate the slope of the time series on each segment. This is often justified by applications in finance, where the extreme values of time series provide valuable information which is lost with the piecewise constant approximation used in SAX. The modifications often come at the cost of losing the lower bounds on distance measures. We now provide a brief overview of some of these variants.
Trend-based and Valued-based Approximation (TVA) uses SAX to symbolically represent the time series values, enhanced with U, D, or S symbols to represent an upwards, downwards, or straight trend, respectively [16]. The TVA representation alternates between value symbols and slope symbols, making the symbolic representation twice as long as a SAX representation with the same number of segments. A similar approach is Trend-based SAX (TSAX) which uses two trend symbols per segment [43].
Extended SAX (ESAX) represents each segment by the minimum, maximum, and mean value of the time series ordered according to their appearance in the segment, defining the mean to appear in the center of the segment [33]. This results in a symbolic representation three times longer than the corresponding SAX representation with the same number of segments. ENhanced SAX (EN-SAX) forms a vector for each segment consisting of the minimum, maximum and mean value. The vectors are then clustered and a symbol is allocated to each cluster [5]. Time-Weighted Average for SAX (TWA SAX) uses the time weighted average for each segment instead of the mean [7]. This can encapsulate important patterns which are missed by the mean.
Trend-based Symbolic approximation (TSX) represents each segment by four symbols [28]. The first symbol corresponds to the SAX representation. The following three symbols correspond to the slopes between the first, last, most peak and most dip points, which are defined in terms of vertical distance from the trend line (the straight line connecting the end point values of a segment). The slopes are converted to symbols using a lookup table. This results in a symbolic representation four times longer than the SAX representation with the same number of segments.
The 1d-SAX algorithm uses linear regression to fit a straight line to each segment [35]. Each segment is then represented by the gradient and the average value of the line. Two sets of Gaussian breakpoints are used to provide symbols for both the averages and the slopes. It is unclear how many breakpoints should be allocated for the averages, and how many should be allocated for the slopes. The total number of symbols is the product of the respective number of breakpoints.
Using the same number of segments, the above SAX variants result in an increase in the length of the symbolic representation by some factor. It is unclear whether any of these approaches performs better than SAX when the SAX segment length len is decreased by the same factor (keeping the overall length of the symbolic representation constant). As with the original SAX approach, all of these variants require the user to specify the segment length len and the number of symbols k in advance.
In many time series applications, the assumption that the values of the normalized time series follow a normal distribution is a strong one. To overcome this, the adaptive SAX algorithm (aSAX) uses k-means clustering to find the breakpoints for the symbolic conversion [40]. However, as piecewise constant approximations are used, the aSAX approach fails to represent the extreme points of the time series.
SAX's digitization procedure based on Gaussian breakpoints allows its extension to a multi-resolution symbolic representation known as indexable SAX (iSAX) [41]. This clever indexing procedure allows mining of datasets containing millions of time series. At the heart of the algorithm is a SAX representation where each window uses Gaussian breakpoints with 2 c regions, where c can change from segment to segment.
The sensorPCA algorithm overcomes the fixed window length problem by using a sliding window to start a new segment when the standard deviation of the approximation exceeds some prespecified tolerance [18]. However, [18] does not provide a method to convert the mean values and window lengths to a symbolic representation.
Symbolic Aggregate approXimation Optimized by data (SAXO) is a data-driven approach based on a regularized Bayesian coclustering method called minimum optimized description length [10,11]. The discretization of the time series is optimized using Bayesian statistics. The number of symbols and the underlying distribution change for each time interval. The computational complexity of SAXO is far greater than that of SAX.
The authors in [37] take a completely different approach based on the persistence of a time series. A persistent time series is one where the value at a certain point is closely related to the previous value; see also [27]. The authors provide "persist", a symbolic representation based on the Kullback-Leibler divergence between the marginal and the self-transition probability distributions of the discretization symbols.
Symbolic Polynomial (SP) [19] is a symbolic representation designed to detect local patterns. It is constructed by an overlapping sliding window of length w and stepsize 1. For each window, one computes the coefficients of a regression polynomial of degree d. The coefficients of each order are collected and allocated a symbol using an equiarea discretization. This symbolic representation provides no dimensional reduction as each window is represented by d symbols.
The authors in [6] introduce a symbolic representation of multivariate time series called SMTS. They construct a data table consisting of time index, time values, and first differences of the time series. A tree learner is trained on the data and each of the leaf nodes is allocated a symbol. Their approach allows multiple tree learners, which in the univariate case results in a symbolic representation much larger than the original.
Piecewise linear approximations of time series have been used for many years. The lengths of the linear pieces (segments) can be prespecified or chosen adaptively. Each segment is approximated using either linear interpolation or linear regression [22]. The authors of [34] describe how the linear segments can be stitched so that each piece is represented by two parameters rather than three. An example of a piecewise linear approximation algorithm is the Ramer-Douglas-Peucker algorithm, an iterative endpoint fitting procedure which uses adaptive linear interpolation with a prespecified tolerance. These methods provide an effective shape-preserving and dimension-reducing representation but not a symbolic representation.

Distance measures
The accuracy of a symbolic time series representation S can be assessed by the distance between the original time series T and its reconstruction T from S. We note that the original time series should first be normalized to have zero mean and unit variance. This ensures that distance measures are comparable across different time series; see [23] for a discussion of the importance of normalization.
A detailed overview of time series distance measures and their applications can be found in [2]. Distance measures for time series typically fall into two main categories: lock-step alignment and elastic alignment [1]. Lock-step alignment refers to the element-wise comparison of time series, i.e., the i-th element of one time series is compared to the i-th element of another. Such measures can only compare time series of equal length. The most popular lock-step distance is the Euclidean distance. The Euclidean distance is a poor measure of shape similarity in two particular cases: if the time series have the same shape but are stretched in value (see Figure 2a), or if the time series have the same shape but are warped in time (see Figure 2b). The first issue can be mitigated by differencing the time series before measuring the distance. The second issue is intrinsic to lock-step alignment distance measures.
Elastic alignment distance measures construct a nonlinear mapping between time series elements, effectively allowing for one value in a time series to be compared to multiple consecutive values in another. The most popular elastic alignment method is Dynamic Time Warping (DTW), originally proposed in [8]. The DTW distance measure corresponds to the Euclidean distance between two DTW-aligned time series. This distance measure can be used to compare time series of different lengths but it has a quadratic computational complexity in both time and space; for further details see [25]. Many methods have been proposed to either approximate the DTW distance at a reduced cost or calculate bounds to avoid computing the DTW alignment altogether. The authors of [26] notice that DTW may pair a rising trend in one time series with a falling trend in another, and they overcome this problem by a variant known as Derivative Dynamic Time Warping (DDTW). The elastic alignment allows DTW to overcome the issues when two time series have the same shape but are warped in time (see Figure 2b), but DTW is still a poor measure of shape sim-  The time series in these plots have the same essential shape according to our interpretation. Euclidean distance is a poor measure of shape for (a) and (b), whereas DTW distance is a poor measure of shape for (a). A differencing of the time series in (a) would make DTW a suitable shape distance. ilarity if the time series have the same shape but are vertically stretched (see Figure 2a). Again, this can be mitigated by differencing the time series before measuring their DTW distance.
It is because of these advantages and drawbacks of the Euclidean and DTW distance measures and their differenced counterparts that we will test the performance of ABBA with all these distance measures in Section 6.

Adaptive Brownian bridge-based aggregation
We now introduce ABBA, a symbolic representation of time series where the symbolic length n and the number of symbols k are chosen adaptively. The ABBA representation is computed in two stages.
1. Compression: The original time series T is approximated by a piecewise linear and continuous function, with each linear piece being chosen adaptively based on a user-specified tolerance. The result is a sequence of tuples (len, inc) consisting of the length of each piece and its increment in value. 2. Digitization: A near-optimal alphabet A is identified via mean-based clustering, with each cluster corresponding to a symbol. Each tuple (len, inc) is assigned a symbol corresponding to the cluster in which it belongs.
The reconstruction of a time series from its ABBA representation involves three stages.
1. Inverse-digitization: Each symbol of the symbolic representation is replaced with the center of the associated cluster. The length values of the centers may not necessarily be integers. 2. Quantization: The lengths of the reconstructed segments are re-aligned with an integer grid. 3. Inverse-compression: The piecewise linear continuous approximation is converted back to a pointwise time series representation using a stitching procedure.
Both the computation of the ABBA representation and the reconstruction are inexpensive. It is essential that the digitization process uses incremental changes in value rather than slopes. This way, ABBA consistently works with increments in both the time and value coordinates. Only in this case a mean-based clustering algorithm will identify meaningful clusters in both coordinate directions. As we will explain in Section 5, the error of the ABBA reconstruction behaves like a random walk pinned at zero for both the start and the end point of the time series. But first, we provide a more detailed explanation of the key parts of ABBA. For clarity, we summarize the notation used throughout this section in Table 1.

Compression
The ABBA compression is achieved by an adaptive piecewise linear continuous approximation of T . Given a tolerance tol, the method adaptively selects n + 1 indices i 0 = 0 < i 1 < · · · < i n = N so that the time series T = [t 0 ,t 1 , . . . ,t N ] is approximated by a polygonal chain going through the points (i j ,t i j ) for j = 0, 1, . . . , n. This gives rise to a partition of After compression: After digitization: After quantization: After inverse-compression: We ensure that the squared Euclidean distance of the values in P j from the straight polygonal line is bounded by (len j − 1) · tol 2 . More precisely, starting with i 0 = 0 and given an index i j−1 , we find the largest possible i j such that i j−1 < i j ≤ N and Note that the first and the last values t i j−1 and t i j are not counted in the distance measure as the straight line approximation passes exactly through them. If required, one can restrict the maximum length of each segment by imposing an upper bound i j ≤ i j−1 + max len with a given integer max len ≥ 1. Each linear piece P j of the resulting polygonal chain T is described by a tuple (len j , inc j ), where inc j = t i j −t i j−1 is the increment in value (not the slope!). As the polygonal chain is continuous, the first value of a segment can be inferred from the end value of the previous segment. Hence the whole polygonal chain can be recovered exactly from the first value t 0 and the tuple sequence (len 1 , inc 1 ), (len 2 , inc 2 ), . . . , (len n , inc n ) ∈ R 2 . (2) An example of the ABBA compression procedure applied to the time series in Figure 1 is shown in Figure 3. Here a tolerance of tol = 0.4 has been used, resulting in n = 7 pieces. As the approximation error on each piece P j satisfies (1), the polygonal chain T also has a bounded Euclidean distance from T : Hence we are sure that the ABBA approximation T (red dashed curve) in Figure 3 has a Euclidean distance of at most √ 223 × 0.4 ≈ 6.0 from the original time series T (black solid curve).

Digitization
Digitization refers to the assignment of the tuples in (2) to k clusters S 1 , S 2 , . . . , S k . Before clustering, we separately normalize the tuple lengths and increments by their standard deviations σ len and σ inc , respectively. We use a further scaling parameter scl to assign different weight ("importance") to the length of each piece in relation to its increment value. Hence, we effectively cluster the scaled tuples If scl = 0, then clustering is performed on the increments alone, while if scl = 1, we cluster in both the length and increment dimension with equal weighting. The cluster assignment is performed by (approximately) minimizing the within-cluster-sum-of-squares with each 2d cluster center µ i = (µ len i , µ inc i ) corresponding to the mean of the scaled tuples associated with the cluster S i . In certain situations one may want to cluster only on the lengths of the pieces and ignore their increments, formally setting scl = ∞. In this case, the cluster assignment is performed by (approximately) minimizing where µ len i is the mean of the scaled lengths in the cluster S i . Given a clustering of the n tuples into clusters S 1 , . . . , S k we use the unscaled cluster centers µ i to define the maximal cluster variances in the length and increment directions as respectively. Here, |S i | is the number of tuples in cluster S i . We seek the smallest number of clusters k such that max(scl · Var len , Var inc ) ≤ tol 2 with a tolerance tol s . This tolerance will be specified in Section 5 as a function of the user-specified tolerance tol and is therefore not a free parameter. (In the case of scl = ∞, we seek the smallest k such that Var len ≤ tol 2 s .) Once the optimal k has been found, each cluster S 1 , . . . , S k is assigned a symbol a 1 , . . . , a k , respectively. Finally, each tuple in the sequence (2) is replaced by the symbol of the cluster it belongs to, resulting in the symbolic representation S = [s 1 , s 2 , . . . , s n ].
If scl = 0 or scl = ∞, a 1d clustering method can be used which takes advantage of sorting algorithms; see the review [20]. We use the ckmeans algorithm [42], an order O(n log n + kn) dynamic programming algorithm which optimally clusters the data by minimizing the WCSS in just one dimension. We have modified the algorithm to choose the smallest k such that the maximal cluster variance is bounded by tol 2 s . For nonzero finite values of scl, k-means clustering is used. This algorithm has an average complexity of O(kn) per iteration (see also [3] for an analysis of the worst case complexity) and might of course result in a suboptimal clustering. In our ABBA implementation the user can specify an interval [min k, . . . , max k] and we search for the smallest k in that interval such that (5) holds. If no such k exists, we set k = max k.
By default, we set scl = 0 as we believe this corresponds most naturally to preserving the up-and-down behavior of the time series. In other words, we ignore the lengths of the pieces and only cluster the value increments. With the value increments represented accurately, the errors in lengths correspond to horizontal stretching in the time direction.
An illustration of the digitization process on the pieces from Figure 3 can be seen in Figure 4 with scl = 0 (our default parameter choice), Figure 5 with scl = 1, and Figure 6 with scl = ∞.

Inverse digitization and quantization
When reversing the digitization process, each symbol of the alphabet is replaced by the center (len i , inc i ) of the corresponding cluster given as Note that the mean-based clustering for digitization is performed on the scaled tuples (4), but the cluster centers used for the inverse digitization are computed with the unscaled tuples (2). The inverse digitization process results in a sequence of n tuples where each tuple is a cluster center, that is ( len i , inc i ) ∈ {(len 1 , inc 1 ), (len 2 , inc 2 ), . . . , (len k , inc k )}.
The lengths len i obtained from this averaging are not necessarily integer values as they were in the compressed representation (2). We therefore perform a simple quantization procedure which realigns the cumulated lengths with their closest integers. We start with rounding the first length, len 1 := round( len 1 ), keeping track of the rounding error e := len 1 − len 1 . This error is added to the second length len 2 := len 2 + e, which is then rounded to len 2 := round( len 2 ) with error e := len 2 − len 2 , and so on. As a result we obtain a sequence of n tuples ( len 1 , inc 1 ), ( len 2 , inc 2 ), . . . , ( len n , inc n ) ∈ R 2 (6) with integer lengths len i . (The increments remain unchanged but we rename them for consistency: inc i := inc i .)

Error analysis
During the compression procedure, we construct a polygonal chain T going through selected points {(i j ,t i j )} n j=0 of the original time series T , with a controllable Euclidean distance (3). After the digitization, inverse digitization, and quantization, we obtain a new tuple sequence (6) which can be stitched together to a polygonal chain T going through the points {( i j , t j )} n j=0 , with ( i 0 , t 0 ) = (0,t 0 ). Our aim is to analyze the distance between T and T , and then balance it with the distance between T and T .
As all the lengths len and increments inc correspond to cluster centers (averages of all the points in a cluster, consistently rounded during quantization), we have the interesting property that the accumulated deviations from the true lengths and increments exactly cancel out at the right endpoint of the last piece P n , that is: ( i n , t i n ) = (i n ,t i n ) = (N,t N ). In other words, the polygonal chain T starts and ends at the same values as T (and hence T ). We now analyze the behavior of T in between the start and endpoints, focusing on the case that scl = 0 and assuming for simplicity that all cluster centers S i have the same mean length µ len i = N/n. (This is not a strong assumption as in the dynamic time warping distance the lengths of the pieces is irrelevant.) We compare T with the polygonal chain T time-warped to the same regular length grid as T , which will give an upper bound on dtw( T , T ). Denoting by d := inc − inc the local deviation of the increment value of T on piece P from the true increment of T , we have that Recall from Section 4.2 that we have controlled the variance of the increment values in each cluster to be bounded by tol 2 s . As a consequence, the increment deviations d have bounded variance tol 2 s , and mean zero as they correspond to deviations from their respective cluster center. It is therefore reasonable to model the "global increment errors" e i j as a random process with fixed values e i 0 = e i n = 0, expectation E(e i j ) = 0, and variance Var(e i j ) = tol 2 s · j(n − j) n , j = 0, . . . , n.
In the case that the d are i.i.d. normally distributed, such a process is known as a Brownian bridge. See also Figure 7 for an illustration. Note that so far we have only considered the variance of the global increment errors e i j at the left and right endpoints of each piece P j , but we are actually interested in analyzing the error of the reconstruction T on the fine time grid. To this end, we now consider a "worst-case" realization of e i j which stays s standard deviations away from its zero mean. That is, we consider a realization e i j = s · tol s · j(n − j) n , j = 0, . . . , n.
By piecewise linear interpolation of these errors from the coarse time grid i 0 , i 1 , . . . , i n to the fine time grid i = 0, 1, . . . , N (in accordance with the linear stitching procedure used in ABBA), we find that using that the interpolated quadratic function on the right-hand side is concave. We can now bound the squared Euclidean norm of this fine-grid "worst-case" realization as This is a probabilistic bound on squared Euclidean error caused by a "worst-case" realization of the Brownian bridge, and thereby a probabilistic bound on the error incurred from the digitization procedure. Equating this bound with the bound (3) on the accuracy of the compression, we find that we should choose with the user-specified tolerance tol. We have experimentally determined that s = 0.2 typically gives a good balance between the compression accuracy and the number of clusters determined using this criterion.
Example: We now illustrate the above analysis on a challenging real-world example. Consider a time series T (N = 7127) consisting of temperature readings off a heat exchanger in an ethylene cracker. We use tol = 0.1 to compress this time series, resulting in a polygonal chain T with n = 123 pieces and an approximation error of euclid(T, T ) = 5.3 ≤ √ N − n · tol ≈ 8.4. See Figure 8 for a plot of the original time series T and its reconstruction T after compression.
We then run the ABBA digitization procedure with scaling parameter scl = 0, resulting in a symbolic representation S of length n using k = 14 symbols. In Figure 7 we show the "global increment errors" e i j of the reconstruction T on each piece P j , that is, the increment deviation of T from T at the endpoints of P j , j = 1, . . . , n. Note how this error is pinned at zero at j = 0 and j = n, and how it resembles a random walk in between.
The reconstruction T on the fine time grid is also shown in Figure 8. The reconstruction error measured in the time warping distance is dtw( T , T ) = 9.5 and the overall error is dtw(T, T ) = 10.8, both of which are approximately of the same order as √ N − n · tol ≈ 8.4. Note that the ABBA reconstruction T visually deviates a lot from T due to the rather high tolerance we have chosen for illustration, but nevertheless, the characteristic up-and-down behavior of T is well represented in T , despite the high compression rate of 123/7128 ≈ 1.7 %.    When the scaling parameter is scl = 0 or scl = ∞, our implementation calls an adaptation of the univariate k-means algorithm from the R package Ckmeans.1d.dp written in C++. We use SWIG, the open-source "Simplified Wrapper and Interface Generator", to call C++ functions from Python. If scl ∈ (0, ∞), we use the k-means algorithm from the Python sklearn library [38]. ABBA uses the lengths and increments of a polygonal chain on each segment to construct its symbolic time series representation. Symbolic Polynomial [19] (with d = 1) and 1d-SAX [35], on the other hand, use linear regression to fit a polynomial to a window of fixed pre-specified length. As we discussed in Section 2, Symbolic Polynomial provides no dimensional reduction and was specifically designed for time series classification problems. Most other SAX variants increase the length of the symbolic representation by enhancing the string with additional characters to capture shapes and trends. It is not clear whether these representations outperform SAX with a reduced width parameter to compensate for the increased string length. A comparison of this would be interesting but is independent of ABBA's performance and out of the scope of this paper. SMTS [6] and aSAX [40] use machine learning techniques to discretize their representation. SMTS is primarily designed for multivariate time series and provides no dimensional reduction. EN-SAX [5] and aSAX suffer from a loss of the trend information in their compression step.
For these reasons, we focus on profiling the reconstructions errors of the ABBA, SAX [32], and 1d-SAX [35] algorithms, as these are most closely related and easily comparable. Note that none of the representations were primarily designed as compression algorithms. ABBA was designed to be adaptive in both segement length and alphabet cardinality, whereas SAX and 1d-SAX have many other benefits such as being hashable [12], indexable [41], and permitting lower bounding distance measures. Our test set consists of all time series in the UCR Time Series Classification Archive [13] with a length of at least 100 data points. There are 128, 978 such time series from a variety of applications. Although the archive is primarily intended for benchmarking time series classification algorithms, our primary focus in this paper is on the approximation performance of the symbolic representations. Our experiment consists of converting each time series T = [t 0 ,t 1 , . . . ,t N ] into its symbolic representation S = [s 1 , . . . , s n ], and then measuring the distance between the reconstruction T = [ t 0 , t 1 , . . . , t N ] and T in the (differenced) Euclidean and DTW norms, respectively.
Recall from Section 2 that both SAX and 1d-SAX require a choice for the fixed segment length. In order to provide a fair comparison, we first run the ABBA compression with an initial tolerance tol = 0.05. This returns n, the number of required pieces to approximate T to this tolerance. If n turns out to be larger than N/5, we successively increase the tolerance by 0.05 and rerun until a compression rate of at least 20 % is achieved. If a time series cannot be compressed to at least 20 % even at the rather crude tolerance of tol = 0.5, we consider it as too noisy and exclude it from the test. We also exclude all time series which, after ABBA compression, result in fewer than nine pieces: this is necessary because we want to use k = 9 symbols for all compared methods. Table 2 shows how many of the 111, 889 remaining time series were compressed at what tolerance. The table gives evidence that most of these time series can be compressed reasonably well while maintaining a rather high accuracy. The average compression rate is 10.3 %. After the number of pieces n has been specified for a given time series T , we determine the fixed segment length len = (N + 1)/n to be used in the SAX and 1d-SAX algorithms. We then apply SAX and 1d-SAX to the first n · len points of T . This guarantees that all three algorithms (SAX, 1d-SAX, and ABBA) produce a symbolic representation of with n pieces. If N + 1 is not divisible by n, SAX and 1d-SAX are applied to slightly shorter time series than ABBA. The number of symbols used for the digitization is k = 9 for all three methods. In the case of 1d-SAX this means that three symbols are used for the mean value, and three symbols are used for the slope on each piece. Each algorithm produces a symbolic representation of length n using an alphabet of cardinality k = 9. SAX and 1d-SAX requires the value of w and k for the reconstruction, whereas ABBA requires the 2k numbers representing the lengths and increments of each cluster. In total, ABBA requires more storage to represent a time series using a string of length n and alphabet of cardinality k, but is able to represent the whole time series more accurately without truncation.
To visualize the results of our comparison we use performance profiles [14]. Performance profiles allow to compare the relative performance of multiple algorithms over a large set of test problems. Each algorithm is represented by a non-decreasing curve in a θ -p graph. The θ -axis represents a tolerance θ ≥ 1 and the p-axis corresponds to a fraction p ∈ [0, 1]. If a curve passes through a point (θ , p) it means that the corresponding algorithm performed within a factor θ of the best observed performance on 100 · p % of the test problems. For θ = 1 one can read off on what fraction of all test problems each algorithm was the best performer, while as θ → ∞ all curves approach the value p → 1 (unless an algorithm has failed on a fraction of the test problems, which is not the case here).
In Figures 9a-10d we present eight performance profiles for the ABBA scaling parameters scl = 0 and scl = 1, respectively, and with four different distance measures: Euclidean and DTW distances and their differenced counterparts, respectively. Figure 9a shows the performance profile for scl = 0, with the distance between T and T measured in the Euclidean norm. As expected, SAX consistently outperforms ABBA because the Euclidean distance is very sensitive to horizontal shifts in the time direction, which ABBA has completely ignored due to the scl = 0 parameter. However, it is somewhat surprising that SAX also outperforms 1d-SAX. It appears that the use of the slope information in 1d-SAX is detrimental to the approximation accuracy and, if the number of symbols is kept constant, they should better be used to represent time series values alone. This observation can also be made in the other performance profiles: irrespective of the distance measure being used, SAX with k = 9 symbols performs better than 1d-SAX with k = 9 symbols.
The performance changes when we use the DTW distance, thereby allowing for shifts in time. In this case, ABBA outperforms SAX and 1d-SAX significantly; see Figure 9b. This is because ABBA has been tailored to preserve the up-and-down shape of the time series, at the cost of allowing for small errors in the lengths of the pieces which are easily corrected by time warping. The performance gain of ABBA becomes even more pronounced when we difference the data before computing the Euclidean and DTW distances; see Figures 9c and 9d, respectively.
In the next four tests we set scl = 1, so the ABBA clustering procedure considers both the increments and lengths equally. Figures 10a and 10b show the resulting performance profiles using the Euclidean and DTW distance measures, respectively. As expected, ABBA becomes more competitive even for the Euclidean distance measure. Computationally, however, this comes at the cost of not being able to use a fast optimal 1d-clustering algorithm. Finally, Figures 10c  and 10d show the performance profiles for the Euclidean and DTW distance measures on the differenced data, respectively. As in the case scl = 0, differencing helps to improve the performance of ABBA in comparison to SAX and 1d-SAX even further 1 .

Further discussion and applications
Section 6 demonstrated that ABBA provides high compression rates while guaranteeing that the time series reconstruction is still close to the original. The high compression is a consequence of the stitching procedure during the compression stage. Section 5 showed how errors are accumulated piece by piece in the stitching process. We believe that this property prevents ABBA from admitting lower bounding distance measures as are available for SAX. SAX's lower bounding measure and indexability make it suitable for applications where multiple time series have to be compared (like time series classification). ABBA, on the other hand, appears best suited for applications where information has to be extracted from a single time series, such as anomaly detection, motif discovery, and trend prediction. As the output of ABBA is simply a string sequence, it can be combined with existing algorithms that previously used, e.g., a SAX representation. Below we discuss various aspects and applications of ABBA.
In-built differencing. Working with the increments (instead of slopes) allows ABBA to capture linear trends in time series without preprocessing. In Figure 11 we consider the simple test problem of a sine wave with a gradual linear trend in the presence of noise. After normalization, SAX is able to accurately represent the time series as shown in Figure 11(i). If we used the symbolic representation for trend prediction, however, the SAX representation would be unsuitable for continuing the linear trend as new symbols would need to be introduced. Of course, this problem could be overcome by removing the linear trend through differencing the time series. A SAX representation of the differenced time series is shown in Figure 11(ii). Unfortunately, differencing the noisy time series amplifies the noise. Figure 11(iii) compares the original time series against the reconstructed time series from the SAX representation of the differenced data. As we can see, the increased noise level renders the SAX representation extremely inaccurate. ABBA, on the other hand, does not require any differencing as it works with increments by default. As a consequence, the ABBA reconstruction shown in Figure 11(iv) stays very close to the original time series, capturing both the gradual linear trend as well as the characteristic up-and-down behavior. Anomaly detection refers to the problem of finding points or intervals in time series which display surprising or unexpected behavior. Recent literature reviews of existing anomaly detection algorithms are given in [4,21]. The ABBA representation can be used for anomaly detection in a variety of ways. Trend anomalies can be detected in the digitization procedure via k-means clustering of the lengths and increments. The alphabet is ordered such that 'a' is the most frequent symbol followed by 'b' and so forth. If the kth cluster contains very few elements relative to the other clusters, then this might be considered a trend anomaly.
TARZAN [24] is a popular anomaly detection algorithm with linear time and space complexity [39]. The algorithm requires two time series, a reference time series R containing normal behavior and the test time series X. Both time series are converted to a symbolic representation and stored in a suffix tree [36]. An anomaly score is computed by comparing the frequency of a substring in X to an expected frequency computed from R. SAX can be used for the discretization process in TARZAN and has been shown to outperform other symbolic representations with no dimensional reduction [32].
If both symbolic representations are short and X contains a symbol that does not appear in R, then the TARZAN score can suffer through lack of perspective. For example, suppose the expected frequency of the substring 'abc' is 4.2 and 'abc' appears 3 times in X, then the anomaly score is 3 − 4.2 = −1.2. Suppose the symbol 'd' does not appear in R but 'ada' appears in X. The expected frequency of the substring 'ada' is 0 and 'ada' appears only once, so the anomaly score is 0 − 1 = −1. This implies that 'abc' is more of an anomaly than 'ada'. This issue can be overcome by dividing the anomaly score by the largest of the expected/actual frequency.
In Figures 12 and 13 we consider a simple experiment comparing SAX, 1d-SAX, and ABBA as discretization procedures for TARZAN with the modified anomaly score 2 . The reference time series R is a simple sine wave where each period spans 25 time samples. The time series X has a full wave replaced by a flat line of 22 time points. The SAX and 1d-SAX representations use a window length w = 5 and k = 9 symbols, whereas ABBA uses a tolerance tuned to give a symbolic representation of equal length and k is bounded by 9. The time series R and X and their symbolic reconstructions are shown in Figure 12. If the length of the anomaly does not align with the window length w, then SAX and 1d-SAX tend to represent the sine wave following the anomaly as a different substring. The adapted TARZAN score is required as certain symbols appear in X that do not appear in R. Figure 13 shows the resulting TARZAN anomaly scores. Both SAX and 1d-SAX suffer from the fixed window length, returning high anomaly scores throughout time following the anomaly, whereas TARZAN using ABBA is able to recover almost immediately after the anomaly due to the adaptive segment lengths.    Fig. 13: A comparison of the TARZAN anomaly detection algorithm using the SAX, 1d-SAX, and ABBA representations, respectively. The first time series R is the reference, while the second time series X is to be tested. The final three plots show the adapted TARZAN anomaly scores for the SAX, 1d-SAX, and ABBA representations, respectively. The black dashed lines indicate tolerances that could be used define the anomalies.
VizTree. We finally mention the possibility of representing an ABBA output as a VizTree, a time series pattern discovery and visualization tool based on suffix trees [29,30,31]. The authors use SAX to discretize the time series before building a suffix tree. Each branch of the suffix tree represents a substring and the thickness of that branch represents the frequency of the substring in the symbolic representation. In principle, SAX pairs well with the visualization as the Gaussian breakpoints should ensure that each symbol appears equally likely. In practice, this is often not the case. One could use ABBA's discretization process instead of SAX by relating the thickness of each line to the frequency of the symbols determined in the clustering procedure. A poor choice of the window length w in the piecewise aggregate approximation in SAX could lead to missing motifs if the distance between is not near a multiple of w. Furthermore, SAX might fail to detect motifs if time warping has occurred, whilst VizTree via ABBA should be able to better capture time-warped motifs as the segment lengths are chosen adaptively. A further exploration of this application will be the subject of future work.

Conclusions and future work
We introduced ABBA, an adaptive symbolic time series representation which aims to preserve the essential shape of a time series. We have shown that the ABBA representation has favorable approximation properties compared to other popular representations, in particular, when the dynamic time warping distance is used. Furthermore, we demonstrated the use of ABBA in some important data mining applications, including trend prediction and anomaly detection. Future research will be devoted to an online streaming version of ABBA with the necessary adaptations of the the Brownian bridge-based error analysis, as well as a more in-depth study of VizTree visualizations. Our recent work [15] explores ABBA's potential for time series forecasting. Classification Archive. We also thank the three anonymous referees and the editor for their helpful comments which significantly improved the paper.