A Review and Evaluation of Elastic Distance Functions for Time Series Clustering

Time series clustering is the act of grouping time series data without recourse to a label. Algorithms that cluster time series can be classified into two groups: those that employ a time series specific distance measure; and those that derive features from time series. Both approaches usually rely on traditional clustering algorithms such as $k$-means. Our focus is on distance based time series that employ elastic distance measures, i.e. distances that perform some kind of realignment whilst measuring distance. We describe nine commonly used elastic distance measures and compare their performance with k-means and k-medoids clustering. Our findings are surprising. The most popular technique, dynamic time warping (DTW), performs worse than Euclidean distance with k-means, and even when tuned, is no better. Using k-medoids rather than k-means improved the clusterings for all nine distance measures. DTW is not significantly better than Euclidean distance with k-medoids. Generally, distance measures that employ editing in conjunction with warping perform better, and one distance measure, the move-split-merge (MSM) method, is the best performing measure of this study. We also compare to clustering with DTW using barycentre averaging (DBA). We find that DBA does improve DTW k-means, but that the standard DBA is still worse than using MSM. Our conclusion is to recommend MSM with k-medoids as the benchmark algorithm for clustering time series with elastic distance measures. We provide implementations in the aeon toolkit, results and guidance on reproducing results on the associated GitHub repository.


Introduction
Clustering is an unsupervised analysis technique where a set of cases, defined as a vector of continuous or discrete variables, are grouped to create clusters which contain cases considered to be homogeneous, whereas cases in different clusters are considered heterogeneous (Bonner, 1964).
Time series clustering is the act of grouping ordered, time series data without recourse to a label.We use the acronym TSCL for time series clustering, to make a distinction between TSCL and time series classification, which is commonly referred to as TSC.There have been a wide range of algorithms proposed for TSCL.Our focus is specifically on clustering using elastic distance measures, i.e. distance measures that use some form of realignment of the series being compared.Our aim to perform a comparative study of these algorithms that follows the basic structure of bake offs in distance based TSC (Lines and Bagnall, 2015), univariate TSC (Bagnall et al., 2017) and multivariate TSC (Ruiz et al., 2021).A huge number of alternative transformation based e.g.(Li et al., 2021), deep learning based clustering algorithms e.g.(Lafabregue et al., 2022) and statistical model based approaches (Caiado et al., 2015) have been proposed for TSCL.These approaches are not the focus of this research.Our aim is to provide a detailed description of nine commonly used elastic distance measures and conduct an extensive experimentation to compare their utility for TSCL.
Clustering is often the starting point for exploratory data analysis, and is widely performed in all fields of data science (Zolhavarieh et al., 2014).However, clustering is harder to define than, for example, classification.There is debate about what clustering means (Jain and Dubes, 1988) and no accepted standard definition of what constitutes a good clustering or what it means for cases to be homogeneous or heterogeneous.For example, homogeneous could mean generated by some common underlying process, or mean it has some common hidden variable in common.A clustering of patients based on some medical data might group all male patients in one cluster and all female patients in another.The clusters are from one view homogeneous, but that does not mean it is necessarily a good clustering.The interpretation of the usefulness of a clustering of a particular dataset requires domain knowledge.This makes comparing algorithms over a range of problems more difficult than performing a bake off of TSC algorithms.Nevertheless, there have been numerous comparative studies (for example (Aghabozorgi et al., 2015) and (Javed et al., 2020)) which take TSC problems, then evaluate clusterings based on how well they recreate class labels.We aim to add to this body of knowledge with a detailed description and a reproducible evaluation of clustering with elastic distance measures (described in Section 3) using the UCR datasets (Dau et al., 2019).We do this in the context of partitional clustering algorithms (see Section 2).Our first null hypothesis is that elastic distance measures do not perform any better than Euclidean distance with k-means clustering.Elastic distance measures are significantly more accurate on average for TSC when used with a one nearest neighbour classifier (Lines and Bagnall, 2015).We evaluate whether this improvement translates to centroid based clustering.We conclude that, somewhat surprisingly, this is not the case with k-means clustering.Only one elastic distance measure, move-splitmerge (Stefan et al., 2013), is significantly better than Euclidean distance, and five of those considered are significantly worse, including the most popular approach, dynamic time warping (DTW).We believe we have detected this difference where others have not because of two factors: firstly, we normalise all series, thus removing the confounding effect of scale; secondly, we evaluate on separate test sets, rather than compare algorithms on train data performance.We explore using dynamic time warping with barycentre averaging (DBA) to improve k-means (Petitjean et al., 2011).DBA finds centroids by aligning cluster members and averaging over values warped to each location.We reproduce the reported improvement that DBA brings to DTW with k-means, but we observe that the improvement is not enough to make it better than Euclidean distance, and it comes with a heavy computational overhead.Our first conclusion is that k-means clustering with elastic distances is generally not a useful benchmark for TSCL, and if it is used as such, it should employ MSM distance.We then experiment with k-medoids TSCL.Using medoids, data points in the training data, rather than averaged cluster members, overcomes the mismatch between simple averaging and elastic distances that barycentre averaging is designed to mitigate against.We found that k-medoids improved the clustering of every elastic distance measure, and that again MSM was the top performing algorithm.Our concluding recommendation is that k-medoids with MSM should be the standard benchmark clusterer against which new algorithms should be assessed.We have provided optimised Python implementations for the distances and clusterers in the aeon toolkit 1 which provides notebooks on using distances and clusterers and contains all our results with guidance on how to reproduce them.
The remainder of this paper is structured as follows: Section 2 describes the general TSCL problem and gives a detailed description of nine elastic distance measures that have been proposed in the literature that are used in our experiments.For more general background on clustering, we direct the reader to (Jain et al., 1999;Saxena et al., 2017).Section 5.2 gives details of the data we use in the analysis, and reviews performance metrics for clustering algorithms.Section 6 present the results of our experiments and describes our analysis of performance on UCR datasets.Finally, Section 7 concludes and signposts the future direction of our research.

Time Series Clustering Background
A time series x is a sequence of m observations, (x 1 , . . ., x m ).We assume all series are equal length.For univariate time series, x i are scalars and for multivariate time series, x i are vectors.A time series data set, D = {x 1 , x 2 , ..., x n }, 1 https://github.com/aeon-toolkitis a set of n time series cases.A clustering is some form of grouping of cases.Broadly speaking, clustering algorithms are either partitional or hierarchical.Partitional clustering algorithms assign (possibly probabilistic) cluster membership to each time series, usually through an iterative heuristic process of optimising some objective function that measures homogeneity.Given a dataset of n time series, D, the partitional time series clustering problem is to partition where k is the number of clusters.It is usually assumed that the number of clusters is set prior to the optimisation heuristic.If k is not set, it is commonly found through some wrapper method.We assume k is fixed in advance for all our experiments.
Hierarchical clustering algorithms form a hierarchy of clusters in the form of a dendrogram, a tree where all elements are in a single cluster at the root and in a unique cluster at the leaves of the tree.This presents the analyst with a range of clusterings with different number of clusters.Time series hierarchical clustering is beyond the scope of this study.
As with classification, clustering algorithms can be split into those that work directly with the time series, and those that employ a transformation to derive features prior to clustering.The focus of this study is on nonprobabilistic partitional clustering algorithms that work directly with time series.

Partitional Time Series Clustering Algorithms
Partitional clustering algorithms have the same basic components that involve example cases (which we call exemplars) that characterise clusters: initialise the cluster time exemplars; assign cases to clusters based on their distance to exemplars.update the exemplars based on the revised cluster membership.Iterations of assign and update are repeated until some convergence condition is met.
k-means (MacQueen et al., 1967), also known as Lloyd's algorithm (Lloyd, 1982), is the most well known and popular partitional clustering algorithm, in both standard clustering and time series clustering.The algorithm uses k centroids as exemplars for each cluster.A centroid is a summary of the members of a cluster found through the update operation, which for standard k-means involves averaging each time point over cluster members.Each case is assigned to the cluster of its closest centroid, as defined by a distance measure.
Many enhancements of the core algorithm have been proposed.One of the most effective refinements is changing how the exemplars are initialised.By default, the initial centroids for k -means are chosen randomly, either by randomly assigning cluster membership then averaging or by choosing random instances from the training data as initial clusters.However, this risks choosing centroids that are in fact in the same cluster, making it harder to split the clusters.To address this problem, forms of restart and selection are often employed.For example, (Bradley and Fayyad, 1998) propose restarting kmeans over subsamples of the data and using the resulting centroids to seed the full run.Another solution is to run the algorithm multiple times and keep the model that yields the best result according to some unsupervised performance measure.
k-means assumes the number of clusters, k is set a priori.There are a range of methods of finding k.These often involve iteratively increasing the number of clusters until some stopping condition is met.This can involve some form of elbow finding or unsupervised quality measure, such as the silhouette value (Lletı et al., 2004).Time series specific enhancements concern the distance metric used and the averaging technique to recalculate centroids.Averaging series matched with an elastic distance measure will mean that, often, the characteristics that made the series similar are lost in the centroid.(Petitjean et al., 2011) describe an alternative averaging method based on pairwise DTW.This is described in detail in Section 3.8.
k -medoids is an alternative clustering algorithm which uses cases, or medoids, as cluster exemplars.One benefit of using instances as cluster centres is that the pairwise distance matrix of the training data is sufficient to fit the model, and this can be calculated before the iterative partitioning.This is particularly important when performing the update operation, which is the main difference between k-means and k-medoids; k-medoid chooses a cluster member as the exemplar rather than average.The medoid is the case with the minimum total distance to the other members of the cluster.Refinements such as Partition Around Medoids (PAM) (Leonard Kaufman, 1990) avoid the complete enumeration of total distances through a swapping heuristic.However, with a precalculated distance matrix and access to reasonable computational resources, this approximation is not necessary.

Time Series Distance Measures
Suppose we want to measure the distance between two equal length, univariate time series, a = {a 1 , a 2 , . . ., a m } and b = {b 1 , b 2 , . . ., b m }.The Euclidean distance d ed is the L2 norm between series, d ed is a standard starting point for distance based analysis.It puts no priority on the ordering of the series and, when used in TSC, is a poor benchmark for distance based algorithms and a very long way from state of the art (Middlehurst et al., 2021).Elastic distance measures allow for possible error in offset by attempting to optimally align two series based on distorting indices or editing series.These have been shown to significantly improve k-nearest neighbour classifiers in comparison to d ed (Lines and Bagnall, 2015).Our aim is to see if we observe the same improvement with clustering.

Dynamic Time Warping
Dynamic Time Warping (DTW) (Ratanamahatana and Keogh, 2005) is the most widely researched and used elastic distance measure.It mitigates distortions in the time axis by realigning (warping) the series to best match each other.Let M (a, b) be the m × m pointwise distance matrix between series a and b, where M i,j = (a i − b j ) 2 .A warping path P =< (e 1 , f 1 ), (e 2 , f 2 ), . . ., (e s , f s ) > is a set of pairs of indices that define a traversal of matrix M .A valid warping path must start at location (1, 1), end at point (m, m) and not backtrack, i.e. 0 ≤ e i+1 − e i ≤ 1 and 0 ≤ f i+1 − f i ≤ 1 for all 1 < i < m.The DTW distance between series is the path through M that minimizes the total distance.The distance for any path P of length s is If P is the space of all possible paths, the DTW path P * is the path that has the minimum distance, hence the DTW distance between series is The optimal warping path P * can be found exactly through a dynamic programming formulation described in Algorithm 1.This can be a time-consuming operation, and it is common to put a restriction on the amount of warping allowed.Figure 1 describes the two most frequently used bounding techniques, the Sakoe-Chiba band (Sakoe and Chiba, 1978) and Itakura parallelogram (Itakura, 1975).In Figure 1 each individual square represents an element of matrix M .Applying a bounding constraint (represented by the darker squares in figure 1) reduces the required computation.The Sakoe-Chiba band creates a warping path window that has the same width along the diagonal of M .The Itakura paralleogram allows for less warping at the start or end of the series than in the middle.Algorithm 1 assumes a Sakoe-Chiba band.
More general bands can be imposed in an implementation by setting values outside the band in the matrix, M , to infinity.Figure 3 gives a demonstration of the effect of constraining the warping path on DTW using the same two series from Figure 2. The relatively extreme warping from point 0 to point 5 evident in Figure 2(a) is constrained when the maximum warping allowed is 2 places (w = 0.2) in Figure 3.
3.2 Derivative Dynamic Time Warping Keogh and Pazzani (2001) proposed a modification of DTW called Derivative Dynamic Time Warping (DDTW) that first transforms the series into a differential series.The difference series of a, a = {a 2 , a 3 , . . ., a m−1 } where a i is defined as the average of the slopes between a i−1 and a i and a i and a i+1 , i.e.
for 1 < i < m.The DDTW is then simply the DTW of the differenced series, (2)

Weighted Dynamic Time Warping
A weighted form of DTW (WDTW) was proposed by Jeong et al. (2011).WDTW adds a multiplicative weight penalty based on the warping distance between points in the warping path.It is a smooth alternative to the cutoff point approach of using a warping window.When creating the distance matrix M , a weight penalty w(|i − j|) for a warping distance of |i − j| is applied, so that A logistic weight function is proposed in Jeong et al. (2011), so that a warping of a places imposes a weighting of where w max is an upper bound on the weight (set to 1), m is the series length and g is a parameter that controls the penalty level for large warpings.The larger g is, the greater the penalty for warping.Note that WDTW does not benefit from the reduced search space a window induces.The WDTW distance is then (3) Figure 4 shows the warping resulting from two parameter values.For this example, g = 0.2 gives the same warping as full window DTW, but g = 0.3 is more constrained.We also investigate the derivative weighted distance (WDDTW),

Longest Common Subsequence
DTW is usually expressed as a mechanism of finding an alignment, so that points are warped onto each other to form a path.An alternative way of looking at the process is as a mechanism of forming a common series between the two input series.With this view, the warping operation can be seen as inserting a gap in one series, or removing an element from another series.This way of thinking derives from approaches for aligning sequences of discrete variables, such as strings or DNA.The Longest Common Subsequence (LCSS) distance for time series is derived from a solution to the problem of finding the longest common subsequence between two discrete series through edit operations.For example, if two discrete series are abaacb and bcacab, the LCSS is baab.Unlike DTW, The LCSS does not provide a path from (1, 1) to (m, m).Instead, it describes edit operations to form a final sequence, and these operations are given a certain cost.So, for example, to edit abaacb into the LCSS baab requires two deletion operations.For DTW, you can then think of the choice between the three warping paths in line 5 of Equation 1 as C i−1,j being a deletion in series b, C i,j−1 as a deletion in series a and C i−1,j−1 as a match.The warping path shown in Figure 3 is a sequence of pairs < (1, 1), (2, 1), (3, 2), (4, 3), (4, 4), (5, 5), (6, 5), (7, 5), (8, 6), (8, 7), (8, 8), (8, 9), (9, 10), (10, 10) > can instead be expressed as an edited series < (1, 1), (3, 2), (4, 3), (5, 5), (8, 6), (9, 10) > .
With this representation the warping operations are in fact deletions (or gaps) in series a in positions 2, 6, 7 and 10 and in b in positions 4, 7, 8 and 9.With discrete series, the matches in the common subsequence have the same value.Thus, each pair in the subsequence would be, for example, a letter in common for the two series.Obviously, actual matches in real valued series will be rare.
The discrete LCSS algorithm can be extended to consider real-valued time series by using a distance threshold , which defines the maximum difference between a pair of values that is allowed for them to be considered a match.The length of the LCSS between two series a and b can be found using Algorithm 2. If two cells are considered the same (line 4), the previously considered LCSS is increased by one.If not, then the LCSS seen so far is carried forward.The LCSS distance between a and b is Figure 5 shows the cost match between our two example series.The longest sequence of matches is seven.These are the pairs < (2, 1), (3, 2), (4, 4), (5, 5), (6, 8), (8, 8), (9, 10) > .

Edit Distance on Real Sequences (EDR)
Algorithm 3 EDR (a, b , (both series of length m), (equality threshold ) 1: Let E be an (m + 1) × (m + 1) matrix initialised to zero, indexed from zero.2: for i ← 1 to m do 3: for j ← 1 to m do 4: LCSS was adapted in Chen et al. (2005), where the edit distance on real sequences (EDR) is described.Like LCSS, EDR uses a distance threshold to define when two elements of a series match.However, rather than count matches and look for the longest sequence, ERP applies a (constant) penalty for non-matching elements where deletions, or gaps, are created in the alignment.Given a distance threshold, , the EDR distance between two points in series a and b is given by Algorithm 3. The EDR distance between a and b is then defined by Equation 6.
At any step, elastic distances can use one of three costs: diagonal, horizontal or vertical, in forming an alignment.In terms of forming a subsequence from series a, we can characterise these as operations as match (diagonal), deletion (horizontal) and insertion (vertical).Insertion to a can also be characterised as deletion from b, but we retain the match/delete/insert terminology for consistence and clarity.EDR does not satisfy triangular inequality, as equality is relaxed by assuming elements are equal when the distance between them is less than or equal to .EDR is very similar to LCSS, but it directly finds a distance rather than simply counting matches, thus removing the need for the subtraction in Equation 5.The resulting cost matrix shown in Figure 6 can easily be used to find either an alignment path or a common subsequence.EDR then characterises the three operations in DTW 3.6 Edit Distance with Real Penalty (ERP) An alternative to EDR was proposed in Chen and Ng (2004), where edit distance with real penalty (ERP) was introduced.LCSS produces a subsequence that can have gaps between elements.For example, there is a gap between (3,2) and (4,4) in the subsequence shown in used to find M for DTW.ERP calculates a cost matrix E that is more like DTW than LCSS: it describes a path alignment of two series based on edits rather than warping.The ERP distance between two series is described in Algorithm 4. The edge terms are initialised to a large constant value (lines 5 and 7).The cost matrix E is then either the cost of matching, where d(a i , b j ) is the distance between two points or the cost of a inserting/deleting a term on either axis (E i−1,j + d(a i , g) or E i,j−1 + d(g, b j )).The ERP distance between a and b is given by Equation 7.

Move
Algorithm 5 MSM(a, b (both series of length m), c (minimum cost), d, (pointwise distance function)) Move-Split-Merge (MSM) (Stefan et al., 2013) is a distance measure that is conceptually similar to ERP.The core motivation for MSM is that the cost of insertion/deletion in ERP are based on the distance of the value from some constant value, and thus it prefers inserting and deleting values close to g compared to other values.Algorithm 5 shows that the major difference is in the deletion/insertion operations on lines 10 and 11.The move operation in MSM ] Fig. 7 An example of ERP distance calculation using our example series, with g = 0.
uses the absolute difference rather than the squared distance for matching in ERP.Insert cost in ERP d(a i , g) is replaced by split operation C(a i , a i−1 , b j ), where C is the cost function given in Equation 8.If the value being inserted, b j , is between the two values a i and a i−1 being split, the cost is a constant value c.If not, the cost is c plus the minimum deviation from the furthest point a i and the previous point a i−1 or b j .The delete cost of ERP d(g, b j ) is replaced by the merge cost C(b j , b j−1 , a i ), which is simply the same operation on the second series.Thus, the cost of splitting and merging values depends on the value itself and adjacent values, rather than treating all insertions and deletions equally as with ERP.The MSM distance between a and b is given by Equation 9 and illustrated in Figure 8. MSM satisfies triangular inequality and is a metric.
] Fig. 8 An example of MSM distance calculation using our example series.

Time Warp Edit (TWE)
Introduced in Marteau (2009), Time Warp Edit (TWE) distance is an elastic distance measure described in Algorithm 6.It encompasses characteristics from both warping and editing approaches.The warping, called stiffness, is controlled by a parameter ν.Stiffness enforces a multiplicative penalty on the distance between matched points in a way that is similar to WDTW, where ν = 0 gives no warping penalty.The stiffness is only weighted this way when considering the match option (line 11).For the delete and insert operations (lines 12 and 13), a constant stiffness (ν) and edit (λ) penalty are applied since the warping is considered from consecutive points in the same series.Ane exampe is shown in Figure 9.
] Fig. 9 An example of TWE distance calculation using our example series.
A summary of distance function parameters and their default values in our implementations is given in Table 1.DTW is sensitive to the window parameter (Dau et al., 2018) and large windows can cause pathological warpings.Based on experimental results (Ratanamahatana and Keogh, 2005), we set the default warping window to 0.2.WDTW uses a default scale parameter value for g of 0.05, based on results reported in Jeong et al. (2011).LCSS and EDR both have an parameter that is a threshold for similarity.If the difference in two random variables is below , the observations are considered a match.The variability in parameter effects depending on the values of the series is one of the arguments for normalising all series.We set the default to 0.05.MSM has a single parameter, c, to represent the cost of the move-split-merge operation.This is set to 1 based on the original paper.TWE λ is analogous Algorithm 6 TWE(a, b (both series of length m), λ (edit cost), ν (warping penalty factor ), d, (pointwise distance function)) 1: Let D be an m + 1 × n + 1 matrix initialised to 0 2: D 0,0 = 0 3: for i ← 1 to m do 4: Table 1 Summary of distance functions, their parameters and the default values to the c parameter in MSM, so we also set it to one, whereas ν is related to the weighting parameter of W DT W , so we set it to 0.05.

Averaging Time Series
k-means clustering requires a form of characterising a set of time series to form an exemplar.The standard approach for k-means is simply to find the mean of the current members of a cluster over time points.This is appropriate if the distance function Euclidean distance, since the average centroid is the series with the minimal Euclidean distance to members of the cluster.However, if cluster membership is being assigned based on an elastic distance measure, the average centroid may misrepresent the elements of a cluster; it is unlikely to be the series that minimizes the elastic distance to cluster members.Dynamic time warping with Barycentre averaging (DBA) (Petitjean et al., 2011) (see Algorithm 7) was proposed to overcome this limitation in the context of DTW.
DBA is a heuristic to find a series that minimizes the elastic distance to cluster members rather than the Euclidean distance.Starting with some initial centre, DBA works by first finding the warping path of series in the cluster onto the centre.It then updates the centre, by finding which points Algorithm 7 DTW Barycentre Averaging(c, the initial average sequence, X p , p time series to average.1: Let dtw path be a function that returns the a list of tuples that contain the indexes of the warping path between two time series.2: Let W be a list of empty lists, where W i stores the values in Xp of points warped onto centre point c i .3: for x ∈ Xp do 4: P ← dtw path(x, c) 5: for (i, j) ∈ P do 6: were warped onto each element of the centre for all elements of the cluster, then recalculating the centre as the average of these warped points.
Suppose function f (a, b) returns the warping path of indexes P =< (e 1 , f 1 ), (e 2 , f 2 ), . . ., (e s , f s ) > generated by dynamic time warping.Given an initial centre c =< c 1 , . . ., c m >, DBA is described by Algorithm 7. It warps each series onto c (line 4), then from the warping path associates the value in the series with the value in the barycentre (line 5 and 6).Once finished for all series, the average of values warped to each index of the centroid is taken.This is meant to better characterise the cluster members in the iterative partitioning of algorithms.

Implementation
We have implemented the nine distance functions used in the evaluation in the Python scikit learn compatible package aeon2 .The distance functions are all implemented using Numba tool3 , which uses just in time compilation to dynamically convert Python to C. There are of course numerous open source packages that offer some form of elastic distance functions, most commonly variants of DTW.usages is available on the associated repository.Listing 1 shows how to calculate the DTW distance and alignment path for the series used in Figure 2.
It also shows how to use the factory to get and call a distance function.The distance factory avoids the repeated numba compilation of distance functions that can occur with different parameters, so it is our recommended method for creating a distance function.There is also the option to find the pairwise distance matrix.
1 from aeon .distances import dtw_distance , d t w _ a l i g n m e n t _ p a t h 2 from aeon .distances import distance_factory , distance , d i s t a n c e _ a l i g n m e n t _ p a t h 3 from aeon .distances import p a i r w i s e _ d i s t a n c e 4 import numpy as np 5 a = np .array ([0.018 , 1.537 , -0.141 , -0.761 , -0.177 , -2.192 , -0.193 , -0.465 , -0.944 , -0.240]) 6 b = np .array ([ -0.755 , 0.446 , 1.198 , 0.171 , 0.564 , 0.689 , 1.794 , 0.066 , 0.288 , 1.634]) 7 d1 = dtw_distance (a , b ) 8 d2 = dtw_distance (a , b , window =0.2) 9 d3 = distance (a , b , metric = ' dtw ' , window =0.2) 10 p1 = d t w _ a l i g n m e n t _ p a t h (a , b ) 11 p2 = d t w _ a l i g n m e n t _ p a t h (a , b , window =0.2) 12 13 # For repeated use create a numba compiled callable 14 # FIX THE BELOW 15 p3 = d i s t a n c e _ a l i g n m e n t _ p a t h (a , b , metric = ' msm ') 16 dtw_numba = d i st a n c e _ f a c t o r y (a , b , metric = ' dtw ') 17 twe_numba = d i st a n c e _ f a c t o r y (a , b , metric = " twe " ) 18 d3 = dtw_numba (a , b ) 19 dist_paras = { " mu " :0.5 , " lmda " :0.5} 20 d4 = twe_numba (a , b , dist_paras ) 21 22 # To find a distance matrix between one or two 2 D matrix , use p a i r w i s e _ d i s t a n c e 23 pair = np .array ([ a , b ]) 24 dist = p a i r w i s e _ d i s t a n c e ( pair , metric = " twe " ) 25 # Equivalent to this 26 dist = p a i r w i s e _ d i s t a n c e ( pair , pair , metric = " twe " ) Listing 1 A simple example of finding distances and alignments in aeon We have implemented the k-means and k-medoids clusterers in aeon.These can be easily used and configured, as shown in Listing 2.
1 from aeon .clustering .k_means import T i m e S er i e s K M e a n s 2 from aeon .clustering .k_medoids import T i m e S e r i e s K M e d o i d s 3 from aeon .datasets import load_un it_test 4 trainX , trainY = load _unit_te st ( split = " test " , return_type = " np2D " ) 5 testX , testY = l oad_unit _test ( split = " train " , return_type = " np2D " ) 6 clst1 = T i m e S e ri e s K M e a n s () 7 clst2 = T i m e S e ri e s K M e a n s ( 8a v e r a g i n g _ m e t ho d = " dba " , Listing 2 An example of using kmeans and kmedoids in aeon To conduct an experiment, we generate results in a standard format.
1 from aeon .benchmarking .experiments import r u n _ c l u s t e r i n g _ e x p e r i m e n t 2 3 r u n _ c l u s t e r i n g _ e x p e r i m e n t ( Currently, we collate these results files using the associated Java package tsml4 .An example of this is available on this paper's repository.

Methodology
Our experiments involve two steps that are not universally performed in the clustering literature: we normalise all series to zero mean and unit variance; and we train and test on separate folds.The issue of whether to always normalise is an open question for TSC, since some discriminatory features may be in the scale or variance.Whether to normalise is often treated as a parameter for TSC.However, for clustering we believe it makes sense to always normalise when comparing over multiple datasets; scale is much more likely to dominate an unsupervised algorithm and confound the comparison.Conversely, evaluating on unseen test data is essential for any TSC comparison.The issue is less clear cut with clustering, which is used more as an exploratory tool than a predictive model.We believe the problem of overestimating performance through assessing on training data applies to clustering data.By evaluating algorithms using training and testing data we get an idea of a clustering algorithms ability to generalise and we can simply place the clustering comparison in the context of classification research.We perform all training on the default train split provided, but present the results on both the train set and test set.

Data
We experiment with time series data in the University of California, Riverside (UCR) archive (Dau et al., 2019) 5 .We restrict our attention to univariate time series, and in all experiments we use 112 of the 128 datasets from the UCR time series archive.We exclude datasets containing series of unequal length or missing values.We also remove the Fungi data, which only provides a single train case for each class label.We report results using the six performance measures described in Section 5.2 on the both the train sample and the test set.
Using the class labels to assessing performance on some of these data may be unfair.For some problems, clustering algorithms naturally find clusters that are independent of the class labels, but are nevertheless valid.For example, the GunPoint data set was created by a man and a woman drawing a gun from their belt, or pretending to draw a gun.The class labels are Gun/No Gun.However, many clusterers finds the Man/Woman clusters rather than Gun/No Gun.Without supervision, this is a perfectly valid clustering, but it will score approximately 50% accuracy, since the man and woman cases are split evenly between Gun/No Gun.This is an inherent problem with attempting to evaluate exploratory, unsupervised algorithms by comparing them with what we know to be true a priori: if a clustering simply finds what we already know, its utility is limited.Furthermore, as observed in (Lafabregue et al., 2022), some of the datasets have the same time series but with different labels.We aim to mitigate against these problems by using a large number of problems, but also explore the effect of removing problems where the algorithms perform little better than forming a single cluster.

Clustering Metrics
Clustering accuracy (CL-ACC), like classification accuracy, is the number of correct predictions divided by the total number of cases.To determine whether a cluster prediction is correct, each cluster has to be assigned to its best matching class value.This can be done naively, taking the maximum accuracy from every permutation of cluster and class value assignment S k .

ACC(y, ŷ) = max
Checking every permutation like this is prohibitively expensive, however, and can be done more efficiently using combinatorial optimization algorithms for the assignment problem.A contingency matrix of cluster assignments and class values is created, and turned into a cost matrix by subtracting each value of the contingency matrix from the maximum value.Clusters are then assigned using an algorithm such as the Hungarian algorithm (Kuhn, 1955) on the cost matrix.If the class value of a case matches the assigned class value of its cluster, the prediction is deemed correct, else it is incorrect.As classes can only have one assigned cluster each, all cases in unassigned clusters due to a difference in a number of clusters and class values are counted as incorrect.
The rand index (RI) works by measuring the similarity between two sets of labels.This could be between the labels produced by different clustering models (thereby allowing direct comparison) or between the ground truth labels and those the model produced.The rand index is the number of pairs that agree on a label divided by the total number of pairs.
One of the limiting factors of RI is that the score is inflated, especially when the number of clusters is increased.The adjusted rand index (ARI) compensates for this by adjusting the RI based on the expected scores on a purely random model.
The mutual information (MI), is a function that measures the agreement of the two clusterings or a clustering and a true labelling, based on entropy.Normalised mutual information (NMI) rescales MI onto [0, 1], and adjusted mutual information (AMI) adjusts the MI to account for the class distribution.
The Davies-Bouldin index is an unsupervised measure we employ for tuning clusterers.It compares the between cluster variation with the inter cluster variation, with a higher score awarded to a clustering where there is good separation between clusters.
To compare multiple clusterers on multiple datasets, we use the rank ordering of the algorithms we use an adaptation of the critical difference diagram (Demšar, 2006), replacing the post-hoc Nemenyi test with a comparison of all classifiers using pairwise Wilcoxon signed-rank tests, and cliques formed using the Holm correction recommended by (García and Herrera, 2008;Benavoli et al., 2016).We use α = 0.05 for all hypothesis tests.Critical difference diagrams such as those shown in Figure 10 display the algorithms ordered by average rank of the statistic in question and the groups of algorithms between which there is no significant difference (cliques).So, for example, in Figure 10(c), MSM has the lowest average rank of 3.8973, and is not in a clique, so has significantly better ARI than the other algorithms.TWE, ERP, WDTW and ED are all grouped into a clique, which means there is no pairwise significant difference between them.LCSS, EDR, WDDTW and DTW form another clique, and DDTW is significantly worse than all other algorithms.

Results
We report a sequence of three sets of experiments.Firstly, in Section 6.1 we compare k-means clustering using 10 different distance functions and put these in the context of classification.Secondly, in Section 6.2 we report the k-medoids results.We address the questions of whether the same pattern of performance seen in k-means is observable in k-medoids and whether k-medoids is generally more effective than k-means.Thirdly, in Section 6.3 we evaluate whether we can recreate the reported improvement to DTW of DBA 6.3 and assess how this improvement compares to our previous experiments.

Elastic distances with k-means
The first set of experiments involve comparing alternative distance functions with k-means clustering using the arithmetic mean of series.Our primary aim is to investigate whether there are any significant differences between the measures when used on the UCR univariate classification datasets.For each experiment, we ran k-means clustering with the default aeon parameters (random initialisation, maximum 300 iterations, 10 restarts, averaging method is the mean) with Euclidean distance and the nine elastic distance measures, set up with the default parameters listed in Table 1. Figure 10 shows the summarised results on the test data.Full results for the test data are given in the 10 9 8 7 6 5 4 3 2 1 4.0045 msm  Appendix in Table 6.These results and code to reproduce them are in the associated GitHub repository.The first obvious conclusion from these results is that MSM is the best performing distance measure: it is significantly better than the other nine distances with all performance measures except accuracy, where it is top ranked.This mirrors similar findings when comparing distance measures with one nearest neighbour classification (see Section 4 of Bagnall et al. (2017)).Nevertheless, it is still surprising, and we believe not widely known in the time series research community.The second stand out result is that DTW is significantly worse than Euclidean distance.This is particularly surprising, particularly given the prevalence of DTW in the TSCL literature.It is even more notable when we observe that WDTW is significantly better than DTW.To verify this particular result, we reran the clustering experiments us-ing the Java toolkit tsml 6 version of DTW in conjunction with the WEKA k-means clusterer, and found no significant difference in the results.Furthermore, we have checked that the aeon DTW distance implementation produces the same distances as other the other implementations listed in Table 2. Further experimentation has shown that full window DTW performs worse than DTW constrained to 20%.It appears that windowing is significantly less effective than weighting for clustering.This is not true of classification, and reflects the greater uncertainty with unsupervised learning.Tuning significantly improves the performance of DTW 1-NN classification, so it is possible that it may also improve DTW based clustering.To tune a clusterer we need an unsupervised cluster assessment algorithm.We use the Davies-Bouldin score since it is popular and available in scikit-learn.We evaluate DTW for ten possible windows on even intervals in the range [0, 1) and use the window with the highest Davies-Bouldin score for our final clustering. Figure 11 shows the scatter plot of DTW vs tuned DTW.There is no significant difference between tuned and untuned DTW.The inherent uncertainty of clustering and the lack of sensitivity of the performance measure means we do not see the improvement through tuning that we observe in classification with elastic distances.The two derivative approaches are both significant worse than their alternatives using the raw data, suggesting that clusters in the time domain better reflect the true classes.LCSS and EDR also perform poorly on all tests.This implies the simple edit thresholding is not sensitive enough to find clustering that represent the class labels.The best overall performing measures are MSM and TWE.These measures are similar, in that they combine elements of both warping and editing.Overall, the clear conclusion is that MSM is the best approach for k-means clustering with standard averaging to find centroids.WDTW, MSM, TWE and ERP all give an explicit penalty for warping.The   DTW penalty for warping is implicit (warping means a longer path), and these results indicate that this allows for more warping than is for clustering.
The obvious first follow up question is: does it make sense to cluster time series with k-means and cluster averaging at all?Before directly addressing that question in Section 6.2 and 6.3, we set the context more thoroughly by considering how much information is in the clusterings in relation to class values.
Although MSM is significantly better than Euclidean distance, it could be that this effect is simply making a bad approach marginally better.Table 7 presents that the average accuracy over all problems for Euclidean is 51.78%,DTW 49.08% and MSM 54.16%.For reference, predicting just the majority class gives an overall accuracy of 34.26%.If we take Euclidean distance as our control, then on average DTW is 2.7% worse, and MSM is 2.38% better.Clustering clearly provides some insight into class membership.Figure 13 shows the accuracy scatter plots of MSM vs ED and DTW, and clearly demonstrates the overall superiority of MSM.To estimate how much information our clusterers  capture, we can upper bound performance with the accuracy of supervised approaches.A 1-NN classifier using Euclidean distance averages 72.36% on these problems, 1-NN DTW achieves an average accuracy of 76.74% and the state of the art, HIVE-COTEv2.0(Middlehurst et al., 2021), has an average accuracy 89.12%.This difference is not surprising, an unsupervised method cannot hope to achieve equivalence with a supervised algorithm, but it is worth observing the massive difference in performance.Clustering improves on random guessing on average, but, as discussed in Section 5, many of these data are probably inappropriate for clustering.If we take the highest accuracy of the ten clusterers as our reference and compare accuracy to default class predictions, we find that 18 datasets are less than 5% better than the predicting a single cluster, and two (MedicalImages and ChlorineConcentration) are more than 10% worse than using a single class.Table 8 lists the best performing clusterer, the accuracy obtained from predicting a single cluster and the deviation of the two.
Figure 14 shows the critical difference diagrams for accuracy and rand index when we rerun the comparison without these 18 datasets.The pattern of results is very similar to those shown in Figure 10.For completeness sake, we continue with all 112 datasets, but note that it may be worthwhile forming a subset of the UCR problems most suitable for assessing clusterers.For consistency with some of the related research, Figure 12 show the same results on train data.The pattern of results is the same as on the test data, although unsurprisingly there is less significance in the results.

k-medoids
Using standard centroids means that the averaging method is not related to the distance measure used, unless employing Euclidean distance.This disconnect between the clustering stages may account for the poor performance of many of the distance measures, in particular relative to k-means with Euclidean distance.We repeated our experiments using the aeon k-medoids clusterer.Figure 15 shows the ranked performance summary for ten distance measures.The pattern of performance is broadly the same as with k-means (Figure 10) with some notable differences: DTW is now no longer worse than Euclidean; ERP performs much better and there is no overall difference between ERP, TWE and MSM as the top performing algorithms.The top performing distance functions all involve explicit penalty for warping.As with k-means, this indicates that regularisation on path length produces better clusters on average.Also of  interest is the relative performance between k-means and k-medoids.Table 4 lists the average accuracy over all problems of k-means, k-medoids and single cluster predictions for the ten distance measures.Accuracy increases for all distances except Euclidean.This indicates that k-medoids is a much stronger benchmark for TSCL algorithms than k-means.

kmeans-DBA
Using medoids mitigates the problem of averaging centres with k-means.DBA, described in Section 3.8, has also been proposed as a means of improving centroid finding for DTW.We have repeated our experiments with the same k-means set up, but centroids found with the original DBA described in Algorithm 7. Figure 16(a) shows that DBA does indeed significantly improve DTW, a result that reproduces the findings in (Petitjean et al., 2011).
However, Figure 16(a) illustrates that it is not significantly different to k-medoids DTW.Furthemore, Figure 17 shows that k-means with DBA is significantly worse than the top clique, composed of k-medoids MSM, k-means MSM and k-medoids TWE.Our implementation of DBA is faithful to the original.An alternative version of DBA was described in (Schultz and Jain, 2017).This version is implemented in the tslearn toolkit.We have wrapped the tslearn k-means DBA algorithm into the aeon toolkit and reran this version.
Figure 18 compares the performance of two DBA versions with MSM kmedoids and k-means.Table 6.3 summarises the time taken to run experiments.We ran experiments on our HPC cluster, so timing results are indicative only.The differences are fairly small.However, we note that of the two best performing algorithms, k-medoids MSM and TWE, k-medoids MSM is the faster and hence is our recommended benchmark for elastic distance time series clustering.

Conclusions
We have described 10 time series distance measures and compared them when used to cluster the time series of the UCR archive.There are a wealth of other time series clustering algorithms that we have not evaluated.We have opted to provide an in depth description, with examples and associated code and a tightly constrained bake off, rather than attempt to include all variants of, for example, deep learning and transformation based time series clusterers.Distance based approaches are still very popular, and we believe our experimental observations will help practitioners choose distance based TSCL more effectively.Our first conclusion is that k-means using DTW and centroid averaging is not a useful benchmark for new clustering algorithms.For k-means, MSM is the best distance measure to use.However, k-means with averaging of centroids is itself not an effective approach for TSCL.k-medoids is significantly better for all nine elastic distances, and has the added benefit of being able to use a precomputed distance matrix.A popular solution to the problem of averaging centroids for k-means is to use DBA, which uses the warping mechanism to align centroids.Whilst this does indeed improve DTW k-means, it is computationally intensive and does not improve DTW enough to make it competitive with the best k-medoids approach.The improved version of DBA (Schultz and Jain, 2017) implemented in tslearn is significantly better than the original, but is not different to either MSM clusterers, and it takes approximately twice as long to run.Our overall conclusion is that, without any data specific prior knowledge as to the best approach, k-medoids with either MSM or TWE are the best approaches for distance based clustering of time series, with MSM preferred because of the lower run time.They should form the basis for assessing new clustering algorithms (although the tslearn version of DBA is also a good benchmark).There is little difference in run time between the algorithms, kmedoids is a little faster but uses more memory, and TWE is slightly slower than the other distances.However, run time and memory were not a major constraint for these experiments.
We have released all our results and code in a dedicated repository.We would welcome contributors of new distance functions to aeon, and would be happy to extend the evaluation to include them.We encourage any would be contributors to engage with the aeon community and look for or raise issues on the aeon GitHub repository related to the distances module.Our next stage is to extend the bake off to consider clustering algorithms not based on elastic distances, to assess the impact of alternative clustering algorithms parameters (e.g. the initialisation technique) and to investigate alternative medoids based approaches such as partitioning around the medoids (Leonard Kaufman, 1990).Furthermore, we have not yet investigated the effect of having to set the number of clusters, k.A future experiment will involve testing whether the relative performance remains the same when using standard techniques for setting k.There are also several possible directions for algorithmic advancement highlighted by this research: we will try combining the clusterings through an ensemble in a manner similar to the elastic ensemble for classification (Lines and Bagnall, 2015).Clustering ensembles are more complex than classification ensembles, requiring some form of alignment of labelling.Nevertheless, it seems reasonably likely that there may be some improvements from doing so.These are just some of the numerous open issues in TSCL research.Our study aims to put future research on a sound basis to facilitate the accurate assessment of algorithmic improvements in a fully reproducible manner.
Figure 2 helps explain how the DTW calculations are arrived at.Euclidean distance is simply the sum of the diagonals of the Matrix M , in Figure 2(a).DTW constructs C using M and previously found values.For example, C 1,1 = M 1,1 = 0.6 and C 1,2 is the minimum of C 1,1 , C 0,1 and C 0,2 plus M 1,2 .C 0,2 and C 0,1 are initialised to infinity, so the best path to get to C 1,2 has distance C 1,1 + M 1,2 which equals 0.6+5.25 = 5.85.Similarly, cell C 10,10 is the minimum of the three cells C 9,9 , C 10,9 , C 9,10 plus the pointwise distance M 10,10 .The optimal path is the trace back through the minimum values (shown in white in Figure 2(b)).

Fig. 2
Fig. 2 An example of Euclidean and DTW distance functions for two series.The left hand matrix, Figure 2(a), shows the pointwise distance between the series (matrix M in Equation 1).The Euclidean distance is the sum of the diagonal path.The right hand matrix, Figure 2(b), shows the DTW distances (matrix C in Equation1) and the resulting warping path when the window size is unconstrained.

Fig. 3
Fig. 3 An example of constraining the DTW window.The left hand matrix, Figure 3(a), show the DTW distances for the same series used in Figure 2 when the window is contrained to 20% of the series length using a Sakoe-Chiba band. Figure 3(a) shows the resulting alignment.
(a) Weighted DTW with weight g = 0.2 (b) Weighted DTW with a weight g = 0.3.

Fig. 4
Fig. 4 Examples of the weighted DTW cost matrix C and resulting alignment for two weight parameters.

Fig. 5
Fig. 5 An example of the LCSS cost matrix with = 1, where the white cells are members of the LCSS.

Fig. 6
Fig. 6 An example of the EDR cost matrix with = 1, where the white cells are members of the least cost alignment path.
T i m e S e r i e s K M e d o i d s () 15 clst4 = T i m e S e r i e s K M e d o i d s ( 16 metric = " dtw " , 17 d is ta nc e _p ar am s ={ " window " : 0.2} , fit ( trainX ) 22 pred = clst1 .predict ( testX ) 23 print ( pred ) An example of running an experiment in aeon.Random state is set to 1 throughout our experiments to faciliate reproducibility.label

Fig. 10
Fig.10Critical difference diagrams for k-means clustering using nine different elastic distance functions on 112 UCR problems using six performance measures.Results are on derived from the test files.

Fig. 11
Fig. 11 Scatter plots of accuracy and rand index for DTW vs Tuned DTW.

Fig. 12
Fig. 12 Train results for k-means clustering with 10 different distance measures.

Fig. 13
Fig. 13 Scatter plots of k-means MSM against k-means Euclidean distance and k-means DTW.

Fig. 15
Fig. 15 Critical difference diagrams for k-medoids clustering with ten distance measures assessed on the test data.
Fig. 16 Scatter plots of accuracy forDTW based k-means, with DBA against standard averaging (a) and against k-medoids (b).

Fig. 18
Fig. 18 Comparison of performance of two DBA implementations and k-medoids and kmeans with MSM distance.
Sakoe H, Chiba S (1978)  Dynamic programming algorithm optimization for spoken word recognition.IEEE Transactions on Acoustics, Speech, and Signal Processing 26(1):43-49 Saxena A, Prasad M, Gupta A, Bharill N, Patel OP, Tiwari A, Er MJ, Ding W, Lin CT (2017) A review of clustering techniques and developments.Neurocomputing 267:664-681 Schultz D, Jain BJ (2017) Nonsmooth analysis and subgradient methods for averaging in dynamic time warping spaces.CoRR abs/1701.06393,URL http://arxiv.org/abs/1701.06393,1701.06393Stefan A, Athitsos V, Das G (2013) The Move-Split-Merge metric for time series.IEEE Transactions on Knowledge and Data Engineering 25(6):1425-1438 Zolhavarieh S, Aghabozorgi S, Teh YW (2014) A review of subsequence time series clustering.The Scientific World Journal 2014 Table2lists some of the most popular packages with DTW implementations and summarises which other elastic distance measures they contain.We have confirmed equivalence of DTW results with these packages.The time taken to perform 200 DTW distance calculations on random series of length 1000 to 10,000 on a desktop PC are shown in Table3.aeonrun time is very similar to tslearn and rust.dtw-python is approximately five times slower than the other packages.aeonoffers the widest range of elastic distance measures with equivalent run time to the other packages.Distance functions can be used directly, either by explicitly importing them or by using the distance factory provided.A Jupyter notebook with example

Table 2
Summary of elastic distance function availability in five Python packages.

Table 3
Time (in seconds) to perform 200 full window DTW distance calculations with random series of length 1000 to 10000.
Test results for k-means clustering with the 93 datasets where the best clusterer is greater than 5% more accurate than predicting a single cluster.

Table 4
Accuracy averaged over 112 problems for k-means and k-medoids clustering.

Table 5
Run time (in hours) average, maximum and total over 112 problems.Critical difference diagrams for the best performing k-means and k-medoids distances, and k-means with DBA (dtw-dba).