Parameterizing the cost function of dynamic time warping with application to time series classification

Dynamic time warping (DTW) is a popular time series distance measure that aligns the points in two series with one another. These alignments support warping of the time dimension to allow for processes that unfold at differing rates. The distance is the minimum sum of costs of the resulting alignments over any allowable warping of the time dimension. The cost of an alignment of two points is a function of the difference in the values of those points. The original cost function was the absolute value of this difference. Other cost functions have been proposed. A popular alternative is the square of the difference. However, to our knowledge, this is the first investigation of both the relative impacts of using different cost functions and the potential to tune cost functions to different time series classification tasks. We do so in this paper by using a tunable cost function λγ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda _{\gamma }$$\end{document} with parameter γ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document}. We show that higher values of γ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} place greater weight on larger pairwise differences, while lower values place greater weight on smaller pairwise differences. We demonstrate that training γ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} significantly improves the accuracy of both the DTW\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${ DTW }$$\end{document} nearest neighbor and Proximity Forest classifiers.

Dynamic Time Warping (DT W ) (Sakoe andChiba 1971, 1978) is a popular distance measure for time series and is often employed as a similarity measure such that the lower the distance the greater the similarity.It is used in numerous applications including speech recognition (Sakoe andChiba 1971, 1978), gesture recognition (Cheng et al. 2016), signature verification (Okawa 2021), shape matching (Yasseen et al. 2016), road surface monitoring (Singh et al. 2017), neuroscience (Cao et al. 2016) and medical diagnosis (Varatharajan et al. 2018).
DT W aligns the points in two series and returns the sum of the pairwisedistances between each of the pairs of points in the alignment.DT W provides flexibility in the alignments to allow for series that evolve at differing rates.In the univariate case, pairwise-distances are usually calculated using a cost function, λ(a ∈ R, b ∈ R) → R + .When introducing DT W , Sakoe and Chiba (1971) defined the cost function as λ(a, b) = |a − b|.However, other cost functions have subsequently been used.The cost function λ(a, b) = (a − b) 2 (Tan et al. 2018;Dau et al. 2019;Mueen and Keogh 2016;Löning et al. 2019;Tan et al. 2020) is now widely used, possibly inspired by the (squared) Euclidean distance.ShapeDTW (Zhao and Itti 2018) computes the cost between two points by computing the cost between the "shape descriptors" of these points.Such a descriptor can be the Euclidean distance between segments centered on this points, taking into account their local neighborhood.
To our knowledge, there has been little research into the influence of tuning the cost function on the efficacy of DT W in practice.This paper specifically investigates how actively tuning the cost function influences the outcome on a clearly defined benchmark.We do so using λ γ (a, b) = |a − b| γ as the cost function for DT W , where γ = 1 gives us the original cost function; and γ = 2 the now commonly used squared Euclidean distance.
We motivate this research with an example illustrated in Figure 1 relating to three series, S, T and U .U exactly matches S in the high amplitude effect at the start, but does not match the low amplitude effects thereafter.T does not match the high amplitude effect at the start but exactly matches the low amplitude effects thereafter.Given these three series, we can ask which of T or U is the nearest neighbor of S?
As shown in Figure 1, the answer varies with γ.Low γ emphasizes low amplitude effects and hence identifies S as more similar to T , while high γ emphasizes high amplitude effects and assesses U as most similar to S. Hence, we theorized that careful selection of an effective cost function on a task by task basis can greatly improve accuracy, which we demonstrate in a set of nearest neighbor time series classification experiments.Our findings extend directly to all applications relying on nearest neighbor search, such as ensemble classification (we demonstrate this with Proximity Forest (Lucas et al. 2019)) and clustering, and have implications for all applications of DT W .
The remainder of this paper is organized as follows.In Section 2, we provide a detailed introduction to DT W and its variants.In Section 3, we present the flexible parametric cost function λ γ and a straightforward method for tuning its parameter.Section 4 presents experimental assessment of the impact of different DT W cost functions, and the efficacy of DT W cost function tuning in similaritybased time series classification (TSC).Section 5 provides discussion, directions for future research and conclusions.

Dynamic Time Warping
The DT W distance measure (Sakoe and Chiba 1971) is widely used in many time series data analysis tasks, including nearest neighbor (NN ) search (Rakthanmanon et al. 2012;Tan et al. 2021a;Petitjean et al. 2011;Keogh and Pazzani 2001;Silva et al. 2018).Nearest neighbor with DT W (NN -DT W ) has been the historical approach to time series classification and is still used widely today.
DT W computes the cost of an optimal alignment between two equal length series, S and T with length L in O(L 2 ) time (lower costs indicating more similar series), by minimizing the cumulative cost of aligning their individual points, also known as the warping path.The warping path of S and T is a sequence W = W 1 , . . ., W P of alignments (dotted lines in Figure 2).Each alignment is a pair W k = (i, j) indicating that S i is aligned with T j .W must obey the following constraints: -Boundary Conditions: W 1 = (1, 1) and W P = (L, L).
-Continuity and Monotonicity: for any The cost of a warping path is minimized using dynamic programming by building a "cost matrix" M DT W for the two series S and T , such that M DT W (i, j) is the minimal cumulative cost of aligning the first i points of S with the first j 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19  points of T .The cost matrix is defined in Equations 1a to 1d, where λ(S i , T j ) is the cost of aligning the two points, discussed in Section 3. It follows that DT W (S, T )=M DT W (L, L).
M DT W (0, 0) = 0 (1a) Figure 3 shows the cost matrix of computing DT W (S, T ).The warping path is highlighted using the bold boxes going through the matrix.
DT W is commonly used with a global constraint applied on the warping path, such that S i and T j can only be aligned if they are within a window range, w.This limits the distance in the time dimension that can separate S i from points in T with which it can be aligned (Sakoe and Chiba 1971;Keogh and Ratanamahatana 2005).This constraint is known as the warping window, w (previously Sakoe-Chiba band) (Sakoe and Chiba 1971).Note that we have 0 ≤ w ≤ L − 2; DT W with w = 0 corresponds to a direct alignment in which ∀ (i,j)∈W i = j; and DT W with w≥L − 2 places no constraints on the distance between the points in an alignment.Figure 3 shows an example with warping window w=2, where the alignment of S and T is constrained to be inside the colored band.Light gray cells are "forbidden" by the window.
Warping windows provide two main benefits: (1) preventing pathological warping of S and T ; and (2) speeding up DT W by reducing its complexity from O(L 2 ) to O(W • L) (Tan et al. 2018(Tan et al. , 2021b)).
Alternative window constraints have also been developed, such as the Itakura Parallelogram (Itakura 1975) and the Ratanamahatana-Keogh band (Ratanamahatana and Keogh 2004).In this paper, we focus on the Sakoe-Chiba Band which is the constraint defined in the original definition of DT W .

Amerced Dynamic Time Warping
DT W uses a crude step function to constrain the alignments, where any warping is allowed within the warping window and none beyond it.This is unintuitive for many applications, where some flexibility in the exact amount of warping might be desired.The Amerced Dynamic Time Warping (ADT W ) distance measure is an intuitive and effective variant of DT W (Herrmann and Webb in press).Rather than using a tunable hard constraint like the warping window, it applies a tunable additive penalty ω for non-diagonal (warping) alignments (Herrmann and Webb in press).
ADT W is computed with dynamic programming, similar to DT W , using a cost matrix M ADT W with ADT W ω (S, T ) = M ADT W (L, L).Equations 2a to 2d describe this cost matrix, where λ(S i , T j ) is the cost of aligning the two points, discussed in Section 3.
The parameter ω works similarly to the warping window, allowing ADT W to be as flexible as DT W with w = L − 2, and as constrained as DT W with w = 0.A small penalty should be used if large warping is desirable, while large penalty minimizes warping.Since ω is an additive penalty, its scale relative to the time series in context matters, as a small penalty in a given problem maybe a huge penalty in another one.An automated parameter selection method has been proposed in the context of time series classification that considers the scale of ω (Herrmann and Webb in press).The scale of penalties is determined by multiplying the maximum penalty ω by a ratio 0 ≤ r ≤ 1, i.e. ω = ω × r.The maximum penalty ω is set to the average "direct alignment" sampled randomly from pairs of series in the training dataset, using the specifed cost function.A direct alignment does not allow any warping, and corresponds to the diagonal of the cost matrix (e.g. the warping path in Figure 3a).Then 100 ratios are sample from r i = ( i 100 ) 5 for 1 ≤ i ≤ 100 to form the search space for ω.Apart from being more intuitive, ADT W when used in a NN classifier is significantly more accurate than DT W on 112 UCR time series benchmark datasets (Herrmann and Webb in press).
Note that ω can be considered as a direct penalty on path length.If series S and T have length L and the length of the warping path for ADT W ω (S, T ) is P , the sum of the ω terms added will equal 2ω(P − L + 1).The longer the path, the greater the penalty added by ω.
3 Tuning the cost function DT W was originally introduced with the cost function λ(a, b) = |a − b|.Nowadays, the cost function λ(a, b) = (a − b) 2 = |a − b| 2 is also widely used (Dau et al. 2019;Mueen and Keogh 2016;Löning et al. 2019;Tan et al. 2020).Some generalizations of DT W have also included tunable cost functions (Deriso and Boyd 2022).To our knowledge, the relative strengths and weaknesses of these two common cost functions has not previously been thoroughly evaluated.To study the impact of the cost function on DT W , and its recent refinement ADT W , we use the cost function We primarily study the cost functions To the best of our knowledge, the remaining cost functions, λ 0.5 (a, b), λ 0. 6(a, b) and λ 1.5 (a, b), have not been previously investigated.As illustrated in Figure 4, Relative to 1, larger values of γ penalize small differences less, and larger differences more.Reciprocally, smaller values of γ penalize large differences more, and small differences less.We will show in Section 4 that learning γ at train time over these 5 values is already enough to significantly improve nearest neighbor classification test accuracy.We will also show that expanding Γ to a larger set {1/5, 1/4, 1/3, 1/2, 1/1.5, 1, 1.5, 2, 3, 4, 5}, or a denser set {1/2, 1/1.75, 1/1.5, 1/1.25, 1, 1.25, 1.5, 1.75, 2} does not significantly improve the classification accuracy, even with doubling the number of explored parameters.Note that all the sets have the form { 1 n . . . 1 . . .n}.Although this balancing is not necessary, we did so to strike a balance in the available exponents.
Tuning λ γ amounts to learning the parameter γ at train time.This means that we now have two parameters for both DT W (the warping window w and γ) and ADT W (the penalty ω and γ).In the current work, the w and ω parameters are always learned independently for each γ, using the standard method (Herrmann and Webb in press).We denote DT W with λ x = |S i −T j | x as DT W x , and ADT W with λ x as ADT W x .We indicate that the cost function has been tuned with the superscript +, i.e.DT W + and ADT W + .
Note that with a window w = 0, for the selected exponent γ.In other words, it is the Minkowski distance (Thompson and Thompson 1996) to the power γ, providing the same relative order as the Minkowski distance, i.e. they both have the same effect for nearest neighbor search applications.
The parameters w and ω have traditionally been learned through leave-one-out cross-validation (LOOCV) evaluating 100 parameter values (Tan et al. 2018(Tan et al. , 2020;;Lines and Bagnall 2015;Tan et al. 2021b).Following this approach, we evaluate 100 parameter values for w (and ω) per value of γ, i.e. we evaluate 500 parameter values for DT W + and ADT W + .To enable a fair comparison in Section 4 of DT W + (resp.ADT W + ) against DT W γ (resp.ADT W γ ) with fixed γ, the latter are trained evaluating both 100 parameter values (to give the same space of values for w or ω) as well as 500 parameter values (to give the same overall number of parameter values).
Given a fixed γ, LOOCV can result in multiple parameterizations for which the train accuracy is equally best.We need a procedure to break ties.This could be achieved through random choice, in which case the outcome becomes nondeterministic (which may be desired).Another possibility is to pick a parameterization depending on other considerations.For DT W , we pick the smallest windows as it leads to faster computations.For ADT W , we follow the paper (Herrmann and Webb in press) and pick the median value.
We also need a procedure to break ties when more than one pair of values over two different parameters all achieve equivalent best performance.We do so by forming a hierarchy over the parameters.We first pick a best value for w (or ω) per possible γ, forming dependent pairs (γ, w) (or (γ, ω)).Then, we break ties between pairs by picking the one with the median γ.In case of an even number of equal best values for γ, taking a median would result in taking an average of dependent pairs, which does not make sense for the dependent value (w or ω).In this case we select between the two middle pairs the one with a γ value closer to 1, biasing the system towards a balanced response to differences less than or greater than zero.
Our method does not change the overall time complexity of learning DT W 's and ADT W 's parameters.The time complexity of using LOOCV for nearest neighbor search with this distances is O(M.N 2 .L 2 ), where M is the number of parameters, N is the number of training instances, and L is the length of the series.Our method only impacts the number of parameters M .Hence, using 5 different exponents while keeping a hundred parameters for w or ω effectively increases the training time 5 fold.

Experimentation
We evaluate the practical utility of cost function tuning by studying its performance in nearest neighbor classification.While the technique has potential applications well beyond classification, we choose this specific application because it has well accepted benchmark problems with objective evaluation criteria (classification accuracy).We experimented over the widely-used time series classification benchmark of the UCR archive (Dau et al. 2018), removing the datasets containing series of variable length or classes with only one training exemplar, leading to 109 datasets.We investigate tuning the exponent γ for DT W + and ADT W + using the following sets (and we write e.g.DT W +a when using the set a): The default set a is the one used in Figure 4, and the one we recommend.
We show that a wide range of different exponents γ each perform best on different datasets.We then compare DT W +a and ADT W +a against their classic counterparts using γ = 1 and γ = 2.We also address the question of the number of evaluated parameters, showing with both DT W and ADT W that tuning the cost function is more beneficial than evaluating 500 values of either w or ω with a fixed cost function.We then show that compared to the large set b (which looks at exponents beyond 1/2 and 2) and to the dense set c (which looks at more exponents between 1/2 and 2), a offer similar accuracy while being less computationally demanding (evaluating less parameters).Just as ADT W is significantly more accurate than DT W (Herrmann and Webb in press), ADT W +a remains significantly more accurate than DT W +a .This holds for sets b and c.
Finally, we show that parameterizing the cost function is also beneficial in an ensemble classifier, showing a significant improvement in accuracy for the leading similarity-based TSC algorithm, Proximity Forest (Lucas et al. 2019).

Analysis of the impact of exponent selection on accuracy
Figure 5 shows the number of datasets for which each exponent results in the highest accuracy on the test data for each of our NN classifiers and each of the three sets of exponents.It is clear that there is great diversity across datasets in terms of which γ is most effective.For DT W , the extremely small γ = 0.2 is desirable for 12% of datasets and the extremely large γ = 5.0 for 8%.
The optimal exponent differs between DT W γ and ADT W γ , due to different interactions between the window parameter w for DT W and the warping penalty parameter ω for ADT W .We hypothesize that low values of γ can serve as a form of pseudo ω, penalizing longer paths by penalizing large numbers of small difference alignments.ADT W directly penalizes longer paths through its ω parameter, reducing the need to deploy γ in this role.If this is correct then ADT W has greater freedom to deploy γ to focus more on low or high amplitude effects in the series, as illustrated in Figure 1.

Comparison against non tuned cost functions
Figures 6 and 7 present accuracy scatter plots over the UCR archive.A dot represents the test accuracy of two classifiers on a dataset.A dot on the diagonal indicates equal performance for the dataset.A dot off the diagonal means that the classifier on the corresponding side (indicated in top left and bottom right corners) is more accurate than its competitor on this dataset.
On each scatter plot, we also indicate the number of times a classifier is strictly more accurate than its competitor, the number of ties, and the result of a Wilcoxon signed ranks test indicating whether the accuracy of the classifiers can be considered significantly different.Following common practice, we use a significance level of 0.05.
Figures 6 and 7 show that tuning the cost function is beneficial for both DT W and ADT W , when compared to both the original cost function λ 1 , and the popular λ 2 .The Wilcoxon signed ranks test for DT W + show that DT W + significantly outperforms both DT W 1 and DT W 2 .Similarly, ADT W + significantly outperforms both ADT W 1 and ADT W 2 .

Investigation of the number of parameter values
DT W + and ADT W + are tuned on 500 parameter options.To assess whether their improved accuracy is due to an increased number of parameter options rather than due to the addition of cost tuning per se, we also compared them against DT W 1 and ADT W 1 also tuned with 500 options for their parameters w and ω instead of the usual 100. Figure 8 shows that increasing the number of parameter values available to DT W 1 and ADT W 1 does not alter the advantage of cost tuning.Note that the warping window w of DT W is a natural number for which the range of values that can result in different outcomes is 0 ≤ w ≤ −2.In consequence, we cannot train DT W on more than −1 meaningfully different parameter values.This means that for short series ( < 100), increasing the number of possible windows from 100 to 500 has no effect.ADT W suffers less from this issue due to the penalty ω being sampled in a continuous space.Still, increasing the number of parameter values yields ever diminishing returns, while increasing the risk of overfitting.This also means that for a fixed budget of parameter values to be explored, tuning the cost function as well as w or ω allows the budget to be spent exploring a broader range of possibilities.

Comparison against larger tuning sets
Our experiments so far allow to achieve our primary goal: to demonstrate that tuning the cost function is beneficial.We did so with the set of exponents a.This set is not completely arbitrary (1 and 2 come from current practice, we added their mean 1.5 and the reciprocals).However, it remains an open question whether or not it is a reasonable default choice.Ideally, practitioners need to use expert knowledge to offer the best possible set of cost functions to choose from for a given application.In particular, using an alternative form of cost function to λ γ could be effective, although we do not investigate this possibility in this paper.
Figure 9 shows the results obtained when using the larger set b, made of 11 values extending a with 3, 4, 5 and their reciprocals.Compared to a, the change benefits DT W + (albeit not significantly according to the Wilcoxon test), at the cost of more than doubling the number of assessed parameter values.On the other hand, ADT W + is mostly unaffected by the change.
Figure 10 shows the results obtained when using the denser set c, made of 9 values between 0.5 and 2. In this case, neither distance benefits from the change.

Runtime
There is usually a tradeoff between runtime and accuracy for a practical machine learning algorithm.Sections 4.2 and 4.3 show that tuning the cost function significantly improves the accuracy of both ADT W and DT W in nearest neighbor classification tasks.However, this comes at the cost of having more parameters (500 instead of 100 with a single exponent).TSC using the nearest neighbor algorithm paired with O(L 2 ) complexity elastic distances are well-known to be computationally expensive, taking hours to days to train (Tan et al. 2021b).Therefore, we discuss in this section, the computational details of tuning the cost function γ and assess the tradeoff in accuracy gain.
We performed a runtime analysis by recording the total time taken to train and test both DT W and ADT W for each γ from the default set a. Our experiments were coded in C++ and parallelised on a machine with 32 cores and AMD EPYC-Rome 2.2Ghz Processor for speed.The C++ pow function that supports exponentiation of arbitrary values is computationally demanding.Hence, we use specialized code to calculate the exponents 0.5, 1.0 and 2.0 efficiently, using sqrt for 0.5, abs for 1.0 and multiplication for 2.0.
Figure 11 shows the LOOCV training time for both ADT W and DT W on each γ, while Figure 12 shows the test time.The runtimes for γ=0.67 and γ=1.5 are both substantially longer than those of the specialized exponents.The total time to tune the cost function and their parameters on 109 UCR time series datasets are 6250.94(2 hours) and 9948.98 (3 hours) seconds for ADT W and DT W respectively.This translates to ADT W + and DT W + being approximately 25 and 38 times slower than the baseline setting with γ=2.Potential strategies for reducing these substantial computational burdens are to only use exponents that admit efficient computation, such as powers of 2 and their reciprocals.Also, the parameter tuning for w and ω in these experiment does not exploit the substantial speedups of recent DT W parameter search methods (Tan et al. 2021b).Despite being slower than both distances at γ=2, completing the training of all 109 datasets ADTW 0.5 ADTW 0.67 ADTW 1 ADTW 1. under 3 hours is still significantly faster than many other TSC algorithms (Tan et al. 2022;Middlehurst et al. 2021)

Noise
As γ alters DT W 's relative responsiveness to different magnitudes of effect in a pair of series, it is credible that tuning it may be helpful when the series are noisy.On one hand, higher values of γ will help focus on large magnitude effects, allowing DT W to pay less attention to smaller magnitude effects introduced by noise.On the other hand, lower values of γ will increase focus on small magnitude effects introduced by noise, increasing the ability of DT W γ to penalize long warping paths that align sets of similar values.
To examine these questions we created two variants of each of the UCR datasets.For the first dataset we added moderate random noise, adding 0.1×N (0, σ) to each time step, where σ is the standard deviation in the values in the series.For the second dataset (substantial noise) we added N (0, σ) to each time step.
The results for DT W γ w=∞ (DT W with no window) are presented in Figures 13 (no additional noise), 14 (moderate additional noise) and 15 (substantial additional noise).Each figure presents a critical difference diagram.DT W γ has been applied with all 109 datasets at each γ ∈ a.For each dataset, the performance for each γ is ranked in descending order on accuracy.The diagram presents the mean rank for each DT W γ across all datasets, with the best mean rank listed rightmost.Lines connect results that are not significantly different at the 0.05 level on a Wilcoxon singed rank test (for each line, the settings indicated with dots are not significantly different).With no additional noise, no setting of γ significantly outperforms the others.With a moderate amount of noise, the three lower values of γ significantly outperform the higher values.We hypothesize that this is as a result of DT W using the small differences introduced by noise to penalize excessively long warping paths.With high noise, the three lowest γ are still significantly outperforming the highest level, but the difference in ranks is closing.We hypothesize that this is    2.761 ADTW 1 2.739 ADTW 1.5   No Additional Noise due to increasingly large differences in value being the only ones that remain meaningful, and hence increasingly needing to be emphasized.
The results for ADT W γ are presented in Figures 16 (no additional noise), 17 (moderate additional noise) and 18 (substantial additional noise).With no additional noise, γ values of 1.5 and 1.0 both significantly outperform 0.5.With a moderate amount of noise, γ = 2.0 increases its rank and no value significantly outperforms any other.With substantial noise, the two highest γ significantly outperform all others.As ADT W has a direct penalty for longer paths, we hypothesize that this gain in rank for the highest γ is due to ADT W placing higher emphasis on larger differences that are less likely to be the result of noise.The results for DT W with window tuning are presented in Figures 19 (no additional noise), 20 (moderate additional noise) and 21 (substantial additional noise).No setting of γ has a significant advantage over any other at any level of noise.We hypothesize that this is because the constraint a window places on how far a warping path can deviate from the diagonal only partially restricts path length, allowing any amount of warping within the window.Thus, DT W still benefits from the use of low γ to penalize excessive path warping that might otherwise fit noise.However, it is also subject to a countervailing pressure towards higher values of γ in order to focus on larger differences in values that are less likely to be the result of noise.It is evident from these results that γ interacts in different ways with the w and ω parameters of DT W and ADT W with respect to noise.For ADT W , larger values of γ are an effective mechanism to counter noisy series.4.7 Comparing DT W + vs ADT W + From Herrmann and Webb (in press), ADT W 2 is more accurate than DT W 2 .Figure 22 shows that ADT W +a is also significantly more accurate than DT W +a .Interestingly, it also shows that ADT W +a is also more accurate than DT W +b , even though the latter benefits from the larger exponent set b.

Comparing PF vs PF +
Proximity Forest (P F ) (Lucas et al. 2019) is an ensemble classifier relying on the same 11 distances as the Elastic Ensemble (EE) (Lines and Bagnall 2015), with the same parameter spaces.Instead of using LOOCV to optimise each distance and ensemble their result, PF builds trees of proximity classifiers, randomly choosing an exemplar, a distance and a parameter at each node.This strategy makes it both We define a new variant of Proximity Forest, PF + , which differs only in replacing original cost functions for DT W and its variants by our proposed parameterized cost function.We replace the cost function of DT W (with and without window), WDTW , DDTW , DWDTW and SQED by λ γ , and randomly select γ from the a set at each node.Note that the replacing the cost function of SQED in this manner makes it similar to a Minkowski distance.
We leave the tuning of other distances and their specific cost functions for future work.This is not a technical limitation, but a theoretical one: we first have to ensure that such a change would not break their properties.
The scatter plot presented in Figure 23 shows that PF + significantly outperforms P F , further demonstrating the value of extending the range of possible parameters to the cost function.
While similarity-based approaches no longer dominate performance across the majority of the UCR benchmark datasets, there remain some tasks for which similarity-based approaches still dominate.Table 1 shows the accuracy of PF + against four TSC algorithms that have been identified (Middlehurst et al. 2021) as defining the state of the art -HIVE-COTE 2.0 (Middlehurst et al. 2021), TS-CHIEF (Shifaz et al. 2020), MultiRocket (Tan et al. 2022) and InceptionTime (Fawaz et al. 2020).This demonstrates that similarity-based methods remain an important part of the TSC toolkit.

Conclusion
DT W is a widely used time series distance measure.It relies on a cost function to determine the relative weight to place on each difference between values for a possible alignment between a value in one series to a value in another.In this paper, we show that the choice of the cost function has substantial impact on nearest neighbor search tasks.We also show that the utility of a specific cost function is task-dependent, and hence that DT W can benefit from cost function tuning on a task to task basis.
We present a technique to tune the cost function by adjusting the γ exponent in a family of cost functions λ γ (a, b) = |a − b| γ .We introduced new time series distance measures utilizing this family of cost functions: DT W + and ADT W + .Our analysis shows that larger γ exponents penalize alignments with large differences while smaller γ exponents penalize alignments with smaller differences, allowing the focus to be tuned between small and large amplitude effects in the series.
We demonstrated the usefulness of this technique in both the nearest neighbor and Proximity Forest classifiers.The new variant of Proximity Forest, PF + , establishes a new benchmark for similarity-based TSC, and dominates all of Hive-Cote2, TS-Chief, MultiRocket and InceptionTime on six of the UCR benchmark tasks, demonstrating that similarity-based methods remain a valuable alternative in some contexts.
We argue that cost function tuning can address noise through two mechanisms.Low exponents can exploit noise to penalize excessively long warping paths.It appears that DT W benefits from this when windowing is not used.High exponents direct focus to larger differences that are least affected by noise.It appears that ADT W benefits from this effect.
We need to stress that we only experimented with one family of cost function, on a limited set of exponents.Even though we obtained satisfactory results, we urge practitioners to apply expert knowledge when choosing their cost functions, or a set of cost functions to select from.Without such knowledge, we suggest what seems to be a reasonable default set of choices for DT W + and ADT W + , significantly improving the accuracy over DT W and ADT W .We show that a denser set does not substantially change the outcome, while DT W may benefit from a larger set that contains more extreme values of γ such as 0.2 and 5.
A small number of exponents, specifically 0.5, 1 and 2, lead themselves to much more efficient implementations than alternatives.It remains for future research to investigate the contexts in which the benefits of a wider range of exponents justify their computational costs.
We expect our findings to be broadly applicable to time series nearest neighbor search tasks.We believe that these finding also hold forth promise of benefit from greater consideration of cost functions in the myriad of other applications of DT W and its variants.

Fig. 1 :
Fig. 1: Tuning the cost function changes which series are considered more similar to one another.U exactly matches the first 7 points of S, but then flattens, running through the center of the remaining points in S. In contrast, T starts with lower amplitude than S over the first seven points, but then exactly matches S for the remaining low amplitude waves.The original DT W cost function, λ(a, b) = |a − b|, results in DT W (S, T ) = DT W (S, U ) = 9, with DT W rating T and U as equally similar to S. The commonly used cost function, λ(a, b) = (a − b) 2 , results in DT W (S, U ) = 9.18 < DT W (S, T ) = 16.66.More weight is placed on the high amplitude start, and S is more similar to U .Using the cost function λ(a, b) = |a − b| 0.5 results in DT W (S, U ) = 8.98 > DT W (S, T ) = 6.64, placing more weight on the low amplitude end, and S is more similar to T .In general, changing the cost function alters the amount of weight placed on low amplitude vs high amplitude effects, allowing DT W to be better tuned to the varying needs of different applications.

Fig. 5 :
Fig. 5: Counts of the numbers of datasets for which each value of γ results in the highest accuracy on the test data.
Fig. 9: Comparison of default exponent set a and larger set b.

Fig. 11 :
Fig.11: LOOCV train time in seconds on the UCR Archive (109 datasets) of each distance, per exponent.These timings are done on a machine with 32 cores and AMD EPYC-Rome 2.2Ghz Processor.

Fig. 12 :
Fig. 12: Test time in seconds on the UCR Archive (109 datasets) of each distance, per exponent.These timings are done on a machine with 32 cores and AMD EPYC-Rome 2.2Ghz Processor.

Fig. 13 :
Fig. 13: Critical Difference Diagram for DT W w=∞ on the UCR Archive (109 datasets) with no additional noise.

Fig. 14 :
Fig. 14: Critical Difference Diagram for DT W w=∞ on the UCR Archive (109 datasets) with moderate additional noise.

Fig. 15 :
Fig. 15: Critical Difference Diagram for DT W w=∞ on the UCR Archive (109 datasets) with substantial additional noise.

Fig. 16 :
Fig. 16: Critical Difference Diagram for ADT W on the UCR Archive (109 datasets) with no additional noise.

Fig. 17 :
Fig. 17: Critical Difference Diagram for ADT W on the UCR Archive (109 datasets) with moderate additional noise.

Fig. 18 :
Fig. 18: Critical Difference Diagram for ADT W on the UCR Archive (109 datasets) with substantial additional noise.

Fig. 19 :
Fig. 19: Critical Difference Diagram for DT W on the UCR Archive (109 datasets) with no additional noise.

Fig. 20 :
Fig. 20: Critical Difference Diagram for DT W on the UCR Archive (109 datasets) with lomoderate additional noise.

Fig. 21 :
Fig. 21: Critical Difference Diagram for DT W on the UCR Archive (109 datasets) with substantial additional noise.

Fig. 22 :
Fig. 22: Accuracy scatter plot over the UCR archive comparing ADT W +a against DT W + tuned over a and b.
Fig. 6: Accuracy scatter plot over the UCR archive comparing DT W +a against DT W 1 and DT W 2 .
(b) ADT W +a vs. ADT W 2 Fig.7: Accuracy scatter plot over the UCR archive comparing ADT W +a against ADT W 1 and ADT W 2 .

Table 1 :
(Stefan et al. 2013);01)er the UCR archive comparing the original Proximity Forest (P F ) against Proximity Forest using λ γ for DT W , and its variants (PF + ).Six benchmark UCR datasets for which PF+ is more accurate than all four algorithms that have been identified as defining the current state of the art in TSC.DT W with and without a window; DDTW adding the derivative to DT W(Keogh and Pazzani 2001); WDTW(Jeong et al. 2011); DWDTW adding the derivative to WDTW ; LCSS (Hirschberg 1977); ERP (Chen and Ng 2004); MSM(Stefan et al. 2013); and TWE (Marteau 2009).