Hydra: competing convolutional kernels for fast and accurate time series classification

We demonstrate a simple connection between dictionary methods for time series classification, which involve extracting and counting symbolic patterns in time series, and methods based on transforming input time series using convolutional kernels, namely Rocket and its variants. We show that by adjusting a single hyperparameter it is possible to move by degrees between models resembling dictionary methods and models resembling Rocket. We present Hydra, a simple, fast, and accurate dictionary method for time series classification using competing convolutional kernels, combining key aspects of both Rocket and conventional dictionary methods. Hydra is faster and more accurate than the most accurate existing dictionary methods, achieving similar accuracy to several of the most accurate current methods for time series classification. Hydra can also be combined with Rocket and its variants to significantly improve the accuracy of these methods.


Introduction
Dictionary methods and Rocket (Dempster et al, 2020) represent two seemingly quite different approaches to time series classification.Dictionary methods involve extracting and then counting symbolic patterns in time series.

Hydra
Fig. 1 Mean rank of Hydra in terms of accuracy versus several prominent dictionary methods over 30 resamples of 106 datasets from the UCR archive.Lower rank indicates higher accuracy.Hydra is more accurate than the most accurate existing dictionary methods.
closely resembles dictionary methods or more closely resembles Rocket, and has a decisive influence on the accuracy of the method: see Figure 2.With multiple kernels per group, Hydra more closely resembles conventional dictionary methods.As the number of kernels per group approaches one (k → 1), Hydra more closely resembles Rocket.In fact, where k = 1 (g 1), Hydra essentially is Rocket, with some qualifications.However, the most Rocket-like variant is not the most accurate variant of Hydra.
The use of random patterns in the form of random convolutional kernels has two key advantages: (1) it is computationally efficient; and (2) it produces high classification accuracy.Figure 1 shows the mean rank of Hydra versus several prominent dictionary methods-cBOSS, S-BOSS, WEASEL, TDE, and MrSQM-over 30 resamples of the 106 datasets from the UCR archive for which benchmark results are available for all relevant methods.We rank the accuracy of each method on each dataset, and take the mean rank over all 106 datasets.Lower mean rank corresponds to higher accuracy.Methods for which the pairwise difference in accuracy is not statistically significant, per a Wilcoxon signed-rank test with Holm correction, are connected with a black line (see Demšar, 2006;García and Herrera, 2008;Benavoli et al, 2016).
Hydra is faster and more accurate than the most accurate existing dictionary methods, TDE and MrSQM, and is competitive in terms of accuracy with several of the most accurate current methods for time series classification: see Section 4.1.Hydra can train and test on this subset of 106 datasets in approximately 28 minutes using a single CPU core, as compared to approximately 1 hour 41 minutes for MrSQM, 4 hours for cBOSS, 22 hours for TDE, more than 24 hours for WEASEL, and more than 6 days for S-BOSS.Compute times for Hydra and MrSQM are averages over 30 resamples, run on a cluster using Intel Xeon E5-2680 and Xeon Gold 6150 CPUs, restricted to a single CPU core per dataset per resample.Compute times for cBOSS, TDE, S-BOSS, and WEASEL are taken from Middlehurst et al (2020a).
The rest of this paper is structured as follows.In Section 2, we present relevant related work.In Section 3, we explain Hydra in detail.In Section 4, we present experimental results including results for a number of larger datasets, and a sensitivity analysis for key hyperparameters.
2 Related Work

Dictionary Methods
Dictionary (or 'bag of words') methods represent a prominent approach to time series classification.BOSS was identified in Bagnall et al (2017) (the 'bake off' paper) as one of the three most accurate methods for time series classification on the datasets in the UCR archive (Dau et al, 2019), and was included in the original HIVE-COTE ensemble, then the most accurate method for time series classification on the datasets in the UCR archive (Lines et al, 2018).The most accurate current dictionary methods, TDE (Middlehurst et al, 2020a) and MrSQM (Le Nguyen and Ifrim, 2022), are competitive with several of the Hydra most accurate current methods for time series classification on the datasets in the UCR archive.TDE is one of the four components of HIVE-COTE 2 (HC2), currently the most accurate method for time series classification on the datasets in the UCR archive (Middlehurst et al, 2021).
Most dictionary methods work in broadly the same way, i.e., by passing a sliding window over each time series, smoothing or approximating the values in each window, and assigning the resulting values to letters from a symbolic alphabet (Large et al, 2019).The counts of the resulting patterns are used as the basis for classification.BOSS and several other methods use Symbolic Fourier Approximation (SFA) (Schäfer, 2015), which involves: • passing a sliding window over the input time series; • applying the Fourier transform to the values in the window, dropping the high-frequency coefficients (in effect, smoothing the values in the window using a low-pass filter); and • assigning the remaining values to one of four letters to form words.
The counts of the resulting words are used to perform classification with a 1nearest neighbour (1NN) classifier.BOSS is a large ensemble of such classifiers using different hyperparameter configurations.
The resulting feature space is typically very large and very sparse (see Schäfer and Leser, 2017;Large et al, 2019;Le Nguyen and Ifrim, 2022), and the resulting patterns represent a high degree of approximation, as the input is both smoothed and quantised to a very small set of discrete values.In addition, for methods using SFA or a variation thereof, the patterns are formed over values in the frequency domain, rather than the original input.
Despite the broad similarities between many dictionary methods, different methods can produce very different results due to differences in the way that the input is approximated or quantised (Bagnall et al, 2017;Large et al, 2019).The most prominent dictionary methods are BOSS (Schäfer, 2015), cBOSS (Middlehurst et al, 2019), S-BOSS (Large et al, 2019), WEASEL (Schäfer and Leser, 2017) and, more recently, TDE (Middlehurst et al, 2020a) and MrSQM (Le Nguyen and Ifrim, 2022).
In contrast to BOSS, cBOSS randomly selects hyperparameter combinations for its ensemble, sets an upper limit on ensemble size, and uses a different ensemble weighting.cBOSS is considerably faster than BOSS with approximately the same accuracy on the datasets in the UCR archive.
S-BOSS adds temporal information by recursively dividing the input time series into subseries, and forming dictionaries over the subseries.S-BOSS also uses a different distance measure for the 1NN classifiers.S-BOSS is more accurate than BOSS on the datasets in the UCR archive, but at considerable computational expense.
WEASEL selects Fourier coefficients using a statistical test, performs quantisation based on information gain, and uses a chi-squared test to perform feature selection.WEASEL uses the resulting features to train a logistic regression model.WEASEL is more accurate than BOSS or cBOSS on the datasets in the UCR archive.TDE incorporates the ensembling method from cBOSS, the temporal information and distance measure from S-BOSS, and the quantisation method from WEASEL.TDE is currently one of the two the most accurate dictionary methods on the datasets in the UCR archive and, as noted above, is one of the four components of HC2.
MrSQM builds on an earlier method, MrSEQL (Le Nguyen et al, 2019).MrSQM uses a combination of random feature selection and feature selection via a chi-squared test, and uses the resulting features to train a logistic regression model.MrSQM is both significantly more accurate, and an order of magnitude faster, than MrSEQL.Along with TDE, MrSQM is one of the two most accurate dictionary methods on the datasets in the UCR archive.

Rocket, MiniRocket, and MultiRocket
The Rocket 'family' of methods-namely Rocket and its variants MiniRocket (Dempster et al, 2021) and MultiRocket (Tan et al, 2022)represent a seemingly quite different approach to time series classification.
Rocket transforms input time series using a large number of random convolutional kernels (by default, 10,000), and uses the transformed features to train a linear classifier.Rocket uses kernels with lengths selected randomly from {7, 9, 11}, weights drawn from N (0, 1), biases drawn from U(−1, 1), random dilation (on an exponential scale), and random padding.Rocket applies both global max pooling, and 'proportion of positive values' (PPV) pooling to the convolution output.The resulting features are used to train a ridge regression classifier or logistic regression (for larger datasets).The two most important aspects of Rocket in terms of accuracy are the use of dilation and PPV pooling.Rocket achieves state-of-the-art accuracy with a fraction of the computational expense of other methods of comparable accuracy, with the exception of MiniRocket and MultiRocket.Along with TDE, an ensemble of Rocket models, known as Arsenal, is one of the four components of HC2 (Middlehurst et al, 2021).
MiniRocket makes several key changes to Rocket in order to remove randomness and significantly speed up the transform.MiniRocket uses a fixed kernel length of 9, a small, fixed set of 84 kernels, bias values drawn from the convolution output, a fixed set of dilation values (fixed relative to the length of the input time series), and only produces PPV features.MiniRocket uses the same classifiers as Rocket.MiniRocket is significantly faster than any other method of comparable accuracy, including Rocket, and demonstrates that it is possible to achieve essentially the same accuracy as Rocket using a fixed set of nonrandom kernels.
MultiRocket represents a further extension of Rocket/MiniRocket, adding three additional pooling operations (in addition to PPV), as well as transforming both the original time series and the first-order difference.MultiRocket uses the same kernels and classifiers as MiniRocket.MultiRocket is currently the next-most-accurate method for time series classification after HC2 on the datasets in the UCR archive.MultiRocket Hydra achieves broadly similar accuracy to HC2 while being several orders of magnitude faster.
While seemingly very different, we show that there is, in fact, a simple connection between Rocket and dictionary methods, and that this connection provides the basis for performing fast and accurate dictionary-like classification using convolutional kernels.

Other State of the Art
There has been significant progress in terms of accuracy in time series classification since Bagnall et al (2017).In addition to Rocket and its variants, the most accurate current methods on the datasets in the UCR archive include TDE, MrSQM, DrCIF (Middlehurst et al, 2020b(Middlehurst et al, , 2021)), InceptionTime (Ismail Fawaz et al, 2020), TS-CHIEF (Shifaz et al, 2020), and HC2 (see Bagnall et al, 2020;Middlehurst et al, 2021).
DrCIF builds on an earlier method, TSF, to extract multiple features, including 'catch22' features (Lubba et al, 2019), from random intervals.DrCIF takes intervals from the input time series, the first-order difference, and a periodogram.Like TDE and Arsenal, DrCIF is one of the components of HC2.
InceptionTime is an ensemble of deep convolutional neural networks based on the Inception architecture, and is currently the most accurate convolutional neural network model for time series classification on the UCR archive.
TS-CHIEF is an extension of an earlier method, ProximityForest (Lucas et al, 2019), and is an ensemble of decision trees using distance measures, intervals, and spectral splitting criteria.
HC2 supersedes earlier variants of HIVE-COTE (Lines et al, 2018;Bagnall et al, 2020), and is an ensemble comprising TDE, DrCIF, Shapelet Transform, and Arsenal.HC2 is currently the most accurate method for time series classification on the datasets in the UCR archive.
These methods, while highly accurate, are all burdened by high computational complexity.DrCIF takes almost 2 days to train and test on 112 datasets in the UCR archive, TDE more than 3 days, InceptionTime more than 3 days (using GPUs), HC2 approximately 2 weeks, and TS-CHIEF several weeks (Middlehurst et al, 2021).MrSQM is faster, taking approximately 3 hours to train and test on the same datasets using broadly comparable hardware.In comparison, Hydra completes the same task in approximately 36 minutes.

Overview
Hydra is a dictionary method which uses convolutional kernels, incorporating aspects of both Rocket and conventional dictionary methods.
Hydra involves transforming the input time series using a set of random convolutional kernels, arranged into g groups with k kernels per group, and then at each timepoint counting the kernels representing the closest match Fig. 3 Hydra convolves each input time series with a set of random convolutional kernels, organised into g groups with k kernels per group, and at each timepoint counts the kernels representing the closest match with the input time series for each group.
with the input time series for each group: see Figure 3.The counts are then used to train a linear classifier.(Note that there are g groups per dilation.)Like other dictionary methods, Hydra uses patterns which approximate the input, and produces features which represent the counts of those patterns.However, unlike typical dictionary methods, Hydra uses random patterns, represented by random convolutional kernels.
While the use of random patterns distinguishes Hydra from typical dictionary methods, the difference is not as radical as it might first appear.The patterns extracted by typical dictionary methods represent a considerable degree of approximation of the input time series: see Section 2.1.Although they represent a different form of approximation (randomisation, rather than smoothing and quantisation), the patterns identified and counted by Hydra are not necessarily 'more approximate' than the patterns used in typical dictionary methods.
Like Rocket, Hydra transforms the input time series using random convolutional kernels.However, unlike Rocket and its variants: (1) the kernels are organised into groups; and (2) Hydra counts the kernels in each group representing the closest match with the input at each timepoint.In effect, Hydra treats each kernel as a pattern in a dictionary, and treats each group as a dictionary.In a sense, the kernels in each group are forced to 'compete' in order to be counted at each timepoint.
The organisation of the kernels into groups, although seemingly simple, is what allows us to move between models more closely resembling dictionary methods and models more closely resembling Rocket, and has a decisive influence on the accuracy of the method.Further, the move away from producing PPV features represents a major departure from the Rocket paradigm.PPV pooling is a defining characteristic of the Rocket family of methods, and one of the most important factors allowing these methods to achieve high accuracy.
Listing 1 PyTorch-like pseudocode for Hydra.The transform involves convolving the input with the kernels, rearranging the output into g groups, performing an argmax (and/or max) operation, and incrementing the counts for the relevant kernels.# X : n input time series # W : g * k kernels def transform(X, W): indices = conv1d(X, W).reshape(n, g, k, -1).argmax(2) return zeros (n, g, k).scatter_add_(-1, indices, ones_like(indices)) The key hyperparameters for Hydra are: • the characteristics of the kernels-for simplicity, Hydra largely inherits the characteristics of the kernels from Rocket and its variants (Section 3.2); • the way in which the kernels are counted -by default, at each timepoint Hydra counts both the kernel with the maximum response and the kernel with the minimum response, using both 'hard' and 'soft' counting, as explained below (Section 3.3); • the number of groups and the number of kernels per group-by default, Hydra uses 64 groups (g = 64) with 8 kernels per group (k = 8), for a total of k × g = 512 kernels per dilation (Section 3.4); and • whether not to include the first-order difference-by default, Hydra acts on both the original time series as well as the first-order difference (Section 3.5).
In setting the values of these hyperparameters, we have restricted ourselves to the same subset of 40 'development' datasets (default training/test splits) from the UCR archive per Dempster et al (2020Dempster et al ( , 2021)), in order to avoid overfitting the entire archive.
Hydra is conducive to a very simple implementation: see the pseudocode in Listing 1.We implement Hydra using PyTorch (Paszke et al, 2019).We use the ridge regression classifier from scikit-learn (Pedregosa et al, 2011), and logistic regression implemented using PyTorch.Our code and full results will be made available at https://github.com/angus924/hydra.
Bias values, closely linked to the PPV features produced by Rocket and its variants, are not used at all in Hydra.As such, we do not have the same flexibility to 'spread' features over different dilations (e.g., by assigning more bias values to smaller or larger dilations), or to increase or decrease the total number of features (e.g., by increasing or decreasing the total number of bias values).Accordingly, Hydra uses a simplified set of dilations in the range {2 0 , 2 1 , 2 2 , ...} (such that the maximum effective length of a kernel including dilation is the length of the input time series), and does not use a fixed feature count (c.f., approx.10,000 for MiniRocket, 20,000 for Rocket, and approx.50,000 for MultiRocket).Additionally, for simplicity, Hydra always uses padding (zeros are added to the start and end of each time series such that the convolution operation begins with the middle element of the kernel centred on the first element of the series and ends with the middle element of the kernel centred on the last element of the series).Using a fixed kernel length, simplified dilation, and consistent padding, greatly simplifies the organisation of kernels into groups.Always padding has the additional advantage of allowing the transform to work by default with variable-length time series.
As noted above, there are g groups per dilation.In this sense, dilation acts as a kind of super grouping of kernels.Accordingly, the total number of kernels (k × g × d, where d is the number of dilations) will grow logarithmically with time series length and, for very long time series, it may be desirable to, e.g., subsample the dilation values, use a smaller number of groups (per dilation), and/or downsample the input.However, we have not used any such restrictions in any of the experiments presented in this paper.
We normalise the kernel weights by subtracting the mean and dividing by the sum of the absolute values.This is to ensure that one kernel is not counted over another kernel because of a spurious difference in the mean or magnitude of the kernel weights.(The scale of the weights is unimportant, provided that the weights for the kernels in each group are on the same scale.) It is possible to perform the transform using a smaller set of kernels, e.g., the MiniRocket kernels, and to form the groups by resampling the convolution output, i.e., such that the convolution output for each kernel may appear in more than one group.This could improve speed and scalability, and potentially provide the basis for a deterministic transform.However, initial experimentation suggests that accuracy decreases as the number of unique kernels decreases.
As noted above, MiniRocket and MultiRocket achieve high accuracy without using random kernels.This suggests that it is not necessarily the 'randomness' of the kernels used in Hydra which is important.It is possible that a larger set of MiniRocket-like kernels would achieve similar accuracy to the random kernels used in Hydra.However, it is not clear whether a fixed grouping or predefined resampling would be effective, or whether the resampling procedure should remain stochastic, even for nonrandom kernels.We leave the exploration of these questions for future work.

Maximum vs Minimum Response
By default, at each timepoint Hydra counts both the kernel with the maximum response (output value), i.e., the kernel representing the closest match with the input time series, and the kernel with the minimum response.

Hydra
Unlike Rocket and its variants, where a kernel is counted in Hydra, the information for the other kernels at that timepoint is discarded.Counting the kernel with the minimum response, in addition to the kernel with the maximum response, preserves more of the information in the convolution output without requiring additional work in terms of the transform, and improves accuracy: see Section 4.3.Note, however, that where k = 1, counting both the minimum response and the maximum response is redundant: they are the same.
The kernel with the minimum response does not represent the 'poorest match' to the input time series, but rather an alternative closest match.(The 'poorest match' would be the kernel with the lowest-magnitude response.)Counting both the kernel with the maximum response and the kernel with the minimum response is equivalent to looking for the closest match twice: once with the given set of kernels, W , and once with the 'inverted' set of kernels, −W .(As the kernels are random, the sign of the kernels is arbitrary.)

Hard (argmax) vs Soft (max) Counting
Hydra uses two different forms of counting: • 'hard' counting-incrementing the count of the kernel with the maximum (and/or minimum) response at each timepoint; and • 'soft' counting-accumulating the value of the maximum (and/or minimum) response for the kernel with the maximum (and/or minimum) response at each timepoint.
Hard counting involves an argmax (or argmin) operation over the channels in the convolution output for each group (one channel per kernel), returning the index of the kernel with the maximum (or minimum) response at each timepoint.This is then used to increment the counts for the relevant kernels.
Soft counting involves a max (or min) operation over the same channels, returning the maximum (or minimum) response at each timepoint.This is then accumulated for the relevant kernels.
Although superficially different, both hard and soft counting capture broadly similar information.Soft counts can be expected to be roughly proportional to hard counts, and vice versa.Where a kernel predominates, it will have a higher relative count, and the sum of the response (where that kernel is counted) is also likely to be relatively large.
By default, Hydra uses both hard counting and soft counting: soft counting for the maximum response, and hard counting for the minimum response.(The allocation of soft counting to the maximum response and hard counting to the minimum response is essentially arbitrary.)Using both soft and hard counting is more accurate than using either alone: see Section 4.3.2.

Clipping
By default, Hydra counts the kernel with the maximum response whether or not the maximum response is positive (and, likewise, counts the kernel with the minimum response whether or not the minimum response is negative).
It is possible to 'clip' the values in the convolution output, such that the kernel with the maximum response is only counted when the maximum response is positive (and, likewise, the kernel with the minimum response is only counted when the minimum response is negative).This is essentially equivalent to passing the convolution output through a ReLU function.
Without clipping, the features produced by the most Rocket-like variant (i.e., k = 1) are uninformative: • hard counting implies counting every timepoint for every kernel, such that all the features are the same, i.e., a value representing the length of the input time series; and • soft counting produces features which are approximately zero, as positive and negative values in the convolution output will 'cancel each other out'.
With clipping, the features produced where k = 1 are equivalent to PPV (hard counting), and global average pooling (soft counting): see Section 3.4.1.However, clipping is of little practical significance.While it improves accuracy for k = 1, it has little or no effect for k ≥ 2, and has no effect on the optimal values of g and k.

Number of Groups (g) & Kernels Per Group (k)
The number of groups (g) or, equivalently, the number of kernels per group (k), controls the extent to which Hydra more closely resembles dictionary methods or more closely resembles Rocket in two senses: • the features produced by the transform-whether the features are more like the features produced by dictionary methods or more like the features produced by Rocket; and • the level of approximation provided by the random kernels-whether the patterns identified and counted by Hydra are more, or less, representative of the input time series.

Features
With multiple kernels per group (k > 1), Hydra produces counts like typical dictionary methods.However, as the number of kernels per group decreases, in particular, as the number of kernels per group approaches one (k → 1), the features produced by the transform undergo a qualitative change.
Where k = 1, provided that at each timepoint a kernel is only counted if the maximum response is positive (or the minimum response is negative), Hydra produces PPV and/or global average pooling features: • hard counting involves incrementing the count for every kernel wherever the convolution output is positive, i.e., PPV; and • soft counting involves accumulating the output values for each kernel wherever the convolution output is positive, i.e., global average pooling.In other words, where k = 1, the features produced by Hydra no longer represent counts, but instead represent the average response of every kernel across all timepoints in the form of PPV and/or global average pooling.In this sense, k = 1 represents the most Rocket-like variant of Hydra.
However, as noted above, Hydra only produces PPV and/or global average pooling features where k = 1, and then only where the values in the convolution output are 'clipped'.Additionally, the most Rocket-like variant of Hydra is not the most accurate variant of Hydra, whether or not clipping is used.This may be, in part, because the effectiveness of PPV features is closely linked to the use of a large number of bias values and Hydra, in contrast to Rocket and its variants, does not use bias values at all.

Approximation
A large number of kernels per group (i.e., a small number of groups) implies a low degree of approximation.The larger the number of kernels per group, the more kernels are 'competing' to have the maximum (or minimum) response at each timepoint, and the higher the probability that the kernels identified and counted by Hydra closely match the input time series: see Figure 4. (With an infinite number of kernels per group, the kernel with the maximum response at each timepoint would be expected to exactly match the input time series.)In this sense, with a larger number of kernels per group, the patterns identified and counted by Hydra are more like patterns extracted from the input as in typical dictionary methods.
A small number of kernels per group implies a high degree of approximation.The smaller the number of kernels per group, the fewer kernels are 'competing', and the lower the probability that the kernels identified and counted by Hydra match the input time series.As the number of kernels per group decreases, the patterns identified and counted by Hydra become increasingly approximate in the sense of becoming increasingly 'random', i.e., unrelated to the input time series.

First-Order Difference
Like MultiRocket and DrCIF, Hydra operates on both the original input time series and the first-order difference.This is achieved by assigning half of the groups (i.e., by default, 32 of the 64 groups) to operate on the first-order difference.This allows the transform to incorporate the first-order difference without introducing additional features or computation.Adding the first-order difference increases the accuracy of every variant of Hydra: see Section 4.3.2.

Classifier
Hydra uses the same classifiers as Rocket and its variants, i.e., a ridge regression classifier or logistic regression (for larger datasets, i.e., where the number of training examples is greater than approx.10,000).In order to limit peak memory usage, by default the transform is performed in batches.As for Rocket and its variants, the Hydra transform can be naturally integrated into the minibatch updates for logistic regression trained using stochastic gradient descent or similar.

Experiments
We evaluate Hydra on the datasets in the UCR archive (Section 4.1), demonstrating that Hydra is more accurate than the most accurate existing dictionary methods, TDE and MrSQM, and achieves similar accuracy to several of the most accurate current methods for time series classification.We further demonstrate the accuracy of Hydra on a number of larger datasets (Section 4.2).We also explore the effect of key hyperparameters in terms of the number of groups (g) and kernels per group (k), the way in which the kernels are counted, whether or not to include the first-order difference, and whether or not to 'clip' the convolution output (Section 4.3).

UCR Archive
We evaluate Hydra on the datasets in the UCR archive (Dau et al, 2019).We compare Hydra against several prominent dictionary methods, and against the most accurate current methods for time series classification on the datasets in the UCR archive, namely, TDE, MrSQM, DrCIF, Rocket and its variants, InceptionTime, TS-CHIEF, and HC2.For direct comparability with other published results, we evaluate Hydra on the same 30 resamples of 112 datasets from the UCR archive per Middlehurst et al (2020aMiddlehurst et al ( ,b, 2021)).The total compute time for Hydra on all 112 datasets, averaged over 30 resamples, is approximately 36 minutes using a single CPU core.

Rocket+Hydra MultiRocket+Hydra
Fig. 6 Mean rank of Rocket and its variants, with and without Hydra, over the same 30 resamples of 112 datasets.
Figure 1 (page 1) shows the mean rank of Hydra versus several prominent dictionary methods, namely, cBOSS, S-BOSS, WEASEL, TDE, and MrSQM.Hydra is faster and, on average, more accurate than all other dictionary methods, including TDE and MrSQM.This ranking is performed on only 106 of the 112 datasets, as results for S-BOSS and WEASEL are unavailable for the six largest datasets due to computational constraints (see Middlehurst et al, 2020a).We have obtained the results for MrSQM using the implementation available at https://github.com/mlgig/mrsqm,with the hyperparameter configuration as set out in Le Nguyen and Ifrim (2022).
Figure 5 shows the mean rank of Hydra and MultiRocket+Hydra versus the most accurate current methods for time series classification over 30 resamples of 112 datasets from the UCR archive.Hydra is combined with MultiRocket by simply concatenating the features produced by each transform.On average, Hydra is more accurate (i.e., is more accurate on more than half the datasets) than TDE, MrSQM, and DrCIF, and achieves broadly similar accuracy to Rocket, InceptionTime, and TS-CHIEF, but is less accurate than HC2.On average, MultiRocket+Hydra is not significantly less accurate than HC2.
Figure 6 shows the mean rank of Rocket and its variants, with and without Hydra.(The pairwise differences between MultiRocket, Rocket+Hydra, and MiniRocket+Hydra are not statistically significant.Similarly, the pairwise differences between Rocket+Hydra, MiniRocket+Hydra, and MultiRocket+Hydra are not statistically significant, as indicated by the orange line joining the relevant methods.)The addition of Hydra significantly improves the accuracy of both Rocket and MiniRocket.Rocket+Hydra and MiniRocket+Hydra both achieve similar accuracy to MultiRocket.This is consistent with results for a number of larger datasets: Section 4.2.While MultiRocket+Hydra is, on average, more accurate than MultiRocket, this is largely an artefact of the large pairwise advantage of MultiRocket+Hydra over MultiRocket.The actual differences in accuracy are mostly very small.This is not necessarily surprising, as MultiRocket is already one of the two most accurate methods for time series classification on the datasets in UCR archive, and already produces a large number of features.
Figure 7 shows the pairwise accuracy of Hydra versus the two most accurate existing dictionary methods for time series classification: TDE (left) and MrSQM (right).Hydra is more accurate than TDE on 73 datasets, and less accurate on 37. Similarly, Hydra is more accurate than MrSQM on 69 datasets, and less accurate on 42.
Figure 8 shows the pairwise accuracy of Hydra versus Rocket (left), and an alternative configuration of Hydra using a single group (right).(This is the most dictionary-like variant of Hydra, using a single group and hard counting.)Hydra is more accurate than Rocket on 56 datasets, and less accurate on 53.The default variant of Hydra (k = 8/g = 64) is more accurate than the most dictionary-like variant of Hydra (k = 512/g = 1) on 95 datasets, and less accurate on 15.
Figure 9 shows the pairwise accuracy of Hydra (left) and Multi-Rocket+Hydra (right) versus HC2.Hydra is more accurate than HC2 on 27 datasets and less accurate on 80. MultiRocket+Hydra is more accurate than HC2 on 48 datasets and less accurate on 57 (compared to 45/62 for MiniRocket+Hydra and 44/62 for Rocket+Hydra).
While the pairwise comparison between Hydra and HC2 is heavily in favour of HC2, the actual differences in accuracy are mostly relatively small.In contrast, the difference in computational cost is enormous.The magnitude of the pairwise difference between Hydra and HC2 is: less than 5% for 90% of datasets (94% for MultiRocket+Hydra); less than 2.5% for 72% of datasets (85% for MultiRocket+Hydra); and less than 1% for 53% of datasets (57% for MultiRocket+Hydra).However Hydra, like Rocket and its variants, requires only a small fraction of the compute time of other methods of comparable accuracy: less than 1% of the compute time of TDE, and less than 0.2% of the compute time of HC2.In other words, the advantages of HC2 over Hydra in terms of accuracy come at a 500× computational cost.We conduct the sensitivity analysis using the subset of 40 'development' datasets (default training/test splits) from the UCR archive per Dempster et al (2020Dempster et al ( , 2021)): see Section 3. Results are mean results over 10 runs.

Number of Groups (g) & Kernels per Group (k)
Figure 10 shows the relationship between accuracy and the number of groups (g), and kernels per group (k), for the most accurate variant of Hydra (counting both the maximum and minimum response, using both soft and hard counting, including the first-order difference, and no clipping: see Sections 4. 3.2 & 4.3.3).Accuracy is represented by mean rank: lower mean rank corresponds to higher accuracy.For each dataset, we rank the accuracy of each combination of k and g, then take the mean rank over all datasets.
Figure 10 (left) shows that 64 groups of 8 kernels (k = 8/g = 64) produces the highest accuracy.Interestingly, this result is consistent across every hyperparameter configuration: counting the maximum response or both the maximum response and minimum response, soft, hard, or both soft and hard counting, with or without the first-order difference, and with or without clipping.Additional results for the same model with and without the first-order difference, and with and without clipping, are shown in Appendix A.
Figure 10 (right) shows the same mean ranks versus k.Each line represents a different total number of kernels.This shows that, except for a very small total number of kernels, the optimal value of k is approximately 8, suggesting that for any given 'budget' (k × g), it is more effective to increase the number of groups, keeping the number of kernels per group at approximately 8.
This also shows that, in general, adding more kernels increases accuracy.For any number of kernels per group (k), a lager number of groups (g)-in other words, a larger total number of kernels (k × g)-is more accurate.The number of kernels represents a tradeoff between accuracy and computational efficiency.While increasing the total number of kernels tends to increase accuracy, the magnitude of the improvement decreases as the number of kernels increases.Although there is a clear difference in rank between k × g = 256 and k × g = 512, the actual differences in accuracy are small: see Figure 13, Appendix A. Increasing the total number of kernels in order to meaningfully increase accuracy beyond k × g = 512 (e.g., k × g = 1024) would involve considerable additional computational burden for little practical gain.
Additionally, limiting the total number of kernels per dilation (k × g) to 512 means that the total number of features per dataset is less than 10,000 (at least for the 40 'development' datasets) or, in other words, roughly commensurate with MiniRocket.For these reasons, and given that Hydra is already approximately as accurate as Rocket, we set k × g = 512 as the maximum number of kernels per dilation.
Different values of k represent different levels of approximation in the kernels: see Section 3.4.2. Figure 10 shows that a relatively high degree of approximation is desirable, broadly consistent with the high level of approximation typical of dictionary methods.The most accurate variants of Hydra are high-g/low-k models, i.e., using a large number of small groups rather than a small number of large groups.This implies that it is preferable to combine the information from many approximate matches.A large number of small groups also has the benefit of low variability: see Figure 14, Appendix A.
We also observe that, as k increases, there is increasing sparsity, i.e., an increasing proportion of kernels which are never counted, although there is large variability between datasets: see Figure 15, Appendix A. For the optimal combination of k and g (i.e., k = 8/g = 64), Hydra produces dense features.

Method of Counting and First-Order Difference
Figure 11 shows the mean rank of different variants of Hydra, with and without the first-order difference (for the most accurate combination of k/g, being k = 8/g = 64 in all cases, and in all cases not using clipping).Figure 11 shows that counting both the maximum and minimum responses is more accurate than only counting the maximum response, that using both soft and hard counting is more accurate than using either soft or hard counting alone, and that including the first-order difference always improves accuracy.

Clipping
Figure 12 shows the mean rank of different variants of Hydra, with and without clipping (for the most accurate combination of k/g, being k = 8/g = 64 in all cases, and in all cases including the first-order difference).Figure 12 shows that clipping has very little effect on accuracy for optimal values of k/g.

Conclusion
We demonstrate a simple connection between dictionary methods for time series classification, which extract and count symbolic patterns in time series, and methods based on transforming the input time series using convolutional kernels, namely Rocket.This provides the basis for fast and accurate dictionary-like classification using convolutional kernels.We present Hydra, a simple, fast, and highly accurate dictionary method for time series classification, incorporating aspects of both Rocket and conventional dictionary methods.Hydra demonstrates the advantages of performing dictionary-like classification with convolutional kernels and random patterns.Hydra is faster and more accurate than the most accurate existing dictionary methods, and consistently improves the accuracy of Rocket and its variants when used in combination with these methods.In future work we plan to explore deterministic variants of Hydra, more sophisticated approaches to combining Hydra with other methods, and the application of Hydra to multivariate time series.

B.2 Training Details
For each method, the transform is performed in larger batches of 4,096 examples, further divided into minibatches for training.For MiniRocket and MultiRocket, we fit the bias values using the first such batch of 4,096 examples.We cache the transformed features in order to avoid unnecessarily repeating the transform when training for multiple epochs.We use the Adam optimizer (Kingma and Ba, 2015), and we use the same hyperparameters for all methods and datasets: a validation set of 2,048 training examples, a minibatch size of 256, and an initial learning rate of 10 −4 .The learning rate is halved if validation loss does not improve after 50 updates, and training is stopped if validation loss does not improve after 100 updates (but only after the first epoch).
We run the experiments on the same cluster referred to in Section 1, performing five separate runs per dataset for each method (results are mean results over those five runs), and using eight cores per dataset per run.

Fig. 5
Fig. 5 Mean rank of Hydra and MultiRocket+Hydra in terms of accuracy versus other SOTA methods over 30 resamples of 112 datasets from the UCR archive.

Fig. 10
Fig. 10 Mean rank (accuracy) versus the number of kernels per group (k), and the number of groups (g).

Fig. 11
Fig. 11 Mean rank of different variants of Hydra, with and without first-order difference.

Fig. 12
Fig. 12 Mean rank of different variants of Hydra, with and without clipping.

Table 3
Training times for Mini/Multi/Rocket and Hydra on the three largest datasets.

Table 4
Training times for Mini/Multi/Rocket+Hydra on the three largest datasets.