In this section, we evaluate MultiRocket on the datasets in the UCR univariate time series archive (Dau et al. 2018). We show that MultiRocket is significantly more accurate than its predecessor, MiniRocket and not significantly less accurate than the current most accurate TSC classifier, HIVE-COTE 2.0. By default, MultiRocket generates 50, 000 features. We show that even with 50, 000 features, MultiRocket is only about 10 times slower than MiniRocket, but orders of magnitude faster than other current state of the art methods. Our experiments also show that the smaller variant of MultiRocket with 10, 000 features (same number of features as MiniRocket) is as fast as MiniRocket while being significantly more accurate. Finally, we explore key design choices, including the choice of transformations, features and the number of features. These design choices are tuned on the 40 “development” datasets as used in (Dempster et al. 2021, 2020) to reduce overfitting of the whole UCR archive.
MultiRocket is implemented in Python, compiled via Numba (Lam et al. 2015) and we use the ridge regression classifier from scikit-learn (Pedregosa et al. 2011). Our code and results are all publicly available in the accompanying website, https://github.com/ChangWeiTan/MultiRocket. All of our experiments were conducted on a cluster with AMD EPYC 7702 CPU, 32 threads and 64 GB memory.
Comparing with current state of the art
First, we evaluate MultiRocket and compare it with the current most accurate TSC algorithms, namely HIVE-COTE 2.0, TS-CHIEF, InceptionTime, MiniRocket, Arsenal, DrCIF, TDE, STC and ProximityForest. These algorithmsFootnote 1 are chosen because they are the most accurate in their respective domains. ProximityForest represents the distance-based algorithms; STC represents shapelet-based algorithms; While TDE and DrCIF represent dictionary-based and interval-based algorithms respectively.
For consistency and direct comparability with the SOTA TSC algorithms, we evaluate MultiRocket on the same 30 resamples of 109 datasets from the UCR archive as reported and used in (Middlehurst et al. 2021; Dempster et al. 2021; Bagnall et al. 2020). Note that each resample creates a different distribution for the train and test sets. Resampling of each dataset is achieved by first mixing the train and test sets, then performing a stratified sampling for train and test sets and maintaining the same number of instances for each resample.
Figure 4 shows the average rank of MultiRocket against all SOTA methods mentioned. The black line groups methods that do not have a pairwise statistical difference using a two-sided Wilcoxon signed-rank test (\(\alpha =0.05\)) with Holm correction as the post-hoc test to the Friedman test (Demšar 2006). MultiRocket is on average significantly more accurate than most SOTA methods. The critical difference diagram with the top 5 algorithms shown in Fig. 1 shows that MultiRocket is significantly more accurate than its predecessor, MiniRocket but not significantly less accurate than HIVE-COTE 2.0, TS-CHIEF and InceptionTime, all of which are ensemble-based algorithms. Note that MultiRocket is one of the few non-ensemble-based algorithms that has achieved SOTA accuracy. Appendix C shows the pairwise comparison of some SOTA algorithms.
Figure 5 shows the pairwise statistical significance and comparison of the mentioned top SOTA methods. Every cell in the matrix shows the wins, draws and losses on the first row and the p-value for the two-sided Wilcoxon signed-rank test. The values in bold indicate that the two methods are significantly different after applying Holm correction. Overall, as expected and pointed out in (Middlehurst et al. 2021), HIVE-COTE 2.0 is significantly more accurate than any other method, where the p-values for most of the methods are much less than 0.001, even after applying Holm correction. MultiRocket is the only method with a p-value larger than 0.001 and not significantly different from HIVE-COTE 2.0 after applying Holm correction. The figure also shows that MultiRocket is significantly more accurate than most other methods.
Although HIVE-COTE 2.0 is significantly more accurate than MultiRocket with 59 wins out of 109 datasets, the difference in accuracy between HIVE-COTE 2.0 and MultiRocket lies within \(\pm 5\%\), as shown in Fig. 6a, indicating that there is relatively little difference between the two methods. On the other hand, MultiRocket and InceptionTime are not significantly different from each other, despite MultiRocket having more larger wins, as depicted in Fig. 6b. For instance, MultiRocket is most accurate against InceptionTime on the SemgHandMovementCh2 dataset with accuracy of 0.792 and 0.551. While InceptionTime is the most accurate against MultiRocket on the PigAirwayPressure dataset with accuracy of 0.922 and 0.647. The large variance in the difference in accuracy between MultiRocket and InceptionTime implies that both methods are strong in their own ways and that MultiRocket can potentially be improved on datasets where InceptionTime performed much better.
HIVE-COTE 2.0 , TS-CHIEF and InceptionTime are able to capture the different time series representations that have not been able to be captured by MultiRocket. This shows the importance of diversity in classifiers to achieve high classification accuracy. However, as shown in Fig. 2a, MultiRocket only takes 5 min (using 32 threads) to complete training and classification on all 109 datasets, a time that is at least an order of magnitude faster than HIVE-COTE 2.0, TS-CHIEF and InceptionTime.
As seen on both Fig. 6a and b, MultiRocket performed the worst on the PigAirwayPressure dataset, with the largest difference of 0.308 and 0.275 compared to HIVE-COTE 2.0 and InceptionTime respectively. Rocket achieved poor performance on this dataset as pointed out in (Dempster et al. 2021) due to the way the bias values are sampled. This issue has been mitigated in MiniRocket by sampling the bias values from the convolution output instead of a uniform distribution, \(U(-1,1)\) in Rocket (Dempster et al. 2020). MultiRocket samples different sets of bias for the base and first order difference series. It is possible that the first order differences gives rise to the poor performance on this dataset.
Runtime analysis
The addition of the first order difference transform and additional 3 features increases the total compute time of MiniRocket. Fig. 7a and b show the total compute time (training and testing) of both MultiRocket and MiniRocket with 10,000 and 50,000 features using an AMD EPYC 7702 CPU with a single thread. The default MultiRocket with 50,000 features is about an order of magnitude slower than the default MiniRocket with 10,000 features. Comparing with the same number of 50,000 features, MultiRocket is only 4 times slower than MiniRocket. This makes sense since MultiRocket computes four features per kernel instead of one. Taking approximately 40 min to complete all 109 datasets, MultiRocket is still significantly faster than all other SOTA methods, as shown in Table 2. However, running MultiRocket with 32 threads significantly reduces this time to 5 min as shown in Fig. 2a. Hence it is recommended to use MultiRocket in a multi-threaded setting. Note that MultiRocket with 10,000 features is significantly more accurate than MiniRocket as shown in Appendix D.
All the other SOTA methods have a long run time as reported in (Middlehurst et al. 2021). We took the total train time on 112 UCR datasets from (Middlehurst et al. 2021) and show them in Table 2 together with a few variants of MultiRocket and MiniRocket with 10,000 and 50,000 features as comparison. As expected, MiniRocket is the fastest, taking just under 3 min to train. This is followed by MultiRocket that took around 16 min. Rocket took approximately 3 h to train, while Arsenal, an ensemble of Rocket took 28 h. The fastest non-Rocket algorithm is DrCIF, taking about 2 days to train, followed by TDE with 3 days. Finally, the collective ensembles are the slowest taking at least 14 days to train. Note that the time for InceptionTime is not directly comparable as it was trained on a GPU.
Table 2 Run time to train single resample of 112 UCR problem. MultiRocket and MiniRocket variants are run on a single thread on a cluster using AMD EPYC 7702 CPU with a single thread. The other algorithms are reported in (Middlehurst et al. 2021) Ablation study
So far, we have shown that MultiRocket performed well overall. In this section, we explore the effect of key design choices for MultiRocket. The choices include (A) selecting the time series representations (B) selecting the set of pooling operators and (C) increasing the number of features.
Time series representations
We explore the effect of the different representations using MiniRocket as the baseline. We consider the first and second order difference to estimate the derivatives of the time series and periodogram to capture information about the frequencies that are present in the time series. Figure 8 shows the comparison of the different combinations of the all 4 representations (including the base time series) for MiniRocket. The figure shows that using each of the representation alone does not improve the accuracy, as some information is inevitably lost during the transformation process. However, combining the base series with either representation improves MiniRocket, with the first order difference being the most accurate. The result indicates that adding diversity to MiniRocket by combining different time series representations with the base time series improves MiniRocket’s performance.
We then perform the same experiment on MultiRocket and observed similar results, as shown in Fig. 9. We used the smaller variant of MultiRocket to be comparable to MiniRocket. In this case, comparing the base versions (MiniRocket and MultiRocket (10k) base) shows that adding the additional 3 pooling operators also improves the discriminating power of MiniRocket, as indicated in the discussion in Appendix D.
Pooling operators
The previous section shows that applying convolutions to the base and first order difference series improves the discriminating power of MiniRocket and MultiRocket. Hence it is chosen as the default for MultiRocket. Now, we explore the effect of different combinations of pooling operators used by each kernel on classification accuracy. Figure 10 compares the different pooling operator combinations of MultiRocket with 10,000 features with the baseline MiniRocket and MiniRocket with base and first order difference series. The result shows that the variant using all pooling operators performed the best overall. This confirms our justification of using all four pooling operators in Sect. 3. Figure 10 also shows that PPV is a strong feature, where most of the combinations did not perform better than using PPV alone. The use of each pooling operator alone (without the combination) also performed significantly worse than PPV.
Number of features
The default setting of MultiRocket uses the combination of the base and first order difference and extracts 4 features per convolution kernel. In this section, we explore the effect of increasing the number of features in MultiRocket. Figure 11 shows the comparison of MultiRocket with different numbers of features. We also compare with the default MiniRocket, MiniRocket with 50,000 features and MiniRocket with base and first order difference. Overall, using 50,000 features is the most accurate and there is little benefit in using 100,000 features as more and more features will be similar to one another. A similar phenomenon was shown in Dempster et al. (2021). Figure 12a and b show that MultiRocket with 50,000 features is significantly more accurate than both MiniRocket with 50,000 features and with the first order difference. MultiRocket is more accurate on 76 and 68 datasets respectively. The results show that the increase in accuracy is not just due to the large number of features but also due to the diversity in the extracted features using the four pooling operators and first order difference. Therefore MultiRocket uses 50,000 features by default.