1 Introduction

Time series analysis is a popular topic in machine learning and data mining research. Thousands of research papers in this field have been published in the last decade. Various algorithms have been proposed for disparate tasks across a wide range of applications. The main reason for this development is the increased ability to store data over time and the spread of cheap sensor technology to most fields of science. For example, solar panels depend on sensors to maximise their potential (e.g. to tilt the solar panel so that the sun shines directly on it) and hospitals routinely record and store patient data such as vital signs. This vast wealth of data offers great potential for data mining.

Two of the most researched time series machine learning/analysis tasks are classification (Bagnall et al. 2017; Middlehurst et al. 2023; Ruiz et al. 2021) and forecasting (Makridakis et al. 2008). Time Series Classification (TSC) involves building a predictive model from (possibly multivariate) time series for a categorical target variable. TSC differs from standard classification in that the discriminatory features are often in the shape of the series or the autocorrelation. Forecasting consists of predicting (usually numeric) values based on past observations. Forecasting is usually approached through a model-based algorithm (e.g., autoregressive or exponential smoothing) or by reducing the forecasting problem to a regression problem through a sliding window then using deep learning or a global model such as XGBoost.

Tan et al. (2021) formally specified a related, but distinct, type of time series regression problem: Time Series Extrinsic Regression (TSER). Rather than being derived from a forecasting problem, TSER involves a predictive model built on time series to predict a real-valued variable distinct from the training input series. For example, Fig. 1 shows soil spectrograms which can be used to estimate the potassium concentration. Ground truth is found through expensive lab based experiments that take some time. Spectrograms (ordered data series we treat as time series) are cheap to obtain and the data can be collected in any environment. An accurate regressor from spectrogram to concentration would make land and crop management more efficient. A TSER example already in the archive is shown in Fig. 2. Each multivariate time series, comprising an electrocardiogram (ECG) and a photoplethysmogram (PPG), can be used for heart rate estimation.

Fig. 1
figure 1

Three examples of soil spectrograms from the PotassiumConcentration dataset. These spectrograms are used to predict the target variable, the potassium concentration level, the values of which are shown in the legend

Fig. 2
figure 2

Two examples of electrocardiograms and photoplethysmograms, with their corresponding heart rate values, from the BIDMC32HR dataset

TSER is related to TSC as traditional regression is to classification: the only difference is that the target variable is real-valued rather than categorical. The distinction extrinsic is required because of the prevalence of the term time series regression in the forecasting literature to mean reduce forecasting to regression through a sliding window.

The first benchmarking work for TSER (Tan et al. 2021) introduced an archive of 19 TSER problems, including four univariate and 15 multivariate datasets. They performed an experimental comparison of the performance of 13 algorithms on these data. The two algorithms adapted from the TSC literature, the RandOm Convolutional KErnel Transform (ROCKET) (Dempster et al. 2020) and the deep learner InceptionTime (Ismail Fawaz et al. 2020), were top-ranked. However, there was no significant difference in Root Mean Square Error (RMSE) between the ten best-performing algorithms, possibly because of the relatively small number of datasets and the conservative nature of the adjustment for multiple tests used. The abstract of Tan et al. (2021) states that “we show that much research is needed in this field to improve the accuracy of ML models [for TSER]”.

Despite the paper’s popularity and the identification of a clear need for novel research, there has been little or no progress in addressing this challenge. We have responded to this call to arms and developed and assessed a range of TSER algorithms. We have proposed new algorithms that are significantly better than the ones evaluated in Tan et al. (2021).

Our starting point to TSER is to adapt TSC algorithms for regression. The ROCKET family of classifiers all involve transformations using randomised convolutions and a pooling operation followed by a linear or ridge classifier. The original ROCKET was converted to TSER by switching the classifier for a ridge regressor. We extend this to consider a more recent ROCKET variant, MultiROCKET (Tan et al. 2022). Deep learning algorithms are also simple to adapt, and we extend the previous study to include a convolutional neural network in addition to an ensemble regression version of InceptionTime (Ismail Fawaz et al. 2020).

An alternative approach to TSC is to use a large number of unsupervised summary features as a transform. A review of a range of alternatives (Middlehurst and Bagnall 2022) found that the Fresh Pipeline with RotatIoN forest Classifier (FreshPRINCE) was the best transform pipeline for TSC. The FreshPRINCE uses the Time Series Feature Extraction based on Scalable Hypothesis Tests (TSFresh) (Christ et al. 2018) followed by a Rotation Forest (RotF) classifier (Rodriguez et al. 2006). We implement the FreshPRINCE for TSER.

Interval-based classifiers also extract unsupervised features, but they do so by ensembling pipelines with randomly selected intervals and a fast base classifier. The first interval-based approach for TSC was the Time Series Forest (TSF) (Deng et al. 2013). TSF generates a set of random intervals and concatenates each interval’s mean, standard deviation and slope to make a unique feature space for every base classifier. The Canonical Interval Forest (CIF) (Middlehurst et al. 2020a), and the subsequent Diverse Representation Canonical Interval Forest (DrCIF) (Middlehurst et al. 2021), adopt a similar model to the TSF but use different summary features and data representations. CIF uses the Canonical Time Series Characteristics (Catch22) (Lubba et al. 2019) feature set. Details of these transformation-based algorithms and how we have adapted them to TSER are provided in Sect. 3. Implementations of these regressors are available in the aeon toolkitFootnote 1 Our main contributions can be summarised as follows:

  1. 1.

    We provide 44 new datasets to the TSER archive, including 24 univariate and 20 multivariate datasets, to take the archive to 63 datasets;

  2. 2.

    We extend the study from Tan et al. (2021) on these new data to examine whether the conclusions translate to the larger collection;

  3. 3.

    We implement recently proposed convolutional-based, feature-based, interval-based, and deep learning-based TSC algorithms to TSER;

  4. 4.

    We conduct an extensive experimental study using 21 regressors and demonstrate that feature-based and interval-based regressors, on average, achieve a significantly better RMSE than any other assessed algorithms;

  5. 5.

    We carry out a comprehensive analysis and an ablative study of the two best proposed approaches: FreshPRINCE and DrCIF.

  6. 6.

    We provide open source implementations of scikit-learn compatible implementations, clear guidance on reproducibility, and detailed results on the associated repositoryFootnote 2.

The rest of this paper is structured as follows. In Sect. 2, the background and related works are exposed. Sect. 3 describes our new TSER algorithms in detail. In Sect. 4, we give an overview of the new archive and describe our experimental setup. In Sect. 5, experimental results for the 21 approaches applied to the total of 63 datasets are presented. Section 6 looks at these results in more detail. Finally, Sect. 7 summarises our findings and highlights future work.

2 Background and related work

TSER aims to create a mapping function between a time series and a scalar value (Tan et al. 2021). A time series is composed of real-valued ordered observations. Formally, a univariate time series of length m is defined as \(\textbf{x} = \{x_1, x_2, \ldots , x_m\}\). A multivariate time series with d channels is specified as \(X = \{\textbf{x}_1, \textbf{x}_2, \ldots , \textbf{x}_d\}\), where \(\textbf{x}_k = \{x_{1, k}, x_{2, k}, \ldots , x_{m, k}\}\) and a collection of n time series is denoted \(\textbf{X}\). Hence, \(x_{i,j,k}\) represents the j-th observation of the i-th case for the k-th channel. A dataset D is composed of n time series samples and an associated response variable, \(D = \{\textbf{X}, \textbf{y}\}\), where \(\textbf{y} = \{y_1, y_2, \ldots , y_n \}\) are the output continuous values, i. e. the input time series \(\textbf{x}_i\) (univariate) or \(X_i\) (multivariate) is associated to the target variable \(y_{i}\).

A TSER model is a mapping function \(\mathcal {T} \rightarrow \mathcal {R}\), where \(\mathcal {T}\) is the space of all time series and \(\mathcal {R}\) a continuous value. A TSER model is trained on a dataset \(D_{TRAIN}\) and evaluated on an independent test dataset \(D_{TEST}\).

TSER shares some similarities with Scalar-on-Function Regression (SoFR) (Goldsmith and Scheipl 2014), a functional regression model where basis models are applied to the series prior to regression, i. e. the goal is to fit a regression model with scalar responses and functional data points as predictors (Reiss et al. 2017). Tan et al. (2021) used two SoFR models in their comparison based on Goldsmith and Scheipl (2014). These were Functional Principal Component Regression (FPCR) and FPCR with B-splines (FPCR-Bs).

Time series forecasting is often reduced to regression through the application of a sliding window to form a collection of time series \(\textbf{S}\) and a forecast horizon specifying how to select the target \(\textbf{y}\). The most common techniques used in Time Series Forecasting Regression (TSFR) include deep learning variants and global models, where channels are concatenated and a standard regressor such as Random Forest or XGBoost is applied.

Our primary source for TSER (Tan et al. 2021) compared three standard regression algorithms (Support Vector Regressors (SVR) (Drucker et al. 1996), Random Forest (RandF) (Breiman 2001), and eXtreme Gradient Boosting (XGBoost) (Chen and Guestrin 2016)); two k-Nearest Neighbours models using Euclidean and Dynamic Time Warping distances with both one and five neighbours; three deep learning approaches (Fully Convolutional Neural Network (FCN) (Wang et al. 2017), Residual Network (ResNet) (He et al. 2016) and InceptionTime  (Ismail Fawaz et al. 2020)); two functional analysis approaches (FPCR and FPCR with B-splines (Goldsmith and Scheipl 2014)); and ROCKET (Dempster et al. 2020). The standard regressors adopt the approach of global forecasting regressors: time series are flattened into a vector concatenating all the channels. Hence, a multivariate time series of length m and d channels is converted into a single vector of length \(m \times d\). Subsequently, there have been very few algorithmic advances for TSER. Most novel developments are domain specific and not aimed at TSER as a whole. Among them, a Linear Space State Layers (LSSL) model (Gu et al. 2021) has been tested on three TSER datasets, achieving low error metrics. Another state-space model, Liquid-S4 (Hasani et al. 2022) has also been evaluated on those three datasets, and claims better results. An architecture based on Graph Neural Networks called TISER-GCN (Bloemheuvel et al. 2022) has been applied to seismic data as an extrinsic regression task. In a similar context, Siddiquee et al. (2022) introduces Septor, a hierarchical neural network model developed to estimate the depth of seismic events from waveform data, i.e. a domain-specific extrinsic regression task. ROCKET-XGBoost (Bayani 2022) has been the only novel algorithm evaluated on the 19 TSER archive datasets, but it offered no significant improvement over the algorithms evaluated in Tan et al. (2021).

2.1 Time series classification (TSC) algorithms

There are a plethora of algorithms for TSC that have been compared in reproducible comparative studies (Bagnall et al. 2017; Middlehurst et al. 2023; Ruiz et al. 2021). Broadly, algorithms can be grouped by how they extract and learn from temporal patterns in the time series. We provide a very brief overview with a focus on how classifiers have been or could be adapted to TSER.

Distance-based classifiers use a distance function in conjunction with an algorithm such as Nearest Neighbour (NN) classifier. The two most commonly used distance functions are Euclidean distance and Dynamic Time Warping (DTW). A NN classifier can trivially be adapted for regression by averaging over the target variable of the NN. For multivariate data, using terminology presented in Shokoohi-Yekta et al. (2017), DTW can either be independent (find DTW distance on each channel separately them sum the values) or dependent (use all channels in the point wise distance calculation).

Feature-based algorithms transform series into features using unsupervised descriptive statistics, then complete the pipeline with a classifier trained on the new feature set.

Interval-based classifiers are an extension of feature-based pipeline classifiers where rather than form summary features over the whole series, they concatenate features found over different intervals. They then form an ensemble over different randomised intervals rather than use a single estimator. Together, we group feature-based and interval-based approaches together as unsupervised feature-based classifiers. Adapting these algorithms to TSER is our primary research goal, so we cover this topic in more detail in Sect. 3.

Kernel/convolution-based models find convolutions from the space of all possible subseries and use them to create features through a form of pooling operation. The most popular approach, ROCKET (Dempster et al. 2020) generates random convolutions and is used in conjunction with a ridge classifier in a pipeline. It was adapted for TSER by simply changing the ridge classifier for a ridge regressor (Tan et al. 2021). More recently, MultiROCKET (Tan et al. 2022) was proposed as an improved version of ROCKET. ROCKET uses two pooling operations to generate features: max pooling and the percentage of positive values. MultiROCKET adds three new pooling operations: mean of positive values, mean of indices of positive values, and longest stretch of positive values. It also extracts features from first order differences in addition to the raw data.

Deep learning continues to be popular for TSC (Ismail Fawaz et al. 2019), although to our knowledge InceptionTime (Ismail Fawaz et al. 2020) is still the best performing deep learner. The study in Tan et al. (2021) used Residual Networks (ResNet), Fully Convolutional Neural Network (FCN), and InceptionTime. The original InceptionTime paper (Ismail Fawaz et al. 2020) proposed an ensemble of five InceptionTime classifiers to obtain the final results. However, Tan et al. (2021) used a single InceptionTime model for TSER. We evaluate both a single InceptionTime (Inception) and an InceptionTime Ensemble (InceptionE) faithful to the TSC version for TSER. We also evaluate the Convolutional Neural Network (CNN) regressor based on the classifier described in Zhao et al. (2017).

Shapelet-based approaches (Bostrom and Bagnall 2017; Ye and Keogh 2011) base classification on the presence of selected phase-independent subseries found from the training data. For classification, shapelets are assessed with a supervised measure such as information gain. Furthermore, the most accurate shapelet-based approaches (Middlehurst et al. 2021) evaluate shapelets with a one vs many approach and balance the search procedure between classes to improve diversity. Adapting shapelets for TSER requires significant internal changes and design decisions, being beyond the scope of this paper.

Dictionary based algorithms use a bag of words-like approach to base classification on the number of occurrences of approximated subseries (patterns). The most successful dictionary-based classifiers (Schäfer 2015; Schäfer and Leser 2023; Middlehurst et al. 2020b) involve a degree of supervised selection using accuracy for filtering/weighting or feature selection and their adaptation for TSER is also beyond the scope of this paper.

3 Unsupervised feature-based regressors

Approaches which extract features from time series in an unsupervised process have been shown to perform well in classification scenarios. ROCKET (Dempster et al. 2020) and CIF (Middlehurst et al. 2020a) perform as well as or better than single representation approaches such as shapelet or dictionary algorithms. These algorithms also have the benefit of lower complexity, essentially consisting of transform to classifier pipelines or an ensemble of pipelines. ROCKET and CIF (Middlehurst et al. 2020a) were also top ranked for Multivariate TSC (MTSC) in a recent survey (Ruiz et al. 2021).

We describe the features extracted and our adaptations for two additional algorithms based on unsupervised transformations. The first is the FreshPRINCE (Middlehurst and Bagnall 2022), a pipeline using the TSFresh (Christ et al. 2018) feature set. The second is DrCIF (Middlehurst et al. 2021), an interval-based ensemble.

3.1 FreshPRINCE

FreshPRINCE is a pipeline algorithm for regression with two components: the TSFresh feature extraction algorithm that transforms the input time series into a feature vector, and then a Rotation Forest (RotF) (Rodriguez et al. 2006) estimator that builds a model and makes target predictions. The structure of a generic pipeline algorithm for TSER is displayed in Fig. 3. TSFresh (Christ et al. 2018) is a collection of just under 800 features that can be extracted from time series data. While the features can be used on their own, a feature selection method called Fresh is provided to remove irrelevant features. FreshPRINCE does not make use of this feature selection. It keeps all the transformation process unsupervised and allows the RotF to decide the utility of features. TSFresh is generally popular within the data science community, and has shown to perform better than other unsupervised transformation pipelines on classification problems as part of FreshPRINCE (Middlehurst and Bagnall 2022).

Fig. 3
figure 3

A diagram visualising a simple transformation pipeline for TSER. A transformation will convert the series into a usable feature vector for a regression algorithm

RotF is an ensemble of tree classifiers which has been shown to accurately make predictions for problems where the attributes are continuous (Bagnall et al. 2018). The classifier has been used as a benchmark and as a part of other pipeline classifiers in TSC (Bagnall et al. 2017; Middlehurst et al. 2021), and performed better than a ridge classifier and XGBoost (Chen and Guestrin 2016) when paired with unsupervised transforms for TSC (Middlehurst and Bagnall 2022). Full descriptions of the RotF algorithm are available in Rodriguez et al. (2006) and Bagnall et al. (2018). RotF is easily adaptable for regression: the implementation we developed removes class subsampling (Pardo et al. 2013), replaces the C4.5 decision tree with a Classification and Regression Tree (CART) (Breiman 2017), and averages the target predictions for each tree in the forest. The full TSFresh transformation and altered RotF make up our FreshPRINCE adaptation for TSER.

3.2 DrCIF

Interval-based techniques select phase-dependent intervals of fixed offsets from which to extract summary features. These intervals share their position for all time series, with the aim of discovering discriminatory features from particular locations in time. Most interval techniques take the form of a forest of decision trees, using different intervals to achieve diversity in the ensemble. While some interval forests do make use of supervised feature extraction (Cabello et al. 2020), TSF (Deng et al. 2013) and DrCIF (Middlehurst et al. 2021) are completely unsupervised in their method for selecting intervals and extracting features from said intervals. All that we change for TSER from the classification implementation is a swapping of the tree algorithm used. TSF can be adapted for the regression task in the same way.

From a series of length m, there are \(m(m-1)/2\) possible intervals when considering all interval lengths and positions. Even at small series lengths, it is unfeasible to extract features from or evaluate all possible intervals. To solve the issue of which intervals from this pool to select, DrCIF uses a random forest based approach. An ensemble of CART regressors is formed, built on the output of different random interval transformations. Algorithm 1 describes the full build process for DrCIF. The transformation has three steps. First, the base time series is split into three series representations: the original time series, the first order differences of the series, and the periodogram of the series (characterised in line 3 of Algorithm 1). The differences and periodogram series-to-series transformations have shown to provide useful information in classification approaches (Flynn et al. 2019; Cabello et al. 2020; Tan et al. 2022; Keogh and Pazzani 2001). Then, a different transform is created for each base regressor. First, a pool of a features is selected from a candidate pool of 29 features (line 6). DrCIF makes use of the CAnonical Time series CHaracteristics (Catch22) (Lubba et al. 2019). Catch22 is a diverse set of 22 features filtered from the 7000+ available in the Highly Comparative Time Series Analysis (HCTSA) toolbox (Fulcher and Jones 2017). The Catch22 features were selected for use on normalised data, but we do not make that assumption. Hence, seven additional summary statistics are also candidates: the mean, standard-deviation, slope, median, interquartile range, min, and max. Then, for each data representation, a set of k random intervals are selected (lines 10–13), and the a unsupervised features are calculated and concatenated from a randomly selected channel (lines 13–15). Finally, a CART tree is trained on the feature set unique to each ensemble member. Figure 4 visualises the transformation (upper figure) and ensemble (lower figure) process for DrCIF. Predictions for new cases are found by averaging the predictions of the base regressors.

Algorithm 1
figure a

DrCIF(A list of n cases of length m with d channels, \(\varvec{T}=(\varvec{X},\varvec{y})\))

Fig. 4
figure 4

Diagrams visualising the transformation (top) and ensemble structure (bottom) of DrCIF

4 Methodology

We summarise the new problems we have added, the regressors used in experiments, and a description of our experimental method.

The previous version of the TSER archive includes 19 different datasets. We have increased the number of datasets in the archive to 63. There are now 28 univariate problems and 35 multivariate, with number of channels ranging from 2 to 24. Dataset size range from 93 to over 16,000. 70% are used for training, 30% for testing. Series length ranges from 14 to 7500. Nine of the problems contain missing values and two have unequal-length series.

The new datasets have been taken from Kaggle competitions and other archives and repositories/websites associated with applied research. Table 1 summarises the gathered data. More details on the datasets are available in “Appendix A” and on the associated repository. None of the datasets have been normalised. One of the new problems has unequal-length series. For experiments, keeping with the practice in Tan et al. (2021), missing values in the series are linearly interpolated, and unequal-length series are truncated to the minimum length series. Full descriptions, and both unequal and missing values series are available on the associated website.

Table 1 New TSER datasets

4.1 Regression algorithms

The full list of the 21 regressors (with associated abbreviation) evaluated in Sect. 5 is presented as a taxonomy in Fig. 5.

Fig. 5
figure 5

Taxonomy of current literature in TSER. All these regressors have been evaluated in Sect. 5

Parameter settings for all algorithms are described in “Appendix B”.

4.2 Experimental design

Each dataset is provided with a default train/test split. We repeat every experiment 30 times to mitigate for sampling variation. The first experiment is with the default data test/train split. Subsequent experiments are conducted with data resampled by pooling the train and test and randomly partitioning the data with the same train/test proportions as the original. Performance is measured with the RMSE to conform with Tan et al. (2021). To compare regressors, we first average RMSE over all resamples. We use ranks in all statistical tests. For multiple regressors over multiple datasets we use an adaptation of the critical difference diagram (Demšar 2006), replacing the post-hoc Nemenyi test with a comparison of all classifiers using pairwise Wilcoxon signed-rank tests, and cliques formed using the Holm correction, as recommended in García and Herrera (2008), Benavoli et al. (2016).

5 Results

Our experiments are structured as follows. In Sect. 5.1 we recreate the results presented in Tan et al. (2021) on the original 19 datasets. We then extend the analysis to our larger collection of datasets to test whether the conclusions reached in Tan et al. (2021) generalise to the new archive of 63 problems. In Sect. 5.2 we compare the best performing regressors from the previous experiments to the new algorithms we are proposing, FreshPRINCE and DrCIF. We also include two improvements for regressors used in the previous study and other regressors available in open source toolkits. RMSE results for the best performing regressors on 63 TSER datasets are available in the accompanying websiteFootnote 3.

5.1 Recreating results on the 19 TSER datasets

We ran the 13 regressors reported in Tan et al. (2021) on the current 19 datasets in the archive. Figure 6 shows a critical difference diagram of our results alongside the results presented in Tan et al. (2021). Broadly, the ordering of algorithms is the same and the cliques are similar. There are some differences in the ordering, with ROCKET and FCN lower ranked and Inception and ResNet higher in our experiments than the original.

Fig. 6
figure 6

Reproduction of the RMSE ranks on the original archive (using 19 datasets). Left is the original image from Tan et al. (2021). Right is our recreation using 5 resamples

Differences may be explained by a slight difference in experimental procedure. The experiments in Tan et al. (2021) involved five repetitions on the default train/test split with a different random seed, whereas we resampled the data on each repetition. We did this for consistency with our later experiments. We also have more diverse cliques than observed in Tan et al. (2021). This is because our adjustment for multiple testing is less conservative than the one used in Tan et al. (2021), where a full Bonferonni adjustment is used rather than a Holm correction.

Fig. 7
figure 7

RMSE ranks for 13 regressors used in Tan et al. (2021) on 63 TSER datasets

In Fig. 7 we compare the 13 regressors used for Fig. 6 on the larger archive of 63 datasets. For all subsequent experiments we extend the number of resamples on each dataset from five to 30. All resampling is done without replacement and maintains the same train/test sizes of the original datasets. The first resample is always the original train/test split and these resamples are seeded so can be exactly reproduced (see accompanying website for an example). We observe that RandF is now the best, improving significantly on ROCKET, and better in rank than Inception and ResNet, among others. Hence, the time series specific methods previously proposed for TSER are not better than using an off the shelf regressor with concatenated features.

5.2 Benchmarking the new TSER archive

For the next set of experiments, we take the top five algorithms in Fig. 7 and compare these to some alternative adaptations of time series specific algorithms. The good performance of XGBoost and RandF suggests we should not overlook standard classifiers. Rotation Forest (RotF) (Rodriguez et al. 2006) is a classifier that can be easily adapted to regression by simple averaging (Pardo et al. 2013). It has been shown to be particularly effective for problems with all real valued attributes, including time series (Bagnall et al. 2018, 2017). Hence, we include a regression adaption in this round of experiments. We also add in the standard Ridge regressor for completeness sake. In addition, the open source toolkit aeonFootnote 4 includes two regression implementations not previously evaluated in the context of TSER. Time Series Forest (TSF) regressor is an adaptation of the TSF classifier (Deng et al. 2013) and CNNRegressor (CNN) is Convolutional Neural Network based on the version described in Zhao et al. (2017). On further investigation, we found that the results for InceptionTime in Tan et al. (2021) were created with a single InceptionTime model (Inception). However, in the original work (Ismail Fawaz et al. 2020), the results supporting InceptionTime as a classifier are found with an ensemble of five InceptionTime models. We include an InceptionTime ensemble model for regression (InceptionE). Furthermore, an improved version of the ROCKET algorithm has been recently publish, known as MultiROCKET (Tan et al. 2022). We adapted it to the TSER paradigm accordingly. Finally, we also included the proposed regressors based on unsupervised feature extraction, DrCIF and FreshPRINCE. We provide implementations of all these approaches in the aeon toolkit. Figure 8 shows the results of the five best algorithms from experiments presented in Fig. 7 and eight new regressors. We have had to exclude the AustralianRainfall dataset and hence reduce the number of datasets in our study to 62 because we were unable to run experiments with the MultiROCKET regressor: it requires over 600GB memory for this dataset and takes more than 15 days to complete.

Fig. 8
figure 8

RMSE ranks for 13 regressors on 62 TSER datasets

Figure 8 shows that DrCIF, FreshPRINCE and InceptionE form the top clique. DrCIF is the top rank regressor, and it is the only algorithm that is significantly better than RotF, the top performing standard approach of all those we have tried. FreshPRINCE achieves the second best averaged rank, though it is not significantly different to several regressors, such as RotF or MultiROCKET. InceptionE is the third best algorithm. InceptionE is often very good: it is top ranked on 13 of the 62 problems, and it is significantly better than a single Inception network, which is top ranked only on one dataset. However, InceptionE also fails spectacularly on many problems. Furthermore, the CNN and Ridge regressors are not competitive with the other eleven algorithms. As expected, MultiROCKET is significantly better than ROCKET. Another interesting feature is that RotF is one of the top performing algorithms, achieving similar results to MultiROCKET, InceptionE and FreshPRINCE. RotF is highly effective with real valued input (Bagnall et al. 2018) and the best performing standard algorithm for TSC (Bagnall et al. 2017), so this is perhaps not surprising. It does well with time series because it removes embedded correlations through randomised PCA transforms on subsets of algorithms. Despite this, the fact that an algorithm for standard regression outperforms a wide range of the deep learning and time series specific approaches is indicative of the scope for improvement in the field of TSER. Thus far, DrCIF is the only regressor that is on average significantly better in terms of RMSE than RotF. In subsequent analyses, the focus is directed towards a subset with the seven top ranked regressors, as shown in Fig. 8.

Table 2 Mean and Standard Deviation (STD) of RMSE and MAE over all problems. The STD is computed as the average of the STD of the datasets

Table 2 provides an overview of the mean and Standard Deviation (STD) of the RMSE and the Mean Absolute Error (MAE) for the top seven algorithms. Figure 9 illustrates relative performance of the top seven regressors for RMSE with boxplot. The y-axis for Fig. 9 is the relative deviation of the RMSE, calculated as \(\frac{RMSE}{RMSE + Median (RMSE)}\), across all problems. Lower values are better and values below 0.5 indicate performance better than the median algorithm. A tight distribution indicates an algorithm is consistent in its performance relative to other algorithms. When considered together, Table 2 and Fig. 9 highlight the relative performance of these algorithms. DrCIF has the lowest average RMSE closely followed by FreshPRINCE. The latter achieves the lowest average MAE, followed by RotF. The standard deviations demonstrate that the regressors have comparable variability, except for InceptionE, Inception, and MultiROCKET, which have higher variance for both RMSE and MAE. DrCIF again stands out as the most robust and stable, followed by RotF, FreshPRINCE and TSF.

Examining all regressors on the x-axis of Fig. 9, only DrCIF and FreshPRINCE are consistently better than the median performance, and the distribution is tightly coupled. Inception and InceptionE have the widest spread.

Fig. 9
figure 9

Distribution of relative deviation of the Root Mean Square Error (RMSE), calculated as \(\frac{RMSE}{RMSE + Median (RMSE)}\)E for the top seven regressors (lower values are better)

Figure 10 summarises the performance of top seven algorithms using a heatmap derived from the average RMSE results. The table was generated with a recently proposed results visualisation toolFootnote 5. These results reinforce our previous observations.

Fig. 10
figure 10

Summary performance results for the best seven regressors

6 Analysis

We explore our results in more detail to better profile the regressors and gain insights into the drivers behind their performance.

6.1 Run time

Figure 11 shows the average rank RMSE against the run time (on a log scale) for the six regressors from Fig. 9. Note that the timings for Inception are not considered reliable as it was executed on a combination of CPU and GPU, potentially leading to confusion. We see a direct trade off between runtime and performance. All algorithms run on a single thread CPU except for InceptionE, which ran on a GPU. This means the graph is very flattering for InceptionE. Even on a GPU it is slower than RotF on a CPU.

Fig. 11
figure 11

Run time in milliseconds (log scale average over all datasets) plotted against average rank for RMSE

6.2 Performance by data characteristics

We break down the performance of regressors by the core characteristics of the data to help gain insights into when different algorithms perform well. We stress this is purely exploratory: the relatively small number of datasets in each category, in brackets, preclude useful significant testing.

Table 3 Average rank RMSE split by number of train cases

Table 3 shows the average rank RMSE when we group problems by the number of training cases. The pattern is that DrCIF and FreshPRINCE are better with a small number of cases, whereas InceptionE is better with larger train set size.

Table 4 Average rank RMSE split by number of channels

Table 4 shows the average rank RMSE when we group problems by the number of channels. DrCIF is better with univariate problems. FreshPRINCE achieves better results when dealing with multivariate datasets. InceptionE has more potential with multivariate problems with 3 or 4 channels.

Table 5 Average rank RMSE split by series length (there are no problems with length 366-999)

Table 5 shows the average rank RMSE when we group problems by the series length. The interval-based DrCIF performs relatively better than FreshPRINCE with long series but worse with shorter series (length <50). InceptionE achieves a good rank for relatively longer time series (150<length\(\le\)365).

Finally, we also assessed relative performance for different problem types but did not detect any interesting trends.

6.3 Ablation of FreshPRINCE

FreshPRINCE is a pipeline of a TSFresh transform and a RotF regressor. We address the question of whether the performance of this regressor is due to the transform, the regressor or both. Figure 12 summarises the performance of FreshPRINCE, RotF on the raw series and TSFresh transform followed by an alternative regressor. It demonstrates that transforming followed by RandF or XGBoost are no better than simply applying RotF to the raw data. We conclude that it is the combination of transform and regressor that give significantly better performance.

Fig. 12
figure 12

RMSE ranks for FreshPRINCE, standard RotF and TSFresh (Fresh) with alternative regressors

Fig. 13
figure 13

Scatter plot of predicted vs actual for DrCIF on BarCrawl6min

Fig. 14
figure 14

Scatter plot of predicted vs actual for InceptionE on BarCrawl6min

There is very little agreement between InceptionE and the other regressors. We believe one reason InceptionE performance is so variable is it sometimes completely fails to find anything useful in a dataset where other models have at least some predictive power. To demonstrate this, we look at the standardised residuals of DrCIF and InceptionE on the BarCrawl6min dataset. The time series are accelerometer data, and the response variable is the transdermal alcohol concentration of the test subjects. The response variable is bounded below by zero. In traditional regression, the analyst might look to transform the response with, for example, a Yeo-Johnson transform (Yeo and Johnson 2000). We are interested in performance over multiple datasets without hand tailored transforms. The RMSE for the default train-test partition for DrCIF is 0.0017 and for InceptionE it is 0.0045. If we plot the predicted response vs the actual response for DrCIF (Fig. 13) we see that DrCIF is making negative predictions for low actual values, underestimating higher values and there seems to be some heteroscedasticity in the residuals. Nevertheless, it has definitely found some relationship between the regressor series and the response. However, if we see the same plot for InceptionE (Fig. 14), we see that InceptionE is nearly always predicting the same value of 0.082. It is likely that careful configuration, tuning and transform of the data may improve InceptionE. However, the same is true for DrCIF and freshPRINCE. We are using InceptionE in the way recommended by its creators (Ismail Fawaz et al. 2020).

Fig. 15
figure 15

Scatter plot of relative RMSE for DrCIF and InceptionE

Furthermore, Fig. 15 shows the scatter plot of the relative RMSE used in Fig. 9 for DrCIF vs InceptionE. There is no correlation. For context, for our top four regressors, the strongest correlation is between DrCIF-RotF (\(R^2=0.27\)) and DrCIF-FreshPRINCE are very weakly correlated (\(R^2 = 0.13\)). This diversity suggests there may be some value in ensembling.

7 Conclusions

We have proposed new algorithms for Time Series Extrinsic Regression (TSER) based on classifiers and conducted an extensive experimental study. We have increased the TSER archive size from 19 to 63, introduced improved versions of regressors used in the previous study and shown our new adaptations of classification algorithms are significantly better than the best alternatives. There are several limitations to this study. The data is not randomly sampled (we have taken problems from where ever we can find them) and some domains may be over represented; we have not tuned any of these regressors (the computation required to tune these algorithms over 63 datasets would be prohibitive); we have not looked at more complex diagnostics of performance such as residual analysis. Nevertheless, we believe we have made a significant contribution to advance the new field of TSER. RotF outperforms all previously assessed regressors, and DrCIF and FreshPRINCE are the only TSER algorithms so far proposed to significantly outperform all standard regression algorithms. We have made all our experiments reproducible by releasing structured source code compatible with standard toolkits, guidance on reproducing experiments, and all of our results. We have hosted our results on the repository Footnote 6. The datasets and these results can be downloaded directly using the aeon toolkit and compared to results for new regressors Footnote 7. Detailed examples of how to do this are on the repository we use to run experimentsFootnote 8.

We believe there is scope for further improvement for algorithms for TSER. Adapting supervised Time Series Classification (TSC) approaches may help further leverage of this popular theme for research. InceptionE is the most promising deep learning approach, and perhaps it can be engineered to avoid the catastrophic failure it tends towards with smaller train set sizes. Heterogeneous ensembles are very successful for TSC, and the diversity in performance between the best performing algorithms suggest this may, with careful adaptation, translate to TSER. We will continue to enhance the repository with more problems and would welcome all donations.