1 Introduction

Time series data grant a great potential for various prediction tasks [1], and time series classification is one of the most challenging tasks in data mining [2]. A typical time series classification task involves multiple variables, represented by multiple data streams each corresponding to a variable. This is known as multivariate time series classification (MTSC)—given a group of time-aligned segments of these data streams, the task is to assign the correct classification label to it. MTSC has demonstrated significance in various applications, such as activity recognition [3], disease diagnosis [4], and automatic device classification [5], etc. Multivariate time series contain the temporal information from different sources, hence, measuring the interaction of sources and learning the temporal representations are the keys to realizing accurate MTSC [6]. Different tasks have different requirements for the classifier, making building a generalized used classifier a challenge. For example, EEG signal based MTSC can be focused on many different goals such as the recognition of emotion [7, 8], decoding cognitive skills [9], recognition, investigation of sustained attention, detection of sleep disorder, decoding of cognitive tasks in brain-computer-interface, etc. In EEG classification, the performance is sensitive to many parameters together such as the number of recording channels, i.e., feature dimension, recording time length, i.e., the number of features, number of the individuals in each group, feature extraction method and, classifier’s architecture.

Traditional methods for time series classification include distance-based models (e.g., k-nearest neighbors) and feature-based models (e.g., random forest [10] and support vector machine [11]). These models highly rely on manually-defined features, which are heuristic and task-dependent [12]. Also, it takes the expertise and considerable time of domain experts to design such features. Furthermore, conventional machine learning (ML) techniques have limitations in processing high-dimension data and representing complicated functions efficiently [13].

Recently, deep learning (DL) has gained popularity in computer vision, natural language processing, and data mining, thanks to its advantages in capturing complicated, nonlinear relations from massive data [14]. Deep neural networks usually stack multiple neural layers for automatic feature extraction and representation learning [15]. Many neural network architectures, such as Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), Transformer [16], Long Short-Term Memory (LSTM) [17], and Gated Recurrent Unit (GRU) [18], have been applied for time series analysis. In particular, RNN sends the prior output to the next input layer to facilitate temporal feature extraction; therefore, it takes a long training time and cannot support parallel computation. CNN can extract temporal feature and be parallelized during training to fully exploit the power of Graphics Processing Units (GPUs); however, it faces challenges in capturing long-range temporal dependencies and is, therefore, less used for time series classification. Transformer [16] has recently emerged as promising solution to multivariate time series classification. While transformer supports both parallel computing and efficient temporal feature extraction, it requires massive parameters for the multiple fully connected layers, making the training extremely time-consuming. Furthermore, transformer suffers overfitting on small datasets [19], and faces challenges in capturing short-range temporal information [20]. Besides, existing solutions to MTSC commonly require careful adjustments of architectures and parameters to deal with time series of various lengths. This is a critical yet little studied issue in existing time series classification research.

To summarize, ML methods are expertise-dependent and difficult for representing complicated non-linear functions. Among the DL methods, CNN is efficient for training and inferencing but challenging for capturing long dependencies; RNN can effectively learn the temporal representations of long temporal features, but is computationally expensive; transformer contains too many parameters, making it easy to prone to overfitting on small size datasets. We aim for accurate MTSC that can adapt to time series of various lengths to address the above deficiencies of existing studies. To this end, we propose a novel CNN architecture called Attentional Gated Res2Net (AGRes2Net) for MTSC. Our model can overcome the shortcoming of the standard CNN architecture by enabling the extraction of both global and local temporal features. It also has the capability to leverage multi-granular feature maps through channel-wise and block-wise attention mechanisms. In a nutshell, we make the following contributions in this paper:

  • We propose a novel AGRes2Net architecture for accurate MTSC. Our model can capture dependencies over various ranges and exploit the inter-variable relations to achieve high performance on time series of various lengths, making it feasible for various tasks.

  • We propose two attention mechanisms, namely channel-wise attention and block-wise attention, to leverage multi-granular temporal information for tasks with different data characteristics. The former has advantages on datasets with many variables, while the latter can effectively prevent overfitting on datasets with very few variables.

  • We conducted extensive experiments on 14 benchmark datasets to evaluate the model. A comparison with several baselines and state-of-the-art methods shows the superior performance of our model. Besides, plugging our model into MLSTM-FCN, a state-of-the-art CNN-RNN parallel model, demonstrates the model’s capability to improve existing models’ performance.

The remainder of the paper is organized as follows. Section 2 overviews the related work; Sect. 3 presents the proposed model and attention mechanisms; Sect. 4 reports our experiments and results; and finally, Sect. 5 gives the concluding remarks.

2 Related Work

2.1 Multivariate Time Series Classification

MTSC has been a longstanding problem and solved by traditional statistic and ML methods [21,22,23]. A representative example is k-Nearest Neighbors (KNN), which is proven outstanding in MTSC [24]. Its combination with Dynamic Time Warping (DTW) can achieve even better performance [25, 26]. DL methods are increasingly applied to MTSC, given their capability in automatic feature extraction and learning complex relations from massive amounts of data [27,28,29]. Commonly used DL architectures include Recurrent Neural Networks (RNNs), Gated Recurrent Unit (GRU) [18], Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM) [17], and Transformer [16]. And recent studies heavily rely on CNNs to overcome the efficiency and scalability issues with recurrent models (e.g., RNN, LSTM, and GRU) [30,31,32].

Traditionally, CNNs are used for computer vision tasks, such as image recognition [33], object detection [34,35,36], and semantic segmentation [37]. Recent studies [38,39,40,41,42] show 1D-CNN is promising for temporal feature extraction—the convolution computation can capture potential temporal patterns while the information fusion across channels can cope with the inter-relations among variables. Further, Inception [43] uses multiple parallel convolutional kernels of different sizes to address the challenged faced by CNNs in capturing long-range temporal dependencies [44, 45]. However, Inception’s receptive field has a restricted width, which limits its ability to capture long-range dependencies.

The combination of CNN and RNN represents an effort to exploit the advantages of both [46]. Hybrid CNN-RNN architectures generally follow a parallel or cascade style to facilitate temporal feature extraction in various ranges. For example, LSTM-FCN [47] uses CNN and RNN in parallel and achieves state-of-the-art performance on several benchmark datasets. Since LSTM-FCN employs RNN as a component, it cannot fully leverage the power of GPUs, leading to extended training time. In comparison, transformer [16] learns both temporal dependencies and inter-variable relations based on positional embedding and attention mechanism. It achieves state-of-the-art performance on several time-series datasets [48, 49] but suffers extended training time and overfitting on small datasets [19] due to its massive trainable parameters. It also finds difficulty in capturing short-range temporal information when compared with RNN.

2.2 Attention Mechanism

Attention mechanism was first used in the seq2seq model for machine translation [18]. A vanilla seq2seq model first feeds the input sequence to an encoder (which consists of multiple recurrent layers) [18] to generate hidden states and outputs. It then collects the hidden states of all the steps to represent the information of the input. An attention mechanism forces the model to learn the weights of hidden states in the decoder part during this process. Thus, the model can focus on a specific region of the input sequence, leading to a significant performance improvement.

Recent studies have designed different attention modules and applied them to various domains [50, 51]. Among them, Squeeze-and-Excitation Block (SE) [52] is widely used for various tasks thanks to its easiness of implementation. SE works in two steps. First, it uses global average pooling to obtain an information vector of feature maps from different channels. Then, it employs fully connected layers to capture the inter-relations between feature maps to learn the weights of feature maps and highlight the critical information.

3 Our Approach

We propose Attentional Gated Res2Net for accurate classification of multivariate time series of various lengths. In particular, we incorporate gating and attention mechanisms on top of Res2Net [53], where gates control the information flow across the groups of convolutional filters, and the attention module harnesses the feature maps at different levels of granularity.

The overall architecture of AGRes2Net (shown in Fig. 1) consists of two stages: Convolution and Attention. We illustrate these two stages in the following subsections, respectively.

Fig. 1
figure 1

The structure of Attentional Gated Res2Net. It consists of two stages: convolution and attention. The convolution stage feeds the input to a convolutional layer for channel expansion and then groups the output along the channel. Each group (except the first) conducts convolution based on its input and its precedent group’s output (passed through gates). The attention stage forces the model to consider the temporal information at different levels of granularity. Finally, the network uses a convolutional layer for channel compression and information fusion

3.1 Convolution Stage

We design the convolution stage based on Res2Net [53], a CNN backbone specially designed to achieve multi-scale receptive fields based on group convolution. Group convolution first appeared in AlexNet [54] and significantly reduced the number of the parameters in that model. It has since been adopted in many lightweight networks [55, 56] to generate a large number of feature maps with a small number of parameters.

Unlike conventional CNNs, which use a single set of filters to work on all channels, Res2Net includes multiple groups of filters and uses a separate group to handle each subset of channels. These filter groups are connected in a hierarchical, residual-like style, and they work as follows. First, a convolutional layer takes the input data and outputs a feature map for channel expansion. Then, the feature map is split into groups along the channel, generating groups of feature maps, i.e., input feature maps. Finally, for each input feature map, a separate group of filters extracts features and generates the corresponding output, i.e., an output feature map. In particular, when extracting features from an input feature map, the filter group also takes into account the output of the filter group that comes immediately before it. The whole process repeats until all input feature maps are processed.

Suppose X is the feature map obtained from channel expansion, and X is evenly divided into s groups, \(\{{\mathbf {x}}_i\}_{i=1}^{s}\), where \({\mathbf {x}}_i\) denotes the ith group. Each group contains an input feature map that has the same temporal size but contains only 1/s of the channels in X. Let \({\mathbf {K}}_i\) be the convolution operation. Then, given an input feature map \({\mathbf {x}}_i\), the convolution output, \({\mathbf {y}}_i\), is calculated as follows:

$$\begin{aligned} {\mathbf {y}}_{i}=\left\{ \begin{array}{ll} {{\mathbf {x}}_{i}} &{} {i=1} \\ {{\mathbf {K}}_{i}\left( {\mathbf {x}}_{i}\right) } &{} {i=2} \\ {{\mathbf {K}}_{i}\left( {\mathbf {x}}_{i}+{\mathbf {y}}_{i-1}\right) } &{} {2<i \leqslant s.} \end{array}\right. \end{aligned}$$
(1)

By feeding the concatenation of all the outputs to a convolutional layer, Res2Net achieves multi-scale receptive fields to facilitate multivariate time series classification. However, it has difficulty in controlling the information flow between the feature-map groups—at each step, \({{\textbf {y}}}_{i}\) is always fully sent to the next group regardless of whether it avails or harms the model’s performance.

Addressing this limitation is important as it enables to model to control how to weigh the precedent output feature map against the current input feature map in an input-dependent manner. This, in turn, mitigates the problem of vanishing gradients without having to take long delays. To this end, we introduce the gating mechanism [31] into Res2Net at the convolutional stage to enhance feature extraction. Specifically, in our model (shown in Fig. 1), all groups of feature maps (except the first) are sent to convolutional layers for feature extraction, and a gating unit lies between each pair of adjacent feature-map groups to control how much information flows from the precedent to the current group. Given a feature-map group (or more specifically, input feature map), \({{\textbf {x}}}_{i}\), the value of the corresponding gate, \({{\textbf {g}}}_{i}\), is calculated as follows:

$$\begin{aligned} {\mathbf {g}}_{i}=\tanh \left( a\left( {\text {concat}}\left( a({\mathbf {y}}_{i-1}), a\left( {\mathbf {x}}_{i}\right) \right) \right) \right) . \end{aligned}$$
(2)

where a can be either fully-connected or 1-D convolutional layers, concat is the concatenation operation, and tanh is the activation function commonly used for gates.

Note that, we only use the precedent output feature map \({{\textbf {y}}}_{{i-1}}\) and the current input feature map \({{\textbf {x}}}_{i}\) to calculate the gate—this is different from the gating mechanism in [31]. More specifically, we omit the undivided feature map X as it contains redundant information and does not significantly improve the performance. Eventually, after the convolution stage, we obtain \({\mathbf {y}}_i\) as follows:

$$\begin{aligned} {\mathbf {y}}_{i}=\left\{ \begin{array}{ll} {{\mathbf {x}}_{i}} &{} {i=1} \\ {{\mathbf {K}}_{i}\left( {\mathbf {x}}_{i}\right) } &{} {i=2} \\ {{\mathbf {K}}_{i}\left( {\mathbf {x}}_{i}+{\mathbf {g}}_{i}\cdot {\mathbf {y}}_{i-1}\right) } &{} {2<i \leqslant s}. \end{array}\right. \end{aligned}$$
(3)

3.2 Attention Stage

The convolution stage only considers the information flow between adjacent feature-map groups. As such, it limits the model’s ability to capture the dependencies between groups that have long distances in-between. In this regard, we design an attention stage to attend to a certain part when processing output feature maps. In particular, we propose two types of attention modules, namely channel-wise attention module and block-wise attention module, to harness multi-granular temporal patterns effectively.

3.2.1 Channel-wise Attention

Channel-wise attention captures the relations between channels of the convolution stage’s output, i.e., output feature maps, \(\{{\mathbf {y}}_i\}_{i=1}^{s}\), where s is the number of feature-map groups in the convolution stage.

Suppose every \({{\textbf {y}}}_{i}\) contains the same number of channels, say J channels—this is reasonable as we divide the original feature map X evenly along the channel. Let \({{\textbf {h}}}_{{i,j}}\) be the feature map for the jth channel of \({{\textbf {y}}}_{i}\). We use three fully-connected layers to learn the query, key, and value of \({{\textbf {h}}}_{{i,j}}\) (denoted by \({{\textbf {q}}}_{{i,j}}\), \({{\textbf {k}}}_{{i,j}}\), and \({{\textbf {v}}}_{{i,j}}\), respectively). Similarly, we denote by \({{\textbf {q}}}_{{m,n}}\), \({{\textbf {k}}}_{{m,n}}\), and \({{\textbf {v}}}_{{m,n}}\) the query, key, and value of \({{\textbf {h}}}_{{m,n}}\), and the feature map for the nth channel of \({{\textbf {y}}}_{m}\). Given two different feature maps, \({{\textbf {h}}}_{i,j}\) and \({{\textbf {h}}}_{{m,n}}\), we calculate the channel-wise attention as follows:

$$\begin{aligned} \mathbf {attention}\left( {\mathbf {q}}_{i, j}, {\mathbf {k}}_{m, n}\right) =\frac{{\mathbf {q}}_{i, j} {\mathbf {k}}_{m, n}^{T}}{\sqrt{J}} \end{aligned}$$
(4)

Once computed, we can update the feature map of every channel according to its relations with all the other feature maps. As the feature maps contain temporal information within various ranges, channel-wise attention can capture temporal dependencies at multiple levels of granularity. Based on the above, the updated feature map \(\tilde{{\mathbf {h}}}_{i,j}\) can be calculated as follows:

$$\begin{aligned} \tilde{{\mathbf {h}}}_{i,j}= \sum _{s} \sum _{J} {\text {Softmax}}\left( \frac{\mathbf {attention}\left( {\mathbf {q}}_{i, j}, {\mathbf {k}}_{m, n}\right) }{\sum _{s} \sum _{J} \mathbf {attention}\left( {\mathbf {q}}_{i, j}, {\mathbf {k}}_{m, n}\right) }\right) {\mathbf {v}}_{m, n} \end{aligned}$$
(5)

Given s output feature maps each having J channels with k dimensions, the total number of feature maps for channel-wise attention is \(s \times J\), resulting in the computational complexity of \({\mathcal {O}}\left( (s \times J)^{2} k\right) \).

3.2.2 Block-wise Attention

Block-wise attention regards each \({{\textbf {y}}}_{i}\) as an individual block that contains temporal information at a certain granularity. Instead of calculating attention values along the channel, block-wise attention directly feeds \({{\textbf {y}}}_{i}\) to the fully-connected layers to calculate the corresponding query, key, and value. Block-wise attention has advantages in mitigating overfitting as it considers sparse relations when computing the attention.

Suppose \({{\textbf {y}}}_{i}\) and \({{\textbf {y}}}_{m}\) are two output feature maps. We denote by \({{\textbf {q}}}_{{i}}\), \({{\textbf {k}}}_{{i}}\) and \({{\textbf {v}}}_{{i}}\) the query, key and value of \({{\textbf {y}}}_{i}\); similarly, we denote by \({{\textbf {q}}}_{{m}}\), \({{\textbf {k}}}_{{m}}\) and \({{\textbf {v}}}_{{m}}\) the query, key and value of \({{\textbf {y}}}_{m}\). Then, we calculate the block-wise attention as follows:

$$\begin{aligned} \mathbf {attention}\left( {\mathbf {q}}_{i}, {\mathbf {k}}_{m}\right) =\frac{{\mathbf {q}}_{i} {\mathbf {k}}_{m}^{T}}{\sqrt{s}} \end{aligned}$$
(6)

Once computed, we can update the feature map of every block according to their relations with all the other feature maps. And the updated feature map for each block, \(\tilde{{\mathbf {y}}}_{i}\), can be calculated as follows:

$$\begin{aligned} \tilde{{\mathbf {y}}}_{i}= \sum _{s} {\text {Softmax}}\left( \frac{\mathbf {attention}\left( {\mathbf {q}}_{i}, {\mathbf {k}}_{j}\right) }{\sum _{s} \mathbf {attention}\left( {\mathbf {q}}_{i}, {\mathbf {k}}_{j}\right) }\right) {\mathbf {v}}_{j} \end{aligned}$$
(7)

Given s feature maps, each having J channels with k dimensions, the computational complexity of block-wise attention is \({\mathcal {O}}\left( s^{2} Jk\right) \).

4 Experiments

This section reports our extensive experiments to evaluate our proposed approach, including comparisons against baselines, ablation studies, and parameter studies on several public time-series datasets. We demonstrate that our approach can be used as a plugin to improve the performance of state-of-the-art methods and provide practical advice on how to adapt our approach to a specific problem.

4.1 Datasets

We conducted experiments on 14 public multivariate time series datasets (summarized in Table 1). These datasets cover various tasks from different application domains, such as activity recognition, EEG classification, and weather forecasting. They contain time series of various lengths with different numbers of variables. We have carefully selected these datasets to reflect applications in various domains and ensure that they are diverse enough in the length and variable number of time series to reflect different difficulty levels in real-world multivariate time-series classification problems.

Table 1 A list of our experimental datasets

4.2 Baseline Methods

We selected several competitive baselines and state-of-the-art (SOTA) methods to compare with our approach.

  • Res2Net [53]: this is a CNN backbone that uses group convolution and hierarchical residual-like connections between convolutional filter groups to achieve multi-scale receptive fields.

  • GRes2Net [31]: this work incorporates gates in Res2Net, where the gates’ values are calculated based on a different method from ours—it additionally takes into account the original feature map before it is divided into groups when calculating gates’ values.

  • Res2Net+SE: this work combines Res2Net with a Squeeze-and-Excitation Block (SE) [52] to leverage the effectiveness of attention modules.

  • GRes2Net+SE: this work combines GRes2Net with SE to leverage the effectiveness of attention modules.

We briefly introduce the SOTA methods for the experimental datasets below. A full list of SOTA methods is given in Table 1.

  • MLSTM-FCN [47]: a multivariate LSTM fully convolutional network that concatenates the outputs of two parallel blocks: a fully convolutional block (embedded with SEs) and an LSTM block. It is a variant of LSTM-FCN.

  • MALSTM-FCN [47]: a multivariate attention LSTM fully convolutional network, which resembles MLSTM-FCN but replaces LSTM cells with attention LSTM cells.

  • MUSE [58]: a model that extracts and filters multivariate features by encoding context information into each feature.

  • InceptionTime [60]: a CNN-based model transferred from computer vision to time series classification, which stacks multiple parallel convolutional filters for temporal feature extraction.

  • Time Series Forest [21]: an ensemble tree-based method that employs a combination of entropy gain and a distance measure to evaluate the differences between time-series sequences.

  • Canonical Interval Forest [61]: a model that refines Time Series Forest by upgrading the interval-based component.

  • Dynamic Time Warping (DTW) [62]: a traditional distance-based machine learning method for time series analysis.

  • Random Convolutional Kernel Transform (ROCKET) [63]: a CNN-based model that uses random convolutional kernels to extract multi-granular temporal features.

4.3 Model Configuration and Evaluation Metric

Table 2 Experiment configuration settings

We followed the methods as illustrated in the SOTA methods to preprocess the datasets. In particular, we normalized each dataset to zero mean and unit standard deviation. We also applied zero paddings to cope with sequences with various lengths in the same training set. The experimental results of each method were obtained under the optimal or suggested settings as provided in the original paper.

To ensure a fair comparison, we set all the models based on Res2Net, GRes2Net, and our approach contained the same number of feature-map groups and used identical filters for each group.

We used our model as the backbone for feature extraction and trained our model for 500 training epochs using Adam [64] optimizer. The learning rate was set to 0.001 and adjusted to 1/10 of itself after every 100 epochs. The dropout rate was set to 0.4 to avoid possible overfitting. We repeated the training and test processes five times and took the average of multiple runs as the final results; this mitigates the impact of randomized parameter initialization. The details including the number of layers, the number of convolutional groups, and the dropout rate settings can be found in Table 2.

We used accuracy, which is currently used by all the SOTA methods on the experimental datasets, as the metric for evaluating the methods. However, accuracy is not comprehensive enough to measuring the performance of the classifier. Although the vast majority of the related work uses accuracy as the only evaluation metric, we additionally use precision, recall, and F-score in our parameter and ablation studies to gain further insights into how our model performs.

4.4 Comparison of Different Methods

Table 3 shows a performance comparison of all the methods on the experimental datasets. Our proposed model, using either channel-wise or block-wise attention, consistently outperformed all the other compared methods on all the 14 datasets, demonstrating our model’s superiority in solving MTSC in diverse contexts regardless of the lengths of time-series sequences.

Channel-wise attention favors longer time-series sequences, as it beats block-wise attention on all the top-8 datasets with the longest sequences. The results conform to our intuition that channel-wise attention may have an edge on capturing multi-granular temporal information.

Block-wise attention tends to excel on datasets that contain fewer variables. Among the top-4 datasets with the least variables, it beats channel-wise attention on 3 of them (AREM, LP5, and EEG); this is also consistent with our intuition that block-wise attention may have advantages in preventing overfitting thanks to the sparse relations considered in its attention calculation.

An exception occurs on the ECG dataset, which has as few as two variables; this reason lies in that this dataset contains abundant sequences that allow for the channel-wise attention to fully exploit the training data without causing overfitting.

Table 3 Accuracy of different models on 14 benchmark datasets. AGRes2Net+CA and AGRes2Net+BA represent our Attentional Gated Res2Net model incorporated with channel-wise attention and block-wise attention, respectively. The improvement is the the comparison between SOTA and the proposed model

Figure 2 shows the result of the Wilcoxon signed-rank test on the baseline methods’ performance. It shows that, overall, our model achieves similar classification performance when using channel-wise attention and block-wise attention. Either way, our model performs significantly better than the baselines. This result demonstrates the effectiveness of harnessing inter-dependencies between variables and multi-granular feature maps (as our model does use gates, attention, and group convolution) in improving classification performance on sequences of various lengths.

Fig. 2
figure 2

Critical difference diagram of the arithmetic means of the ranks on all datasets

4.5 Impact of Depth and Width of Model

In this experiment, we study how the depth and width of our model impact the classification performance. Generally, a deeper and wider model has a stronger capability to capture complex relations from data. Our model becomes more complex as we increase its depth (by stacking more layers), width (by expanding the number of feature-map groups), or both.

We trained our model under different width and depth settings and applied different types of attention for the experiment. Considering the many experimental datasets, we only show the results on two representative datasets, Action 3D and Heartbeat. The former has medium-length sequences and a large number of variables; in contrast, the latter has long sequences but a medium number of variables, making them ideal for exemplifying the experimental results. In particular, we show the results of our model after applying channel-wise attention and block-wise attention on Heartbeat and Action 3D datasets, respectively.

Our results (Table 4) show that wider models beat deeper models in both the training and test phases. While stacking multiple layers leads to large receptive fields that can capture dependencies in a larger range, a wider model can achieve receptive fields with multiple sizes and fuse the feature maps from different convolution filters to learn multi-granular temporal patterns. In comparison, a wider model leverages the temporal features of time-series sequences more effectively, making it generally a better choice. Several studies [65, 66] in the computer vision field draw similar conclusions.

Table 4 Training and test results under varying widths and depths

4.6 Impact of Group Number

In this experiment, we further explore the impact of the hyperparameter s, which determines the number of feature-map groups (as well as the number of filter groups) in our model. Intuitively, a larger s gives a wider model that can fuse more temporal features extracted by convolutional filters with multiple sizes of receptive fields, thus facilitating capturing long-range dependencies.

We kept all other settings (e.g., number of layers, epochs, learning rate, dropout rate) unchanged while varying the value of s to explore its influence on classification results. Similar to the precedent experiment, we show the experimental results on four datasets that have significantly different lengths of sequences (namely LP5, AREM, Ozone, and Action 3D) to avoid information overload. We used block-wise attention on the first two datasets and channel-wise attention on the last two.

Our results (Table 5) show our model consistently achieved better performance during training as s increased. And we can easily tune our model towards capturing a broader range of temporal information by allowing for more groups with a greater s. However, greater values of s bring the risk of overfitting, demonstrated by decreased performance in the test phase, e.g., in the case of the Qzone and Action 3D datasets. The results suggest the necessity of tuning this hyperparameter s given a specific dataset to gain the best performance.

Table 5 Training and test results under different s. We set greater s values for the AREM dataset as it has much longer sequences than the others do. We set 6 layers for Ozone, 6 layers for AREM, 4 layers for Action 3D, and 4 layers for LP5

Beyond the above results, we may consider our model as recurrent because each group’s output feature map is sent to the subsequent group. Following this idea, we may regard group number s as the number of steps that the model takes during its recurrent computation. While traditional convolutional neural networks obtain larger receptive fields by stacking multiple layers or employing dilation convolution layers, they are not as flexible or effective as our model in capturing multi-granular temporal information.

4.7 Impact of Attention Modules

The superiority of our attention modules over SE is indicated by our model outperforming those baselines that incorporate SE [52] (see Table 3). Specifically, the SE module uses global average pooling, which generates a scalar to represent the feature map of each channel. In comparison, our attention mechanisms (channel-wise and block-wise attention) avoid using global average pooling, thus preventing the information loss caused by the pooling operation.

Table 6 further shows our model’s performance when using the two attention modules during training and test. We choose to show the results on three datasets, which cover a large range of variable numbers (7 for AREM, 72 for Ozone, and 570 for Action 3D). The results (Table 6) are consistent with our findings in Sect. 4.4 that channel-wise attention generally beats block-wise attention except for small datasets with very few variables.

As for this experiment, both Ozone and Action 3D datasets contain many variables (72 and 570) and sufficient sequences during training for channel-wise attention to perform well. In contrast, AREM contains only 43 sequences that cover as many as seven classes. The number of sequences is extremely limited for each class, making channel-wise attention easily lead to overfitting.

Table 6 Training and test results of our model with different attention modules

4.8 Ablation Study

We conducted ablation studies to explore the effectiveness of gates and our attention modules. The model without gates and attention module is the same as vanilla Res2Net. We separately incorporate gates, attention, and both attention and gates in Res2Net and compare the results.

Again, we only present the results on EEG and AREM datasets to avoid information overload. For each dataset, we tested the attention mechanism that led to inferior performance to the other, i.e., channel-wise attention on the EEG dataset and block-wise attention on the AREM dataset, to make the comparisons more evident.

Our results (Table 7) show the attention modules contribute slights more than gates on improving the performance of Res2Net, but every component contributes significantly to the improved performance.

Table 7 Ablation test for our model

4.9 Time Consumption of Attention Modules

We conducted experiments to analyze the extra time consumption of the attention modules. We select two datasets, MotorImagery and DuckDuckGeese, because their length and variable number are significantly large. We trained the models on i7-8700K CPU instead of GPU because GPUs are too powerful that can alleviate the impact. We stacked 4 layers and used 64 groups of convolutional filters at each layer. We trained the model with channel-wise attention, with block-wise attention, and without attention module 300 epochs separately, and recorded the training time and test time per epoch. We calculate and give the average time consumption and the standard deviation. The results are shown in Tables 8 and 9.

Table 8 Time consumption comparison with attention modules and without attention modules on DuckDuckGeese, CA means channel-wise attention and BA means block-wise attention. The data in the brackets is the standard deviation
Table 9 Time consumption comparison with attention modules and without attention modules on MotorImagery, CA means channel-wise attention and BA means block-wise attention. The data in the brackets is the standard deviation

According to the results, we can see that the time consumption significantly increases when using the attention module. Among the two attention modules, channel-wise attention is more computationally expensive. Compared with the model without any attention module, the time consumption of channel-wise attention for training is about 2.2 times on DuckDuckGeese and is 4.9 times on MotorImagery. While the time consumption of block-wise attention for training is 2.1 times on DuckDuckGeese and is 2 times on MotorImagery. Although attention modules improve the performance (shown in Sect. 4.8), they also make the model less efficient, which brings challenges for employing the model on devices with limited computing resources.

4.10 Impact of Feature Dimension Reduction

As discussed in Sect. 4.9, we find that our model is less efficient when the time series contains too many variables. So we conducted experiments to explore the impact of combining feature dimension reduction algorithms with the AGRes2Net. We select SelfRegulationSCP2 dataset as it contains 1152 variables. We used Principal Component Analysis (PCA) to reduce the number of variables from 1152 to 28. we stacked 4 layers, and each layer has 8 groups of convoltuional filters. We use the same dropout rate and experiment settings that are described in the Sect. 4.3. We trained the model on i7-8700K CPU. We recorded the performance including accuracy, precision, recall, F1score, and time consumption of both training phase and test phase. The results are given in Table 10.

Table 10 Performance Comparison between the data with PCA and without PCA on SelfRegulationSCP2. The data in the brackets is the standard deviation

According to the results, all the performances go poorer, but the time consumption is significantly reduced. Specifically, the accuracy after using PCA decreases 9.59%, but the test speed of the model is about 33 times faster. So dimension reduction algorithms (such as PCA) are practicable for dropping some features if we want to make the model more efficient in facing the time series that contain too many variables.

4.11 Effectiveness of Our Model as a Plugin

We use MLSTM-FCN, the SOTA architecture on most datasets (as shown in Table 1), to demonstrate the effectiveness of our model as a plugin. The original MLSTM-FCN follows a CNN-LSTM parallel architecture. The input goes through multiple LSTMs and CNNs, and the outputs are concatenated and go through a fully connected layer for information fusion. We conducted this experiment by replacing the original convolutional modules of MLSTM-FCN with our model while preserving the architecture and all the other parts in MLSTM-FCN.

We show the comparison results on two datasets, AREM and Gesture Phase, to demonstrate the impact of our model on the overall performance of MLSTM-FCN. Specifically, we adopted block-wise attention on the AREM dataset and channel-wise attention on the Gesture Phase dataset without particular reasons. We omit to show the results on other datasets as they draw similar conclusions.

The results (Fig. 3) show a significant improvement in the classification accuracy of MLSTM-FCN on both datasets after the replacement, demonstrating the positive effect of our model on the performance of existing multivariate time series classification models when used as a plugin.

Fig. 3
figure 3

Accuracy comparison between the vanilla MLSTM-FCN (blue bar) and the MLSTM-FCN where our model replaces the convolutional modules (orange bar). Block-wise attention and channel-wise attention are applied to the AREM dataset and the Gesture Phase dataset, respectively

4.12 Exploring the Threshold for Choosing Channel-wise Attention and Block-wise Attention

As discussed in the previous section, channel-wise attention performs better and vice versa. This section further explores whether a standard threshold exists for choosing the proper attention module. We select two datasets, LSST and HeartBeat, for experiments, because they contain many variables and channel-wise attention performs better than block-wise attention, and we can use dimension reduction methods to tune the variable numbers to find when the block-wise attention performs better. We use PCA to gradually control the variable numbers. The results can be seen in Tables 11 and 12.

According to the results, we can see the thresholds of the two datasets are different (3 for LSST and 2 for HeartBeat). Besides, when we reduce the variable number to 3 on LSST, the performance significantly decreases, making the results less convincing. According to the results, we can see the thresholds of the two datasets are different (3 for LSST and 2 for HeartBeat). Besides, when reducing the variable number to 3 on LSST, the performance of both attention modules is significantly decreased, making the results less convincing. Besides, from the results given in Table 3, we can see on the FingerMovements dataset, the block-wise attention performs better, while on the ECG dataset, the channel-wise attention outperforms block-wise attention. However, FingerMovements contains 28 variables, while ECG contains only 2 variables. To summarize, the threshold is case-by-case, and the standard threshold does not exist. Although we can follow a rule that using channel-wise attention is preferable in facing a dataset that has lots of variables (such as SelfRegulationSCP2, Action 3D, DuckDuckGeese, etc.), we still need to do empirical studies on each dataset to choose the proper attention module.

Table 11 Performance comparison based on the different variable numbers on LSST
Table 12 Performance comparison based on the different variable numbers on HeartBeat

4.13 Practical Advice

We offer several suggestions on applying our model to broader scenarios based on the above experimental results and our analysis:

  • Avoid very deep models: a wider model is generally more capable than a deeper model of addressing a general multivariate time series classification. We should prioritize constructing wider models rather than stacking more layers when faced with a new problem.

  • Focus on tuning the hyperparameter s: setting a larger s increases the number of convolutional-filter groups, leading to multiple receptive fields that capture temporal patterns in various ranges. Tuning the hyperparameter s is especially important for long time-series sequences to achieve the best possible performance. It is generally worthwhile to tune s ahead of investigating the optimal settings of other parameters.

  • Choose attention module based on variable number: the number of variables is, by far, the most useful single criterion for deciding which attention module to choose for our model, based on our experiments. As discussed, block-wise attention is preferred for sequences with a small number of variables, and channel-wise attention is more suitable for sequences with massive variables. More criteria include the number of sequences available for training, the number of classes, and the length of sequences, which must be figured out case by case.

5 Conclusion and Future Work

In this paper, we propose a novel deep learning architecture called Attentional Gated Res2Net for accurate multivariate time series classification. Our model comprehensively incorporates gates and two types of attention modules to capture multi-granular temporal information. We evaluate the model on diverse datasets that contain sequences of various lengths with a wide range of variable numbers. Our experiments show the model outperforms several baselines and state-of-the-art methods by a large margin. We thoroughly investigate the effect of different components and settings on the model’s performance and provide hands-on advice on applying our model to a new problem. Our test on plugging the model into a state-of-the-art architecture, MLSTM-FCN, demonstrates the potential for using our model as a plugin to improve the performance of existing models.

However, our attention modules increase the training, and inference is time-consuming facing the time series with many variables. Although some dimension reduction algorithms can alleviate the time consumption, it negatively influences classification accuracy. In the future, we aim to explore a pluggable feature selection module to select essential variables hence accelerating the training and inference process. Besides, our model still rely on manual fine-tuning for various datasets. We wish to make our model dynamic instead of static to ensure automatic adaptability based on the certain dataset.