Attentional Gated Res2Net for Multivariate Time Series Classification

Multivariate time series classification is a critical problem in data mining with broad applications. It requires harnessing the inter-relationship of multiple variables and various ranges of temporal dependencies to assign the correct classification label of the time series. Multivariate time series may come from a wide range of sources and be used in various scenarios, bringing the classifier challenge of temporal representation learning. We propose a novel convolutional neural network architecture called Attentional Gated Res2Net for multivariate time series classification. Our model uses hierarchical residual-like connections to achieve multi-scale receptive fields and capture multi-granular temporal information. The gating mechanism enables the model to consider the relations between the feature maps extracted by receptive fields of multiple sizes for information fusion. Further, we propose two types of attention modules, channel-wise attention and block-wise attention, to better leverage the multi-granular temporal patterns. Our experimental results on 14 benchmark multivariate time-series datasets show that our model outperforms several baselines and state-of-the-art methods by a large margin. Our model outperforms the SOTA by a large margin, the classification accuracy of our model is 10.16% better than the SOTA model. Besides, we demonstrate that our model improves the performance of existing models when used as a plugin. Further, based on our experiments and analysis, we provide practical advice on applying our model to a new problem.

is, therefore, less used for time series classification. Transformer [16] has recently emerged as promising solution to multivariate time series classification. While transformer supports both parallel computing and efficient temporal feature extraction, it requires massive parameters for the multiple fully connected layers, making the training extremely time-consuming. Furthermore, transformer suffers overfitting on small datasets [19], and faces challenges in capturing short-range temporal information [20]. Besides, existing solutions to MTSC commonly require careful adjustments of architectures and parameters to deal with time series of various lengths. This is a critical yet little studied issue in existing time series classification research.
To summarize, ML methods are expertise-dependent and difficult for representing complicated non-linear functions. Among the DL methods, CNN is efficient for training and inferencing but challenging for capturing long dependencies; RNN can effectively learn the temporal representations of long temporal features, but is computationally expensive; transformer contains too many parameters, making it easy to prone to overfitting on small size datasets. We aim for accurate MTSC that can adapt to time series of various lengths to address the above deficiencies of existing studies. To this end, we propose a novel CNN architecture called Attentional Gated Res2Net (AGRes2Net) for MTSC. Our model can overcome the shortcoming of the standard CNN architecture by enabling the extraction of both global and local temporal features. It also has the capability to leverage multi-granular feature maps through channel-wise and block-wise attention mechanisms. In a nutshell, we make the following contributions in this paper: • We propose a novel AGRes2Net architecture for accurate MTSC. Our model can capture dependencies over various ranges and exploit the inter-variable relations to achieve high performance on time series of various lengths, making it feasible for various tasks. • We propose two attention mechanisms, namely channel-wise attention and block-wise attention, to leverage multi-granular temporal information for tasks with different data characteristics. The former has advantages on datasets with many variables, while the latter can effectively prevent overfitting on datasets with very few variables. • We conducted extensive experiments on 14 benchmark datasets to evaluate the model.
A comparison with several baselines and state-of-the-art methods shows the superior performance of our model. Besides, plugging our model into MLSTM-FCN, a stateof-the-art CNN-RNN parallel model, demonstrates the model's capability to improve existing models' performance.
The remainder of the paper is organized as follows. Section 2 overviews the related work; Sect. 3 presents the proposed model and attention mechanisms; Sect. 4 reports our experiments and results; and finally, Sect. 5 gives the concluding remarks.
Traditionally, CNNs are used for computer vision tasks, such as image recognition [33], object detection [34][35][36], and semantic segmentation [37]. Recent studies [38][39][40][41][42] show 1D-CNN is promising for temporal feature extraction-the convolution computation can capture potential temporal patterns while the information fusion across channels can cope with the inter-relations among variables. Further, Inception [43] uses multiple parallel convolutional kernels of different sizes to address the challenged faced by CNNs in capturing long-range temporal dependencies [44,45]. However, Inception's receptive field has a restricted width, which limits its ability to capture long-range dependencies.
The combination of CNN and RNN represents an effort to exploit the advantages of both [46]. Hybrid CNN-RNN architectures generally follow a parallel or cascade style to facilitate temporal feature extraction in various ranges. For example, LSTM-FCN [47] uses CNN and RNN in parallel and achieves state-of-the-art performance on several benchmark datasets. Since LSTM-FCN employs RNN as a component, it cannot fully leverage the power of GPUs, leading to extended training time. In comparison, transformer [16] learns both temporal dependencies and inter-variable relations based on positional embedding and attention mechanism. It achieves state-of-the-art performance on several time-series datasets [48,49] but suffers extended training time and overfitting on small datasets [19] due to its massive trainable parameters. It also finds difficulty in capturing short-range temporal information when compared with RNN.

Attention Mechanism
Attention mechanism was first used in the seq2seq model for machine translation [18]. A vanilla seq2seq model first feeds the input sequence to an encoder (which consists of multiple recurrent layers) [18] to generate hidden states and outputs. It then collects the hidden states of all the steps to represent the information of the input. An attention mechanism forces the model to learn the weights of hidden states in the decoder part during this process. Thus, the model can focus on a specific region of the input sequence, leading to a significant performance improvement.
Recent studies have designed different attention modules and applied them to various domains [50,51]. Among them, Squeeze-and-Excitation Block (SE) [52] is widely used for various tasks thanks to its easiness of implementation. SE works in two steps. First, it uses global average pooling to obtain an information vector of feature maps from different channels. Then, it employs fully connected layers to capture the inter-relations between feature maps to learn the weights of feature maps and highlight the critical information.

Our Approach
We propose Attentional Gated Res2Net for accurate classification of multivariate time series of various lengths. In particular, we incorporate gating and attention mechanisms on top of Res2Net [53], where gates control the information flow across the groups of convolutional filters, and the attention module harnesses the feature maps at different levels of granularity. Fig. 1 The structure of Attentional Gated Res2Net. It consists of two stages: convolution and attention. The convolution stage feeds the input to a convolutional layer for channel expansion and then groups the output along the channel. Each group (except the first) conducts convolution based on its input and its precedent group's output (passed through gates). The attention stage forces the model to consider the temporal information at different levels of granularity. Finally, the network uses a convolutional layer for channel compression and information fusion The overall architecture of AGRes2Net (shown in Fig. 1) consists of two stages: Convolution and Attention. We illustrate these two stages in the following subsections, respectively.

Convolution Stage
We design the convolution stage based on Res2Net [53], a CNN backbone specially designed to achieve multi-scale receptive fields based on group convolution. Group convolution first appeared in AlexNet [54] and significantly reduced the number of the parameters in that model. It has since been adopted in many lightweight networks [55,56] to generate a large number of feature maps with a small number of parameters.
Unlike conventional CNNs, which use a single set of filters to work on all channels, Res2Net includes multiple groups of filters and uses a separate group to handle each subset of channels. These filter groups are connected in a hierarchical, residual-like style, and they work as follows. First, a convolutional layer takes the input data and outputs a feature map for channel expansion. Then, the feature map is split into groups along the channel, generating groups of feature maps, i.e., input feature maps. Finally, for each input feature map, a separate group of filters extracts features and generates the corresponding output, i.e., an output feature map. In particular, when extracting features from an input feature map, the filter group also takes into account the output of the filter group that comes immediately before it. The whole process repeats until all input feature maps are processed.
Suppose X is the feature map obtained from channel expansion, and X is evenly divided into s groups, {x i } s i=1 , where x i denotes the ith group. Each group contains an input feature map that has the same temporal size but contains only 1/s of the channels in X. Let K i be the convolution operation. Then, given an input feature map x i , the convolution output, y i , is calculated as follows: By feeding the concatenation of all the outputs to a convolutional layer, Res2Net achieves multi-scale receptive fields to facilitate multivariate time series classification. However, it has difficulty in controlling the information flow between the feature-map groups-at each step, y i is always fully sent to the next group regardless of whether it avails or harms the model's performance.
Addressing this limitation is important as it enables to model to control how to weigh the precedent output feature map against the current input feature map in an input-dependent manner. This, in turn, mitigates the problem of vanishing gradients without having to take long delays. To this end, we introduce the gating mechanism [31] into Res2Net at the convolutional stage to enhance feature extraction. Specifically, in our model (shown in Fig. 1), all groups of feature maps (except the first) are sent to convolutional layers for feature extraction, and a gating unit lies between each pair of adjacent feature-map groups to control how much information flows from the precedent to the current group. Given a feature-map group (or more specifically, input feature map), x i , the value of the corresponding gate, g i , is calculated as follows: where a can be either fully-connected or 1-D convolutional layers, concat is the concatenation operation, and tanh is the activation function commonly used for gates. Note that, we only use the precedent output feature map y i−1 and the current input feature map x i to calculate the gate-this is different from the gating mechanism in [31]. More specifically, we omit the undivided feature map X as it contains redundant information and does not significantly improve the performance. Eventually, after the convolution stage, we obtain y i as follows:

Attention Stage
The convolution stage only considers the information flow between adjacent feature-map groups. As such, it limits the model's ability to capture the dependencies between groups that have long distances in-between. In this regard, we design an attention stage to attend to a certain part when processing output feature maps. In particular, we propose two types of attention modules, namely channel-wise attention module and block-wise attention module, to harness multi-granular temporal patterns effectively.

Channel-wise Attention
Channel-wise attention captures the relations between channels of the convolution stage's output, i.e., output feature maps, where s is the number of feature-map groups in the convolution stage.
Suppose every y i contains the same number of channels, say J channels-this is reasonable as we divide the original feature map X evenly along the channel. Let h i, j be the feature map for the jth channel of y i . We use three fully-connected layers to learn the query, key, and value of h i, j (denoted by q i, j , k i, j , and v i, j , respectively). Similarly, we denote by q m,n , k m,n , and v m,n the query, key, and value of h m,n , and the feature map for the nth channel of y m . Given two different feature maps, h i, j and h m,n , we calculate the channel-wise attention as follows: Once computed, we can update the feature map of every channel according to its relations with all the other feature maps. As the feature maps contain temporal information within various ranges, channel-wise attention can capture temporal dependencies at multiple levels of granularity. Based on the above, the updated feature maph i, j can be calculated as follows: Given s output feature maps each having J channels with k dimensions, the total number of feature maps for channel-wise attention is s × J , resulting in the computational complexity

Block-wise Attention
Block-wise attention regards each y i as an individual block that contains temporal information at a certain granularity. Instead of calculating attention values along the channel, block-wise attention directly feeds y i to the fully-connected layers to calculate the corresponding query, key, and value. Block-wise attention has advantages in mitigating overfitting as it considers sparse relations when computing the attention.
Suppose y i and y m are two output feature maps. We denote by q i , k i and v i the query, key and value of y i ; similarly, we denote by q m , k m and v m the query, key and value of y m . Then, we calculate the block-wise attention as follows: Once computed, we can update the feature map of every block according to their relations with all the other feature maps. And the updated feature map for each block,ỹ i , can be calculated as follows:ỹ Given s feature maps, each having J channels with k dimensions, the computational complexity of block-wise attention is O s 2 J k . series datasets. We demonstrate that our approach can be used as a plugin to improve the performance of state-of-the-art methods and provide practical advice on how to adapt our approach to a specific problem.

Datasets
We conducted experiments on 14 public multivariate time series datasets (summarized in Table 1). These datasets cover various tasks from different application domains, such as activity recognition, EEG classification, and weather forecasting. They contain time series of various lengths with different numbers of variables. We have carefully selected these datasets to reflect applications in various domains and ensure that they are diverse enough in the length and variable number of time series to reflect different difficulty levels in real-world multivariate time-series classification problems.

Baseline Methods
We selected several competitive baselines and state-of-the-art (SOTA) methods to compare with our approach.
• Res2Net [53]: this is a CNN backbone that uses group convolution and hierarchical residual-like connections between convolutional filter groups to achieve multi-scale receptive fields. • GRes2Net [31]: this work incorporates gates in Res2Net, where the gates' values are calculated based on a different method from ours-it additionally takes into account the original feature map before it is divided into groups when calculating gates' values. • Res2Net+SE: this work combines Res2Net with a Squeeze-and-Excitation Block (SE) [52] to leverage the effectiveness of attention modules. • GRes2Net+SE: this work combines GRes2Net with SE to leverage the effectiveness of attention modules.
We briefly introduce the SOTA methods for the experimental datasets below. A full list of SOTA methods is given in Table 1.

Model Configuration and Evaluation Metric
We followed the methods as illustrated in the SOTA methods to preprocess the datasets. In particular, we normalized each dataset to zero mean and unit standard deviation. We also applied zero paddings to cope with sequences with various lengths in the same training set. The experimental results of each method were obtained under the optimal or suggested settings as provided in the original paper.
To ensure a fair comparison, we set all the models based on Res2Net, GRes2Net, and our approach contained the same number of feature-map groups and used identical filters for each group. We used our model as the backbone for feature extraction and trained our model for 500 training epochs using Adam [64] optimizer. The learning rate was set to 0.001 and adjusted to 1/10 of itself after every 100 epochs. The dropout rate was set to 0.4 to avoid possible overfitting. We repeated the training and test processes five times and took the average of multiple runs as the final results; this mitigates the impact of randomized parameter initialization. The details including the number of layers, the number of convolutional groups, and the dropout rate settings can be found in Table 2.
We used accuracy, which is currently used by all the SOTA methods on the experimental datasets, as the metric for evaluating the methods. However, accuracy is not comprehensive enough to measuring the performance of the classifier. Although the vast majority of the related work uses accuracy as the only evaluation metric, we additionally use precision, recall, and F-score in our parameter and ablation studies to gain further insights into how our model performs.  Table 3 shows a performance comparison of all the methods on the experimental datasets. Our proposed model, using either channel-wise or block-wise attention, consistently outperformed all the other compared methods on all the 14 datasets, demonstrating our model's superiority in solving MTSC in diverse contexts regardless of the lengths of time-series sequences.

Comparison of Different Methods
Channel-wise attention favors longer time-series sequences, as it beats block-wise attention on all the top-8 datasets with the longest sequences. The results conform to our intuition that channel-wise attention may have an edge on capturing multi-granular temporal information.
Block-wise attention tends to excel on datasets that contain fewer variables. Among the top-4 datasets with the least variables, it beats channel-wise attention on 3 of them (AREM, LP5, and EEG); this is also consistent with our intuition that block-wise attention may have advantages in preventing overfitting thanks to the sparse relations considered in its attention calculation.
An exception occurs on the ECG dataset, which has as few as two variables; this reason lies in that this dataset contains abundant sequences that allow for the channel-wise attention to fully exploit the training data without causing overfitting. Figure 2 shows the result of the Wilcoxon signed-rank test on the baseline methods' performance. It shows that, overall, our model achieves similar classification performance when using channel-wise attention and block-wise attention. Either way, our model performs significantly better than the baselines. This result demonstrates the effectiveness of harnessing inter-dependencies between variables and multi-granular feature maps (as our model does use gates, attention, and group convolution) in improving classification performance on sequences of various lengths.

Impact of Depth and Width of Model
In this experiment, we study how the depth and width of our model impact the classification performance. Generally, a deeper and wider model has a stronger capability to capture complex relations from data. Our model becomes more complex as we increase its depth (by stacking more layers), width (by expanding the number of feature-map groups), or both. We trained our model under different width and depth settings and applied different types of attention for the experiment. Considering the many experimental datasets, we only show the results on two representative datasets, Action 3D and Heartbeat. The former has mediumlength sequences and a large number of variables; in contrast, the latter has long sequences but a medium number of variables, making them ideal for exemplifying the experimental results. In particular, we show the results of our model after applying channel-wise attention and block-wise attention on Heartbeat and Action 3D datasets, respectively.

1.08
Bold values indicate the best performance Our results (Table 4) show that wider models beat deeper models in both the training and test phases. While stacking multiple layers leads to large receptive fields that can capture dependencies in a larger range, a wider model can achieve receptive fields with multiple sizes and fuse the feature maps from different convolution filters to learn multi-granular temporal patterns. In comparison, a wider model leverages the temporal features of timeseries sequences more effectively, making it generally a better choice. Several studies [65,66] in the computer vision field draw similar conclusions.

Impact of Group Number
In this experiment, we further explore the impact of the hyperparameter s, which determines the number of feature-map groups (as well as the number of filter groups) in our model. Intuitively, a larger s gives a wider model that can fuse more temporal features extracted by convolutional filters with multiple sizes of receptive fields, thus facilitating capturing long-range dependencies.
We kept all other settings (e.g., number of layers, epochs, learning rate, dropout rate) unchanged while varying the value of s to explore its influence on classification results. Similar to the precedent experiment, we show the experimental results on four datasets that have significantly different lengths of sequences (namely LP5, AREM, Ozone, and Action 3D) to avoid information overload. We used block-wise attention on the first two datasets and channel-wise attention on the last two.
Our results (Table 5) show our model consistently achieved better performance during training as s increased. And we can easily tune our model towards capturing a broader range of temporal information by allowing for more groups with a greater s. However, greater values of s bring the risk of overfitting, demonstrated by decreased performance in the test phase, e.g., in the case of the Qzone and Action 3D datasets. The results suggest the necessity of tuning this hyperparameter s given a specific dataset to gain the best performance.
Beyond the above results, we may consider our model as recurrent because each group's output feature map is sent to the subsequent group. Following this idea, we may regard group number s as the number of steps that the model takes during its recurrent computation. While traditional convolutional neural networks obtain larger receptive fields by stacking multiple layers or employing dilation convolution layers, they are not as flexible or effective as our model in capturing multi-granular temporal information.

Impact of Attention Modules
The superiority of our attention modules over SE is indicated by our model outperforming those baselines that incorporate SE [52] (see Table 3). Specifically, the SE module uses global average pooling, which generates a scalar to represent the feature map of each channel. In comparison, our attention mechanisms (channel-wise and block-wise attention) avoid using global average pooling, thus preventing the information loss caused by the pooling operation. Table 6 further shows our model's performance when using the two attention modules during training and test. We choose to show the results on three datasets, which cover a large range of variable numbers (7 for AREM, 72 for Ozone, and 570 for Action 3D). The results ( Table 6) are consistent with our findings in Sect. 4.4 that channel-wise attention generally beats block-wise attention except for small datasets with very few variables.
As for this experiment, both Ozone and Action 3D datasets contain many variables (72 and 570) and sufficient sequences during training for channel-wise attention to perform well. In Table 4 Training and test results under varying widths and depths Bold values indicate the best performance Table 5 Training and test results under different s.
We set greater s values for the AREM dataset as it has much longer sequences than the others do. We set 6 layers for Ozone, 6 layers for AREM, 4 layers for Action 3D, and 4 layers for LP5 Dataset Bold values indicate the best performance contrast, AREM contains only 43 sequences that cover as many as seven classes. The number of sequences is extremely limited for each class, making channel-wise attention easily lead to overfitting.

Ablation Study
We conducted ablation studies to explore the effectiveness of gates and our attention modules. The model without gates and attention module is the same as vanilla Res2Net. We separately incorporate gates, attention, and both attention and gates in Res2Net and compare the results. Again, we only present the results on EEG and AREM datasets to avoid information overload. For each dataset, we tested the attention mechanism that led to inferior performance to the other, i.e., channel-wise attention on the EEG dataset and block-wise attention on the AREM dataset, to make the comparisons more evident.
Our results (Table 7) show the attention modules contribute slights more than gates on improving the performance of Res2Net, but every component contributes significantly to the improved performance.

Time Consumption of Attention Modules
We conducted experiments to analyze the extra time consumption of the attention modules. We select two datasets, MotorImagery and DuckDuckGeese, because their length and variable number are significantly large. We trained the models on i7-8700K CPU instead of GPU because GPUs are too powerful that can alleviate the impact. We stacked 4 layers and used 64 groups of convolutional filters at each layer. We trained the model with channel-wise attention, with block-wise attention, and without attention module 300 epochs separately, and recorded the training time and test time per epoch. We calculate and give the average time consumption and the standard deviation. The results are shown in Tables 8 and 9.
According to the results, we can see that the time consumption significantly increases when using the attention module. Among the two attention modules, channel-wise attention is more computationally expensive. Compared with the model without any attention module, the time consumption of channel-wise attention for training is about 2.2 times on DuckDuckGeese and is 4.9 times on MotorImagery. While the time consumption of block-wise attention for training is 2.1 times on DuckDuckGeese and is 2 times on MotorImagery. Although attention modules improve the performance (shown in Sect. 4.8), they also make the model less efficient, which brings challenges for employing the model on devices with limited computing resources.

Impact of Feature Dimension Reduction
As discussed in Sect. 4.9, we find that our model is less efficient when the time series contains too many variables. So we conducted experiments to explore the impact of combining feature dimension reduction algorithms with the AGRes2Net. We select SelfRegulationSCP2 dataset as it contains 1152 variables. We used Principal Component Analysis (PCA) to reduce the number of variables from 1152 to 28. we stacked 4 layers, and each layer has 8 groups of convoltuional filters. We use the same dropout rate and experiment settings that are described in the Sect. 4.3. We trained the model on i7-8700K CPU. We recorded the performance Table 6 Training and test results of our model with different attention modules   including accuracy, precision, recall, F1score, and time consumption of both training phase and test phase. The results are given in Table 10.
According to the results, all the performances go poorer, but the time consumption is significantly reduced. Specifically, the accuracy after using PCA decreases 9.59%, but the test speed of the model is about 33 times faster. So dimension reduction algorithms (such as PCA) are practicable for dropping some features if we want to make the model more efficient in facing the time series that contain too many variables.

Effectiveness of Our Model as a Plugin
We use MLSTM-FCN, the SOTA architecture on most datasets (as shown in Table 1), to demonstrate the effectiveness of our model as a plugin. The original MLSTM-FCN follows a CNN-LSTM parallel architecture. The input goes through multiple LSTMs and CNNs, and the outputs are concatenated and go through a fully connected layer for information fusion. We conducted this experiment by replacing the original convolutional modules of MLSTM-FCN with our model while preserving the architecture and all the other parts in MLSTM-FCN.  We show the comparison results on two datasets, AREM and Gesture Phase, to demonstrate the impact of our model on the overall performance of MLSTM-FCN. Specifically, we adopted block-wise attention on the AREM dataset and channel-wise attention on the Gesture Phase dataset without particular reasons. We omit to show the results on other datasets as they draw similar conclusions.
The results (Fig. 3) show a significant improvement in the classification accuracy of MLSTM-FCN on both datasets after the replacement, demonstrating the positive effect of our model on the performance of existing multivariate time series classification models when used as a plugin.

Exploring the Threshold for Choosing Channel-wise Attention and Block-wise Attention
As discussed in the previous section, channel-wise attention performs better and vice versa. This section further explores whether a standard threshold exists for choosing the proper attention module. We select two datasets, LSST and HeartBeat, for experiments, because they contain many variables and channel-wise attention performs better than block-wise attention, and we can use dimension reduction methods to tune the variable numbers to find when the block-wise attention performs better. We use PCA to gradually control the variable numbers. The results can be seen in Tables 11 and 12. According to the results, we can see the thresholds of the two datasets are different (3 for LSST and 2 for HeartBeat). Besides, when we reduce the variable number to 3 on LSST, the performance significantly decreases, making the results less convincing. According to the results, we can see the thresholds of the two datasets are different (3 for LSST and 2 for HeartBeat). Besides, when reducing the variable number to 3 on LSST, the performance of both attention modules is significantly decreased, making the results less convincing. Besides, from the results given in Table 3, we can see on the FingerMovements dataset, the block-wise attention performs better, while on the ECG dataset, the channel-wise attention

Practical Advice
We offer several suggestions on applying our model to broader scenarios based on the above experimental results and our analysis: • Avoid very deep models: a wider model is generally more capable than a deeper model of addressing a general multivariate time series classification. We should prioritize constructing wider models rather than stacking more layers when faced with a new problem. • Focus on tuning the hyperparameter s: setting a larger s increases the number of convolutional-filter groups, leading to multiple receptive fields that capture temporal patterns in various ranges. Tuning the hyperparameter s is especially important for long time-series sequences to achieve the best possible performance. It is generally worthwhile to tune s ahead of investigating the optimal settings of other parameters. • Choose attention module based on variable number: the number of variables is, by far, the most useful single criterion for deciding which attention module to choose for our model, based on our experiments. As discussed, block-wise attention is preferred for sequences with a small number of variables, and channel-wise attention is more suitable for sequences with massive variables. More criteria include the number of sequences available for training, the number of classes, and the length of sequences, which must be figured out case by case.

Conclusion and Future Work
In this paper, we propose a novel deep learning architecture called Attentional Gated Res2Net for accurate multivariate time series classification. Our model comprehensively incorporates gates and two types of attention modules to capture multi-granular temporal information. We evaluate the model on diverse datasets that contain sequences of various lengths with a wide range of variable numbers. Our experiments show the model outperforms several baselines and state-of-the-art methods by a large margin. We thoroughly investigate the effect of different components and settings on the model's performance and provide handson advice on applying our model to a new problem. Our test on plugging the model into a state-of-the-art architecture, MLSTM-FCN, demonstrates the potential for using our model as a plugin to improve the performance of existing models. However, our attention modules increase the training, and inference is time-consuming facing the time series with many variables. Although some dimension reduction algorithms can alleviate the time consumption, it negatively influences classification accuracy. In the future, we aim to explore a pluggable feature selection module to select essential variables hence accelerating the training and inference process. Besides, our model still rely on manual fine-tuning for various datasets. We wish to make our model dynamic instead of static to ensure automatic adaptability based on the certain dataset.
Funding Open Access funding enabled and organized by CAUL and its Member Institutions. This work was supported by the Australian Research Council (Grant numbers DE180100251, DP220103717 and LP180100654).

Data Availability
The datasets generated and/or analysed during the current study are available from the corresponding author on reasonable request.

Conflict of Interest
The authors have no conflict of interest to declare that are relevant to the content of this article.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.