Improving Position Encoding of Transformers for Multivariate Time Series Classification

Transformers have demonstrated outstanding performance in many applications of deep learning. When applied to time series data, transformers require effective position encoding to capture the ordering of the time series data. The efficacy of position encoding in time series analysis is not well-studied and remains controversial, e.g., whether it is better to inject absolute position encoding or relative position encoding, or a combination of them. In order to clarify this, we first review existing absolute and relative position encoding methods when applied in time series classification. We then proposed a new absolute position encoding method dedicated to time series data called time Absolute Position Encoding (tAPE). Our new method incorporates the series length and input embedding dimension in absolute position encoding. Additionally, we propose computationally Efficient implementation of Relative Position Encoding (eRPE) to improve generalisability for time series. We then propose a novel multivariate time series classification (MTSC) model combining tAPE/eRPE and convolution-based input encoding named ConvTran to improve the position and data embedding of time series data. The proposed absolute and relative position encoding methods are simple and efficient. They can be easily integrated into transformer blocks and used for downstream tasks such as forecasting, extrinsic regression, and anomaly detection. Extensive experiments on 32 multivariate time-series datasets show that our model is significantly more accurate than state-of-the-art convolution and transformer-based models. Code and models are open-sourced at \url{https://github.com/Navidfoumani/ConvTran}.


Introduction
A time series is a time-dependent quantity recorded over time.Time series data can be univariate, where only a sequence of values for one variable is collected; or multivariate, where data are collected on multiple variables.There are many applications that require time series analysis, such as human activity recognition [1], diagnosis based on electrocardiogram (ECG), electroencephalogram (EEG), and systems monitoring problems [2].Many of these applications are inherently multivariate in nature -various sensors are used to measure human's activities; EEGs use a set of electrodes (channels) to measure brain signals at different locations of the brain.Hence, multivariate time-series analysis methods such as classification and segmentation are of great current interest [3][4][5].
Convolutional neural networks (CNNs) have been widely employed in time series classification [4,5].Many studies have shown that convolution layers tend to have strong generalization with fast convergence due to their strong inductive bias [6].While CNN-based models are excellent for capturing local temporal/spatial correlations, these models cannot effectively capture and utilize long-range dependencies.Also, they only consider the local order of data points in a time series rather than the order of all data points globally.Due to this, many recent studies have used recurrent neural networks (RNN) such as LSTMs to capture this information [7].However, RNN-based models are computationally expensive, and their capability in capturing long-range dependencies are limited [8,9].
On the other hand, attention models can capture long-range dependencies, and their broader receptive fields provide more contextual information, which can improve the models' learning capacity.Not surprisingly, with the success of attention models in natural language processing [8,10], many previous studies have attempted to bring the power of attention models into other domains such as computer vision [11] and time series analysis [9,12,13].
The transformer's core is self-attention [8], which is capable of modeling the relationship of input time series.Self-attention, however, has a limitation -it cannot capture the ordering of input series.Hence, adding explicit representations of position information is especially important for the attention since the model is otherwise entirely invariant to input order, which is undesirable for modeling sequential data.This limitation is even worse in time series data since, unlike image and text, which use Word2Vec-like embedding, time series data has less informative data context.
There are two main methods for encoding positional information in transformers: absolute and relative.Absolute methods, such as those used in [8,10], assign a unique encoding vector to each position in the input sequence based on its absolute position in the sequence.These encoding vectors are combined with the input encoding to provide positional information to the model.On the other hand, relative methods [14,15] encode the relative distance between two elements in the sequence, rather than their absolute positions.The model learns to compute the relative distances between any two positions during training and looks up the corresponding embedding vectors in a pre-defined table to obtain the relative position embeddings.These embeddings are used to directly modify the attention matrix.Position encoding has been verified to be effective in natural language processing and computer vision [16].However, in time series classification, the efficacy is still unclear.
The original absolute position encoding is proposed for language modeling, where high embedding dimensions like 512 or 1024 are usually used for position embedding of input with a length of 512 [8].But, for time series tasks, embedding dimensions are relatively low, and the series might have a variety of lengths (ranging from very low to very high).In this paper, for the first time, we study the efficiency (i.e.how well resources are utilized) and the effectiveness (i.e.how well the encodings achieve their intended purpose) of existing absolute and relative position encodings for time series data.We then show that the existing absolute position encodings are ineffective with time series data.We introduce a novel time series-specific absolute position encoding method that takes into account the series embedding dimension and length.We show that our new absolute position encoding outperforms the existing absolute position encodings in time series classification tasks.
Additionally, since the existing relative position encodings have large memory overhead and they require a large number of parameters to be trained, in time series data it is very likely they overfit.We propose a novel computationally efficient implementation of relative position encoding to improve their generalisability for time series.We show that our new relative position encoding outperforms the existing relative position encodings in time series classification tasks.We then propose a novel time series classification model based on the combination of our proposed absolute/relative position encodings named ConvTran to improve the position embedding of time series data.We further enriched the data embedding of time series using CNN rather than linear encoding.Our extensive experiments on 32 benchmark datasets show ConvTran is significantly more accurate than the previous state-of-the-art in deep learning models for time series classification (TSC).We believe our novel position encodings can boost the performance of other transformer-based TSC models.

Related Work
In this section, we briefly discuss the state-of-the-art multivariate time series classification (MTSC) algorithms, as well as CNN and attention-based models that have been applied to MTSC tasks.We refer interested readers to the corresponding papers or the recent survey on deep learning for time series classification [17] for a more detailed description of these algorithms and models.

State-of-the-art MTSC Algorithms
Many MTSC algorithms have been proposed in recent years [2,4,5], where many of them are adapted from their univariate version.A recent survey [5] evaluated most of the existing MTSC algorithms on the UEA MTS archive, that consists of 26 equal-length time series datasets.This benchmark includes a few deep learning as well as non-deep learning approaches.This survey concluded that there are four main state of the art methods.These are ROCKET [18], HIVE-COTE [19], CIF [20] and Inception-Time [21].
ROCKET [18] is a scalable TSC algorithm that uses 10,000 random convolution kernels to extract 2 features from each input time series, creating 20,000 features for each time series.Then a linear model is used for classification, such as ridge or logistic regression.Mini-ROCKET [22] is an extension of ROCKET with some slight modifications to the feature extraction process.It is significantly more scalable than ROCKET and uses only 10,000 features without compromising accuracy.Multi-ROCKET [23] extends Mini-ROCKET by leveraging the first derivative of the series as well as extracting 4 features per kernel.It is significantly more accurate than both ROCKET and Mini-ROCKET on 128 univariate TSC tasks.Note that neither Mini-ROCKET nor Multi-ROCKET has previously been benchmarked on the UEA MTS archive.The adaptation for multivariate time series for ROCKET, Mini-ROCKET and Multi-ROCKET is done by randomly selecting different channels of the time series for each convolutional kernel.
The Canonical Interval Forest (CIF) [20] is an interval based classifier.It first extracts 25 features from random intervals of the time series and builds a time series forest with 500 trees.It is an algorithm initially designed for univariate TSC and was adapted to multivariate TSC by expanding the random interval search space, where an interval is defined as a random dimension of the time series.
The Hierarchical Vote Collective of Transformation-based Ensembles (HIVE-COTE) is a meta ensemble for TSC.It forms its ensemble from classifiers of multiple domains.Since its introduction in 2016, HIVE-COTE has gone through a few iterations.The version used in the MTSC benchmark [5] comprised of 4 ensemble members -Shapelet Transform Classifier (STC), Time Series Forest (TSF), Contractable Bag of Symbolic Fourier Approximation Symbols (CBOSS) and Random Interval Spectral Ensemble (RISE), each of them being the state of the art in their respective domains.Since these algorithms were designed for univariate time series, the adaption for multivariate time series is not easy.Hence, they were adapted for multivariate time series through ensembling over all the models built on each dimension independently.This means that they are computationally very expensive especially when the number of channels is large.Recently, the latest HIVE-COTE version, HIVE-COTEv2.0(HC2) was proposed [24].It is currently the most accurate classifier for both univariate and multivariate TSC tasks [24].Despite being the most accurate on 26 benchmark MTSC datasets, that are relatively small, HC2 is not scalable to either large datasets with long time series or datasets with many channels.

CNN Based Models
CNNs are popular deep learning architectures for MTSC due to their ability to extract latent features from the time series data efficiently.Fully Convolutional Neural Network (FCN) and Residual Network (ResNet) were proposed in [25] and evaluated in [4].FCN is a simple convolutional network that does not contain any pooling layers in convolution blocks.The output from the last convolution block is averaged with a Global Average Pooling (GAP) layer and passed to a final softmax classifier.ResNet is one of the deepest architectures for MTSC (and TSC in general), containing three residual blocks followed by a GAP layer and a softmax classifier.It uses residual connections between blocks to reduce the vanishing gradient effect in deep learning models.ResNet was one of the most accurate deep learning TSC architectures on 85 univariate TSC datasets [3,4].It was also proven to be an accurate deep learning model for MTSC [4,5].
Inception-Time is the current state-of-the-art deep learning model for both univariate TSC and MTSC [5,21].Inception-Time is an ensemble of five randomly initialised inception network models that each consists of two blocks of inception modules.Each inception module first reduces the dimensionality of a multivariate time series using a bottleneck layer with length and stride of 1 while maintaining the same length.Then, 1D convolutions of different lengths are applied to the output of the bottleneck layer to extract patterns at different sizes.A max pooling layer followed by a bottleneck layer are also applied to the original time series to increase the robustness of the model to small perturbations.Residual connections are also used between each inception block to reduce the vanishing gradient effect.The output of the second inception block is passed to a GAP layer before feeding into a softmax classifier.
Recently, Disjoint-CNN [26] shows that factorization of 1D convolution kernels into disjoint temporal and spatial components yields accuracy improvements with almost no additional computational cost.Applying disjoint temporal convolution and then spatial convolution behaves similarly to the "Inverted Bottleneck" [27].Like the Inverted Bottleneck, the temporal convolutions expand the number of input channels, and spatial convolutions later project the expanded hidden state back to the original size to capture the temporal and spatial interaction.

Attention Based Models
Self-attention has been demonstrated to be effective in various natural language processing tasks due to its higher capacity and superior ability to capture long-term dependencies in text [8].Recently, it has also been shown to be effective for time series classification tasks.Cross Attention Stabilized Fully Convolutional Neural Network (CA-SFCN) [9] has applied the self-attention mechanism to leverage the long-term dependencies for the MTSC task.CA-SFCN combines FCN and two types of self-attention -temporal attention (TA) and variable attention (VA), which interact to capture both long-range temporal dependencies and interactions between variables.With evidence that multi-headed attention dominates self-attention, many models try to adapt it to the MTSC domain.Gated Transformer Networks (GTN) [28], similar to CA-SFCN, use two-tower multi-headed attention to capture discriminative information from the input series.They merge the output of two towers using a learnable matrix named gating.
Inspired by the development of transformer-based self-supervised learning like BERT [13], many models try to adopt the same structure for time series classification [12,13].BErt-inspired Neural Data Representations (BEN-DER) replace the word2vec encoder in BERT with the wav2vec to leverage the same structure for time series data.BENDER shows that if we have a massive amount of EEG data, the pre-trained model can be used effectively to model EEG sequences recorded with differing hardware.Similarly, Voice-to-Series with Transformer-based Attention (V2Sa) uses a large-scale pre-trained speech processing model for downstream problems like time series classification problems [29].Recently, a Transformer-based Framework (TST) was also introduced to adopt vanilla transformers to the multivariate time series domain [12].TST uses only the encoder part of transformers and pre-train it with proportionally masked data in an unsupervised manner.

Background
This section provides a basic definition of self-attention and an overview of current position encoding models.Note that position encoding refers to the method that integrates position information, e.g., absolute or relative.Position embedding refers to a numerical vector associated with position encoding.

Problem Description and Notation
Given a time series dataset X with n samples, X = {x 1 , x 2 , ..., x n }, where x t = {x 1 , x 2 , ..., x L } is a d x -dimensional time series and L is the length of time series, x t ∈ R L×dx , and the set of relevant response labels Y = {y 1 , y 2 , ..., y n }, y t ∈ {1, ..., c} and c is the number of classes.The aim is to train a neural network classifier to map set X to Y .

Self-Attention
The first attention mechanisms were proposed in the context of natural language processing [30].While they still relied on a recurrent neural network at its core, Vaswani et al. [8] proposed a transformer model that relies on attention only.Transformers map a query and a set of key-value pairs to an output.More specifically, for an input series, x t = {x 1 , x 2 , ..., x L }, self-attention computes an output series z t = {z 1 , z 2 , ..., z L } where z i ∈ R dz and is computed as a weighted sum of input elements: Each coefficient weight α i,j is calculated using softmax function: where e ij is an attention weight from positions j to i and is computed using a scaled dot-product: The projections W Q , W K , W V ∈ R dx×dz are parameter matrices and are unique per layer.Instead of computing self-attention once, Multi-Head Attention (MHA) [8] does so multiple times in parallel, i.e., employing h attention heads.A linear transformation is applied to the attention head outputs and concatenated into the standard dimensions.

Position Encoding
The self-attention layer cannot preserve time series positional information in the transformer architecture since the transformer contains no recurrence and convolution.However, the local positional information, i.e., the ordering of time series, is essential.The practical approach in transformer-based methods involves using multiple encoding [16,31,32], such as absolute or relative positional encoding, to enhance the temporal context of time-series inputs.

Absolute Position Encoding
The original self-attention considers the absolute position [8], and adds the absolute positional embedding P = (p 1 , ..., p L ) to the input embedding x as: where the position embedding p i ∈ R d model .There are several options for absolute positional encodings, including the fixed encodings by sine and cosine functions with different frequencies called V anillaAP E and the learnable encodings through trainable parameters (we refer it as Learn method) [8,10].By using sine and cosine for fixed position encoding, the d model -dimensional embeddings of i th time step position can be represented by the following equation: where k is in the range of [0, ], d model is the embedding dimension and ω k is the frequency term.Variations in ω k ensure that no positions < 10 4 are assigned similar embeddings.

Relative Position Encoding
In addition to the absolute position embedding, recent studies in natural language processing and computer vision also consider the pairwise relationships between input elements, i.e., relative position [14,15].This type of method encodes the relative distance between the input elements x i and x j into vectors The encoding vectors are embedded into the self-attention module, which modifies Equation 1 and Equation 3as By doing so, the pairwise positional relation is trained during transformer training.Shaw et al. [14] proposed the first relative position encoding for selfattention.Relative positional information is supplied to the model on two levels: values and keys.First, relative positional information is included in the model as an additional component to the keys.The softmax operation Equation 3 remains unchanged from vanilla self-attention.Lastly, relative positional information is resupplied as a sub-component of the values matrix.Besides, the authors believe that relative position information is not useful beyond a certain distance, so they introduced a clip function to reduce the number of parameters.Encoding is formulated as follows to consider the distance between inputs i and j in computing their attention: Where p V and p K are the trainable weights of relative position encoding on values and keys, respectively.
The scalar k is the maximum relative distance.However, this technique (Shaw) is not memory efficient.As can be seen in Equation 8, it requires O(L 2 d) memory due to the additional relative position encoding.Huang et al. [15] introduced a new method (in this paper it is called Vector method) of computing relative positional encoding that reduces its intermediate memory requirement from O(L 2 d) to O(Ld) using skewing operation [15].According to this paper, the authors also dropped the additional relative positional embedding corresponding to the value term and focused only on the key component.Encoding is formulated as follows: Where Skew procedure use padding, reshaping and slicing to reduce the memory requirement [15].In Table 1 we provided a summary of the parameter sizes, memory, and computation complexities of various position encoding methods (including our proposed ones in this paper) for comparison purposes.

Position Encoding of Transformers for MTSC
We design our position encoding methods to examine several aspects which are not well studied in prior transformers-based time series classification work (see the analysis in Sec 5.4).As a first step, we propose a new absolute position encoding method dedicated to time series data called time Absolute Position Encoding (tAPE).tAPE incorporates the series length and input embedding dimension in absolute position encoding.We then introduce efficient Relative Position Embedding (eRPE) to explore the independent encoding of positions from the input encodings.After that, to study the integration of eRPE into a transformer model, we compare different integration of position information to the attention matrix; finally, we provide an efficient implementation for our methods.

Time Absolute Position Encoding (tAPE)
Absolute position encoding was originally proposed for language modeling, where high embedding dimensions like 512 or 1024 are usually used for position embedding of input with a length of 512 [8].Fig. 1a shows the dot product between two sinusoidal positional embedding whose distance is K using Equation 5 with various embedding dimensions.Clearly, higher embedding dimensions, such as 512 (red thick line), can better reflect the similarity between various positions.As shown in Fig. 1a  dimensions (thin blue and orange lines, respectively), the dot product does not always decrease as the distance between two positions increases.We call this the distance awareness property, which disappears when lower embedding dimensions, such as 64, are used for position encoding.While high embedding dimensions show a desirable monotonous decrease trend when the distance between two positions increases (see red line in Fig. 1a), they are not suitable for encoding time series datasets.The reason is that most time series datasets have relatively low input dimensionality (e.g., 28 out of 32 datasets have less than 64 input dimension), and higher embedding dimensions may yield inferior model throughput due to extra parameters (increasing the chances of overfitting the model).
On the other hand, in low embedding dimensions, the similarity value between two random embedding vectors is high, making the embedding vectors very similar to each other.In other words, we cannot fully utilise the embedding vector space to differentiate between two positions.Fig. 1b depicts the embedding vectors of the first and last position embedding for the embedding dimension equals 128 and length equals 30.In this figure, almost half of the embedding vectors are the same.This is called the anisotropic phenomenon [33].The anisotropic phenomenon makes the position encoding to be ineffective in low embedding dimensions as embedding vectors become similar to each other as it is shown in Fig. 1a (the blue line).
Hence, we require a position embedding for time series that has distance awareness while simultaneously being isotropic.In order to incorporate distance awareness, we propose to use the time series length in Equation 5.In this equation, ω k refers to the frequency of the sine and cosine functions from which the embedding vectors are generated.Without our modification, as series length L increases the dot product of positions becomes ever less regular, resulting in a loss of distance awareness.By incorporating the length parameter in the frequency terms in both sine and cosine functions in Equation 5, the dot product remains smoother with a monotonous trend.
As the embedding dimension d model value increases, it is more likely the vector embeddings are sampled from low-frequency sinusoidal functions, which results in the anisotropic phenomenon.To alleviate this, we incorporate the d model parameter into the frequency term in both sine and cosine functions in Equation 5. We propose a novel absolute position encoding for time series called tAPE in which ω new k takes into account the input embedding dimension and length as follows: where L is the series length and d model shows the embedding dimension.
Our new tAPE position encoding is compared with a vanilla sinusoidal position encoding to provide further illustration.Using d model = 128 dimension vector, Figs 2a-b show the dot product (similarity) of two positions with a distance of K for series with of length L = 1000 and L = 30 respectively.As depicted in Fig 2a, in vanilla APE, only the closest positions in the series have a monotonous decreasing trend, and approximately from a distance 50 onwards (|K|> 50) on both sides, the decreasing similarity trend becomes less apparent as the distance between two positions in the time series increases.However, tAPE has a more stable decreasing trend and more steadily reflects the distance between two positions.Meanwhile, Fig 2b shows the embedding vectors of tAPE are less similar to each other compared to vanilla APE.This is due to better utilising the embedding vector space to differentiate between two positions as we discussed earlier.
Note in Equation 13our ω new k will obviously be equal to the ω k in vanilla APE when d model = L and the encodings of tAPE and vanilla APE will be the same.However, if d model ̸ = L, tAPE will encode the positions in series more effectively than vanilla APE due to the two properties we discussed earlier.

Efficient Relative Position Encoding (eRPE)
There are multiple extensions of the abovementioned Section 3.3.2relative position embeddings in machine translation and computer vision [16,31,32].However, input embeddings are the basis for all previous methods of relative position encoding (adding or multiplying the position matrices to the query, key, and value matrices).In this study, we introduce an efficient model of relative position encoding independent of input embeddings.In particular, we propose the following formulation: where L is series length, A i,j is attention weight and w i−j is a learnable scalar (i.e., w ∈ R O(L) ) and represent the relative position weight between positions i and j.
It is worth comparing the strengths and weaknesses of relative position encodings and attention to determine what properties are more desirable for relative position encoding of time series data.Firstly, the relative position embedding w i−j is an input-independent parameter with static values, whereas an attention weight A i,j is dynamically determined by the representation of the input series.In other words, attention adapts to input series via a weighting strategy (input-adaptive weighting [8]).Input-adaptive-weighting enables models to capture the complicated relationships between different time points, a property that we desire most when we want to extract high-level concepts in time series.This can be for instance the seasonality component in time series.However, when we have limited size data we are at a greater risk of overfitting when using attention.
Secondly, relative position embedding w i−j takes into account the relative shift between positions i and j and not their values.This is similar to translation equivalence property of convolution, which has been shown to enhance generalization [6].We propose to consider the notation of w i−j as a scalar rather than a vector to enable the translation equivalency without blowing up the number of parameters.In addition, the scalar representation of w provides the benefit that the value of w i−j for all (i, j) can be subsumed within the pairwise dot-product attention function, resulting in minimal additional computation (see subsection 4.2.1).We call our proposed efficient relative position encoding as eRPE.
Theoretically, there are many possibilities for integrating relative position information into the attention matrix, but we empirically found that attention models perform better when we add the relative position to the model after applying the softmax to the attention matrix as shown in Equation 14.We presume this is because the position values will be sharper without the softmax.And sharper position embeddings seems to be beneficial in TSC task as it emphasizes more on informative relative positions for classification compared to existing models in which softmax is applied to relative position embeddings.

Efficient Implementation: Indexing
To implement the efficient version of eRFE in Equation 14for input time series with a length of L, for each head, we create a trainable parameter w of size 2L − 1, as the maximum distance is 2L − 1.Then for two position indices i and j, the corresponding relative scalar is w i−j+L where indexes start from 1 instead of 0 (1-base index).Accordingly, we need to index L 2 elements from 2L − 1 vector.
On GPU, a more efficient way to index is to use gather, which only requires memory access.At inference time, indexing the L 2 elements from 2L−1 vector can be pre-computed and cached to increase the processing speed further.As shown in Table 1, our proposed eRPE is more efficient in terms of both memory

ConvTran
Now we look at how we can utilize our new position encodings method to build a time series classification network.According to the earlier discussion, global attention has a quadratic complexity w.r.t. the series length.This means that if we directly apply the proposed attention in Equation 14 to the raw time series, the computation will be excessively slow for long time series.Hence, we first use convolutions to reduce the series length and then apply our proposed position encodings once the feature map has been reduced to a less computationally intense size.See Fig. 4 where convolution blocks comes as a first component proceeded by attention blocks.Another benefit of using convolutions is that convolutions operations are very well-suited to capture local patterns.By using convolutions as the first component in our architecture we can capture any discriminative local information that exists in raw time series.
As Shown in Fig. 4, as the first step in the convolution layers, M temporal filters are applied to the input data.In this step, the model extracts temporal patterns in the input series.Next, the output of temporal filtering is convolved with d model spatial d x × M shape filters to capture the correlations between variables in multivariate time series and construct d model size input embeddings.Such disjoint temporal and spatial convolution is similar to "Inverted Bottleneck" in [27].It first expands the number of input channels and then squeezes them.A key reason for this choice is that the Feed Forward Network (FFN) in transformers [8] also expands on the input size and later projects the expanded hidden state back to the original size to capture the spatial interactions.
Before feeding the input embedding to the transformer block, we add the tAPE-generated position embedding to the input embedding vector so that the model can capture the temporal order of the time series.The size of the embedding vector is d m odel, which is the same as the input embedding.Inside the multi-head attention, the inputs with the L × d model dimension are first converted to L × d z × 3 shape using a linear layer to get the qkv matrix in which d z indicates the model dimension and defined by the user.Each of the three matrices of shape L × d z represents the Query (q), Key (k) and Value (v) matrices.These q, k, and v matrices are reshaped to h × L × d z /h to represent the h attention heads.Each of these attention heads can be responsible for capturing different patterns in time series.For instance, one attention head can attend to the non-noisy data, another head can attend to the seasonal component and another to the trend.Once we have the q, k, and v matrices, we finally perform the attention operation inside the Multi-Head attention block using Equation 14.
According to Equation 14the eRPE with the same shape of L × L is also added to the attention output.We consider the notation of w i−j as a scalar (i.e., w ∈ R O(L) ) to enable the global convolution kernel without increasing the number of parameters.The relative position embedding enables the model to learn not only the order of time points, but also the relative position of pairs of time points, which can capture richer information than other position embedding strategies.
The FFN, is a multi-layer perceptron block consisting of two linear layers and Gaussian Error Linear Units (GELUs) as an activation function.The outputs from the FFN block are again added to the inputs (via skip connection) to get the final output from the transformer block.Finally, just before the fully connected layer, max-pooling and global average pooling (GAP) are applied to the output of the last layer's ELU activation function, which gives a more translation-equivalence model.

Experimental Results
In this section, we evaluate the performance of our ConvTran model on the UEA time series repository [2] and two large multivariate time series datasets and compare it with the state-of-the-art models.All of our experiments were conducted using the PyTorch framework in Python on a computing system consisting of a single Nvidia A5000 GPU with 24GB of memory and an Intel(R) Core(TM) i9-10900K CPU.To promote reproducibility, we have provided our source code and more experimental results online 1 .
We have divided our experiments into four parts.First, we present an ablation study on various position encodings.Then, we demonstrate that our Improving Position Encoding of Transformers for MTSC ConvTran model outperforms existing CNN and transformer-based models.Next, we compare the performance of ConvTran with four state-of-the-art MTSC algorithms (including both deep learning and non-deep learning categories) identified in [5,24].We report the results provided on the archive website2 for HiveCote2, CIF, ROCKET, and Inception-Time on 26 out of 30 UEA datasets only in Section 5.6.Finally, we evaluate the efficiency and effectiveness of ConvTran by comparing it with the current state-of-the-art model, ROCKET.

UEA Repository
The archive consists of 30 real-world multivariate time series data from a wide range of applications such as Human Activity Recognition, Motion classification, and ECG/EEG classification [2].The number of dimensions ranges from two dimensions to 1345 dimensions.The length of the time series ranges from 8 to 17,984.The datasets also have a train size ranging from 12 to 25000.Ford Challenge This dataset is obtained from the Kaggle challenge website3 .It includes measurements from total of 600 real-time driving sessions where each driving session takes 2 minutes and sampled with 100ms rate.Also, the trials are samples from 100 drivers of both genders, and of different ages.The training data file consists of 604,329 data points each belongs to one of 500 trials.The test file contains 120,840 data points belonging to 100 trials.While each data point comes with a label in 0,1 and also contains 8 physiological, 12 environmental, and 10 vehicular features that are acquired while driving.Actitracker human Activity Recognition This dataset describes six daily activities which are collected in a controlled laboratory environment.The activities include "Walking", "Jogging", "Stairs", "Sitting", "Standing", and "Lying Down" which are recorded from 36 users collected using a cell phone in their pocket.Data has 2,980,765 samples with 3 dimensions, subject-wise split into train and test sets, and a sampling rate of 20Hz [1].

Evaluation Procedure
We use the classification accuracy as the overall metric to compare different models.Then we rank each model based on its classification accuracy per dataset.The most accurate model is assigned a rank of 1 and the worse performing model is assigned the highest rank.The average ranking is taken in case of ties.Then the average rank for each model is computed across all datasets in the repository.
This gives a direct general assessment of all the models: the lowest rank corresponds to the method that is the most accurate on average.The average ranking for each model is presented in the form of critical difference diagram  Various relative position encodings.The lowest rank corresponds to the method that is the most accurate on average.[34], where models in the same clique (the black bar in the diagram) are not statistically significant.For the statistical test, we used the Wilcoxon signedrank test with Holm correction as the post hoc test to the Friedman test [34].

Parameter Setting
Adam optimization is used simultaneously with an early stopping method based on validation loss.We use the default setting for other models.We set the default value for the number of temporal and spatial filters to 64 and set the length of the temporal filters to 8. The width of the spatial convolutions are set equal to the input dimensions [26].Similar to TST, the transformers based model for MTSC [12], and default transformers block [8], we use 8 heads to capture the varieties of attention from input series.The dimension of transformers encoding is set to d model = d z = 64 and FFN in transformers block expands the input size by 4x and later projects the 4x-wide hidden state back to the original size.

Ablation Study on Position Encoding
In this section, firstly we compare our proposed tAPE with the exisiting absolute position encodings.Secondly, we compare our proposed eRPE with the existing relative position encoding methods.As a final step, we combined tAPE and eRPE into a single framework and campare it with all possible combinations of absolute and relative position encodings.
For this ablation study we run a single-layer transformer five times on all 30 UEA benchmark datasets for classification.Fig. 5a illustrates the critical difference diagram of a single-layer transformer with different absolute position encodings.Note in critical difference diagram methods grouped by a black line are not significantly different from each other.In Fig. 5, None is the model without any position encoding, Learn is the model with learning absolute position encoding parameters [10], Vanilla APE is the vanilla sinusoidal function-based encoding [8], Vector is the vector-based implementation of input-dependent relative position embedding [15], and our proposed models showed as tAPE and eRPE.
As depicted in Fig. 5a, tAPE has the highest rank in terms of accuracy and is significantly better than other absolute position encodings due to effectively utilising embedding space to provide an isotropic encoding while holding the distance awareness property.As expected, the model without position encoding Learn-eRPE 3.8966 Vanilla APE-eRPE 3.8103 Vanilla APE-Vector 3.7759 tAPE-Vector 2.3621 tAPE-eRPE Fig. 6 The average rank of various combination of absolute and relative position encodings.
has the least accurate results, highlighting the importance of absolute position encoding in time series classification.The vanilla APE also improves overall performance despite not being significantly accurate than Learn APE since it has fewer parameters.Fig. 5b shows the critical difference diagram of a single-layer transformer with different relative position encodings.As shown in this figure, eRPE has the highest rank and is significantly better than other encodings in terms of accuracy as it has less number of parameters which is less likely to overfit.It is not surprising that the model without position encoding has the least accurate results, highlighting the importance of relative position encoding and the translation equality property in time series classification.The input-dependent Vector encoding also improves overall performance and is significantly better than None model.Fig. 6 shows the critical difference diagram for the various combinations of absolute and relative position encodings.As depicted in this figure, the combination of our proposed tAPE and eRPE is significantly more accurate than all other combinations.This shows the high potential of our encoding methods to incorporate position information into transformers.The combination of Learn and Vector has the least accurate results, most likely due to the high number of parameters.

Comparing with State-of-the-Art Deep Learning Models
We compare our ConvTran with the following convolution-based and transformer-based models for MTSC: FCN: Fully Convolutional Neural network is one of the most accurate deep neural networks for MTSC [4] reported in the literature.
ResNet: Residual Network is also one of most accurate deep neural networks for both univariate TSC and MTSC [4] reported in the literature.Disjoint-CNN: One of the accurate and lightweight CNN-based models that factorize convolution kernels into disjoint temporal and spatial convolutions [26].
Inception-Time: The most accurate deep learning univariate TSC and MTSC algorithm to date.[5,21].TST: A transformer-based model for MTSC [12].Fig. 7 shows the average rank of ConvTran on 32 MTS datasets againts all convolutional-based and/or transformer-based methods.This figure shows that on average, ConvTran has the lowest average rank and is more accurate than all  7 and Table 2 we can conclude that ConvTran is the most accurate TSC method on average on all 32 benchmark datasets and particularly has superior performance in datasets in which there are enough data to train (i.e., the number of training samples per class is more than 100) and wins on all 12 datasets except one.

Benchmark against State-of-the-Art Models
Given the experiments on the 32 datasets show that our ConvTran model has the best performance compared to all the other convolution and transformers based models, we now proceed to benchmark it against the state-of-the-art MTSC models, i.e., both deep learning and non-deep learning models.We compare HC2, CIF and ROCKET models on only 26 out of 32 MTSC benchmarking datasets [5] because the other six datasets are either large in terms of training sample or have varied series lengths that make it almost impossible to run HC2 on them.For having detailed insights into the ConvTran performance we provide a pair-wise comparison between our proposed model and each of these models.As shown in Fig. 8 our proposed model mostly outperforms HC2, ROCKET, CIF, and Inception-Time on the datasets with 100 or more training samples per class (marked with a blue circle).However, state-of-the-art models outperform ConvTran on datasets with few training instances such as EigenWorms with 26 train sample per-class.Indeed, as shown in Table 2, all CNN based models fail to perform competitively on the EigenWorms dataset.Note that ConvTran is the most accurate among all CNNs on this dataset.This is due to the limitation of CNN-based models, which cannot capture long-term dependencies in the high length time series.Adding a transformer improves the performance, but it still requires more training samples to perform as well as other models.
It is also interesting to observe from Figs. 8a and 8c that HC2 and CIF perform better than ConvTran on the EthanolConcentration dataset.Considering that this dataset is based on spectra of water-and-ethanol, hence    ROCKET on this dataset.We refer interested readers to Appendix A.1 for a more comprehensive exploration of the empirical evaluation of efficiency and effectiveness on all datasets.Notably, ConvTran demonstrates faster inference time compared to ROCKET across all datasets.It is important to note that all the ConvTran experiments are performed on GPUs, whereas ROCKET experiments are performed on a CPU (please refer to Section 5 for computing system details).

Conclusion
This paper studies the importance of position encoding for time series for the first time and reviews existing absolute and relative position encoding methods in time series classification.Based on the limitations of the current position encodings for time series, we proposed two novel absolute and relative position encodings sepecifically for time series called tAPE and eRPE, respectively.We then integrated our two proposed position encodings into a transformer block and combine them with a convolution layer and presented a novel deep-learning framework for multivariate time series classification (Con-vTran).Extensive experiments show that ConvTran benefits from the position information, achieving state-of-the-art performance on Multivariate time series classification in deep learning literature.In future, we will study the effectiveness of our new transformer block in other transformer-based TSC models and other down stream tasks such as anomaly detection.table shows that some of the non-deep learning models failed to handle specific datasets due to either computational complexity or the inability to handle varying input series lengths.For example, we were not able to run HC2 and CIF on the larger HAR, Ford, and InsectWingbeat datasets due to computational complexity.They were also not designed to handle varying length time series such as the CharacterTrajectories, SpokenArabicDigits, and JapaneseVowels datasets.

Fig. 1
Fig. 1 Sinusoidal absolute position encoding.a) The dot product of two sinusoidal position embeddings whose distance is K with various embedding dimensions.b) 128 dimension sinusoidal positional encoding curves for positions 1 and 30 in a series of length 30.
Fig 2a shows a case in which d model < L and Fig 2b shows a case in which d model > L and in both cases tAPE utilises embedding space to provide an isotropic encoding, while holding the distance awareness property.In other words, tAPE provides a balance between these two properties in its encodings.The superiority of tAPE compared to vanilla APE and learned APE on various length time series datasets is shown in the experimental results section.

Fig. 2
Fig.2Comparing dot product between two position whose distance is K in a time series using tAPE and vanilla APE with dx = 128 dimension vector for series of length a) L = 1000 b) L = 30.

𝒆𝑹𝑷𝑬Fig. 3
Fig.3Self-attention modules with relative position encoding using scalar and vector parameters.Newly added parts are depicted in grey.

Fig. 5
Fig. 5 Critical difference diagram of various position encoding over thirty datasets for the UEA MTSC archive based on average accuracies: a) Various absolute position encodings, b)Various relative position encodings.The lowest rank corresponds to the method that is the most accurate on average.

Fig. 8
Fig. 8 Pairwise comparison of ConvTran with the state of the art models: (a) HC2, (b) ROCKET, (c) CIF (d) and Inception-Time.The datasets with 100 training samples per class or more are marked with a blue circle, while the others are marked with a red square.The three values at the top of each figure show the number of win/draw/loss from left to right

Fig. 9
Fig. 9 Comparison of runtime and accuracy between ConvTran and ROCKET on UEA largest dataset InsectWingBeat with 25,000 training samples.The figure shows the runtime of the two models on datasets with different sizes, and their corresponding classification accuracy.

Table 1
Comparing the parameter sizes, memory, and computation complexities of various position encoding methods.In our implementation dz is equal to d model .
[15]l Relative Shaw[14](2L − 1)dzL 2 dz + L 2 L 2 dzVector[15] The average rank of ConvTran against all deep learning based methods on all 32 MTS datasets.othermethods.It is important to observe that ConvTran is significantly more accurate than its predecessors, i.e., a convolution based model, Disjoint-CNN as well as the transformer based model, TST.This indicates the effectiveness of adding tAPE and eRPE to transformers.Table2presents the classification accuracy of each method on all 32 datasets and the highest accuracy for each dataset is highlighted in bold.In this table datasets are sorted based on the number of training samples per class.Considering Fig.

Table 2
Average accuracy of six deep learning based models over 32 multivariate time series datasets.Datasets are sorted based on the number of training samples per-class.The highest accuracy for each dataset is highlighted in bold.On the other hand, ROCKET has a few wins compared to ConvTran (Fig8b).Most of these datasets where ROCKET performs better, such as the StandWalkjump dataset have a small number of time series instances per class.For instance, StandWalkjump has 3 classes with 12 training instances, which is 4 time series per class.This is insufficient to train large number of parameters in deep learning models such as ConvTran to achieve better performance.Note, as mentioned, these results are for 26 datasets only, excluding six datasets for which we could not run HC2 (which has high computational complexity and is limited to be applied on variable-length time series).Among excluded datasets, 4 of them are large datasets from which ConvTran could have benefited.Considering this, ConvTran still achieves competetive performance compared to SOTA deep and non-deep models.
To provide further insight into the efficiency of our model on datasets of varying sizes, we conducted additional experiments on the largest UEA dataset InsectWingBeat with 25,000 series for training.We compare the training time and test accuracy of our proposed ConvTran and ROCKET on random subsets of 5,000, 10,000, 15,000, 20,000, and 25,000 training samples.The results depicted in Figure9demonstrate that ROCKET has faster training time than ConvTran on smaller datasets, specifically on the 5k and 10k datasets while achieving similar training time to ConvTran on the 15k set.However, our deep learning-based model, ConvTran, demonstrates faster training times with increasing data quantity, as expected.Additionally, we also observed from the figure that ConvTran is consistently more accurate than

Table A1
Comparison of runtime and accuracy between ConvTran and ROCKET on 32 datasets of varying sizes.To facilitate easy identification, superior performance in both accuracy and runtime is highlighted in bold in the table.For a detailed comparison, the runtimes are shown in seconds.ConvTran vs non-deep learning SOTA Models Table A2 compares the performance of ConvTran against three non-deep learning models -ROCKET, HC2, and CIF -on different datasets with varying training sample sizes.The table presents the accuracy of each model on each dataset, with boldface indicating superior accuracy."-" denotes non-runnable methods, either due to computation complexity or inability to handle various length series.Overall, ConvTran outperforms the non-deep learning models on 19 out of 32 datasets (for the HC2 and CIF models, we only have results for 26 datasets, and ConvTran outperforms the other models in 13 out of the 26).It performs better on datasets with larger training sample sizes, such as InsectWingBeat, while other models perform better on datasets with fewer training samples, such as StandWalkJump, which only has 12 training samples.Additionally, the