Series2Vec: Similarity-based Self-supervised Representation Learning for Time Series Classification

We argue that time series analysis is fundamentally different in nature to either vision or natural language processing with respect to the forms of meaningful self-supervised learning tasks that can be defined. Motivated by this insight, we introduce a novel approach called \textit{Series2Vec} for self-supervised representation learning. Unlike other self-supervised methods in time series, which carry the risk of positive sample variants being less similar to the anchor sample than series in the negative set, Series2Vec is trained to predict the similarity between two series in both temporal and spectral domains through a self-supervised task. Series2Vec relies primarily on the consistency of the unsupervised similarity step, rather than the intrinsic quality of the similarity measurement, without the need for hand-crafted data augmentation. To further enforce the network to learn similar representations for similar time series, we propose a novel approach that applies order-invariant attention to each representation within the batch during training. Our evaluation of Series2Vec on nine large real-world datasets, along with the UCR/UEA archive, shows enhanced performance compared to current state-of-the-art self-supervised techniques for time series. Additionally, our extensive experiments show that Series2Vec performs comparably with fully supervised training and offers high efficiency in datasets with limited-labeled data. Finally, we show that the fusion of Series2Vec with other representation learning models leads to enhanced performance for time series classification. Code and models are open-source at \url{https://github.com/Navidfoumani/Series2Vec.}


Introduction
Learning from large time series datasets is important in various fields such as human activity recognition [1], diagnosis based on electronic health records [2], and systems monitoring problems [3].These applications can generate hundreds to thousands of time series every day, producing large quantities of data that are critical for the performance of various time series tasks.However, obtaining labeled data for large time series datasets can be costly and challenging.Machine learning models trained on large labeled time series datasets tend to produce better performance than models trained on sparsely labeled datasets, small datasets with limited labels or without supervision which produce subpar performance on various time series machine learning tasks [4,5].Therefore, instead of relying on good quality annotations on large datasets, researchers and practitioners are now turning their attention towards self-supervised representation learning for timeseries.
Self-supervised representation learning is a subfield of machine learning that aims to learn representations from data without requiring explicit supervision [6].Unlike supervised learning, where models are trained on labeled data, self-supervised learning methods leverage the inherent structure of the data to learn useful representations in an unsupervised manner.The learned representations can then be used for a variety of downstream tasks such as classification, anomaly detection, and forecasting [1].
Contrastive learning is an effective and popular self-supervised learning method, originally developed for image analysis [7].In contrastive learning, the model learns to differentiate between similar and dissimilar examples.These methods have been successfully used to improve performance in a variety of learning tasks such as image classification [7], object detection [8,9], and natural language processing [7].
In spite of the research progress in self-supervised approaches in vision and language, this area is in its infancy for time series [1].In this paper, we propose a new approach to self-supervised learning for time series that is inspired by contrastive learning [7].A common yet powerful method for contrastive learning with images is to first create synthetic transformations (augmentation) of an image and then the model learns to contrast the image and its transforms from other images in the training data.We believe that this approach works well for images because many learning tasks related to images involve the Fig. 1: A dendrogram comparing the similarity of three time series of different classes and their augmented variants taken from the BME dataset [15].The three original raw series are augmented using the strong augmentation technique (jittering and permutation) proposed in TS-TCC [10].Under Dynamic Time Warping distance, the original series of class 0 is most similar to the augmented series of class 2. Additionally, the augmented series of class 0 and 1 are quite dissimilar from their original series.interpretation of the objects captured in the image.Transformations such as scaling, blurring, and rotation assume that the resulting images will resemble those that would have been generated in the original scenario with changes in camera zoom, stability, focus, or angle.
However, there do not appear to be equivalent transformations that can be applied to time series data.Transformations that have been used in contrastive learning for time series, including TS-TCC [10], MCL [11], TS2Vec [4], BTSF [5], and TF-C [12], all carary the risk that the variants of the positive sample might be less similar to the anchor sample compared to the series in the negative set.For instance, T-Loss [13] uses a subseries as a positive sample for a given anchor sample.In situations where there is a level shift in the anchor sample, the defined positive sample may be less similar to the anchor sample compared to the series in the negative set, where no level shift exists.TS-TCC [10] uses augmentation techniques such as permutation which carries the same risk.i.e., the permutation of the anchor sample may be very similar to a series in the negative set. Figure 1 shows an example where augmentation techniques proposed for TS-TCC, using jittering and permutation, produce augmented series that are different (dissimilar under Dynamic Time Warping (DTW) distance [14]) from the original series.The original series of class 0 is more similar to the augmented series of class 2 than to its own augmentation.Additionally, the augmented series of class 0 and 1 are quite dissimilar to their original series.This represents a failure to generate augmentations that are meaningfully similar to the originals while also sufficiently different to allow the creation of useful representations.Series2Vec: Similarity-based Representation Learning For this reason, we propose Series2Vec, a novel self-supervised method inspired by contrastive learning that instead uses learning similarity as its selfsupervised task.Our model utilizes time series similarity measures to assign the target output for the encoder loss, providing a different type of implicit bias that is more suitable for time series analysis than existing pretext methods (pretext refers to the unsupervised task used to generate supervision signals for the target task).This method of creating representations in time series data offers a new and more effective approach to implicit bias encoding.
This method simply aims to provide similar representations for time series that are similar to each other in the original feature space and dissimilar representations for the time series that are far from each other- where Sim T is a relevant similarity measure in the time domain, Sim r is a relevant similarity measure in the representation domain, E T is the function from time series to their representations and x i , x j and x k are time series.Since frequency information in time series can be of great importance and is a different/additional source of information, we further extended our model to also learn representations in the frequency domain.
To do so, we propose a novel approach that applies self-attention to each representation within the batch during training.The self-attention mechanism enforces the network to learn similar representations for all similar time series within each batch.One crucial insight motivating this work is the importance of consistency of the targets, not just their correctness, which enables the model to focus on modeling the sequential structure of time series.Our approach draws inspiration from the contrastive learning method for self-supervised representation learning; however, Series2Vec benefits from the similarity prediction loss over time series to represent their structure.Notably, it achieves this without the need for hand-crafted data augmentation.
Additionally, we demonstrate that similarity-based representation learning can be used as a complementary technique with other methods such as selfprediction and contrastive learning to enhance the performance of time series analysis.
In summary, the main contributions of this work are as follows: • A novel self-supervised learning framework (Series2Vec) is proposed for time series representation learning, inspired by contrastive learning.• A time series similarity measure-based pretext is proposed to assign the target output for the encoder loss, providing a more suitable implicit bias for time series analysis.• A novel approach is introduced that applies order-invariant self-attention to each representation during training, effectively enhancing the preservation of similarity in the representation domain.
• The Series2Vec framework was evaluated extensively on nine real-world time series datasets, along with the UCR/UEA archive, and displayed improved results compared to existing SOTA self-supervised methods.It is also evaluated when fused with other representation learning models.

Related Work
Recent advances in self-supervised learning have focused on learning representations through pretext tasks, such as solving jigsaw puzzles [16], image colorization [17], and predicting image rotation [18] in the computer vision domain.In the NLP domain, self-supervised models like BERT [19], and GPT-3 [20] have also been successful in learning meaningful representations of language.However, these methods rely on heuristics that may limit the generality of the learned representations.Contrastive learning methods have emerged as an alternative to address this issue, leveraging augmented data to learn invariant representations, such as SimCLR [8] in computer vision and ALBERT [21] in NLP.Self-supervised learning for time series classification can mainly be divided into two groups: contrastive learning and self-prediction.This section delves into these approaches.Additionally, a literature review on time series similarity measures has been conducted and is available in Appendix A for those interested.

Contrastive Learning
Contrastive learning involves model learning to differentiate between positive and negative time series examples.Scalable Representation Learning (SRL) [13] and Temporal Neighborhood Coding (TNC) [22] apply a subsequencebased sampling and assume that distant segments are negative pairs and neighbor segments are positive pairs.TNC takes advantage of the local smoothness of a signal's generative process to define neighborhoods in time with stationary properties to further improve the sampling quality for the contrastive loss function.TS2Vec [4] uses contrastive learning to obtain robust contextual representations for each timestamp in a hierarchical manner.It involves randomly sampling two overlapping subseries from input and encouraging consistency of contextual representations on the common segment.The encoder is optimized using both temporal contrastive loss and instance-wise contrastive loss.
In addition to the subsequence-based methods, there are also other models such as Time-series Temporal and Contextual Contrasting (TS-TCC) [10], Mixing up Contrastive Learning (MCL) [11], and Bilinear Temporal-Spectral Fusion (BTSF) [5] that employ instance-based sampling.TS-TCC uses weak and strong augmentations to transform the input series into two views and then uses a temporal contrasting module to learn robust temporal representations.The contrasting contextual module is then built upon the contexts from the temporal contrasting module and aims to maximize similarity among contexts Series2Vec: Similarity-based Representation Learning of the same sample while minimizing similarity among contexts of different samples [10].BTSF uses simple dropout as the augmentation method and aims to incorporate spectral information into the feature representation [5].Similarly, Time-Frequency Consistency (TF-C) [12] is a self-supervised learning method that leverages the frequency domain to achieve better representation.It proposes that the time-based and frequency-based representations, learned from the same time series sample, should be more similar to each other in the time-frequency space compared to representations of different time series samples.

Self-Prediction
The primary objective of self-prediction-based self-supervised models is to reconstruct the input data.Studies have explored using transformer-based self-supervised learning methods for time series classification, following the success of models like BERT [19].BErt-inspired Neural Data Representations (BENDER) [23] uses the transformer structure to model EEG sequences and shows that it can effectively handle massive amounts of EEG data recorded with differing hardware.Another study, Voice-to-Series with Transformerbased Attention (V2Sa) [24], utilizes a large-scale pre-trained speech processing model for time series classification.
Transformer-based Framework (TST) [25] adapts vanilla transformers to the multivariate time series domain and uses a self-prediction-based selfsupervised pre-training approach with masked data.The pre-trained models are then fine-tuned for downstream tasks such as classification and regression.These studies demonstrate the potential of using transformer-based self-supervised learning methods for time series classification.

Method
This section begins by formulating the problem of self-supervised time series representation learning.We then introduce our proposed Series2Vec model architecture, which is designed to effectively learn representations from time series data.We also explain the similarity measures that we use in our approach and how they contribute to the effectiveness of our method.Finally, we describe our pretext method for self-supervised time series representation learning i.e., self-supervised similarity-preserving.We outline our approach for defining a model that can effectively capture and preserve the underlying similarity within the data.

Problem Definition
In this study, our aim is to tackle the problem of learning a nonlinear embedding function that can effectively map each time series x i from a given dataset X into a condensed and meaningful representation r i ∈ R K , where K denotes the desired representation dimension.The dataset X comprises n samples, specifically X = {x 1 , x 2 , ..., x n }, where each x i represents a d x -dimensional time series of length L. We denote that x i ≡ x T i represents an input time series sample, and x F i represents the discrete frequency spectrum of x i .We define r T i as the representation of x i sample in the time domain, and r F i as the representation of x i in the frequency domain, and r i is the concatenation of [r T i , r F i ].To evaluate the quality of our learned representation r = {r 1 , r 2 .., r n }, we consider two scenarios based on the availability of labeled data: Linear Probing and Fine-Tuning.

Linear Probing
We assume access to a large volume of unlabeled data X U = {x i |i = 1, ..., n}, along with a smaller subset of labeled data C, where C represents the number of classes.First, we pre-train a model without using labels through a self-supervised pretext task.Once the pre-training is complete, we freeze the encoder and add a linear classifier on top of the pre-trained model's output or intermediate representations.This linear classifier can be implemented as a linear layer or logistic regression.The linear classifier is subsequently trained on a downstream task, typically a classification task, utilizing the pre-trained representations as inputs.Linear probing serves as an evaluation method to assess the quality of the learned representations.

Fine-Tuning
We assume that the dataset X is fully labeled, denoted as X = {(x i , y i )|i = 1, ..., n}.Each sample in X L is associated with a label y i ∈ {1, ..., C}, where C represents the number of classes.We investigate whether leveraging similarity-based representation learning for initialization provides advantages compared to randomly initializing a supervised model.To examine this, we first pre-train the model without using labels through a self-supervised pretext task.Afterward, we train (fine-tune) the entire model for a few epochs using the labeled dataset in a fully supervised manner.

Model Architecture
The overall architecture of Series2Vec is shown in Figure 2. The Series2Vec model architecture proposed in this work is designed to handle both univariate and multivariate time series inputs.However, for the purpose of simplicity, we will focus on illustrating the model using univariate time series in the following descriptions.As shown in Figure 2 the model comprises four main components: a time encoder (E t ), a frequency encoder (E f ), a similarity measuring functions for time and frequency (section 3.3), and an similarity-preserving loss function (section 3.4).The encoder blocks map the input time series data into condensed and meaningful representations in both time and frequency domains.A similarity measuring function calculates the similarity between pairs of input series, providing a quantitative measure of their resemblance.To Series2Vec: Similarity-based Representation Learning Fig. 2: Architecture of Series2Vec.The top module learns the representations in the temporal domain and the bottom module learns the representations in the frequency domain.
optimize the encoder blocks, a similarity-preserving loss function is employed.This loss function guides the learning process, encouraging the encoder blocks to learn representations that preserve the similarity relationships between different samples in the dataset in both time and frequency domains.For a given input time series sample, denoted as x i , we obtain its corresponding frequency spectrum, x F i , through a transform operator such as the Fourier Transformation [26].The frequency spectrum captures universal frequency information within the time series data, which has been widely acknowledged as a key component in classical signal processing [26].Furthermore, recent studies have demonstrated the potential of utilizing frequency information to enhance self-supervised representation learning for time series data [5,12].
The time-domain input x T i and the frequency-domain input x F i are separately passed into the time and frequency encoders to extract features.The feature extraction process is as follows: where θ T and θ F represent the parameters of the time and frequency encoders, respectively.The encoded representations of x i are denoted as r T i ∈ R K and r F i ∈ R K .Following the established setup outlined in previous works (e.g., [27,28]), we adopt Disjoint convolutions for encoding both temporal and spectral features.These convolutions efficiently capture the temporal and spatial features [27].To ensure consistent representation sizes, we employ max pooling at the end of the encoding network.This choice guarantees the scalability of our model to different input lengths.
Springer Nature 2021 L A T E X template Series2Vec: Similarity-based Representation Learning 9

Similarity Measuring Function
Soft-DTW [29] is employed as the similarity function in time domain.It was proposed as an alternative to DTW and we used it due to the availability of an efficient GPU implementation of Soft-DTW1 , allows our proposed method to be more efficient, scale, and run faster on large time series datasets.The distance calculated by Soft-DTW is a continuous and differentiable function.
The formulation for Soft-DTW distance is given by Where x T i and x T j are the two time series being compared, L is the length of the time series, and π is a warping path.The warping path is defined as a function that maps each index of one time series to a corresponding index in the other time series.The goal is to find the warping path that minimizes the sum of the squared distances between the corresponding elements of the two time series.The parameter α ∈ [0, 1] controls the degree of alignment between the two time series.Smaller values of α result in a more accurate alignment, while larger values lead to a more robust alignment.It is worth noting that setting α = 0 makes Soft-DTW and DTW equivalent.
For the similarity function in the frequency domain, we use the Euclidean distance as unlike the temporal domain where Soft-DTW is employed, the concept of time warping does not apply directly to the frequency domain.The Euclidean distance between two input series x F i and x F j can be calculated as follows: Here, x F i and x F j represent the frequency domain representations of two time series being compared, and M is the number of frequency bins.The Euclidean distance is computed by taking the square root of the sum of squared differences between corresponding frequency components of the two representations.

Self-Supervised Similarity-Preserving
Contrastive learning has been successfully used in computer vision and natural language processing due to the strong constraints present in image and text data.In NLP, syntax, and semantics constrain the ordering and meaning of tokens, making it easier to define meaningful variants of the positive samples, e.g., replacing a word with its synonym [20].Similarly, images can be analyzed based on the subject matter, and transformations such as scaling, blurring, and rotation can still be used to identify the same subject.However, the wide variety of possible sources and processes in time series data makes it more challenging to apply the same constraints and techniques used in computer Series2Vec: Similarity-based Representation Learning vision and NLP for learning representations in time series, all carry the risk that the variants of the positive sample (such as permutation [10] and subseries selection [13]) might be less similar to the anchor sample compared to the series in the negative set.
We propose a novel pretext task that is specifically designed to address the unique challenges and characteristics of time series data.Our task aims to model a different type of implicit bias that is more suitable for time series analysis.Our proposed Series2Vec utilizes a similarity measure to align the target output for the encoder loss.The key question now is what is the best approach that effectively captures and preserves this similarity.
To simplify the explanation, we will focus on the time domain and omit the frequency domain.Let's assume that r i and r j are the representation vectors for input time series x i and x j , respectively.Our main objective is to learn similar representations for all similar time series within each batch.To accomplish this, we leverage transformers and make use of the order-invariant property of self-attention mechanisms.
In our approach, each time series within each batch functions as a query and attends to the keys of the other samples in the batch in order to construct its representation.This process allows the representation we seek to capture and aggregate all the relevant information from the input representations of the entire batch.By employing the transformer's architecture and utilizing self-attention, we aim to generate comprehensive representations that encapsulate the pertinent characteristics and similarities among the input time series samples.
To the best of our knowledge, our work is the first to introduce the concept of feeding each time series as an input token to transformers in order to learn similarity-based representations.In our approach, we utilize transformers to model the relationships and interactions between the time series within the batch.By treating each time series as a separate input token, we enable the model to capture the fine-grained similarities between different series.
Specifically, transformers map a query and a set of key-value pairs to an output.For an input batch representation, R = {r 1 , r 2 , ..., r B } where B is the batch size, self-attention computes an output series Z = {z 1 , z 2 , ..., z B } where z i ∈ R dz and is computed as a weighted sum of input elements: Each coefficient weight α i,j is calculated using a softmax function: where e ij is an attention weight from representations j to i and is computed using a scaled dot-product: The projections W Q , W K , W V ∈ R K×dz are parameter matrices and are unique per layer.Instead of computing self-attention once, Multi-Head Attention (MHA) [30] does so multiple times in parallel, i.e., employing h attention heads.
Assuming z i , z j ∈ R dz are the output vectors of transformers for input representation r i and r j ∈ R K , respectively.The pretext objective we have defined aims to minimize the following loss function: The equation 8 represents the similarity loss between the encoded representations z i and z j using our encoder.It is calculated the smooth L 1 loss [31] between the similarity R T (z i , z j ) and similarity function Sim T (x i , x j ).The smooth L 1 loss is defined as: We chose smooth L1 loss because the literature shows it is less sensitive to outliers compared to MSE loss, and in certain scenarios, it prevents the issue of exploding gradients [31].We also found experimentally that it performs better than MSE loss.The similarity R T (z i , z j ) is computed by taking the dot product of the encoded vectors z i and z j .The similarity Sim T (x i , x j ) is calculated between the time series x i and x j using equation 3.
In our model, we follow the same process for the frequency domain.The loss function is defined as follows: Here, the similarity Sim F (x F i , x F j ) is calculated between x F i and x F j using equation 4. The total loss is then calculated as: Training the encoder using L Total loss function that is based on a time series-specific similarity measure enabled the model to learn a representation of the input data that effectively captures the similarities between the series in each batch.Additionally, time series-specific similarity measures are able to align and compare time series with different time steps and lengths by warping the time axis, making the loss function robust to non-linear variations in the data.This makes the model more robust and less sensitive to small variations in the data, which in turn improves its ability to generalize to unseen time series data.Furthermore, by training the model with a loss function that is based on time series-specific similarity measures, the model is exposed to a wide range of time series variations, such as different time steps, lengths, and irregular intervals, which allows it to learn the underlying patterns in the data that are specific to time series.Time series-specific similarity measures Series2Vec: Similarity-based Representation Learning like Dynamic Time Warping (DTW) can handle irregular time intervals, nonstationary time series, and variable-length time series, which can be beneficial when training the model with time series that have these characteristics.
The primary focus of our proposed pretext model is to leverage the similarity information between time series, without being limited by the quality of a specific similarity measure.This allows for flexibility in the choice of similarity measure, as any time series similarity measure can be plugged into the model and used to learn representations.In this paper, we chose a time series-specific similarity measure, Soft-DTW [29] (please refer to section 3.3 for the reason why we used this similarity measure).Clearly, our proposed model is not limited to specific similarity measures and can be easily extended to incorporate other similarity measures as well.

Experimental Results
This section presents the experimental results of our study, focusing on the performance evaluation of the Series2Vec model in a downstream task of time series classification.The experiments are divided into three main parts: 1) linear probing, 2) fine-tuning, and 3) ablation study.Our primary objective is to assess the effectiveness of the learned representation in accurately classifying time series data and to compare Series2Vec performance against other state-ofthe-art models.Additional experiments on the UCR/UEA archive are provided in the Appendix C due to space constraints.Here we evaluate models on commonly used datasets in the representation learning literature.

Datasets
To evaluate the performance of our model, we utilize a total of nine publicly available datasets that have been previously used in the literature for time series representation learning [1].These datasets cover various domains, such as epileptic seizure prediction [32], sleep stage classification [33], and human activity recognition datasets such as [34], PAMAP2 [35], Skoda [36], USC-HAD [37], Opportunity [38], WISDM [39], and WISDM2 [40].The details of each dataset are presented in Appendix B.

Evaluation Procedure and Parameter Setting
We evaluate model performance using classification accuracy as the main metric following the literature in time series classification.Models are ranked based on their accuracy per dataset, with the highest accuracy receiving a rank of 1 and the lowest rank assigned to the worst performer.In the case of ties, the average rank is calculated.The final step is to compute the average rank of each model across all datasets.This gives a direct general assessment of all the models: the lowest rank corresponds to the method that is the most accurate on average.For the statistical test, we used the Wilcoxon signed-rank test [41].In our experiments, the Series2Vec model employed two layers of temporal and spatial convolutions [27] to encode temporal and spectral features.The model utilized 16 filters per layer in the temporal and spatial convolution layers.During training, a batch size of 64 was used, and the Adam optimization algorithm [42] was employed.To prevent overfitting, an early stopping method based on the validation loss was implemented.The model is pre-trained for 100 epochs, and logistic regression is then applied to the representations for linear probing.
Similar to the transformer-based model for multivariate time series classification (TST) [25] and the default transformer block [30], in our experiments we utilized eight attention heads to capture the diverse features from the input time series.The transformer encoding dimension was set to d m = 64, and the feed-forward network (FFN) in the transformer block expanded the input size by 4x before projecting it back to its original size.
The Soft-DTW's parameter α, which determines the level of alignment between the two time series, is set to 0.1 as per the original paper's recommendation [29].

Comparison with Baseline Approaches
In order to evaluate the effectiveness of our approach, we conducted extensive comparison against six state-of-the-art self-supervised methods for time series, including TS2Vec [4], TS-TCC [10], TNC [22], TF-C [12], MCL [11] and TST [25].To ensure a fair comparison, we used publicly available code for the baseline methods.
Table 1 presents the average accuracy of Series2Vec over five runs, along with other state-of-the-art self-supervised models, for the purpose of comparison.The number in bold for each dataset represents the highest accuracy achieved for that dataset.The last row in Table 1 shows the rank of each model Series2Vec: Similarity-based Representation Learning

Low-Label Regimes
We conducted a comparison between three self-supervised models (Series2Vec, TS2Vec, and TS-TCC) and a supervised model in a low-labeled data regime.
The TNC, TF-C, MCL, and TST models were excluded from the comparison due to their significantly lower accuracy compared to the other models.Figure 3 demonstrates that our proposed Series2Vec model consistently outperforms both the supervised model and other representation learning models (except for one dataset -Sleep-in comparison to TS-TCC) when the number of labeled data points is limited to less than 50.Note each subfigure here shows the results for one dataset.This indicates the promising performance of Series2Vec models in scenarios where data scarcity is a challenge.Notably, the Series2Vec models exhibit consistent performance across all datasets, which adds to the reliability of our findings.It is important to highlight that TS-TCC, which uses augmentation techniques, performs similarly to our model on Sleep datasets.Sleep dataset consists of EEG signals, and enhancing the model's ability to handle noise would be especially beneficial in this scenario.

Pre-Training
Our objective here is to evaluate the effectiveness of our model in the pretraining phase.Table 2 presents the classification accuracy results for different datasets, comparing the performance of a model with random initialization and pre-trained Series2Vec.The table shows that using pre-trained Series2Vec leads to an average improvement of 1% in accuracy compared to the random initialization.Significant improvements are observed in specific datasets, such as WISDM2, PAMAP2, and WISDM.For WISDM2, Series2Vec achieves an accuracy gain of 2.35% compared to the random initialization.Similarly, for PAMAP2 and WISDM, the accuracy gains are 3.03% and 1.51% respectively, validating the effectiveness of utilizing similarity-based methods for enhanced learning and improved time series classification.

Ablation Study
Component Analysis: To assess the effectiveness of the proposed components in Series2Vec, we conducted a comparison between the Series2Vec model and three variations, as presented in As shown in Table 3, the inclusion of order-invariant self-attention has a significant impact on the model's accuracy, thereby validating our approach, which employs it to ensure that the model attends to similar series in the batch for a given time series.Furthermore, we observed that in datasets recorded with a low sampling rate such as WISDM2, Skoda, WISDM, and UCI-HAR, employing the frequency domain improves the model's performance.Low sampling may make it difficult for the model to capture fine-grained temporal patterns in the data.However, frequency-based representations derived from the FFT can capture information about the underlying periodicity and spectral content of the signal.
Complementary Loss Function We evaluate our similarity preserving loss (L Sim ) performance in combination with other methods such as selfprediction loss (L SP ) used in TST and contrastive loss (L Cons ) employed in TS-TCC.Table 4 showcases the average accuracy of five runs for different combinations of similarity, contrastive, and self-prediction loss on all nine datasets.Notably, we find that the similarity loss surpasses the individual performance of self-prediction loss in TST and contrastive loss in TS-TCC.Additionally, the combination of self-prediction and similarity-preserving learning yields superior results compared to the combination of contrastive and similarity loss.This suggests that self-prediction and similarity learning capture distinct implicit biases, and their fusion leads to enhanced performance in time series analysis.

Conclusion
This paper proposes a novel self-supervised learning method, Series2Vec, for time series analysis.Series2Vec is inspired by contrastive learning, but instead of using synthetic transformations, it utilizes time series similarity metrics to assign the target output for the encoder loss.This method offers a novel and more effective approach to implicit bias encoding, making it more suitable for time series analysis.The results of the experiments show that Series2Vec outperforms existing methods for time series representation learning.Additionally, our experimental results indicate that Series2Vec performs well in datasets with a limited number of labeled samples.Finally, fusion of Series2Vec with other representation learning models leads to enhanced performance in time series classification.Series2Vec: Similarity-based Representation Learning Appendix C Additional experiments on UCR/UEA In order to highlight the great performance and generalisability of Series2Vec on diverse problems, we compare Series2Vec with the same self-supervised methods used in Section 4.3.1 on the UCR univariate and UEA multivariate time series classification benchmarking archive [3,15].Figures C1a and C1b show that Series2Vec outperforms all the other methods on these archives.It is significantly more accurate than all the methods except TS2Vec, while winning on more datasets.However, Series2Vec is still outperformed by the state-of-the-art time series classification methods on these archives.This is because the archives mainly contain relatively small-size training datasets that are less than 10,000 training examples, and are significantly smaller than the ones used in this work (see Table B1).Self-supervised techniques usually require large training datasets to generalise and perform well.This highlights the limitations in current time series classification research, the need of having more larger datasets and room for improving self-supervised techniques.

Fig. 3 :
Fig.3: Comparison of Linear Probing with Series2Vec, TS2Vec, TS-TCC and Supervised on all nine datasets.The x-axis represents the number of labeled samples per class, while the y-axis represents the corresponding accuracy achieved by each approach

Fig. C1 :
Fig. C1: Pairwise comparison of Series2Vec with state-of-the-art selfsupervised methods.Each cell presents the average difference in accuracy across all datasets, the win/draw/loss counts of numbers of datasets for which Series2Vec obtains higher or lower accuracy and the p-value from a Wilcoxon signed rank test.The methods are ranked by their average accuracy across all the default fold of (a) 128 UCR datasets and (b) 30 UEA datasets, indicated by the values below each method.The values in bold indicate that the two methods are significantly different under significance level α = 0.05.The color represents the scale of the average difference in accuracy.

Table 2 :
Comparison of Classification Accuracy between Random Initialization and Pre-Trained Series2Vec

Table 3 .
The variations are as follows: (1) w/o Attention, where the transformer block is removed; (2) w/o Spectral, where only the temporal domain is used as input feature; and (3) w/o Temporal, where the frequency of the input series is solely utilized to generate the representation.Series2Vec: Similarity-based Representation Learning

Table 4 :
L Sim as Complementary Loss Function

Table B1 :
Description of datasets used in our experiments.