1 Introduction

Learning from large time series datasets is important in various fields such as human activity recognition (Foumani et al. 2024a), diagnosis based on electronic health records (Rajkomar et al. 2018), and systems monitoring problems (Bagnall et al. 2018). These applications can generate hundreds to thousands of time series every day, producing large quantities of data that are critical for the performance of various time series tasks. However, obtaining labeled data for large time series datasets can be costly and challenging. Machine learning models trained on large labeled time series datasets tend to produce better performance than models trained on sparsely labeled datasets, small datasets with limited labels or without supervision which produce subpar performance on various time series machine learning tasks (Yue et al. 2022; Yang and Hong 2022). Therefore, instead of relying on good quality annotations on large datasets, researchers and practitioners are now turning their attention towards self-supervised representation learning for time series.

Self-supervised representation learning is a subfield of machine learning that aims to learn representations from data without requiring explicit supervision (Goyal et al. 2021). Unlike supervised learning, where models are trained on labeled data, self-supervised learning methods train a model through a pretext task, leveraging the inherent structure of the data to learn useful representations in an unsupervised manner. The learned representations can then be used for a variety of downstream tasks such as classification, anomaly detection, and forecasting (Foumani et al. 2024a).

Contrastive learning is an effective and popular self-supervised learning method, originally developed for image analysis. Contrastive methods are trained by minimizing the distance between the representation of a reference sample (anchor sample) and its positive pairs, while simultaneously increasing the distance between representations of negative pairs. These negative and positive pairs are created through hand-crafted data augmentation (van den Oord et al. 2018). These methods have been successfully used to improve performance in a variety of learning tasks such as image classification (Chen et al. 2020), object detection (He et al. 2020; Grill et al. 2020), and natural language processing (van den Oord et al. 2018).

A common yet powerful method for contrastive learning with images is to first create synthetic transformations (augmentation) of an image and then the model learns to contrast the image and its transforms from other images in the training data. We believe that this approach works well for images because many learning tasks related to images involve the interpretation of the objects captured in the image. Transformations such as scaling, blurring, and rotation assume that the resulting images will resemble those that would have been generated in the original scenario with changes in camera zoom, stability, focus, or angle.

However, there do not appear to be equivalent transformations that can be applied to time series data. Transformations that have been used in contrastive learning for time series, including TS-TCC (Eldele et al. 2021), MCL (Wickstrøm et al. 2022), TS2Vec (Yue et al. 2022), BTSF Yang and Hong (2022), and TF-C (Zhang et al. 2022), all carry the risk that the variants of the positive sample might be less similar to the anchor sample compared to the series in the negative set. In addition, their performance critically depends on the choice of augmentations (Yang and Hong 2022; Zhang et al. 2022). For instance, T-Loss (Franceschi et al. 2019) uses a subseries as a positive sample for a given anchor sample. In situations where there is a level shift in the anchor sample, the defined positive sample may be less similar to the anchor sample compared to the series in the negative set, where no level shift exists. TS-TCC (Eldele et al. 2021) uses augmentation techniques such as permutation which carries the same risk. i.e., the permutation of the anchor sample may be very similar to a series in the negative set. Figure 1a shows an example where augmentation techniques proposed for TS-TCC, using jittering and permutation with the same hyper-parameters proposed in the original paper, produce augmented series that are different (dissimilar under Dynamic Time Warping (DTW) distance (Sakoe and Chiba 1971)) from the original series. The original series of class 0 is more similar to the augmented series of class 2 than to its own augmentation and the augmented series of class 0 and 1 are quite dissimilar to their original series. Additionally, for further clarification, we conducted the nearest neighbor algorithm with DTW distance (1NN-DTW) on the original training set, the original training set combined with augmented data, and solely on the augmented data. The classification accuracy of 1NN-DTW decreases significantly from 0.89 to 0.77 when the augmented series are used as a training set, as shown in Fig. 1b.

Fig. 1
figure 1

a A dendrogram comparing the similarity of three time series of different classes and their augmented variants taken from the BME dataset (Dau et al. 2019). The three original raw series are augmented using the strong augmentation technique (jittering and permutation) proposed in TS-TCC (Eldele et al. 2021). Under DTW distance, the original series of class 0 is most similar to the augmented series of class 2. Additionally, the augmented series of class 0 and 1 are quite dissimilar from their original series. b 1NN-DTW classification accuracy of BME dataset on different training sets. A significant decrease in accuracy can be observed when using only the augmented series as the training set

For this reason, we propose Series2Vec, a novel self-supervised method inspired by contrastive learning that uses learning similarity as its self-supervised task. Our model utilizes time series similarity measures to assign a target output that is used to calculate the encoder loss. This use of a time series specific loss function provides a different type of implicit bias to the image inspired augmentations such as jittering and permutations that have previously been used in time series contrastive learning. This method of creating representations in time series data offers a new and more effective approach to implicit bias encoding.

This method simply aims to provide similar representations for time series that are similar to each other in the original feature space and dissimilar representations for the time series that are far from each other—

$$\begin{aligned} Sim_t(\mathbf {x^i,x^j})<Sim_t(\mathbf {x^i,x^k})\implies R_t(\mathbf {E_t(x^i)},\mathbf {E_t(x^j)})<R_t(\mathbf {E_t(x^i)}),\mathbf {E_t(x^k)}) \end{aligned}$$
(1)

where \(Sim_t\) is a relevant similarity measure in the time domain, \(R_t\) is a relevant similarity measure in the representation domain, \(\mathbf {E_t}\) is the function from time series to their representations and \(\mathbf {x^i}\), \(\mathbf {x^j}\) and \(\mathbf {x^k}\) are time series. Since frequency information in time series can be of great importance and is a different/additional source of information, we further extended our model to also learn representations in the frequency domain.

To do so, we propose a novel approach that applies self-attention to each representation within the batch during training. The self-attention mechanism enforces the network to learn similar representations for all similar time series within each batch. Our approach draws inspiration from the contrastive learning method for self-supervised representation learning; however, Series2Vec benefits from the similarity prediction loss over time series to represent their structure. Notably, it achieves this without the need for hand-crafted data augmentation. One crucial insight motivating this work is the relevance of the unsupervised similarity step to a wide range of time series analysis tasks, which enables the model to focus on modeling the sequential structure of time series.

Additionally, we demonstrate that similarity-based representation learning can be used as a complementary technique with other self-supervised methods such as self-prediction and contrastive learning to enhance the performance of time series analysis.

In summary, the main contributions of this work are as follows:

  • A novel self-supervised learning framework (Series2Vec) is proposed for time series representation learning, inspired by contrastive learning.

  • A time series similarity measure-based pretext is proposed to assign the target output for the encoder loss, providing a more suitable implicit bias for time series analysis.

  • A novel approach is introduced that applies order-invariant self-attention to each representation during training, effectively enhancing the preservation of similarity in the representation domain.

The Series2Vec framework was evaluated extensively on nine real-world time series datasets, along with the UCR/UEA archive, and displayed improved results compared to existing SOTA self-supervised methods. It is also evaluated when fused with other representation learning models.

2 Related work

Self-supervised learning for time series classification can mainly be divided into two groups: contrastive learning and self-prediction. This section delves into these approaches, and for a more comprehensive understanding, we recommend that interested readers refer to the recent survey (Foumani et al. 2024a). Additionally, a literature review on time series similarity measures has been conducted and is available in Appendix A for those interested.

2.1 Contrastive learning

Contrastive learning involves model learning to differentiate between positive and negative time series examples. Scalable Representation Learning (SRL) (Franceschi et al. 2019) and Temporal Neighborhood Coding (TNC) (Tonekaboni et al. 2021) apply a subsequence-based sampling and assume that distant segments are negative pairs and neighbor segments are positive pairs. TNC takes advantage of the local smoothness of a signal’s generative process to define neighborhoods in time with stationary properties to further improve the sampling quality for the contrastive loss function. TS2Vec (Yue et al. 2022) uses contrastive learning to obtain robust contextual representations for each timestamp in a hierarchical manner. It involves randomly sampling two overlapping subseries from input and encouraging consistency of contextual representations on the common segment. The encoder is optimized using both temporal contrastive loss and instance-wise contrastive loss.

In addition to the subsequence-based methods, there are also other models such as Time-series Temporal and Contextual Contrasting (TS-TCC) (Eldele et al. 2021), Mixing up Contrastive Learning (MCL) (Wickstrøm et al. 2022), and Bilinear Temporal-Spectral Fusion (BTSF) (Yang and Hong 2022) that employ instance-based sampling. TS-TCC uses weak and strong augmentations to transform the input series into two views and then uses a temporal contrasting module to learn robust temporal representations. The contrasting contextual module is then built upon the contexts from the temporal contrasting module and aims to maximize similarity among contexts of the same sample while minimizing similarity among contexts of different samples (Eldele et al. 2021). BTSF uses simple dropout as the augmentation method and aims to incorporate spectral information into the feature representation (Yang and Hong 2022). Similarly, Time-Frequency Consistency (TF-C) (Zhang et al. 2022) is a self-supervised learning method that leverages the frequency domain to achieve better representation. It proposes that the time-based and frequency-based representations, learned from the same time series sample, should be more similar to each other in the time-frequency space compared to representations of different time series samples. These self-supervised methods have demonstrated the ability to generate high-level semantic representations (Foumani et al. 2024a) by capturing essential features that remain consistent across various data views. However, they can also introduce notable biases that might impede performance in specific downstream tasks or even to pretraining tasks with dissimilar data distributions. Additionally, their efficacy is significantly influenced by the selection of augmentation techniques (Yang and Hong 2022; Zhang et al. 2022). To address the above drawbacks, we propose a model that utilizes time series similarity measures to assign a target output for learning high-level representations without the need for data augmentation.

2.2 Self-prediction

The main idea behind self-prediction methods is to remove or corrupt parts of the input and train the model to predict or reconstruct the altered content (Foumani et al. 2024a). Studies have explored using transformer-based self-supervised learning methods for time series classification, following the success of models like BERT (Devlin et al. 2019). BErt-inspired Neural Data Representations (BENDER) (Kostas et al. 2021) uses the transformer structure to model EEG sequences and shows that it can effectively handle massive amounts of biosignals data recorded with differing hardware. Similarly, EEG2Rep (Foumani et al. 2024b) introduces a self-prediction approach for self-supervised representation learning from EEG. Two core novel components of this model are outlined: (1) Instead of learning to predict the masked input directly from raw data, EEG2Rep trains to predict the masked input within the latent representation space, and (2) Instead of conventional masking methods, EEG2Rep uses a new semantic subsequence preserving (SSP) method which provides informative masked inputs to guide EEG2Rep to generate rich semantic representations.

Transformer-based Framework (TST) (Zerveas et al. 2021) adapts vanilla transformers to the multivariate time series domain and uses a self-prediction-based self-supervised pre-training approach with masked data. The pre-trained models are then fine-tuned for downstream tasks such as classification and regression. These studies demonstrate the potential of using transformer-based self-supervised learning methods for time series classification. Compared to contrastive methods, self-prediction pretraining tasks require less prior knowledge and exhibit better generalization across various downstream tasks (Foumani et al. 2024b, a). While many of these approaches leverage auto-encoding techniques (Kostas et al. 2021; Zerveas et al. 2021), it is worth noting that auto-encoding can be computationally intensive, and the level of detail needed for series reconstruction and prediction may exceed what is necessary for effective representation learning (Grill et al. 2020). In this paper, we propose a model inspired by contrastive learning to avoid the costly reconstruction step in raw time series space.

3 Method

3.1 Problem definition

In this study, we aim to tackle the problem of learning a nonlinear embedding function that can effectively map each time series \(\mathbf {x^i}\) from a given dataset X into a condensed and meaningful representation \(r^i \in \mathbb {R}^K\), where K denotes the desired representation dimension. The dataset X comprises n samples, specifically \(X=\left\{ \mathbf {x^1},\mathbf {x^2},...,\mathbf {x^n}\right\}\), where each \(\mathbf {x^i}\) represents a \(d_x\)-dimensional time series of length L. We denote that \(\mathbf {x^i} \equiv \mathbf {x_t^i}\) represents an input time series sample, and \(\mathbf {x^i_f}\) represents the discrete frequency spectrum of \(\mathbf {x^i}\). We define \(r^i_t\) as the representation of \(\mathbf {x^i}\) sample in the time domain, and \(r_f^i\) as the representation of \(\mathbf {x^i}\) in the frequency domain, and \(r^i\) is the concatenation of \([r^i_t,r^i_f]\). These representations can be used in various downstream tasks, such as classification. To evaluate the quality of our learned representation \(\textbf{r}=\{r^1,r^2..,r^n\}\), we consider two scenarios based on the availability of labeled data: Linear Probing and Fine-Tuning (see Sect. 4).

3.2 Model architecture

The overall architecture of Series2Vec is shown in Fig. 2. The Series2Vec model architecture proposed in this work is designed to handle both univariate and multivariate time series inputs. However, for the purpose of simplicity, we will focus on illustrating the model using univariate time series in the following descriptions. As shown in Fig. 2 the model comprises four main components: a time encoder (\(\mathbf {E_t}\)), a frequency encoder (\(\mathbf {E_f}\)), a similarity measuring functions for time and frequency (Sect. 3.3), and an similarity-preserving loss function (Sect. 3.4). The encoder blocks map the input time series data into condensed and meaningful representations in both time and frequency domains. A similarity measuring function calculates the similarity between pairs of input series, providing a quantitative measure of their resemblance. To optimize the encoder blocks, a similarity-preserving loss function is employed. This loss function guides the learning process, encouraging the encoder blocks to learn representations that preserve the similarity relationships between different samples in the dataset in both time and frequency domains.

For a given input time series sample, denoted as \(\mathbf {x^i}\), we obtain its corresponding frequency spectrum, \(\mathbf {x^i_f}\), through a transform operator such as the Fourier Transformation (Cooley et al. 1969). The frequency spectrum captures universal frequency information within the time series data, which has been widely acknowledged as a key component in classical signal processing (Cooley et al. 1969). Furthermore, recent studies have demonstrated the potential of utilizing frequency information to enhance self-supervised representation learning for time series data (Zhang et al. 2022; Yang and Hong 2022).

The time-domain input \(\mathbf {x^i_t}\) and the frequency-domain input \(\mathbf {x^i_f}\) are separately passed into the time and frequency encoders to extract features. The feature extraction process is as follows:

$$\begin{aligned} r^i_t=\mathbf {E_t}(\mathbf {x^i_t},\theta _t), \quad r^i_f=\mathbf {E_f}(\mathbf {x^i_f},\theta _f) \end{aligned}$$
(2)

where \(\theta _T\) and \(\theta _F\) represent the parameters of the time and frequency encoders, respectively. The encoded representations of \(\mathbf {x^i}\) are denoted as \(r^i_t\in \mathbb {R}^K\) and \(r^i_f\in \mathbb {R}^K\). Following the established setup outlined in previous works (e.g., Foumani et al. (2021, 2023)), we adopt disjoint convolutions for encoding both temporal and spectral features. These convolutions efficiently capture the temporal and spatial features (Foumani et al. 2021). To ensure consistent representation sizes, we employ max pooling at the end of the encoding network. This choice guarantees the scalability of our model to different input lengths.

Fig. 2
figure 2

Architecture of Series2Vec. The top module learns the representations in the temporal domain and the bottom module learns the representations in the frequency domain

3.3 Similarity measuring function

Soft-DTW (Cuturi and Blondel 2017) is employed as the similarity function in time domain. It was proposed as an alternative to DTW and we used it due to the availability of an efficient GPU implementation of Soft-DTW,Footnote 1 allows our proposed method to be more efficient, scale, and run faster on large time series datasets. The distance calculated by Soft-DTW is a continuous and differentiable function. The formulation for Soft-DTW distance is given by

$$\begin{aligned} \mathcal {S}_T(\mathbf {x^i_t},\mathbf {x^j_t}) = \min _{\pi } \sum _{i=1}^{L} \Vert \mathbf {x^i_t} - \mathbf {x^{j,\pi (i)}_t}\Vert ^2 e^{-\frac{\alpha }{2}\Vert i-\pi (i)\Vert ^2} \end{aligned}$$
(3)

Where \(\mathbf {x^i_t}\) and \(\mathbf {x^j_t}\) are the two time series being compared, L is the length of the time series, and \(\pi\) is a warping path. The warping path is defined as a function that maps each index of one time series to a corresponding index in the other time series. The goal is to find the warping path that minimizes the sum of the squared distances between the corresponding elements of the two time series. The parameter \(\alpha \in [0,1]\) controls the degree of alignment between the two time series. Smaller values of \(\alpha\) result in a more accurate alignment, while larger values lead to a more robust alignment. It is worth noting that setting \(\alpha =0\) makes Soft-DTW and DTW equivalent.

For the similarity function in the frequency domain, we use the Euclidean distance as unlike the temporal domain where Soft-DTW is employed, the concept of time warping does not apply directly to the frequency domain. The Euclidean distance between two input series \(\mathbf {x^i_f}\) and \(\mathbf {x_f^j}\) can be calculated as follows:

$$\begin{aligned} \mathcal {S}_F(\mathbf {x^i_f},\mathbf {x_f^j}) = \sqrt{\sum _{k=1}^{M}\Vert \mathbf {x^{i,k}_f}-\mathbf {x^{j,k}_f}\Vert ^2} \end{aligned}$$
(4)

Here, \(\mathbf {x^i_f}\) and \(\mathbf {x_f^j}\) represent the frequency domain representations of two time series being compared, and M is the number of frequency bins. The Euclidean distance is computed by taking the square root of the sum of squared differences between corresponding frequency components of the two representations.

3.4 Self-supervised similarity-preserving

To simplify the explanation, we will focus on the time domain and omit the frequency domain. Let’s assume that \(r^i\) and \(r^j\) are the representation vectors for input time series \(\mathbf {x^i}\) and \(\mathbf {x^j}\), respectively. Our main objective is to learn similar representations for all similar time series within each batch. To accomplish this, we leverage transformers and make use of the order-invariant property of self-attention mechanisms. In our approach, each time series within each batch functions as a query and attends to the keys of the other samples in the batch in order to construct its representation. This process allows the representation we seek to capture and aggregate all the relevant information from the input representations of the entire batch. By employing the transformer’s architecture and utilizing self-attention, we aim to generate richer representations that encapsulate the relative characteristics and similarities among the input time series samples.

To the best of our knowledge, our work is the first to introduce the concept of feeding each time series as an input token to transformers to learn similarity-based representations. In our approach, we utilize transformers to model the relationships and interactions between the time series within the batch. By treating each time series as a separate input token, we enable the model to capture the fine-grained similarities between different series. Specifically, the attention operation in transformers starts with building three different linearly-weighted vectors from the input, known as query, key, and value. Transformers then map a query and a set of key-value pairs to generate an output. For an input batch representation, \(\textbf{R} = \left\{ r^1,r^2,...,r^B\right\}\) where B is the batch size, self-attention computes an output series \(\textbf{Z} =\left\{ z^1,z^2,...,z^B\right\}\) where \(z^i\in \mathbb {R}^{d_z}\) and is computed as a weighted sum of input elements:

$$\begin{aligned} z^i=\sum _{j=1}^B \alpha _{i,j}(r^j W^V) \end{aligned}$$
(5)

Each coefficient weight \(\alpha _{i,j}\) is calculated using a softmax function:

$$\begin{aligned} \alpha _{i,j}=\frac{exp(e_{ij})}{\sum _{k=1}^B exp(e_{ik})} \end{aligned}$$
(6)

where \(e_{ij}\) is an attention weight from representations j to i and is computed using a scaled dot-product:

$$\begin{aligned} e_{ij}=\frac{(r^i W^Q)(r^j W^K)^T}{\sqrt{d_z}} \end{aligned}$$
(7)

The projections \(W^Q, W^K, W^V \in \mathbb {R}^{K \times d_z}\) are parameter matrices and are unique per layer. Instead of computing self-attention once, Multi-Head Attention (MHA) (Vaswani et al. 2017) does so multiple times in parallel, i.e., employing h attention heads.

Assuming \(z^i, z^j \in \mathbb {R}^{d_z}\) are the output vectors of transformers for input representation \(r^i\) and \(r^j \in \mathbb {R}^{K}\), respectively. The pretext objective we have defined aims to minimize the following loss function:

$$\begin{aligned} \mathcal {L}_t = \text {smooth}_{L_1} (R_t(z^i, z^j), Sim_{t} \mathbf {(x^i, x^j)}) \end{aligned}$$
(8)

The Eq. 8 is calculated the smooth \(L_1\) loss (Girshick 2015) between the similarity \(R_t(z^i, z^j)\) and similarity function \({Sim_t}\mathbf {(x^i, x^j)}\). The smooth \(L_1\) loss is defined as:

$$\begin{aligned} \text {smooth}_{L_1}(x) = {\left\{ \begin{array}{ll} 0.5x^2 &{} \text {if } |x| < 1 \\ |x |- 0.5 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(9)

We chose smooth L1 loss because the literature shows it is less sensitive to outliers compared to MSE loss, and in certain scenarios, it prevents the issue of exploding gradients (Girshick 2015). We also found experimentally that it performs better than MSE loss. The similarity \((R_t(z^i, z^j)\) is computed by taking the dot product of the encoded vectors \(z^i\) and \(z^j\). The similarity \({Sim_t}\mathbf {(x^i, x^j)}\) is calculated between the time series \(\mathbf {x^i}\) and \(\mathbf {x^j}\) using Eq. 3.

In our model, we follow the same process for the frequency domain. The loss function is defined as follows:

$$\begin{aligned} \mathcal {L}_f = \text {smooth}_{L_1} (R_f(z_f^i, z_f^j), \text {Sim}_F(\mathbf {x^i_f, x_f^j)}) \end{aligned}$$
(10)

Here, the similarity \(\text {Sim}_f(\mathbf {x^i_f, x_f^j})\) is calculated between \(\mathbf {x^i_f}\) and \(\mathbf {x_f^j}\) using Eq. 4. The total loss is then calculated as:

$$\begin{aligned} \mathcal {L}_{\text {Total}} = \mathcal {L}_t + \mathcal {L}_f \end{aligned}$$
(11)

Training the encoder using \(\mathcal {L}_{\text {Total}}\) loss function that is based on a time series-specific similarity measure enabled the model to learn a representation of the input data that effectively captures the similarities between the series in each batch. Additionally, time series-specific similarity measures are able to align and compare time series with different time steps and lengths by warping the time axis, making the loss function robust to non-linear variations in the data. This makes the model more robust and less sensitive to small variations in the data, which in turn improves its ability to generalize to unseen time series data. Furthermore, by training the model with a loss function that is based on time series-specific similarity measures, the model is exposed to a wide range of time series variations, such as different time steps, lengths, and irregular intervals, which allows it to learn the underlying patterns in the data that are specific to time series. Time series-specific similarity measures like Dynamic Time Warping (DTW) can handle irregular time intervals, non-stationary time series, and variable-length time series, which can be beneficial when training the model with time series that have these characteristics. Refer to Algorithm 1 for a detailed, step-by-step walkthrough of our method.

Algorithm 1
figure a

Similarity-based representation learning pseudocode

The primary focus of our proposed pretext model is to leverage the similarity information between time series, without being limited by the quality of a specific similarity measure. This allows for flexibility in the choice of similarity measure, as any time series similarity measure can be plugged into the model and used to learn representations. In this paper, we chose a time series-specific similarity measure, Soft-DTW (Cuturi and Blondel 2017) (please refer to Sect. 3.3 for the reason why we used this similarity measure). Our proposed model is not limited to specific similarity measures and has the potential to incorporate other similarity measures as well.

4 Experimental results

This section presents the experimental results of our study, focusing on the performance evaluation of the Series2Vec model in a downstream task of time series classification. The experiments are divided into three main parts: (1) linear probing, (2) fine-tuning, and (3) ablation study. Our primary objective is to assess the effectiveness of the learned representation in accurately classifying time series data and to compare Series2Vec performance against other state-of-the-art models. For implementation details and hyperparameters of Series2Vec and Baseline Models, please refer to Appendix B. Additional experiments on the UCR/UEA archive are provided in the Appendix D due to space constraints. Here we evaluate models on commonly used datasets in the representation learning literature.

4.1 Datasets

To evaluate the performance of our model, we utilize a total of nine publicly available datasets that have been previously used in the literature for time series representation learning (Foumani et al. 2024a). These datasets cover various domains, such as epileptic seizure prediction (Andrzejak et al. 2001), sleep stage classification (Goldberger et al. 2000), and human activity recognition datasets such as Anguita et al. (2013), PAMAP2 (Reiss and Stricker 2012), Skoda (Zappi et al. 2012), USC-HAD (Zhang and Sawchuk 2012), Opportunity (Chavarriaga et al. 2013), WISDM (Lockhart et al. 2012), and WISDM2 (Weiss and Lockhart 2012). The details of each dataset are presented in Appendix C.

4.2 Evaluation procedure

Following the literature on time series classification (Fawaz et al. 2019; Yue et al. 2022; Foumani et al. 2023), we evaluate model performance using classification accuracy as the main metric. Models are ranked based on their accuracy per dataset, with the highest accuracy receiving a rank of 1 and the lowest rank assigned to the worst performer. In the case of ties, the average rank is calculated. In the final step, we compute the average rank across all datasets for each model, with the lowest average rank indicating the method with the highest average accuracy across datasets.

4.3 Linear probing

We assume access to a large volume of unlabeled data \(X^u = \left\{ \mathbf {x^i} |i=1,...,n\right\}\), along with a smaller subset of labeled data \(X^l = \left\{ (\mathbf {x^i}, y^i) |i=1,...,m\right\}\) samples (\(m \ll n\)). Each sample in \(X^l\) is associated with a label \(y^i \in \left\{ 1,..., C\right\}\), where C represents the number of classes. First, we pre-train a model without using labels through a self-supervised pretext task. Once the pre-training is complete, we freeze the encoder and add a linear classifier on top of the pre-trained model’s output or intermediate representations. This linear classifier can be implemented as a linear layer or logistic regression. The linear classifier is subsequently trained on a downstream task, typically a classification task, utilizing the pre-trained representations as inputs. Linear probing serves as an evaluation method to assess the quality of the learned representations.

4.3.1 Comparison with baseline approaches

In order to evaluate the effectiveness of our approach, we conducted extensive comparison against six state-of-the-art self-supervised methods for time series, including TS2Vec (Yue et al. 2022), TS-TCC (Eldele et al. 2021), TNC (Tonekaboni et al. 2021), TF-C (Zhang et al. 2022), MCL (Wickstrøm et al. 2022) and TST (Zerveas et al. 2021). To ensure a fair comparison, we utilized publicly available code for the baseline methods. We employed the same encoder architecture, with identical computational complexity and parameters as previously outlined. Additionally, we followed the literature (Yue et al. 2022; Eldele et al. 2021) by setting the representation dimensions to \(K=320\).

Table 1 Comparing self-supervised models: An analysis of average accuracy scores for Series2Vec, TS2Vec, TS-TCC, TNC, MCL, TF-C and TST

Table 1 presents the average accuracy of Series2Vec over five runs, along with other state-of-the-art self-supervised models, for the purpose of comparison. The number in bold for each dataset represents the highest accuracy achieved for that dataset. The last row in Table 1 shows the rank of each model across all nine datasets. The results presented in this table indicate that our model, Series2Vec, achieves the highest average rank of 1 (which is significantly more accurate than other models) and the highest average accuracy of 82.47 among all self-supervised models. The second most accurate model, TS2Vec, obtains an average rank of 3 and an average accuracy of 79.90. TS-TCC follows closely with an average accuracy of 78.07. TST is the worst-performing model with an average accuracy of 70.83.

Fig. 3
figure 3

Comparison of 2D t-SNE plots for representation learned by a TS-TCC, b TS2Vec, and c Series2Vec on the Epilepsy dataset

Figure 3 illustrates t-SNE plots visualizing representations learned by TS-TCC, TS2Vec, and our method on the Epilepsy dataset (excluding TNC, TF-C, MCL, and TST due to inferior performance). In Fig. 3a and b, the two classes are not easily separable in the learned representation space, leading to low classification accuracy. Contrastingly, Fig. 3c demonstrates clear separability between the two classes, underscoring the efficacy of Series2Vec in enhancing representations and, consequently, improving classification accuracy.

4.3.2 Low-label regimes

We conducted a comparison between three self-supervised models (Series2Vec, TS2Vec, and TS-TCC) and a supervised model in a low-labeled data regime. The TNC, TF-C, MCL, and TST models were excluded from the comparison due to their significantly lower accuracy compared to the other models. Figure 4 demonstrates that our proposed Series2Vec model consistently outperforms both the supervised model and other representation learning models (except for one dataset -Sleep- in comparison to TS-TCC) when the number of labeled data points is limited to less than 50. Note each subfigure here shows the results for one dataset. This indicates the promising performance of Series2Vec models in scenarios where data scarcity is a challenge. It is important to highlight that Series2Vec does not exhibit a significant performance advantage over TS-TCC on the Epilepsy dataset. This can be attributed to the presence of low-level noise in EEG data, such that jittering replicates this effect allowing classifiers to avoid overfitting high-frequency effects. Furthermore, Series2Vec demonstrates slightly lower performance compared to the TS-TCC method on the Sleep dataset. We believe this is due to the relatively long length of the series (e.g., 3000 time steps), which poses a challenge for our model to accurately represent similarity using only a single similarity loss across both time and frequency domains.

Fig. 4
figure 4

Comparison of Linear Probing with Series2Vec, TS2Vec, TS-TCC and Supervised on all nine datasets. The x-axis represents the number of labeled samples per class, while the y-axis represents the corresponding accuracy achieved by each approach

4.4 Pre-training

We assume that the dataset X is fully labeled, denoted as \(X = \left\{ (\mathbf {x^i}, y^i) |i=1,...,n\right\}\). Each sample in \(X^l\) is associated with a label \(y^i \in \left\{ 1,..., C\right\}\), where C represents the number of classes. We investigate whether leveraging similarity-based representation learning for initialization provides advantages compared to randomly initializing a supervised model. To examine this, we first pre-train the model without using labels through a self-supervised pretext task. Afterward, we train (fine-tune) the entire model for a few epochs using the labeled dataset in a fully supervised manner.

Table 2 presents the classification accuracy results for different datasets, comparing the performance of a model with random initialization and pre-trained Series2Vec. The table shows that using pre-trained Series2Vec leads to an average improvement of 1% in accuracy compared to the random initialization. Significant improvements are observed in specific datasets, such as WISDM2, PAMAP2, and WISDM. For WISDM2, Series2Vec achieves an accuracy gain of 2.35% compared to the random initialization. Similarly, for PAMAP2 and WISDM, the accuracy gains are 3.03% and 1.51% respectively, validating the effectiveness of utilizing similarity-based methods for enhanced learning and improved time series classification.

Table 2 Comparison of Classification Accuracy between Random Initialization and Pre-Trained Series2Vec

4.5 Ablation study

Component analysis To assess the effectiveness of the proposed components in Series2Vec, we conducted a comparison between the Series2Vec model and three variations, as presented in Table 3. The variations are as follows: (1) w/o Attention, where the transformer block is removed; (2) w/o Spectral, where only the temporal domain is used as input feature; and (3) w/o Temporal, where the frequency of the input series is solely utilized to generate the representation.

Table 3 Series2Vec Ablation Study: Component Analysis

As shown in Table 3, the inclusion of order-invariant self-attention has a significant impact on the model’s accuracy, thereby validating our approach, which employs it to ensure that the model attends to similar series in the batch for a given time series. Furthermore, we observed that in datasets recorded with a low sampling rate such as WISDM2, Skoda, WISDM, and UCI-HAR, employing the frequency domain improves the model’s performance. Low sampling may make it difficult for the model to capture fine-grained temporal patterns in the data. However, frequency-based representations derived from the FFT can capture information about the underlying periodicity and spectral content of the signal.

Complementary loss function We evaluate our similarity preserving loss (\(\mathcal {L}_{Sim}\)) performance in combination with other methods such as self-prediction loss (\(\mathcal {L}_{SP}\)) used in TST and contrastive loss (\(\mathcal {L}_{Cons}\)) employed in TS-TCC. Table 4 showcases the average accuracy of five runs for different combinations of similarity, contrastive, and self-prediction loss on all nine datasets. Notably, we find that the similarity loss surpasses the individual performance of self-prediction loss in TST and contrastive loss in TS-TCC. Additionally, the combination of self-prediction and similarity-preserving learning yields superior results compared to the combination of contrastive and similarity loss. This suggests that self-prediction and similarity learning capture distinct implicit biases, and their fusion leads to enhanced performance in time series analysis.

Table 4 \(\mathcal {L}_{Sim}\) as Complementary Loss Function

5 Conclusion

This paper proposes a novel self-supervised learning method, Series2Vec, for time series analysis. Series2Vec is inspired by contrastive learning, but instead of using synthetic transformations, it utilizes time series similarity metrics to assign the target output for the encoder loss. This method offers a novel and more effective approach to implicit bias encoding, making it more suitable for time series analysis. The experiment results show that Series2Vec outperforms existing methods for time series representation learning. Moreover, our findings indicate that Series2Vec performs well in datasets with a limited number of labeled samples. Finally, fusion of similarity-based loss function with other representation learning models leads to enhanced performance in time series classification. In the future, we will explore incorporating additional similarity measurements into the model to better represent the similarity among the series. Furthermore, we plan to preprocess the data before calculating similarity to improve the robustness of the pretext targets to noise.