1 Introduction

Neural networks serve as fundamental learning models across disciplines such as machine learning, data mining, and artificial intelligence. Typically, these networks go through a training phase during which their parameters are tuned according to specific learning rules and the data provided. A popular training paradigm involves backpropagation for optimizing an objective function. Once trained, these networks can be deployed for a variety of tasks, including classification, clustering, and regression.

A time series is a real-valued ordered sequence. The task of time series clustering assigns time series instances into homogeneous groups. It is one of the most important and challenging tasks in time series data mining and has been applied in various fields such as finance (Kumar et al. 2002), biology (Subhani et al. 2010; Fujita et al. 2012), climate (Steinbach et al. 2003), medicine (Wismüller et al. 2002) and so on. In this work, we consider the partitional clustering problem, wherein the given time series instances are grouped into pairwise-disjoint clusters.

Existing time series clustering methods achieve good performance (Paparrizos and Gravano 2015; Petitjean et al. 2011; Li et al. 2019), but since they form clusters based on a single focus, such as shape or point-to-point distance, they are suboptimal for some specific data types. Here, we introduce a novel method named RandomNet for time series clustering using untrained deep neural networks. Different from conventional training methods that adjust network weights (parameters) using backpropagation, RandomNet utilizes different sets of random parameters to extract various representations of the data. By extracting diverse representations, it can effectively handle time series with different characteristics. These representations are clustered; the results from the clusters are then selected and ensembled to produce the final clustering. This approach ensures that data only needs to pass through the networks once to obtain the final result, obviating the need for backpropagation. Therefore, the time complexity of RandomNet is linear in the number of instances in the dataset, providing a more efficient solution for time series clustering tasks.

Given a neural network, the various sets of parameters in the network can be thought of as performing different types of feature extraction on the input data. As a result, these varied parameters can generate diverse data representations. Some of these representations may be relevant to a clustering task, producing meaningful clusterings, while others may be less useful or entirely irrelevant, leading to less accurate or meaningless clustering. This concept forms the basis of RandomNet: by combining clustering results derived from all these diverse representations, the meaningful and latent group structure within the data can be discovered. This is because the noise introduced by irrelevant representations tends to cancel each other out during the ensemble process whereas the connections provided by relevant representations are strengthened. Therefore, efficient and reliable clustering can be achieved despite the randomness of the network parameters.

To demonstrate the effectiveness of RandomNet, we provide theoretical analysis. The analysis shows that RandomNet has the ability to effectively identify the latent group structure in the dataset as long as the ensemble size is large enough. Moreover, the analysis also provides a lower bound for the ensemble size. Notably, this lower bound is independent of the number of instances or the length of the time series in the dataset, given that the data in the dataset are generated from the same mechanism. This provides the ability to use a fixed, large ensemble size to achieve satisfactory results, offering a practical approach to time series clustering that does not need adjustment for different dataset sizes or time series lengths.

We conduct extensive experiments on all 128 datasets in the well-known UCR time series archive (Dau et al. 2019) and perform statistical analysis on the results. These datasets have different sizes, sequence lengths, and characteristics. The results show that RandomNet has the top performance in the Rand Index compared with other state-of-the-art methods and achieves superior performance across all data types evaluated.

The main contributions of the paper are summarized as follows:

  • We propose RandomNet, a novel method for time series clustering using untrained neural networks with random weights. There is no training or backpropagation in the method.

  • We demonstrate the effectiveness of the proposed method both empirically and theoretically. We conduct extensive experiments on 128 datasets to evaluate the proposed method and provide statistical analysis of the comparison results to show the superiority of our method over the state-of-the-art methods.

  • We demonstrate the efficiency of the proposed method through the experimental evaluation on data of varying sizes and time lengths. The results of linear curve-fitting on the running time indicate that the method has linear time complexity.

2 Background and related work

2.1 Definitions and notations

Definition 1

A time series \(T=[t_1, t_2, \ldots , t_m]\) is an ordered sequence of real-value data points, where m is the length of the time series.

Definition 2

Given a set of time series \(\{T_i\}_{i=1}^n\) and the number of clusters k, the objective of time series clustering is to assign each time series instance \(T_i\) a group label \(c_j\), where \(j \in \{1, \ldots , k\}\). n is the number of instances in the dataset. We would like the instances in the same group to be similar to each other and dissimilar to the instances in other groups.

2.2 Related work

There has been much work on time series clustering, and we categorize them into four groups: raw-data-based methods, feature-based methods, deep-learning-based methods, and others.

Raw-data-based methods. The raw-data-based methods directly apply classic clustering algorithms such as k-means (MacQueen et al. 1967) on raw time series. The standard k-means algorithm adopts Euclidean distance to measure the dissimilarity of the instances and often cannot handle the scale-variance, phase-shifting, distortion, and noise in the time series data. To cope with these challenges, dozens of distance measures for time series data have been proposed.

Dynamic Time Warping (DTW) (Berndt and Clifford 1994) is one of the most popular distance measures that can find the optimal alignment between two sequences. It is used in Dynamic time warping Barycenter Averaging (DBA) (Petitjean et al. 2011) which proposes an iterative procedure to refine the centroid in order to minimize the squared DTW distances from the centroids to other time series instances. Similarly, K-Spectral Centroid (KSC) (Yang and Leskovec 2011) proposes a distance measure that finds the optimal alignment and scaling for matching two time series. The centroids are computed, based on matrix decomposition, to minimize the distances between the centroids and the instances under this distance measure. Another approach, k-shape (Paparrizos and Gravano 2015) proposes a shape-based distance measure based on the cross-correlation of two time series. The distance measure shifts the two time series to find the optimal matching. Each centroid is obtained from the optimization of the squared normalized cross-correlation from the centroid to the instances in the cluster.

Feature-based methods. Feature-based methods transform the time series into flat, unordered features, and then apply classic clustering algorithms to the transformed data.

Zakaria et al. (2012) propose to calculate the distances from a set of short sequences to the time series instances in the dataset and use the distance values as new features for the respective instances. This set of short sequences, called U-shapelets, is found by enumerating all the subsequences in the data to best separate the instances. K-means are then applied to the new features for clustering. In the work by Zhang et al. (2016), instead of enumerating the subsequences, the shapelets are learned by optimizing an objective function with gradient descent.

A recent work (Lei et al. 2019) proposes Similarity PreservIng RepresentAtion Learning (SPIRAL) to sample pairs of time series to calculate their DTW distances and build a partially-observed similarity matrix. The matrix is an approximation for the pair-wise DTW distances matrix in the dataset. The new features are generated by solving a symmetric matrix factorization problem such that the inner product of the new feature matrix can approximate the partially-observed similarity matrix.

Deep-learning-based methods. Many methods in this category adopt the autoencoder architecture for clustering. In autoencoder, the low-dimension hidden layer output is used as features for clustering. Among these, Improved Deep Embedded Clustering (IDEC) (Guo et al. 2017) improves autoencoder by adding an extra layer to the model. It not only employs a reconstruction loss but also optimizes a clustering loss specifically designed to preserve the local structure of the data. This dual loss strategy can capture the global structure and local differences, thereby improving the clustering process to better learn the inherent characteristics of the data.

Deep Temporal Clustering (DTC) (Madiraju et al. 2018) specifically addresses time series clustering by using Mean Square Error (MSE) to measure the reconstruction loss, and Kullback–Leibler (KL) divergence to measure clustering loss. Similarly, Deep Temporal Clustering Representation (DTCR) (Ma et al. 2019) adopts MSE for the reconstruction loss, while it uses a k-means objective function to measure the clustering loss. DTCR also employs a fake-sample generation strategy to augment the learning process. Clustering Representation Learning on Incomplete time-series data (CRLI) (Ma et al. 2021) further studies the problem of clustering time series with missing values. It jointly optimizes the imputation and clustering process, aiming to impute more discriminative values for clustering and to make the learned representations possess a good clustering property.

In the broader neural network literature, there is a class of methods that also use random weights known as the Extreme Learning Machine (ELM) (He et al. 2014; Peng et al. 2016; Wu et al. 2018), which uses a single-layer feed-forward network to map inputs into a new feature space. The hidden layer weights are set randomly but the output weights are trained. The idea is to find a mapping space where instances of different classes can be separated well.

In the domain of time series classification, ROCKET (Dempster et al. 2020), MiniRocket (Dempster et al. 2021) and MultiRocket (Tan et al. 2022) adopt strategies involving the use of random weights to generate features for classification. They use multiple single-layer convolution kernels instead of a deep network architecture.

Beyond the neural network and clustering fields, several works also adopt randomized features or feature maps (Rahimi et al. 2007; Chitta et al. 2012; Farahmand et al. 2017). However, it is worth noting that all these methods diverge from our proposed approach in their network structures. Moreover, none of these methods incorporates ensemble learning, which forms the core of our approach. To the best of our knowledge, we are the first to propose using a network with random weights in time series clustering.

Other methods. In our previous work (Li et al. 2019), we present a Symbolic Pattern Forest (SPF) algorithm for time series clustering, which adopts Symbolic Aggregate approXimation (SAX) (Lin et al. 2007) to transform time series subsequences into symbolic patterns. Through iterative selections of random symbolic patterns to divide the dataset into two distinct branches based on the presence or absence of the pattern, a symbolic pattern tree is constructed. Repeating this process forms a symbolic pattern forest, the ensemble of which produces the final clustering result.

3 The proposed method

Fig. 1
figure 1

The overall structure of RandomNet

3.1 Architecture and algorithm

Figure 1 shows the architecture of RandomNet. The method is structured with B branches, each containing a CNN-LSTM block, designed to capture both spatial and temporal dependencies of time series, followed by k-means clustering. Each CNN-LSTM block contains multiple groups of CNN networks and an LSTM network, and each group of CNN network consists of a one-dimensional convolutional layer, a Rectified Linear Units (ReLU) layer, and a pooling layer. The output of the CNN networks is flattened. In our experiments, we set the number of groups of the CNN network equal to \(\log _2{m}\), where m represents the length of the time series. We fix the number of filters of the 1D convolution to 8, the filter size to 3, and the pooling size to 2. We set the number of LSTM units to 8. The weights used within the network are randomly chosen from \(\{-1, 0, 1\}\). We opt for this finite parameter set over a continuous interval (e.g., \([-1, 1]\)) for the purpose of simplifying the parameter space.

Each branch produces its own clustering, however, some clusterings might be skewed or deviant due to the inherent randomness of the weights. To alleviate this problem, we propose a selection mechanism to remove any clusterings that contain clusters that are either too small or too large.

Concretely, the method sets a lower bound lr and an upper bound ur for the cluster size. The number of instances that violate the bounds in each clustering is counted as violation. For example, suppose a clustering contains two clusters with sizes 40 and 52, respectively. If the lower bound is 5 and the upper bound is 50, then the number of violations for this clustering is \(52-50=2\). The clusterings are sorted according to the number of violations and the method selects the top S clusterings for the ensemble. Here, \(S=\max (zv, sr \times B)\), where zv is the number of clusterings with zero violation values, sr is a selection rate, and B is the number of branches in the method.

Finally, we ensemble the results to form the final clustering. While the diversity of clustering results from a large number of different branches helps reveal various intrinsic patterns in the data, it introduces the challenge of combining these different results into a cohesive unified clustering. To address this challenge, we adopt the Hybrid Bipartite Graph Formulation (HBGF) (Fern and Brodley 2004) to perform clustering ensemble. This technique builds a bipartite graph for the clusterings in the ensemble, where the instances and clusters become the vertices. If an instance belongs to a cluster, then there is an edge connecting the two respective vertices in the graph. Partitioning the graph gives a consensus clustering for the ensemble. HBGF has two main advantages. First, it can extract consensus from differences, identifying and strengthening the repeated patterns of grouping across the clustering set. Second, it has linear time complexity, which ensures the scalability of our model for large datasets. In our implementation, we use Metis (Karypis and Kumar 1998) library to partition the graph.

Algorithm 1 gives the pseudo-code of RandomNet. Given a time series dataset \(D=\{T_i\}_{i=1}^n\), a branch number B, a cluster number k, bounds lr and ur, and a selection rate sr, the algorithm outputs a clustering assignment C for the input time series.

Algorithm 1
figure a

RandomNet

Algorithm 2
figure b

Selection Mechanism

Algorithm 3
figure c

Ensemble

In Algorithm 1 Line 4, the parameters in the CNN-LSTM blocks are randomly set from \(\{-1, 0, 1\}\) as previously noted. The data passes the CNN-LSTM blocks to generate features for each time series in Line 5. Line 6 applies k-means on the features to produce a clustering assignment. Line 7 adds the clustering to the ensemble set. In Line 9, the selection mechanism (Algorithm 2) introduced above is performed on the ensemble set with the user-provided selection rate and bounds. Finally in Line 10, the ensemble function (Algorithm 3) ensembles the clusterings in SelectedSet and gives the clustering C as the output of the algorithm.

3.2 Effectiveness of RandomNet

Given the network architecture, its parameters (weights) represent a form of feature extraction from the data and thus produce a kind of representation. With multiple random parameters, we can have multiple representations.

Some representations are relevant to the clustering task. The instances that are similar to each other are more likely to be put in the same cluster under these relevant representations. Other representations are irrelevant to the clustering task. Under these representations, two similar instances may not be assigned in the same cluster.

The intuition is that, in the ensemble, the effect of irrelevant representations can cancel each other out, and the effect of relevant representations can dominate the ensemble. Inspired by (Li et al. 2019) which is described in the previous section, we provide effectiveness analysis for RandomNet.

We assume the data contains k distinct clusters which correspond to k different classes. We have the following theorem:

Theorem 1

Assume two instances, \(T_1\) and \(T_2\), are from the same class. If they reside within the same cluster under some relevant representations, then RandomNet assigns these two instances to the same cluster in the final output.

Proof

Let \(\gamma\) denote the percentage of relevant representations among all the representations. In each CNN-LSTM block, if the representation is relevant, we have \(P(C(T_1)=C(T_2))=1\), where \(P(\cdot )\) stands for the probability and \(C(\cdot )\) denotes the clustering assignment. If the representation is irrelevant, the instances are assigned to any of the k clusters randomly. Hence, we can deduce: \(P(C(T_1)=C(T_2))=1/k\), and \(P(C(T_1) \ne C(T_2))=(k-1)/k\). Considering the above, overall we can derive that \(P(C(T_1)=C(T_2))= \gamma \times 1 + (1- \gamma ) \times 1/k\) and \(P(C(T_1) \ne C(T_2))= (1- \gamma ) \times (k-1)/k\). It is clear that \(P(C(T_1)=C(T_2))> P(C(T_1) \ne C(T_2))\). Since each block is independent of the others, according to the law of large numbers (Révész 2014) which states that if we repeat an experiment independently a large number of times, the average of the results obtained from those experiments will converge to the expected value, we have:

$$\begin{aligned} Count(C(T_1)=C(T_2))> Count(C(T_1) \ne C(T_2)) \end{aligned}$$
(1)

where we consider a sufficiently large ensemble size and \(Count(\cdot )\) is the count of occurrences. Consequently, in the ensemble result, instances \(T_1\) and \(T_2\) belong to the same cluster. \(\square\)

The above analysis assumes we have a large ensemble size, and the following theorem provides a lower bound for the ensemble size. Here, for simplicity, we set \(k=2\).

Theorem 2

Assume the ensemble size to be b. Then, the lower bound of b needed to provide a good clustering is given by \(-2 \ln \alpha / \gamma ^2\), where \(\gamma\) represents the percentage of relevant representations and 1-\(\alpha\) is the confidence level.

Proof

Let Y be a random variable indicating the number of cases where \(C(T_1)=C(T_2)\). The random variable Y follows a binomial distribution:

$$\begin{aligned} P(Y=s) = \left( {\begin{array}{c}b\\ s\end{array}}\right) p^s(1-p)^{b-s} \end{aligned}$$
(2)

where \(p=P(C(T_1)=C(T_2))\). Equation (1) needs to hold with high probability, leading to the following inequality:

$$\begin{aligned} P(Y \le s) = \sum _{i=0}^s \left( {\begin{array}{c}b\\ i\end{array}}\right) p^i(1-p)^{b-i} \le \alpha \end{aligned}$$
(3)

where \(s=b/2\), \(1-\alpha\) is the confidence level. By applying Hoeffding’s inequality (Hoeffding 1994), we have \(P(E[\bar{Y}]-\bar{Y} \ge t) \le e^{-2bt^2}\), where \(t \ge 0\). Considering \(E[\bar{Y}]=p\), we have:

$$\begin{aligned} P(E[\bar{Y}]-\bar{Y} \ge t)&= P(bE[\bar{Y}]-b\bar{Y} \ge bt)\end{aligned}$$
(4)
$$\begin{aligned}&= P(Y \le bp-bt) \le e^{-2bt^2} \end{aligned}$$
(5)

Let \(s=bp-bt\), then \(t=(bp-s)/b\), so we get \(P(Y \le s) \le e^{-2(bp-s)^2/b} \le \alpha\). With \(s=b/2\), \(p=\gamma \times 1 + (1- \gamma ) \times 1/2\), we solve the above inequality and derive \(b \ge -2 \ln \alpha / \gamma ^2\). \(\square\)

Here is a concrete example for the bound: suppose we have a confidence level of 99% and we estimate that 30% of the representations are relevant. In this case, \(\alpha =0.01\) and \(\gamma =0.3\), yielding a b value of at least 102.33. From the theorem, one observes that the lower bound is independent of the number of instances in the dataset. This implies that we can maintain a sufficiently large fixed ensemble size to handle inputs of varying sizes, provided the data generation mechanism remains constant. We verify this in the experimental section by using a fixed ensemble size chosen through experiments and varying the number of time series instances that are generated from the same mechanism.

4 Experimental evaluation

4.1 Experimental setup

To evaluate the effectiveness of RandomNet, we run the algorithm on all 128 datasets from the well-known UCR time series archive (Dau et al. 2019). These datasets come from different disciplines with various characteristics. Each dataset in the archive is split into a training set and a testing set. We fuse the two sets and utilize the entire dataset in the experiment. Some of these datasets contain varying-length time series. To ensure that all time series in a dataset have the same length, we append zeros at the end of the shorter series.

For benchmarking purposes, we run kDBA (Petitjean et al. 2011), KSC (Yang and Leskovec 2011), k-shape (Paparrizos and Gravano 2015), SPIRAL (Lei et al. 2019), and SPF (Li et al. 2019) on the same datasets. These methods are used as representatives of the state-of-the-art for time series clustering. Additionally, we incorporated deep-learning-based methods, Improved Deep Embedding Clustering (IDEC) (Guo et al. 2017) and DTC (Madiraju et al. 2018), for comparison. While DTC is specifically designed for time series data, as discussed in Sect. 2.2, IDEC is a general clustering method. We also compare our method with ROCKET (Dempster et al. 2020) and its variants, MiniRocket (Dempster et al. 2021) and MultiRocket (Tan et al. 2022), since we are interested in how other models that also used random parameters compare to ours. As they are all specifically designed for time series classification, we adapt them to our use case by removing the classifier component and replacing it with k-means. All references to them will pertain to this adapted version. We do not include DTCR (Ma et al. 2019) in the comparison, as we are unable to reproduce the results reported in its paper, despite using the code provided by its authors.Footnote 1 This issue has been similarly reported by others on the GitHub issue webpage for the project.Footnote 2 We do not include CRLI (Ma et al. 2021) since it is specially designed for incomplete time series data which is outside the scope of our study. Table 1 provides a concise comparison of our method and various baselines we used in experiments, outlining their applicable data types, method types, main focuses, and time complexity in terms of the number of instances (n) and the length of time series (l). Note that due to the complexity involved in training deep learning models, we have not included the time complexity for the two deep learning methods, DTC and IDEC, which require network training. For a more detailed description of each method, please refer to Sect. 2.2. We provide the experimental evaluation for the time complexity of our model in Sect. 4.6.

Table 1 Comparison of baselines and our method

The source code of kDBA, KSC, and k-shape are obtained from the authors of k-shape. The source code of SPIRAL,Footnote 3 SPF,Footnote 4 IDEC,Footnote 5 DTC,Footnote 6 ROCKET,Footnote 7 MiniRocketFootnote 8 and MultiRocketFootnote 9 are available online. The number of clusters k is set to equal the number of classes in the datasets, and we follow the default parameter settings in the source code.

For the default hyperparameters of RandomNet, the number of branches B is set to 800, the selection rate sr is set to 0.1, the lower bound lr is set to \(0.3 \times acs\), and the upper bound ur is set to \(1.5 \times acs\), where acs refers to the average cluster size, which is computed as \(round(number\_of\_instances/k)\). Detailed hyperparameter selection experiments are elaborated upon in the subsequent section.

We implement RandomNet using Python and TensorFlow 2.1 and use the k-means implementation in the Scikit-learn package (Pedregosa et al. 2011) with default settings. The experiments are run on a node of a batch-processing cluster. The node uses a 2.6 GHz CPU and 64 GB RAM. Given that the process does not involve neural network training, there is no necessity for GPU utilization. The source code of RandomNet is available at: https://github.com/Jackxiini/RandomNet.

All the datasets in the experiments have labels that can be used as ground truth. We use Rand Index (Rand 1971) to measure the clustering accuracy of the methods under comparison. The range of the Rand Index falls in [0, 1] where a large value indicates that the clustering matches the actual class relationship well.

Fig. 2
figure 2

Left: Rand Index by the varying number of branches, ranging from 100 to 1000. Middle: Rand Index by the varying selection rate, ranging from 0.1 to 1. Right: Rand Index by three sets of the lower bound and upper bound, representing wide interval, intermediate interval, and narrow interval

4.2 Hyperparameter analysis

To fine-tune and investigate the influence of hyperparameters on model performance, we conduct a series of experiments. We select 20 datasets from the UCR time series archive (Dau et al. 2019) and run each experiment 10 times and take the average Rand Index as the result.

Number of branches. The number of branches B plays a pivotal role in our model, affecting both the quality of the clustering and the computational efficiency. We test B values ranging from 100 to 1000, in increments of 100, and keep all other settings default.

The left plot of Fig. 2 shows that the average Rand Index improves with an increase in the number of branches until \(B=800\). Increasing the number of branches beyond 800 only results in a rise in running time, without contributing to better clustering quality. Therefore, we set the default B for all datasets to 800.

Selection rate. The selection rate sr controls the lower bound of the number of selected clustering. We test sr values ranging from 0.1 to 1, in increments of 0.1, and keep all other settings default.

The middle plot of Fig. 2 shows slight changes in the average Rand index as sr changes. Since a larger sr will increase the running time of the model, we choose \(sr=0.1\) as the default value.

Lower bound and upper bound. The lower bound lr and the upper bound ur are crucial in detecting the number of violation, which affects the quality of clustering. We evaluate three pairs of lr and ur, (0.1, 1.8), (0.3, 1.5), and (0.5, 1.2), representing wide intervals, intermediate intervals, and narrow intervals, respectively. For simplicity, we present these as multipliers; the actual lower and upper bounds are obtained by multiplying these values with the average cluster size acs. Narrower intervals are more restrictive to the size of the clustering and thus will increase the number of violations. We keep all other settings as default.

The right plot of Fig. 2 shows the effects of the wide interval, intermediate interval, and narrow interval on the Rand Index. We can observe that appropriate intervals can bring better performance. The intermediate interval has the best Rand Index, whereas the wide interval underperforms the other two due to its lax size constraints which affects its ability to screen violations. Therefore, we set lr to \(0.3 \times acs\) and ur to \(1.5 \times acs\) as default.

Selection bias test. Since the dataset selection for hyperparameter tuning may cause potential selection bias, we conduct an experiment to show how the choice affects the experimental results. We divide 128 datasets into five similar-sized groups, optimizing three hyperparameters separately for each. We then apply the optimized hyperparameters to the remaining datasets in each group. For each hyperparameter, we average the Rand Index for each group, and obtain the average and the standard deviation of the Rand Index of the five groups. As illustrated in Table 2, the standard deviations, 0.0073, 0.0097 and 0.0078, for the Average Rand Index across the five groups are very small, indicating that the selection of datasets for hyperparameter tuning has a minimal effect on the final results. Therefore, we retain the selection of the previous 20 datasets in subsequent experiments.

Table 2 Dataset selection bias test for each hyperparameter

4.3 Experimental results

Since we use 20 of the 128 datasets to select the hyperparameters, for the sake of fairness, we remove them in the following comparison and only show the results for the remaining 108 datasets.

Comparison with k-means.

Fig. 3
figure 3

Critical difference diagram of the comparison with k-means, kmeansE and RandomNet

As RandomNet uses k-means to generate clustering assignments, we are interested in how they compare. We also run k-means 800 times and use HBGF (Fern and Brodley 2004) to ensemble the results, which is denoted as kmeansE.

We run the methods under comparison on the 108 datasets and record the Rand Index. Figure 3 presents a critical difference diagram (Demšar 2006) for the comparison based on Rand Index. The values adjacent to each method represent the respective average rank (with smaller being better), and the methods connected by a thick bar do not significantly differ at the 95% confidence level. Notably, there is no thick bar present in Fig. 3, suggesting all methods have significant differences from each other.

Table 3 Comparing RandomNet with the state-of-the-art methods on the UCR Archive benchmarks
Table 4 (Continued) Comparing RandomNet with the state-of-the-art methods on the UCR Archive benchmarks
Table 5 Comparison of RandomNet with state-of-the-art methods across seven different time series data types

As seen in the figure, RandomNet significantly outperforms both k-means and kmeansE. It is noteworthy that kmeansE is significantly better than the standard k-means, indicating that employing ensemble methods can substantially improve the performance of time series clustering, even for the naive method that uses the original representation of time series. Comparing RandomNet with kmeansE further demonstrates that using the proposed deep neural network with random parameters for generating representations can indeed enhance the accuracy of k-means clustering and ensembles.

Fig. 4
figure 4

Critical difference diagram of the comparison with ROCKET, MiniRocket and MultiRocket

Comparison with ROCKET and its variants.

We compare our method with ROCKET and its variants, MiniRocket and MultiRocket. Note that we remove the classifier components in these ROCKET variants and replace them with k-means to adapt them to our use case. As shown in Fig. 4, RandomNet outperforms ROCKET variants in terms of average rank, and is especially significantly better than ROCKET and MultiRocket. This reflects the superiority of RandomNet, which is specially designed for time series clustering, in models based on random parameters. It is worth noting that MiniRocket is the best model among ROCKET variants. Therefore, we keep only MiniRocket in subsequent experiments.

Comparison with the state-of-the-arts. Tables 3 and 4 present the experimental results of RandomNet compared to state-of-the-art methods. The best results for each dataset are highlighted in bold. We provide the average Rand Index and average rank for each method. The \(\dag\) symbol indicates that the dataset is used for hyperparameter selection. Consequently, the results from these datasets have not been included in the computation of the average Rand Index and average rank. As the results illustrate, RandomNet achieves the highest average Rand Index and average rank amongst all the baseline methods.

Figure 5 depicts the critical difference diagram of the comparison between RandomNet and the state-of-the-art methods. The figure demonstrates that RandomNet significantly outperforms k-means, KSC, kDBA, SPIRAL, and two deep learning-based methods, IDEC and DTC. It also shows our proposed method is slightly better than MiniRocket, k-shape and SPF. These results solidify that RandomNet is a state-of-the-art time series clustering method.

In order to gain insights on the strengths and weaknesses of all the 10 methods compared and see how each method performs on different types of data, we divide the time series datasets into different categories (e.g. sensor, device, motion, spectro, etc). For each dataset, we rank the results of the 10 methods as we did in previous comparisons (1 is the best and 10 is the worst), and compute the average ranking of each method for each category. The ranking results are shown in Table 5. Note we only include categories with at least five datasets. The best-performing method for each category is in bold. Next, we rank these average rankings for each category (e.g. for Device, our method has the ranking of 1 since it has the lowest average rank, whereas KSC has the ranking of 10 since it has the highest average rank). We then average these category-wise rankings and report them in the line Average rank. For example, the rankings of our method for the 7 categories are: 1, 1, 1, 3, 2, 1, 1, respectively, with an average rank of 1.429. The last two lines show the numbers of categories in which the Rand index of a method is among top-1 and top-3, respectively.

Our model achieves the best average rank and is the best in five data types. It is also among the top three in all data types. This demonstrates the superiority of our model compared to other models across diverse data types. This can be attributed to its ability to generate diverse representations and its ensemble mechanism, which effectively cancels out irrelevant representations.

In contrast, other methods exhibit varying performance due to their specific focuses such as local shape or point-to-point distance computation, which may limit their effectiveness to only work on certain data types. For example, k-shape ranks ninth on the device data (where RandomNet ranks first), and SPF achieves an intermediate rank (fourth) on both image and motion data (where RandomNet ranks first and third, respectively). These results indicate that while specific models may perform well in certain data types, their performance can be suboptimal in others due to focus limitations.

Fig. 5
figure 5

Critical difference diagram of the comparison with state-of-the-art methods

Table 6 Ablation results of RandomNet on 108 UCR datasets

4.4 Ablation study

To verify the effectiveness of each component in RandomNet, we compare the performance of full RandomNet and its four variants on 108 UCR datasets, which are shown in Table 6. The four variants are, 1) RandomNet w/ GRU (replaces LSTM with GRU), 2) RandomNet w/o LSTM (removes LSTM), 3) RandomNet w/o LSTM & ReLU (removes LSTM and ReLU), and 4) RandomNet w/o LSTM & ReLU & pooling (removes LSTM, ReLU and pooling).

The results show that full RandomNet is better than the four variants in average rand index and average rank, reflecting the effectiveness of each part of RandomNet. It is worth noting that pooling is important in the model. Removing pooling will significantly increase the running time and decrease the performance.

4.5 Visualizing clusters for different methods

Figure 6 shows the 2D embeddings of the Cylinder-Bell-Funnel (CBF) (Saito and Coifman 1994) dataset using t-distributed Stochastic Neighbor Embedding (t-SNE) algorithm (Van der Maaten and Hinton 2008), as well as cluster assignments by k-means, MiniRocket, and RandomNet compared with the true labels. We can see clearly that k-means and MiniRocket both have difficulty distinguishing the blue and green classes, which correspond to the Bell and the Cylinder classes, respectively.

Upon closer examination, we can see why. Figure 7 shows five instances of the CBF time series and their cluster assignments from k-means, MiniRocket, and RandomNet, respectively. All methods successfully group the red time series (Funnel) into one cluster. However, k-means and MiniRocket inaccurately cluster the blue (Bell) and green (Cylinder) time series, whereas RandomNet is able to identify the correct clusters. This is due to k-means’ sensitivity to misalignment in the time series data (e.g. the blue time series), high dimensionality, and noise as it clusters based on Euclidean distances. For MiniRocket, the use of a network with random weights results in many class-independent values in its final representation, which is equivalent to adding noise from its last layer to k-means. In contrast, RandomNet uses the selection mechanism and ensemble, which weakens the influence of irrelevant representation and strengthens relevant representation, making the model more robust.

Fig. 6
figure 6

Clusterings of CBF dataset visualized using t-SNE for RandomNet (upper right), k-means (lower left) and MiniRocket (lower right), compared with True Label (upper left)

Fig. 7
figure 7

Clustering results on five samples from the CBF dataset using k-means (left), MiniRocket (middle) and RandomNet (right)

4.6 Testing the time complexity

In real-world applications, the size of datasets and the length of time series can be huge, making linear time complexity with respect to the number of instances and length of time series an essential characteristic of any practical model. To test the scalability and effectiveness of our proposed method, we use the same mechanism to generate datasets of varying sizes. For different time series lengths, we supplement the original time series (length of 128) with random noise to reach the required length. In this experiment, we use the CBF dataset (Saito and Coifman 1994). For testing linear complexity w.r.t the number of instances, the number of instances is set from 200 to 10,000 with a fixed time series length of 100. For testing linear complexity w.r.t the length of time series, the length is set from 1000 to 10,000 with a fixed dataset size of 120. We run RandomNet 10 times and record the average running time and Rand Index. The outcomes are presented in Fig. 8.

Fig. 8
figure 8

Running time of RandomNet on the different number of instances (left) and lengths of time series (right)

In the figure, each blue dot represents the average running time corresponding to the respective number of instances or length of time series. We perform linear curve-fitting on the results, depicted by the red line. One can see from the figure that the \(R^2\) value, which is the coefficient of determination of the fitting, is 0.9942 and 0.9756, respectively. The value is close to 1, indicating that the average running time of RandomNet has a strong linear relationship with the number of instances and length of time series. Moreover, we also observe stable Rand Index results across varying input sizes, indicating that our model is not sensitive to the size of the input data. Note that since we add a lot of noise (e.g. for the length of 9000, only 1.4% of the time series is non-noise), the Rand Index in the right figure drops significantly. In the next section, we will inject a reasonable proportion of noise to analyze noise sensitivity.

From Table 1, we can find that there are some models that also have the same characteristics, namely linear complexity w.r.t dataset size and time series length, such as k-means, SPF and MiniRocket, but our model is overall more accurate than these methods and has superior performance on all evaluated time series data types.

4.7 Analyzing noise sensitivity

We use three different datasets, SmallKitchenAppliances, ECG200, and FiftyWords, from three different application domains to test the noise sensitivity of the model. These datasets are injected with six levels of random Gaussian noise (scales of 0.05, 0.1, 0.2, 0.3, 0.4, and 0.5). This setting ensures that most values in the time series are valid, unlike in the previous section, where most values are noise. We evaluate the performance of RandomNet against the second-best model, SPF, by running each model 10 times and calculating the average Rand Index.

As illustrated in Fig.  9, while both models exhibit a strong resilience to noise, our model is slightly better than SPF. For the SmallKitchenAppliances dataset, the performance of RandomNet has little effect as the noise level increases. On the contrary, the performance of SPF decreases more obviously. In the ECG200 dataset, both models experience small fluctuations in performance at different noise levels, indicating similar effects on noise in this case. For the FiftyWords dataset, both models remain highly stable and show minimal performance differences despite the introduced noise.

Overall, these observations highlight RandomNet’s competitive ability to handle noise, confirming its effectiveness and robustness in noisy scenarios.

Fig. 9
figure 9

Sensitivity analysis of RandomNet and SPF across varying noise levels

4.8 Finding the optimal number of clusters

In many real-world data mining scenarios, the true number of clusters (k) within the dataset is unknown, so whether the model has the ability to determine the optimal k is crucial. The Elbow Method is a widely accepted heuristic used in determining the optimal k. It entails plotting the explained variation as a function of k and picking the "elbow" of the curve as the optimal k to use.

We apply the Elbow Method to the clustering performed by both k-means and RandomNet on the CBF dataset, which contains three classes. As shown in Fig. 10, RandomNet can find an obvious “elbow” at \(k=3\), whereas for k-means, it is hard to locate a clear “elbow”.

Fig. 10
figure 10

Elbow Method test for k-means (left) and RandomNet (right) on CBF dataset

5 Conclusion and future work

In this paper, we introduces RandomNet, a novel method for time series clustering that utilizes deep neural networks with random parameters to extract diverse representations of the input time series for clustering. The data only passes through the network once, and no backpropagation is involved. The selection mechanism and ensemble in the proposed method cancel irrelevant representations out and strengthen relevant representations to provide reliable clustering. Extensive evaluations conducted across all 128 UCR datasets demonstrate competitive accuracy compared to state-of-the-art methods, as well as superior efficiency. Future research directions may involve integrating more complex or domain-specific network structures into our method. Additionally, incorporating some level of training into the framework could potentially improve performance. We will also try to explore the potential of applying our method to multivariate time series or other data types, such as image data.