Abstract
Neural networks are widely used in machine learning and data mining. Typically, these networks need to be trained, implying the adjustment of weights (parameters) within the network based on the input data. In this work, we propose a novel approach, RandomNet, that employs untrained deep neural networks to cluster time series. RandomNet uses different sets of random weights to extract diverse representations of time series and then ensembles the clustering relationships derived from these different representations to build the final clustering results. By extracting diverse representations, our model can effectively handle time series with different characteristics. Since all parameters are randomly generated, no training is required during the process. We provide a theoretical analysis of the effectiveness of the method. To validate its performance, we conduct extensive experiments on all of the 128 datasets in the well-known UCR time series archive and perform statistical analysis of the results. These datasets have different sizes, sequence lengths, and they are from diverse fields. The experimental results show that the proposed method is competitive compared with existing state-of-the-art methods.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Neural networks serve as fundamental learning models across disciplines such as machine learning, data mining, and artificial intelligence. Typically, these networks go through a training phase during which their parameters are tuned according to specific learning rules and the data provided. A popular training paradigm involves backpropagation for optimizing an objective function. Once trained, these networks can be deployed for a variety of tasks, including classification, clustering, and regression.
A time series is a real-valued ordered sequence. The task of time series clustering assigns time series instances into homogeneous groups. It is one of the most important and challenging tasks in time series data mining and has been applied in various fields such as finance (Kumar et al. 2002), biology (Subhani et al. 2010; Fujita et al. 2012), climate (Steinbach et al. 2003), medicine (Wismüller et al. 2002) and so on. In this work, we consider the partitional clustering problem, wherein the given time series instances are grouped into pairwise-disjoint clusters.
Existing time series clustering methods achieve good performance (Paparrizos and Gravano 2015; Petitjean et al. 2011; Li et al. 2019), but since they form clusters based on a single focus, such as shape or point-to-point distance, they are suboptimal for some specific data types. Here, we introduce a novel method named RandomNet for time series clustering using untrained deep neural networks. Different from conventional training methods that adjust network weights (parameters) using backpropagation, RandomNet utilizes different sets of random parameters to extract various representations of the data. By extracting diverse representations, it can effectively handle time series with different characteristics. These representations are clustered; the results from the clusters are then selected and ensembled to produce the final clustering. This approach ensures that data only needs to pass through the networks once to obtain the final result, obviating the need for backpropagation. Therefore, the time complexity of RandomNet is linear in the number of instances in the dataset, providing a more efficient solution for time series clustering tasks.
Given a neural network, the various sets of parameters in the network can be thought of as performing different types of feature extraction on the input data. As a result, these varied parameters can generate diverse data representations. Some of these representations may be relevant to a clustering task, producing meaningful clusterings, while others may be less useful or entirely irrelevant, leading to less accurate or meaningless clustering. This concept forms the basis of RandomNet: by combining clustering results derived from all these diverse representations, the meaningful and latent group structure within the data can be discovered. This is because the noise introduced by irrelevant representations tends to cancel each other out during the ensemble process whereas the connections provided by relevant representations are strengthened. Therefore, efficient and reliable clustering can be achieved despite the randomness of the network parameters.
To demonstrate the effectiveness of RandomNet, we provide theoretical analysis. The analysis shows that RandomNet has the ability to effectively identify the latent group structure in the dataset as long as the ensemble size is large enough. Moreover, the analysis also provides a lower bound for the ensemble size. Notably, this lower bound is independent of the number of instances or the length of the time series in the dataset, given that the data in the dataset are generated from the same mechanism. This provides the ability to use a fixed, large ensemble size to achieve satisfactory results, offering a practical approach to time series clustering that does not need adjustment for different dataset sizes or time series lengths.
We conduct extensive experiments on all 128 datasets in the well-known UCR time series archive (Dau et al. 2019) and perform statistical analysis on the results. These datasets have different sizes, sequence lengths, and characteristics. The results show that RandomNet has the top performance in the Rand Index compared with other state-of-the-art methods and achieves superior performance across all data types evaluated.
The main contributions of the paper are summarized as follows:
-
We propose RandomNet, a novel method for time series clustering using untrained neural networks with random weights. There is no training or backpropagation in the method.
-
We demonstrate the effectiveness of the proposed method both empirically and theoretically. We conduct extensive experiments on 128 datasets to evaluate the proposed method and provide statistical analysis of the comparison results to show the superiority of our method over the state-of-the-art methods.
-
We demonstrate the efficiency of the proposed method through the experimental evaluation on data of varying sizes and time lengths. The results of linear curve-fitting on the running time indicate that the method has linear time complexity.
2 Background and related work
2.1 Definitions and notations
Definition 1
A time series \(T=[t_1, t_2, \ldots , t_m]\) is an ordered sequence of real-value data points, where m is the length of the time series.
Definition 2
Given a set of time series \(\{T_i\}_{i=1}^n\) and the number of clusters k, the objective of time series clustering is to assign each time series instance \(T_i\) a group label \(c_j\), where \(j \in \{1, \ldots , k\}\). n is the number of instances in the dataset. We would like the instances in the same group to be similar to each other and dissimilar to the instances in other groups.
2.2 Related work
There has been much work on time series clustering, and we categorize them into four groups: raw-data-based methods, feature-based methods, deep-learning-based methods, and others.
Raw-data-based methods. The raw-data-based methods directly apply classic clustering algorithms such as k-means (MacQueen et al. 1967) on raw time series. The standard k-means algorithm adopts Euclidean distance to measure the dissimilarity of the instances and often cannot handle the scale-variance, phase-shifting, distortion, and noise in the time series data. To cope with these challenges, dozens of distance measures for time series data have been proposed.
Dynamic Time Warping (DTW) (Berndt and Clifford 1994) is one of the most popular distance measures that can find the optimal alignment between two sequences. It is used in Dynamic time warping Barycenter Averaging (DBA) (Petitjean et al. 2011) which proposes an iterative procedure to refine the centroid in order to minimize the squared DTW distances from the centroids to other time series instances. Similarly, K-Spectral Centroid (KSC) (Yang and Leskovec 2011) proposes a distance measure that finds the optimal alignment and scaling for matching two time series. The centroids are computed, based on matrix decomposition, to minimize the distances between the centroids and the instances under this distance measure. Another approach, k-shape (Paparrizos and Gravano 2015) proposes a shape-based distance measure based on the cross-correlation of two time series. The distance measure shifts the two time series to find the optimal matching. Each centroid is obtained from the optimization of the squared normalized cross-correlation from the centroid to the instances in the cluster.
Feature-based methods. Feature-based methods transform the time series into flat, unordered features, and then apply classic clustering algorithms to the transformed data.
Zakaria et al. (2012) propose to calculate the distances from a set of short sequences to the time series instances in the dataset and use the distance values as new features for the respective instances. This set of short sequences, called U-shapelets, is found by enumerating all the subsequences in the data to best separate the instances. K-means are then applied to the new features for clustering. In the work by Zhang et al. (2016), instead of enumerating the subsequences, the shapelets are learned by optimizing an objective function with gradient descent.
A recent work (Lei et al. 2019) proposes Similarity PreservIng RepresentAtion Learning (SPIRAL) to sample pairs of time series to calculate their DTW distances and build a partially-observed similarity matrix. The matrix is an approximation for the pair-wise DTW distances matrix in the dataset. The new features are generated by solving a symmetric matrix factorization problem such that the inner product of the new feature matrix can approximate the partially-observed similarity matrix.
Deep-learning-based methods. Many methods in this category adopt the autoencoder architecture for clustering. In autoencoder, the low-dimension hidden layer output is used as features for clustering. Among these, Improved Deep Embedded Clustering (IDEC) (Guo et al. 2017) improves autoencoder by adding an extra layer to the model. It not only employs a reconstruction loss but also optimizes a clustering loss specifically designed to preserve the local structure of the data. This dual loss strategy can capture the global structure and local differences, thereby improving the clustering process to better learn the inherent characteristics of the data.
Deep Temporal Clustering (DTC) (Madiraju et al. 2018) specifically addresses time series clustering by using Mean Square Error (MSE) to measure the reconstruction loss, and Kullback–Leibler (KL) divergence to measure clustering loss. Similarly, Deep Temporal Clustering Representation (DTCR) (Ma et al. 2019) adopts MSE for the reconstruction loss, while it uses a k-means objective function to measure the clustering loss. DTCR also employs a fake-sample generation strategy to augment the learning process. Clustering Representation Learning on Incomplete time-series data (CRLI) (Ma et al. 2021) further studies the problem of clustering time series with missing values. It jointly optimizes the imputation and clustering process, aiming to impute more discriminative values for clustering and to make the learned representations possess a good clustering property.
In the broader neural network literature, there is a class of methods that also use random weights known as the Extreme Learning Machine (ELM) (He et al. 2014; Peng et al. 2016; Wu et al. 2018), which uses a single-layer feed-forward network to map inputs into a new feature space. The hidden layer weights are set randomly but the output weights are trained. The idea is to find a mapping space where instances of different classes can be separated well.
In the domain of time series classification, ROCKET (Dempster et al. 2020), MiniRocket (Dempster et al. 2021) and MultiRocket (Tan et al. 2022) adopt strategies involving the use of random weights to generate features for classification. They use multiple single-layer convolution kernels instead of a deep network architecture.
Beyond the neural network and clustering fields, several works also adopt randomized features or feature maps (Rahimi et al. 2007; Chitta et al. 2012; Farahmand et al. 2017). However, it is worth noting that all these methods diverge from our proposed approach in their network structures. Moreover, none of these methods incorporates ensemble learning, which forms the core of our approach. To the best of our knowledge, we are the first to propose using a network with random weights in time series clustering.
Other methods. In our previous work (Li et al. 2019), we present a Symbolic Pattern Forest (SPF) algorithm for time series clustering, which adopts Symbolic Aggregate approXimation (SAX) (Lin et al. 2007) to transform time series subsequences into symbolic patterns. Through iterative selections of random symbolic patterns to divide the dataset into two distinct branches based on the presence or absence of the pattern, a symbolic pattern tree is constructed. Repeating this process forms a symbolic pattern forest, the ensemble of which produces the final clustering result.
3 The proposed method
3.1 Architecture and algorithm
Figure 1 shows the architecture of RandomNet. The method is structured with B branches, each containing a CNN-LSTM block, designed to capture both spatial and temporal dependencies of time series, followed by k-means clustering. Each CNN-LSTM block contains multiple groups of CNN networks and an LSTM network, and each group of CNN network consists of a one-dimensional convolutional layer, a Rectified Linear Units (ReLU) layer, and a pooling layer. The output of the CNN networks is flattened. In our experiments, we set the number of groups of the CNN network equal to \(\log _2{m}\), where m represents the length of the time series. We fix the number of filters of the 1D convolution to 8, the filter size to 3, and the pooling size to 2. We set the number of LSTM units to 8. The weights used within the network are randomly chosen from \(\{-1, 0, 1\}\). We opt for this finite parameter set over a continuous interval (e.g., \([-1, 1]\)) for the purpose of simplifying the parameter space.
Each branch produces its own clustering, however, some clusterings might be skewed or deviant due to the inherent randomness of the weights. To alleviate this problem, we propose a selection mechanism to remove any clusterings that contain clusters that are either too small or too large.
Concretely, the method sets a lower bound lr and an upper bound ur for the cluster size. The number of instances that violate the bounds in each clustering is counted as violation. For example, suppose a clustering contains two clusters with sizes 40 and 52, respectively. If the lower bound is 5 and the upper bound is 50, then the number of violations for this clustering is \(52-50=2\). The clusterings are sorted according to the number of violations and the method selects the top S clusterings for the ensemble. Here, \(S=\max (zv, sr \times B)\), where zv is the number of clusterings with zero violation values, sr is a selection rate, and B is the number of branches in the method.
Finally, we ensemble the results to form the final clustering. While the diversity of clustering results from a large number of different branches helps reveal various intrinsic patterns in the data, it introduces the challenge of combining these different results into a cohesive unified clustering. To address this challenge, we adopt the Hybrid Bipartite Graph Formulation (HBGF) (Fern and Brodley 2004) to perform clustering ensemble. This technique builds a bipartite graph for the clusterings in the ensemble, where the instances and clusters become the vertices. If an instance belongs to a cluster, then there is an edge connecting the two respective vertices in the graph. Partitioning the graph gives a consensus clustering for the ensemble. HBGF has two main advantages. First, it can extract consensus from differences, identifying and strengthening the repeated patterns of grouping across the clustering set. Second, it has linear time complexity, which ensures the scalability of our model for large datasets. In our implementation, we use Metis (Karypis and Kumar 1998) library to partition the graph.
Algorithm 1 gives the pseudo-code of RandomNet. Given a time series dataset \(D=\{T_i\}_{i=1}^n\), a branch number B, a cluster number k, bounds lr and ur, and a selection rate sr, the algorithm outputs a clustering assignment C for the input time series.
In Algorithm 1 Line 4, the parameters in the CNN-LSTM blocks are randomly set from \(\{-1, 0, 1\}\) as previously noted. The data passes the CNN-LSTM blocks to generate features for each time series in Line 5. Line 6 applies k-means on the features to produce a clustering assignment. Line 7 adds the clustering to the ensemble set. In Line 9, the selection mechanism (Algorithm 2) introduced above is performed on the ensemble set with the user-provided selection rate and bounds. Finally in Line 10, the ensemble function (Algorithm 3) ensembles the clusterings in SelectedSet and gives the clustering C as the output of the algorithm.
3.2 Effectiveness of RandomNet
Given the network architecture, its parameters (weights) represent a form of feature extraction from the data and thus produce a kind of representation. With multiple random parameters, we can have multiple representations.
Some representations are relevant to the clustering task. The instances that are similar to each other are more likely to be put in the same cluster under these relevant representations. Other representations are irrelevant to the clustering task. Under these representations, two similar instances may not be assigned in the same cluster.
The intuition is that, in the ensemble, the effect of irrelevant representations can cancel each other out, and the effect of relevant representations can dominate the ensemble. Inspired by (Li et al. 2019) which is described in the previous section, we provide effectiveness analysis for RandomNet.
We assume the data contains k distinct clusters which correspond to k different classes. We have the following theorem:
Theorem 1
Assume two instances, \(T_1\) and \(T_2\), are from the same class. If they reside within the same cluster under some relevant representations, then RandomNet assigns these two instances to the same cluster in the final output.
Proof
Let \(\gamma\) denote the percentage of relevant representations among all the representations. In each CNN-LSTM block, if the representation is relevant, we have \(P(C(T_1)=C(T_2))=1\), where \(P(\cdot )\) stands for the probability and \(C(\cdot )\) denotes the clustering assignment. If the representation is irrelevant, the instances are assigned to any of the k clusters randomly. Hence, we can deduce: \(P(C(T_1)=C(T_2))=1/k\), and \(P(C(T_1) \ne C(T_2))=(k-1)/k\). Considering the above, overall we can derive that \(P(C(T_1)=C(T_2))= \gamma \times 1 + (1- \gamma ) \times 1/k\) and \(P(C(T_1) \ne C(T_2))= (1- \gamma ) \times (k-1)/k\). It is clear that \(P(C(T_1)=C(T_2))> P(C(T_1) \ne C(T_2))\). Since each block is independent of the others, according to the law of large numbers (Révész 2014) which states that if we repeat an experiment independently a large number of times, the average of the results obtained from those experiments will converge to the expected value, we have:
where we consider a sufficiently large ensemble size and \(Count(\cdot )\) is the count of occurrences. Consequently, in the ensemble result, instances \(T_1\) and \(T_2\) belong to the same cluster. \(\square\)
The above analysis assumes we have a large ensemble size, and the following theorem provides a lower bound for the ensemble size. Here, for simplicity, we set \(k=2\).
Theorem 2
Assume the ensemble size to be b. Then, the lower bound of b needed to provide a good clustering is given by \(-2 \ln \alpha / \gamma ^2\), where \(\gamma\) represents the percentage of relevant representations and 1-\(\alpha\) is the confidence level.
Proof
Let Y be a random variable indicating the number of cases where \(C(T_1)=C(T_2)\). The random variable Y follows a binomial distribution:
where \(p=P(C(T_1)=C(T_2))\). Equation (1) needs to hold with high probability, leading to the following inequality:
where \(s=b/2\), \(1-\alpha\) is the confidence level. By applying Hoeffding’s inequality (Hoeffding 1994), we have \(P(E[\bar{Y}]-\bar{Y} \ge t) \le e^{-2bt^2}\), where \(t \ge 0\). Considering \(E[\bar{Y}]=p\), we have:
Let \(s=bp-bt\), then \(t=(bp-s)/b\), so we get \(P(Y \le s) \le e^{-2(bp-s)^2/b} \le \alpha\). With \(s=b/2\), \(p=\gamma \times 1 + (1- \gamma ) \times 1/2\), we solve the above inequality and derive \(b \ge -2 \ln \alpha / \gamma ^2\). \(\square\)
Here is a concrete example for the bound: suppose we have a confidence level of 99% and we estimate that 30% of the representations are relevant. In this case, \(\alpha =0.01\) and \(\gamma =0.3\), yielding a b value of at least 102.33. From the theorem, one observes that the lower bound is independent of the number of instances in the dataset. This implies that we can maintain a sufficiently large fixed ensemble size to handle inputs of varying sizes, provided the data generation mechanism remains constant. We verify this in the experimental section by using a fixed ensemble size chosen through experiments and varying the number of time series instances that are generated from the same mechanism.
4 Experimental evaluation
4.1 Experimental setup
To evaluate the effectiveness of RandomNet, we run the algorithm on all 128 datasets from the well-known UCR time series archive (Dau et al. 2019). These datasets come from different disciplines with various characteristics. Each dataset in the archive is split into a training set and a testing set. We fuse the two sets and utilize the entire dataset in the experiment. Some of these datasets contain varying-length time series. To ensure that all time series in a dataset have the same length, we append zeros at the end of the shorter series.
For benchmarking purposes, we run kDBA (Petitjean et al. 2011), KSC (Yang and Leskovec 2011), k-shape (Paparrizos and Gravano 2015), SPIRAL (Lei et al. 2019), and SPF (Li et al. 2019) on the same datasets. These methods are used as representatives of the state-of-the-art for time series clustering. Additionally, we incorporated deep-learning-based methods, Improved Deep Embedding Clustering (IDEC) (Guo et al. 2017) and DTC (Madiraju et al. 2018), for comparison. While DTC is specifically designed for time series data, as discussed in Sect. 2.2, IDEC is a general clustering method. We also compare our method with ROCKET (Dempster et al. 2020) and its variants, MiniRocket (Dempster et al. 2021) and MultiRocket (Tan et al. 2022), since we are interested in how other models that also used random parameters compare to ours. As they are all specifically designed for time series classification, we adapt them to our use case by removing the classifier component and replacing it with k-means. All references to them will pertain to this adapted version. We do not include DTCR (Ma et al. 2019) in the comparison, as we are unable to reproduce the results reported in its paper, despite using the code provided by its authors.Footnote 1 This issue has been similarly reported by others on the GitHub issue webpage for the project.Footnote 2 We do not include CRLI (Ma et al. 2021) since it is specially designed for incomplete time series data which is outside the scope of our study. Table 1 provides a concise comparison of our method and various baselines we used in experiments, outlining their applicable data types, method types, main focuses, and time complexity in terms of the number of instances (n) and the length of time series (l). Note that due to the complexity involved in training deep learning models, we have not included the time complexity for the two deep learning methods, DTC and IDEC, which require network training. For a more detailed description of each method, please refer to Sect. 2.2. We provide the experimental evaluation for the time complexity of our model in Sect. 4.6.
The source code of kDBA, KSC, and k-shape are obtained from the authors of k-shape. The source code of SPIRAL,Footnote 3 SPF,Footnote 4 IDEC,Footnote 5 DTC,Footnote 6 ROCKET,Footnote 7 MiniRocketFootnote 8 and MultiRocketFootnote 9 are available online. The number of clusters k is set to equal the number of classes in the datasets, and we follow the default parameter settings in the source code.
For the default hyperparameters of RandomNet, the number of branches B is set to 800, the selection rate sr is set to 0.1, the lower bound lr is set to \(0.3 \times acs\), and the upper bound ur is set to \(1.5 \times acs\), where acs refers to the average cluster size, which is computed as \(round(number\_of\_instances/k)\). Detailed hyperparameter selection experiments are elaborated upon in the subsequent section.
We implement RandomNet using Python and TensorFlow 2.1 and use the k-means implementation in the Scikit-learn package (Pedregosa et al. 2011) with default settings. The experiments are run on a node of a batch-processing cluster. The node uses a 2.6 GHz CPU and 64 GB RAM. Given that the process does not involve neural network training, there is no necessity for GPU utilization. The source code of RandomNet is available at: https://github.com/Jackxiini/RandomNet.
All the datasets in the experiments have labels that can be used as ground truth. We use Rand Index (Rand 1971) to measure the clustering accuracy of the methods under comparison. The range of the Rand Index falls in [0, 1] where a large value indicates that the clustering matches the actual class relationship well.
4.2 Hyperparameter analysis
To fine-tune and investigate the influence of hyperparameters on model performance, we conduct a series of experiments. We select 20 datasets from the UCR time series archive (Dau et al. 2019) and run each experiment 10 times and take the average Rand Index as the result.
Number of branches. The number of branches B plays a pivotal role in our model, affecting both the quality of the clustering and the computational efficiency. We test B values ranging from 100 to 1000, in increments of 100, and keep all other settings default.
The left plot of Fig. 2 shows that the average Rand Index improves with an increase in the number of branches until \(B=800\). Increasing the number of branches beyond 800 only results in a rise in running time, without contributing to better clustering quality. Therefore, we set the default B for all datasets to 800.
Selection rate. The selection rate sr controls the lower bound of the number of selected clustering. We test sr values ranging from 0.1 to 1, in increments of 0.1, and keep all other settings default.
The middle plot of Fig. 2 shows slight changes in the average Rand index as sr changes. Since a larger sr will increase the running time of the model, we choose \(sr=0.1\) as the default value.
Lower bound and upper bound. The lower bound lr and the upper bound ur are crucial in detecting the number of violation, which affects the quality of clustering. We evaluate three pairs of lr and ur, (0.1, 1.8), (0.3, 1.5), and (0.5, 1.2), representing wide intervals, intermediate intervals, and narrow intervals, respectively. For simplicity, we present these as multipliers; the actual lower and upper bounds are obtained by multiplying these values with the average cluster size acs. Narrower intervals are more restrictive to the size of the clustering and thus will increase the number of violations. We keep all other settings as default.
The right plot of Fig. 2 shows the effects of the wide interval, intermediate interval, and narrow interval on the Rand Index. We can observe that appropriate intervals can bring better performance. The intermediate interval has the best Rand Index, whereas the wide interval underperforms the other two due to its lax size constraints which affects its ability to screen violations. Therefore, we set lr to \(0.3 \times acs\) and ur to \(1.5 \times acs\) as default.
Selection bias test. Since the dataset selection for hyperparameter tuning may cause potential selection bias, we conduct an experiment to show how the choice affects the experimental results. We divide 128 datasets into five similar-sized groups, optimizing three hyperparameters separately for each. We then apply the optimized hyperparameters to the remaining datasets in each group. For each hyperparameter, we average the Rand Index for each group, and obtain the average and the standard deviation of the Rand Index of the five groups. As illustrated in Table 2, the standard deviations, 0.0073, 0.0097 and 0.0078, for the Average Rand Index across the five groups are very small, indicating that the selection of datasets for hyperparameter tuning has a minimal effect on the final results. Therefore, we retain the selection of the previous 20 datasets in subsequent experiments.
4.3 Experimental results
Since we use 20 of the 128 datasets to select the hyperparameters, for the sake of fairness, we remove them in the following comparison and only show the results for the remaining 108 datasets.
Comparison with k-means.
As RandomNet uses k-means to generate clustering assignments, we are interested in how they compare. We also run k-means 800 times and use HBGF (Fern and Brodley 2004) to ensemble the results, which is denoted as kmeansE.
We run the methods under comparison on the 108 datasets and record the Rand Index. Figure 3 presents a critical difference diagram (Demšar 2006) for the comparison based on Rand Index. The values adjacent to each method represent the respective average rank (with smaller being better), and the methods connected by a thick bar do not significantly differ at the 95% confidence level. Notably, there is no thick bar present in Fig. 3, suggesting all methods have significant differences from each other.
As seen in the figure, RandomNet significantly outperforms both k-means and kmeansE. It is noteworthy that kmeansE is significantly better than the standard k-means, indicating that employing ensemble methods can substantially improve the performance of time series clustering, even for the naive method that uses the original representation of time series. Comparing RandomNet with kmeansE further demonstrates that using the proposed deep neural network with random parameters for generating representations can indeed enhance the accuracy of k-means clustering and ensembles.
Comparison with ROCKET and its variants.
We compare our method with ROCKET and its variants, MiniRocket and MultiRocket. Note that we remove the classifier components in these ROCKET variants and replace them with k-means to adapt them to our use case. As shown in Fig. 4, RandomNet outperforms ROCKET variants in terms of average rank, and is especially significantly better than ROCKET and MultiRocket. This reflects the superiority of RandomNet, which is specially designed for time series clustering, in models based on random parameters. It is worth noting that MiniRocket is the best model among ROCKET variants. Therefore, we keep only MiniRocket in subsequent experiments.
Comparison with the state-of-the-arts. Tables 3 and 4 present the experimental results of RandomNet compared to state-of-the-art methods. The best results for each dataset are highlighted in bold. We provide the average Rand Index and average rank for each method. The \(\dag\) symbol indicates that the dataset is used for hyperparameter selection. Consequently, the results from these datasets have not been included in the computation of the average Rand Index and average rank. As the results illustrate, RandomNet achieves the highest average Rand Index and average rank amongst all the baseline methods.
Figure 5 depicts the critical difference diagram of the comparison between RandomNet and the state-of-the-art methods. The figure demonstrates that RandomNet significantly outperforms k-means, KSC, kDBA, SPIRAL, and two deep learning-based methods, IDEC and DTC. It also shows our proposed method is slightly better than MiniRocket, k-shape and SPF. These results solidify that RandomNet is a state-of-the-art time series clustering method.
In order to gain insights on the strengths and weaknesses of all the 10 methods compared and see how each method performs on different types of data, we divide the time series datasets into different categories (e.g. sensor, device, motion, spectro, etc). For each dataset, we rank the results of the 10 methods as we did in previous comparisons (1 is the best and 10 is the worst), and compute the average ranking of each method for each category. The ranking results are shown in Table 5. Note we only include categories with at least five datasets. The best-performing method for each category is in bold. Next, we rank these average rankings for each category (e.g. for Device, our method has the ranking of 1 since it has the lowest average rank, whereas KSC has the ranking of 10 since it has the highest average rank). We then average these category-wise rankings and report them in the line Average rank. For example, the rankings of our method for the 7 categories are: 1, 1, 1, 3, 2, 1, 1, respectively, with an average rank of 1.429. The last two lines show the numbers of categories in which the Rand index of a method is among top-1 and top-3, respectively.
Our model achieves the best average rank and is the best in five data types. It is also among the top three in all data types. This demonstrates the superiority of our model compared to other models across diverse data types. This can be attributed to its ability to generate diverse representations and its ensemble mechanism, which effectively cancels out irrelevant representations.
In contrast, other methods exhibit varying performance due to their specific focuses such as local shape or point-to-point distance computation, which may limit their effectiveness to only work on certain data types. For example, k-shape ranks ninth on the device data (where RandomNet ranks first), and SPF achieves an intermediate rank (fourth) on both image and motion data (where RandomNet ranks first and third, respectively). These results indicate that while specific models may perform well in certain data types, their performance can be suboptimal in others due to focus limitations.
4.4 Ablation study
To verify the effectiveness of each component in RandomNet, we compare the performance of full RandomNet and its four variants on 108 UCR datasets, which are shown in Table 6. The four variants are, 1) RandomNet w/ GRU (replaces LSTM with GRU), 2) RandomNet w/o LSTM (removes LSTM), 3) RandomNet w/o LSTM & ReLU (removes LSTM and ReLU), and 4) RandomNet w/o LSTM & ReLU & pooling (removes LSTM, ReLU and pooling).
The results show that full RandomNet is better than the four variants in average rand index and average rank, reflecting the effectiveness of each part of RandomNet. It is worth noting that pooling is important in the model. Removing pooling will significantly increase the running time and decrease the performance.
4.5 Visualizing clusters for different methods
Figure 6 shows the 2D embeddings of the Cylinder-Bell-Funnel (CBF) (Saito and Coifman 1994) dataset using t-distributed Stochastic Neighbor Embedding (t-SNE) algorithm (Van der Maaten and Hinton 2008), as well as cluster assignments by k-means, MiniRocket, and RandomNet compared with the true labels. We can see clearly that k-means and MiniRocket both have difficulty distinguishing the blue and green classes, which correspond to the Bell and the Cylinder classes, respectively.
Upon closer examination, we can see why. Figure 7 shows five instances of the CBF time series and their cluster assignments from k-means, MiniRocket, and RandomNet, respectively. All methods successfully group the red time series (Funnel) into one cluster. However, k-means and MiniRocket inaccurately cluster the blue (Bell) and green (Cylinder) time series, whereas RandomNet is able to identify the correct clusters. This is due to k-means’ sensitivity to misalignment in the time series data (e.g. the blue time series), high dimensionality, and noise as it clusters based on Euclidean distances. For MiniRocket, the use of a network with random weights results in many class-independent values in its final representation, which is equivalent to adding noise from its last layer to k-means. In contrast, RandomNet uses the selection mechanism and ensemble, which weakens the influence of irrelevant representation and strengthens relevant representation, making the model more robust.
4.6 Testing the time complexity
In real-world applications, the size of datasets and the length of time series can be huge, making linear time complexity with respect to the number of instances and length of time series an essential characteristic of any practical model. To test the scalability and effectiveness of our proposed method, we use the same mechanism to generate datasets of varying sizes. For different time series lengths, we supplement the original time series (length of 128) with random noise to reach the required length. In this experiment, we use the CBF dataset (Saito and Coifman 1994). For testing linear complexity w.r.t the number of instances, the number of instances is set from 200 to 10,000 with a fixed time series length of 100. For testing linear complexity w.r.t the length of time series, the length is set from 1000 to 10,000 with a fixed dataset size of 120. We run RandomNet 10 times and record the average running time and Rand Index. The outcomes are presented in Fig. 8.
In the figure, each blue dot represents the average running time corresponding to the respective number of instances or length of time series. We perform linear curve-fitting on the results, depicted by the red line. One can see from the figure that the \(R^2\) value, which is the coefficient of determination of the fitting, is 0.9942 and 0.9756, respectively. The value is close to 1, indicating that the average running time of RandomNet has a strong linear relationship with the number of instances and length of time series. Moreover, we also observe stable Rand Index results across varying input sizes, indicating that our model is not sensitive to the size of the input data. Note that since we add a lot of noise (e.g. for the length of 9000, only 1.4% of the time series is non-noise), the Rand Index in the right figure drops significantly. In the next section, we will inject a reasonable proportion of noise to analyze noise sensitivity.
From Table 1, we can find that there are some models that also have the same characteristics, namely linear complexity w.r.t dataset size and time series length, such as k-means, SPF and MiniRocket, but our model is overall more accurate than these methods and has superior performance on all evaluated time series data types.
4.7 Analyzing noise sensitivity
We use three different datasets, SmallKitchenAppliances, ECG200, and FiftyWords, from three different application domains to test the noise sensitivity of the model. These datasets are injected with six levels of random Gaussian noise (scales of 0.05, 0.1, 0.2, 0.3, 0.4, and 0.5). This setting ensures that most values in the time series are valid, unlike in the previous section, where most values are noise. We evaluate the performance of RandomNet against the second-best model, SPF, by running each model 10 times and calculating the average Rand Index.
As illustrated in Fig. 9, while both models exhibit a strong resilience to noise, our model is slightly better than SPF. For the SmallKitchenAppliances dataset, the performance of RandomNet has little effect as the noise level increases. On the contrary, the performance of SPF decreases more obviously. In the ECG200 dataset, both models experience small fluctuations in performance at different noise levels, indicating similar effects on noise in this case. For the FiftyWords dataset, both models remain highly stable and show minimal performance differences despite the introduced noise.
Overall, these observations highlight RandomNet’s competitive ability to handle noise, confirming its effectiveness and robustness in noisy scenarios.
4.8 Finding the optimal number of clusters
In many real-world data mining scenarios, the true number of clusters (k) within the dataset is unknown, so whether the model has the ability to determine the optimal k is crucial. The Elbow Method is a widely accepted heuristic used in determining the optimal k. It entails plotting the explained variation as a function of k and picking the "elbow" of the curve as the optimal k to use.
We apply the Elbow Method to the clustering performed by both k-means and RandomNet on the CBF dataset, which contains three classes. As shown in Fig. 10, RandomNet can find an obvious “elbow” at \(k=3\), whereas for k-means, it is hard to locate a clear “elbow”.
5 Conclusion and future work
In this paper, we introduces RandomNet, a novel method for time series clustering that utilizes deep neural networks with random parameters to extract diverse representations of the input time series for clustering. The data only passes through the network once, and no backpropagation is involved. The selection mechanism and ensemble in the proposed method cancel irrelevant representations out and strengthen relevant representations to provide reliable clustering. Extensive evaluations conducted across all 128 UCR datasets demonstrate competitive accuracy compared to state-of-the-art methods, as well as superior efficiency. Future research directions may involve integrating more complex or domain-specific network structures into our method. Additionally, incorporating some level of training into the framework could potentially improve performance. We will also try to explore the potential of applying our method to multivariate time series or other data types, such as image data.
Notes
https://github.com/qianlima-lab/DTCR.
https://github.com/qianlima-lab/DTCR/issues/8.
https://github.com/cecilialeiqi/SPIRAL.
https://github.com/xiaoshengli/SPF.
https://github.com/XifengGuo/IDEC.
https://github.com/FlorentF9/DeepTemporalClustering.
https://github.com/angus924/rocket.
https://github.com/angus924/minirocket.
https://github.com/ChangWeiTan/MultiRocket.
References
Berndt DJ, Clifford J (1994) Using dynamic time warping to find patterns in time series. In: KDD Workshop, vol. 10, pp 359–370. Seattle, WA
Chitta R, Jin R, Jain AK (2012) Efficient kernel clustering using random fourier features. In: 2012 IEEE 12th international conference on data mining, pp 161–170. IEEE
Dau HA, Bagnall A, Kamgar K, Yeh C-CM, Zhu Y, Gharghabi S, Ratanamahatana CA, Keogh E (2019) The ucr time series archive. IEEE/CAA J Automatica Sinica 6(6):1293–1305
Dempster A, Petitjean F, Webb GI (2020) Rocket: exceptionally fast and accurate time series classification using random convolutional kernels. Data Min Knowl Disc 34(5):1454–1495
Dempster A, Schmidt DF, Webb GI (2021) Minirocket: A very fast (almost) deterministic transform for time series classification. In: Proceedings of the 27th ACM SIGKDD, pp 248–257
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(Jan):1–30
Farahmand A-m, Pourazarm S, Nikovski D (2017) Random projection filter bank for time series data. In: NIPS, pp 6562–6572
Fern XZ, Brodley CE (2004) Solving cluster ensemble problems by bipartite graph partitioning. In: Proceedings of the twenty-first international conference on machine learning, p 36. ACM
Fujita A, Severino P, Kojima K, Sato JR, Patriota AG, Miyano S (2012) Functional clustering of time series gene expression data by granger causality. BMC Syst Biol 6(1):137
Guo X, Gao L, Liu X, Yin J (2017) Improved deep embedded clustering with local structure preservation. In: IJCAI, pp 1753–1759
He Q, Jin X, Du C, Zhuang F, Shi Z (2014) Clustering in extreme learning machine feature space. Neurocomputing 128:88–95
Hoeffding W (1994) Probability inequalities for sums of bounded random variables. collected Works Wassily Hoeffding 58:409–426
Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392
Kumar M, Patel NR, Woo J (2002) Clustering seasonality patterns in the presence of errors. In: Proceedings of the Eighth ACM SIGKDD, pp 557–563. ACM
Lei Q, Yi J, Vaculin R, Wu L, Dhillon IS (2019) Similarity preserving representation learning for time series clustering. In: Proceedings of the 28th international joint conference on artificial intelligence, pp 2845–2851. AAAI Press
Li X, Lin J, Zhao L (2019) Linear time complexity time series clustering with symbolic pattern forest. In: Proceedings of the 28th international joint conference on artificial intelligence, pp 2930–2936. AAAI Press
Lin J, Keogh E, Wei L, Lonardi S (2007) Experiencing sax: a novel symbolic representation of time series. Data Min Knowl Disc 15(2):107–144
Ma Q, Zheng J, Li S, Cottrell GW (2019) Learning representations for time series clustering. Adv Neural Inf Process Syst 32:3776–3786
Ma Q, Chen C, Li S, Cottrell GW (2021) Learning representations for incomplete time series clustering. Proc AAAI Conf Artif Intell 35(10):8837–8846
Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(11):2579–2605
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth berkeley symposium on mathematical statistics and probability, vol. 1, pp 281–297. Oakland, CA, USA
Madiraju NS, Sadat SM, Fisher D, Karimabadi H (2018) Deep temporal clustering: fully unsupervised learning of time-domain features. arXiv preprint arXiv:1802.01059
Paparrizos J, Gravano L (2015) k-shape: efficient and accurate clustering of time series. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp 1855–1870. ACM
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
Peng Y, Zheng W-L, Lu B-L (2016) An unsupervised discriminative extreme learning machine and its applications to data clustering. Neurocomputing 174:250–264
Petitjean F, Ketterlin A, Gançarski P (2011) A global averaging method for dynamic time warping, with applications to clustering. Pattern Recogn 44(3):678–693
Rahimi A, Recht B (2007) Random features for large-scale kernel machines. In: NIPS, vol. 3, p. 5. Citeseer
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850
Révész P (2014) The Laws of Large Numbers, vol 4. Academic Press, Cambridge
Saito N, Coifman RR (1994) Local feature extraction and its applications using a library of bases. PhD thesis, Yale University
Steinbach M, Tan P-N, Kumar V, Klooster S, Potter C (2003) Discovery of climate indices using clustering. In: Proceedings of the Ninth ACM SIGKDD, pp 446–455. ACM
Subhani N, Rueda L, Ngom A, Burden CJ (2010) Multiple gene expression profile alignment for microarray time-series data clustering. Bioinformatics 26(18):2281–2288
Tan CW, Dempster A, Bergmeir C, Webb GI (2022) Multirocket: multiple pooling operators and transformations for fast and effective time series classification. Data Min Knowl Disc 36(5):1623–1646
Wismüller A, Lange O, Dersch DR, Leinsinger GL, Hahn K, Pütz B, Auer D (2002) Cluster analysis of biomedical image time-series. Int J Comput Vision 46(2):103–128
Wu L, Chen P-Y, Yen IE-H, Xu F, Xia Y, Aggarwal C (2018) Scalable spectral clustering using random binning features. In: Proceedings of the 24th ACM SIGKDD, pp 2506–2515
Yang J, Leskovec J (2011) Patterns of temporal variation in online media. In: Proceedings of the Fourth ACM international conference on web search and data mining, pp 177–186
Zakaria J, Mueen A, Keogh E (2012) Clustering time series using unsupervised-shapelets. In: 2012 IEEE 12th international conference on data mining, pp 785–794. IEEE
Zhang Q, Wu J, Yang H, Tian Y, Zhang C (2016) Unsupervised feature learning from time series. In: IJCAI, pp 2322–2328
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Eamonn Keogh.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Li, X., Xi, W. & Lin, J. Randomnet: clustering time series using untrained deep neural networks. Data Min Knowl Disc (2024). https://doi.org/10.1007/s10618-024-01048-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10618-024-01048-5