1 Introduction

Cluster analysis, as a core category of unsupervised learning techniques, allows to discover hidden patterns in data where one does not know the true answer upfront. Its goal is to assign a heterogeneous set of objects into non-overlapping clusters, where in each cluster any two objects are more related to each other than to objects in other clusters. Given its exploratory nature, clustering has nowadays a number of applications in various fields of both industry and scientific research, such as biological and medical research (Damian et al. 2007; Zhao et al. 2014; Jääskinen et al. 2014), information technology (Jain et al. 1999; Slonim et al. 2005), signal and image processing (Rubinstein et al. 2013), geology (Juozapavičius and Rapsevicius 2001) and finance (Pavlidis et al. 2006; Bastos and Caiado 2014; Ieva et al. 2016). There exists a rich literature of cluster analysis on random vectors, where the objects, waiting to be clustered, are sampled from high-dimensional joint distributions. There is no shortage of such clustering algorithms (Xu and Wunsch 2005). However, stochastic processes are quite a different setting from random vectors, since their observations (sample paths) are sampled from processes distributions. While the cluster analysis on random vectors has developed aggressively, clustering on stochastic processes receives much less attention. Today cluster analysis on stochastic processes deserves increasingly intense study, thanks to their vital importance to many applied areas, where the collected information are indexed by real time and are especially long. Examples of these time-indexed information include biological data, financial data, marketing data, surface weather data, geological data and video/audio data, etc.

Recall that in the setting of random vectors, a process of clustering often consists of two steps:

Step 1 :

One suggests a suitable dissimilarity measure to describe the distance between 2 objects, under which “two objects are close to each other” becomes meaningful.

Step 2 :

One designs an enough accurate and computationally efficient clustering function based on the above dissimilarity measure.

Clustering stochastic processes is performed in a similar way but new challenges may arise in both Step 1 and Step 2. Intuitively, one can always apply existing random vectors clustering approaches to cluster arbitrary stochastic processes, such as non-hierarchical approaches (K-means clustering methods) and hierarchical approaches (agglomerative method, divisive method) (Hartigan 1975), based on “naive” dissimilarity measures (e.g., Euclidean distance, Manhattan distance or Minkowski distance). However, one faces at least 2 potential risks when applying the above approaches to clustering stochastic processes:

Risk 1 :

These approaches might suffer from their huge complexity costs, due to the great length of their sample paths. As a result classical clustering algorithms are often computationally forbidding (Ieva et al. 2016; Peng and Müller 2008).

Risk 2 :

These approaches might suffer from over-fitting issues. For example, clustering stationary or periodic processes based on Euclidean distance between the paths, without considering their path properties will result in “over fitting, bad clusters” situation.

In summary, classical dissimilarity measures or clustering strategies would fail in clustering stochastic processes.

Fortunately, the complexity cost and the over-fitting errors of clustering processes could be largely reduced, if one is aware of the fact that a stochastic process often possesses fine paths features (e.g., stationarity, Markov property, self-similarity, sparsity, seasonality, etc.), which is unlike an arbitrary random vector. An appropriate dissimilarity measure then should be chosen to be able to capture these paths features. Clustering processes is then performed to group any two sample paths into one group, if they are relatively close to each other under that particular dissimilarity measure. Below are some examples provided in the literature.

Peng and Müller (2008) proposed a dissimilarity measure between two special sample paths of processes. In their setting it is supposed that, for each path only sparse and irregularly spaced measurements with additional measurement errors are available. Such features occur commonly in longitudinal studies and online trading data. Based on this particular dissimilarity measure, classification and cluster analysis could be made. Ieva et al. (2016) developed a new algorithm to perform clustering of multivariate and functional data, based on a covariance-based dissimilarity measure. Their attention is focused on the specific case of a set of observations from two populations, whose probability distributions have equal mean but differ in terms of covariances. Khaleghi et al. (2016) designed consistent algorithms for clustering strict-sense stationary ergodic processes [see the forthcoming Eq. (4) for the definition of strict-sense ergodicity], where the dissimilarity measure is proposed as distance of process distributions. It is worth noting that the consistency of their algorithms is guaranteed thanks to the assumption of strict-sense ergodicity.

In this framework, we aim to design asymptotically consistent algorithms to cluster a general class of stochastic processes, i.e., wide-sense stationary ergodic processes (see Definition 1 below). Asymptotically consistent algorithms can be obtained for this setting, since the covariance stationarity and ergodicity allow the process to present some featured asymptotic behavior with respect to their length, rather than to the total number of paths.

Definition 1

(Wide-sense stationary ergodic process) A stochastic process \(X=\{X_t\}_{t\in T}\) (the time indexes set T can be either \({\mathbb {R}}_+=[0,+\infty )\) or \({\mathbb {N}}=\{1,2\ldots \}\)) is called wide-sense stationary if its mean and covariance structure are finite and time-invariant: \({\mathbb {E}}(X_t)=\mu \) for any \(t\in T\), and for any subset \((X_{i_1},\ldots ,X_{i_r})\), its covariance matrix remains invariant subject to any time shift \(h>0\):

$$\begin{aligned} {\mathbb {C}}ov(X_{i_1},\ldots ,X_{i_r})= {\mathbb {C}}ov(X_{i_1+h},\ldots ,X_{i_r+h}). \end{aligned}$$

Denote by \(\gamma \) the auto-covariance function of X. Then X is further called weakly ergodic (or wide-sense ergodic) if it is ergodic for the mean and the second-order moment:

  • If X is a continuous-time process (e.g., \(T={\mathbb {R}}_+\)), then it satisfies for any \(s\in {\mathbb {R}}_+\),

    $$\begin{aligned} \frac{1}{h}\int _s^{s+h}X_u\,\mathrm {d}u\xrightarrow [h\rightarrow +\infty ]{a.s.}\mu , \end{aligned}$$

    and

    $$\begin{aligned} \frac{1}{h}\int _s^{s+h}(X_{u+\tau }-\mu )(X_u-\mu )\,\mathrm {d}u\xrightarrow [h\rightarrow +\infty ]{a.s.}\gamma (\tau ),~\text{ for } \text{ all } \tau \in {\mathbb {R}}_+, \end{aligned}$$

    where \(\xrightarrow []{a.s.}\) denotes the almost sure convergence (convergence with probability 1).

  • If X is a discrete-time process (e.g., \(T={\mathbb {N}}\)), then it satisfies for any \(s\in \mathbb N\cup \{0\}\),

    $$\begin{aligned} \frac{X_s+X_{s+1}+\ldots +X_{s+h}}{h+1}\xrightarrow [h\in {\mathbb {N}},~h\rightarrow +\infty ]{a.s.}\mu , \end{aligned}$$

    and

    $$\begin{aligned} \frac{\sum _{u=s}^{s+h}(X_{u+\tau }-\mu )(X_u-\mu )}{h+1}\xrightarrow [h\in {\mathbb {N}},~h\rightarrow +\infty ]{a.s.}\gamma (\tau ),~\text{ for } \text{ all } \tau \in \mathbb N\cup \{0\}. \end{aligned}$$

Wide-sense stationarity and ergodicity are believed to be a very general assumption, at least in the following senses:

  1. 1.

    The assumption that each process is generated by some mean and covariance structure is sufficient for capturing all features of a wide-sense stationary ergodic process. In other words, our algorithms intend to cluster means and auto-covariance functions, not process distributions.

  2. 2.

    Wide-sense stationary ergodic process partially extends the strict-sense one. A finite-variance strict-sense stationary ergodic process [see Eq. (4) for its definition] is also wide-sense stationary ergodic. However strict-sense stationary ergodic stable processes are not wide-sense stationary, because their variances explode (Cambanis et al. 1987; Samorodnitsky 2004).

  3. 3.

    A Gaussian process can be fully identified only by its mean and covariance structure. Then a wide-sense stationary ergodic Gaussian process is also strict-sense stationary ergodic.

  4. 4.

    In the clustering problem, the dependency among the sample paths can be arbitrary.

There is a long list of processes which are wide-sense stationary ergodic, but not necessarily stationary in the strict sense. The examples of wide-sense stationary processes below are not exhausted.

Example 1

Non-independent White Noise.

Let U be a random variable uniformly distributed over \((0,2\pi )\) and define

$$\begin{aligned} Z(t):=\sqrt{2}\cos (tU),~\text{ for }~t\in {\mathbb {N}}. \end{aligned}$$

The process \(Z=\{Z(t)\}_{t\in {\mathbb {N}}}\) is then a white noise because it verifies

$$\begin{aligned} {\mathbb {E}}(Z(t))=0,~{\mathbb {V}}ar(Z(t))=1~\text{ and }~{\mathbb {C}}ov(Z(s),Z(t))=0,~\text{ for }~s\ne t. \end{aligned}$$

We claim that Z is wide-sense stationary ergodic, which can be obtained by using the Kolmogorov’s strong law of large numbers, see e.g. Theorem 2.3.10 in Sen and Singer (1993). However Z is not strict-sense stationary since

$$\begin{aligned} (Z(1),Z(2))\ne (Z(2),Z(3))~\text{ in } \text{ law }. \end{aligned}$$

Indeed, it is easy to see that

$$\begin{aligned} 0<{\mathbb {E}}\big (Z(1)^2Z(2)\big )\ne {\mathbb {E}}\big (Z(2)^2Z(3)\big )=0. \end{aligned}$$

Example 2

Auto-regressive Models.

It is well-known that an auto-regressive model \(\{Y(t)\}_t\sim AR(1)\) in the form:

$$\begin{aligned} Y(t)=aY(t-1)+Z(t),~|a|<1,a\ne 0,~\text{ for }~t\in {\mathbb {N}} \end{aligned}$$
(1)

is wide-sense stationary ergodic. However it is not necessarily strict-sense stationary ergodic, when the joint distributions of the white noise \(\{Z(t)\}_t\) are not invariant with time-shifting (e.g., take \(\{Z(t)\}_t\) to be the white noise in Example 1).

Example 3

Increment Process of Fractional Brownian Motion.

Let \(\{B^H(t)\}_t\) be a fractional Brownian motion with Hurst index \(H\in (0,1)\) (see Mandelbrot and van Ness 1968). For each \(h>0\), its increment process \(\{Z^h(t):=B^H(t+h)-B^H(t)\}_t\) is finite-variance strict-sense stationary ergodic (Magdziarz and Weron 2011). As a result it is also wide-sense stationary ergodic. More detail will be discussed in Sect. 4.

Example 4

Increment Process of More General Gaussian Processes.

Peng (2012) introduced a general class of zero-mean Gaussian processes \(X=\{X(t)\}_{t\in {\mathbb {R}}}\) having stationary increments. Its variogram \(\nu (t):=2^{-1}{\mathbb {E}}(X(t)^2)\) satisfies:

  1. (1)

    There is a non-negative integer d such that \(\nu \) is 2d-times continuously differentiable over \([-2,2]\), but not \(2(d+1)\)-times continuously differentiable over \([-2,2]\).

  2. (2)

    There are 2 real numbers \(c\ne 0\) and \(s_0\in (0,2)\), such that for all \(t\in [-2,2]\),

    $$\begin{aligned} \nu (t)=\nu ^{(2d)}(0)+c|t|^{{s_{0}}}+r(t), \end{aligned}$$

    where the remainder r(t) satisfies:

    • \(r(t)=o(|t|^{{s_{0}}})\), as \(t\rightarrow 0\).

    • There are two real numbers \(c' > 0\), \(\omega > s_0\) and an integer \(q > \omega +1/2\) such that r is q-times continuously differentiable on \([-2, 2]\backslash \{0\}\) and for all \(t \in [-2, 2]\backslash \{0\}\), we have

      $$\begin{aligned} |r^{(q)}(t)|\leqslant c'|t|^{\omega -q}. \end{aligned}$$

It is shown that the process X extends fractional Brownian motion and it also has wide-sense (and strict-sense) stationary ergodic increments when \(d+s_0/2\in (0,1)\) (see Proposition 3.1 in Peng 2012).

The problem of clustering processes via their means and covariance structures leads us to formulating our clustering targets in the following way.

Definition 2

(Ground truthGof covariance structures) Let

$$\begin{aligned} G=\big \{G_1,\ldots ,G_\kappa \big \} \end{aligned}$$

be a partitioning of \({\mathbb {N}}=\{1,2,\ldots \}\) into \(\kappa \) disjoint sets \(G_k\), \(k=1,\ldots ,\kappa \), such that the means and covariance structures of \({\mathbf {x}}_i\), \(i\in {\mathbb {N}}\) are identical, if and only if \(i\in G_k\) for some \(k=1,\ldots ,\kappa \). Such G is called ground truth of covariance structures. We also denote by \(G|_{N}\) the restriction of G to the first N sequences:

$$\begin{aligned} G|_{N}=\big \{G_k\cap \{1,\ldots ,N\}:~k=1,\ldots ,\kappa \big \}. \end{aligned}$$

Our clustering algorithms will aim to output the ground truth partitioning G, as the sample length grows. Before stating these algorithms, we introduce the inspiring framework done by Khaleghi et al. (2016).

1.1 Preliminary results: clustering strict-sense stationary ergodic processes

Khaleghi et al. (2016) considered the problem of clustering strict-sense stationary ergodic processes. The main fruit in Khaleghi et al. (2016) is obtaining the so-called asymptotically consistent algorithms to cluster processes of that type. We briefly state their work below. Depending on how the information is collected, the stochastic processes clustering problems consist of dealing with two models: offline setting and online setting.

Offline setting :

The observations are assumed to be a finite number N of paths:

$$\begin{aligned} {\mathbf {x}}_1 = \Big (X_1^{(1)},\ldots , X_{n_1}^{(1)}\Big ),\ldots ,{\mathbf {x}}_N = \Big (X_1^{(N)},\ldots , X_{n_N}^{(N)}\Big ). \end{aligned}$$

Each path is generated by one of the \(\kappa \) different unknown process distributions. In this case, an asymptotically consistent clustering function should satisfy the following.

Definition 3

(Consistency: offline setting) A clustering function f is consistent for a set of sequences S if \(f(S,\kappa )=G\). Moreover, denoting \(n=\min \{n_1,\ldots ,n_N\}\), f is called strongly asymptotically consistent in the offline sense if with probability 1 from some n on it is consistent on the set S, i.e.,

$$\begin{aligned} {\mathbb {P}}\left( \lim _{n\rightarrow \infty }f(S,\kappa )=G\right) =1. \end{aligned}$$

It is called weakly asymptotically consistent if \(\lim \limits _{n\rightarrow \infty }{\mathbb {P}}(f(S,\kappa )=G)=1\).

Online setting :

In this setting the observations, having growing length and number of scenarios with respect to time t, are denoted by

$$\begin{aligned} {\mathbf {x}}_1 = \Big (X_1^{(1)},\ldots , X_{n_1}^{(1)}\Big ),\ldots ,{\mathbf {x}}_{N(t)} = \Big (X_1^{(N(t))},\ldots , X_{n_{N(t)}}^{(N(t))}\Big ), \end{aligned}$$

where the index function N(t) is non-decreasing with respect to t.

Then an asymptotically consistent online clustering function is defined below:

Definition 4

(Consistency: online setting) A clustering function is strongly (RESP. weakly) asymptotically consistent in the online sense, if for every \(N\in {\mathbb {N}}\) the clustering \(f(S(t),\kappa )|_N\) is strongly (RESP. weakly) asymptotically consistent in the offline sense, where \(f(S(t),\kappa )|_N\) is the clustering \(f(S(t),\kappa )\) restricted to the first N sequences:

$$\begin{aligned} f(S(t),\kappa )|_N=\left\{ f(S(t),\kappa )\cap \{1,\ldots ,N\}:~k=1,\ldots ,\kappa \right\} . \end{aligned}$$

There is a detailed discussion on the comparison of offline and online settings in Khaleghi et al. (2016), stating that these two settings have significant differences, since using the offline algorithm in the online setting by simply applying it to the entire data observed at every time step, does not result in an asymptotically consistent algorithm. Therefore separately and independently studying these two settings becomes necessary and meaningful.

As the main results in Khaleghi et al. (2016), asymptotically consistent clustering algorithms for both offline and online settings are designed. They are then successfully applied to clustering synthetic and real data sets.

Note that in the framework of Khaleghi et al. (2016), a key step is introduction to the so-called distributional distance (Gray 1988): the distributional distance between a pair of process distributions \(\rho _1\), \(\rho _2\) is defined to be

$$\begin{aligned} d(\rho _1,\rho _2)=\sum _{m,l=1}^{\infty }w_m w_l\sum _{B\in B^{m,l}}\left| \rho _1(B)-\rho _2(B)\right| , \end{aligned}$$
(2)

where:

  • The sets \(B^{m,l}\), \(m,l\ge 1\) are obtained via the partitioning of \({\mathbb {R}}^m\) into cubes of dimension m and volume \(2^{-ml}\), starting at the origin.

  • The sequence of weights \(\{w_j\}_{j\ge 1}\) is positive and decreasing to zero. Moreover it should be chosen such that the series in (2) is convergent. The weights are often suggested to give precedence to earlier clusterings, protecting the clustering decisions from the presence of the newly observed sample paths, whose corresponding distance estimates may not yet be accurate. For instance, it is set to be \(w_j=1/j(j+1)\) in Khaleghi et al. (2016).

Further, the distance between two sample paths \({\mathbf {x}}_1\), \({\mathbf {x}}_2\) of stochastic processes is given by

$$\begin{aligned} {\widehat{d}}({\mathbf {x}}_1,{\mathbf {x}}_2)=\sum _{m=1}^{m_n}\sum _{l=1}^{l_n}w_m w_l\sum _{B\in B^{m,l}}|\nu ({\mathbf {x}}_1,B)-\nu ({\mathbf {x}}_2,B)|, \end{aligned}$$
(3)

where:

  • \(m_n,l_n\) (\(\leqslant n\)) can be arbitrary sequences of positive integers increasing to infinity, as \(n\rightarrow \infty \).

  • For a process path \({\mathbf {x}}=(X_1,\ldots ,X_n)\), and an event B, \(\nu ({\mathbf {x}},B)\) denotes the average times that the event B occurs over \(n-m_n+1\) time intervals. More precisely,

    $$\begin{aligned} \nu ({\mathbf {x}},B):=\frac{1}{n-m_n+1}\sum _{i=1}^{n-m_n+1}\mathbb {1}\{(X_i,\ldots ,X_{i+m_n-1})\in B\}. \end{aligned}$$

The process distribution X from which \({\mathbf {x}}\) is sampled is called strictly ergodic if

$$\begin{aligned} {\mathbb {P}}\left( \lim _{n\rightarrow \infty }\nu ({\mathbf {x}},B)={\mathbb {P}}(X\in B)\right) =1,~\text{ for } \text{ all } B. \end{aligned}$$
(4)

The assumption that the processes are ergodic leads to that \({\widehat{d}}\) is a strongly consistent estimator of d:

$$\begin{aligned} {\mathbb {P}}\left( \lim _{n\rightarrow \infty }{\widehat{d}}({\mathbf {x}}_1,{\mathbf {x}}_2)=d(\rho _1,\rho _2)\right) =1, \end{aligned}$$

where \(\rho _1,\rho _2\) are the process distributions corresponding to \({\mathbf {x}}_1,{\mathbf {x}}_2\), respectively.

Based on the distances d and their estimates \({\widehat{d}}\), the asymptotically consistent algorithms for clustering stationary ergodic processes in each of the offline and online settings are provided (see Algorithms 1, 2 and Theorems 11, 12 in Khaleghi et al. 2016). Khaleghi et al. (2016) also show that their methods can be implemented efficiently: they are at most quadratic in each of their arguments, and are linear (up to log terms) in some formulations.

1.2 Statistical setting: clustering wide-sense stationary ergodic processes

Inspired by the framework of Khaleghi et al. (2016), we consider the problem of clustering wide-sense stationary ergodic processes. We first introduce the following covariance-based dissimilarity measure, which is one of the main contributions of this paper.

Definition 5

(Covariance-based dissimilarity measure) The covariance-based dissimilarity measure \(d^*\) between a pair of processes \(X^{(1)}\), \(X^{(2)}\) (in fact \(X^{(1)}\), \(X^{(2)}\) denote two covariance structures, each may contain different process distributions) is defined as follows:

$$\begin{aligned}&d^*\big (X^{(1)},X^{(2)}\big ):= \sum _{m,l = 1}^{\infty } w_m w_l\nonumber \\&\quad \times {\mathcal {M}}\left( \left( {\mathbb {E}}\big (X_{l\ldots l+m-1}^{(1)}\big ), {\mathbb {C}}ov\big (X_{l\ldots l+m-1}^{(1)}\big )\right) , \left( {\mathbb {E}}\big (X_{l\ldots l+m-1}^{(2)}\big ), {\mathbb {C}}ov\big (X_{l\ldots l+m-1}^{(2)}\big )\right) \right) , \end{aligned}$$
(5)

where:

  • For \(j=1,2\), \(\{X_l^{(j)}\}_{l\in {\mathbb {N}}}\) denotes some path sampled from the process \(X^{(j)}\). We assume that all possible observations of the process \(X^{(j)}\) is a subset of \(\{X_l^{(j)}\}_{l\in {\mathbb {N}}}\). For \(l'\ge l\ge 1\), we define the shortcut notation \(X_{l\ldots l'}^{(j)}:=(X_{l}^{(j)},\ldots ,X_{l'}^{(j)})\).

  • The function \({\mathcal {M}}\) is defined by: for any \(p_1,p_2,p_3\in {\mathbb {N}}\), any 2 vectors \(v_1,v_2\in {\mathbb {R}}^{p_1}\) and any 2 matrices \(A_1,A_2\in {\mathbb {R}}^{p_2\times p_3}\),

    $$\begin{aligned} {\mathcal {M}}((v_1,A_1),(v_2,A_2)):= \left| v_1-v_2\right| +\rho ^*\left( A_1,A_2\right) . \end{aligned}$$
    (6)
  • The distance \(\rho ^*\) between 2 equal-sized matrices \(M_1,M_2\) is defined to be

    $$\begin{aligned} \rho ^*(M_1,M_2):=\Vert M_1-M_2\Vert _F, \end{aligned}$$
    (7)

    with \(\Vert \cdot \Vert _F\) being the Frobenius norm:

    for an arbitrary matrix \(M=\{M_{ij}\}_{i=1,\ldots ,m; j=1,\ldots ,n}\),

    $$\begin{aligned} \Vert M\Vert _F:=\sqrt{\sum _{i=1}^m\sum _{j=1}^n|M_{ij}|^2}. \end{aligned}$$

    Introduction to the matrices distance \(\rho ^*\) is inspired by Herdin et al. (2005). The matrices distance given in Herdin et al. (2005) is used to measure the distance between 2 correlation matrices. However, our distance \(\rho ^*\) is a modification of the one in the latter paper. Indeed, unlike Herdin et al. (2005), \(\rho ^*\) is a well-defined metric distance, as it satisfies the triangle inequalities.

  • The sequence of positive weights \(\{w_j\}\) is chosen such that \(d^*(X^{(1)},X^{(2)})\) is finite. Observe that the distances \(|\cdot |\) and \(\rho ^*\) in (5) do not depend on l, as a result we necessarily have

    $$\begin{aligned} \sum _{l=1}^\infty w_l<+\infty . \end{aligned}$$
    (8)

    In practice a typical choice of weights we suggest is \(w_j=1/j(j+1)\), \(j=1,2,\ldots \). This is because, for most of the well-known covariance stationary ergodic processes (causal ARMA(pq), increments of fractional Brownian motions, etc.), their auto-covariance functions are absolutely summable: denote by \(\gamma _X\) the auto-covariance function of \(\{X_t\}_{t}\),

    $$\begin{aligned} \sum _{h=-\infty }^{+\infty }\left| \gamma _X(h)\right| <+\infty . \end{aligned}$$
    (9)

    Śęlzak (2017) pointed out that (9) is a sufficient condition for \(\{X_t\}\) being mean-ergodic. However (9) does not necessarily imply that \(\{X_t\}\) is covariance-ergodic. It becomes a sufficient and necessary condition if \(\{X_t\}\) is Gaussian. Therefore subject to (9), taking \(w_j=1/j(j+1)\), we obtain for any integer \(N>0\),

    $$\begin{aligned}&\sum _{m,l = 1}^{N} w_m w_l \left| {\mathbb {E}}\left( X_{l\ldots l+m-1}^{(1)}\right) -{\mathbb {E}}\left( X_{l\ldots l+m-1}^{(2)}\right) \right| \nonumber \\&\quad = \sum _{m,l = 1}^{N} w_m w_l \sqrt{m}|\mu _1-\mu _2|=|\mu _1-\mu _2|\sum _{m,l = 1}^{N}\frac{1}{\sqrt{m}(m+1)l(l+1)}\nonumber \\&\quad \leqslant |\mu _1-\mu _2|\sum _{m,l = 1}^{+\infty }\frac{1}{\sqrt{m}(m+1)l(l+1)}<+\infty , \end{aligned}$$
    (10)

    with \(\mu _j={\mathbb {E}}\big (X_1^{(j)}\big )\), for \(j=1,2\); and

    $$\begin{aligned}&\sum _{m,l = 1}^{N} w_m w_l \rho ^*\left( {\mathbb {C}}ov\Big (X_{l\ldots l+m-1}^{(1)}\Big ),{\mathbb {C}}ov\Big (X_{l\ldots l+m-1}^{(2)}\Big )\right) \nonumber \\&\quad \leqslant \sum _{m,l = 1}^{N} w_m w_l \sqrt{2\sum _{k_1=1}^m\sum _{k_2=1}^m\left( \gamma _X(|k_1-k_2|)\right) ^2}\nonumber \\&\quad = \sum _{m,l = 1}^{N} w_m w_l \sqrt{2\sum _{q=-(m-1)}^{m-1}(m-|q|)\left( \gamma _X(|q|)\right) ^2}\nonumber \\&\quad \leqslant \sum _{m,l = 1}^{N} w_m w_l \sqrt{2m\sum _{q=-(m-1)}^{m-1}\left( \gamma _X(|q|)\right) ^2}\nonumber \\&\quad \leqslant c\sum _{m,l = 1}^{N} \frac{\sqrt{2m}}{m(m+1)l(l+1)}\leqslant c\sum _{m,l = 1}^{+\infty } \frac{\sqrt{2m}}{m(m+1)l(l+1)}<+\infty , \end{aligned}$$
    (11)

    where the constant \(c=\sum _{q=-\infty }^{\infty }|{\mathbb {C}}ov(X_1,X_{1+|q|})|<+\infty \). Therefore combining (10) and (11) leads to

    $$\begin{aligned} d^*\big (X^{(1)},X^{(2)}\big )<+\infty . \end{aligned}$$

    Hence \(d^*(X^{(1)},X^{(2)})\) in (5) is well-defined.

In (5) and (6) we see that the behavior of the dissimilarity measure \(d^*\) is jointly explained by the Euclidean distance of means and the matrices distance of covariance matrices. If the means of the processes \(X^{(1)}\) and \(X^{(2)}\) are priorly known to be equal, the distance \(d^*\) can be simplified to:

$$\begin{aligned} d^*\big (X^{(1)},X^{(2)}\big )= \sum _{m,l = 1}^{\infty } w_m w_l\rho ^*\left( {\mathbb {C}}ov\big (X_{l\ldots l+m-1}^{(1)}\big ), {\mathbb {C}}ov\big (X_{l\ldots l+m-1}^{(2)}\big )\right) . \end{aligned}$$
(12)

Note that this dissimilarity measure can be applied on self-similar processes, since they are all zero-mean (see Sect. 3).

Next we provide consistent estimator of \({d^*}(X^{(1)},X^{(2)})\). For \(1\leqslant l\leqslant n\) and \(m\leqslant n-l+1\), define \(\mu ^*({X_{l\ldots n}}, m)\) to be the empirical mean of a process X’s sample path \((X_l,\ldots ,X_n)\):

$$\begin{aligned} \mu ^*(X_{l\ldots n},m) :=\frac{1}{n-m-l+2} \sum _{i = l} ^{n-m+1} (X_i~\ldots ~X_{i+m-1})^T, \end{aligned}$$
(13)

and define \(\nu ^*({X_{l\ldots n}}, m)\) to be the empirical covariance matrix of \((X_l,\ldots ,X_n)\):

$$\begin{aligned} \nu ^*(X_{l\ldots n},m):= & {} \frac{1}{n-m-l+2} \sum _{i = l} ^{n-m+1} (X_i~\ldots ~X_{i+m-1})^T(X_i~\ldots ~X_{i+m-1})\nonumber \\&-\,\mu ^*(X_{l\ldots n},m)\mu ^*(X_{l\ldots n},m)^T, \end{aligned}$$
(14)

where \(M^T\) denotes the transpose of the matrix M.

Recall that the notion of wide-sense ergodicity is given in Definition 1. The ergodicity theorem concerns what information can be derived from an average over time about the ensemble average at each point of time. For the wide-sense stationary ergodic process X, being either continuous-time or discrete-time, the following statement holds: every empirical mean \(\mu ^*(X_{l\ldots n},m)\) is a strongly consistent estimator of the path mean \({\mathbb {E}}(X_{l\ldots l+m-1})\); and every empirical covariance matrix \(\nu ^*(X_{l\ldots n},m)\) is a strongly consistent estimator of the covariance matrix \({\mathbb {C}}ov(X_{l\ldots l+m-1})\) under the Frobenius norm, i.e., for all \(m\ge 1\), we have

$$\begin{aligned} {\mathbb {P}}\left( \lim _{n\rightarrow \infty }\left| \mu ^*(X_{l\ldots n},m)-{\mathbb {E}}(X_{l\ldots l+m-1})\right| =0\right) =1 \end{aligned}$$

and

$$\begin{aligned} {\mathbb {P}}\left( \lim _{n\rightarrow \infty }\left\| \nu ^*(X_{l\ldots n},m)-{\mathbb {C}}ov(X_{l\ldots l+m-1})\right\| _F=0\right) =1. \end{aligned}$$

Next we introduce the empirical covariance-based dissimilarity measure \(\widehat{d}^{*}\), serving as a consistent estimator of the covariance-based dissimilarity measure \(d^*\).

Definition 6

(Empirical covariance-based dissimilarity measure) Given two processes’ sample paths \({\mathbf {x}}_j=(X_1^{(j)},\ldots ,X_{n_j}^{(j)})\), \(j=1,2\). Let \(n=\min \{n_1,n_2\}\), we define the empirical covariance-based dissimilarity measure between \({\mathbf {x}}_1\) and \({\mathbf {x}}_2\) by

$$\begin{aligned}&\widehat{d}^{*}({\mathbf {x}}_{1},{\mathbf {x}}_{2}):=\sum _{m= 1}^{m_n} \sum _{l= 1}^{n-m+1} w_m w_l\nonumber \\&\quad \times {\mathcal {M}}\left( \left( \mu ^*(X^{(1)}_{l\ldots n},m), \nu ^*(X^{(1)}_{l\ldots n},m)\right) ,\left( \mu ^*(X^{(2)}_{l\ldots n},m), \nu ^*(X^{(2)}_{l\ldots n},m)\right) \right) . \end{aligned}$$
(15)

The empirical covariance-based dissimilarity measure between a sample path \({\mathbf {x}}_i\) and a process \(X^{(j)}\) (\(i,j\in \{1,2\}\)) is defined by

$$\begin{aligned}&\widehat{d}^{*}({\mathbf {x}}_{i},X^{(j)}):=\sum _{m= 1}^{m_n} \sum _{l= 1}^{n-m+1} w_m w_l\nonumber \\&\quad \times {\mathcal {M}}\left( \left( \mu ^*(X^{(i)}_{l\ldots n},m), \nu ^*(X^{(i)}_{l\ldots n},m)\right) ,\left( {\mathbb {E}}\left( X_{l\ldots l+m-1}^{(j)}\right) , {\mathbb {C}}ov\left( X_{l\ldots l+m-1}^{(j)}\right) \right) \right) . \end{aligned}$$
(16)

Unlike the dissimilarity measure \(d^*\) which describes some distance between stochastic processes, the empirical covariance-based dissimilarity measure is some distance between two sample paths (finite-length vectors). We will show in the forthcoming Lemma 1 that \(\widehat{d}^{*}\) is a consistent estimator of \(d^*\).

Two observed sample paths possibly have distinct lengths \(n_1, n_2\), therefore in (15) we consider computing the distances between their subsequences of length \(n=\min \{n_1,n_2\}\). In practice we usually take \(m_n=\lfloor \log n\rfloor \), the floor number of \(\log n\).

It is easy to verify that both \(d^*\) and \(\widehat{d}^{*}\) satisfy the triangle inequalities, thanks to the fact that both the Euclidean distance and \(\rho ^*\) satisfy the triangle inequalities. More precisely, the following holds.

Remark 1

Thanks to (7) and the definitions of \(d^*\) [see (5)] and \(\widehat{d}^{*}\) [see (15)], we see that the triangle inequality holds for the covariance-based dissimilarity measure \(d^*\), as well as for its empirical estimate \(\widehat{d}^{*}\). Therefore for arbitrary processes \(X^{(i)},~i = 1,2,3\) and arbitrary finite-length sample paths \({\mathbf {x}}_{i},~i=1,2,3\), we have

$$\begin{aligned} d^*\big (X^{(1)},X^{(2)}\big )\le & {} d^*\big (X^{(1)},X^{(3)}\big ) + d^*\big (X^{(2)},X^{(3)}\big ),\\ \widehat{d^*}({\mathbf {x}}_{1},{\mathbf {x}}_{2})\le & {} \widehat{d}^{*}({\mathbf {x}}_{1},{\mathbf {x}}_{3}) + \widehat{d}^{*}({\mathbf {x}}_{2},{\mathbf {x}}_{3}),\\ \widehat{d}^{*}\big ({\mathbf {x}}_{1},X^{(1)}\big )\le & {} \widehat{d}^{*}\big ({\mathbf {x}}_{1},X^{(2)}\big ) + d^*\big (X^{(1)},X^{(2)}\big ). \end{aligned}$$

Remark 1 together with the fact that the processes are weakly ergodic, leads to Lemma 1 below, which is the key to demonstrate that our clustering algorithms in the forthcoming section are asymptotically consistent.

Lemma 1

Given two paths

$$\begin{aligned} {\mathbf {x}}_{\mathbf {1}}=\left( X_1^{(1)},\ldots ,X_{n_1}^{(1)}\right) \quad \text{ and } \quad \mathbf {x_2}=\left( X_1^{(2)},\ldots ,X_{n_2}^{(2)}\right) , \end{aligned}$$

sampled from the wide-sense stationary ergodic processes \(X^{(1)}\) and \(X^{(2)}\) respectively, we have

$$\begin{aligned} {\mathbb {P}}\left( \lim _{n_1,n_2 \rightarrow \infty } \widehat{d}^{*}\left( {\mathbf {x}}_{1},{\mathbf {x}}_{2}\right) = d^*\left( X^{(1)},X^{(2)}\right) \right) =1 \end{aligned}$$
(17)

and

$$\begin{aligned} {\mathbb {P}}\left( \lim _{n_i \rightarrow \infty } \widehat{d}^{*}\left( {\mathbf {x}}_{i},X^{(j)}\right) = d^*\left( X^{(1)},X^{(2)}\right) \right) =1,~\text{ for }~i,j\in \{1,2\},~i\ne j. \end{aligned}$$
(18)

Proof

We take \(n=\min \{n_1,n_2\}\). To show (17) holds it suffices to prove that for arbitrary \(\varepsilon >0\), there is an integer \(N>0\) such that for any \(n\ge N\), with probability 1,

$$\begin{aligned} \left| \widehat{d}^{*}\left( {\mathbf {x}}_{1},{\mathbf {x}}_{2}\right) - d^*(X^{(1)},X^{(2)})\right| <\varepsilon . \end{aligned}$$

Define the sets of indexes

$$\begin{aligned} S_1(n)=\left\{ (m,l)\in {\mathbb {N}}^2:~m\leqslant m_n,~l\leqslant n-m+1\right\} ~\text{ and }~S_2(n)={\mathbb {N}}^2\backslash S_1(n). \end{aligned}$$

To be more convenient we also denote by

$$\begin{aligned} V\Big (X_{l\ldots l+m-1}^{(j)}\Big ):=\left( {\mathbb {E}}\left( X_{l\ldots l+m-1}^{(j)}\right) ,{\mathbb {C}}ov\left( X_{l\ldots l+m-1}^{(j)}\right) \right) \end{aligned}$$
(19)

and

$$\begin{aligned} \widehat{V}\Big (X_{l\ldots n}^{(j)},m\Big ):=\left( \mu ^*\left( X_{l\ldots n}^{(j)},m\right) ,\nu ^*\left( X_{l\ldots n}^{(j)},m\right) \right) , \end{aligned}$$
(20)

for \((m,l)\in {\mathbb {N}}^2\) and \(j=1,2\). By using the definitions of \(d^*\) [see (5)], of \(\widehat{d}^{*}\) [see (15)] and the triangle inequality

$$\begin{aligned} \left| \sum \limits _{i\in I}a_i\right| \leqslant \sum \limits _{i\in I}|a_i|,~\text{ for } \text{ any } \text{ indexes } \text{ set } I \text{ and } \text{ any } \text{ real } \text{ numbers } a_i'\hbox {s}, \end{aligned}$$

we obtain

$$\begin{aligned}&\left| \widehat{d}^{*}({\mathbf {x}}_{1}, {\mathbf {x}}_{2}) - d^*\big (X^{(1)}, X^{(2)}\big )\right| \nonumber \\&\quad = \biggl | \sum _{(m,l)\in S_1(n)} w_m w_l \left( {\mathcal {M}}\left( \widehat{V}(X_{l\ldots n}^{(1)},m),\widehat{V}(X_{l\ldots n}^{(2)},m)\right) \right. \nonumber \\&\qquad - \sum _{(m,l)\in S_1(n)\cup S_2(n)} w_m w_l{\mathcal {M}}\left( V(X_{l\ldots l+m-1}^{(1)}), V(X_{l\ldots l+m-1}^{(2)})\right) \biggl | \nonumber \\&\quad \leqslant \biggl | \sum _{(m,l)\in S_1(n)} w_m w_l \left( {\mathcal {M}}\left( \widehat{V}(X_{l\ldots n}^{(1)},m),\widehat{V}(X_{l\ldots n}^{(2)},m)\right) \right. \nonumber \\&\qquad \left. - {\mathcal {M}}\left( V(X_{l\ldots l+m-1}^{(1)}), V(X_{l\ldots l+m-1}^{(2)})\right) \right) \biggl | \nonumber \\&\qquad +\sum _{(m,l)\in S_2(n)}w_mw_l{\mathcal {M}}\left( V(X_{l\ldots l+m-1}^{(1)}), V(X_{l\ldots l+m-1}^{(2)})\right) \nonumber \\&\quad \leqslant \sum _{(m,l)\in S_1(n)} w_m w_l \biggl |{\mathcal {M}}\left( \widehat{V}(X_{l\ldots n}^{(1)},m),\widehat{V}(X_{l\ldots n}^{(2)},m)\right) \nonumber \\&\qquad - {\mathcal {M}}\left( V(X_{l\ldots l+m-1}^{(1)}), V(X_{l\ldots l+m-1}^{(2)})\right) \biggl | \nonumber \\&\qquad +\sum _{(m,l)\in S_2(n)}w_mw_l{\mathcal {M}}\left( V(X_{l\ldots l+m-1}^{(1)}), V(X_{l\ldots l+m-1}^{(2)})\right) . \end{aligned}$$
(21)

Next note that the metric \({\mathcal {M}}\) satisfies the following triangle inequality:

$$\begin{aligned}&\biggl |{\mathcal {M}}\left( \widehat{V}(X_{l\ldots n}^{(1)},m),\widehat{V}(X_{l\ldots n}^{(2)},m)\right) - {\mathcal {M}}\left( V(X_{l\ldots l+m-1}^{(1)}), V(X_{l\ldots l+m-1}^{(2)})\right) \biggl |\nonumber \\&\quad \le {\mathcal {M}}\left( \widehat{V}(X_{l\ldots n}^{(1)},m),{V}(X_{l\ldots l+m-1}^{(1)})\right) + {\mathcal {M}}\left( {\widehat{V}}(X_{l\ldots n}^{(2)},m), V(X_{l\ldots l+m-1}^{(2)})\right) . \end{aligned}$$
(22)

It follows from (21) and (22) that

$$\begin{aligned}&\left| \widehat{d}^{*}({\mathbf {x}}_{1}, {\mathbf {x}}_{2}) - d^*\big (X^{(1)}, X^{(2)}\big )\right| \nonumber \\&\quad \leqslant \sum _{(m,l)\in S_1(n)} w_m w_l \biggl ({\mathcal {M}}\left( \widehat{V}(X_{l\ldots n}^{(1)},m),V(X_{l\ldots l+m-1}^{(1)})\right) \nonumber \\&\qquad + {\mathcal {M}}\left( {\widehat{V}}(X_{l\ldots n}^{(2)},m), V(X_{l\ldots l+m-1}^{(2)})\right) \biggl ) \nonumber \\&\qquad +\sum _{(m,l)\in S_2(n)}w_mw_l{\mathcal {M}}\left( V(X_{l\ldots l+m-1}^{(1)}), V(X_{l\ldots l+m-1}^{(2)})\right) . \end{aligned}$$
(23)

Next we show that the right-hand side of (23) converges to 0 as \(n\rightarrow \infty \). First observe that the weights \(\{w_m\}_{m\ge 1}\) have been chosen such that

$$\begin{aligned} \sum _{m,l = 1}^{\infty } w_m w_l {\mathcal {M}}\left( V(X_{l\ldots l+m-1}^{(1)}),V(X_{l\ldots l+m-1}^{(2)})\right) <+\infty . \end{aligned}$$
(24)

Then for arbitrary fixed \(\varepsilon >0\), we can find an index J such that for \(n\ge J\),

$$\begin{aligned} \sum _{(m,l)\in S_2(n)} w_mw_l{\mathcal {M}}\left( V(X_{l\ldots l+m-1}^{(1)}),V(X_{l\ldots l+m-1}^{(2)})\right) \le \frac{\varepsilon }{3}. \end{aligned}$$
(25)

Next, the weak ergodicity of the processes \(X^{(1)}\) and \(X^{(2)}\) implies that: for each \((m,l)\in {\mathbb {N}}^2\), \({\widehat{V}}(X_{l\ldots n}^{(j)},m)\) (\(j=1,2\)) is a strongly consistent estimator of \(V(X_{l\ldots l+m-1}^{(j)})\), under the metric \({\mathcal {M}}\), i.e., with probability 1,

$$\begin{aligned} \lim _{n\rightarrow \infty }{\mathcal {M}}\left( {\widehat{V}}(X_{l\ldots n}^{(j)},m),~ V(X_{l\ldots l+m-1}^{(j)})\right) =0. \end{aligned}$$
(26)

Thanks to (26), for any \((m,l)\in S_1(J)\), there exists some \(N_{m,l}\) (which depends on ml) such that for all \(n\ge N_{m,l}\), we have, with probability 1,

$$\begin{aligned} {\mathcal {M}}\left( {\widehat{V}}(X^{(j)}_{l\ldots n},m),~V(X_{l\ldots l+m-1}^{(j)})\right) \le \frac{\varepsilon }{3w_m w_l\#S_1(J)},~\text{ for }~j = 1, 2, \end{aligned}$$
(27)

where \(\# A\) denotes the number of elements included in the set A. Denote by \(N_J=\max \limits _{(m,l)\in S_1(J)}N_{m,l}\). Then observe that, for \(n\ge \max \{N_J,J\}\),

$$\begin{aligned}&\sum _{(m,l)\in S_2(n)}w_mw_l{\mathcal {M}}\left( V(X_{l\ldots l+m-1}^{(1)}), V(X_{l\ldots l+m-1}^{(2)})\right) \nonumber \\&\quad \leqslant \sum _{(m,l)\in S_2(J)}w_mw_l{\mathcal {M}}\left( V(X_{l\ldots l+m-1}^{(1)}), V(X_{l\ldots l+m-1}^{(2)})\right) . \end{aligned}$$
(28)

It results from (23), (28), (27) and (25) that, for \(n\ge \max \{N_J,J\}\),

$$\begin{aligned}&\left| \widehat{d}^{*}({\mathbf {x}}_{1}, {\mathbf {x}}_{2}) - d^*\big (X^{(1)}, X^{(2)}\big )\right| \nonumber \\&\quad \le \sum _{(m,l)\in S_1(n)} w_m w_l {\mathcal {M}}\left( {\widehat{V}}(X^{(1)}_{l\ldots n},m),V(X_{l\ldots l+m-1}^{(1)})\right) \nonumber \\&\qquad +\sum _{(m,l)\in S_1(n)} w_m w_l {\mathcal {M}}\left( {\widehat{V}}(X^{(2)}_{l\ldots n},m),V(X_{l\ldots l+m-1}^{(2)})\right) \nonumber \\&\qquad +\sum _{(m,l)\in S_2(J)}w_mw_l{\mathcal {M}}\left( V(X_{l\ldots l+m-1}^{(1)}), V(X_{l\ldots l+m-1}^{(2)})\right) \nonumber \\&\quad \leqslant \frac{\varepsilon }{3}+\frac{\varepsilon }{3}+\frac{\varepsilon }{3}=\varepsilon , \end{aligned}$$

which proves (17). The statement (18) can be proved analogously. \(\square \)

2 Asymptotically consistent clustering algorithms

2.1 Offline and online algorithms

In this section we introduce the asymptotically consistent algorithms for clustering offline and online datasets respectively. We explain how the two algorithms work, and prove that both algorithms are asymptotically consistent. It is worth noting that the asymptotic consistency of our algorithms relies on the assumption that the number of clusters \(\kappa \) is priorly known. The case for \(\kappa \) being unknown has been studied in Khaleghi et al. (2016) in the problem of clustering strictly stationary ergodic processes. However in the setting of wide-sense stationary ergodic processes, this problem remains open.

Algorithm 1 below presents the pseudo-code for clustering offline datasets. It is a centroid-based clustering approach. One of its main features is that the farthest 2-point initialization applies. The algorithm selects the first two cluster centers by picking the two “farthest” observations among all observations (Lines 1–3), under the empirical dissimilarity measure \(\widehat{d}^{*}\). Then each next cluster center is chosen to be the observation farthest to all the previously assigned cluster centers (Lines 4–6). Finally the algorithm assigns each remaining observation to its nearest cluster (Lines 7–11).

figure a

We point out that Algorithm 1 is different from Algorithm 1 in Khaleghi et al. (2016) at two points:

  1. 1.

    As mentioned previously, our algorithm relies on the covariance-based dissimilarity \(\widehat{d}^{*}\), in lieu of the process distributional distances.

  2. 2.

    Our algorithm suggests 2-point initialization, while Algorithm 1 in Khaleghi et al. (2016) randomly picks 1-point as the first cluster center. The latter initialization was proposed for use with k-means clustering by Katsavounidis et al. (1994). Algorithm 1 in Khaleghi et al. (2016) requires \(\kappa N\) distance calculations, while our algorithm requires \(N(N-1)/2\) distances calculations. It is very important to point out that, to reduce the computational complexity cost of our algorithm, it is fine to replace our 2-point initialization with the one in Khaleghi et al. (2016). However there are two reasons based on which we recommend using our approach of initialization:

Reason 1 :

In the forthcoming Sect. 4.1, our empirical comparison to Khaleghi et al. (2016) shows that the 2-point initialization turns out to be more accurate in clustering than the 1-point initialization.

Reason 2 :

Concerning the complexity cost, we have the following loss and earn: on one hand, the 2-point initialization requires more steps of calculations than the 1-point initialization; on the other hand, in our covariance-based dissimilarity measure \(\widehat{d}^{*}\) defined in (15), the matrices distance \(\rho ^*\) requires \(m_n^2\) computations of Euclidean distances, while the distance \( \sum _{B\in B^{m,l}}|\nu ({\mathbf {x}}_1,B)-\nu ({\mathbf {x}}_2,B)|\) given in (3) requires at least \(n_1+n_2-2m_n+2\) computations of Euclidean distances [see Eq. (33) in Khaleghi et al. 2016]. Note that we take \(m_n=\lfloor \log n\rfloor \) (\(\lfloor \cdot \rfloor \) denotes the floor integer number) though this framework. Therefore the computational complexity of the covariance-based dissimilarity \(\widehat{d}^{*}\) makes the overall complexity of Algorithm 1 quite competitive to the algorithm in Khaleghi et al. (2016), especially when the paths lengths \(n_i\), \(i=1,\ldots ,n\) are relatively large, or when the database of all distance values are at hand.

Next we present the clustering algorithm for online setting. As mentioned in Khaleghi et al. (2016), one regards recently-observed paths as unreliable observations, for which sufficient information has not yet been collected, and for which the estimators of the covariance-based dissimilarity measures are not accurate enough. Consequently, farthest-point initialization would not work in this case; and clustering on all available data results in not only mis-clustering unreliable paths, but also in clustering incorrectly those for which sufficient data are already available. The strategy is presented in Algorithm 2 below: clustering based on a weighted combination of several clusterings, each obtained by running the offline algorithm (Algorithm 1) on different portions of data.

More precisely, Algorithm 2 works as follows. Suppose the number of clusters \(\kappa \) is known. At time t, a sample S(t) is observed (Lines 1–2), the algorithm iterates over \(j= \kappa ,\ldots ,N(t)\) where at each iteration Algorithm 1 is utilized to cluster the first j paths in S(t) into \(\kappa \) clusters (Lines 6–7). For each cluster its center is selected as the observation having the smallest index among that cluster, and their indexes are ordered increasingly (Line 8). The minimum inter-cluster distance \(\gamma _j\) (see Cesa-Bianchi and Lugosi 2006) is calculated as the minimum distance \(\widehat{d}^{*}\) between the \(\kappa \) cluster centers obtained at iteration j (Line 9). Finally, every observation in S(t) is assigned to the nearest cluster, based on the weighted combination of the distances between this observation and the candidate cluster centers obtained at each iteration on j (Lines 14–17).

figure b

In Algorithm 2, \(\beta (j)\) denotes a function indexed by j, which is the value chosen for the weight \(w_j\). Remark that for online setting, our algorithm requires the same number of distance calculations as in Algorithm 2 in Khaleghi et al. (2016). They are both bounded by \({\mathcal {O}}(N(t)^2)\). Using 2-point initialization, our Algorithm 2 then takes advantage in the overall computational complexity cost. Finally we note that both Algorithm 1 and Algorithm 2 require \(\kappa \ge 2\). When \(\kappa \) is known, this restriction is not a practical issue.

2.2 Consistency and computational complexity of the algorithms

In this section we prove the asymptotic consistency of Algorithms 1 and 2. They are stated in the 2 theorems below.

Theorem 1

Algorithm 1 is strongly asymptotically consistent (in the offline sense), provided that the true number \(\kappa \) of clusters is known, and each sequence \({\mathbf {x}}_i,~i = 1,\ldots ,N\) is sampled from some wide-sense stationary ergodic process.

Proof

Similar to the idea used in the proof of Theorem 11 in Khaleghi et al. (2016), to prove the consistency statement we will need Lemma 1 to show that if the sample paths in S are long enough, the sample paths that are generated by the same process covariance structure are “closer” to each other than to the rest. Therefore, the sample paths chosen as cluster centers are each generated by a different covariance structure, and since the algorithm assigns the rest to the closest clusters, the statement follows. More formally, let \(n_{\min }\) denote the shortest path length in S:

$$\begin{aligned} n_{\min }:= \min \left\{ n_i:~i=1,\ldots ,N\right\} . \end{aligned}$$

Denote by \(\delta _{\min }\) the minimum non-zero covariance-based dissimilarity measure between any 2 covariance structures:

$$\begin{aligned} \delta _{\min }:= \min \left\{ d^*\left( X^{(k)},X^{(k')}\right) :~k,k'\in \{1,\ldots ,\kappa \},~k\ne k'\right\} . \end{aligned}$$
(29)

Fix \(\varepsilon \in (0, \delta _{\min }/4)\). Since there are a finite number N of observations, by Lemma 1 there is \(n_0\) such that for \(n_{\min }\ge n_0\) we have

$$\begin{aligned} \max _{\begin{array}{c} l \in \{1,\ldots ,\kappa \} \\ i \in G_l \cap \left\{ 1,\ldots ,N \right\} \end{array}} \widehat{d}^{*}\left( {\mathbf {x}}_i, X^{(l)}\right) \le \varepsilon , \end{aligned}$$
(30)

where \(G_l,~l = 1,\ldots ,\kappa \) denote the covariance structure ground-truth partitions given by Definition 2.

On one hand, by using (30), the triangle inequality (see Remark 1) and the fact that

$$\begin{aligned} \max _{i\in I}(a_i+b_i)\leqslant \max _{i\in I}a_i+\max _{i\in I}b_i \end{aligned}$$

for any indexes set I and any real numbers \(a_i\)’s and \(b_i\)’s, we obtain

$$\begin{aligned}&\max _{\begin{array}{c} l \in \{1,\ldots ,\kappa \} \\ i,j \in G_l \cap \left\{ 1,\ldots ,N \right\} \end{array}} \widehat{d}^{*}\left( {\mathbf {x}}_i, {\mathbf {x}}_j\right) \nonumber \\&\quad \le \max _{\begin{array}{c} l \in \{1,\ldots ,\kappa \} \\ i,j \in G_l \cap \left\{ 1,\ldots ,N \right\} \end{array}} \widehat{d}^{*}\left( {\mathbf {x}}_i, X^{(l)}\right) +\max _{\begin{array}{c} l \in \{1,\ldots ,\kappa \} \\ i,j \in G_l \cap \left\{ 1,\ldots ,N \right\} \end{array}} \widehat{d}^{*}\left( {\mathbf {x}}_j, X^{(l)}\right) \nonumber \\&\quad = \max _{\begin{array}{c} l \in \{1,\ldots ,\kappa \} \\ i \in G_l \cap \left\{ 1,\ldots ,N \right\} \end{array}} \widehat{d}^{*}\left( {\mathbf {x}}_i, X^{(l)}\right) +\max _{\begin{array}{c} l \in \{1,\ldots ,\kappa \} \\ j \in G_l \cap \left\{ 1,\ldots ,N \right\} \end{array}} \widehat{d}^{*}\left( {\mathbf {x}}_j, X^{(l)}\right) \nonumber \\&\quad \leqslant 2 \varepsilon < \frac{\delta _{\min }}{2}. \end{aligned}$$
(31)

On the other hand, by using the triangle inequality (see Remark 1), (29) and (30), we have for \(n_{\min }\ge n_0\),

$$\begin{aligned}&\min _{\begin{array}{c} k,k'\in \{1,\ldots ,\kappa \},k\ne k'\\ i \in G_k \cap \left\{ 1,\ldots ,N \right\} \\ j \in G_{k'} \cap \left\{ 1,\ldots ,N \right\} \end{array}} \widehat{d}^{*}({\mathbf {x}}_i, {\mathbf {x}}_j) \nonumber \\&\quad \ge \min _{\begin{array}{c} k,k'\in \{1,\ldots ,\kappa \},k\ne k'\\ i \in G_k \cap \left\{ 1,\ldots ,N \right\} \\ j \in G_{k'} \cap \left\{ 1,\ldots ,N \right\} \end{array}}\left\{ d^*\left( X^{(k)}, X^{(k')}\right) - \widehat{d}^{*}\left( {\mathbf {x}}_i, X^{(k)}\right) - \widehat{d}^{*}\left( {\mathbf {x}}_j, X^{(k')}\right) \right\} \nonumber \\&\quad \ge \delta _{\min }-2\varepsilon > \frac{\delta _{\min }}{2}. \end{aligned}$$
(32)

In words, (31) together with (32) indicates that the sample paths in S that are generated by the same covariance structure are closer to each other than to the rest of sample paths. Then by (31) and (32), for \(n_{\min }\ge n_0\), we necessarily have each sample path should be “close” enough to its cluster center, i.e.,

$$\begin{aligned} \max _{i = 1,\ldots ,N} \min _{k = 1,\ldots ,\kappa -1} \widehat{d}^{*} ({\mathbf {x}}_i, {\mathbf {x}}_{c_k})>\frac{\delta _{\min }}{2}, \end{aligned}$$
(33)

where the \(\kappa \) cluster centers’ indexes \(c_1,\ldots ,c_\kappa \) are given by Algorithm 1 as

$$\begin{aligned} (c_1,c_2) := {\mathop {{{\,\mathrm{argmax}\,}}}\limits _{i,j = 1,\ldots ,N,~i<j}}\widehat{d}^{*} ({\mathbf {x}}_i, {\mathbf {x}}_{j}), \end{aligned}$$

and

$$\begin{aligned} c_k :={\mathop {{{\,\mathrm{argmax}\,}}}\limits _{i = 1,\ldots ,N}} \displaystyle \min _{j = 1,\ldots ,k-1} \widehat{d}^{*} ({\mathbf {x}}_i, {\mathbf {x}}_{c_j}),~k = 3,\ldots ,\kappa . \end{aligned}$$

Hence, the indexes \(c_1,\ldots , c_{\kappa }\) will be chosen to index the sample paths generated by different process covariance structures. Then by (31) and (32), each remaining sample path will be assigned to the cluster center corresponding to the sample path generated by the same process covariance structure. Finally Theorem 1 results from (31), (32) and (33). \(\square \)

Theorem 2

Algorithm 2 is strongly asymptotically consistent (in the online sense), provided the true number of clusters \(\kappa \) is known, and each sequence \({\mathbf {x}}_i, i \in {\mathbb {N}}\) is sampled from some wide-sense stationary ergodic process.

Proof

The idea of the proof is similar to that of Theorem 12 in Khaleghi et al. (2016). The main differences between the 2 proofs are made by the fact that our covariance-based dissimilarity measure \(\widehat{d}^{*}\) is not bounded by some constant. Although it is not mentioned in the pseudo-code Algorithm 2, the notations \(\gamma _j\)’s and \(\eta \) are dependent of t, therefore we denote \(\gamma _j^t:=\gamma _j\) and \(\eta ^t:=\eta \) through this proof. In the first step, by using the triangle inequality we can show that for any \(t>0\), any \(N\in {\mathbb {N}}\),

$$\begin{aligned}&\sup _{\begin{array}{c} j\in \{1,\ldots ,N\}\\ k\in \{1,\ldots ,\kappa \} \end{array}}\widehat{d}^{*} \left( {\mathbf {x}}_{j}^t ,X^{(k)} \right) \leqslant \sup _{\begin{array}{c} j\in \{1,\ldots ,N\}\\ k\in \{1,\ldots ,\kappa \} \end{array}}\left( d^*\left( X^{(k)}, X^{(k_j')}\right) +\widehat{d}^{*}\left( {\mathbf {x}}_{j}^t, X^{(k_j')}\right) \right) \nonumber \\&\quad \leqslant \sup _{\begin{array}{c} j\in \{1,\ldots ,N\}\\ k\in \{1,\ldots ,\kappa \} \end{array}}d^*\left( X^{(k)}, X^{(k_j')}\right) +\sup _{\begin{array}{c} j\in \{1,\ldots ,N\}\\ k\in \{1,\ldots ,\kappa \} \end{array}}\widehat{d}^{*}\left( {\mathbf {x}}_{j}^t, X^{(k_j')}\right) \nonumber \\&\quad = \sup _{\begin{array}{c} j\in \{1,\ldots ,N\}\\ k\in \{1,\ldots ,\kappa \} \end{array}}d^*\left( X^{(k)}, X^{(k_j')}\right) +\sup _{j\in \{1,\ldots ,N\}}\widehat{d}^{*}\left( {\mathbf {x}}_{j}^t, X^{(k_j')}\right) , \end{aligned}$$
(34)

where for each j, \(k_j'\) is chosen such that \({\mathbf {x}}_j^t\) is sampled from the process covariance structure \(X^{(k_j')}\). On one hand, let

$$\begin{aligned} \delta _{\max }:= \max \left\{ d^*\left( X^{(k)},X^{(k')}\right) :~k,k'\in \{1,\ldots ,\kappa \},~k\ne k'\right\} , \end{aligned}$$
(35)

then the first term on the right-hand side of (34) can be bounded by the constant \(\delta _{\max }\), which neither depends on t nor on N:

$$\begin{aligned} \sup _{\begin{array}{c} j\in \{1,\ldots ,N\}\\ k\in \{1,\ldots ,\kappa \} \end{array}}d^*\left( X^{(k)}, X^{(k_j')}\right) \leqslant \delta _{\max }. \end{aligned}$$
(36)

On the other hand, since \({\mathbf {x}}_j^t\) is sampled from \(X^{(k_j')}\), by using the weak ergodicity (see Lemma 1), for \(j=1,\ldots ,N\), with probability 1,

$$\begin{aligned} \lim _{t\rightarrow \infty }\widehat{d}^{*}\left( {\mathbf {x}}_{j}^t, X^{(k_j')}\right) =0. \end{aligned}$$

This together with the fact that a convergent sequence is also bounded, leads to, for each \(j\in \{1,\ldots , N\}\), there is \(b_j\) (not depending on t) such that

$$\begin{aligned} \widehat{d}^{*}\left( {\mathbf {x}}_{j}^t, X^{(k_j')}\right) \leqslant b_j,~\text{ for } \text{ all } t\ge 0. \end{aligned}$$

Therefore the second term on the right-hand side of (34) can be bounded as:

$$\begin{aligned} \sup _{j\in \{1,\ldots ,N\}}\widehat{d}^{*}\left( {\mathbf {x}}_{j}^t, X^{(k_j')}\right) \leqslant \max \{b_1,\ldots ,b_N\}. \end{aligned}$$
(37)

Let

$$\begin{aligned} B(N):=\delta _{\max }+\max \{b_1,\ldots ,b_N\}. \end{aligned}$$
(38)

It is important to point out that B(N) depends only on N but not on t. It follows from (34), (36), (37) and (38) that

$$\begin{aligned} \sup _{\begin{array}{c} j\in \{1,\ldots ,N\}\\ k\in \{1,\ldots ,\kappa \} \end{array}}\widehat{d}^{*} \left( {\mathbf {x}}_{j}^t ,X^{(k)} \right) \leqslant B(N). \end{aligned}$$
(39)

Let \(\delta _{\min }\) be the one given in (29). Fix \(\varepsilon \in (0, \delta _{\min }/4)\). By using (8), we can choose some \(J>0\) so that

$$\begin{aligned} \sum _{j = J+1}^\infty w_j\leqslant \varepsilon . \end{aligned}$$
(40)

Recall that in online setting, the \(i\hbox {th}\) sample path’s length \(n_i(t)\) grows with time, for each i. Therefore, by the wide-sense ergodicity (see Lemma 1), for every \(j \in \{1,\ldots ,J\}\) there exists some \(T_1(j)>0\) such that for all \(t \ge T_1(j)\) we have

$$\begin{aligned} \max _{\begin{array}{c} k \in \{1,\ldots ,\kappa \} \\ i \in G_k \cap \left\{ 1,\ldots ,j \right\} \end{array} } \widehat{d}^{*}\left( {\mathbf {x}}_i^t, X^{(k)}\right) \leqslant \varepsilon . \end{aligned}$$
(41)

For \(k = 1,\ldots ,\kappa \), define \(s_k(N(t))\) to be the index of the first path in S(t) sampled from the covariance structure \(X^{(k)}\), i.e.,

$$\begin{aligned} s_k(N(t)) := \min \left\{ i \in G_k \cap \{1,\ldots ,N(t)\} \right\} . \end{aligned}$$
(42)

Note that \(s_k(N(t))\) depends only on N(t). Then denote

$$\begin{aligned} m(N(t)) := \max _{k \in \{1,\ldots ,\kappa \}} s_k(N(t)). \end{aligned}$$
(43)

By Theorem 1 for every \(j \in \{m(N(t)),\ldots ,J\}\) there exists some \(T_2(j)\) such that \(\text{ Alg1 }(S(t)|_j, \kappa )\) is asymptotically consistent for all \(t \ge T_2(j)\), where \(S(t)|_j = \left\{ {\mathbf {x}}_1^t, \ldots , {\mathbf {x}}_j^t \right\} \) denotes the subset of S(t) consisting of the first j sample paths. Let

$$\begin{aligned} T:= \max _{\begin{array}{c} i=1,2\\ j \in \{1,\ldots ,J\} \end{array}} T_i(j). \end{aligned}$$

Recall that, by the definition of m(N(t)) in (43), \(S(t)|_{m(N(t))}\) contains sample paths from all \(\kappa \) distinct covariance structures. Therefore, similar to obtaining (32), for all \(t \ge T\), we use the triangle inequality, (29) and (41) to obtain

$$\begin{aligned}&\min _{\begin{array}{c} k,k'\in \{1,\ldots ,\kappa \}\\ k\ne k' \end{array}} \widehat{d}^{*}\left( {\mathbf {x}}_{c_{k}^{m(N(t))}}^t , {\mathbf {x}}_{c_{k'}^{m(N(t))}}^t \right) \nonumber \\&\quad \ge \min _{\begin{array}{c} k,k'\in \{1,\ldots ,\kappa \}\\ k\ne k' \end{array}}\left( d^*\left( X^{(k)}, X^{(k')}\right) -\left( \widehat{d}^{*}\left( {\mathbf {x}}_{c_k^{m(N(t))}}^t, X^{(k)}\right) + \widehat{d}^{*}\left( {\mathbf {x}}_{c_{k'}^{m(N(t))}}^t, X^{(k')}\right) \right) \right) \nonumber \\&\quad \ge \delta _{\min } - 2\varepsilon \ge \frac{\delta _{\min }}{2}. \end{aligned}$$
(44)

From Algorithm 2 (see Lines 9, 11) we see

$$\begin{aligned} \eta ^t : = \sum _{j=1}^{N(t)}w_j\gamma _j^t,\quad \text{ with }\quad \gamma _j^t:=\min _{\begin{array}{c} k,k'\in \{1,\ldots ,\kappa \}\\ k\ne k' \end{array}} \widehat{d}^{*}\left( {\mathbf {x}}_{c_k^j}^t,{\mathbf {x}}_{c_{k'}^j}^t\right) . \end{aligned}$$

Hence, by (44), for all \(t\ge T\),

$$\begin{aligned} \eta ^t \ge \frac{w_{m(N(t))} \delta _{\min }}{2}. \end{aligned}$$
(45)

For \(j\in \{ J+1,\ldots ,N(t)\}\), by the triangle inequality and (39), we have for all \(t\ge T\),

$$\begin{aligned}&\gamma _j^t=\min _{\begin{array}{c} k,k'\in \{1,\ldots ,\kappa \}\\ k\ne k' \end{array}} \widehat{d}^{*}\left( {\mathbf {x}}_{c_{k}^{j}}^t , {\mathbf {x}}_{c_{k'}^{j}}^t \right) \nonumber \\&\quad \leqslant \min _{\begin{array}{c} k,k'\in \{1,\ldots ,\kappa \}\\ k\ne k' \end{array}}\left( d^*\left( X^{(k)}, X^{(k')}\right) +\left( \widehat{d}^{*}\left( {\mathbf {x}}_{c_k^{j}}^t, X^{(k)}\right) + \widehat{d}^{*}\left( {\mathbf {x}}_{c_{k'}^{j}}^t, X^{(k')}\right) \right) \right) \nonumber \\&\quad \le \delta _{\max } + 2B(N(t)). \end{aligned}$$
(46)

Denote by

$$\begin{aligned} M(N(t)):=\delta _{\max } + 2B(N(t)), \end{aligned}$$

then (46) can be interpreted as: for all \(t\ge T\),

$$\begin{aligned} \sup _{j\in \{J+1,\ldots ,N(t)\}}\gamma _j^t\leqslant M(N(t)). \end{aligned}$$
(47)

By (39), (45) and (47), for every \(k \in \{1,\ldots ,\kappa \}\) we obtain

$$\begin{aligned}&\frac{1}{\eta ^t}\sum _{j=1}^{N(t)} w_j \gamma _j^t\widehat{d}^{*}\left( {\mathbf {x}}_{c_k^j}^t ,X^{(k)} \right) \nonumber \\&\quad =\frac{1}{\eta ^t}\sum _{j=1}^{J} w_j \gamma _j^t\widehat{d}^{*}\left( {\mathbf {x}}_{c_k^j}^t ,X^{(k)} \right) +\frac{1}{\eta ^t}\sum _{j=J+1}^{N(t)} w_j \gamma _j^t\widehat{d}^{*}\left( {\mathbf {x}}_{c_k^j}^t ,X^{(k)} \right) \nonumber \\&\quad \le \frac{1}{\eta ^t} \sum _{j=1}^{J} w_j \gamma _j^t \widehat{d}^{*} \left( {\mathbf {x}}_{c_k^j}^t ,X^{(k)} \right) + \frac{2B(N(t))M(N(t))}{w_{m(N(t))}\delta _{\min }}\sum _{j=J+1}^{N(t)}w_j\nonumber \\&\quad = \frac{1}{\eta ^t} \sum _{j=1}^{m(N(t))-1} w_j \gamma _j^t \widehat{d}^{*} \left( {\mathbf {x}}_{c_k^j}^t ,X^{(k)} \right) +\frac{1}{\eta ^t} \sum _{j=m(N(t))}^{J} w_j \gamma _j^t \widehat{d}^{*} \left( {\mathbf {x}}_{c_k^j}^t ,X^{(k)} \right) \nonumber \\&\qquad +\, \frac{2B(N(t))M(N(t))\varepsilon }{w_{m(N(t))}\delta _{\min }}. \end{aligned}$$
(48)

Next we provide upper bounds of the first 2 items in the right-hand side of (48). On one hand, by the definition of m(N(t), the sample paths in \(S(t)|_j\) for \(j = 1,\ldots ,m(N(t)) - 1\) are generated by at most \(\kappa -1\) out of the \(\kappa \) process covariance structures. Therefore for each \(j \in \{1,\ldots ,m(N(t))-1\}\) there exists at least one pair of distinct cluster centers that are generated by the same process covariance structure. Consequently, by (41) and the definition of \(\eta ^t\), for all \(t \ge T\) and \(k\in \{1,\ldots ,\kappa \}\),

$$\begin{aligned} \frac{1}{\eta ^t} \sum _{j=1}^{m(N(t))-1} w_j \gamma _j^t \widehat{d}^{*}\left( {\mathbf {x}}_{c_k^j}^t ,X^{(k)}\right) \le \frac{\varepsilon }{\eta ^t} \sum _{j=1}^{m(N(t))-1} w_j \gamma _j^t \le \varepsilon . \end{aligned}$$
(49)

On the other hand, since the clusters are ordered in the order of appearance of the distinct covariance structures, we have \({\mathbf {x}}_{c_l^j}^t = {\mathbf {x}}_{s_l(N(t))}^t\) for all \(j = m,\ldots ,J\) and \(l = 1,\ldots ,\kappa \), where the index \(s_l(N(t))\) is defined in (42). Therefore, by (41) and the definition of \(\eta ^t\), for all \(t \ge T\) and every \(l=1,\ldots ,\kappa \) we have

$$\begin{aligned} \frac{1}{\eta ^t} \displaystyle \sum _{j=m(N(t))}^{J}w_j \gamma _j^t \widehat{d}^{*}\left( {\mathbf {x}}_{c_l^j}^t,X^{(l)}\right) = \widehat{d}^{*} \left( {\mathbf {x}}_{s_l(N(t))}^t ,X^{(l)}\right) \frac{1}{\eta ^t}\sum _{j=m(N(t))}^{J} w_j \gamma _j^t \le \varepsilon . \end{aligned}$$
(50)

Combining (48), (49), (50) and (41) we obtain, for \(t\ge T\),

$$\begin{aligned} \frac{1}{\eta ^t} \displaystyle \sum _{j=1}^{N(t)} w_j \gamma _j^t \widehat{d}^{*}\left( {\mathbf {x}}_{c_k^j}^t ,X^{(k)}\right) \leqslant \varepsilon \left( 2 + \frac{2B(N(t))M(N(t))}{w_{m(N(t))}\delta _{\min }}\right) \end{aligned}$$
(51)

for all \(l = 1,\ldots ,\kappa \).

Now we explain how to use (51) to prove the asymptotic consistency of Algorithm 2. Consider an index \(i \in G_{k'}\) for some \(k' \in \{1,\ldots ,\kappa \}\). Then on one hand, using (49) and (50), we get for \(k\in \{1,\ldots ,\kappa \}\), \(k\ne k'\),

$$\begin{aligned}&\frac{1}{\eta ^t} \sum _{j=1}^{N(t)} w_j \gamma _j^t \widehat{d}^{*}\left( {\mathbf {x}}_i^t,{\mathbf {x}}_{c_k^j}^t\right) \nonumber \\&\quad \ge \frac{1}{\eta ^t} \sum _{j=1}^{N(t)} w_j \gamma _j^t \widehat{d}^{*}\left( {\mathbf {x}}_i^t ,X^{(k)}\right) - \frac{1}{\eta ^t} \sum _{j=1}^{N(t)} w_j \gamma _j^t \widehat{d}^{*}\left( {\mathbf {x}}_{c_k^j}^t ,X^{(k)}\right) \nonumber \\&\quad \ge \frac{1}{\eta ^t} \sum _{j=1}^{N(t)} w_j \gamma _j^t \left( d^*\left( X^{(k)},X^{(k')}\right) - \widehat{d}^{*}\left( {\mathbf {x}}_i^t, X^{(k')} \right) \right) \nonumber \\&\qquad -\, \frac{1}{\eta ^t} \sum _{j=1}^{N(t)} w_j \gamma _j^t \widehat{d}^{*}\left( {\mathbf {x}}_{c_k^j}^t ,X^{(k)}\right) \nonumber \\&\quad \ge \delta _{\min } - 2\varepsilon \left( 2 + \frac{2B(N(t))M(N(t))}{w_{m(N(t))} \delta _{\min }}\right) . \end{aligned}$$
(52)

On the other hand, for any \(N\in {\mathbb {N}}\), by using the wide-sense ergodicity, there is T(N) such that for all \(t\ge T(N)\),

$$\begin{aligned} \max _{\begin{array}{c} k \in \{1,\ldots ,\kappa \} \\ i \in G_k \cap \left\{ 1,\ldots ,N \right\} \end{array} } \widehat{d}^{*}\left( {\mathbf {x}}_i^t, X^{(k)}\right) \leqslant \varepsilon . \end{aligned}$$
(53)

Since \(\varepsilon \) can be arbitrarily chosen, it follows from (52) and (53) that

$$\begin{aligned} {\mathop {{{\,\mathrm{argmin}\,}}}\limits _{k \in \{1,\ldots ,\kappa \}}} \frac{1}{\eta ^t} \sum _{j=1}^{N(t)} w_j \gamma _j \widehat{d}^{*}\left( {\mathbf {x}}_i^t ,{\mathbf {x}}_{c_k^j}^t\right) = k' \end{aligned}$$
(54)

holds almost surely for all \(i=1,\ldots ,N\) and all \(t\ge \max \{T,T(N)\}\). Theorem 2 is proved. \(\square \)

The next part involves discussion of the complexity costs of the above two algorithms.

  1. 1.

    For offline setting, our Algorithm 1 requires \(N(N-1)/2\) calculations of \(\widehat{d}^{*}\), against \(\kappa N\) calculations of \(\widehat{d}\) in the offline algorithm in Khaleghi et al. (2016). In each \(\widehat{d}^{*}\), the matrices distance \(\rho ^*\) consists of \(m_n^2\) calculations of Euclidean distances. Then iterating over ml in \(\widehat{d}^{*}\) we see that at most \({\mathcal {O}}(nm_n^3)\) computations of Euclidean distances, against \({\mathcal {O}}(nm_n/|\log s|)\) computations of \({\hat{d}}\) for the offline algorithm in Khaleghi et al. (2016), where

    $$\begin{aligned} s=\min _{\begin{array}{c} X_i^{(1)}\ne X_j^{(2)} \\ i\in \{1,\ldots ,n_1\};j\in \{1,\ldots ,n_2\} \end{array}}\left| X_i^{(1)}-X_j^{(2)}\right| . \end{aligned}$$

    It is known that efficient searching algorithm can be utilized to determine s, with at most \({\mathcal {O}}(n\log (n))\) (\(n=\min \{n_1,n_2\}\)) computations. Therefore our Algorithm 1 is computationally competitive to the one in Khaleghi et al. (2016).

  2. 2.

    For online setting, we can hold a similar discussion as in Khaleghi et al. (2016), Section 5.1. There it shows the computational complexity of updates of \(\widehat{d}^{*}\) for both our Algorithm 2 and the online algorithm in Khaleghi et al. (2016) is at most \({\mathcal {O}}(N(t)^2+N(t)\log ^3n(t))\) (here we take \(m_{n(t)}=\lfloor \log n(t)\rfloor \)). Therefore the overall difference of computational complexities between the 2 algorithms are reflected by the complexity of computing \(\widehat{d}^{*}\) and \(\widehat{d}\) (see Point 1).

2.3 Efficient dissimilarity measure

Kleinberg (2003) presented a set of three simple properties that a good clustering function should have: scale-invariance, richness and consistency. Further, he demonstrated that there is no clustering function that satisfies these properties at the meanwhile. He pointed out, as one particular example, that the centroid-based clustering basically does not satisfy the above consistency property (note that this is a different concept from our asymptotic consistency). In this section we show that, although the consistency property is not satisfied, there exists some other criterion of efficiency of dissimilarity measure in a particular setting. It is the so-called efficient dissimilarity measure.

Definition 7

(Efficient dissimilarity measure) Assume that the samples \(S=\{{\mathbf {x}}(\xi ):~\xi \in {\mathcal {H}}\}\) (\({\mathcal {H}}\subset {\mathbb {R}}^q\) for some \(q\in {\mathbb {N}}\)), meaning that all the paths \({\mathbf {x}}(\xi )\) are indexed by a set of real-valued parameters \(\xi \). Then a clustering function is called efficient if its dissimilarity measure d satisfies that, there exists \(c>0\) so that for any \({\mathbf {x}}(\xi _1),{\mathbf {x}}(\xi _2)\in S\),

$$\begin{aligned} d({\mathbf {x}}(\xi _1),{\mathbf {x}}(\xi _2))=c\Vert \xi _1-\xi _2\Vert , \end{aligned}$$

where \(\Vert \cdot \Vert \) denotes some norm defined over \({\mathbb {R}}^q\).

Mathematically, efficient dissimilarity measure is a metric induced by some norm. Clustering processes based on efficient dissimilarity measure will then be equivalent to clustering under classical distances in \({\mathbb {R}}^q\), such as Euclidean distance, Manhattan distance, or Minkowski distance. The latter setting has well-known advantages in cluster analysis. For example, Euclidean distance performs well when deployed to datasets that include compact or isolated clusters (Jain and Mao 1996; Jain et al. 1999); when the shape of clusters is hyper-rectangular (Xu and Wunsch 2005), Manhattan distance can be used; Minkowski distance, including Euclidean and Manhattan distances as its particular cases, can be utilized to solve clustering obstacles (Wilson and Martinez 1997). There is a rich literature on comparing the above three distances to each other through discussing of their advantages and inconveniences. We refer to Hirkhorshidi et al. (2015) and the references therein.

In the next section we present an excellent example, to show how to improve the efficiency of our consistent algorithms, for clustering self-similar processes with wide-sense stationary ergodic increments.

3 Self-similar processes and logarithmic transformation

In this section we introduce a non-linear transformation of the covariance matrices in \(\widehat{d}^{*}\), in order to improve the efficiency of clustering. This transformation is based on logarithmic function. We use one example to explain how this transformation works. We show this transformation maps \(\widehat{d}^{*}\) to some covariance-based dissimilarity measure similar to an efficient one, when applied to clustering self-similar processes.

Definition 8

(Self-similar process, see Samorodnitsky and Taqqu (1994)) A process \(X^{(H)}=\{X_t^{(H)}\}_{t\in T}\) (e.g., \(T={\mathbb {R}}\) or \({\mathbb {Z}}\)) is self-similar with index \(H\in (0,1)\) if, for all \(n\in {\mathbb {N}}\), all \(t_1,\ldots ,t_n\in T\), and all \(c\ne 0\) such that \(ct_i\in T\) (\(i=1,\ldots ,n\)),

$$\begin{aligned} \Big (X_{t_1}^{(H)},\ldots ,X_{t_n}^{(H)}\Big ){\mathop {=}\limits ^{law}}\Big (|c|^{-H}X_{ct_1}^{(H)},\ldots ,|c|^{-H}X_{ct_n}^{(H)}\Big ). \end{aligned}$$

It can be shown that a self-similar process has necessarily zero mean and its covariance structure is indexed by its self-similarity index H, in the following way (Embrechts and Maejima 2000).

Theorem 3

Let \(\big \{X_t^{(H)}\big \}_{t\in T}\) be a zero-mean self-similar process with index \(H\in (0,1)\) and with wide-sense stationary ergodic increments. Assume \({\mathbb {E}}|X_1^{(H)}|^2<+\infty \), then for any \(s,t\in T\),

$$\begin{aligned} {\mathbb {C}}ov\left( X_s^{(H)}, X_t^{(H)} \right) =\frac{{\mathbb {E}}|X_1^{(H)}|^2}{2}\left( |s|^{2H}+|t|^{2H}-|s-t|^{2H}\right) . \end{aligned}$$

The corollary below follows.

Corollary 1

Let \(\{X_t^{(H)}\}_{t\in T}\) be a zero-mean self-similar process with index H and weakly stationary increments. Assume \({\mathbb {E}}|X_1^{(H)}|^2<+\infty \). For \(h>0\) small enough, define the increment process \(Z_h^{(H)}(s)=X_{s+h}^{(H)}-X_s^{(H)}\), then for \(s,t\in T\) such that \(s-t\ge h\), we have

$$\begin{aligned} {\mathbb {C}}ov\left( Z_h^{(H)}(s),Z_h^{(H)}(t) \right) =\frac{{\mathbb {E}}|X_1^{(H)}|^2}{2} \left( (s-t-h)^{2H}+(s-t+h)^{2H}-2(s-t)^{2H} \right) .\nonumber \\ \end{aligned}$$
(55)

Applying three times the mean value theorem to (55) leads to

$$\begin{aligned}&{\mathbb {C}}ov\left( Z_h^{(H)}(s), Z_h^{(H)}(t) \right) =H{\mathbb {E}} |X_1^{(H)}|^2\left( (v_1^{(H)})^{2H-1}-(v_2^{(H)})^{2H-1}\right) h\nonumber \\&\quad =H(2H-1){\mathbb {E}} |X_1^{(H)}|^2(v^{(H)})^{2H-2}h, \end{aligned}$$
(56)

for some \(v_1^{(H)}\in (s-t,s-t+h)\), \(v_2^{(H)}\in (s-t-h,s-t)\) and \(v^{(H)}\in (v_2^{(H)},v_1^{(H)})\). We see that the item \({\mathbb {C}}ov\left( Z_h^{(H)}(s),Z_h^{(H)}(t)\right) \) is a non-linear function of H. Next we would find a function g such that \(g\left( {\mathbb {C}}ov\left( Z_h^{(H)}(s),Z_h^{(H)}(t)\right) \right) \) is linearly dependent of H. To this end we introduce the following \(\log ^*\)-transformation: for \(x\in {\mathbb {R}}\), define

$$\begin{aligned} \log ^*(x):={{\,\mathrm{sgn}\,}}(x)\log |x|=\left\{ \begin{array}{l@{\quad }l} \log (x)&{}\text{ if } x>0;\\ -\log (-x)&{}\text{ if } x<0;\\ 0&{}\text{ if } x=0. \end{array}\right. \end{aligned}$$

Introduction to \(\log ^*\)-transformation is driven by the following 2 motivations:

Motivation 1 :

The \(\log ^*\) function transforms the current dissimilarity measure to the one which “linearly” depends on its variable H.

Motivation 2 :

The value \(\log ^*(x)\) preserves the sign of x, which leads to the consequence that larger distance between xy yields larger distance between \(\log ^*(x)\) and \(\log ^*(y)\).

Applying \(\log ^*\)-transformation to the covariances of \(Z_h^{(H)}\) given in (56), we obtain

$$\begin{aligned}&\log ^*\left( {\mathbb {C}}ov\left( Z_h^{(H)}(s),Z_h^{(H)}(t)\right) \right) \\&\quad ={{\,\mathrm{sgn}\,}}(2H-1)\left( (2H-2)\log v^{(H)}+\log h+\log (H|1-2H|{\mathbb {V}}ar(X_1^{(H)}))\right) . \end{aligned}$$

When \(v^{(H)}\) and h are small the items \(\log v^{(H)}\) and \(\log h\) are significantly large so \(\log (H|1-2H|{\mathbb {V}}ar(X_1^{(H)}))\) becomes negligible. Thus we can write

$$\begin{aligned} \log ^*\left( {\mathbb {C}}ov\left( Z_h^{(H)}(s),Z_h^{(H)}(t)\right) \right) \approx {{\,\mathrm{sgn}\,}}(2H-1)\left( (2H-2)\log v^{(H)}+\log h\right) . \end{aligned}$$

In conclusion,

  • When \(H_1,H_2\in (0,1/2]\) or \(H_1,H_2\in [1/2,1)\), the item \(\log ^*\left( {\mathbb {C}}ov\left( Z_h^{(H)}(s),Z_h^{(H)}(t)\right) \right) \) is “approximately linear” on \(H\in (0,1/2]\) or on \(H\in [1/2,1)\).

    Using the approximation \(\log v^{(H_1)}\approx \log v^{(H_2)}\) for \(H_1,H_2\in (0,1/2]\) or \(H_1,H_2\in [1/2,1)\), we have

    $$\begin{aligned}&\log ^*\left( {\mathbb {C}}ov\left( Z_h^{(H_1)}(s),Z_h^{(H_1)}(t)\right) \right) -\log ^*\left( {\mathbb {C}}ov\left( Z_h^{(H_2)}(s),Z_h^{(H_2)}(t)\right) \right) \\&\quad \approx 2{{\,\mathrm{sgn}\,}}(2H_1-1)(H_1-H_2)\log v^{(H_1)}. \end{aligned}$$
  • When \(H_1\in (0,1/2]\) and \(H_2\in (1/2,1)\), \(\log ^*\left( {\mathbb {C}}ov\left( Z_h^{(H)}(s),Z_h^{(H)}(t)\right) \right) \) turns out to be relatively large, because we have

    $$\begin{aligned}&\log ^*\left( {\mathbb {C}}ov\left( Z_h^{(H_1)}(s),Z_h^{(H_1)}(t)\right) \right) -\log ^*\left( {\mathbb {C}}ov\left( Z_h^{(H_2)}(s),Z_h^{(H_2)}(t)\right) \right) \\&\quad \approx -(2H_1-2)\log v^{(H_1)}-(2H_2-2)\log v^{(H_2)}\\&\quad \ge 2(2-H_1-H_2)\min \left\{ \log v^{(H_1)},\log v^{(H_2)}\right\} . \end{aligned}$$

Taking advantage of the above facts we define the new empirical covariance-based dissimilarity measure (based on the definition (12)) to be

$$\begin{aligned} \widehat{d^{**}}({\mathbf {z}}_{1},{\mathbf {z}}_{2}):=\sum _{m= 1}^{m_n} \sum _{l= 1}^{n-m+1} w_m w_l \rho ^*\left( \nu ^{**}(Z^{(H_1)}_{l\ldots n},m), \nu ^{**} (Z^{(H_2)}_{l\ldots n},m)\right) , \end{aligned}$$

where \(\nu ^{**}(Z^{(H_1)}_{l\ldots n},m)\) is the empirical covariance matrix of \(Z_h^{(H_1)}\), \(\nu ^*(Z^{(H_1)}_{l\ldots n},m)\), with each of its coefficients transformed by \(\log ^*\): let \(M=\{M_{i,j}\}_{i=1,\ldots ,m;~j=1,\ldots ,n}\) be an arbitrary real-valued matrix, define

$$\begin{aligned} \log ^*M:=\left\{ \log ^*M_{ij}\right\} _{i=1,\ldots ,m;~j=1,\ldots ,n}. \end{aligned}$$

Then we have

$$\begin{aligned} \nu ^{**}(Z^{(H_1)}_{l\ldots n},m):=\log ^*\left( \nu ^{*}(Z^{(H_1)}_{l\ldots n},m)\right) . \end{aligned}$$

Now given 2 wide-sense stationary ergodic processes \(X^{(1)}\), \(X^{(2)}\), we choose \(\{w_j\}_{j\in {\mathbb {N}}}\) to satisfy

$$\begin{aligned} \sum _{m,l=1}^{\infty } w_m w_l \rho ^*\left( \log ^*(V_{l,l+m-1}(X^{(1)})),\log ^*(V_{l,l+m-1}(X^{(2)})\right) <+\infty , \end{aligned}$$
(57)

where we denote by

$$\begin{aligned} V_{l,l+m-1}(X^{(1)}):={\mathbb {C}}ov\left( X_l^{(1)},\ldots ,X_{l+m-1}^{(1)}\right) . \end{aligned}$$

Then define the \(\log ^*\)-transformation of the covariance-based dissimilarity measure to be

$$\begin{aligned} d^{**}(X^{(1)},X^{(2)}):=\sum _{m,l=1}^{\infty } w_m w_l \rho ^*\left( \log ^*(V_{l,l+m-1}(X^{(1)})),\log ^*(V_{l,l+m-1}(X^{(2)})\right) . \end{aligned}$$
(58)

Using the fact that \(\log ^*\) is continuous over \({\mathbb {R}}\backslash \{0\}\) and the weak ergodicity of \(Z_h^{(H)}\), we have the following version of ergodicity:

$$\begin{aligned} \widehat{d^{**}}({\mathbf {z}}_{1},{\mathbf {z}}_{2})\xrightarrow [n\rightarrow \infty ]{a.s.}d^{**}\big (Z_h^{(H_1)},Z_h^{(H_2)}\big ). \end{aligned}$$

Unlike \(\widehat{d}^{*}\), the dissimilarity measure \(\widehat{d^{**}}\) is approximately linear with respect to the self-similarity index H. Indeed, it is easy to see that

$$\begin{aligned} \widehat{d^{**}}({\mathbf {z}}_{1},{\mathbf {z}}_{2})\sim \left\{ \begin{array}{l@{\quad }l} |H_1-H_2|<1,&{}\text{ for } H_1,H_2\in (0,1/2]\hbox { or }H_1,H_2\in [1/2,1);\\ 2(2-H_1-H_2)>1,&{}\text{ for } H_1\in (0,1/2)\hbox { and }H_2\in [1/2,1), \end{array}\right. \end{aligned}$$
(59)

where \(H_1,H_2\) correspond to the self-similarity indexes of \(X^{(H_1)},X^{(H_2)}\) respectively. In fact, from (59) we can say that \(\widehat{d^{**}}\) satisfies Definition 7 in the wide sense: it is approximately linearly dependent of \(|H_1-H_2|\) when \(H_1,H_2\) are in the same group out of (0, 1 / 2] and [1 / 2, 1); it is approximately larger than \(|H_1-H_2|\) when \(H_1,H_2\) are in different groups out of (0, 1 / 2] and [1 / 2, 1). This fact allows our asymptotically consistent algorithms to be more efficient when clustering self-similar processes with weakly stationary increments, having different values of H. In Sect. 4.2 we provide an example of clustering using our consistent algorithms with and without the \(\log ^*\)-transformation, when the observed paths are from a well-known self-similar process with stationary increments – fractional Brownian motion.

4 Simulation and empirical study

This section is devoted to applying our clustering algorithms to several synthetic data and real-world data. It is worth noting that, in our statistical setting, the auto-covariance functions are supposed to be unavailable, then the prior choice of the weights \(w_j\) presents some trade-off between the convergence of the dissimilarity measure and practical application. On one hand, low rate of convergence (e.g. \(w_j=1/j(j+1)\)) risks to a divergent dissimilarity measure \(d^*\) [see (5)]. On the other hand, high rate of convergence (e.g., \(w_j=1/j^3(j+1)^3\)) will only make use of some first observations in the sample paths. We believe that the first issue is a minor one in practice, because for most of the wide-sense stationary ergodic processes (especially Gaussian) taking \(w_j=1/j(j+1)\) can lead to convergent \(d^*\). Also, in practice, instead of (5) it is fine to regard

$$\begin{aligned} d^*\left( X^{(1)},X^{(2)}\right) := \sum _{m,l = 1}^{N} w_m w_l \rho ^*\left( V_{l,l+m-1}(X^{(1)}),V_{l,l+m-1}(X^{(2)})\right) , \end{aligned}$$

for some N large enough.

Therefore, through this entire section we take \(w_j=1/j(j+1)\) and \(m_n=\lfloor \log n\rfloor \) (recall that \(\lfloor \cdot \rfloor \) denotes the floor number) in the covariance-based dissimilarity measure \(\widehat{d}^{*}\). Next we explain how to prepare offline and online datasets in this simulation study.

  • [Offline dataset simulation] For each scenario, we simulate 5 groups of sample paths, each consists of 10 paths with length \(N(t)=5t\), for the time steps \(t=1,2,\ldots ,50\). Algorithm 1 is performed over 100 such scenarios, and the misclassification rate is calculated.

  • [Online dataset simulation] For each scenario, we simulate 5 groups of sample paths. Let the total number of sample paths be \(N(t) = 30 + \lfloor (t-1)/10 \rfloor \) at each time step t. That is, there are 6 sample paths in each of the 5 groups when \(t=1\). And the number of sample paths in each group will increase by 1 once the time t increases by 10. For \(i=1,2,\ldots \), the \(i\hbox {th}\) sample path in each group has length \(n_i(t) = 5[t-(i-6)^+]\), where \(x^+ = \max (x,0)\).

We then apply the proposed clustering algorithms to both offline and online settings, and determine their corresponding misclassification rates. These misclassification rates are utilized to intuitively illustrate the asymptotic consistency of our clustering algorithms, or to compare the performances of our clustering approaches to other ones. Recall that the misclassification rate (i.e. mean clustering error rate, see Section 6 in Khaleghi et al. 2016) is obtained by dividing the number of misclassified paths by the total number of paths per scenario, then average all these fractions:

$$\begin{aligned} p:=Ave\left( \frac{\#~\text{ of } \text{ misclassified } \text{ sample } \text{ paths }}{\#~\text{ of } \text{ total } \text{ sample } \text{ paths } \text{ collected }}\right) . \end{aligned}$$

More precisely, let \((C_1,\ldots ,C_\kappa )\) denote the ground truth clusters of the N sample paths \({\mathrm {x}}_1,\ldots , {\mathrm {x}}_N\). We define the ground truth cluster labels by

$$\begin{aligned} L_k = \underbrace{(k,\ldots ,k)}_{\# C_k~\text{ times }},~\text{ for }~k=1,\ldots ,\kappa . \end{aligned}$$

Let \((l_1,\ldots ,l_N)\) denote the cluster labels of \(({\mathrm {x}}_1,\ldots ,{\mathrm {x}}_N)\) output by some clustering approach. Then the misclassification rate p of this approach is computed by

$$\begin{aligned} p=\min _{\begin{array}{c} \sigma \in S_{\kappa }\\ (\pi _1,\ldots ,\pi _N)=(L_{\sigma (1)},\ldots ,L_{\sigma (\kappa )}) \end{array}}\frac{\sum \limits _{i=1}^N{\mathbb {1}_{\{\pi _i\ne l_i\}}}}{N}, \end{aligned}$$
(60)

where \(S_\kappa \) denotes the group of all possible permutations over the set \(\{1,\ldots ,\kappa \}\).

For example, in one scenario of 7 sample paths, if the ground truth cluster labels of \(({\mathrm {x}}_1,\ldots ,{\mathrm {x}}_7)\) satisfy

$$\begin{aligned} \left( L_1, L_2, L_3\right) = ((1,1),(2),(3,3,3,3)), \end{aligned}$$

while the clustering algorithm output cluster labels corresponding to \(({\mathrm {x}}_1,\ldots ,{\mathrm {x}}_7)\) are given by

$$\begin{aligned} \left( l_1,\ldots ,l_7\right) = \left( 2,1,1,2,3,2,1\right) , \end{aligned}$$

then according to Eq. (60), the misclassification rate is 4 / 7. This can be explained as, at least 4 changes of labels are needed to let the output cluster labels match that of the ground truth ones (1, 1, 3, 2, 2, 2, 2):

$$\begin{aligned} l_1\leftarrow {} 1;~l_3\leftarrow {} 3;~l_5\leftarrow {} 2;~l_7\leftarrow {} 2. \end{aligned}$$

We provide the implementation of the misclassification rate [see Eq. (60)] in MATLAB publicly online as misclassify_rate.m.Footnote 1

4.1 Clustering non-Gaussian discrete-time stochastic processes

In Khaleghi et al. (2016) a simulation study on a non-Gaussian strictly stationary ergodic discrete-time stochastic process (see also Shields (1996)) has been performed. Since this process has finite covariance structure, it is also wide-sense stationary ergodic. As a result we can test our clustering algorithms over the same dataset and compare their performances to the ones in Khaleghi et al. (2016). Recall that this process \(\{X_t\}_{t\in {\mathbb {N}}}\) is generated in the following way. Fix some irrational-valued parameter \(\alpha \in (0,1)\).

  • Step 1. Draw a uniform random number \(r_0 \in [0,1]\).

  • Step 2. For each index \(i=1,2,\ldots ,N\):

    • Step 2.1. Define \(r_i = r_{i-1} + \alpha - \lfloor r_{i-1} + \alpha \rfloor .\)

    • Step 2.2. Define \(X_i = {\left\{ \begin{array}{ll} 1 &{} \text {when } r_i > 0.5, \\ 0 &{} \text {otherwise}. \end{array}\right. } \)

We simulate 5 groups of sample paths \(\{X_t\}_{t\in {\mathbb {N}}}\) indexed by the irrational values \(\alpha _1 = 0.31\ldots \), \(\alpha _2 = 0.33\ldots \), \(\alpha _3 = 0.35\ldots \), \(\alpha _4 = 0.37\ldots \), \(\alpha _5 = 0.39\ldots \) (\(\alpha _i\), \(i=1,\ldots ,5\), each is simulated by a longdouble with a long mantissa, see Khaleghi et al. (2016)), respectively.

4.1.1 Offline dataset

We demonstrate the asymptotic consistency of Algorithm 1 by conducting offline clustering on the simulated offline datasets of \(\{X_i\}_{i\in {\mathbb {N}}}\).

The valid blue line in Fig. 1 illustrates the asymptotic consistency of Algorithm 1 through the fact that its misclassification rate decreases as time t increases. Compared to the simulation study over the same dataset in Khaleghi et al. (2016), the misclassification rate provided by our proposed algorithm converges at a comparable speed (see Figure 2 in Khaleghi et al. 2016), even though Algorithm 1 aims to cluster “covariance structures” but not “process distributions”.

The dot-dashed red line in Fig. 1 presents the performance of Algorithm 2 and compares its misclassification rates with the ones from Algorithm 1. Applied to offline dataset, the offline algorithm’s misclassification rates are consistently lower than the online algorithm, i.e., the offline dataset clustering algorithm performs better than the online dataset clustering algorithm, when dealing with offline datasets.

Fig. 1
figure 1

The graph compares the misclassification rates of Algorithm 1 and Algorithm 2 applied to offline dataset of non-Gaussian discrete-time processes. 100 runs are performed at each time step t to compute the misclassification rate

4.1.2 Online dataset

In our simulated online datasets the number of sample paths and the length of each sample path increase as t increases. This type of setting is mimicking the situation such as modeling financial asset prices, where new assets are launched at each time step. The offline and online clustering algorithms are applied at each time t with 100 runs, their misclassification rates at each time t are then obtained.

Figure 2 compares the misclassification rates of offline algorithm and online algorithm applied to the online dataset described above. The periodical pattern, that misclassification rate increases per 10 time steps using offline algorithm, matches the timing of adding new observations. That is, the misclassification rate spikes whenever new observations are obtained. We observe that the misclassification rate of the online algorithm is overall lower than that of offline algorithm in this dataset, reflecting the advantage of online algorithm against the offline one in the case where new observations are expected to occur. It is worth pointing out that our online setting is different from the one in Khaleghi et al. (2016), therefore the two clustering results are not comparable.

Finally, all the codes in MATLAB that reproduce the main conclusions in this subsection can be found publicly online.Footnote 2

Fig. 2
figure 2

The graph compares the misclassification rates of Algorithm 1 and Algorithm 2 applied to online dataset of non-Gaussian discrete-time processes. 100 runs are performed at each time step t to compute the misclassification rate

4.2 Clustering fractional Brownian motions

In this section, we present the performance of proposed offline (Algorithm 1) and online (Algorithm 2) methods, on a synthetic dataset sampled from continuous-time Gaussian processes. The wide-sense stationary ergodic processes that we choose are the first order increment processes of fractional Brownian motions (see Mandelbrot and van Ness 1968). Denote by \(\{B^H(t)\}_{t\ge 0}\) a fractional Brownian motion with Hurst index \(H\in (0,1)\). It is well-known that \(B^H\) is a zero-mean self-similar Gaussian process with self-similarity index H and with covariance function

$$\begin{aligned} {\mathbb {C}}ov\left( B^H(s),B^H(t)\right) =\frac{1}{2}\left( s^{2H}+t^{2H}-|s-t|^{2H}\right) ,~\text{ for } s,t\ge 0. \end{aligned}$$
(61)

Fix \(h>0\), define its increment process (with time variation h) to be

$$\begin{aligned} Z_h^{(H)}(t)=B^H(t+h)-B^H(t),~\text{ for } t\ge 0. \end{aligned}$$

\(Z_h^{(H)}\) is also called fractional Gaussian noise. Using the covariance function (61) we obtain the auto-covariance function of \(Z_h^{(H)}\) below: for \(\tau \ge 0\),

$$\begin{aligned} \gamma (\tau )={\mathbb {C}}ov\left( Z_h^{(H)}(s),Z_h^{(H)}(s+\tau )\right) =\frac{1}{2}\left( |\tau +h|^{2H}+|\tau -h|^{2H}-2|\tau |^{2H}\right) . \end{aligned}$$
(62)

Recall that for stationary Gaussian processes such as \(Z_h^{(H)}\), the strict ergodicity can be fully expressed in the language of its auto-covariance function \(\gamma \), i.e., the following result (Maruyama 1970; Śęlzak 2017) provides a sufficient and necessary condition for a stationary Gaussian process to be strictly ergodic.

Theorem 4

(Strict ergodicity of Gaussian processes) A continuous-time Gaussian stationary process X is strictly ergodic if and only if

$$\begin{aligned} \lim _{t\rightarrow \infty }\frac{1}{t}\int _0^t|\gamma _X(u)|\,\mathrm {d}u=0, \end{aligned}$$
(63)

where \(\gamma _X\) denotes the auto-covariance function of X.

In view of (62) we can deduce that the auto-covariance function \(\gamma \) of \(Z_h^{(H)}\) satisfies (63). This together with Theorem 4 yields that \(Z_h^{(H)}\) is second-order strict-sense stationary ergodic, so it is also wide-sense stationary ergodic.

To test our algorithms we simulate \(\kappa =5\) groups of independent fractional Brownian paths, with the \(i\hbox {th}\) group containing 10 paths as \(\{B^{H_i}(1/n),\ldots ,B^{H_i}((n-1)/n),B^{H_i}(1)\}\), for the self-similarity indexes

$$\begin{aligned} H_1=0.3,~H_2=0.4,~\ldots ,~H_5=0.7. \end{aligned}$$

Remark that clustering a zero-mean fractional Brownian motion \(B^H\) is equivalent to clustering its increments \(Z_{1/n}^{(H)}(t)=B^{H}(t+1/n)-B^H(t)\). These total number of 50 observed paths of \(Z_{1/n}^{(H)}(t)\), each of length 150, compose an offline dataset and an online one. The clustering algorithms are applied to the dataset at each time step t. 100 runs are made to compute the misclassification rates. we use offline (RESP. online) dataset clustering algorithm to cluster offline (RESP. online) dataset. The purpose is to compare the algorithms with and without \(\log ^*\)-transformations.

Figure 3 presents the comparisons of 2 algorithms: one is using the dissimilarity measure \(\widehat{d}^{*}\), the other one is using the dissimilarity measure \(\widehat{d^{**}}\), based on the behavior of misclassification rates as time increases. We conclude that, both algorithms with and without the \(\log ^*\)-transformations are asymptotically consistent. However in both offline and online settings, the covariance-based dissimilarity measure algorithms with \(\log ^*\)-transformation (dashed red lines) have 30% lower misclassification rates on average than that of algorithms without \(\log ^*\)-transformation (solid blue lines). This simulation study proves the necessity of utilizing \(\log ^*\)-transformed covariance-based dissimilarity measure when the underlying observations have nonlinear, especially power based, covariance-based dissimilarity measure, such as observations sampled from self-similar processes.

The codes in MATLAB used in this subsection are provided publicly online.Footnote 3

Fig. 3
figure 3

The top graph illustrates the misclassification rates by offline algorithm applied to offline datasets of increments of fractional Brownian motions. The bottom graph plots misclassification rates by online algorithm applied to online datasets

4.3 Clustering AR(1) processes: non strict-sense stationary ergodic

To show that our algorithms can be applied to clustering non strict-sense stationary ergodic processes, we consider a simulation study on the non-Gaussian AR(1) process \(\{Y(t)\}_t\) defined in Example 2, Eq. (1). We then conduct the cluster analysis with \(\kappa =5\), and specify the values of a in Eq. (1) as

$$\begin{aligned} a_1 = -0.4, \ a_2 = -0.15, \ a_3 = 0.1, \ a_4 = 0.35, \ a_5 = 0.6. \end{aligned}$$

We mimic the procedure in Sect. 4.2 to generate the offline and online datasets of \(\{X(t)\}_t\). Figure 4 illustrates the consistent converging property of offline algorithm and online algorithm under different dataset settings.

All the codes in MATLAB that reproduce the main conclusions in this subsection can be found publicly online.Footnote 4

Fig. 4
figure 4

The top graph plots the misclassification rates of (\(\log ^*\)) covariance-based dissimilarity measure along with the increase of time using offline and online algorithms on offline dataset. The bottom graph shows misclassification rates with both algorithms on online dataset

4.4 Application to the real world: clustering global equity markets

4.4.1 Data and methodology

In this section we apply the clustering algorithms to real-world datasets. The application involves in dividing equity markets of major economic entities in the world into different subgroups. In financial economics, researchers usually cluster global equity markets according to either geographical region or the development stage of the underlying economic entities. The reasoning of these clustering methods is that entities with less geographical distance and closer development level involve in more bilateral economic activities. Impacted by similar economic factors, entities with less “distance” tend to have higher correlation in stock market performance. This correlation then measures the level of “comovement” of stock market indexes on global capital market.

However, the globalization is breaking the barriers of region and development level. For instance, in 2016 China became the largest trader partner with the U.S. (besides EU).Footnote 5 China is not a regional neighbor of the U.S., and is categorized as a developing country by World Bank, in opposite to the U.S. as a developed country.

We cluster the equity markets in the world according to the empirical covariance structure of their performance, using Algorithms 1 and 2 as purposed in this paper. Then we compare our clustering results with the traditional clustering methodologies. The index constituents of MSCI ACWI (All Country World Index) are selected as the sample data. Each of the observations is a sample path representing the historical monthly returns of underlying economic entities. Through empirical study it is proved that these indexes returns exhibit the “long memory” path feature hence they can be modeled by self-similar processes such as fractional Brownian motions (see e.g. Comte and Renault 1998; Bianchi and Pianese 2008). Therefore similar to Sect. 4.2 we may cluster the increments of the indexes returns with the \(\log ^*\)-transformed dissimilarity measure \(\widehat{d^{**}}\). MSCI ACWI is the leading global equity market index and has $3.2 billion in underlying market capitalization.Footnote 6 MSCI ACWI contains 23 developed markets, 24 emerging markets from 4 regions: Americas, EMEA (Europe, Middle East and Africa), Pacific and Asia. Table 1 lists all markets included in this empirical study. We exclude Greece market due to its bankruptcy after the global financial crisis.

We construct both offline and online datasets starting from different dates. For offline dataset we let it start from Jan. 30, 2009 to exclude the financial crisis period in 2007 and 2008. This is because, under global stock market crisis, the (downside) performance of equity market is contagious and thus blurs the cluster analysis. The online dataset starts on Jan. 31, 1989, which covers 1997 Asian financial crisis, 2003 dot-com bubble and 2007 subprime mortgage crisis. Another key feature is that 14 markets are added to the MSCI ACWI index (at different time) since 1989, including 1 developed market and 13 emerging markets. Therefore, the case where new time series are observed is handled in online dataset.

Table 1 The categories of major equity markets in the MSCI ACWI (All Country World Index)

4.4.2 Clustering results

We compare the clustering outcomes of both offline and online datasets with separations suggested by region (4 groups) and development level (2 groups). The factor with the lowest misclassification rate is proved to be the corresponding factor that contributes to increase covariance-based dissimilarity measure the most. In other words, this corresponding factor leads to the clustering of stock markets with the most significant impact.

Table 2 shows that the misclassification rates by development level are significantly and consistently lower than that by geographical region, for both algorithms (offline and online algorithms) and datasets (offline and online datasets). The clustering results seem to infer that the geographical distance is less dominating than the development level of underlying economic entities, when analyzing different groups of equity markets.

Table 2 The misclassification rates of clustering algorithms on datasets, comparing to clusters suggested by geographical region and development levels

The global minimum of the misclassification rate occurs when we use online algorithm on offline dataset. Table 3 presents the detailed clustering outcome under this circumstance. In each group, the correctly and incorrectly categorized equity markets are listed respectively. For instance, China (Mainland) market is correctly categorized along with other emerging market. Meanwhile Austria market, though being developed market in MSCI ACWI, is categorized to the group where most of the equity markets are emerging markets. The misclassified markets in the emerging group are Austria, Finland, Italy, Norway and Spain markets. The misclassified markets in the developed group are Malaysia, Philippines, Taiwan, Chile and Mexico markets. These empirical results thus suggest that several capital markets have irregular post-crisis performance which blurs the barrier between emerging and developed markets.

Table 3 The clustering outcome of equity markets using offline dataset (starting from Jan. 30, 2009) and online algorithm

The contribution of this real-world dataset cluster analysis is twofold. First, we explored and determined the principal force that brings structural difference in global capital markets, which potentially predicts the “comovement” pattern of future index performance. Second, we provided new evidence on the impact of globalization on breaking geographical barriers between economic entities.

5 Conclusion and future perspectives

Inspired by Khaleghi et al. (2016), we introduce the problem of clustering wide-sense stationary ergodic processes. A new covariance-based dissimilarity measure is proposed to obtain asymptotically consistent clustering algorithms for both offline and online settings. The recommended algorithms are competitive for at least two reasons:

  1. 1.

    Our algorithms are applicable to clustering a wide class of stochastic processes, including any strict-sense stationary ergodic processes whose covariance structures are finite.

  2. 2.

    Our algorithms are efficient enough in terms of their computational complexity cost. In particular, a so-called \(\log ^*\)-transformation is introduced to improve the efficiency of clustering, for self-similar processes.

The above advantages have been supported through the simulation study on non-Gaussian discrete-time processes, fractional Brownian motions, non-Gaussian non strict-sense stationary ergodic AR(1) processes, and a real-world application: clustering global equity markets. The implementations in MATLAB of our clustering algorithms are provided publicly online.

Finally we note that, the clustering framework proposed in our paper focuses on the cases where the true number of clusters \(\kappa \) is known. The case for which \(\kappa \) is unknown is still open and left to future research. Another interesting problem is that, many stochastic processes are not wide-sense stationary but they get a tight relationship with the wide-sense stationarity. For example, a self-similar process does not necessarily have wide-sense stationary increments, but their Lamperti transformations are strict-sense stationary (Lamperti 1962); locally asymptotically self-similar processes are generally not self-similar but their tangent processes are self-similar (Boufoussi et al. 2008). Our cluster analysis sheds light on clustering the above processes. These topics can be left for future research.