Abstract
We introduce a new unsupervised learning problem: clustering widesense stationary ergodic stochastic processes. A covariancebased dissimilarity measure together with asymptotically consistent algorithms is designed for clustering offline and online datasets, respectively. We also suggest a formal criterion on the efficiency of dissimilarity measures, and discuss an approach to improve the efficiency of our clustering algorithms, when they are applied to cluster particular type of processes, such as selfsimilar processes with widesense stationary ergodic increments. Clustering synthetic data and realworld data are provided as examples of applications.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Cluster analysis, as a core category of unsupervised learning techniques, allows to discover hidden patterns in data where one does not know the true answer upfront. Its goal is to assign a heterogeneous set of objects into nonoverlapping clusters, where in each cluster any two objects are more related to each other than to objects in other clusters. Given its exploratory nature, clustering has nowadays a number of applications in various fields of both industry and scientific research, such as biological and medical research (Damian et al. 2007; Zhao et al. 2014; Jääskinen et al. 2014), information technology (Jain et al. 1999; Slonim et al. 2005), signal and image processing (Rubinstein et al. 2013), geology (Juozapavičius and Rapsevicius 2001) and finance (Pavlidis et al. 2006; Bastos and Caiado 2014; Ieva et al. 2016). There exists a rich literature of cluster analysis on random vectors, where the objects, waiting to be clustered, are sampled from highdimensional joint distributions. There is no shortage of such clustering algorithms (Xu and Wunsch 2005). However, stochastic processes are quite a different setting from random vectors, since their observations (sample paths) are sampled from processes distributions. While the cluster analysis on random vectors has developed aggressively, clustering on stochastic processes receives much less attention. Today cluster analysis on stochastic processes deserves increasingly intense study, thanks to their vital importance to many applied areas, where the collected information are indexed by real time and are especially long. Examples of these timeindexed information include biological data, financial data, marketing data, surface weather data, geological data and video/audio data, etc.
Recall that in the setting of random vectors, a process of clustering often consists of two steps:
 Step 1 :

One suggests a suitable dissimilarity measure to describe the distance between 2 objects, under which “two objects are close to each other” becomes meaningful.
 Step 2 :

One designs an enough accurate and computationally efficient clustering function based on the above dissimilarity measure.
Clustering stochastic processes is performed in a similar way but new challenges may arise in both Step 1 and Step 2. Intuitively, one can always apply existing random vectors clustering approaches to cluster arbitrary stochastic processes, such as nonhierarchical approaches (Kmeans clustering methods) and hierarchical approaches (agglomerative method, divisive method) (Hartigan 1975), based on “naive” dissimilarity measures (e.g., Euclidean distance, Manhattan distance or Minkowski distance). However, one faces at least 2 potential risks when applying the above approaches to clustering stochastic processes:
 Risk 1 :

These approaches might suffer from their huge complexity costs, due to the great length of their sample paths. As a result classical clustering algorithms are often computationally forbidding (Ieva et al. 2016; Peng and Müller 2008).
 Risk 2 :

These approaches might suffer from overfitting issues. For example, clustering stationary or periodic processes based on Euclidean distance between the paths, without considering their path properties will result in “over fitting, bad clusters” situation.
In summary, classical dissimilarity measures or clustering strategies would fail in clustering stochastic processes.
Fortunately, the complexity cost and the overfitting errors of clustering processes could be largely reduced, if one is aware of the fact that a stochastic process often possesses fine paths features (e.g., stationarity, Markov property, selfsimilarity, sparsity, seasonality, etc.), which is unlike an arbitrary random vector. An appropriate dissimilarity measure then should be chosen to be able to capture these paths features. Clustering processes is then performed to group any two sample paths into one group, if they are relatively close to each other under that particular dissimilarity measure. Below are some examples provided in the literature.
Peng and Müller (2008) proposed a dissimilarity measure between two special sample paths of processes. In their setting it is supposed that, for each path only sparse and irregularly spaced measurements with additional measurement errors are available. Such features occur commonly in longitudinal studies and online trading data. Based on this particular dissimilarity measure, classification and cluster analysis could be made. Ieva et al. (2016) developed a new algorithm to perform clustering of multivariate and functional data, based on a covariancebased dissimilarity measure. Their attention is focused on the specific case of a set of observations from two populations, whose probability distributions have equal mean but differ in terms of covariances. Khaleghi et al. (2016) designed consistent algorithms for clustering strictsense stationary ergodic processes [see the forthcoming Eq. (4) for the definition of strictsense ergodicity], where the dissimilarity measure is proposed as distance of process distributions. It is worth noting that the consistency of their algorithms is guaranteed thanks to the assumption of strictsense ergodicity.
In this framework, we aim to design asymptotically consistent algorithms to cluster a general class of stochastic processes, i.e., widesense stationary ergodic processes (see Definition 1 below). Asymptotically consistent algorithms can be obtained for this setting, since the covariance stationarity and ergodicity allow the process to present some featured asymptotic behavior with respect to their length, rather than to the total number of paths.
Definition 1
(Widesense stationary ergodic process) A stochastic process \(X=\{X_t\}_{t\in T}\) (the time indexes set T can be either \({\mathbb {R}}_+=[0,+\infty )\) or \({\mathbb {N}}=\{1,2\ldots \}\)) is called widesense stationary if its mean and covariance structure are finite and timeinvariant: \({\mathbb {E}}(X_t)=\mu \) for any \(t\in T\), and for any subset \((X_{i_1},\ldots ,X_{i_r})\), its covariance matrix remains invariant subject to any time shift \(h>0\):
Denote by \(\gamma \) the autocovariance function of X. Then X is further called weakly ergodic (or widesense ergodic) if it is ergodic for the mean and the secondorder moment:

If X is a continuoustime process (e.g., \(T={\mathbb {R}}_+\)), then it satisfies for any \(s\in {\mathbb {R}}_+\),
$$\begin{aligned} \frac{1}{h}\int _s^{s+h}X_u\,\mathrm {d}u\xrightarrow [h\rightarrow +\infty ]{a.s.}\mu , \end{aligned}$$and
$$\begin{aligned} \frac{1}{h}\int _s^{s+h}(X_{u+\tau }\mu )(X_u\mu )\,\mathrm {d}u\xrightarrow [h\rightarrow +\infty ]{a.s.}\gamma (\tau ),~\text{ for } \text{ all } \tau \in {\mathbb {R}}_+, \end{aligned}$$where \(\xrightarrow []{a.s.}\) denotes the almost sure convergence (convergence with probability 1).

If X is a discretetime process (e.g., \(T={\mathbb {N}}\)), then it satisfies for any \(s\in \mathbb N\cup \{0\}\),
$$\begin{aligned} \frac{X_s+X_{s+1}+\ldots +X_{s+h}}{h+1}\xrightarrow [h\in {\mathbb {N}},~h\rightarrow +\infty ]{a.s.}\mu , \end{aligned}$$and
$$\begin{aligned} \frac{\sum _{u=s}^{s+h}(X_{u+\tau }\mu )(X_u\mu )}{h+1}\xrightarrow [h\in {\mathbb {N}},~h\rightarrow +\infty ]{a.s.}\gamma (\tau ),~\text{ for } \text{ all } \tau \in \mathbb N\cup \{0\}. \end{aligned}$$
Widesense stationarity and ergodicity are believed to be a very general assumption, at least in the following senses:

1.
The assumption that each process is generated by some mean and covariance structure is sufficient for capturing all features of a widesense stationary ergodic process. In other words, our algorithms intend to cluster means and autocovariance functions, not process distributions.

2.
Widesense stationary ergodic process partially extends the strictsense one. A finitevariance strictsense stationary ergodic process [see Eq. (4) for its definition] is also widesense stationary ergodic. However strictsense stationary ergodic stable processes are not widesense stationary, because their variances explode (Cambanis et al. 1987; Samorodnitsky 2004).

3.
A Gaussian process can be fully identified only by its mean and covariance structure. Then a widesense stationary ergodic Gaussian process is also strictsense stationary ergodic.

4.
In the clustering problem, the dependency among the sample paths can be arbitrary.
There is a long list of processes which are widesense stationary ergodic, but not necessarily stationary in the strict sense. The examples of widesense stationary processes below are not exhausted.
Example 1
Nonindependent White Noise.
Let U be a random variable uniformly distributed over \((0,2\pi )\) and define
The process \(Z=\{Z(t)\}_{t\in {\mathbb {N}}}\) is then a white noise because it verifies
We claim that Z is widesense stationary ergodic, which can be obtained by using the Kolmogorov’s strong law of large numbers, see e.g. Theorem 2.3.10 in Sen and Singer (1993). However Z is not strictsense stationary since
Indeed, it is easy to see that
Example 2
Autoregressive Models.
It is wellknown that an autoregressive model \(\{Y(t)\}_t\sim AR(1)\) in the form:
is widesense stationary ergodic. However it is not necessarily strictsense stationary ergodic, when the joint distributions of the white noise \(\{Z(t)\}_t\) are not invariant with timeshifting (e.g., take \(\{Z(t)\}_t\) to be the white noise in Example 1).
Example 3
Increment Process of Fractional Brownian Motion.
Let \(\{B^H(t)\}_t\) be a fractional Brownian motion with Hurst index \(H\in (0,1)\) (see Mandelbrot and van Ness 1968). For each \(h>0\), its increment process \(\{Z^h(t):=B^H(t+h)B^H(t)\}_t\) is finitevariance strictsense stationary ergodic (Magdziarz and Weron 2011). As a result it is also widesense stationary ergodic. More detail will be discussed in Sect. 4.
Example 4
Increment Process of More General Gaussian Processes.
Peng (2012) introduced a general class of zeromean Gaussian processes \(X=\{X(t)\}_{t\in {\mathbb {R}}}\) having stationary increments. Its variogram \(\nu (t):=2^{1}{\mathbb {E}}(X(t)^2)\) satisfies:

(1)
There is a nonnegative integer d such that \(\nu \) is 2dtimes continuously differentiable over \([2,2]\), but not \(2(d+1)\)times continuously differentiable over \([2,2]\).

(2)
There are 2 real numbers \(c\ne 0\) and \(s_0\in (0,2)\), such that for all \(t\in [2,2]\),
$$\begin{aligned} \nu (t)=\nu ^{(2d)}(0)+ct^{{s_{0}}}+r(t), \end{aligned}$$where the remainder r(t) satisfies:

\(r(t)=o(t^{{s_{0}}})\), as \(t\rightarrow 0\).

There are two real numbers \(c' > 0\), \(\omega > s_0\) and an integer \(q > \omega +1/2\) such that r is qtimes continuously differentiable on \([2, 2]\backslash \{0\}\) and for all \(t \in [2, 2]\backslash \{0\}\), we have
$$\begin{aligned} r^{(q)}(t)\leqslant c't^{\omega q}. \end{aligned}$$

It is shown that the process X extends fractional Brownian motion and it also has widesense (and strictsense) stationary ergodic increments when \(d+s_0/2\in (0,1)\) (see Proposition 3.1 in Peng 2012).
The problem of clustering processes via their means and covariance structures leads us to formulating our clustering targets in the following way.
Definition 2
(Ground truthGof covariance structures) Let
be a partitioning of \({\mathbb {N}}=\{1,2,\ldots \}\) into \(\kappa \) disjoint sets \(G_k\), \(k=1,\ldots ,\kappa \), such that the means and covariance structures of \({\mathbf {x}}_i\), \(i\in {\mathbb {N}}\) are identical, if and only if \(i\in G_k\) for some \(k=1,\ldots ,\kappa \). Such G is called ground truth of covariance structures. We also denote by \(G_{N}\) the restriction of G to the first N sequences:
Our clustering algorithms will aim to output the ground truth partitioning G, as the sample length grows. Before stating these algorithms, we introduce the inspiring framework done by Khaleghi et al. (2016).
1.1 Preliminary results: clustering strictsense stationary ergodic processes
Khaleghi et al. (2016) considered the problem of clustering strictsense stationary ergodic processes. The main fruit in Khaleghi et al. (2016) is obtaining the socalled asymptotically consistent algorithms to cluster processes of that type. We briefly state their work below. Depending on how the information is collected, the stochastic processes clustering problems consist of dealing with two models: offline setting and online setting.
 Offline setting :

The observations are assumed to be a finite number N of paths:
$$\begin{aligned} {\mathbf {x}}_1 = \Big (X_1^{(1)},\ldots , X_{n_1}^{(1)}\Big ),\ldots ,{\mathbf {x}}_N = \Big (X_1^{(N)},\ldots , X_{n_N}^{(N)}\Big ). \end{aligned}$$Each path is generated by one of the \(\kappa \) different unknown process distributions. In this case, an asymptotically consistent clustering function should satisfy the following.
Definition 3
(Consistency: offline setting) A clustering function f is consistent for a set of sequences S if \(f(S,\kappa )=G\). Moreover, denoting \(n=\min \{n_1,\ldots ,n_N\}\), f is called strongly asymptotically consistent in the offline sense if with probability 1 from some n on it is consistent on the set S, i.e.,
It is called weakly asymptotically consistent if \(\lim \limits _{n\rightarrow \infty }{\mathbb {P}}(f(S,\kappa )=G)=1\).
 Online setting :

In this setting the observations, having growing length and number of scenarios with respect to time t, are denoted by
$$\begin{aligned} {\mathbf {x}}_1 = \Big (X_1^{(1)},\ldots , X_{n_1}^{(1)}\Big ),\ldots ,{\mathbf {x}}_{N(t)} = \Big (X_1^{(N(t))},\ldots , X_{n_{N(t)}}^{(N(t))}\Big ), \end{aligned}$$where the index function N(t) is nondecreasing with respect to t.
Then an asymptotically consistent online clustering function is defined below:
Definition 4
(Consistency: online setting) A clustering function is strongly (RESP. weakly) asymptotically consistent in the online sense, if for every \(N\in {\mathbb {N}}\) the clustering \(f(S(t),\kappa )_N\) is strongly (RESP. weakly) asymptotically consistent in the offline sense, where \(f(S(t),\kappa )_N\) is the clustering \(f(S(t),\kappa )\) restricted to the first N sequences:
There is a detailed discussion on the comparison of offline and online settings in Khaleghi et al. (2016), stating that these two settings have significant differences, since using the offline algorithm in the online setting by simply applying it to the entire data observed at every time step, does not result in an asymptotically consistent algorithm. Therefore separately and independently studying these two settings becomes necessary and meaningful.
As the main results in Khaleghi et al. (2016), asymptotically consistent clustering algorithms for both offline and online settings are designed. They are then successfully applied to clustering synthetic and real data sets.
Note that in the framework of Khaleghi et al. (2016), a key step is introduction to the socalled distributional distance (Gray 1988): the distributional distance between a pair of process distributions \(\rho _1\), \(\rho _2\) is defined to be
where:

The sets \(B^{m,l}\), \(m,l\ge 1\) are obtained via the partitioning of \({\mathbb {R}}^m\) into cubes of dimension m and volume \(2^{ml}\), starting at the origin.

The sequence of weights \(\{w_j\}_{j\ge 1}\) is positive and decreasing to zero. Moreover it should be chosen such that the series in (2) is convergent. The weights are often suggested to give precedence to earlier clusterings, protecting the clustering decisions from the presence of the newly observed sample paths, whose corresponding distance estimates may not yet be accurate. For instance, it is set to be \(w_j=1/j(j+1)\) in Khaleghi et al. (2016).
Further, the distance between two sample paths \({\mathbf {x}}_1\), \({\mathbf {x}}_2\) of stochastic processes is given by
where:

\(m_n,l_n\) (\(\leqslant n\)) can be arbitrary sequences of positive integers increasing to infinity, as \(n\rightarrow \infty \).

For a process path \({\mathbf {x}}=(X_1,\ldots ,X_n)\), and an event B, \(\nu ({\mathbf {x}},B)\) denotes the average times that the event B occurs over \(nm_n+1\) time intervals. More precisely,
$$\begin{aligned} \nu ({\mathbf {x}},B):=\frac{1}{nm_n+1}\sum _{i=1}^{nm_n+1}\mathbb {1}\{(X_i,\ldots ,X_{i+m_n1})\in B\}. \end{aligned}$$
The process distribution X from which \({\mathbf {x}}\) is sampled is called strictly ergodic if
The assumption that the processes are ergodic leads to that \({\widehat{d}}\) is a strongly consistent estimator of d:
where \(\rho _1,\rho _2\) are the process distributions corresponding to \({\mathbf {x}}_1,{\mathbf {x}}_2\), respectively.
Based on the distances d and their estimates \({\widehat{d}}\), the asymptotically consistent algorithms for clustering stationary ergodic processes in each of the offline and online settings are provided (see Algorithms 1, 2 and Theorems 11, 12 in Khaleghi et al. 2016). Khaleghi et al. (2016) also show that their methods can be implemented efficiently: they are at most quadratic in each of their arguments, and are linear (up to log terms) in some formulations.
1.2 Statistical setting: clustering widesense stationary ergodic processes
Inspired by the framework of Khaleghi et al. (2016), we consider the problem of clustering widesense stationary ergodic processes. We first introduce the following covariancebased dissimilarity measure, which is one of the main contributions of this paper.
Definition 5
(Covariancebased dissimilarity measure) The covariancebased dissimilarity measure \(d^*\) between a pair of processes \(X^{(1)}\), \(X^{(2)}\) (in fact \(X^{(1)}\), \(X^{(2)}\) denote two covariance structures, each may contain different process distributions) is defined as follows:
where:

For \(j=1,2\), \(\{X_l^{(j)}\}_{l\in {\mathbb {N}}}\) denotes some path sampled from the process \(X^{(j)}\). We assume that all possible observations of the process \(X^{(j)}\) is a subset of \(\{X_l^{(j)}\}_{l\in {\mathbb {N}}}\). For \(l'\ge l\ge 1\), we define the shortcut notation \(X_{l\ldots l'}^{(j)}:=(X_{l}^{(j)},\ldots ,X_{l'}^{(j)})\).

The function \({\mathcal {M}}\) is defined by: for any \(p_1,p_2,p_3\in {\mathbb {N}}\), any 2 vectors \(v_1,v_2\in {\mathbb {R}}^{p_1}\) and any 2 matrices \(A_1,A_2\in {\mathbb {R}}^{p_2\times p_3}\),
$$\begin{aligned} {\mathcal {M}}((v_1,A_1),(v_2,A_2)):= \left v_1v_2\right +\rho ^*\left( A_1,A_2\right) . \end{aligned}$$(6) 
The distance \(\rho ^*\) between 2 equalsized matrices \(M_1,M_2\) is defined to be
$$\begin{aligned} \rho ^*(M_1,M_2):=\Vert M_1M_2\Vert _F, \end{aligned}$$(7)with \(\Vert \cdot \Vert _F\) being the Frobenius norm:
for an arbitrary matrix \(M=\{M_{ij}\}_{i=1,\ldots ,m; j=1,\ldots ,n}\),
$$\begin{aligned} \Vert M\Vert _F:=\sqrt{\sum _{i=1}^m\sum _{j=1}^nM_{ij}^2}. \end{aligned}$$Introduction to the matrices distance \(\rho ^*\) is inspired by Herdin et al. (2005). The matrices distance given in Herdin et al. (2005) is used to measure the distance between 2 correlation matrices. However, our distance \(\rho ^*\) is a modification of the one in the latter paper. Indeed, unlike Herdin et al. (2005), \(\rho ^*\) is a welldefined metric distance, as it satisfies the triangle inequalities.

The sequence of positive weights \(\{w_j\}\) is chosen such that \(d^*(X^{(1)},X^{(2)})\) is finite. Observe that the distances \(\cdot \) and \(\rho ^*\) in (5) do not depend on l, as a result we necessarily have
$$\begin{aligned} \sum _{l=1}^\infty w_l<+\infty . \end{aligned}$$(8)In practice a typical choice of weights we suggest is \(w_j=1/j(j+1)\), \(j=1,2,\ldots \). This is because, for most of the wellknown covariance stationary ergodic processes (causal ARMA(p, q), increments of fractional Brownian motions, etc.), their autocovariance functions are absolutely summable: denote by \(\gamma _X\) the autocovariance function of \(\{X_t\}_{t}\),
$$\begin{aligned} \sum _{h=\infty }^{+\infty }\left \gamma _X(h)\right <+\infty . \end{aligned}$$(9)Śęlzak (2017) pointed out that (9) is a sufficient condition for \(\{X_t\}\) being meanergodic. However (9) does not necessarily imply that \(\{X_t\}\) is covarianceergodic. It becomes a sufficient and necessary condition if \(\{X_t\}\) is Gaussian. Therefore subject to (9), taking \(w_j=1/j(j+1)\), we obtain for any integer \(N>0\),
$$\begin{aligned}&\sum _{m,l = 1}^{N} w_m w_l \left {\mathbb {E}}\left( X_{l\ldots l+m1}^{(1)}\right) {\mathbb {E}}\left( X_{l\ldots l+m1}^{(2)}\right) \right \nonumber \\&\quad = \sum _{m,l = 1}^{N} w_m w_l \sqrt{m}\mu _1\mu _2=\mu _1\mu _2\sum _{m,l = 1}^{N}\frac{1}{\sqrt{m}(m+1)l(l+1)}\nonumber \\&\quad \leqslant \mu _1\mu _2\sum _{m,l = 1}^{+\infty }\frac{1}{\sqrt{m}(m+1)l(l+1)}<+\infty , \end{aligned}$$(10)with \(\mu _j={\mathbb {E}}\big (X_1^{(j)}\big )\), for \(j=1,2\); and
$$\begin{aligned}&\sum _{m,l = 1}^{N} w_m w_l \rho ^*\left( {\mathbb {C}}ov\Big (X_{l\ldots l+m1}^{(1)}\Big ),{\mathbb {C}}ov\Big (X_{l\ldots l+m1}^{(2)}\Big )\right) \nonumber \\&\quad \leqslant \sum _{m,l = 1}^{N} w_m w_l \sqrt{2\sum _{k_1=1}^m\sum _{k_2=1}^m\left( \gamma _X(k_1k_2)\right) ^2}\nonumber \\&\quad = \sum _{m,l = 1}^{N} w_m w_l \sqrt{2\sum _{q=(m1)}^{m1}(mq)\left( \gamma _X(q)\right) ^2}\nonumber \\&\quad \leqslant \sum _{m,l = 1}^{N} w_m w_l \sqrt{2m\sum _{q=(m1)}^{m1}\left( \gamma _X(q)\right) ^2}\nonumber \\&\quad \leqslant c\sum _{m,l = 1}^{N} \frac{\sqrt{2m}}{m(m+1)l(l+1)}\leqslant c\sum _{m,l = 1}^{+\infty } \frac{\sqrt{2m}}{m(m+1)l(l+1)}<+\infty , \end{aligned}$$(11)where the constant \(c=\sum _{q=\infty }^{\infty }{\mathbb {C}}ov(X_1,X_{1+q})<+\infty \). Therefore combining (10) and (11) leads to
$$\begin{aligned} d^*\big (X^{(1)},X^{(2)}\big )<+\infty . \end{aligned}$$Hence \(d^*(X^{(1)},X^{(2)})\) in (5) is welldefined.
In (5) and (6) we see that the behavior of the dissimilarity measure \(d^*\) is jointly explained by the Euclidean distance of means and the matrices distance of covariance matrices. If the means of the processes \(X^{(1)}\) and \(X^{(2)}\) are priorly known to be equal, the distance \(d^*\) can be simplified to:
Note that this dissimilarity measure can be applied on selfsimilar processes, since they are all zeromean (see Sect. 3).
Next we provide consistent estimator of \({d^*}(X^{(1)},X^{(2)})\). For \(1\leqslant l\leqslant n\) and \(m\leqslant nl+1\), define \(\mu ^*({X_{l\ldots n}}, m)\) to be the empirical mean of a process X’s sample path \((X_l,\ldots ,X_n)\):
and define \(\nu ^*({X_{l\ldots n}}, m)\) to be the empirical covariance matrix of \((X_l,\ldots ,X_n)\):
where \(M^T\) denotes the transpose of the matrix M.
Recall that the notion of widesense ergodicity is given in Definition 1. The ergodicity theorem concerns what information can be derived from an average over time about the ensemble average at each point of time. For the widesense stationary ergodic process X, being either continuoustime or discretetime, the following statement holds: every empirical mean \(\mu ^*(X_{l\ldots n},m)\) is a strongly consistent estimator of the path mean \({\mathbb {E}}(X_{l\ldots l+m1})\); and every empirical covariance matrix \(\nu ^*(X_{l\ldots n},m)\) is a strongly consistent estimator of the covariance matrix \({\mathbb {C}}ov(X_{l\ldots l+m1})\) under the Frobenius norm, i.e., for all \(m\ge 1\), we have
and
Next we introduce the empirical covariancebased dissimilarity measure \(\widehat{d}^{*}\), serving as a consistent estimator of the covariancebased dissimilarity measure \(d^*\).
Definition 6
(Empirical covariancebased dissimilarity measure) Given two processes’ sample paths \({\mathbf {x}}_j=(X_1^{(j)},\ldots ,X_{n_j}^{(j)})\), \(j=1,2\). Let \(n=\min \{n_1,n_2\}\), we define the empirical covariancebased dissimilarity measure between \({\mathbf {x}}_1\) and \({\mathbf {x}}_2\) by
The empirical covariancebased dissimilarity measure between a sample path \({\mathbf {x}}_i\) and a process \(X^{(j)}\) (\(i,j\in \{1,2\}\)) is defined by
Unlike the dissimilarity measure \(d^*\) which describes some distance between stochastic processes, the empirical covariancebased dissimilarity measure is some distance between two sample paths (finitelength vectors). We will show in the forthcoming Lemma 1 that \(\widehat{d}^{*}\) is a consistent estimator of \(d^*\).
Two observed sample paths possibly have distinct lengths \(n_1, n_2\), therefore in (15) we consider computing the distances between their subsequences of length \(n=\min \{n_1,n_2\}\). In practice we usually take \(m_n=\lfloor \log n\rfloor \), the floor number of \(\log n\).
It is easy to verify that both \(d^*\) and \(\widehat{d}^{*}\) satisfy the triangle inequalities, thanks to the fact that both the Euclidean distance and \(\rho ^*\) satisfy the triangle inequalities. More precisely, the following holds.
Remark 1
Thanks to (7) and the definitions of \(d^*\) [see (5)] and \(\widehat{d}^{*}\) [see (15)], we see that the triangle inequality holds for the covariancebased dissimilarity measure \(d^*\), as well as for its empirical estimate \(\widehat{d}^{*}\). Therefore for arbitrary processes \(X^{(i)},~i = 1,2,3\) and arbitrary finitelength sample paths \({\mathbf {x}}_{i},~i=1,2,3\), we have
Remark 1 together with the fact that the processes are weakly ergodic, leads to Lemma 1 below, which is the key to demonstrate that our clustering algorithms in the forthcoming section are asymptotically consistent.
Lemma 1
Given two paths
sampled from the widesense stationary ergodic processes \(X^{(1)}\) and \(X^{(2)}\) respectively, we have
and
Proof
We take \(n=\min \{n_1,n_2\}\). To show (17) holds it suffices to prove that for arbitrary \(\varepsilon >0\), there is an integer \(N>0\) such that for any \(n\ge N\), with probability 1,
Define the sets of indexes
To be more convenient we also denote by
and
for \((m,l)\in {\mathbb {N}}^2\) and \(j=1,2\). By using the definitions of \(d^*\) [see (5)], of \(\widehat{d}^{*}\) [see (15)] and the triangle inequality
we obtain
Next note that the metric \({\mathcal {M}}\) satisfies the following triangle inequality:
It follows from (21) and (22) that
Next we show that the righthand side of (23) converges to 0 as \(n\rightarrow \infty \). First observe that the weights \(\{w_m\}_{m\ge 1}\) have been chosen such that
Then for arbitrary fixed \(\varepsilon >0\), we can find an index J such that for \(n\ge J\),
Next, the weak ergodicity of the processes \(X^{(1)}\) and \(X^{(2)}\) implies that: for each \((m,l)\in {\mathbb {N}}^2\), \({\widehat{V}}(X_{l\ldots n}^{(j)},m)\) (\(j=1,2\)) is a strongly consistent estimator of \(V(X_{l\ldots l+m1}^{(j)})\), under the metric \({\mathcal {M}}\), i.e., with probability 1,
Thanks to (26), for any \((m,l)\in S_1(J)\), there exists some \(N_{m,l}\) (which depends on m, l) such that for all \(n\ge N_{m,l}\), we have, with probability 1,
where \(\# A\) denotes the number of elements included in the set A. Denote by \(N_J=\max \limits _{(m,l)\in S_1(J)}N_{m,l}\). Then observe that, for \(n\ge \max \{N_J,J\}\),
It results from (23), (28), (27) and (25) that, for \(n\ge \max \{N_J,J\}\),
which proves (17). The statement (18) can be proved analogously. \(\square \)
2 Asymptotically consistent clustering algorithms
2.1 Offline and online algorithms
In this section we introduce the asymptotically consistent algorithms for clustering offline and online datasets respectively. We explain how the two algorithms work, and prove that both algorithms are asymptotically consistent. It is worth noting that the asymptotic consistency of our algorithms relies on the assumption that the number of clusters \(\kappa \) is priorly known. The case for \(\kappa \) being unknown has been studied in Khaleghi et al. (2016) in the problem of clustering strictly stationary ergodic processes. However in the setting of widesense stationary ergodic processes, this problem remains open.
Algorithm 1 below presents the pseudocode for clustering offline datasets. It is a centroidbased clustering approach. One of its main features is that the farthest 2point initialization applies. The algorithm selects the first two cluster centers by picking the two “farthest” observations among all observations (Lines 1–3), under the empirical dissimilarity measure \(\widehat{d}^{*}\). Then each next cluster center is chosen to be the observation farthest to all the previously assigned cluster centers (Lines 4–6). Finally the algorithm assigns each remaining observation to its nearest cluster (Lines 7–11).
We point out that Algorithm 1 is different from Algorithm 1 in Khaleghi et al. (2016) at two points:

1.
As mentioned previously, our algorithm relies on the covariancebased dissimilarity \(\widehat{d}^{*}\), in lieu of the process distributional distances.

2.
Our algorithm suggests 2point initialization, while Algorithm 1 in Khaleghi et al. (2016) randomly picks 1point as the first cluster center. The latter initialization was proposed for use with kmeans clustering by Katsavounidis et al. (1994). Algorithm 1 in Khaleghi et al. (2016) requires \(\kappa N\) distance calculations, while our algorithm requires \(N(N1)/2\) distances calculations. It is very important to point out that, to reduce the computational complexity cost of our algorithm, it is fine to replace our 2point initialization with the one in Khaleghi et al. (2016). However there are two reasons based on which we recommend using our approach of initialization:
 Reason 1 :

In the forthcoming Sect. 4.1, our empirical comparison to Khaleghi et al. (2016) shows that the 2point initialization turns out to be more accurate in clustering than the 1point initialization.
 Reason 2 :

Concerning the complexity cost, we have the following loss and earn: on one hand, the 2point initialization requires more steps of calculations than the 1point initialization; on the other hand, in our covariancebased dissimilarity measure \(\widehat{d}^{*}\) defined in (15), the matrices distance \(\rho ^*\) requires \(m_n^2\) computations of Euclidean distances, while the distance \( \sum _{B\in B^{m,l}}\nu ({\mathbf {x}}_1,B)\nu ({\mathbf {x}}_2,B)\) given in (3) requires at least \(n_1+n_22m_n+2\) computations of Euclidean distances [see Eq. (33) in Khaleghi et al. 2016]. Note that we take \(m_n=\lfloor \log n\rfloor \) (\(\lfloor \cdot \rfloor \) denotes the floor integer number) though this framework. Therefore the computational complexity of the covariancebased dissimilarity \(\widehat{d}^{*}\) makes the overall complexity of Algorithm 1 quite competitive to the algorithm in Khaleghi et al. (2016), especially when the paths lengths \(n_i\), \(i=1,\ldots ,n\) are relatively large, or when the database of all distance values are at hand.
Next we present the clustering algorithm for online setting. As mentioned in Khaleghi et al. (2016), one regards recentlyobserved paths as unreliable observations, for which sufficient information has not yet been collected, and for which the estimators of the covariancebased dissimilarity measures are not accurate enough. Consequently, farthestpoint initialization would not work in this case; and clustering on all available data results in not only misclustering unreliable paths, but also in clustering incorrectly those for which sufficient data are already available. The strategy is presented in Algorithm 2 below: clustering based on a weighted combination of several clusterings, each obtained by running the offline algorithm (Algorithm 1) on different portions of data.
More precisely, Algorithm 2 works as follows. Suppose the number of clusters \(\kappa \) is known. At time t, a sample S(t) is observed (Lines 1–2), the algorithm iterates over \(j= \kappa ,\ldots ,N(t)\) where at each iteration Algorithm 1 is utilized to cluster the first j paths in S(t) into \(\kappa \) clusters (Lines 6–7). For each cluster its center is selected as the observation having the smallest index among that cluster, and their indexes are ordered increasingly (Line 8). The minimum intercluster distance \(\gamma _j\) (see CesaBianchi and Lugosi 2006) is calculated as the minimum distance \(\widehat{d}^{*}\) between the \(\kappa \) cluster centers obtained at iteration j (Line 9). Finally, every observation in S(t) is assigned to the nearest cluster, based on the weighted combination of the distances between this observation and the candidate cluster centers obtained at each iteration on j (Lines 14–17).
In Algorithm 2, \(\beta (j)\) denotes a function indexed by j, which is the value chosen for the weight \(w_j\). Remark that for online setting, our algorithm requires the same number of distance calculations as in Algorithm 2 in Khaleghi et al. (2016). They are both bounded by \({\mathcal {O}}(N(t)^2)\). Using 2point initialization, our Algorithm 2 then takes advantage in the overall computational complexity cost. Finally we note that both Algorithm 1 and Algorithm 2 require \(\kappa \ge 2\). When \(\kappa \) is known, this restriction is not a practical issue.
2.2 Consistency and computational complexity of the algorithms
In this section we prove the asymptotic consistency of Algorithms 1 and 2. They are stated in the 2 theorems below.
Theorem 1
Algorithm 1 is strongly asymptotically consistent (in the offline sense), provided that the true number \(\kappa \) of clusters is known, and each sequence \({\mathbf {x}}_i,~i = 1,\ldots ,N\) is sampled from some widesense stationary ergodic process.
Proof
Similar to the idea used in the proof of Theorem 11 in Khaleghi et al. (2016), to prove the consistency statement we will need Lemma 1 to show that if the sample paths in S are long enough, the sample paths that are generated by the same process covariance structure are “closer” to each other than to the rest. Therefore, the sample paths chosen as cluster centers are each generated by a different covariance structure, and since the algorithm assigns the rest to the closest clusters, the statement follows. More formally, let \(n_{\min }\) denote the shortest path length in S:
Denote by \(\delta _{\min }\) the minimum nonzero covariancebased dissimilarity measure between any 2 covariance structures:
Fix \(\varepsilon \in (0, \delta _{\min }/4)\). Since there are a finite number N of observations, by Lemma 1 there is \(n_0\) such that for \(n_{\min }\ge n_0\) we have
where \(G_l,~l = 1,\ldots ,\kappa \) denote the covariance structure groundtruth partitions given by Definition 2.
On one hand, by using (30), the triangle inequality (see Remark 1) and the fact that
for any indexes set I and any real numbers \(a_i\)’s and \(b_i\)’s, we obtain
On the other hand, by using the triangle inequality (see Remark 1), (29) and (30), we have for \(n_{\min }\ge n_0\),
In words, (31) together with (32) indicates that the sample paths in S that are generated by the same covariance structure are closer to each other than to the rest of sample paths. Then by (31) and (32), for \(n_{\min }\ge n_0\), we necessarily have each sample path should be “close” enough to its cluster center, i.e.,
where the \(\kappa \) cluster centers’ indexes \(c_1,\ldots ,c_\kappa \) are given by Algorithm 1 as
and
Hence, the indexes \(c_1,\ldots , c_{\kappa }\) will be chosen to index the sample paths generated by different process covariance structures. Then by (31) and (32), each remaining sample path will be assigned to the cluster center corresponding to the sample path generated by the same process covariance structure. Finally Theorem 1 results from (31), (32) and (33). \(\square \)
Theorem 2
Algorithm 2 is strongly asymptotically consistent (in the online sense), provided the true number of clusters \(\kappa \) is known, and each sequence \({\mathbf {x}}_i, i \in {\mathbb {N}}\) is sampled from some widesense stationary ergodic process.
Proof
The idea of the proof is similar to that of Theorem 12 in Khaleghi et al. (2016). The main differences between the 2 proofs are made by the fact that our covariancebased dissimilarity measure \(\widehat{d}^{*}\) is not bounded by some constant. Although it is not mentioned in the pseudocode Algorithm 2, the notations \(\gamma _j\)’s and \(\eta \) are dependent of t, therefore we denote \(\gamma _j^t:=\gamma _j\) and \(\eta ^t:=\eta \) through this proof. In the first step, by using the triangle inequality we can show that for any \(t>0\), any \(N\in {\mathbb {N}}\),
where for each j, \(k_j'\) is chosen such that \({\mathbf {x}}_j^t\) is sampled from the process covariance structure \(X^{(k_j')}\). On one hand, let
then the first term on the righthand side of (34) can be bounded by the constant \(\delta _{\max }\), which neither depends on t nor on N:
On the other hand, since \({\mathbf {x}}_j^t\) is sampled from \(X^{(k_j')}\), by using the weak ergodicity (see Lemma 1), for \(j=1,\ldots ,N\), with probability 1,
This together with the fact that a convergent sequence is also bounded, leads to, for each \(j\in \{1,\ldots , N\}\), there is \(b_j\) (not depending on t) such that
Therefore the second term on the righthand side of (34) can be bounded as:
Let
It is important to point out that B(N) depends only on N but not on t. It follows from (34), (36), (37) and (38) that
Let \(\delta _{\min }\) be the one given in (29). Fix \(\varepsilon \in (0, \delta _{\min }/4)\). By using (8), we can choose some \(J>0\) so that
Recall that in online setting, the \(i\hbox {th}\) sample path’s length \(n_i(t)\) grows with time, for each i. Therefore, by the widesense ergodicity (see Lemma 1), for every \(j \in \{1,\ldots ,J\}\) there exists some \(T_1(j)>0\) such that for all \(t \ge T_1(j)\) we have
For \(k = 1,\ldots ,\kappa \), define \(s_k(N(t))\) to be the index of the first path in S(t) sampled from the covariance structure \(X^{(k)}\), i.e.,
Note that \(s_k(N(t))\) depends only on N(t). Then denote
By Theorem 1 for every \(j \in \{m(N(t)),\ldots ,J\}\) there exists some \(T_2(j)\) such that \(\text{ Alg1 }(S(t)_j, \kappa )\) is asymptotically consistent for all \(t \ge T_2(j)\), where \(S(t)_j = \left\{ {\mathbf {x}}_1^t, \ldots , {\mathbf {x}}_j^t \right\} \) denotes the subset of S(t) consisting of the first j sample paths. Let
Recall that, by the definition of m(N(t)) in (43), \(S(t)_{m(N(t))}\) contains sample paths from all \(\kappa \) distinct covariance structures. Therefore, similar to obtaining (32), for all \(t \ge T\), we use the triangle inequality, (29) and (41) to obtain
From Algorithm 2 (see Lines 9, 11) we see
Hence, by (44), for all \(t\ge T\),
For \(j\in \{ J+1,\ldots ,N(t)\}\), by the triangle inequality and (39), we have for all \(t\ge T\),
Denote by
then (46) can be interpreted as: for all \(t\ge T\),
By (39), (45) and (47), for every \(k \in \{1,\ldots ,\kappa \}\) we obtain
Next we provide upper bounds of the first 2 items in the righthand side of (48). On one hand, by the definition of m(N(t), the sample paths in \(S(t)_j\) for \(j = 1,\ldots ,m(N(t))  1\) are generated by at most \(\kappa 1\) out of the \(\kappa \) process covariance structures. Therefore for each \(j \in \{1,\ldots ,m(N(t))1\}\) there exists at least one pair of distinct cluster centers that are generated by the same process covariance structure. Consequently, by (41) and the definition of \(\eta ^t\), for all \(t \ge T\) and \(k\in \{1,\ldots ,\kappa \}\),
On the other hand, since the clusters are ordered in the order of appearance of the distinct covariance structures, we have \({\mathbf {x}}_{c_l^j}^t = {\mathbf {x}}_{s_l(N(t))}^t\) for all \(j = m,\ldots ,J\) and \(l = 1,\ldots ,\kappa \), where the index \(s_l(N(t))\) is defined in (42). Therefore, by (41) and the definition of \(\eta ^t\), for all \(t \ge T\) and every \(l=1,\ldots ,\kappa \) we have
Combining (48), (49), (50) and (41) we obtain, for \(t\ge T\),
for all \(l = 1,\ldots ,\kappa \).
Now we explain how to use (51) to prove the asymptotic consistency of Algorithm 2. Consider an index \(i \in G_{k'}\) for some \(k' \in \{1,\ldots ,\kappa \}\). Then on one hand, using (49) and (50), we get for \(k\in \{1,\ldots ,\kappa \}\), \(k\ne k'\),
On the other hand, for any \(N\in {\mathbb {N}}\), by using the widesense ergodicity, there is T(N) such that for all \(t\ge T(N)\),
Since \(\varepsilon \) can be arbitrarily chosen, it follows from (52) and (53) that
holds almost surely for all \(i=1,\ldots ,N\) and all \(t\ge \max \{T,T(N)\}\). Theorem 2 is proved. \(\square \)
The next part involves discussion of the complexity costs of the above two algorithms.

1.
For offline setting, our Algorithm 1 requires \(N(N1)/2\) calculations of \(\widehat{d}^{*}\), against \(\kappa N\) calculations of \(\widehat{d}\) in the offline algorithm in Khaleghi et al. (2016). In each \(\widehat{d}^{*}\), the matrices distance \(\rho ^*\) consists of \(m_n^2\) calculations of Euclidean distances. Then iterating over m, l in \(\widehat{d}^{*}\) we see that at most \({\mathcal {O}}(nm_n^3)\) computations of Euclidean distances, against \({\mathcal {O}}(nm_n/\log s)\) computations of \({\hat{d}}\) for the offline algorithm in Khaleghi et al. (2016), where
$$\begin{aligned} s=\min _{\begin{array}{c} X_i^{(1)}\ne X_j^{(2)} \\ i\in \{1,\ldots ,n_1\};j\in \{1,\ldots ,n_2\} \end{array}}\left X_i^{(1)}X_j^{(2)}\right . \end{aligned}$$It is known that efficient searching algorithm can be utilized to determine s, with at most \({\mathcal {O}}(n\log (n))\) (\(n=\min \{n_1,n_2\}\)) computations. Therefore our Algorithm 1 is computationally competitive to the one in Khaleghi et al. (2016).

2.
For online setting, we can hold a similar discussion as in Khaleghi et al. (2016), Section 5.1. There it shows the computational complexity of updates of \(\widehat{d}^{*}\) for both our Algorithm 2 and the online algorithm in Khaleghi et al. (2016) is at most \({\mathcal {O}}(N(t)^2+N(t)\log ^3n(t))\) (here we take \(m_{n(t)}=\lfloor \log n(t)\rfloor \)). Therefore the overall difference of computational complexities between the 2 algorithms are reflected by the complexity of computing \(\widehat{d}^{*}\) and \(\widehat{d}\) (see Point 1).
2.3 Efficient dissimilarity measure
Kleinberg (2003) presented a set of three simple properties that a good clustering function should have: scaleinvariance, richness and consistency. Further, he demonstrated that there is no clustering function that satisfies these properties at the meanwhile. He pointed out, as one particular example, that the centroidbased clustering basically does not satisfy the above consistency property (note that this is a different concept from our asymptotic consistency). In this section we show that, although the consistency property is not satisfied, there exists some other criterion of efficiency of dissimilarity measure in a particular setting. It is the socalled efficient dissimilarity measure.
Definition 7
(Efficient dissimilarity measure) Assume that the samples \(S=\{{\mathbf {x}}(\xi ):~\xi \in {\mathcal {H}}\}\) (\({\mathcal {H}}\subset {\mathbb {R}}^q\) for some \(q\in {\mathbb {N}}\)), meaning that all the paths \({\mathbf {x}}(\xi )\) are indexed by a set of realvalued parameters \(\xi \). Then a clustering function is called efficient if its dissimilarity measure d satisfies that, there exists \(c>0\) so that for any \({\mathbf {x}}(\xi _1),{\mathbf {x}}(\xi _2)\in S\),
where \(\Vert \cdot \Vert \) denotes some norm defined over \({\mathbb {R}}^q\).
Mathematically, efficient dissimilarity measure is a metric induced by some norm. Clustering processes based on efficient dissimilarity measure will then be equivalent to clustering under classical distances in \({\mathbb {R}}^q\), such as Euclidean distance, Manhattan distance, or Minkowski distance. The latter setting has wellknown advantages in cluster analysis. For example, Euclidean distance performs well when deployed to datasets that include compact or isolated clusters (Jain and Mao 1996; Jain et al. 1999); when the shape of clusters is hyperrectangular (Xu and Wunsch 2005), Manhattan distance can be used; Minkowski distance, including Euclidean and Manhattan distances as its particular cases, can be utilized to solve clustering obstacles (Wilson and Martinez 1997). There is a rich literature on comparing the above three distances to each other through discussing of their advantages and inconveniences. We refer to Hirkhorshidi et al. (2015) and the references therein.
In the next section we present an excellent example, to show how to improve the efficiency of our consistent algorithms, for clustering selfsimilar processes with widesense stationary ergodic increments.
3 Selfsimilar processes and logarithmic transformation
In this section we introduce a nonlinear transformation of the covariance matrices in \(\widehat{d}^{*}\), in order to improve the efficiency of clustering. This transformation is based on logarithmic function. We use one example to explain how this transformation works. We show this transformation maps \(\widehat{d}^{*}\) to some covariancebased dissimilarity measure similar to an efficient one, when applied to clustering selfsimilar processes.
Definition 8
(Selfsimilar process, see Samorodnitsky and Taqqu (1994)) A process \(X^{(H)}=\{X_t^{(H)}\}_{t\in T}\) (e.g., \(T={\mathbb {R}}\) or \({\mathbb {Z}}\)) is selfsimilar with index \(H\in (0,1)\) if, for all \(n\in {\mathbb {N}}\), all \(t_1,\ldots ,t_n\in T\), and all \(c\ne 0\) such that \(ct_i\in T\) (\(i=1,\ldots ,n\)),
It can be shown that a selfsimilar process has necessarily zero mean and its covariance structure is indexed by its selfsimilarity index H, in the following way (Embrechts and Maejima 2000).
Theorem 3
Let \(\big \{X_t^{(H)}\big \}_{t\in T}\) be a zeromean selfsimilar process with index \(H\in (0,1)\) and with widesense stationary ergodic increments. Assume \({\mathbb {E}}X_1^{(H)}^2<+\infty \), then for any \(s,t\in T\),
The corollary below follows.
Corollary 1
Let \(\{X_t^{(H)}\}_{t\in T}\) be a zeromean selfsimilar process with index H and weakly stationary increments. Assume \({\mathbb {E}}X_1^{(H)}^2<+\infty \). For \(h>0\) small enough, define the increment process \(Z_h^{(H)}(s)=X_{s+h}^{(H)}X_s^{(H)}\), then for \(s,t\in T\) such that \(st\ge h\), we have
Applying three times the mean value theorem to (55) leads to
for some \(v_1^{(H)}\in (st,st+h)\), \(v_2^{(H)}\in (sth,st)\) and \(v^{(H)}\in (v_2^{(H)},v_1^{(H)})\). We see that the item \({\mathbb {C}}ov\left( Z_h^{(H)}(s),Z_h^{(H)}(t)\right) \) is a nonlinear function of H. Next we would find a function g such that \(g\left( {\mathbb {C}}ov\left( Z_h^{(H)}(s),Z_h^{(H)}(t)\right) \right) \) is linearly dependent of H. To this end we introduce the following \(\log ^*\)transformation: for \(x\in {\mathbb {R}}\), define
Introduction to \(\log ^*\)transformation is driven by the following 2 motivations:
 Motivation 1 :

The \(\log ^*\) function transforms the current dissimilarity measure to the one which “linearly” depends on its variable H.
 Motivation 2 :

The value \(\log ^*(x)\) preserves the sign of x, which leads to the consequence that larger distance between x, y yields larger distance between \(\log ^*(x)\) and \(\log ^*(y)\).
Applying \(\log ^*\)transformation to the covariances of \(Z_h^{(H)}\) given in (56), we obtain
When \(v^{(H)}\) and h are small the items \(\log v^{(H)}\) and \(\log h\) are significantly large so \(\log (H12H{\mathbb {V}}ar(X_1^{(H)}))\) becomes negligible. Thus we can write
In conclusion,

When \(H_1,H_2\in (0,1/2]\) or \(H_1,H_2\in [1/2,1)\), the item \(\log ^*\left( {\mathbb {C}}ov\left( Z_h^{(H)}(s),Z_h^{(H)}(t)\right) \right) \) is “approximately linear” on \(H\in (0,1/2]\) or on \(H\in [1/2,1)\).
Using the approximation \(\log v^{(H_1)}\approx \log v^{(H_2)}\) for \(H_1,H_2\in (0,1/2]\) or \(H_1,H_2\in [1/2,1)\), we have
$$\begin{aligned}&\log ^*\left( {\mathbb {C}}ov\left( Z_h^{(H_1)}(s),Z_h^{(H_1)}(t)\right) \right) \log ^*\left( {\mathbb {C}}ov\left( Z_h^{(H_2)}(s),Z_h^{(H_2)}(t)\right) \right) \\&\quad \approx 2{{\,\mathrm{sgn}\,}}(2H_11)(H_1H_2)\log v^{(H_1)}. \end{aligned}$$ 
When \(H_1\in (0,1/2]\) and \(H_2\in (1/2,1)\), \(\log ^*\left( {\mathbb {C}}ov\left( Z_h^{(H)}(s),Z_h^{(H)}(t)\right) \right) \) turns out to be relatively large, because we have
$$\begin{aligned}&\log ^*\left( {\mathbb {C}}ov\left( Z_h^{(H_1)}(s),Z_h^{(H_1)}(t)\right) \right) \log ^*\left( {\mathbb {C}}ov\left( Z_h^{(H_2)}(s),Z_h^{(H_2)}(t)\right) \right) \\&\quad \approx (2H_12)\log v^{(H_1)}(2H_22)\log v^{(H_2)}\\&\quad \ge 2(2H_1H_2)\min \left\{ \log v^{(H_1)},\log v^{(H_2)}\right\} . \end{aligned}$$
Taking advantage of the above facts we define the new empirical covariancebased dissimilarity measure (based on the definition (12)) to be
where \(\nu ^{**}(Z^{(H_1)}_{l\ldots n},m)\) is the empirical covariance matrix of \(Z_h^{(H_1)}\), \(\nu ^*(Z^{(H_1)}_{l\ldots n},m)\), with each of its coefficients transformed by \(\log ^*\): let \(M=\{M_{i,j}\}_{i=1,\ldots ,m;~j=1,\ldots ,n}\) be an arbitrary realvalued matrix, define
Then we have
Now given 2 widesense stationary ergodic processes \(X^{(1)}\), \(X^{(2)}\), we choose \(\{w_j\}_{j\in {\mathbb {N}}}\) to satisfy
where we denote by
Then define the \(\log ^*\)transformation of the covariancebased dissimilarity measure to be
Using the fact that \(\log ^*\) is continuous over \({\mathbb {R}}\backslash \{0\}\) and the weak ergodicity of \(Z_h^{(H)}\), we have the following version of ergodicity:
Unlike \(\widehat{d}^{*}\), the dissimilarity measure \(\widehat{d^{**}}\) is approximately linear with respect to the selfsimilarity index H. Indeed, it is easy to see that
where \(H_1,H_2\) correspond to the selfsimilarity indexes of \(X^{(H_1)},X^{(H_2)}\) respectively. In fact, from (59) we can say that \(\widehat{d^{**}}\) satisfies Definition 7 in the wide sense: it is approximately linearly dependent of \(H_1H_2\) when \(H_1,H_2\) are in the same group out of (0, 1 / 2] and [1 / 2, 1); it is approximately larger than \(H_1H_2\) when \(H_1,H_2\) are in different groups out of (0, 1 / 2] and [1 / 2, 1). This fact allows our asymptotically consistent algorithms to be more efficient when clustering selfsimilar processes with weakly stationary increments, having different values of H. In Sect. 4.2 we provide an example of clustering using our consistent algorithms with and without the \(\log ^*\)transformation, when the observed paths are from a wellknown selfsimilar process with stationary increments – fractional Brownian motion.
4 Simulation and empirical study
This section is devoted to applying our clustering algorithms to several synthetic data and realworld data. It is worth noting that, in our statistical setting, the autocovariance functions are supposed to be unavailable, then the prior choice of the weights \(w_j\) presents some tradeoff between the convergence of the dissimilarity measure and practical application. On one hand, low rate of convergence (e.g. \(w_j=1/j(j+1)\)) risks to a divergent dissimilarity measure \(d^*\) [see (5)]. On the other hand, high rate of convergence (e.g., \(w_j=1/j^3(j+1)^3\)) will only make use of some first observations in the sample paths. We believe that the first issue is a minor one in practice, because for most of the widesense stationary ergodic processes (especially Gaussian) taking \(w_j=1/j(j+1)\) can lead to convergent \(d^*\). Also, in practice, instead of (5) it is fine to regard
for some N large enough.
Therefore, through this entire section we take \(w_j=1/j(j+1)\) and \(m_n=\lfloor \log n\rfloor \) (recall that \(\lfloor \cdot \rfloor \) denotes the floor number) in the covariancebased dissimilarity measure \(\widehat{d}^{*}\). Next we explain how to prepare offline and online datasets in this simulation study.

[Offline dataset simulation] For each scenario, we simulate 5 groups of sample paths, each consists of 10 paths with length \(N(t)=5t\), for the time steps \(t=1,2,\ldots ,50\). Algorithm 1 is performed over 100 such scenarios, and the misclassification rate is calculated.

[Online dataset simulation] For each scenario, we simulate 5 groups of sample paths. Let the total number of sample paths be \(N(t) = 30 + \lfloor (t1)/10 \rfloor \) at each time step t. That is, there are 6 sample paths in each of the 5 groups when \(t=1\). And the number of sample paths in each group will increase by 1 once the time t increases by 10. For \(i=1,2,\ldots \), the \(i\hbox {th}\) sample path in each group has length \(n_i(t) = 5[t(i6)^+]\), where \(x^+ = \max (x,0)\).
We then apply the proposed clustering algorithms to both offline and online settings, and determine their corresponding misclassification rates. These misclassification rates are utilized to intuitively illustrate the asymptotic consistency of our clustering algorithms, or to compare the performances of our clustering approaches to other ones. Recall that the misclassification rate (i.e. mean clustering error rate, see Section 6 in Khaleghi et al. 2016) is obtained by dividing the number of misclassified paths by the total number of paths per scenario, then average all these fractions:
More precisely, let \((C_1,\ldots ,C_\kappa )\) denote the ground truth clusters of the N sample paths \({\mathrm {x}}_1,\ldots , {\mathrm {x}}_N\). We define the ground truth cluster labels by
Let \((l_1,\ldots ,l_N)\) denote the cluster labels of \(({\mathrm {x}}_1,\ldots ,{\mathrm {x}}_N)\) output by some clustering approach. Then the misclassification rate p of this approach is computed by
where \(S_\kappa \) denotes the group of all possible permutations over the set \(\{1,\ldots ,\kappa \}\).
For example, in one scenario of 7 sample paths, if the ground truth cluster labels of \(({\mathrm {x}}_1,\ldots ,{\mathrm {x}}_7)\) satisfy
while the clustering algorithm output cluster labels corresponding to \(({\mathrm {x}}_1,\ldots ,{\mathrm {x}}_7)\) are given by
then according to Eq. (60), the misclassification rate is 4 / 7. This can be explained as, at least 4 changes of labels are needed to let the output cluster labels match that of the ground truth ones (1, 1, 3, 2, 2, 2, 2):
We provide the implementation of the misclassification rate [see Eq. (60)] in MATLAB publicly online as misclassify_rate.m.^{Footnote 1}
4.1 Clustering nonGaussian discretetime stochastic processes
In Khaleghi et al. (2016) a simulation study on a nonGaussian strictly stationary ergodic discretetime stochastic process (see also Shields (1996)) has been performed. Since this process has finite covariance structure, it is also widesense stationary ergodic. As a result we can test our clustering algorithms over the same dataset and compare their performances to the ones in Khaleghi et al. (2016). Recall that this process \(\{X_t\}_{t\in {\mathbb {N}}}\) is generated in the following way. Fix some irrationalvalued parameter \(\alpha \in (0,1)\).

Step 1. Draw a uniform random number \(r_0 \in [0,1]\).

Step 2. For each index \(i=1,2,\ldots ,N\):

Step 2.1. Define \(r_i = r_{i1} + \alpha  \lfloor r_{i1} + \alpha \rfloor .\)

Step 2.2. Define \(X_i = {\left\{ \begin{array}{ll} 1 &{} \text {when } r_i > 0.5, \\ 0 &{} \text {otherwise}. \end{array}\right. } \)

We simulate 5 groups of sample paths \(\{X_t\}_{t\in {\mathbb {N}}}\) indexed by the irrational values \(\alpha _1 = 0.31\ldots \), \(\alpha _2 = 0.33\ldots \), \(\alpha _3 = 0.35\ldots \), \(\alpha _4 = 0.37\ldots \), \(\alpha _5 = 0.39\ldots \) (\(\alpha _i\), \(i=1,\ldots ,5\), each is simulated by a longdouble with a long mantissa, see Khaleghi et al. (2016)), respectively.
4.1.1 Offline dataset
We demonstrate the asymptotic consistency of Algorithm 1 by conducting offline clustering on the simulated offline datasets of \(\{X_i\}_{i\in {\mathbb {N}}}\).
The valid blue line in Fig. 1 illustrates the asymptotic consistency of Algorithm 1 through the fact that its misclassification rate decreases as time t increases. Compared to the simulation study over the same dataset in Khaleghi et al. (2016), the misclassification rate provided by our proposed algorithm converges at a comparable speed (see Figure 2 in Khaleghi et al. 2016), even though Algorithm 1 aims to cluster “covariance structures” but not “process distributions”.
The dotdashed red line in Fig. 1 presents the performance of Algorithm 2 and compares its misclassification rates with the ones from Algorithm 1. Applied to offline dataset, the offline algorithm’s misclassification rates are consistently lower than the online algorithm, i.e., the offline dataset clustering algorithm performs better than the online dataset clustering algorithm, when dealing with offline datasets.
4.1.2 Online dataset
In our simulated online datasets the number of sample paths and the length of each sample path increase as t increases. This type of setting is mimicking the situation such as modeling financial asset prices, where new assets are launched at each time step. The offline and online clustering algorithms are applied at each time t with 100 runs, their misclassification rates at each time t are then obtained.
Figure 2 compares the misclassification rates of offline algorithm and online algorithm applied to the online dataset described above. The periodical pattern, that misclassification rate increases per 10 time steps using offline algorithm, matches the timing of adding new observations. That is, the misclassification rate spikes whenever new observations are obtained. We observe that the misclassification rate of the online algorithm is overall lower than that of offline algorithm in this dataset, reflecting the advantage of online algorithm against the offline one in the case where new observations are expected to occur. It is worth pointing out that our online setting is different from the one in Khaleghi et al. (2016), therefore the two clustering results are not comparable.
Finally, all the codes in MATLAB that reproduce the main conclusions in this subsection can be found publicly online.^{Footnote 2}
4.2 Clustering fractional Brownian motions
In this section, we present the performance of proposed offline (Algorithm 1) and online (Algorithm 2) methods, on a synthetic dataset sampled from continuoustime Gaussian processes. The widesense stationary ergodic processes that we choose are the first order increment processes of fractional Brownian motions (see Mandelbrot and van Ness 1968). Denote by \(\{B^H(t)\}_{t\ge 0}\) a fractional Brownian motion with Hurst index \(H\in (0,1)\). It is wellknown that \(B^H\) is a zeromean selfsimilar Gaussian process with selfsimilarity index H and with covariance function
Fix \(h>0\), define its increment process (with time variation h) to be
\(Z_h^{(H)}\) is also called fractional Gaussian noise. Using the covariance function (61) we obtain the autocovariance function of \(Z_h^{(H)}\) below: for \(\tau \ge 0\),
Recall that for stationary Gaussian processes such as \(Z_h^{(H)}\), the strict ergodicity can be fully expressed in the language of its autocovariance function \(\gamma \), i.e., the following result (Maruyama 1970; Śęlzak 2017) provides a sufficient and necessary condition for a stationary Gaussian process to be strictly ergodic.
Theorem 4
(Strict ergodicity of Gaussian processes) A continuoustime Gaussian stationary process X is strictly ergodic if and only if
where \(\gamma _X\) denotes the autocovariance function of X.
In view of (62) we can deduce that the autocovariance function \(\gamma \) of \(Z_h^{(H)}\) satisfies (63). This together with Theorem 4 yields that \(Z_h^{(H)}\) is secondorder strictsense stationary ergodic, so it is also widesense stationary ergodic.
To test our algorithms we simulate \(\kappa =5\) groups of independent fractional Brownian paths, with the \(i\hbox {th}\) group containing 10 paths as \(\{B^{H_i}(1/n),\ldots ,B^{H_i}((n1)/n),B^{H_i}(1)\}\), for the selfsimilarity indexes
Remark that clustering a zeromean fractional Brownian motion \(B^H\) is equivalent to clustering its increments \(Z_{1/n}^{(H)}(t)=B^{H}(t+1/n)B^H(t)\). These total number of 50 observed paths of \(Z_{1/n}^{(H)}(t)\), each of length 150, compose an offline dataset and an online one. The clustering algorithms are applied to the dataset at each time step t. 100 runs are made to compute the misclassification rates. we use offline (RESP. online) dataset clustering algorithm to cluster offline (RESP. online) dataset. The purpose is to compare the algorithms with and without \(\log ^*\)transformations.
Figure 3 presents the comparisons of 2 algorithms: one is using the dissimilarity measure \(\widehat{d}^{*}\), the other one is using the dissimilarity measure \(\widehat{d^{**}}\), based on the behavior of misclassification rates as time increases. We conclude that, both algorithms with and without the \(\log ^*\)transformations are asymptotically consistent. However in both offline and online settings, the covariancebased dissimilarity measure algorithms with \(\log ^*\)transformation (dashed red lines) have 30% lower misclassification rates on average than that of algorithms without \(\log ^*\)transformation (solid blue lines). This simulation study proves the necessity of utilizing \(\log ^*\)transformed covariancebased dissimilarity measure when the underlying observations have nonlinear, especially power based, covariancebased dissimilarity measure, such as observations sampled from selfsimilar processes.
The codes in MATLAB used in this subsection are provided publicly online.^{Footnote 3}
4.3 Clustering AR(1) processes: non strictsense stationary ergodic
To show that our algorithms can be applied to clustering non strictsense stationary ergodic processes, we consider a simulation study on the nonGaussian AR(1) process \(\{Y(t)\}_t\) defined in Example 2, Eq. (1). We then conduct the cluster analysis with \(\kappa =5\), and specify the values of a in Eq. (1) as
We mimic the procedure in Sect. 4.2 to generate the offline and online datasets of \(\{X(t)\}_t\). Figure 4 illustrates the consistent converging property of offline algorithm and online algorithm under different dataset settings.
All the codes in MATLAB that reproduce the main conclusions in this subsection can be found publicly online.^{Footnote 4}
4.4 Application to the real world: clustering global equity markets
4.4.1 Data and methodology
In this section we apply the clustering algorithms to realworld datasets. The application involves in dividing equity markets of major economic entities in the world into different subgroups. In financial economics, researchers usually cluster global equity markets according to either geographical region or the development stage of the underlying economic entities. The reasoning of these clustering methods is that entities with less geographical distance and closer development level involve in more bilateral economic activities. Impacted by similar economic factors, entities with less “distance” tend to have higher correlation in stock market performance. This correlation then measures the level of “comovement” of stock market indexes on global capital market.
However, the globalization is breaking the barriers of region and development level. For instance, in 2016 China became the largest trader partner with the U.S. (besides EU).^{Footnote 5} China is not a regional neighbor of the U.S., and is categorized as a developing country by World Bank, in opposite to the U.S. as a developed country.
We cluster the equity markets in the world according to the empirical covariance structure of their performance, using Algorithms 1 and 2 as purposed in this paper. Then we compare our clustering results with the traditional clustering methodologies. The index constituents of MSCI ACWI (All Country World Index) are selected as the sample data. Each of the observations is a sample path representing the historical monthly returns of underlying economic entities. Through empirical study it is proved that these indexes returns exhibit the “long memory” path feature hence they can be modeled by selfsimilar processes such as fractional Brownian motions (see e.g. Comte and Renault 1998; Bianchi and Pianese 2008). Therefore similar to Sect. 4.2 we may cluster the increments of the indexes returns with the \(\log ^*\)transformed dissimilarity measure \(\widehat{d^{**}}\). MSCI ACWI is the leading global equity market index and has $3.2 billion in underlying market capitalization.^{Footnote 6} MSCI ACWI contains 23 developed markets, 24 emerging markets from 4 regions: Americas, EMEA (Europe, Middle East and Africa), Pacific and Asia. Table 1 lists all markets included in this empirical study. We exclude Greece market due to its bankruptcy after the global financial crisis.
We construct both offline and online datasets starting from different dates. For offline dataset we let it start from Jan. 30, 2009 to exclude the financial crisis period in 2007 and 2008. This is because, under global stock market crisis, the (downside) performance of equity market is contagious and thus blurs the cluster analysis. The online dataset starts on Jan. 31, 1989, which covers 1997 Asian financial crisis, 2003 dotcom bubble and 2007 subprime mortgage crisis. Another key feature is that 14 markets are added to the MSCI ACWI index (at different time) since 1989, including 1 developed market and 13 emerging markets. Therefore, the case where new time series are observed is handled in online dataset.
4.4.2 Clustering results
We compare the clustering outcomes of both offline and online datasets with separations suggested by region (4 groups) and development level (2 groups). The factor with the lowest misclassification rate is proved to be the corresponding factor that contributes to increase covariancebased dissimilarity measure the most. In other words, this corresponding factor leads to the clustering of stock markets with the most significant impact.
Table 2 shows that the misclassification rates by development level are significantly and consistently lower than that by geographical region, for both algorithms (offline and online algorithms) and datasets (offline and online datasets). The clustering results seem to infer that the geographical distance is less dominating than the development level of underlying economic entities, when analyzing different groups of equity markets.
The global minimum of the misclassification rate occurs when we use online algorithm on offline dataset. Table 3 presents the detailed clustering outcome under this circumstance. In each group, the correctly and incorrectly categorized equity markets are listed respectively. For instance, China (Mainland) market is correctly categorized along with other emerging market. Meanwhile Austria market, though being developed market in MSCI ACWI, is categorized to the group where most of the equity markets are emerging markets. The misclassified markets in the emerging group are Austria, Finland, Italy, Norway and Spain markets. The misclassified markets in the developed group are Malaysia, Philippines, Taiwan, Chile and Mexico markets. These empirical results thus suggest that several capital markets have irregular postcrisis performance which blurs the barrier between emerging and developed markets.
The contribution of this realworld dataset cluster analysis is twofold. First, we explored and determined the principal force that brings structural difference in global capital markets, which potentially predicts the “comovement” pattern of future index performance. Second, we provided new evidence on the impact of globalization on breaking geographical barriers between economic entities.
5 Conclusion and future perspectives
Inspired by Khaleghi et al. (2016), we introduce the problem of clustering widesense stationary ergodic processes. A new covariancebased dissimilarity measure is proposed to obtain asymptotically consistent clustering algorithms for both offline and online settings. The recommended algorithms are competitive for at least two reasons:

1.
Our algorithms are applicable to clustering a wide class of stochastic processes, including any strictsense stationary ergodic processes whose covariance structures are finite.

2.
Our algorithms are efficient enough in terms of their computational complexity cost. In particular, a socalled \(\log ^*\)transformation is introduced to improve the efficiency of clustering, for selfsimilar processes.
The above advantages have been supported through the simulation study on nonGaussian discretetime processes, fractional Brownian motions, nonGaussian non strictsense stationary ergodic AR(1) processes, and a realworld application: clustering global equity markets. The implementations in MATLAB of our clustering algorithms are provided publicly online.
Finally we note that, the clustering framework proposed in our paper focuses on the cases where the true number of clusters \(\kappa \) is known. The case for which \(\kappa \) is unknown is still open and left to future research. Another interesting problem is that, many stochastic processes are not widesense stationary but they get a tight relationship with the widesense stationarity. For example, a selfsimilar process does not necessarily have widesense stationary increments, but their Lamperti transformations are strictsense stationary (Lamperti 1962); locally asymptotically selfsimilar processes are generally not selfsimilar but their tangent processes are selfsimilar (Boufoussi et al. 2008). Our cluster analysis sheds light on clustering the above processes. These topics can be left for future research.
Notes
Source: U.S. Department of Commerce, Census Bureau, Economic Indicators Division.
As of June 30, 2017, as reported on September 30, 2017 by eVestment, Morningstar and Bloomberg.
References
Bastos, J. A., & Caiado, J. (2014). Clustering financial time series with variance ratio statistics. Quantitative Finance, 14(12), 2121–2133.
Bianchi, S., & Pianese, A. (2008). Multifractional properties of stock indices decomposed by filtering their pointwise Hölder regularity. International Journal of Theoretical and Applied Finance, 11(06), 567–595.
Boufoussi, B., Dozzi, M., & Guerbaz, R. (2008). Path properties of a class of locally asymptotically self similar processes. Electronic Journal of Probability, 13(29), 898–921.
Cambanis, S., Hardin, C. J., & Weron, A. (1987). Ergodic properties of stationary stable processes. Stochastic Processes and their Applications, 24(1), 1–18.
CesaBianchi, N., & Lugosi, G. (2006). Prediction, learning, and games. Cambridge: Cambridge University Press.
Comte, F., & Renault, E. (1998). Long memory in continuoustime stochastic volatility models. Mathematical Finance, 8(4), 291–323.
Damian, D., Orešič, M., Verheij, E., et al. (2007). Applications of a new subspace clustering algorithm (COSA) in medical systems biology. Metabolomics, 3(1), 69–77.
Embrechts, P., & Maejima, M. (2000). An introduction to the theory of selfsimilar stochastic processes. International Journal of Modern Physics B, 14(12), 1399–1420.
Gray, R. M. (1988). Probability, random processes, and ergodic properties. Berlin: Springer.
Hartigan, J. A. (1975). Clustering algorithms. New York: Wiley.
Herdin, M., Czink, N., Ozcelik, H., & Bonek, E. (2005). Correlation matrix distance, a meaningful measure for evaluation of nonstationary MIMO channels. In IEEE 61st vehicular technology conference, 2005 (Vol. 1, pp. 136–140).
Hirkhorshidi, A. S., Aghabozorgi, S., & Wah, T. Y. (2015). A comparison study on similarity and dissimilarity measures in clustering continuous data. PLoS ONE, 10(12), e0144,059.
Ieva, F., Paganoni, A. M., & Tarabelloni, N. (2016). Covariancebased clustering in multivariate and functional data analysis. Journal of Machine Learning Research, 17, 1–21.
Jääskinen, V., Parkkinen, V., Cheng, L., & Corander, J. (2014). Bayesian clustering of DNA sequences using markov chains and a stochastic partition model. Statistical Applications in Genetics and Molecular Biology, 13(1), 105–121.
Jain, A. K., & Mao, J. (1996). A selforganizing network for hyperellipsoidal clustering (HEC). IEEE Transactions on Neural Networks, 7, 16–29.
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys (CSUR), 31(3), 264–323.
Juozapavičius, A., & Rapsevicius, V. (2001). Clustering through decision tree construction in geology. Nonlinear Analysis: Modelling and Control, 6(2), 29–41.
Katsavounidis, I., Kuo, C. J., & Zhang, Z. (1994). A new initialization technique for generalized Lloyd iteration. IEEE Signal Processing Letters, 1(10), 144–146.
Khaleghi, A., Ryabko, D., Mari, J., & Preux, P. (2016). Consistent algorithms for clustering time series. Journal of Machine Learning Research, 17(3), 1–32.
Kleinberg, J. M. (2003). An impossibility theorem for clustering. Advances in Neural Information Processing Systems (NIPS), 15, 463–470.
Lamperti, J. W. (1962). Semistable stochastic processes. Transactions of the American Mathematical Society, 104, 62–78.
Magdziarz, M., & Weron, A. (2011). Ergodic properties of anomalous diffusion processes. Annals of Physics, 326, 2431–2443.
Mandelbrot, B., & van Ness, J. W. (1968). Fractional Brownian motions, fractional noises and applications. SIAM Review, 10(4), 422–437.
Maruyama, G. (1970). Infinitely divisible processes. Theory of Probability and Its Applications, 15(1), 1–22.
Pavlidis, N. G., Plagianakos, V. P., Tasoulis, D. K., & Vrahatis, M. N. (2006). Financial forecasting through unsupervised clustering and neural networks. Operational Research, 6(2), 103–127.
Peng, J., & Müller, H. G. (2008). Distancebased clustering of sparsely observed stochastic processes, with applications to online auctions. The Annals of Applied Statistics, 2(3), 1056–1077.
Peng, Q. (2012). Uniform Hölder exponent of a stationary increments Gaussian process: Estimation starting from average values. Statistics & Probability Letters, 81(8), 1326–1335.
Rubinstein, M., Joulin, A., Kopf, J., & Liu, C. (2013). Unsupervised joint object discovery and segmentation in internet images. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1939–1946).
Samorodnitsky, G. (2004). Extreme value theory, ergodic theory and the boundary between short memory and long memory for stationary stable processes. The Annals of Probability, 32(2), 1438–1468.
Samorodnitsky, G., & Taqqu, M. S. (1994). Stable nonGaussian random processes: Stochastic models with infinite variance. New York: Chapman & Hall.
Sen, P. K., & Singer, J. M. (1993). Large sample methods in statistics. New York: Chapman & Hall Inc.
Shields, P. C. (1996). The ergodic theory of discrete sample paths, Graduate Studies in Mathematics (Vol. 13). Providence: American Mathematical Society.
Śęlzak, J. (2017). Asymptotic behaviour of time averages for nonergodic Gaussian processes. Annals of Physics, 383, 285–311.
Slonim, N., Atwal, G. S., Tkavcik, G., & Bialek, W. (2005). Informationbased clustering. PNAS, 102(51), 18,297–18,302.
Wilson, D. R., & Martinez, T. R. (1997). Improved heterogeneous distance functions. JAIR, 6, 1–34.
Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3), 645–678.
Zhao, W., Zou, W., & Chen, J. J. (2014). Topic modeling for cluster analysis of large biological and medical datasets. BMC Bioinformatics, 15, S11.
Acknowledgements
We gratefully thank the editor Dr. João Gama and three anonymous referees for their careful reading of our manuscript and their many insightful comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editor: Joao Gama.
Rights and permissions
About this article
Cite this article
Peng, Q., Rao, N. & Zhao, R. Covariancebased dissimilarity measures applied to clustering widesense stationary ergodic processes. Mach Learn 108, 2159–2195 (2019). https://doi.org/10.1007/s1099401905818x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1099401905818x
Keywords
 Cluster analysis
 Widesense stationary ergodic processes
 Covariancebased dissimilarity measure
 Selfsimilar processes