1 Introduction

The main goal in time series clustering is to divide a set of time series into homogeneous groups. Such problems have been reported in many fields of application, ranging from climatology and geology to health sciences and failure detection, to mention just a few. Recent surveys of the field can be found in Aghabozorgi et al. [1] and Caiado et al. [5]. As these works point out, one of the key issues is that assessing similarities among time series can be measured in many ways, leading to different clustering formation. While most of the dissimilarity measures proposed in the literature use raw data or extracted features of the time series, they rarely study or take into account the dependence of the series when forming clusters. Alonso and Peña [2] and Alonso et al. [4] propose a measure that captures the linear dependencies and use it in the grouping of time series and in the estimation of factorial models with group structure, respectively. This measure, called generalized cross-correlation, uses all auto- and cross-correlations up to a specified lag. However, this measure takes into account the entire range of values of the time series and, in this paper, we are interested in the dependence on extreme values, that is, values that exceed certain thresholds. Therefore, one of the challenges in clustering time series is to select proper measures that capture extremal dependence between series to subsequently perform the clustering based on those dependencies. Extreme values in a time series can often be indicative of significant events or trends, and thus, it might be crucial to accurately capture this dependence in some real-world problems such as meteorology, computer security or stock exchange, etc. The underlying idea is that time series exhibiting similar patterns in extreme values are more likely to be close than those with dissimilar patterns.

Clustering of time series by measures of the behavior of their extremes is a topic of active research that is gaining importance nowadays. Early works in this area include Scotto et al. [26] and Scotto et al. [28] who proposed a time series clustering framework based on extreme value theory and classification techniques, in which the distance between time series is assessed through the L\(_2\)-Wasserstein distance, based on their posterior predictive distributions of return values, corresponding to a predefined return period. An example of application to European daily temperature series can be found in Scotto et al. [27]. Clustering techniques based on tail dependence coefficients have been also proposed in the literature. To this end, De Luca and Zuccolotto [10] consider a dissimilarity measure based on the lower pairwise tail dependence coefficient and a copula-based function, to group time series with an association between low values. Furthermore, in Durante et al. [8] the pairwise dependence coefficients are computed via a non-parametric procedure. An extension of copula functions with time varying parameters is proposed by De Luca and Zuccolotto [11]. Furthermore, Yang et al. [30] considered another extension based on the jump tail dependence coefficients to group time series. Extensions for a double clustering procedure based on the low and upper tail dependence coefficients was introduced by the De Luca and Zuccolotto [12]. More about copula-based measures of association can be found in De Luca and Zuccolotto [13]. A different methodology was adopted by D’Urso et al. [9] who proposed an approach based on extreme features of time series using static parameters estimated from the Generalized Extreme Value (GEV) distribution. It is worth mentioning that metric-based of quantile autocovariance functions (QAF) has been also introduced recently. An example of that are the works by Lafuente-Rego and Vilar [20] and Lafuente-Rego et al. [21] who proposed a dissimilarity measure based on the QAF. However their approaches do not take into account the dependence among time series.

The rest of the paper is organized as follows: in order to be self-contained, Sect. 2 introduces the required concepts of time series and clustering that allow to propose three measures of similarity that capture extreme dependence in Sect. 3; these measures are empirically evaluated in different simulated scenarios and in a set of time series of daily maximum temperatures across Europe in Sect. 4. Conclusions and possible extensions are presented in Sect. 5.

2 Time series and clustering

In this section, the book of Hamilton [16] has been used as the basis for the main definitions and properties of time series, and Hartigan [17], Ripley [23] and Rousseeuw [24] for clustering procedures.

2.1 Introduction to time series

A time series is a collection or sample of observations indexed by the time of occurrence. Usually the collected data begins and ends at some particular dates or times. Considering time as infinite, notice that the observed sample can be seen as a finite segment of a doubly infinite sequence

$$\begin{aligned} \{y_t\}^{\infty }_{-\infty } = \{\dots ,y_{-1},y_0,y_1\dots ,y_T,y_{T+1},\dots \}. \end{aligned}$$

If \(\{y_t\}^{\infty }_{-\infty }\) is a sequence of uncorrelated random variables, with zero mean and constant variance, then it is said that the time series is white noise.

Furthermore, If neither the mean nor the autocovariances depend on t then the time series is said to be weakly stationary. Moreover, a process is said to be strictly stationary if, for any values \(j_1,\dots ,j_n\), the joint distribution of \(y_{t+j_1},\dots , y_{t+j_n}\) depends only on the intervals separating the time indexes \(j_1,\dots ,j_n\) but not on the time index t. The most widely used models to describe the temporal dependence structure of a weakly stationary process are the autoregressive moving-average (ARMA) models.

Having in mind that ARMA processes are focused on short term dependencies, extreme values are difficult to be captured or to predict. There have been several attempts overcome this limitation and in particular the Max-ARMA model is quite popular alternative in Extreme Value Theory (EVT). Davis and Resnick [7] introduced the Max-ARMA process as follows.

Definition 2.1

(Max-ARMA process) Let \(\{\varepsilon _t\}\) be a sequence of independent and identically distributed unit Fréchet random variables, that is \(\mathbb {P}(\varepsilon _t<u) = \exp (-1/u)\) when \(u>0\) and 0 otherwise. Then, a max-autoregressive moving average process of order (pq), denoted as Max-ARMA(pq), is defined as follows:

$$\begin{aligned} X_t = \max \left( \sum _{i=1}^p \phi _i X_{t-i}, \text { }\varepsilon _t+\sum _{i=1}^q \theta _i \varepsilon _{t-i}\right) , \end{aligned}$$

where \(\phi _i\) and \(\theta _i\) are p and q real constants, respectively.

Although the previous definition allows any orders, in practice the most used order is (1, 0). The following example shows trajectories and autocorrelation functions of max-AR processes (Fig. 1).

Example 2.1

Let \(X_t\) and \(Y_t\) be two max-autoregressive processes of order (1,0), where \(\phi _x=0.95\) and \(\phi _y=0.5\), that is

$$\begin{aligned} X_t = \max \left( 0.95 X_{t-i}, \text { }\varepsilon _t\right) ,\,\, Y_t = \max \left( 0.5 Y_{t-i}, \text { }\eta _t\right) , \end{aligned}$$

where \(\varepsilon _t\) and \(\eta _t\) are unit Fréchet random variables. Despite having the same order, the autoregressive coefficient might have great influence in the series. The coefficient \(\phi \) defines the shrinking of an extreme value. Thus, for a high \(\phi \), the correlation between \(X_t\) and \(X_{t-1}\) is high, while for a low value, the correlation between \(X_t\) and \(X_{t-1}\) is low (see Fig. 2).

Fig. 1
figure 1

Simulated max autoregressive processes with \(\phi _x=0.95\) and \(\phi _y=0.5\)

Fig. 2
figure 2

Sample ACF function for \(X_t\) and \(Y_t\)

2.2 Similarity measures

As mentioned in the Introduction, similarity measures will play a key role in this research. Basically, the main idea is to quantify the similarity between two objects (two time series in the present work). In the same way that the distance between two points in an n-dimensional space can be measured in many ways, the similarity measure of two time series can be defined in many ways. Before introducing the concept of similarity measure to capture time series dependency, we define cross-correlation.

2.2.1 Cross-correlation measure

The definition of cross-correlation measure might appear in different forms in literature. Here, we follow Paparrizos and Gravano [22] since their paper motivates one of the proposed measures in Sect. 3.

Let \(X=(x_1,\dots ,x_T)\) and \(Y=(y_1,\dots ,y_T)\) be two time series whose expectation and variance exist. The normalized cross-correlation can be defined in three ways, namely

$$\begin{aligned}{} & {} \text {NCC}_b(X,Y):=\frac{CC_k(X,Y)}{T}, \\ {}{} & {} \text {NCC}_u(X,Y):=\frac{CC_k(X,Y)}{T-|k-T|} \end{aligned}$$

and

$$\begin{aligned} \text {NCC}_c(X,Y):=\frac{CC_k(X,Y)}{\sqrt{R_0(X,Y)\cdot R_0(X,Y)}}, \end{aligned}$$

where \(k\in \{1,\dots ,2n-1\}\), \(CC_{k}(X,Y):=R_{k-T}(X,Y)\), and \(R_w(X,Y)\) is computed as

$$\begin{aligned} R_w(X,Y)={\left\{ \begin{array}{ll} \sum \limits _{l=1}^{T-w} x_{l+w} y_{l}, \quad w\ge 0 \\ R_{-w}(Y,X), \quad w<0 \end{array}\right. }. \end{aligned}$$
(1)

The point is that normalization of the data and the cross-correlation measure might have a great impact on the cross-correlation sequence produced, and consequently, the creation of a distance measure is not a trivial task. Next, we define a dependence-based similarity measure.

Definition 2.2

(Dependence-Based Similarity Measure) A function \(C: \mathbb {R}^T \times \mathbb {R}^T \longrightarrow \mathbb {R}\) is called dependence-based similarity measure if it verifies the following properties;

  • \(C(X,Y)\in [0,1]\), \( \forall X, Y\in \mathbb {R}^T\).

  • \(C(X,Y)=1\) if, and only if, there is an exact linear relation between the two series.

  • \(C(X,Y)=0\) if, and only if, all the cross-correlation coefficients between the two time series are zero.

In practice, it is usual to work with measures of dissimilarity instead, that indicate to what extent the data objects are different from each other. In the present work we will consider the dissimilarity measure as being the reverse of the similarity measure; \(C(x_t,y_t)=0\) means total similarity while \(C(x_t,y_t)=1\) means total dissimilarity.

2.3 Clustering procedures

A clustering of a set \(\mathcal {X}\) of n objects is a partition of \(\mathcal {X}\) into non-empty, disjoint and exhaustive classes, called clusters. Several algorithms of unsupervised classification are used, such as k-means algorithm or hierarchical clustering. Since the data is unlabeled, the main criterion for classification in almost all used algorithms is a distance matrix. In other words, in order to decide if a point belong to a cluster, it is measured how far is located that point from other points in the cluster. Again, the key here is how that distance or similarity is measured.

2.3.1 Hierarchical clustering

One of the most widely used algorithm to perform unsupervised classification is hierarchical clustering, see, for instance, Hartigan [17]. This method has two different versions:

  • Agglomerative: Each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

  • Divisive: All observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

In this type of clustering, the dendrogram is used for illustrating the arrangement of the clusters produced by the corresponding analyses and typically, the number of clusters is estimated depending on it (it is not required to set the number of clusters from the beginning). Since there is or may be more than one data point in the clusters, it is necessary to define a distance between clusters. For this purpose, the most frequently used methods are the following:

  • Single linkage: The distance between two clusters is defined as the minimum distance between two points that belongs to different clusters. This linkage tends to produce chain-like clusters.

  • Complete linkage: The distance between two clusters is defined as the maximum distance between two points that belongs to different clusters. It tends to produce clusters that are more compact.

  • Average linkage: The distance between two clusters is defined as the average of distances between all pairs of points that belongs to different clusters. The produced dendrogram with this linkage might be the easiest one to interpret.

Determining the best linkage method in hierarchical clustering can be a subjective process and often depends on the specific goals and characteristics of our data. In order to compare the obtained dendrograms, it is needed to measure how similar are these results. For this purpose, the cophenetic distance between two observations that have been clustered is defined as the inter-group dissimilarity at which the two observations are first combined into a single cluster. Let \(\mathcal {X} = (X_1,\dots ,X_n)\) be the original set of n objects and \(\mathcal {T}=\{T_1,\dots ,T_n\}\) the obtained dendrogram, i.e. a simplified model in which data that are ‘close’ have been grouped into a hierarchical tree. Considering the distance between the ith and jth observations, \(d(X_i,X_j)\), and the distance defined by the dendrogram \(t(X_i,X_j)\) between the model points \(T_i\) and \(T_j\). The distance \(t(X_i,X_j)\) is the height of the node at which points \(T_i\) and \(T_j\) are first joined together. Then, letting \(\overline{d}\) be the average of the \(d(X_i,X_j)\), and letting \(\overline{t}\) be the average of the \(t(X_i,X_j)\), the cophenetic correlation coefficient, \(\text {CCC}\), is defined as:

$$\begin{aligned} \text {CCC}:=\frac{\sum \nolimits _{i<j}(d(i,j)-\overline{d})(t(i,j)-\overline{t})}{\sqrt{\sum \nolimits _{i<j}(d(i,j)-\overline{d})^2 \sum \nolimits _{i<j}(t(i,j)-\overline{t})^2}}. \end{aligned}$$

2.3.2 Selection of k

After obtaining clustering results for several number of clusters, k, sometimes it is needed to select an ‘optimal’ k value. For this purpose, we use the silhouette index proposed by Rousseeuw [24].

Definition 2.3

(Silhouette index) Let \(\{C_1,\dots ,C_k\}\) be the predicted clusters for some set of objects \(\mathcal {X}=(X_1,\dots ,X_n)\), and d a given dissimilarity measure. Then, if \(X_i \in C_l\), its silhouette index is defined as follows:

$$\begin{aligned} s_i:= \frac{b_i-a_i}{\max (b_i, a_i)}, \end{aligned}$$

where

$$\begin{aligned}{} & {} a_i:= \frac{1}{|C_l|}\sum \limits _{X_j \in C_l} d(X_i, X_j)\qquad \textrm{and} \\ {}{} & {} b_i:= \min _{s \ne l}\left\{ \frac{1}{|C_s|}\sum \limits _{X_j \in C_s} d(X_i, X_j)\right\} . \end{aligned}$$

From the definition we easily see that \(s_i\in [-1,1]\) for each object \(X_i\). \(s_i=1\) implies that the ‘within’ dissimilarity \(a_i\) is much smaller than the smallest ‘between’ dissimilarity, \(b_i\). Therefore, we can say that \(X_i\) is well-clustered. When \(s_i\) is close to zero, then \(a_i\) and \(b_i\) are approximately equal, and hence it is not clear at all whether \(X_i\) should have been assigned to either \(C_l\) or some other cluster. A negative values means that \(X_i\) is not well-clustered.

2.3.3 K-means clustering

The basic idea behind k-means clustering consists of defining clusters so that the total intra-cluster variation (known as total within-cluster variation) is minimized, see, for instance, Ripley [23]. For this, a centroid for each cluster is computed as the mean of points assigned to that cluster. In such way, objects within the same cluster are as similar as possible, whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity). In k-means algorithm, it is required to set the number of clusters, k from the beginning. These are the main steps in the algorithm:

  • Select (randomly) k objects from the data set to serve as the initial centroids for the clusters.

  • Assign each data point to its closest centroid, where closest is defined using a distance between the object and the cluster mean.

  • Compute the new mean value of each cluster and repeat iteratively the second step until the cluster assignments stop changing.

In order to avoid dependence on the first centroids, the previous three steps are repeated a number of times and the solution with the lowest total within-cluster variation is selected.

The k-means procedure is one of the most popular partitional clustering methods. Others like k-medoid clustering consists of the same idea but the concept of centroid is changed; instead of considering the mean point of a cluster, the point with the maximum similarity with all the other points in the cluster is selected. In such way, it is obtained a more robust way of clustering, and less sensitive to noise and outliers compared to k-means.

3 Similarity measures to capture extremal dependence

One of the disadvantages that unsupervised classification of time series is that the outcome is very sensitive to the selection of distance or similarity measure. Therefore, defining a similarity measure function to distinguish the main properties sequences is crucial. Among the existing distance metrics in clustering algorithms, each distance exhibits advantages and weaknesses; Euclidean distance (ED) is the most interpretable, Dynamic Time Warping Distance (DTW) is usual for nonlinear mapping, while shape-based distance (SBD) (explained below) takes into account the offset of amplitude and phase and obtain better results in some other cases. In the same way, the linear alignment of ED will lead to large deviations in the calculation results, the non-linear alignment of DTW will cause mismatches of similar shapes, and SBD will strictly limit the shift-invariance and lead to missing out other dependencies. In Sect. 4, several similarity measures will be applied in a simulation and to a real life data.

3.1 Generalized cross correlation

In the literature, several methods have been proposed to cluster time series, and in most of the cases, it is built a measure of distance between two series by using their univariate features such as the raw values, the periodograms or autorregresive representation of the process. Alonso and Peña [2] proposed to use the pairwise linear dependency for clustering, based on cross-correlation coefficients between them, and defined a general measure of linear association called Generalized Cross Correlation (GCC).

The generalized cross-correlation similarity measure compares the determinant of the correlation matrix until some lag k of a bivariate vector conformed by two univariate time series, \(x_t\) and \(y_t\). Suppose that, without loss of generality, the series are standardized so that \(E(x_t)=E(y_t)=0\) and \(E(x_t^2)=E(y_t^2)=1\). Call \(\eta _{xx}(h)=E(x_{t-h}x_t)=\eta _x(h)\) and \(\eta _{xy}(h)=E(x_{t-h}y_t)=\eta _{yx}(-h)\). The linear dependency at lag h is represented by the following matrix:

$$\begin{aligned} {\textbf {R}}(h)=\begin{pmatrix} \eta _x(h) &{} \eta _{xy}(h) \\ \eta _{yx}(h) &{} \eta _y(h) \end{pmatrix} \end{aligned}$$

and in the same way, the linear dependency for lags from zero to k can be summarized by putting together the \((k+1)\) square matrices of dimension two that describe the dependency at each lag:

$$\begin{aligned} {\textbf {R}}_k=\begin{pmatrix} {\textbf {R}}(0) &{} \quad \dots &{} \quad {\textbf {R}}(k) \\ \vdots &{} \quad \ddots &{} \quad \vdots \\ {\textbf {R}}(-k) &{} \quad \dots &{} \quad {\textbf {R}}(0) \end{pmatrix}. \end{aligned}$$

Notice that this matrix is symmetric, non negative definite and it corresponds to the covariance matrix of the vector stationary process

$$\begin{aligned} (x_t,y_t,x_{t-1},y_{t-1},\dots ,x_{t-k},y_{t-k})'. \end{aligned}$$

We can arrange the components of the previous vector as \((Y_{t,k}',X_{t,k}')'\), where \(Y_{t,k} = (y_t, y_{t-1},\dots ,y_{t-k})\) and \(X_{t,k} = (x_t, x_{t-1},\dots ,x_{t-k})\). The ‘new’ vector has the following correlation matrix:

$$\begin{aligned} {\textbf {R}}_{yx,k}=\begin{pmatrix} {\textbf {R}}_{yy,k} &{} \quad {\textbf {C}}_{xy,k}^T \\ {\textbf {C}}_{xy,k} &{} \quad {\textbf {R}}_{xx,k} \end{pmatrix}. \end{aligned}$$

From the above matrix, we will define the generalized cross correlation as:

$$\begin{aligned} \text {GCC}(x_t,y_t):= 1-\left( \frac{|{\textbf {R}}_{yx,k}|}{|{\textbf {R}}_{xx,k}||{\textbf {R}}_{yy,k}|}\right) ^{1/(k+1)}. \end{aligned}$$

This measure satisfy the conditions of the dependence based similarity measure (Alonso and Peña [2]). Note that this measure captures linear dependencies in the entire range of values of the time series, however, it can be modified to focus on the extreme values of the series. One way might be to capture linear dependencies but only for extremal values, which is the idea explained in Sect. 3.1.1 below. Another option might be to substitute the cross-correlation measure with other measures that capture the dependency of extremal values, and this is the motivation for introducing the concepts of extremogram and cross-extremogram.

3.1.1 Binary-GCC measure

We have seen that GCC based on cross-correlation coefficients captures the linear dependency among series. The approach now is to capture the extremal dependency, using GCC. For this purpose, we introduce an extention of GCC applied to the series’ of extreme values; the so-called Binary-GCC (BGCC). The modification here is to consider a binary series instead of the whole time series. When the time series exhibits an extreme value, it is assigned a 1 and 0 otherwise. Then, we define BGCC as follows:

$$\begin{aligned} \text {BGCC}(x_t,y_t) = \text {GCC}(\hat{I}_{q}(x_t),\hat{I}_q(y_t)), \end{aligned}$$

where

$$\begin{aligned} \hat{I}_q:&\qquad \mathbb {R}^n\quad \quad \qquad \longrightarrow \qquad \qquad \mathbb {R}^n \nonumber \\&(x_1,\dots , x_n) \quad \longrightarrow \quad (1_{\{x_1<P_q\}}, \dots , 1_{\{x_n<P_q\}})\quad , \end{aligned}$$
(2)

and \(P_q\) represents the q-th percentile of the series and \(1_{A}\) denotes the indicator function. Notice that since GCC verifies all the needed conditions to be a similarity measure, BGCC will also fulfill them.

3.2 Shape-based distance

A measure related to GCC is the Shape-Based Distance (SBD) proposed by Paparrizos and Gravano [22]. Instead of depending on the whole cross-correlation vector, only the maximum value of \(\text {NCC}_C(x_t,y_t)\) is considered. Hence, it is not sensitive to the cases in which the cross-correlation values are small for some lags. Moreover, regardless of the data normalization, it is used the coefficient normalization that gives values between \(-1\) and 1, and divides the cross-correlation sequence by the geometric mean of autocovariances of the individual sequences. It is defined as follows:

$$\begin{aligned} \text {SBD}(x_t,y_t):= 1-\max _{k}\left( \frac{CC_{k}(x_t,y_t)}{\sqrt{R_0(x_t,x_t)\cdot R_0(y_t,y_t)}}\right) , \end{aligned}$$

which takes values in [0, 2], where 0 indicates perfect similarity. It should be noted that the SBD measure, like the DTW measure, focuses on the alignment of the time series and therefore time series with high negative cross-correlation are considered far from each other.

The following two subsections introduce the necessary concepts to modify the SBD measure so that it is capable of capturing dependencies on the extreme values of the time series.

3.2.1 Extremogram

Davis and Mikosch [14] introduced a kind of asymptotic autocovariance function for sequences of extreme values, called the extremogram. Roughly speaking, it is an analogue of the correlogram for measuring dependence in extremes, and it is defined as follows:

$$\begin{aligned} \gamma _{AB}(h):=\lim _{n\rightarrow \infty }n\cdot \text {cov}\left( I_{\{a_n^{-1} X_0\in A\}}, I_{\{a_n^{-1} X_h\in B\}}\right) , \end{aligned}$$

where \(a_n\) is a normalization sequence and A and B are two fixed sets bounded away from zero. Notice that by definition, it is assumed that the limit exists. Davis and Mikosch [14] and Davis et al. [6] studied the conditions for the existence of the limit taking into account the tail behaviour of the time series, and the choices of A and B since the extremal dependence could depends on these sets.

3.2.2 Cross-extremogram

Following the idea of cross-correlation measure, extremal dependence between two or more series for a particular number of lags can be measured using the cross-extremogram.

Fig. 3
figure 3

Simulated sample paths for max-AR dependent processes

Fig. 4
figure 4

Sample extremogram or time series \(X_t\) and \(Y_t\)

Definition 3.1

(Cross-extremogram) Let be \(F_1\) and \(F_2\) the marginal distributions of \(X_t\) and \(Y_t\), so that the transformed series \(\hat{X}_t=G_1(X_t)\) and \(\hat{Y}_t=G_2(Y_t)\), have Fréchet marginals \(F(x)=\exp \{-1/x\}, x>0\) with \(\alpha =1\), where \(G_i(z)=-1/\log (F_i(z))\). Then, the cross-extremogram is defined by

$$\begin{aligned} \nu _{A,B}(h):= \lim \limits _{x\rightarrow \infty } P(x^{-1} \hat{Y}_{h}\in B | x^{-1} \hat{X}_{0}\in A),\qquad h>0 \end{aligned}$$

where A and B are bounded sets away from zero.

Davis and Mikosch [14] proposed a sample estimators of these extremal dependencies between two or more data sequences. For a sequence of observations \(X_1,\dots ,X_n\) the sample extremogram is computed as

$$\begin{aligned} \widehat{\gamma }_{A,B} (k)=\frac{\sum \nolimits _{t=1}^{n-k} I_{\{ a_m^{-1} X_{t+k} \in B,\text { }a_m^{-1} X_t\in A\}}}{\sum \nolimits _{t=1}^n I_{\{a_m^{-1}X_t\in A\}}}, \end{aligned}$$

where \(a_m\) is the \((1-1/m)\)-quantile of the stationary distribution of \(X_t\). For having a consistent estimator, it is required that \(m=m_n\rightarrow \infty \) with \(m/n\rightarrow 0\) as \(n\rightarrow \infty \). In practice, a large value of m is used.

Similar to the sample extremogram the authors proposed the sample cross-extremogram for bivariate time series \((X_t,Y_t)\) as

$$\begin{aligned} \widehat{\nu }_{A,B}(k) = \frac{\sum \nolimits _{t=1}^{n-k} I_{\{a_{m,Y}^{-1} Y_{t+k}\in B, a_{m,X}^{-1} X_t\in A\}}}{\sum \nolimits _{t=1}^{n} I_{\{a_{m,X}^{-1} X_t \in A \}}}, \end{aligned}$$

where \(a_{m,X}\) and \(a_{m,Y}\) are replaced by largest values in samples of \(X_t\) and \(Y_t\), respectively. A and B can be any bounded sets away from zero, but in practice are of the form \(A = B = (1, \infty )\).

Example 3.1

Let \((X_t,Y_t)\) be a bivariate time series modelled by the following dependent max-ARMA processes of order (1, 0):

$$\begin{aligned} {\left\{ \begin{array}{ll} X_t =\max (\alpha _{x,1} X_{t-1}, \alpha _{y,1} Y_{t-1}, Z_t^x) \\ Y_t = \max (\alpha _{y,2} Y_{t-1}, \alpha _{x,2} X_{t-1}, Z_t^y) \end{array}\right. }, \end{aligned}$$

where \(\alpha _{j,i}\in (0,1)\) for \(j=\{x,y\}\) and \(i\in \{1,2\}\) and \(Z_t^x\) and \(Z_t^y\) are unit Fréchet random variables. Figure 3 displays a realization of the time series with \(n=1000\) and \(\alpha _{x,1},\alpha _{y,2}=0.95\) and \(\alpha _{x,2},\alpha _{y,1}=0.6\). Figures 4 and 5, shows the extremograms of \(X_t\) and \(Y_t\), and the cross-extremograms of \((X_t,Y_t)\), respectively.

Fig. 5
figure 5

Sample cross-extremogram for \((X_t,Y_t)\)

Now, considering the same simulated series, let us compute \(\text {BGCC}_q\) values for some q threshold. When q increases, all the extreme values are concentrated in the maximum peak, while a small q value implies having extreme values all around the series.

q/ k

0

1

2

3

4

5

6

7

8

9

0.9

0.52

0.36

0.30

0.27

0.25

0.24

0.23

0.23

0.22

0.22

0.95

0.83

0.70

0.63

0.60

0.57

0.56

0.55

0.54

0.53

0.53

0.99

1

1

1

1

1

1

1

1

1

1

When \(q=0.99\), only 10 points out of 1000 are considered as extreme values. Thus, if those 10 values are concentrated in a same peak in both of the series, then the GCC coefficient for some k is 1.

Example 3.2

Let consider \(X_t\) and \(Y_t\) two independent max-AR processes defined as

$$\begin{aligned} {\left\{ \begin{array}{ll} X_t = \max (\alpha _x X_{t-1},Z_t^x) \\ Y_t = \max (\alpha _y Y_{t-1},Z_t^y) \end{array}\right. }. \end{aligned}$$

The extremogram values of the simulated time series are similar as in the dependent series of the previous example but the cross-extremogram values decrease significantly as shown in Fig. 6.

Fig. 6
figure 6

Cross-extremogram values for independent \((X_t,Y_t)\)

In this case, the values of \(\text {BGCC}_q\) are close to zero for any \(k<50\) and \(q=0.9, 0.95, 0.99\).

These examples show that both the cross-extremogram and the \(\text {BGCC}_q\) are able to differentiate between independent and extreme-dependent time series.

3.2.3 Extreme-SBD measure

Next, we define a similarity measure which captures extremal dependence. Here, we present the idea of introducing the cross-extremogram values in SBD, instead of cross-correlation.

Definition 3.2

(ESBD measure) Let \(\{x_t\}\) and \(\{y_t\}\) be two stationary time series and \(\hat{\nu }_{x_t,y_t}(k)\) the estimated cross-extremogram for lag k, then

$$\begin{aligned} \text {ESBD}(x_t,y_t):= 1-\max _k \left( \hat{\nu }_{x_t,y_t}(k)\right) , \end{aligned}$$

is called Extreme Shape Based Distance.

Note that for two identical sequences in which the percentiles for defining extremes coincide, the maximum cross-extremogram value will be attained at lag \(k=0\) being the ESBD equal to one. Similarly, for two independent sequences it is possible to select percentiles such that \((a_{m,Y}^{-1} Y_{t+k}\in B)\cap (a_{m,X}^{-1}X_t\in A) = \emptyset \), being the cross-extremogram equal to zero.

3.2.4 Binary-SBD measure

Recall that the SBD measure considers the maximum value of \(\text {NCC}_C\) for a set of lags selected from the entire time series. Analogously to BGCC, it is also possible to compute the SBD to its binary time series counterpart. Thus, we define Binary-SBD (BSBD) measure as

$$\begin{aligned} \text {BSBD}(x_t,y_t):= \text {SBD}(\hat{I}_{q}(x_t),\hat{I}_q(y_t)), \end{aligned}$$

where \(\hat{I}_{q}\) is defined in (2). SBD was calculated using the implementation of [25].

4 Application of clustering by extremal dependence

4.1 Time series clustering on simulated scenarios

This section aims at implementing the proposed measures in clustering time series for max-autoregressive processes. As a reference, we have taken the scenarios considered by Alonso et al. [3]. A set of fifteen time series are generated from the max-AR process \(X_{i,t} = \max (\phi _i X_{i,t-1}, \epsilon _{i,t})\) where for \(i = \{1,\dots , 4\}\cup \{11,\dots ,15\}\), \(\phi _i=0.9\), and for \(i = 5,\dots , 10,\) \(\phi _i = 0.2\). Furthermore, cross-dependency is introduced through the innovations \(\epsilon _{i,t}\), which are Fréchet white noise random variables with dependence structure differing according to each scenario. Following [3], seven scenarios with three or seven clusters are defined by indicating the non-null cross-correlations. It is important mentioning that when the simulated time series contain a cluster, the dependencies are simulated in two different ways, namely, chain dependent structure or full dependent structure. The former implies that any each time series is related to its nearest neighbor but is independent of the rest, that is, a time series A is close to B yet far from C, and B and C are close to each other. In the latter, the cluster exhibits a strong dependency structure among all time series. All the other possible cross-correlations are assumed to be zero (see Fig. 7). The seven scenarios are described below:

Fig. 7
figure 7

Set of simulated scenarios. Each time series is represented by a node and non-zero cross-correlations are represented by axes between nodes

  • Scenario 1: Four chain dependent; six chain dependent and five independent time series (7 groups).

  • Scenario 2: Four chain dependent; six chain dependent and five chain dependent time series (3 groups).

  • Scenario 3: Four chain dependent; six full dependent and five independent time series (7 groups)

  • Scenario 4: Four chain dependent; six full dependent and five chain dependent time series (3 groups).

  • Scenario 5: Four full dependent; six full dependent and five independent time series (7 groups).

  • Scenario 6: Four full dependent; six full dependent and five chain dependent time series (3 groups).

  • Scenario 7: Four full dependent; six full dependent and five full dependent time series (3 groups).

One of the main reasons for doing simulations is that it allows us to evaluate the unsupervised classification procedure in a context where we know the true groups of each observation. That is, it allows us to evaluate if the grouping algorithms and the measures that we use are capable of grouping the series with dependencies (both full and chain). The set of measures considered in this study are listed below.

  • Adjusted Rand Index (ARI): All pair-wise comparisons between the original cluster labels and predicted clusters are considered here (see, e.g. [19]). The ARI is a corrected version of the so-called Rand Index (RI) which is defined as the number of pairs of items that are in the same cluster in both the predicted and true clusterings, divided by the total number of pairs of items. It ranges from 0 to 1, with a value of 1 indicating a perfect match and a value of 0 indicating no match. The ARI adjusts the RI by subtracting the expected value of the RI under the null hypothesis of random clustering,

    $$\begin{aligned} \text {ARI}(X_t,Y_t):= \frac{\text {RI}-E[\text {RI}]}{\max \text {RI}- E[\text {RI}]}. \end{aligned}$$

    More formally, let \(C=\{c_1,\dots ,c_s\}\) and \(C' = \{c'_1,\dots ,c'_r\}\) two clusterings having s and r clusters, respectively, and we obtain the \(r\times s\) contingency table whose (ij)th element, \(n_{ij}\), is defined as the cardinal of \(c_i \cap c'_j\). The ARI is computed as

    $$\begin{aligned}{} & {} \text {ARI}(C, C') \\{} & {} \quad = \frac{\sum \nolimits _{i=1}^s\sum \nolimits _{j=1}^r \left( {\begin{array}{c}n_{ij}\\ 2\end{array}}\right) - \bigg [ \sum \nolimits _{i=1}^s \left( {\begin{array}{c}a_{i}\\ 2\end{array}}\right) \sum \nolimits _{j=1}^r \left( {\begin{array}{c}b_{i}\\ 2\end{array}}\right) \bigg ] \bigg / \left( {\begin{array}{c}n\\ 2\end{array}}\right) }{\frac{1}{2} \bigg [ \sum \nolimits _{i=1}^s \left( {\begin{array}{c}a_i\\ 2\end{array}}\right) +\sum \nolimits _{j=1}^r\left( {\begin{array}{c}b_j\\ 2\end{array}}\right) \bigg ] - \bigg [ \sum \nolimits _{i=1}^s \left( {\begin{array}{c}a_i\\ 2\end{array}}\right) \sum \nolimits _{j=1}^r \left( {\begin{array}{c}b_j\\ 2\end{array}}\right) \bigg ] \bigg / \left( {\begin{array}{c}n\\ 2\end{array}}\right) },\end{aligned}$$

    where n is the number of objects in the data set, \(a_i = \sum \nolimits _{j=1}^r n_{i,j}\) and \(b_j = \sum \nolimits _{i=1}^s n_{i,j}\).

  • Separation Index (SI): The information provided by the SI is to quantify how well the predicted groups are separated from each other. It takes values in the interval [0,1], where 1 implies total separation and 0 means no separation. Moreover, it relies on the pair-wise distances (considered for clustering) for every point to the closest point not belonging to the same cluster, and the distances between cluster centers. More than one variations of Separation Indexes can be found in literature (see, e.g. [18]) although in general it allows to formalise separation less sensitive to ambiguous points. The most common method is to use the mean inter-cluster distance and the mean intra-cluster distance. The mean inter-cluster distance represents the average distance between the centroid of each cluster and the centroid of every other cluster. The mean intra-cluster distance is the average distance between all the data points within a cluster and the centroid of that cluster. Notice that it is independent of the number of clusters and the true clusters, then it is often used in conjunction with other measures of clustering quality.

  • Normalized Gamma Index (\((\overline{\Gamma }\)): This index seeks to measure the correlation between the distance matrix, \(\pmb {D}\), and the predicted clusters. To this end, a matrix, say \(\pmb {C}\), with entries (ij)-th beign 1 if both the i-th and the j-th elements are classified in the same cluster, and 0 otherwise, is computed. The index is defined as

    $$\begin{aligned} \overline{\Gamma }(\pmb {C},\pmb {D}):= \frac{Corr(\pmb {c},\pmb {d})+1}{2}, \end{aligned}$$

    where \(\pmb {c} = vec(\pmb {C})\) and \(\pmb {d} = vec(\pmb {D})\). Note that the closer the value to one, the better the selected clustering (see, e.g. Halkidi et al. [15]).

In the simulation study both hierarchical clustering using single linkage (Hier) and partitional k-means clustering (Part) are performed considering different similarity measures. The single linkage has been selected in hierarchical clustering because it captures chain-dependency, as shown in Alonso and Peña [2]. Firstly, for each scenario, the time series of length \(T=1000\) are simulated using max-AR processes. Afterwards, clustering is performed using each one of the following similarity measures: BGCC, BSBD and ESBD for \(q=0.9\), 0.95 and 0.99. This procedure is repeated five times, so as to capture the variability and uncertainty. The obtained results can be found in Tables 1 (Adjusted Rand Index), 2 (Separation Index) and 3 (\(\overline{\Gamma }\) index) for \(q = 0.9\). The tables for \(q = 0.95\) and 0.99 are shown in the Appendix.

Table 1 Clustering performance evaluation with Adjusted Rand Index at scenarios S.1–S.7, for \(T = 1000\) and \(q=0.90\)
Table 2 Clustering performance evaluation with Separation Index at scenarios S.1–S.7, for \(T = 1000\) and \(q=0.90\)

It is important noticing that several parameters might have strong influence in the procedure, such as max-AR coefficients, the number of observed values in each time series and the value of q. The results in Tables 1, 2 and 3 lead us to the following conclusions:

  • In almost all cases the hierarchical clustering procedure performs better than partitional for the three indexes, or at least it does not improve. This is expected since hierarchical algorithm classifies only based on dissimilarity matrix, while partitional includes the concept of centroids, which might not be representative in this case. In the sequel, we will focus on hierarchical clustering.

  • As far as scenarios is concerned the 7th provides the best results, since it exhibits the highest correlations.

  • In the sense of \(\overline{\Gamma }\) index, clearly when a chain dependent structure is added, then the result is worst as compared with totally independent structures (see Fig. 8): the results at the 4th and the 6th scenarios are worst than the ones at the 3th and the 5th scenarios, respectively.

  • In order to compare the similarity measures, as expected, the obtained results using BSBD and ESBD are very similar, but in general BSBD might give slightly better results (see Fig. 8). On the other hand, BGCC clearly separates clusters much more than BSBD and ESBD but might take lower \(\overline{\Gamma }\) index values at the firsts scenarios. In general, we would say BGCC is the similarity measure that works best in this simulation exercise since ARI and SI values for any scenario (except the third one) are always one or very close to one.

Table 3 Clustering performance evaluation with \(\overline{\Gamma }\) at scenarios S.1–S.7, for \(T = 1000\) and \(q=0.90\)
Fig. 8
figure 8

ARI, separation index and \(\overline{\Gamma }\) for hierarchical clustering by similarity measure and scenario when \(q=0.9\)

The same analysis was performed using \(q = 0.95\) and \(q=0.99\) and similar conclusions are obtained (see Appendix).

Fig. 9
figure 9

SI values by similarity measure, k and q

Fig. 10
figure 10

\(\overline{\Gamma }\) values by similarity measure, k and q

Fig. 11
figure 11

Mean silhouette scores by number of clusters for BSBD, ESBD and BGCC when \(q=0.9\)

Fig. 12
figure 12

European map clustering results for \(q=0.9\) and \(k=6\), using ESBD

4.2 Application to real data

In this subsection, the proposed procedure is applied in a real-world data set, namely, to daily maximum temperatures. Having a large number of stations’ data, the objective now is to classify these stations according to the extreme temperature values’ dependencies.

4.2.1 Dataset

The dataset was obtained from the European Climate Assessment & Dataset.Footnote 1 In this site, a wide list of European stations is available, as well as a list of sources that are used to produce the blended series for each station. The data is divided in some subsets such as daily maximum temperatures, daily cloud cover or daily humidity. As mentioned before, in this study we will work with daily maximum temperatures. The main objective is perform hierarchical clustering using the proposed measures BGCC, BSBD and ESBD.

This dataset contains records of 5804 stations all around Europe, and the number of years of data varies for each station. Therefore, for analysis, we selected the most recent 3000 days of data for each station, up to the data download date of April 22nd, 2022. However, considering all the stations simultaneously may be computationally intensive, in terms of memory usage and the number of calculations required. Consequently, 500 stations are randomly selected meeting the following conditions,

  • There are more than 3000 days of observations.

  • The missing data percentage is lesser than 1%.

4.2.2 Hierarchical clustering at daily maximum temperatures

In the previous section, we concluded that partitional clustering performs worse than hierarchical when aiming at capturing extremal dependency. Therefore, only the hierarchical one is applied. Once selecting the data, the parameters that affects the result are the number of clusters, the similarity measure and the threshold to divide extreme from the rest of the data. Consequently, it might be interesting to perform several scenarios to then compare all of them and see which case works better. In our case, the considered parameters are \(k\in \{3,\dots ,12\}\) (number of clusters) and \(q \in \{0.9,0.95,0.99\}\) (thresholds). Next,

the clustering procedure is performed \(10\times 3\times 3 = 90\) times. Moreover, hierarchical clustering is performed using single, average and complete linkage.

The output contains the separation index, gamma correlation coefficient and the obtained clusters in each case, and also the used data in order to obtain those results. In Table 10 in Appendix can be found all the obtained values. Figures 9 and 10 display the separation index and \(\overline{\Gamma }\) index for the 90 combinations. The following conclusions are obtained, namely

  • As showed in Fig. 9, BGCC is clearly the best option in the sense of SI comparing with the other two similarity measures. Furthermore, BSBD separates clusters better than ESBD.

  • Regarding \(\overline{\Gamma }\), BGCC exhibits worst values when compared with the other two for \(q=0.9\), which show quite similar values. Finally, ESBD provides slightly better results than BSBD.

Notice that as the number of clusters increases, the SI decreases while \(\overline{\Gamma }\) increases. At this point meteorological knowledge might be required to select the best clustering results. Undoubtedly, covariates like geographic location, climate or geological environment can affect the final classification.

Since we have obtained the result for some candidates k and the dissimilarity matrices for BGCC, BSBD and ESBD, the silhouette index for any object (station) can be computed. In order to select an optimal k value, the mean silhouette index for each k is calculated; see Fig. 11 for details.

For both BSBD and ESBD, \(k=6\) maximizes the mean silhouette value. In the case of BGCC the value \(k=7\) is selected.

Moreover, the entries of the cophenetic correlation matrix are

leading to conclude that the obtained results for the three measures are highly similar.

After selecting an optimal k value for each case, we plot the classification in the European map, where each drawn point represents a station, and the color represents the predicted cluster. Figure 12 shows the results for ESBD, \(k=6\) and \(q=0.9\). Clearly, the extremal values go together with the geographical location of each station, and this is correctly captured in the implemented procedure. Figures 15 and 16 in Appendix display the results using other similarity measures. Since we have used the average linkage, the centroids of each cluster (that somehow matched with geographical centroids) are separated as much as possible. Notice that for example, when using BSBD or ESBD measure, we can appreciate some little changes, like Scandinavia and UK are separated from Central Europe. In general, the difference between the three measures is small, only changing the classification of a few points (see Figs. 15 and 16 in Appendix).

5 Conclusions and extensions

In this paper three new similarity measures that capture the extremal dependence are proposed and discussed in detail. The results obtained in different scenarios show their effectiveness in capturing extremal dependency between time series.

One of the main conclusions is that several parameters may influence the results of the clustering analysis, namely the threshold to consider a value as an extreme, the number of clusters or the sample size. Moreover, in Sect. 4.1, it is shown that capturing chain-dependencies is the most challenging scenario for the majority of parameter combinations. Furthermore, in Sect. 4.2, we conclude that the differences between the results obtained using the three similarity measures are quite similar. On the other hand, hierarchical clustering performs better, in extremal-dependent clustering, than partitional clustering. Further studies would assess the performance of other clustering methods like fuzzy as in Durante et al. [8] or spectral (see, e.g. Von Luxburg [29]) using the proposed similarity measures.

Fig. 13
figure 13

ARI, SI and \(\overline{\Gamma }\) indexes for \(q=0.95\)

Another possible extension involves the study of measures that are less dependent on the selection of the percentile, such as \(\max \nolimits _{q \ge q_0} \text {ESBD}_q(x_t, y_t)\), or a measure that uses several values of q such as \(\sum \nolimits _{q \in Q} \text {ESBD}_q(x_t, y_t)\), where Q is a finite set of percentiles. It may also be interesting to study measures based on the series \(x_t I_q(x_t)\) instead of the time series \(I_q(x_t)\). Note that the Pearson, Spearman, and Kendall correlation coefficients return the same results when we use binary series but may capture different features when we use the time series \(x_t I_q(x_t)\).

Table 4 Clustering performance evaluation with Adjusted Rand Index at scenarios S.1–S.7, for \(T = 1000\) and \(q=0.95\)
Table 5 Clustering performance evaluation with Separation Index at scenarios S.1–S.7, for \(T = 1000\) and \(q=0.95\)
Table 6 Clustering performance evaluation with \(\overline{\Gamma }\) at scenarios S.1–S.7, for \(T = 1000\) and \(q=0.95\)
Fig. 14
figure 14

ARI, SI and \(\overline{\Gamma }\) indexes for \(q=0.99\)