A fast epigraph and hypograph-based approach for clustering functional data

Pulido, Belén; Franco-Pereira, Alba M.; Lillo, Rosa E.

doi:10.1007/s11222-023-10213-7

A fast epigraph and hypograph-based approach for clustering functional data

Original Paper
Open access
Published: 04 February 2023

Volume 33, article number 36, (2023)
Cite this article

Download PDF

You have full access to this open access article

Statistics and Computing Aims and scope Submit manuscript

A fast epigraph and hypograph-based approach for clustering functional data

Download PDF

Belén Pulido¹,
Alba M. Franco-Pereira^1,2 &
Rosa E. Lillo^1,3

1654 Accesses
2 Citations
4 Altmetric
Explore all metrics

Abstract

Clustering techniques for multivariate data are useful tools in Statistics that have been fully studied in the literature. However, there is limited literature on clustering methodologies for functional data. Our proposal consists of a clustering procedure for functional data using techniques for clustering multivariate data. The idea is to reduce a functional data problem into a multivariate one by applying the epigraph and hypograph indexes to the original curves and to their first and/or second derivatives. All the information given by the functional data is therefore transformed to the multivariate context, being informative enough for the usual multivariate clustering techniques to be efficient. The performance of this new methodology is evaluated through a simulation study and is also illustrated through real data sets. The results are compared to some other clustering procedures for functional data.

Nonparametric Hierarchical Clustering of Functional Data

Benchmarking different clustering algorithms on functional data

Article 04 July 2016

Sparse and smooth functional data clustering

Article Open access 09 March 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Nowadays, in several fields of study, much of the data collected and analyzed can be considered as functions $x_i(t)$, $i=1,\ldots ,n$, $t \in {\mathcal {I}}$, where ${\mathcal {I}}$ is an interval in ${\mathbb {R}}$. For example, growth, weather variables, the evolution of the market, $\ldots $ This has been triggered by recent technological developments that enable a large volume of data to be analyzed in a short period of time. Functional Data Analysis (FDA) arises when this information is studied through the analysis of curves or functions. A complete overview of FDA can be found in the monographs of Ramsay and Silverman (2005), and Ferraty and Vieu (2006), while some interesting reviews of functional data can be found in Horváth and Kokoszka (2012), Hsing and Eubank (2015), and Wang et al. (2016).

The main drawback when working with functional and multivariate data, unlike in one dimension, is the lack of a total order. Thus, a traditional challenge in FDA and in multivariate analysis is to provide an ordering within a sample of curves that enables the definition of order statistics such as ranks and L-statistics. In this sense, (Tukey 1975) introduced the concept of statistical depth that provided a center-outward ordering for multivariate data. Some other definitions can be found in Oja (1983), Liu (1990), and Zuo (2003). This concept was extended to functional data, leading to different definitions of functional depth. See, for example, (Vardi and Zhang 2000; Fraiman and Muniz 2001; Cuevas et al. 2006; Cuesta-Albertos and Nieto-Reyes 2008; López-Pintado and Romo 2009, 2011), and (Sguera et al. 2014).

More recently, (Franco-Pereira et al. 2011) proposed the epigraph and the hypograph indexes in order to measure the “extremality” of a curve with respect to a bunch of curves, and to provide an alternative order to the one given by the statistical depth.

The combination of these two indexes has already been exploited: (Arribas-Gil and Romo 2014) proposed the outliergram for outliers detection, (Martín-Barragán et al. 2018) defined a functional boxplot, and (Franco-Pereira and Lillo 2020) contributed with a homogeneity test for functional data. These works show how the epigraph index, the hypograph index and the band depth provide useful information about both the shape and the magnitude of the curves. The main idea of this work is to use the epigraph and the hypograph indexes to reduce a problem in an infinite dimension into a multivariate one in which multivariate clustering techniques can be applied. The joint use of these indexes in the original curves and in their derivatives largely characterizes aspects of the curves in the sample, providing an ordering of the curves from top to bottom or vice versa.

When studying a high volume of data, there is an increased need to classify the data into groups without any extra information, since this classification makes them easier to manipulate. Clustering is one of the most widely used techniques within unsupervised learning, and has been fully studied for multivariate data. Some of the most frequently used procedures are distance-based techniques such as hierarchical clustering (see (Sibson 1973; Defays 1977; Sokal and Michener 1958; Lance and Williams 1967), and (Ward 1963) for different hierarchical clustering procedures) and k-means clustering (introduced by MacQueen (1967)). Taking into account that k-means is probably the most frequently used clustering method in the literature, different variations have been introduced. See (Ben-Hur et al. 2001), and (Dhillon et al. 2004).

Clustering functional data is a challenging problem since it involves working in an infinite dimensional space. Different approaches have been considered in the literature. In Jacques and Preda (2014), the functional clustering techniques are classified into four categories: (1) the raw data methods, which consist of considering the functional data set as a multivariate one and applying clustering techniques for multivariate data (Boullé 2012; 2) the filtering methods, which first apply a basis to the functional data after applying clustering techniques to the obtained data (Abraham et al. 2003; Rossi et al. 2004; Peng et al. 2008), and (Kayano et al. 2010; 3) the adaptive methods, where dimensionality reduction and clustering are performed at the same time (James and Sugar 2003; Jacques and Preda 2013), and (Giacofci et al. 2013), and (Traore et al. 2019); and (4) the distance-based methods, which apply a clustering technique based on distances considering a specific distance for functional data (Tarpey and Kinateder 2003; Ieva et al. 2013), and (Martino et al. 2019). Recent works that cannot be easily classified into any of these categories for clustering functional data are (Romano et al. 2017), which introduced a method for clustering spatially dependent functional data, (Zambom et al. 2019), which proposed a new method by applying k-means, assigning each element to a cluster or another based on a combination of an hypothesis test of parallelism and a test for equality of means, and (Schmutz et al. 2020), which presented a new strategy for clustering functional data based on applying model-based techniques after a principal component analysis. Based on the previous classification, the methodology proposed in this paper could be considered as both filtering and adaptive, since dimensionality reduction is performed by using the epigraph and the hypograph indexes after applying a basis to the data.

The paper is organized as follows. In Sect. 2, the epigraph and the hypograph indexes are introduced, as well as their relation with the band depth. The methodology for clustering functional data sets based on these indexes is explained in Sect. 3. In Sect. 4, this methodology is examined through an extensive simulation study, and the results are compared to those obtained with some existing procedures for clustering functional data. In Sect. 5, the applicability of our procedure is illustrated through some real data sets. A discussion and some concluding remarks are finally presented in Sect. 6.

2 Preliminaries: The epigraph, the hypograph and the band depth

Let $C({\mathcal {I}})$ be the space of continuous functions defined on a compact interval ${\mathcal {I}}$. Consider a stochastic process X with sample paths in $C({\mathcal {I}})$ and distribution $F_{X}$. The graph of a function x in $C({\mathcal {I}})$ is $G(x) = \{(t,x(t)),\ for \ all \ t \in {\mathcal {I}} \}.$ Then, the epigraph (epi) and the hypograph (hyp) of x are defined as follows:

$$\begin{aligned}{} & {} epi(x)=\{(t,y) \in {\mathcal {I}} \times {\mathbb {R}}: y \ge x(t)\},\\{} & {} hyp(x)=\{(t,y) \in {\mathcal {I}} \times {\mathbb {R}}: y \le x(t)\}. \end{aligned}$$

Franco-Pereira et al. (2011) defined two indexes based on these two concepts. Given a sample of curves $\{x_1(t),\ldots ,x_n (t)\}$, the epigraph index of a curve x ($\text {EI}_n(x)$) is defined as one minus the proportion of curves in the sample that are totally included in its epigraph. Analogously, the hypograph index of x ($\text {HI}_n(x)$) is the proportion of curves totally included in the hypograph of x.

$$\begin{aligned} \text {EI}_n(x)=&1-\frac{ \sum _{i=1}^n{I(\{G(x_i)\subseteq epi(x)\})}}{n}= \\&1-\frac{\sum _{i=1}^nI\{E_{i,x}\}}{n},\\ \text {HI}_n(x)=&\frac{ \sum _{i=1}^n{I(\{G(x_i)\subseteq hyp(x)\})}}{n}=\\&\frac{ \sum _{i=1}^n{I\{H_{i,x}\}}}{n}, \end{aligned}$$

where $E_{i,x}=\{x_i(t)\ge x(t), \ for \ all \ t\in {\mathcal {I}}\}$, $H_{i,x}= \{x_i(t)\le x(t), for \ all \ t \in {\mathcal {I}}\}$ and $I\{A\}$ is 1 if A is true and 0 otherwise.

Their population versions are given by:

$$\begin{aligned} {}&{} \text{ EI }(x,F_X)\equiv \text{ EI }(x) = 1-P(G(X) \subseteq epi(x)) \\ {}&{} \quad =1-P(X(t)\ge x(t), for \ all \ t \in {\mathcal {I}}),\\ {}&{} \text{ HI }(x,F_X)\equiv \text{ HI }(x) = P(G(X) \subseteq hyp(x)) \\ {}&{} \quad =P(X(t)\le x(t),for \ all \ t\in {\mathcal {I}}). \end{aligned}$$

Franco-Pereira et al. (2011) argued that when the curves in the sample are extremely irregular, with many intersections, the modified versions of these indexes are recommended. If ${\mathcal {I}}$ is considered as a time interval, the modified epigraph index of x ($\text {MEI}_n(x)$) can be defined as one minus the proportion of time the curves are in the epigraph of x, i.e., the proportion of time the curves of the sample are above x. Analogously, the generalized hypograph index of x ($\text {MHI}_n(x)$) can be considered as the proportion of time the curves in the sample are below x.

$$\begin{aligned}{} & {} \text {MEI}_n(x) =1- \sum _{i=1}^n{ \frac{ \lambda ({t \in {\mathcal {I}}}: x_i(t) \ge x(t))}{n \lambda ({\mathcal {I}})}}, \end{aligned}$$

(1)

$$\begin{aligned}{} & {} \text {MHI}_n(x) = \sum _{i=1}^n{ \frac{ \lambda ({t \in {\mathcal {I}}: x_i(t) \le x(t)})}{n \lambda ({\mathcal {I}})}}, \end{aligned}$$

(2)

where $\lambda $ stands for Lebesgue’s measure on ${\mathbb {R}}$.

Although these definitions are applicable to an arbitrary curve, from now on the curve x will be considered as a curve of the sample, since the methodology proposed here is based on the computation of these indexes on the sample curves. Note that, since the graph of any curve x is contained in its epigraph and its hypograph, this relation holds when $x(t)=x_i(t)$:

$$\begin{aligned} \lambda ( t \in {\mathcal {I}}: x_i(t) \ge x(t) ) = \lambda ({\mathcal {I}}) = \\ \lambda (t \in {\mathcal {I}}: x_i(t) \le x(t)).\end{aligned}$$

Applying this condition to (1) and (1), we obtain

$$\begin{aligned} \text {MEI}_n(x) =1- \left( \sum _{\begin{array}{c} i=1 \\ x_i \ne x \end{array}}^n{ \frac{ \lambda ({t \in {\mathcal {I}}: x_i(t) \ge x(t)})}{n \lambda ({\mathcal {I}})}} + \frac{1}{n}\right) , \end{aligned}$$

(3)

and

$$\begin{aligned} \text {MHI}_n(x) =\sum _{\begin{array}{c} i=1 \\ x_i \ne x \end{array}}^n{ \frac{ \lambda ({t \in {\mathcal {I}}: x_i(t) \le x(t)})}{n \lambda ({\mathcal {I}})}} + \frac{1}{n}. \end{aligned}$$

(4)

Moreover, if $x(t)\ne x_i(t)$, then

$$\begin{aligned} \lambda (t \in {\mathcal {I}}: x_i(t) \le x(t)) + \\ \lambda (t \in {\mathcal {I}}: x_i(t) \ge x(t)) = \lambda ({\mathcal {I}}).\end{aligned}$$

Now, applying this into (4) we can write:

$$\begin{aligned}{} & {} \text {MHI}_n(x) = \\{} & {} 1-\frac{1}{n}- \sum _{\begin{array}{c} i=1 \\ x_i \ne x \end{array}}^n{ \frac{ \lambda ({t \in {\mathcal {I}}: x_i(t) \ge x(t)})}{n \lambda ({\mathcal {I}})}} + \frac{1}{n} {\mathop {=}\limits ^{((3))}} \\{} & {} \text {MEI}_n(x)+\frac{1}{n}. \end{aligned}$$

Finally, the following relation between the two modified versions of the epigraph and the hypograph indexes is obtained, concluding that they are linearly dependent:

$$\begin{aligned} \text {MHI}_n(x)-\text {MEI}_n(x)=\frac{1}{n}. \end{aligned}$$

Note that this equality does not hold in Franco-Pereira and Lillo (2020) because the way in which the data is considered in the homogeneity test differs from the perspective given in this paper.

One may wonder why the epigraph and the hypograph indexes are considered for summarizing the information of a functional sample instead of the band depth (López-Pintado and Romo 2009). In the following, the band depth will be obtained as a combination of the epigraph and the hypograph indexes. Therefore, these indexes are able to summarize the information provided by this depth.

First of all, some definitions are recalled. Consider the band in ${\mathbb {R}}^2$ delimited by two curves $x_i$ and $x_j$ as

$$\begin{aligned}{} & {} b(x_i,x_j)=\{(t,y) \in {\mathcal {I}} \times {\mathbb {R}}: \\{} & {} min\{x_i(t),x_j(t)\} \le y \le max\{x_i(t),x_j(t)\}\}. \end{aligned}$$

Then, the band depth of x of López-Pintado and Romo (2009) $(\text {BD}_n(x))$ is the proportion of bands $b(x_i, x_j)$ determined by two curves $x_i$, $x_j$ in the sample, containing the whole graph of x.

$$\begin{aligned} \text {BD}_n(x) =&\frac{\sum _{i=1}^{n-1}\sum _{j=i+1}^{n} I\{G(x) \subset b(x_i,x_j)\}}{{\left( {\begin{array}{c}n\\ 2\end{array}}\right) }}= \\ {}&\frac{\sum _{i=1}^{n-1}\sum _{j=i+1}^{n} I\{B_{i,j,x}\}}{\left( {\begin{array}{c}n\\ 2\end{array}}\right) }. \end{aligned}$$

where

$$\begin{aligned}{} & {} B_{i,j,x}= \{ min\{x_i(t),x_j(t)\} \le x(t) \le max\{x_i(t),x_j(t)\},\\{} & {} \ for \ all \ t \in {\mathcal {I}}\}. \end{aligned}$$

The Lebesgue measure can also be used instead of the indicator function, obtaining a more flexible definition of the band depth. The modified band depth of x $(\text {MBD}(x))$ is given by:

$$\begin{aligned} \text {MBD}_n(x) = \frac{ \sum _{i=1}^{n-1}\sum _{j=i+1}^{n} \frac{\lambda ( MB _{i,j,x})}{\lambda ({\mathcal {I}})}}{\left( {\begin{array}{c}n\\ 2\end{array}}\right) }, \end{aligned}$$

where

$$\begin{aligned}{} & {} MB _{i,j,x}= \\{} & {} \{t \in {\mathcal {I}}: min\{x_i(t),x_j(t)\} \le x(t) \le max\{x_i(t),x_j(t)\}\}. \end{aligned}$$

Now, the relation between band depth and the indexes is given by

$$\begin{aligned}{} & {} \text {BD}_n(x)= \\{} & {} 2\text {HI}_n(x)+\frac{2}{n-1} \text {EI}_n(x)-\frac{2n}{n-1} \text {EI}_n(x) \text {HI}_n(x), \end{aligned}$$

and the relation between the modified band depth and the modified epigraph index is given by

$$\begin{aligned} \text {MBD}_n(x)= & {} \frac{1}{n}+2\text {MEI}_n(x)-\frac{2n}{n-1}{\text {MEI}_n(x)}^2 \\{} & {} -\frac{2}{n(n-1)}\sum _{i=1}^{n-1}\sum _{j=1}^n\left( \frac{\lambda ( ME _{i,x}\cap ME _{j,x}) }{\lambda ({\mathcal {I}})}-\right. \\{} & {} \left. \frac{\lambda ( ME _{i,x})\lambda ( ME _{j,x} )}{\lambda ({\mathcal {I}})}\right) , \end{aligned}$$

where $ ME _{i,x} =\{t \in {\mathcal {I}}: x_i(t)\ge x(t)\}.$ The proof of the first equality is below. The proof of the second one can be found in Arribas-Gil and Romo (2014), but note that they omit “1 -” in the definition of the MEI. Also note that, in order to obtain these equations, x is considered to belong to the sample of curves.

As stated before, the band depth, the epigraph and the hypograph indexes can be written as:

$$\begin{aligned}{} & {} \text {BD}_n(x) =\frac{\sum _{i=1}^{n-1}\sum _{j=i+1}^{n} I\{B_{i,j,x}\}}{\left( {\begin{array}{c}n\\ 2\end{array}}\right) },\\{} & {} \text {EI}_n(x) =1-\frac{1}{n}\sum _{i=1}^n I\{E_{i,x}\}, \ and \\{} & {} \text {HI}_n(x) =\frac{1}{n}\sum _{i=1}^n I\{H_{i,x}\}. \end{aligned}$$

Note that $\sum _{i=1}^{n-1}\sum _{j=i+1}^{n} I\{B_{i,j,x}\}$ represents the number of bands between two curves that can be obtained taking any two curves in the sample different from x, plus the number of bands that include x. $\sum _{i=1}^n I\{E_{i,x}\}$ represents the number of curves that lie above x plus one, and $\sum _{i=1}^n I\{H_{i,x}\}$ stands for the number of curves that lie below x plus one.

It holds that

$$\begin{aligned} I\{B_{i,j,x}\}=I\{E_{i,x}\}I\{H_{j,x}\}+I\{H_{i,x}\}I\{E_{j,x}\}, \end{aligned}$$

if $x\ne x_i, x\ne x_j$ and $I\{B_{i,j,x}\}=1$ otherwise.

Thus,

$$\begin{aligned}{} & {} \sum _{i=1}^{n-1}\sum _{j=i+1}^{n} I\{B_{i,j,x}\}=\sum _{i=1}^n I\{E_{i,x}\}\sum _{i=1}^n I\{H_{i,x}\}- \\{} & {} \sum _{i=1}^n I\{E_{i,x}\}-\sum _{i=1}^n I\{H_{i,x}\}+n. \end{aligned}$$

Now, since $\sum _{i=1}^n I\{E_{i,x}\}=n(1-\text {EI}_n(x))$, and $\sum _{i=1}^n I\{H_{i,x}\} =n\text {HI}_n(x)$, we have

$$\begin{aligned} \text {BD}_n(x){} & {} =2\text {HI}_n(x)+\frac{2}{n-1} \text {EI}_n(x)\\{} & {} -\frac{2n}{n-1} \text {EI}_n(x) \text {HI}_n(x). \end{aligned}$$

3 Clustering functional data through the epigraph and the hypograph indexes

The proposed methodology for clustering functional data is a four-step method, as illustrated in Fig. 1. In what follows, we will refer to this method as EHyClus.

Step 1 (S1) consists of smoothing the data. This is recommended since the amount of data upon which the process is based precludes abrupt changes in value. For this reason, it is common to smooth the data when working with curves. Cubic B-spline bases have been used, but any other functional basis could have been applied. In this case, since the first and the second derivatives of the data are considered, taking a cubic B-spline basis is the most natural option. In order to choose the optimal number of bases, a sensitivity study was carried out. The corresponding results are shown in the Supplementary Material, Section 1, and depend on the data sets one may consider, as happens in almost all the studies in the literature. Here, all the data sets explained in Sect. 4 have been taken. They show that particular changes in the number of bases do not play a crucial role in the results. Nonetheless, this analysis highlights that the best ones can be obtained with a number of bases between 30 and 40. After the data set is transformed, the second step (S2) is to apply the epigraph and the hypograph indexes (and their generalized versions) to the basis transformed data, as well as to their derivatives, obtaining a multivariate data set. As explained in Sect. 2, the modified epigraph and hypograph indexes are linearly dependent. Because of that, MHI will be discarded since it will not provide “extra information”.

From now on, the term curves will refer to the smoothed ones, and data will refer to the complete data set with curves and the first and second derivatives. Then, different subsets of the data will be taken to apply multivariate clustering techniques (S3). Finally, the fourth step (S4) consists of obtaining a final clustering partition in a previously fixed number of groups. In general, as explained in Rendón et al. (2011), clustering validity approaches can be divided into two categories: external and internal criteria. The first type of methods requires the ground truth to obtain a final result, and the second one uses some other intrinsical information of the data to achieve a solution. For evaluating the goodness of the classification, in this work, three different external validation strategies will be applied: Purity, F-measure and Rand Index (RI), which are fully explained in Manning et al. (2009), and Rendón et al. (2011).

Purity is the proportion of elements that were classified correctly. The F-measure is the harmonic mean of the precision and the recall values for each cluster. The precision of a cluster is the same as its purity coefficient. The recall of a cluster is the proportion of observations classified as a given class in a correct way. The Rand Index can be viewed as a measure of the percentage of correct decisions made by the algorithm. All these indexes provide values in $\left[ 0, 1\right] $ and verify that the higher the value, the better the classification. The Adjusted Rand Index (ARI) could be considered instead of the ARI. The ARI is a corrected version of the RI that rectifies the fact that good results were obtained by chance. In this work we have considered RI instead of ARI because ARI is not always a number between 0 and 1 and thus, it has a different scale than Purity and F-measure, which are the other validity measures considered here.

In steps S1 to S3, the following procedure is carried out to obtain the clustering partitions to which the external validation criteria explained above may be applied.

A functional data problem is converted into a multivariate one by applying the indexes. Thus, some “information” is lost. Applying these indexes, not only to the original curves but also to their first and second derivatives, allows one to take advantage of the shape/magnitude/amplitude of the curves in the sample. It seems clear that these three attributes play an important role in functional clustering, and these indexes can provide a great deal of information in this regard.

Thereafter, data and indexes are combined to obtain a data set where a multivariate clustering technique is later applied.

Since considering all the possible combinations between data and indexes without fixing any condition leads to a vast number of options, 18 different combinations of data and indexes are contemplated.

Now, the notation used for presenting the results in every table is explained. The combinations sets are represented as (b).(c) where (b) stands for the data, with ‘$\_$’ representing the curves, ‘d’ first derivatives and ‘d2’ second derivatives, and (c) represents the indexes. The 18 different combinations come from applying all the indexes in (c) to all the data in (b). All these possible combinations are listed in the Supplementary Material, but some examples are shown in Table 1.

Table 1 Representation and description of the combinations of data and indexes

A fast epigraph and hypograph-based approach for clustering functional data

Abstract

Similar content being viewed by others

Nonparametric Hierarchical Clustering of Functional Data

Benchmarking different clustering algorithms on functional data

Sparse and smooth functional data clustering

1 Introduction

2 Preliminaries: The epigraph, the hypograph and the band depth

3 Clustering functional data through the epigraph and the hypograph indexes

4 Simulation study

4.1 Simulation study A: Two clusters

4.2 Simulation Study B: More than two clusters

4.3 Simulation summary

5 Application to real data

5.1 Case study: Berkeley Growth Study data set

5.2 Case study: Canadian weather data set

6 Discussion

6.1 Supplementary information

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 309 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation