A fast epigraph and hypograph-based approach for clustering functional data

Clustering techniques for multivariate data are useful tools in Statistics that have been fully studied in the literature. However, there is limited literature on clustering methodologies for functional data. Our proposal consists of a clustering procedure for functional data using techniques for clustering multivariate data. The idea is to reduce a functional data problem into a multivariate one by applying the epigraph and hypograph indexes to the original curves and to their first and/or second derivatives. All the information given by the functional data is therefore transformed to the multivariate context, being informative enough for the usual multivariate clustering techniques to be efficient. The performance of this new methodology is evaluated through a simulation study and is also illustrated through real data sets. The results are compared to some other clustering procedures for functional data.


Introduction
Nowadays, in several fields of study, many of the data that are collected and analysed can be considered as functions x i (t), i = 1, ..., n, t ∈ I, where I is an interval in R. For example growth, weather variables, the evolution of the market...This is due to the fact that current technological development enables to analyse large volume of data in a short period of time.Functional Data Analysis (FDA) arises when this information is studied through the analysis of curves or functions.A complete overview of functional data analysis can be found in Ramsay and Silverman (2005) and Ferraty and Vieu (2006), while some interesting reviews of functional data can be found in Horváth and Kokoszka (2012a), Hsing and Eubank (2015) and Wang et al. (2016).
The main problem when working with functional and multivariate data is that there does not exist a total order as in one dimension.Thus, a traditional challenge in FDA and in multivariate analysis is to provide an ordering within a sample of curves which enables the definition of order statistics such as ranks and L-statistics.In this sense, Tukey (1975) introduced the concept of statistical depth that provided a center-outward ordering for multivariate data.Some other definitions can be found in Oja (1983), Liu (1990) and Zuo (2003).This concept was extended to functional data appearing several definitions of functional depth.See, for example, Vardi and Zhang (2000), Fraiman and Muniz (2001), Cuevas et al. (2006), Cuesta-Albertos and Nieto-Reyes (2008), López-Pintado and Romo (2009), López-Pintado andRomo (2011), andSguera et al. (2014).
More recently, Franco-Pereira et al. (2011) proposed the epigraph and the hypograph indexes that, instead of dealing with the "centrality" of a bunch of curves, allow to measure their "extremality".The combination of these indexes has been alredy exploited: Arribas-Gil and Romo (2014) proposed the outliergram for outliers detection, Martín-Barragán et al. (2018) defined a boxplot for functional data and Franco-Pereira and Lillo (2019) contributed with a homogeneity test for functional data.The main idea of combining the epigraph and the hypograph indexes is to be able to summarize the information provided by a functional sample into a vector in R 2 or in R d , d ∈ N if the derivatives of the functions are also considered.
When studying high volume data, the necessity of classifying the data into groups without any extra information increases since they become easier to manipulate.
Clustering is one of the most widely used techniques within unsupervised learning techniques, and that it has been fully studied for multivariate data.One of the most frequently used procedures, are the distance based techniques as hierarchical clustering (see Sibson (1973), Defays (1977), Sokal and Michener (1958), Lance and Williams (1967), Ward (1963) for different hierarchical clustering procedures) and k-means clustering (introduced by MacQueen (1967)).Taking into consideration the fact that k-means is, probably, the mostly used clustering method in the literature, different variations of it have been introduced.See Ben-Hur et al. (2001) and Dhillon et al. (2004).
Clustering functional data is a challenging problem since it involves working with an infinite dimensional space.Different approaches have been considered in the literature.
In Jacques and Preda (2014) the functional clustering techniques are classified into four categories: raw data methods, which consist on considering the functional data set as a multivariate one and apply there the clustering techniques studied for multivariate data (Boullé (2012)); the filtering methods, that firstly apply a basis to the functional data and then use clustering techniques to the obtained data (Abraham et al. (2003), Rossi et al. (2004), Peng, Müller, et al. (2008), Kayano et al. (2010)); adaptive methods, where dimensionality reduction and clustering are performed at the same time (James and Sugar (2003), Jacques and Preda (2013), Giacofci et al. (2013), Traore et al. (2019)); and distancebased methods, which apply a clustering technique based on distances with a specific distance for functional data (Tarpey and Kinateder (2003), Ieva et al. (2011), Martino et al. (2019)).Recent works which perform different strategies for clustering functional data are Zambom et al. (2019) that propose a new method applying k-means, assigning each element to a cluster or another based on a combination of an hypothesis test of parallelism and a test for equality of means, and Schmutz et al. (2020)  The paper is organized as follows.In Section 2, the epigraph and the hypograph indexes are introduced and the methodology for clustering functional data sets based on these indexes is explained in Section 3. Some other techniques for clustering functional data are presented in Section 4. In Section 5 we present the results of an extensive simulation study in which our proposed methodology is compared to the existing procedures and, in Section 6, we illustrate its applicability through some real data sets.In section 7 we present an important question, which is the election of the number of clusters prior the application of any clustering methodology, and in Section 8, a brief discussion and the main conclusions of the paper are reflected.
2 Preliminaries: The epigraph and the hypograph indexes Let C(I) be the space of continuous functions defined on a compact interval I. Let consider a stochastic process X with sample paths in C(I) and distribution F X .The graph of a function x in C(I) is G(x) = {(t, x(t)), t ∈ I}.Then, the epigraph (epi) and the hypograph (hyp) of x are defined as follows: Taking into account the information that can be obtained from these graphs, Franco-Pereira et al. ( 2011) defined two indexes based on these two concepts.Given a sample of curves {x 1 (t), ..., x n (t)}, the epigraph index of a curve x (EI n (x)) is defined as one minus the proportion of curves in the sample that are totally included in its epigraph.
Analogously, the hypograph index of x (HI n (x)) is the proportion of curves totally include in the hypograph of x.
Their population versions are given by: Franco-Pereira et al. ( 2011) argued that when the curves in the sample are extremely irregular, having lots of intersections, the modified versions of these indexes are more convenient.If I is considered as a time interval, the modified epigraph index of x (M EI n (x)) can be defined as one minus the proportion of time that the curves of the sample are in the epigraph of a given curve, i.e., the proportion of time that the curves of the sample are above it.Analogously, the generalized hypograph index of x (M HI n (x)) can be considered as the proportion of time the curves in the sample are below a given curve.
where λ stands for the Lebesgue's measure on R.
Note that, since the graph of any curve x is contained in its epigraph and its hypograph, this relation holds: Applying this condition to (1) and (2), we obtain and Moreover, if x = x i , for all i = 1, ..., n, then and applying this into (4) we can write: Finally, we obtain the following relation between the two modified versions of the epigraph and the hypograph indexes, leading to conclude that both are linearly dependent: Note that this equality does not hold in Franco-Pereira and Lillo (2019) because of the way in which the data is considered in the homogeneity test differs from the perspective given in this paper.
3 Clustering functional data through the epigraph and the hypograph indexes The four steps of the proposed methodology for clustering functional data are illustrated in Figure 1.
Figure 1: Scheme of the proposed methodology for clustering functional data.
Step 1 (S1) consists of smoothing the data.This is recommended since the amount of the data upon which the process is based precludes abrupt changes in value.For this reason, it is common to smooth the sample curves when working with functions.We have used a cubic B-spline basis, but any other functional basis could have been used.After the data set is transformed, the second step (S2) is to apply the epigraph and the hypograph indexes (and their generalized versions) to the basis transformed data, as well as to their derivatives, obtaining a multivariate data set.Then, the combination of data that is the most informative is considered and a multivariate clustering technique is applied in the third step of the process (S3).Finally, the fourth step (S4) consists of obtaining a final clustering partition in the previously set number of groups.
Along the procedure, it is necessary to make three elections as represented in Figure 2.
These elections will be explained below.
Figure 2: The three elections to be made during the proposed procedure.
Election of the data and/or their derivatives (E1).Once dealing with smoothed functions, it is possible to consider their derivatives.When the indexes are applied to the data (S2), we move from a functional data problem into a multivariate one.Thus, we loose some "information".Applying the indexes, not only to the original functions but also to their first and second derivatives, allows us to take advantage of the shape of the functions in the sample.When one poses a problem of discrimination with functions, it is clear that the shape of these functions must be taken into account, and the epigraph and the hypograph indexes have the property of reflecting the shape of the data, as it was shown in Martín-Barragán et al. (2018).From now on, the term data we will be used to refer to the original functional sample together with the corresponding first and second derivatives.
Election of the best combination of indexes (E2).As explained in Section 2, the modified epigraph and hypograph indexes are linearly dependent.Because of that, MEI will be discarded when obtaining the multivariate data set through the indexes since it will not provide "extra information".
The epigraph, the hypograph and the generalized version of the epigraph applied to all the data.
Table 1: Representation and explanation of the combinations of data and indexes.
When curves are extremely irregular, the epigraph and the hypograph indexes take values very close to 1 and 0, respectively.This fact causes that the indexes lose discriminatory capacity to differentiate between clusters and also induce computational problems.
That is, singular or near-singular matrix is often referred to as "ill-conditioned" matrix because it delivers problems in many statistical data analysis.To solve this issue, a condition regarding the variability of the indexes is included for each combination of indexes to be considered in the clustering process.This condition is where Y is a matrix of shape number of curves × number of indexes.
Election of the multivariate clustering method (E3).We have considered the following: hierarchical clustering techniques using different criteria for calculating similarity between clusters (single linkage, complete linkage, average linkage, centroid linkage), and Ward's method.Besides, we have also considered k-means method and its different versions using a feature space induced by a kernel, such as kernel k-means (kkmeans), spectral clustering (spc) and support vector clustering (svc).
For hierarchical methods, the Euclidean distance has been considered.On the other hand, when implementing k-means, Euclidean and Mahalanobis distances have been used.
In order to apply the Mahalanobis distance, the data is rescaled using the Cholesky decomposition of the variance matrix before running k-means with the Euclidean distance (see Redko et al. (2019)).Moreover, when the method use a kernel space we have applied three type of kernels: gaussian, polynomial and linear.
In summary, a clustering partition is obtained once we choose the data to be considered (E1), the combination of indexes to apply (E2) and the multivariate clustering method (E3).
Once we have described the methodology, it is necessary to apply an external validation criterion in order to evaluate how it works.We have considered three different multivariate clustering evaluation criteria to rank the results of the algorithms.Purity, F-measure and Rand Index (RI), that are fully explained in Manning et al. (2009) and Rendón et al. (2011).
Purity of a cluster measures the fraction of the cluster size with the most repeated value.Because of that, purity scores positively the fact that there is a unique group in the clustering partition.On the other hand, F-measure penalizes this fact by focusing on the overlapping between obtained and real classifications.A small value indicates that the clustering partition only has one class.Rand Index applies a different approach.Instead of counting single elements, RI counts pairs of elements that are correctly or incorrectly classified.All these indexes provide values in [0, 1] and verify that the higher its value is, the better the classification.

The benchmark for clustering functional data
There are several studies concerning clustering functional data, that show the interest this topic arouses.Here, we have considered two recent works: The distance based k-means procedure (functional k-means) appearing in Martino et al. (2019) and the test based kmeans from Zambom et al. (2019).
In Martino et al. (2019), a clustering procedure based on k-means clustering with the generalized Mahalanobis distance, d ρ , previously defined in Ghiglietti and Paganoni (2017), where the values of ρ have to be set in advance is proposed.
Let X i , X j , i, j ∈ N be two realizations of a stochastic process X.Then, where d M,l stands for the contribution of the Mahalanobis distance to this generalized one and which is defined as Here, ϕ l and λ l stand for the eigenfunctions and eigenvalues of the covariance kernel, and h l (ρ) is a sequence of functions for a given value ρ that makes it possible that the generalized Mahalanobis distance verify the conditions for being a function.These functions are defined as such that the function g ρ (s) is considered to deal with the situation in which d M,l (X i , X j ) exp(−λ l s) is a finite function but not integrable for every s.
The values of ρ considered in their paper, and that we have considered in our simulation study for comparison, are, ρ 1 = 0.001, ρ 2 = 0.02, ρ 3 = 1, ρ 4 = 100 and ρ 5 = 10 8 .The data is smoothed with a B-spline basis before applying k-means.Besides, they compare their procedure performance with the results of k-means applied with the truncated Mahalanobis distance, d K , choosing a fixed number K of principal components defined as where d 2 M,l stands for the empirical version of d 2 M,l .They also compare their results to those obtained from k-means applied with the L 2 distance defined as follows (see Horváth and Kokoszka (2012b)) being I a compact interval.They have developed the 'gmfd' R-package that is associated to the cited paper, and which will be used to compare the two methodologies.Zambom et al. (2019) propose a methodology based on an hypothesis test applying k-means, where they initialize the clusters centers in four different ways, choosing them at random, choosing one iteration of k-means, one iteration of a hierarchical method (Ward's method with Euclidean distance) or one iteration of k-means++ (Vassilvitskii and Arthur (2006)).
At each step of the k-means algorithm, the allocation of each curve to a cluster is based on an hypothesis test performed as the combination of two test statistics.
Let X i (t j i ), i = 1, ..., n, j = 1..., , r i the realizations of a stochastic process X, and {c p (t j i ), p = 1, ..., K, i = 1, ..., n, j = 1..., , r i } the set of estimated values of the cluster centers at grid point t j i , where K is the previously selected number of clusters.The first statistic measures the proximity between the curve and the cluster centers by looking for parallelism.The residuals are computed as , and they consider (ξ jp i , t j i ) as data from a one-way ANOVA design with ξ jp i being the observation at "level" t j i .As two or more observations per factor level are required, each cell t j i is augmented by including the ξ jp i corresponding to (m − 1), m odd, nearest grid points on either side, i.e.
The window size m determines the number of neighbors included in each augmented ANOVA level.Then, they compute the test statistic for parallelism as the absolute value of the standardized statistic , The second statistic tests for equality of means using the t test statistic for differences in averages.They consider ) and V is an unbiased estimator of their variance.
Finally, they propose allocating the ith curve to the pth cluster center by combining the two previously defined statistics in the following way: where I(•) denotes the indicator function and γ determines the rejection tail threshold for the test statistics.
Selecting the number of clusters prior the application of any clustering techniques is an interesting topic that is still open.These two recent techniques have been proposed fixing the number of clusters in advanced.Choosing the number of clusters suppose a limitation that have been studied, and which is now a work in progress to be further developed.Moreover, these techniques are computationally expensive, which suppose another important limitation.In order to find a competitive and faster alternative, in the following section we compare our methodology to those two.

Simulation study
We have carried out a simulation study in order to evaluate our methodology and to compare it with the functional k-means and the test based k-means strategies.
Each simulated scenario consists of a previously known number of groups proceeding from different processes.Each scenario is simulated 100 times for each of the three methodologies.In each iteration, we compute the average of the three validation criteria: Purity, F-measure, Rand Index, and the mean execution time of all the methods.We have divided this section into two depending on the previously known number of clusters.
The code necessary to develop the simulation is available in https://github.com/bpulidob/Functional-clustering-via-multivariate-clustering.

Simulation study A: Two clusters
Two different simulation group of scenarios will be studied in this section.The first one consist of eight different scenarios that have been previously considered in Flores et al.
First we describe how we simulate the data of the first group of scenarios: Consider eight functional samples defined in [0, 1], which have continuous trajectories in such interval and which are the realizations of a stochastic process X.Each curve has 30 equidistant observations in the interval [0, 1].We generate 100 functions: 50 from Model 1 and 50 from Model i, i = 2, ..., 9, obtaining eight different functional data sets.
Model 1.This is the set of functions which is considered in all the eight data sets for generating the first 50 functions.It is generated by a Gaussian process , is the mean function and e(t) is a centered Gaussian process with covariance matrix Cov(e i , e j ) = 0.3 exp(− The rest of models are obtained from the first one by perturbing the generation process.
The first three models contain changes in the mean, while the covariance matrix does not change.Changes in the mean are presented in increasing order from Model 2 to Model 4.
The next two samples are obtained by multiplying the covariance matrix by a constant.
Model 7.This set is obtained from adding to E 1 (t) a centered Gaussian process h(t) whose covariance matrix is given by Cov(e i , e j ) = 0.5 exp(− The next two samples are obtained by a different mean function.
From now on, the eight resulting data sets will be referred as scenarios, where S 1-2 refers to the combination of Model 1 and 2, S 1-3 obtained from combining Model 1 and 3, and so on.
We smooth the data using a cubic B-spline basis in order to remove noise and to use the derivatives of the data (S1 in Figure 1).Then, we simulate each of the scenarios 100 times and apply, each time, our whole clustering strategy (S1-S4 in Figure 1).The three elections of the process (E1, E2, E3 in Figure 2) are carried out for each simulated data set.The mean Purity, F-measure and Rand index (RI), that are used as a criterion to choose the best model, as well as the execution time (ET) for the scenario S 1-4 are shown in Table 2.The rest of the tables are deferred to the Supplementary Material.In these tables, each row represents a description of the process carried out for the 100 realizations, denoted   An important fact is that first and second derivatives of the two groups are the same because of the nature of the functions.Thus, in this scenario the first and second derivatives are not important for classification when they are considered together.Besides, turn out to be more informative those combinations including the original data (Table 2), where the best result, with a RI of 0.860, is obtained from applying support vector clustering on the epigraph and hypograph index of the original data (svc..EIHI).
These results are compared to those obtained from applying functional k-means and test based k-means techniques, which are showed in Tables 3 and 4  distance provides the best RI, 0.847, that is close to those methods with a small value of ρ.Nevertheless, when considering ρ equal to 0.02 the execution time is the double of that for L 2 distance.When applying test based k-means the method is not able to distinguish between the two groups, since all the considered metrics get values close to 0.5 in all cases.
Our methodology, besides being the one obtaining the best values in terms of metrics, is the fastest.In this case more than 300 times faster than the best method chosen with functional k-means.The difference in the execution times is key to say that the proposed methodology is a very good alternative to the existing ones to cluster functional data.
Results from the other seven scenarios, whose tables appear in the Supplementary Material, are competitive in terms of metrics and always obtain better execution times.
On the other hand, in order to obtain a further simulation study, data obtained as explained in Martino et al. ( 2019) is considered, because this simulation type is specially created for testing clustering techniques for functional data.
This type of simulated data consist of taking two functional samples defined in [0, 1], having continuous trajectories and which are generated by independent stochastic process in L 2 (I).In this case, each curve has 150 equidistant observations in the interval [0, 1].We generate 100 functions, 50 from Model 10 and 50 from Model i, i = 11, 12, obtaining two different functional samples.These two scenarios will be referred as S 10-11 and S 10-12 respectively.
The three different models defined for these simulations are explained below.
Model 10.The first 50 functions are generated as follows: where E 2 (t) = t(1 − t) is the mean function, {Z k , k = 1, ..., 100} is a standard normal variable, {ρ k , k ≥ 1} is a positive real numbers sequence defined as in such a way that the values of ρ k are chosen to decrease quicker when k ≥ 4 in order to have most of the variance explained by the first three principal components.
The sequence {θ k , k ≥ 1} is an orthonormal basis of L 2 (I) defined as The next two models are defined in the same way but changing in each case the term which is added to E 2 (t) in Model 10.Moreover, the standard normal variables generated for the following two models differs from those of the last model.
As before, we consider the data smoothed with a cubic B-spline basis to remove noise and to be able to use first and second derivatives of the data.
When data simulated from S 10-12 is smoothed, the curves cross a lot between themselves, and also their derivatives (Figure 6).When applying the epigraph and the hypograph indexes to these sets of curves, again the difference between groups is negligible (Figure 7).Nevertheless, looking at Figure 8, when applying the MEI the difference between groups is now much more clear.The best result in Table 5 is achieved by applying kernel k-means with a polynomial kernel on the generalized epigraph index of the first and second derivatives, obtaining a RI of 0.919.Moreover, when applying the same technique with the same set of data (first and second derivatives) but now adding the epigraph and hypograph indexes, the same RI is obtained.This combination is not shown since six different variables are involved in it.
These results are compared to those obtained by applying functional k-means procedure (Table 6).In this case, the best distance is the Mahalanobis distance with a big value of ρ, ρ = 1e+08, obtaining a RI of 0.718, which is small compared to the value of 0.919 obtained with our strategy.Besides, our methodology spends 0.00423 seconds per iteration, while their procedure spends 7.9055 seconds.When applying test based k-means to this type of simulated data, results in Table 7 show that any of the initialization strategies are able to distinguish between groups, obtaining close values to 0.5 for all metrics.
We conclude that the new methodology obtains the best results in terms of metrics and execution time.

Simulation Study B: Three clusters
In this case, we consider three different scenarios coming from three different groups.This simulation study previously appeared in Zambom et al. (2019).Each data set is composed The nine different models are defined as follows: 2 ) and 1 ∼ N (2, 0.4 2 ), 2 ∼ N (2, 0.4 2 ) for each curve.This way, S 13-14-15 is composed by 50 functions from Model 13, 50 functions from Model 14 and 50 of them from Model 15, S 16-17-18 is composed by 50 functions of each Model 16,17 and 18, and S 19-20-21 is created in the same way from models 19, 20 and 21.
Data considered for S 13-14-15 is shown in Figure 9.We observe that functions in green and red intertwine a lot.Moreover, we observe that when applying the epigraph and the hypograph indexes (Figure 10) there is a clear difference between two groups, but the green one is overlapped with the other two.Nevertheless, when considering the modified epigraph index (Figure 11) it seems that the difference between the three groups are much more evident.When applying our methodology using k-means with both Euclidean or Mahalanobis distances to the first and second derivatives of the generalized epigraph index ends in the best result in Table 8. (RI=0.983and ET=0.003 seconds for both distances).When considering functional k-means, the best method in Table 9 is the one with a small value of ρ, ρ = 0.001, (RI=0.928and ET=6.50802 seconds).And when applying test based kmeans, the best result in   In summary, the three methodologies provide accurate results, but the proposed procedure is the one obtaining the best values in terms of metrics and execution time.
Results obtained for S 16-17-18 and S 19-20-21 are shown in the Supplementary Material.Regarding data simulated for these two scenarios, our methodology gets the best execution time for both of them, obtaining competitive results in terms of metrics.
schemes.As reflected in Figure 12, the shapes of the two groups are different, and when applying the hypograph, epigraph (Figure 13) and its modified version (Figure 14), the two groups have different behaviours despite that the obtained values seems to be overlapping one to another.When applying functional k-means procedure (    vector clustering initialized with kernel k-means applied on the modified epigraph index of the original data and its second derivatives.The obtained Rand index is 0.719, while the F-measure is a smaller value equal to 0.510.This means that while the final configurations of groups are accurate, some groups are better classified than the others.For example, Pacific area obtains 6 elements inside when the correct number is 5, at the beginning it seems to be a good classification, but only two elements are correctly classified (Table 15).
In general this classification obtains close results to the real ones.(ET=0.10655,Table 14).
When considering functional k-means procedure (Table 16), the best result is obtained with truncated Mahalanobis distance (RI=0.784,F-measure=0.613).When talking about test based k-means (Table 17), the best result is obtained with a hierarchical clustering 7 Choosing the number of clusters In Sections 5 and 6, the number of clusters was set in advance.Nevertheless, choosing the correct number of clusters before applying a clustering technique is a challenge.In Martino et al. (2019) and Zambom et al. (2019) they both fix the number of clusters before performing the classification.
To overcome this problem, we have considered the Silhouette index.Let x i be one of the considered points, and let a(x i ) be the average distance of x i with respect to all other points in its cluster and b(x i ) be the lowest average distance of x i to any other cluster of which x i is not a member.Then The silhouette index ranges from -1 to 1 where a positive value means that the object is well matched to its own cluster, and a negative value means that the object is bad matched to its own cluster.The average silhouette gives a global measure of the election of the clusters, such that the more positive, the better the configuration.Thus, we choose the number of clusters as the one providing the greater average silhouette.
We simulate 100 times each of the scenarios that have been considered in Section 5 and apply the mean silhouette to obtain the optimal number of clusters.In Table 18 appears the number of times corresponding to each possible number of cluster.In this case we have consider numbers between 2 and 6, but the list could be any other.When the real number of clusters is two, the correct number of clusters is always obtained.Thus, we can conclude that this procedure works well for two clusters.However, this strategy is not consistent with three clusters and it is necessary to look for an alternative.
We have also considered 30 different indexes for multivariate clustering available in the R package 'NbClust' and which are fully explained in Charrad et al. (2012).With all these indexes, we have not find consistent techniques for two and three clusters.Thus, this is a still open research line.
that presents a new strategy for clustering functional data based on applying model based techniques when a principal component analysis is previously performed.In this paper we propose a new techique for clustering functional data based on the used of the epigraph and the hypograph indexes and their modified versions.The idea is to transform a functional data problem into a multivariate one and then use the very well known techniques for multivariate clustering.
by (a).(b).(c)where (a) represents the name of the considered strategy: a hierarchical method, k-means, support vector clustering, kernel k-means or spectral clustering, and (b).(c) represents the elections of the data and indices, as represented in Table1, where (b) is the name of the employed data and (c) the applied indexes on the corresponding data.The smoothed functions and the first and second derivatives of data generated from S 1-4 are shown in Figure3.It is clear that, in this case, the original data better discriminate the two clusters.When the epigraph and the hypograph indexes (Figure4), and the modified epigraph index of different combinations of variables (Figure5) are applied, this is also noticeable since the combinations which better distinguish between the two clusters are those including the original data.In these figures only the combinations of two variables are shown.However, combinations up to nine variables are possible.

Figure 3 :
Figure 3: A sample generated from S 1-4.Original data, first and second derivatives curves.

Figure 4 :
Figure 4: Scatter plots of the epigraph index (EI) and the hypograph index (HI) of the original data simulated from Model 1 and 4 (left panel), first derivatives (center panel) and second derivatives (right panel).

Figure 5 :
Figure 5: A sample generated from S 1-4.Scatter plots of different combinations of MEI.Original data and first derivatives (left panel), original data and second derivatives (center panel) and first and second derivatives (right panel).

Figure 6 :
Figure 6: A sample generated from S 10-12.Original data, first and second derivatives curves.

Figure 7 :
Figure 7: Scatter plots of the epigraph index (EI) and the hypograph index (HI) of the original data simulated from Model 10 and 12 (left panel), first derivatives (second panel) and second derivatives (right panel).

Figure 8 :
Figure 8: A sample generated from S 10-12.Scatter plots of different combinations of MEI.Original data and first derivatives (left panel), original data and second derivatives (center panel) and first and second derivatives (right panel).

Figure 10 :
Figure 10: Scatter plots of the epigraph index (EI) and the hypograph index (HI) of the original data simulated from Model 13, 14 and 15 (left panel), first derivatives (center panel) and second derivatives (right panel).

Figure 11 :
Figure 11: A sample generated from S 13-14-15.Scatter plots of different combinations of MEI.Original data and first derivatives (left panel), original data and second derivatives (center panel) and first and second derivatives (right panel).

Figure 12 :
Figure 12: Growth curves (girls in green and boys in blue) for the original data (left panel) the first derivatives (center panel) and the second derivatives (right panel).

Figure 13 :
Figure 13: Scatter plots of the epigraph index (EI) and the hypograph index (HI) of the growth curves original data (left panel), first derivatives (second panel) and second derivatives (right panel).

Figure 14 :
Figure 14: Growth curves.Scatter plots of different combinations of MEI.Original data and first derivatives (left panel), original data and second derivatives (center panel) and first and second derivatives (right panel).

Figure 15 :
Figure 15: Canadian weather curves.Original data, first and second derivatives curves.

Figure 16 :
Figure 16: Canadian weather curves.Epigraph and hypograph index on the original data (left panel), the generalized epigraph index on the original data and first derivatives (center panel) and the generalized epigraph index on the first and second derivatives (right panel).
Table 18: Distribution of the number of clusters suggested when applying Silhouette for each Scenario simulated 100 times.8DiscussionIn this paper, we propose a new methodology for clustering functional data that is competitive with respect to the existing ones, and that is significantly better in terms of execution time.Our methodology is based on converting a functional problem into a multivariate problem through the use of the epigraph, the hypograph indexes, their generalized versions and multivariate clustering techniques.It has been compared to two recent procedures for clustering functional data, outperforming them in most of the cases and in all cases when concerning execution time.Finally, the code needed to carry out this analysis and to apply our technique is available in the GitHub repository: https://github.com/bpulidob/Functional-clustering-via-multivariate-clustering.In the new proposal, we have set the number of clusters in advance.Despite that, an strategy for choosing the number of clusters has been tried in Section 7, without obtaining a consistent technique for two and three clusters.Thus, setting the number of clusters prior applying the clustering technique is a question still open for further research.
. In functional k-means each row represents a different distance between generalized Mahalanobis distance (dρ), truncated Mahalanobis distance (dk) and Euclidean distance (L 2 ).In test based k-means, distance (pink), a gaussian kernel (yellow), a polynomial kernel (blue), kernel k-means for initialization (green) and k-means for initialization (orange).

Table 3 :
Mean values of Purity, F-measure, Rand Index and execution time for the functional k-means procedure (Martino et al. (2019)) with truncated Mahalanobis distance, generalized Mahalanobis distance and L 2 distance to simulated data from S 1-4.

Table 5
initialization (green) and k-means for initialization (orange).

Table 6 :
Mean values of Purity, F-measure, Rand Index and execution time for the functional k-means procedure (Martino et al. (2019)) with truncated Mahalanobis distance, generalized Mahalanobis distance and L 2 distance to simulated data from S 10-12.

Table 7 :
Mean values of Purity, F-measure, Rand Index and execution time for the test based k-means procedure (Zambom et al. (2019)) with four different initialization to simulated data from S 10-12.Model 21.X 21 Table 10 is obtained when initializing the process with k-means++.(RI=0.944and ET=0.99653).

Table 12
), the greater Purity coefficient is equal to 0.850 when applying a big value of ρ, ρ = 1e + 08.Besides, apart from obtaining better metrics coefficients, our methodology reach an execution time almost 400 times smaller than when considering functional k-means strategy.Furthermore, applying the test based k-means technique (Table13), obtains the best result with k-means initialization obtaining a Purity coefficient of 0.817.In summary, our methodology obtains the best result in terms of the three different metrics and in terms of execution time.

Table 11 :
Mean results for growth data set considering Euclidean distance (gray),