Supervised classification of curves via a combined use of functional data analysis and tree-based methods

Technological advancement led to the development of tools to collect vast amounts of data usually recorded at temporal stamps or arriving over time, e.g. data from sensors. Common ways of analysing this kind of data also involve supervised classification techniques; however, despite constant improvements in the literature, learning from high-dimensional data is always a challenging task due to many issues such as, for example, dealing with the curse of dimensionality and looking for a trade-off between complexity and accuracy. Nowadays, research in functional data analysis (FDA) and statistical learning is very lively to address these drawbacks adequately. This study offers a supervised classification strategy that combines FDA and tree-based procedures. Specifically, we introduce functional classification trees, functional bagging, and functional random forest exploiting the functional principal components decomposition as a tool to extract new features and build functional classifiers. In addition, we introduce new tools to support the understanding of the classification rules, such as the functional empirical separation prototype, functional predicted separation prototype, and the leaves’ functional deviance. Furthermore, we suggest some possible solutions for choosing the number of functional principal components and functional classification trees to be implemented in the supervised classification procedure. This research aims to provide an approach to improve the accuracy of the functional classifier, serve the interpretation of the functional classification rules, and overcome the classical drawbacks due to the high-dimensionality of the data. An application on a real dataset regarding daily electrical power demand shows the functioning of the supervised classification proposal. A simulation study with nine scenarios highlights the performance of this approach and compares it with other functional classification methods. The results demonstrate that this line of research is exciting and promising; indeed, in addition to the benefits of the suggested interpretative tools, we exceed the previously established accuracy records on a dataset available online.


Introduction
Today, dimensionality reduction and classification techniques are among the most used strategies for dealing with the enormous amount of data we can collect every day. The reason is that technological progress led to the evolution of instruments to store and manage vast amounts of data continuously. Indeed, we can gather data through telephones, calculators, or sensors with different purposes, e.g. to control the environment, health, temperature, pressure, and earthquakes. Hence, the hunt for statistical methodologies to examine this variety of data is crucial. For this cause, dimensionality reduction, supervised and unsupervised classification techniques using high-dimensional data have assumed an increasingly important role in many sectors such as medicine, multimedia processing, environmental monitoring, industrial quality control, speech processing, robotics, and many other fields.
Despite constant improvements in the literature, learning from high-dimensional data is always a challenging task due to many issues, such as: the sampling units are often observed in a finite set of time points that may be irregularly spaced and different for each individual; computational time-consuming and algorithm convergence may be complex due to possible local minimum; the search for a trade-off between complexity/interpretability and accuracy; and the curse of dimensionality. All these issues are of great relevance, but the latter mentioned drawback certainly deserves special attention as the number of observations grows. Generally, the difficulties related to classification due to high dimensional data is referred to as the curse of dimensionality (COD). In today's big data world, COD can refer to several potential issues that arise when data has many dimensions. The most common consequences are data sparsity, difficulty selecting a model (and interpreting causal relationships) among multiple possibilities, multicollinearity, and distance concentration.
One of the most widespread approaches to dealing with high-dimensional data and resolving some of these drawbacks is functional data analysis (FDA). The theory and practice of statistical methods in situations where temporal sequences of data can be suitably represented by functions (instead of real numbers or vectors) are often referred to as FDA (Ramsay and Silverman 2002;Ferraty and Vieu 2003;Ramsay and Silverman 2005;Ferraty 2011). This topic has become very popular during the last decades and is now a major research field in statistics. Dealing with functional data significantly impacts statistical thinking and methods, changing how we represent, model, and predict data. The basic idea of FDA is to deal directly with the function generating the data instead of the sequence of observations, and thus to treat observed data functions as single entities (Ramsay and Silverman 2005).
All the benefits and motivations of using the FDA have been highlighted in the extensive and recent literature, e.g. to use instruments such as derivatives (Ferraty and Vieu 2006), to take advantage of a non-parametric approach without the need for very restrictive assumptions (Cuevas 2014), to reduce the dimensionality of the data and exploit additional critical sources of pattern and variation Supervised classification of curves via a combined use of… (Ramsay and Silverman 2005). Therefore, in recent years, we are witnessing a solid development of the methodological literature on FDA that seeks to reproduce, in a functional key, a large part of traditional statistics (see e.g. Ramsay and Silverman 2005;Ferraty and Vieu 2006;Aguilera and Aguilera-Morillo 2013;Bongiorno and Goia 2019;Maturo et al. 2019b). In addition, there is a continuous augmentation of original applications, as well as simply investigations of high-dimensional time-series or proposals to solve specific problems in peculiar contexts exploiting functional tools (see e.g. Zanin Zambom et al. 2018;Maturo 2018;Fortuna et al. 2018;Maturo et al. 2019c, a;Carcenac and Redif 2019).
Based on the technique used to represent the functional data, many possible solutions may exist to reduce the dimensionality of the data and represent curves. The functional principal component decomposition (FPCD) is one of the most used approaches, and it is considered in many studies (see e.g. Ferraty and Vieu 2003;Ramsay and Silverman 2005;Ocana et al. 2007; Febrero-Bande and de la Fuente 2012; Aguilera and Aguilera-Morillo 2013). FPCD allows us to display the functions by a linear combination of a reduced set of functional principal components (FPCs), and thus the functional data can be rewritten as a decomposition on an orthonormal basis by maximizing the variance. The advantage of this approach is that it finds a lower-dimensional representation, preserving the maximum amount of information from the original data. Recently, many traditional statistical techniques have been extended to functional data by exploiting this dimensionality reduction technique.

Background
Among the most studied topics in the FDA literature, there are certainly those of unsupervised and supervised classification. Notably, in this paper, we focus on supervised learning from high dimensional data treated with the FDA approach. Functional supervised classification is a very attractive methodological matter that consists of creating a classification rule based on the curves observed on a training set. Clearly, the labels of this grouping variable are also known in the training phase. The goal, as always in supervised classification, is to predict the classes of new curves, of which we do not know the label of the groups, with the best possible accuracy. Possibly, a second objective would also be to adequately interpret the generated classification rule and also assess features importance.
In the literature regarding supervised classification in the FDA framework, many approaches have been proposed, e.g. Logistic Classifier, k-Nearest Neighbour Classifier, Maximum Depth Classifier, Kernel Classifier (see e.g. Febrero-Bande and de la Fuente 2012). Despite continuing developments on the subject, to date, the fields of supervised and also semi-supervised classification applied to curves are still lively, and we continuously observe new developments in the literature (see e.g. Cuevas et al. 2007;Preda et al. 2007; Aguilera-Morillo et al. 2012;Escabias et al. 2014;Cuevas 2014;Gregorutti et al. 2015;Belli and Vantini 2020).
Therefore, research on this topic is still in progress but strangely, the literature on possible combinations between FDA and tree-based classifiers (Hastie et al. 2009) 1 3 is underdeveloped and there are few available investigations. Some previous studies dealt with such a problem and followed very different approaches from a methodological and applicative perspective. Yu and Lambert (1999) proposed to adopt spline trees for functional data, with an application to time-of-day patterns for customers who place international calls. Another original idea was proposed by Balakrishnan and Madigan (2006) who suggested building functional decision trees looking for possible candidate splitting curve via clustering. Nerini and Ghattas (2007) focused on the problem of building a regression tree via FDA when the response variable is a probability density function. Fan et al. (2010) dealt with the problem of functional data classification for temporal gene expression data with kernel-induced random forests. Gregorutti et al. (2015) concentrated on the problem of assessing variables' importance when dealing with the combined use of FDA and tree-based methods. Möller et al. (2016) proposed a classification approach based on a random forest procedure for functional covariates by using different mean values computed at different time windows over the whole domain. Haouij et al. (2018) suggested an extension of the random forest approach via wavelet basis with an application to driver's stress level classification. Rahman et al. (2019) investigated the possibility to build a classifier for dose-response predictions in which the outcome is a curve. Finally, Belli and Vantini (2020) focused on constrained convex optimization to extract multiple weighted integral features from the input functions and determine binary splits of trees trained using functional inputs.
Hence, research on this topic is very lively and promising in several respects. However, there are still many aspects that can be explored and developed, e.g. the improvement of the accuracy of functional classifiers using appropriate features, the introduction of understandable graphical tools for the interpretation of classification rules, the use of appropriate simulation studies, and the creation of other ad-hoc tools to be extended to the functional case, e.g. possible rules to look for the optimal number of functional principal components when dealing with supervised classification tasks.

Aims and scopes
This paper presents a functional supervised classification approach to deal with high-dimensional data in a temporal domain via the combined use of FDA and statistical learning techniques. Of course, the proposed approach can also be extended to the case of functions represented in a domain different from time. Specifically, in this study, we combine FPCs decomposition and tree-based methods focusing on the scalar-on-function classification problem. Thus, we introduce the so-called "Functional Classification Trees" (FCTs), Functional Bagging (FBAG), and Functional Random Forest (FRF).
The main goal is to propose a method to exploit both the benefits of FDA and tree-based techniques (Hastie et al. 2009) to classify high-dimensional data that can be expressed through curves. The idea of using FPCs as features to train a FRF classifier is very appealing both from the perspective of dimensional reduction and from the point of view of working with uncorrelated features.
Moreover, this research presents new tools to support the interpretation of the functional classification rules and assess the terminal nodes, namely, the functional empirical separation prototype (FESP) and the functional predicted separation prototype (FPSP). The former attempts to immediately interpret the split rule in terms of existing curves, i.e. a prototype which coincides with a truly existing curve and thus with a "true" shape in the time domain.
The third original aspect is to consider the functional variability of the leaves to capture the "impurity" of the final nodes with an additional perspective that is different from the classical approach to evaluate the impurity. The classical measures of impurity, e.g. the Gini or Shannon indices, consider the distribution of the classes within each node; the more homogeneous the node, the more is pure. The classical impurity measures are used to split the nodes but, additionally, this paper proposes also the functional variability in the final nodes as an instrument to understand if the curves with the same predicted class in the same terminal node are similar or not in the time domain. High functional variability in a leaf can be a consequence of an impure node in the classical sense but can also be a clue of the presence of sub-patterns, i.e. curves with the same label but presenting different shapes.
The fourth original aspect is given by some proposals to select the number of FPCs and FCTs in the forest. Moreover, after introducing the possibility of evaluating the accuracy of the functional classifier using the functional training set, bootstrap, or cross-validation, this paper focuses on the problem of using a test set in a functional framework.
An application on real data regarding the Daily Italian Electrical Power Demand (DIPD) is presented to illustrate the functioning of the supervised classification proposal, with particular attention to its meaning from a visualization viewpoint. Finally, this study presents a simulation with nine different scenarios to compare different functional classification approaches.
This study highlights interesting results from various perspectives. First, focusing on the classifier's performance in terms of accuracy, we improve the previous world records of the DIPD dataset. Second, the idea of exploiting the FPCD to extract new features for training a functional classifier is attractive because it allows purifying the classification rules from noise and unimportant characteristics of curves over the whole domain. The latter aspect leads to a more stable classifier when applied to a test set because it helps reduce overfitting. Then, the figure of the functional classification trees through our interpretative tools appears very intuitive and helpful in understanding the meaning of each split leading to subsets of the original dataset. Subsequently, intending to reduce the variance of a single functional classification tree and decorrelate functional trees in the perspective of an ensemble, the purpose of extending bagging and random forest to the case of functional data proves to be quite appealing and worthy of further investigations. In summary, this line of research, in our opinion, turns out to be very promising in a world that is increasingly dominated by large amounts of data that we continuously collect from different devices in many fields of application. This research is structured as follows. Section 2 introduces the basic classical concepts of FDA, and displays functional classification trees, functional bagging, and functional random forest. Section 3 shows an application to the DIPD dataset and deepens the proposed interpretative instruments. Section 4 illustrates a simulation study under nine different scenarios. Section 5 proposes some suggestion to select the number of FPCs and FCTs in the forest. The paper ends with the Sect. 6 in which there are the discussion, conclusions, and possible future perspectives of this research line.

Functional data analysis (FDA)
The basic idea of FDA is to handle data functions as single objects. Nevertheless, in practical applications, functional data are often observed as series of point data, and thus the function expressed by z = f (x) reduces to a record of discrete observations that are denoted by the T pairs (x j ; z j ) where x j ∈ ℜ and z j are the values of the function computed at the points x j , j = 1, 2, ..., T (Ramsay and Silverman 2005). Generalizing the reference framework, we consider that a functional variable X is a random variable assuming values in a functional space Ξ , and a functional data set is a sample x 1 ,...,x N , also denoted x 1 (t) ,..., x N (t) , drawn from a functional variable X (Ferraty and Vieu 2003).
Focusing our attention to the case of a Hilbert space with a metric d(⋅, ⋅) associated with a norm so that d(x 1 (t), x 2 (t)) = ‖x 1 (t) − x 2 (t)‖ , and where the norm ‖ ⋅ ‖ is associated with an inner product ⟨⋅, ⋅⟩ so that ‖x(t)‖ = ⟨x(t), x(t)⟩ 1∕2 , we can obtain as a specific case the space L 2 of real square-integrable functions defined on by ⟨x 1 (t), x 2 (t)⟩ = ∫ x 1 (t)x 2 (t)dt , where is the Lebesgue measure on T. Therefore, if x(t) ∈ L 2 , a basis function system is a set of known functions j (t) that are linearly independent of each other and which span L 2 (Ramsay and Silverman 2005).
The first step in FDA is to convert the observed values z i1 , z i2 , ..., z iT for each unit i = 1, 2, ..., N to a functional form, where z ij , j = 1, 2, ..., T , is the observation of the statistical unit i at the instant of time j. The most common approach to estimate the functional datum is the basis approximation. Depending on the characteristics of the curves, various basis systems can be adopted. A common approach is that functions can be obtained using a finite representation in a fixed basis system (Ramsay and Silverman 2005) as follows: where c i = (c i1 , ..., c iS ) T (i = 1, 2, ..., N) is the vector of coefficients defining the linear combination and s (t) is the s-th basis function, from a subset of S < ∞ functions that can be used to approximate the full basis expansion.
Another prevalent approach consists in exploiting a data driven basis instead of a fixed basis system. The most used technique is the Functional Principal Components (FPCs) decomposition. The latter leads to a dimensionality reduction whilst preserving the maximum amount of information from the original data (Ramsay and Silverman 2005; Aguilera and Aguilera-Morillo 2013; Febrero-Bande and de la Fuente 2012). In this case, the functional data can be approximated as follows: where K is the total number of FPCs, ik is the score of the generic FPC k for the generic function x i ( i = 1, 2, ..., N).
By truncating this representation in terms of the first p FPCs, we can obtain an approximation of the sample curves, whose explained variance is given by ∑ p k=1 k , where k is the variance of the k-th functional principal component. The FPCs approximation is constructed in such a way that the variance explained by the kth FPC decreases as k increases. Particularly, when dealing with high-dimensional data, the latter dimensionality reduction technique is necessary for explaining the main features of the data by a reduced set of uncorrelated FPCs. This approach is clearly an extension of the classical PCA. Indeed, in this context, if we assume that the observed curves are centered so that the sample mean is equal to 0, the i-th FPCs scores are given by where the weight function k is obtained by maximizing the variance, solving: and Proximity measures among statistical units play a critical role in classification. Surely, according to different chosen distances, contrasting results can be achieved. Thus, the choice of a proximity measure depends on the nature of the data and the purpose of the specific research. In the context of the FDA, different metrics and semi-metrics can be used; however, limiting our consideration to the case of the L 2 -space, the most ordinarily employed distance between functional elements are the following (Ramsay and Silverman 2005;Ferraty and Vieu 2006; Febrero-Bande and de la Fuente 2012; Jacques and Preda 2014). The L 2 -distance is the most used and can be computed as follows: where w(t) is a strictly positive weight function, and the observed point on each curve are equally spaced. Often, the semi-metric of the r-order derivatives of two curves, e.g. x 1 (t) and x 2 (t) , could be considered because it provides exciting information depending on the scope of the study. It can be calculated as follows: where x (r) 1 (t) and x (r) 2 (t) are the r-derivatives of x 1 (t) and x 2 (t) , respectively. Finally, the semi-metric of the FPCs is particularly interesting when researchers need dimensionality reduction and desire to interpret similarity among functional data according to different parts of the domain. Another benefit of this way of computing similarity/dissimilarity among functional objects is that such a measure excludes noise and only considers the most important sources of variability. The semi-metric of the FPCs is given by: where i,k is the coefficients of the expansion, and k is the k-th orthonormal eigenvector.

Supervised classification via a combined use of FDA and tree-based methods
In the functional classification framework, the aim is to predict the class or label Y of an observation X taking values in a separable metric space (Ξ, d) . Therefore, our approach is designed for functional data of the form {y i , x i (t)} , with a predictor curve x i (t) , t ∈ T , and y i being the (scalar) response value observed at sample i = 1, ..., N . The classification of a new observation x from X is carried out by constructing a mapping f ∶ Ξ ⟶ {0, 1, ..., U} , called a "classifier", which maps x into its predicted label and whose probability of error is given by P{f (X) ≠ Y}.
The continuous domain T can be of different types, such as time, space, or other parameters. In this context, we focus on the time domain, but the approach can be easily extended to other parameters. In theory, the response could be either categorical or numerical, leading to classification or regression problems, respectively. However, in this study, we focus on a scalar-on-function classification problem. In particular, we concentrate on functional classification trees, and thus we combine FDA and tree-based classifiers. In the following subsections, we introduce functional classification trees (FCTs), functional bagging (FBAG), and functional random forest (FRF).

3
Supervised classification of curves via a combined use of…

Functional Classification Trees (FCTs)
Decision trees (DTs) are a supervised learning technique that predict values of responses by learning decision rules obtained from features. They can be used in both regression and classification contexts. For this reason, they are sometimes also referred to as Classification And Regression Trees (CART). Detailed information on DTs can be found in many previous works (Hyafil and Rivest 1976;Quinlan 1986;Hastie et al. 2009). The starting point of the proposed approach is that DTs can be extended to the FDA framework by exploiting the coefficients of a basis representation as new features to train the functional classifier. The latter method will be indicated as "Functional Classification Tree (FCT)" approach.
It follows that, in the case of a fixed basis system like that in Eq. 1, the features' matrix is given by: where the generic element c is is the coefficient of the i-th curve ( i = 1, ..., N ) relative to the s-th ( s = 1, ..., S ) basis function s (t) involved in the linear combination.
Instead, in the case of a data driven basis system like that in Eq. 2, the features' matrix is given by: where ik is the score of the i-th curve ( i = 1, ..., N ) relative to the k-th functional principal component k ( k = 1, ..., K).
The most challenging part of these approaches is to look for a rationale interpretation (in terms of functional data) of the classification rules. For this reason, the choice of one of the two methods depends on different reasoning, e.g. classifier's performance in terms of accuracy and also its interpretability. Moreover, these procedures can be used both in the case that functions are obtained by smoothing highfrequency data in the time domain and if they depend on other specific parameters. Therefore, the interpretation must also be made accordingly.
In the following Sections, we give special attention to Eq. 2 and thus also to Eq. 11. In other words, we consider the data driven basis method as the "gold standard" for obtaining new features useful to our functional classification purpose. The latter is indicated as "Functional Classification Trees with Functional Principal Components (FCTs-FPCs)" approach. Eq. 1 and hence Eq. 10, instead, are also quite useful to train a functional classifier; however, given the superiority of the approach based on FPCs, we use the fixed basis system tool only for a comparison in terms of classification performance in Sect. 3.3. The reason for the latter choice is widely discussed in Sect. 6. However, we adopt B-spline as a fixed basis system, and the approach can be referred to as "Functional Classification Trees with B-splines (FCTs-Bsplines)". The FCTs-FPCs consists of recursive binary partitions of the feature space into rectangular regions (terminal nodes or leaves) composed of sets of functions x i (t) ∈ X . To build the FCTs-FPCs, an optimal binary partition is provided at each step of the algorithm, based on the optimization of cost criterion (e.g. decrease of the impurity of the node via the Gini index 1 or the Shannon-Weiner index 2 ) (Hastie et al. 2009;Therneau et al. 2019). The algorithm begins with the complete functional data set composed of the coefficients (scores) of the FPCs decomposition obtained using Eq. 11 and continues until the terminal leaves are obtained. Having obtained the best split in one node, the data are partitioned into two nodes; the rule is replicated to achieve the most suitable binary separation on all resulting nodes. Typically, a vast FCT is produced at the beginning, which is then pruned according to an optimization criterion, e.g. focusing on the performance on a test set to look for an acceptable trade-off between complexity and accuracy.
Hence, given that the scores of the linear combination are used as new features to predict the response Y, the interpretation of FCTs-FPCs is totally different if compared to the classical DTs. Indeed, the values of the splits should be interpreted according to the part of the domain that the single FPC k (t) mostly represent and the scores' thresholds 0k considered. Note that the value "0" (instead of i) in the subscript of k indicates that the threshold identified for the score relating to a specific FPC k (t) becomes a fixed value to separate the set curves into two subsamples (son nodes). For example, considering the first split rule, i.e. the separation rule of the root node, we have that 0k 0 is the threshold value related to FPC k 0 . Therefore, all the curves satisfying the condition ik 0 < 0k 0 form a subgroup whereas all the remaining functions, i.e. those satisfying the condition ik 0 ≥ 0k 0 , enter the other subset.
To make the split rule better understandable in the functional context, we introduce two tools: the Functional Predicted Separation Prototype (FPSP) and the 1 The Gini index is a measure of heterogeneity for categorical variables. The lower the value of the index, the more homogeneous the observations in a node. It can be computed as follows: where f ri represents the proportion of training observations in the r-th region that are from the i-th class, and F is the number of modalities of Y. When we are interested in decreasing the impurity of a node, our goal is to reduce the value of this heterogeneity index in the nodes that we obtain as a result of the split rule. (12) The Shannon-Wiener entropy index is an index of heterogeneity for categorical variables. Thus, the higher the value of the index, the more heterogeneous the observations in a node. It can be calculated as follows: where f ri represents the proportion of training observations in the r-th region that are from the i-th class. (13)

3
Supervised classification of curves via a combined use of… Functional Empirical Separation Prototype (FESP). They are both representable as curves, and as such they help to perceive the split rule from a functional perspective. The FPSPs are given by each theoretical separation rule generated by every single split of the FCT-FPCs: where Ω = {k z 1 , ..., k z Z } is the set of the FPCs k (t) selected in the classification rule path until the split of the z-th node ( z = 1, ..., Z ). The generic intermediate node that generates a separation is therefore indicated with z and total number of these intermediate nodes is identified with Z. Hence, each FPSP can be associated to every intermediate node and, of course, also to the root node (FPSP 1 (t) is the FPSP of the root node). The limit of the FPSPs is that they do not provide, from an interpretative point of view, a real beneficial element to understand the different rules to split. These functional prototypes are very often characterized by very "flat" trends with minimal functional variability. Thus, they do not help to perceive the meaning of the splitting rules in terms of the time domain and real curves in the data.
For this reason, we propose the FESP. The latter is defined as the curve existing in the training dataset, closest (based on the semi-metric FPCs) to the split rule delivered by the FPSP and can be defined as follows: where x (z) (t) is the empirical curve existing in the functional dataset, which is the closest to the functional predicted separation prototype of the z-th node. Hence, the FESP is a curve representing the functional empirical split rule produced by the binary partition of a node split and is very useful to understand, especially from a graphical perspective, the separation rule over the time domain. It follows that we can detect and plot a FESP z (t) for any split z existing in our FCT-FPCs. Therefore, we can always identify a FESP z (t) associated to each FPSP z (t) . This approach helps in understanding the different levels of separation that occurred in the FCT-FPCs.
In summary, the FESP attempts to provide an immediate interpretation of the split rule, but in terms of existing curves, i.e. with a true shape in the time domain. For this reason, we prefer the FESP which, although poor in mathematical properties, from an interpretation and visualization point of view, has a substantial impact on the understanding of what happens at each split of the FCT. On the contrary, the FPSP, which instead is the (functional) theoretical rule of separation, does not have the same interpretative impact because it assumes shapes that are difficult to understand when cutting curves. For this reason, FPSPs are clearly essential to split, but not as useful as the FESP in terms of visualization and understanding what happens in the time domain.
Another exciting aspect that deserves to be evaluated is the functional variability of the terminal node after pruning the FCT-FPCs to bypass overfitting. Each leaf is regularly composed of a set of curves. When we prune the FCTs-FPCs to avoid overfitting, these terminal nodes are often not pure: leaves can be composed of functions that belong to different outcome classes. The purity of the terminal node of a pruned FCT-FPCs may always be evaluated with the classical indexes, such as the Gini or the Shannon-Wiener entropy index (Eqs. 12 and 13) (see e.g. Hastie et al. 2009;Therneau et al. 2019). In any case, we can exploit a fascinating source of additional information in the functional case: the functional deviance of the leaves. The latter is a measure of variability that provides knowledge about the dispersion of the set of curves in the leaf l over the whole reference domain, and can be expressed as: where DEV l (t) is the functional deviance of the generic l-th leaf, n l is the number of curves in the l-th leaf ( l = 1, ..., L ), x l (t) is the functional mean of the l-th leaf. High values of DEV l (t) in specific points of the domain provides information on a time interval in which the curves belonging to a terminal node are very far from the functional mean of the leaf and therefore are different from each other. In other words, the latter would be an indication that we have a terminal node composed of diverse functions, regardless of the label of the response variable. Conversely, a very low and "flat" DEV l (t) indicates a leaf with very similar curves and thus a more homogeneous terminal node.
Willing to express the functional variability of the terminal nodes in terms of the total functional deviance of the root node, we can also calculate the following ratio: is the functional deviance of the root node and x(t) is its functional mean. The latter tool can be useful to assess the quality of the partition in terms of reduction of the total functional deviance. 3

Functional Bagging (FBAG)
In classical DTs, modest changes in the data may lead to very diverse DTs and thus different classification rules and interpretations. A useful technique to reduce this variance can be to create an ensemble of DTs using bagging (see e.g. Hastie et al. 2009;James et al. 2013). Classical bagging (bootstrap aggregating) is an extension of the classical decision trees (Breiman 1996). With the goal of reducing the variance of a single classification tree, the basic idea is to create many classification trees for building a final classification rule.

3
Supervised classification of curves via a combined use of… A similar procedure can be extended to the case of functional data leading to Functional Bagging (FBAG), i.e. functional bootstrap aggregating. Focusing on the FPCs scores as features to build the single FCTs-FPCs on each bootstrap replicate, we indicate this approach as FBAG-FPCs. Similarly, it could be extended to Eq. 1 leading to a functional bagging using B-splines (FBAG-Bsplines). However, in the following, we focus on the former approach.
Analytically, let's assume the FBAG-FPCs consists of H trees h , h = 1, ..., H , where H is chosen to be a large number. The h-th tree h is grown on a random subset of the training set, obtained from the original data .., N} of the same size N as the original data set. It is straightforward to replace x (h) s (t) using its expansion in terms of FPCs basis as in Eq. 2. Thus, the set of curves i = 1, ..., N present in the h-th bootstrap sample D * h is called, from now on, "in-bag curves sample" (IBCs) and is used to build the single h-th FCT-FPCs. Instead, the "out-of-bag curves sample" (from now on OOBCs) is composed of the remaining curves relative to those statistical units that are not present in D * h . Then, we train H FCTs-FPCs using H bootstrapped functional training sets to get f * h (x i (t)) that is the predicted class of the h-th FTC for the curve i. Afterwards, we take all the H predictions of the FTCs-FPCs to obtain the final prediction for the i-th curve via the so-called "majority vote", i.e. the overall prediction for each new curve i-th is the most commonly occurring class among the H predictions according to the H different FCTs-FPCs. Formally, let Y be a categorical response with U + 1 classes ( = 0, 1, ..., U) ; thus, an estimate of the probability that the i-th curve belongs to the class -th can be given by: where I is an indicator variable taking value "1" if the class predicted by the FCT is and "0" otherwise. Then, the decision rule for the overall bagging prediction of the curve i is given by Because each FCT-FPCs is grown deep and is not pruned (Breiman 2004), each tree has low bias, but high variance. Averaging these H FCT-FPCs diminishes the variance. This approach gains in accuracy with respect to a single FCT-FPCs because it combines hundreds of trees. Expanding H will not lead to overfitting, as stressed by Breiman (2004). In practice, we would use a value of H that is large enough for the test error to have settled down.

Functional random forest (FRF)
Classical Random Forest (RF) (Ho 1998) is one of the most efficient machine learning algorithms and is a particular case of bagging for decision trees. It consists of applying bagging to the data and bootstrap sampling to the predictor variables at each split. This implies that, at each splitting step of the tree algorithm, a random sample of predictors is selected as split candidates from the full set of the predictors. This leads to an improvement of the classic bagging because it allows to obtain a classifier that is not strongly influenced by the correlation among trees, which otherwise would all be dominated by the most discriminating variable. Following the same reasoning, the limit of FBAG-FPCs is that the FCTs-FPCs are correlated and thus the desirable reduction of variance is not so high because the FCTs are not independent. Indeed, in FBAG-FPCs, even if we bootstrap the functional dataset, most of the FCTs-FPCs (sometimes all the FCTs-FPCs) will be very similar because often (sometimes always) dominated, at the top of the FCTs-FPCs, by the same FPCs that are able to better discriminate between the classes of the outcome. It follows that the higher the correlation among FCTs, the lower the reduction in variance.
Functional Random Forest using FPCs (from now on FRF-FPCs) provides an improvement over FBAG-FPCs by way of a small tweak that decorrelates the FCTs and reduces the variance when we average the FCTs. Each time a split in a single FCT-FPCs is considered, a random selection of m FPCs is chosen as split candidates from the full set of the K FPCs. It follows that, when m < K , we have FRF-FPCs whereas when m = K , we have that FBAG-FPCs = FRF-FPCs. Following this approach, the FCTs-FPCs into the forest, will be less correlated because the most important FPCs won't always be those features on the top of the FCTs-FPCs determining the first important separation rules.
In summary, FRF-FPCs is an extension of FBAG-FPCs because, in addition to considering different FCTs on bootstrap replicates of the functional dataset, it selects a random sample of m predictors at each split in the tree-building process. A general rule of thumb can be to select, as size of the subset of FPCs, a value of m ≈ √ K . Therefore, at each split in the FCT, the algorithm is not even allowed to consider a majority of the available FPCs. Indeed, on average, K−m K of the splits will not even contemplate some FPCs. In this way, FRF-FPCs decorrelates the FCTs-FPCs, making the average of the FCTs-FPCs less variable and hence more reliable. Thus, the difference between FBAG-FPCs and FRF-FPCs depends on the choice of m.
Following the same reasoning of Eq. 17, the decision rule for the overall random forest prediction is given by It's worth noticing that, differently from FCTs-FPCs, in the case of FRF-FPCs, the FCTs in the forest are not pruned. As remarked by Breiman (2004), in the classical (non-functional) random forest framework, over-fitting might occur in a single FCT (which is a cause we generally use pruning) but is mitigated in FRF-FPCs by the use of bootstrap and feature sampling at each step. For this reason, in this context, the forest is composed of non-pruned FCTs whereas, in Sect. 3, we use pruning when considering a single FCT as a functional classifier.

3
Supervised classification of curves via a combined use of…

Estimation of the misclassification error rates of the functional tree-based classifiers
As in the non-functional framework, the estimation of the misclassification error rate (or the accuracy) of the functional classifier can be carried out through different strategies, e.g. utilizing the functional training set, cross-validation, bootstrap, or using a functional test set. The most trivial way for computing the misclassification error rate of our functional classifiers FCT-FPCs, FBAG-FPCs, and FRF-FPCs would be to adopt the identical functional training set. In this case, we can assess the rate of correct classifications by simply matching the predictions to the real labels of the outcome. The latter procedure gives the accuracy of the functional classifiers, and it is straightforward to determine the misclassification error accordingly. Ordinarily, employing this approach, we would observe what is denominated the "apparent error" and its usefulness is very weak. Testing a model on the same dataset that generated it has no practical advantage. In fact, by overfitting the data, we would get a 100% accuracy.
Similarly, we can refer to cross-validation or bootstrap (e.g. using the OOBCs before mentioned) (Hastie et al. 2009;James et al. 2013) without the need of losing a large part of statistical units useful for training just to test the results. Generally, these approaches are very useful when the size of the data set is limited. Indeed, we are able to train our functional models without dropping information that can be contained in that portion of the data that we would waste when producing an ad-hoc test set by breaking the original full dataset.
Concentrating on the blended use of FDA and tree-based techniques for functional classification purposes, the use of a functional test set is of particular interest. Indeed, when we aim to adopt a functional test set to estimate the misclassification error rate of our functional classifier, we must represent the test curves employing the same basis system we have applied to represent the functional training set. In the event we have used a fixed basis system (see Eq. 1), e.g. B-spline basis, we only necessitate to approximate the test functions utilizing the same fixed basis system, i.e. with the same number and order of B-splines. This procedure ensures that the scores of the linear combination we have used for the training set curves are comparable with those we will use with the test set curves because they refer to the corresponding basis functions. Instead, if we use a data-driven basis system as the FPCs of the training set (see Eq. 2), the representation of the test functions is less straightforward. Definitely, we can not derive FPCs on the test functions because it would lead to a new basis system that is inconsistent with the one we have achieved using the FPCA performed on the training set. This condition would lead to scores that refer to a collection of FPCs different from the FPCs derived from the training data making these incomparable. For this reason, we need to represent the functions of the test group in terms of the fixed basis system given by the FPCs previously defined using the functional training set. In other words, we need to project the new functions of the test set onto the FPCs space generated by the training set to find the proper scores. Following this approach, the j-th principal component scores are given by where the weight functions ′ k s are obtained performing the FPCs on the training set, x c are the centered functions of the test set (obtained subtracting the sample mean of training samples), and M is the total number of curves in the test set. Note that we can think of k as eigenvectors, which provide precisely the ordinates of the principal functional components for each point of the domain. Therefore, when we compute the scores to approximate the test set functions using the FPCs derived from the training set as a fixed basis system, we can also assess our functional classifiers' misclassification error rate using a functional test set.

The meaning of the splitting rules of a FCT-FPCs performed using the DIPD dataset
In the following, we limit our attention to Eq. 2, and thus we consider the functional principal components as the basis system to reconstruct the original curves. The proposed approach is applied to the DIPD dataset that was derived from the twelve-monthly electrical power demand time series from Italy and was first adopted in the paper "Intelligent Icons: Integrating Lite-Weight Data Mining and Visualization into GUI Operating System" (Keogh et al. 2006). The DIPD dataset is available at https://www.timeseriesclassification.com/. The training group comprises 67 signals, while the test group is made up of 1029 signals. The classification task is to distinguish days from October to March (inclusive) and April to September. Figure 1 illustrates the 67 original signals of the functional training and test set. The black signals represent the Daily Italian Power Demand (DIPD) from October to March, whereas the red curves denote the DIPD from April to September. Therefore, the basic idea is to predict, based on the characteristics of the curves, whether a new curve refers to the October-March or April-September class. Figure 1 also displays the FPCs decomposition of the functional training set. The plot considers the first twenty FPCs. The variability explained by each FPC is shown in the legend. The first two FPCs explain about 80% of the total variability. In this framework, the traditional ways of choosing the number of FPCs are not useful. Indeed, often the FPCs that explain less variability are decisive in discriminating the classes of the outcome and thus are essential in the construction of the FCT-FPCs. Effectively, the first FPC, which by construction is the one that captures most of the variability, is rarely decisive in FCTs-FPCs. As Fig. 1 remarks, each FPC explains different parts of the time domain differently. Figure 2 presents the FCT-FPCs built using the functional training set's features. The criterion to evaluate the impurity of the nodes is the Gini index (see Eq. 12). We can observe that, in the FCT-FPCs, the features are constituted by the scores of the FPCs. The cut on a specific value of a FPCs score determines the split of a node.
Among all the possible FPCs and possible split values of the scores, the one that maximizes the decrease of impurities of the node is chosen. The most crucial FPC in our FCT is the second FPC. In particular, for a threshold value equal to 0.0098 of the score of the second FPC, the best separation of the curves is obtained based on the purity criterion of the node. Subsequently, the first and sixth FPCs play an essential role in constructing the FCT with threshold values of the scores equal to 2.6 and 0.82, respectively. The second FPC is the most important feature in our functional classifier. As expected, the first FPC is not essential for discriminating, as it captures a variability common to many curves of different classes. Figure 4 represents the first step in elucidating the sense of splitting at a specific value of the coefficient of a FPC. Figure 4(a) is simply the second FPC that is used for the first split rule. Figure 4(b) illustrates the distribution of the scores of the second FPC and the threshold established by the first rule of the FCT. In Fig. 4(c), there are coloured curves based on the original labels of the training groups, and also the two prototypes: the FPSP, which has little interpretative power and the FESP, which instead suggests precisely where the cut takes place. Instead, in Fig. 4(d), we re-propose the prototypes but plotted on the curves whose colour is given by the classes predicted using only the first decision rule; therefore, the red and black colours are not the same as in Fig. 4(c), because now they are given by the predicted labels (not the original ones). In practice, the number of curves with non-corresponding colours to Fig. 4(c), indicates the misclassification error that is committed in classifying using only the first decision As mentioned in Sect. 2.2.1, FESPs are is used as interpretative tools of the functional separation rules. In fact, the FESP is the curve existing in the functional training set, which is the closest to the predicted functional separation rule, based on the semi-metric of the functional principal components (see Eq. 9). With the aim of better illustrating the meaning of the FESP of the first split rule identified by the FCT-FPCs, Fig. 5 is proposed. Figure 5(a) presents the full functional dataset with the original groups and the FESP of the first split (root node). The understanding of "what is happening" following the classification rule dictated by the FCT-FPCs is immediate. The FESP helps to capture, on the basis of the original functions, how the separation takes place considering the whole time domain. In particular, it helps to detect those parts of the domain that are crucial to the separation rule. Following the same reasoning proposed to look for the first FESP, we can identify the FESPs for all the split rules dictated by the FCT-FPCs. Thus, since the FCT operates three splits, we have three empirical prototypes of separation, and each works on a different subset. Figure 5(b) represents the second split rule, based on the first FPC, and thus the yellow curve is the second FESP. The second FESP is shown on the reduced set of the data (that is, only those curves satisfying the condition of the first split and not leading to a terminal node, i.e. 2 < 0.0098 ). Figure 5(c) represents the last split rule, based on the sixth FPC, and thus the orange curve is the third FESP. The third FESP is displayed onto the subset of curves satisfying the double condition 2 < 0.0098 and 1 ≥ 2.6 . Figure 5(a-c) are accompanied by histograms to show the distributions of their scores and the cut made by the FCT rule. Figure 5(d) offers a synthetic version of all the separation rules that occurred in the FCT in terms of an FESPs to better catch over the whole domain all the splitting rules according to the real curves available in the training data. Figure 6 summarizes all the aspects of the splitting rules generated by the FCT-FPCs in terms of FESPs and functional means of subgroups produced by each splitting rule. In addition, the leaves' functional variance is considered below the terminal nodes' charts to capture the dispersion of every single leaf over the entire domain. The second leaf has no variability because it is composed of only one curve. In addition to the classical measures to estimate the impurity of the terminal node, the functional variance of the leaves can be used to judge their quality in different parts of the time domain in terms of dispersion around the functional mean of the single leaf (see Eq. 14). Figure 6 highlights that the last leaf is characterized by a high functional variability due to a composition characterized by curves that have very different shapes especially in the last part of the domain, regardless of the purity of the node. This indicates that there are sub-patterns inside the terminal node, although the predicted class is the same.

FBAG-FPCs and FRF-FPCs performed using the DIPD dataset
In this section, we extend the FCT-FPCs to the context of FBAG-FPCs and FRF-FPCs. Therefore, we provide an ensemble of FCTs-FPCs as a functional classifier to predict the response of a binary grouping variable.  Figure 7 illustrates the first nine FCTs-FPCs obtained using the FBAG-FPCs procedure. As discussed in Sect. 2.2.3 to motivate the subsequent introduction of FRF-FPCs, in FBAG-FPCs, all the FCTs-FPCs are always dominated by the second FPC despite building a different FCT for each bootstrap replicate. Regardless of the curves included in the bootstrap sample, the most critical feature to discriminate the functional dataset always remains the second FPC, even if the threshold value of the score may vary. The latter condition leads to a limited reduction in variance due to the correlation of the various FCTs-FPCs that are very similar. Figure 8 provides the first four FCTs-FPCs obtained using the FRF-FPCs algorithm. The former points out that the FCTs-FPCs in the forest are not all governed by the second FPC. Indeed, at the top of the FCTs, there are often different FPCs, and thus the proposed approach helps to decorrelate the FCTs and get a much more pronounced reduction in variance compared to FBAG-FPCs. Figure 9 illustrates the importance of the FPCs in the FRF-FPCs classifier based on the mean decrease accuracy index (Hastie et al. 2009;Therneau et al. 2019). Unquestionably, the second FPC is the most important in discriminating Fig. 7 The first nine FCTs-FPCs provided by the FBAG-FPCs procedure the classes of the outcome; however, the FPCs n. 6, 1, and 14 also play an influential role in enhancing the accuracy of the functional classifier. This latter consideration is quite significant because those FPCs explaining a small part of the variability of phenomena are often essential for improving the classifiers' performance; effectively, they can capture those peculiarities that the first FPCs do not catch.

Performance results of the FRF-FPCs classifier applied to the Daily Italy Power Demand dataset
In this section, we present the results of the application of FRF-FPCs to DIPD dataset and some comparisons with different functional classifiers. For synthesis, we do not report all the results of the single FCT-FPCs and FBAG-FPCs as we can consider the FRF-FPCs as the best version of the approach presented that, starting from the single FCT-FPCs and passing through the FBAG-FPCs, leads us to obtain the final FRF-FPCs approach. The results of the functional classifiers in terms of accuracy and misclassification error rate are listed in Tables 1 and 2, which consider the accuracy computed using the OOB estimation on the training set and that on the test set, respectively. In addition, a comparison with other functional classification techniques already implemented in the literature is presented in Tables 3, 4, and 5. Table 1 displays the accuracy of the FRF-FPCs classifier applied to the training subset of the DIPD dataset. Different sizes of the forest and different numbers of FPCs are adopted. The accuracy computed on the training set is computed using the OOBCs estimation procedure (see Sect. 2.2.2). The best accuracy on the training set is 98.51% and is achieved using 14 FPCs and 13 or 28 FCTs. On the other hand, Table 2 concentrates on the results of the FRF-FPCs classifier applied to the test set (see Sect. 2.3). Also in this case, the accuracy is shown for different sizes of the forest and different numbers of FPCs. The best accuracy on the test set is 97.57% with 11 FPCs and 29 FCTs. The   Supervised classification of curves via a combined use of… FRF-FPCs classifier performs better than previous used methods. Particularly, the most interesting result is that the world record in terms of accuracy reached for this dataset was 97.03% (see also https://www.timeseriesclassification.com/). Therefore, the FRF-FPC classifier, on this dataset, exceeds the existing world record. Table 3 proposes the results of a similar classification procedure, but based on the coefficients of the B-spline decomposition (Eq. 1). The accuracy of the FRF-B-spline classifier according to different sizes of the forest and a fixed number of B-splines is calculated both on the training and test sets. The best accuracy on the training set is 95.52% with 35 different sizes of the forest. The best accuracy on the test set is 97.18% with 43 trees.
To compare the FRF-FPCs approach to the most recent and widespread methods to classify functional data, we provide the results of different approaches implemented in the fda.usc R package (Febrero-Bande and de la Fuente 2012). In particular, Table 4 displays the results of the functional classification using the k-nn classifier for different values of the parameter k. The best accuracy on the training set is 98.51% with 3, 5, 7, and 9 nearest neighbours. Instead, the best accuracy on the test set is 96.31% with 9 or 11 nearest neighbours. Moreover, Table 5 shows the results of the functional classification using the functional depth classifiers implemented in the R package fda.usc (Febrero-Bande and de la Fuente 2012). The best accuracy on the training set is 98.51% with the depth measure "mode", and the best accuracy on the test set is 95.63% with the depth measure "mode".
In summary, the FRF-FPC classifier achieves the best results in terms of accuracy when we focus on the test set. The best accuracy on the training set is 98.51% with the depth measure "mode". The best accuracy on the test set is 95.63% with the depth measure "mode". depth.RP computes the Random Projection depth (see Cuevas et al. 2007). depth.RP computes the Random Projection depth (Cuevas et al. 2007). depth.mode implements the modal depth (Cuevas et al. 2007). depth.RT implements the Random Tukey depth (Cuesta-Albertos and Nieto-Reyes 2008). depth. FM computes the integration of a univariate depth along the axis x (Fraiman and Muñiz 2001). It is also known as Integrated Depth. depth.RPD implements a depth measure based on random projections possibly using several derivatives (Cuevas et al. 2007

Simulation results
In order to show the performance of the FPCs classifier, we adjust and adapt different models already considered by Cuevas et al. (2007); Preda et al. (2007); Taiwo Ojo et al. (2021) for generating functional data with peculiar shapes. Particularly, we provide nine scenarios, in which the first seven take into account the case of a binary outcome. We generate 200 functions from each population. Both the training set and test sets are composed of 100 curves with balances classes. Instead, the 8th simulation considers a three-classes supervised classification problem. In this last circumstance, we generate 300 curves with balances classes. The last simulation display a four-classes case in which we have 100 observations for each class. In all cases, the data are equally split into training and test sets, and the domain is composed of 50 time observations for each curve. For each simulation, we compare the FPCs-FCTs approach, the functional k-means and five versions of functional depth classifiers available in Febrero-Bande and de la Fuente (2012). The different scenarios are obtained using the following simulations. Simulation 1. We consider the following two functional data generating models to obtain two groups, which differ mainly according to their magnitude. Group 1 is generated by the main model that is of the form X i (t) = t + e i (t) whereas group 2 is generated by X i (t) = t + qk i + e i (t) , where t ∈ [0, 1] , e i (t) is a Gaussian process with zero mean and covariance function (s, t) = e { − |t − s| } , k i ∈ {−1, 1} , and q is a constant controlling how far the curves in group 2 are from the mean function of group 1. Figure 10 shows the simulated data obtained fixing = 4 , q = 3 , = 1 , = 1 , and = 1.
Simulation 2. We consider two functional data generating models to obtain two groups that differ because one is characterized by a rather constant trend and the other characterized by peaks. Group 1 is generated by X i (t) = t + e i (t) while group 2 is given by X i (t) = t + qk i I T i ≤t≤T i +l + e i (t) with t ∈ [0, 1] , e i (t) is a Gaussian process with zero mean and covariance function of the form (s, t) = exp {− |t − s| } , k i ∈ {−1, 1} with P k i = −1 = P k i = 1 = 0.5 , q is a constant controlling how far curves in group 2 are from the mass of the curves in group 1, T i is a uniform random variable in an interval [a, b] ⊂ [0, 1] , I is an indicator function, and l is a constant specifying for how much of the domain of group 2 are away from the mean function. In this simulation, we set = 4 , q = 8 , a = 0.1 , b = 0.9 , l = 0.05 , = 1 , = 1 , and = 1.
Simulation 3. Group 1 is generated by the model X i (t) = t + e i (t) and group 2 is generated by the model X i (t) = t + qk i I T i ≤t + e i (t) where t ∈ [0, 1] , e i (t) is a Gaussian process with zero mean and covariance of the form (s, t) = exp {− |t − s| } , k i ∈ {−1, 1} with P k i = −1 = P k i = 1 = 0.5 , I is an indicator function, q is a constant controlling how far the curves in group 2 are from the mass of group 1, and T i is a uniform random variable in an interval [a, b] ⊂ [0, 1] . This simulation allows us to get two groups that differ in their magnitude only in a specific part of the time domain. For our purpose, we fix = 8 , q = 3 , a = 0.5 , b = 0.9 , = 1 , = 1 , and = 1.
Simulation 4. We consider the following two functional data generating models to obtain two groups, which differ mainly according to their amplitude. The main model is of the form X i (t) = a 1i sin + a 2i cos + e i (t) . To obtain group 2 we refer to the model X i (t) = b 1i sin + b 2i cos 1 − u i + c 1i sin + c 2i cos u i + e i (t) Fig. 10 The nine simulated scenarios adopted to compare the performances of the functional classifiers where t ∈ [0, 1] , ∈ [0, 2 ] , a 1i , a 2i follows uniform distribution in an interval a 1 , a 2 , b 1i , b 2i follows uniform distribution in an interval b 1 , b 2 , c 1i , c 2i follows uniform distribution in an interval c 1 , c 2 , u i follows Bernoulli distribution, and e i (t) is a Gaussian process with zero mean and covariance function of the form (s, t) = exp {− |t − s| } . Figure 10 shows the simulated data obtained fixing a 1i = 1 , a 2i = 8 , b 1i = 1.5 , b 2i = 2.5 , c 1i = 5 , c 2i = 10.5 = 1 , = 1 , and = 1. Simulation 5. Group 1 is generated by the model X i (t) = t + e i (t) and group 2 is obtained by X i (t) = t +ẽ i (t) where t ∈ [0, 1] , and e i (t) and ẽ i (t) are Gaussian processes with zero mean and covariance function of the form (s, t) = exp {− |t − s| } . Figure 10 shows the simulated data obtained with = −24 , = 1 , = 1 , and = 1 for the Gaussian process e i (t) . Instead, = 5 , = 2 , and = 0.5 are the parameters selected for the Gaussian process ẽ i (t) . Simulation 5 allows us to get quite overlapping curves, but with small differences in their shape.
Simulation 6. We consider the following two functional data generating models to obtain two groups, which have a small difference in magnitude and also in shape, but only in a portion of the time domain. Group 1 is achieved using the model X i (t) = t + e i (t) whereas group 2 is generated by the model is a Gaussian process with zero mean and covariance function of the form (s, t) = exp {− |t − s| } , u follows Bernoulli distribution with P(u = 1) = 0.5 , q, r, z and w are constants, v follows a Uniform distribution in [a, b]. The two groups are obtained setting the following parameters: = 8 , q = 1.8 , a = 0.45 , b = 0.55 , = 1 , = 1 , = 1 , r = 0.02 , z = 90 , and w = 2. Simulation 7. Group 1 is generated by the model X i (t) = k sin(r t) + e i (t) and group 2 is obtained by X i (t) = k sin(r t + v) + e i (t) where t ∈ [0, 1] , and e i (t) is a Gaussian process with zero mean and covariance function of the form (s, t) = exp {− |t − s| } , and k, r, v are constants. The goal of this simulation is to generate periodic functions with slightly different shapes among the curves of the two groups. To this purpose, we set r = 20 and the remaining parameters equal to one.
Simulation 8. The aim of the simulation 8 is to test the classifiers on a supervised classification problem with three classes. To this end, we propose the model of the simulation 6 but adapted to get three different groups. The first group is obtained setting = 0 and, for the second model, we set q = 1.8 , a = 0.45 , b = 0.45 , = 1 , = 1 , = 1 , r = 0.02 , z = 90 , and w = 2 . The third group is obtained setting = 1 , q = 0.8 , a = 0.65 , b = 0.65 , = 1 , = 1 , = 1 , r = 0.02 , z = 90 , and w = 2.
Simulation 9. Simulation 9 proposes a four-classes classification problem. Also in this case, we propose an extension of the model used in simulation number 6 with appropriate adjustments to obtain groups with a certain recognition complexity. The first group is obtained setting = 0 and, for the second model, we set q = 1.8 , a = 0.45 , b = 0.45 , = 1 , = 1 , = 1 , r = 0.02 , z = 90 , and w = 2 . The third group is obtained setting = −2 , and the fourth one is generated fixing q = 1.8 , a = 0.15 , b = 0.15 , = 0.8 , = 0.8 , = 1 , r = 0.01 , z = 90 , and w = 4. Figure 10 illustrates the nine datasets generated using the model described above.
Supervised classification of curves via a combined use of… Table 6 exhibits the best results obtained for each classifier in any simulation scenario. The numbers of FCTs, FPCs, and nearest neighbours are commented below, and additional detailed are provided as supplementary materials of this paper. The performance are indicated as percentages of accuracy. FCT-FPCs is compared to some functional classifiers implemented in the R package fda.usc (Febrero-Bande and de la Fuente 2012). Particularly, we refer to F-K-nn that implements the functional K-nn classifier, depth.RP that computes the Random Projection depth (see (Cuevas et al. 2007)), depth.mode which implements the modal depth (Cuevas et al. 2007), depth.RT that implements the Random Tukey depth (Cuesta-Albertos and Nieto-Reyes 2008), depth.FM which computes the integration of a univariate depth along the axis x (Fraiman and Muñiz 2001), and finally depth.RPD that provides a depth measure based on random projections possibly using several derivatives (Cuevas et al. 2007). Table 6 The best simulation results for each classifier in any scenario The performance are expressed as percentages of accuracy. FCT-FPCs classifier VS some functional classifiers implemented in the R package fda.usc (Febrero-Bande and de la Fuente 2012). F-K-nn implements the functional k-nn classifier (Febrero-Bande and de la Fuente 2012). depth.RP computes the Random Projection depth (Cuevas et al. 2007). depth.mode implements the modal depth (Cuevas et al. 2007). depth.RT implements the Random Tukey depth (Cuesta-Albertos and Nieto-Reyes 2008). depth. FM computes the integration of a univariate depth along the axis x (Fraiman and Muñiz 2001). It is also known as Integrated Depth. depth.RPD implements a depth measure based on random projections possibly using several derivatives (Cuevas et al. 2007). In the first seven scenarios the test and the training set are made up of 100 observations each. Therefore, in the accuracy measurements, no decimals other than 0 are observed. Limiting the attention to the accuracy of the functional classifiers computed using the test set, the first scenario highlights the superiority of the FRF-FPCs classifier, which indicates a precision on the test set of 98% with 15 FPCs and 14 FCTs in the forest. The functional k-nn is also very accurate, with 95% accuracy with k = 1 . In this simulation, the functional depth-based classifiers also achieve satisfactory results. In the second scenario, the FRF-FPCs classifier is much superior to the others with an accuracy of 97%, i.e. 21 percentage points above the second one, that is the functional k-nn with k = 1 . Functional classifiers based on depth do not perform well in the second scenario. In fact, they reach a maximum of 50% accuracy. The FRF-FPCs gets very high performances in many circumstances, with 14 or 15 FPCs and a different number of FCTs . The other classifiers are unable to capture isolated peaks in small portions of the time domain (see Fig. 10 -Scenario 2). In the third scenario, both the FRF-FPCs and the functional k-nn have high performances. The FRF-FPCs have 98% accurateness on the test set with 14 and 15 FPCs and 39, 43, and 59 trees. Instead, the functional k-nn attains a maximum precision of 95% with k = 1 . Even in scenario 4, the same classifiers consistently achieve the best performance. The best accuracy of the FRF-FPCs is 92% with 5 FPCs and 11 FCTs; instead, the highest result of the k-nn is 88% with k = 7 . Also, RPD, in this case, reaches 88% of accuracy. In scenario 5, the maximum difference in terms of performance between the FRF-FPCs classifier and all the other classifiers is present. Definitely, the FRF-FPCs perfectly recognize all the curves in the test set with 12 FPCs and 23, 45, and 57 FCTs. The k-nn fetches 53% with k = 1 while the depth-based classifiers only reach 50%. Hence, the other classifiers are incapable of detecting the labels of curves that overlap in the central area with a group that has a slightly different shape and variability (see Figure 10 -Scenario 5). In scenario 6, both the FRF-FPCs and the k-nn have a precision of 98%. The FRF-FPCs achieves this result with just 2 or 3 FPCs and a different numbers of FCTs in the forest, while the k-nn needs k = 1 . Evidently, the simulation of this scenario provides quite simple patterns to be recognized and, in fact, also the mode and RPD methods reach 96% and 90%, respectively. Scenario 7 is the simplest to recognize, and indeed most classifiers have an accuracy of 100%. The FRF-FPCs classifier offers a perfect classification with any number of FPCs from 2 to 14 and very few FTCs in the forest (2, 3, or 4 are often sufficient). Scenario 8 offers a classification problem with three classes. In this case, the best classifier is k-nn because fixing k = 1 achieves 96% accuracy. The FRF-FCTs, on the other hand, show 94% accuracy with 4 FPCs and 9 FCTs. Simulation 9 is a four-class classification problem with overlapping curves due to their very different shapes. The latter is the most challenging classification problem. Indeed, the best classifier is the FRF-FCTs with an accuracy of just 72% with 4 FPCs and 9 FCTs.

On the selection of the number of FPCs and FCTs for the FRF-FPCs classifier
The choice of the "optimal" number of FPCs and FCTs useful for the purposes of functional supervised classification is certainly a very complex and challenging topic. The choice of an optimal number of FPCs does not have a theoretical solution, but some considerations can help select a decision criterion. Unfortunately, knowing a priori, with certainty, the number of FPCs to be considered for supervised classification would be like comprehending a priori the optimal number of nearest neighbours in the use of the k-nn classifier. A priori, we can have some hints based on the data, but we only can have the certainty of the optimal number after testing the classifier. The number of FPCs could have a heuristic solution based on some considerations and evidence that emerge from the empirical analyzes and theoretical reasoning.
The first consideration is that, in this context, the choice of the number of FPCs is significantly dissimilar from the classic case where the goal may be to try to explain at least 70-80% of the variability of the phenomenon. We could not even be satisfied with describing 90-95% because the 5-10% we lose could be essential for the performance of the functional classifier for recognizing functions with particular patterns in specific parts of the time domain.
The second consideration is that very often, unlike the classic PCA or FPCA, the first FPC is not very influential for classification purposes because it often captures a variability common to all curves. Very often, the following FPCs play a fundamental role, even if they explain a small portion of the total variability.
In light of these considerations, our decision rule cannot only be based on the variability explained but must be based on other concerns of an empirical nature, e.g. the number of curves understudy, the performance of the classifier on the training group and possible validation group, and the number of classes to predict. In fact, even this last aspect turns out to be relevant because, as can be seen from the simulation, when the number of classes to be predicted increases, the number of FPCs that are important in the classification procedure increases, particularly when the groups are not similar (see Scenario n.9 in the supplementary material). At this point, to choose an adequate number of FPCs, we suggest three possible heuristic criteria.
First, when we have a number of labelled curves, for each group, that is large enough to allow us to create a separate training set and validation set, we can train the classifier on the training set and refine the number of FPCs to be considered by observing the accuracy on the validation set for different numbers of FPCs. Clearly, we will pick a number of FPCs that guarantee maximum accuracy on the validation set.
A second feasible heuristic solution is to observe the plot of the average accuracy of the FRF-FPCs classifier (computed using the OOB estimation on the training set) versus the number of FPCs; in this case, given the number of FPCs, to obtain each ordinate of the curve, we average the accuracy values obtained varying the number of FCTs. The latter curve is fascinating because it provides information on the discriminatory extent of a particular set of FPCs in contributing to the accuracy of the classifier, which, moreover, is minimally affected by fluctuations due to chance. In fact, making an average that fixed the number of FPCs, takes into account a number of FCTs ranging from 2 to 60, as in our simulation study (see the supplementary material), there is a substantial compensation of the fluctuations due to chance. Hence, we believe this curve has real informative power for the choice of the number of FPCs to be considered. In fact, from an empirical point of view, it is exciting to note that very often, there is a correspondence between the peak of this average curve (solid blue curve in the simulations) and high values of the maximum accuracy value on the test set (dotted cyan curve in all simulations). Therefore, an interesting criterion could be to choose a number of FPCs (or an interval) in which a peak of the average accuracy curve is observed. This choice guarantees a high accuracy value, even on the test set.
A third and final empirical criterion could be to choose a number of FPCs for which a sudden decrease in the plot of the average accuracy of the classifier tested on training with the OOB is observed (an idea that is similar to the classic criterion of the scree plot). This last criterion is particularly suitable in the application to the DIPD data. Figure 11(a) shows that the maximum accuracy on the test set (the peak of the dotted cyan curve) is associated with 11 FPCs, i.e. the point of the time domain in which the average OOB accuracy (solid blue curve) has a sudden decrease. In other words, starting from the FPC n.12, the discriminating power begins to be low, on average, and thus we can omit the following FPCs. In this case, also the second method is an excellent choice because, although it does not lead to the maximum accuracy peak, it guarantees a very high level of precision.
The choice of the number of FCTs in the forest, on the other hand, is not very different from the classical case. Thus, the basic idea is that the number of FCTs needs to be sufficiently large to stabilize the error rate. Indeed, more FCTs provide more robust and stable error estimates and variable importance measures; however, the effect on computation time grows linearly with the number of FCTs. Therefore, once a certain threshold in the number of FCTs is exceeded, the improvement of the classifier decreases as the number of FCTs increases, i.e. at a certain point, the benefit in prediction performance from learning more FCTs will be lower than the cost in computation time for learning these additional FCTs. A further consideration is that Fig. 11 Accuracy VS number of FPCs, and accuracy VS number of functional classification trees the optimal number of FCTs in the FRF depends on the number of curves in the data set. The more curves in the data, the more FCTs are needed. Therefore, the practical solution for choosing the number of FCTs in the forest can be based on the plot of the average accuracy of the FRF-FPCs classifier, tested on training with the OOB estimation, versus the number of FCTs in the forest. Figure 11(b) and all the corresponding images of the simulations in the supplementary material highlight that starting from a certain number of FCTs onwards, the value of the average accuracy calculated with the OOB procedure applied to the training (solid blue curve) stabilizes. Furthermore, even the average accuracy calculated on the test set (solid cyan curve) generally stabilizes in correspondence with an approximately equal number of FCTs. Consequently, a reasonable choice of the number of FCTs in the forest can be to choose the value (or a range of values), after which the plot of the average accuracy tends to flatten out. Indeed, the maximum accuracy values on the test set always occur after the plot of the average accuracy calculated with the OOB on the training set approximately stabilizes.

Discussion and conclusions
In today's society, thanks to technological advances, we are able to collect huge amounts of data. Many of these data, especially in the medical, environmental, and industrial fields, come from sensors that produce high frequency observations, e.g. for monitoring pollution, climate, heart, and brain activity. Consequently, in the last few decades, many methods have been developed to deal with this kind of data because traditional statistical methods can fail. One of the most important problems is, for example, the so-called curse of dimensionality that is linked with many problematic aspects when dealing with high-dimensional data.
This research proposes a classification approach for high-dimensional data that combines the use of FDA and some of the most recent tree-based machine learning techniques. In particular, we focused on the application of FCTs-FPCs, FBAG-FPCs, and FRF-FPCs using data that are expressed through curves. The idea of FRF is quite recent. Few papers are available in the literature and the approaches are considerably different. For example, Möller et al. (2016) proposed an approach to extract features based on the mean of the function within fixed intervals of the domain whereas El Haouij et al. (2019) and also Gregorutti et al. (2015) focused on the wavelet basis decomposition in specific frameworks. Our approach is quite different and concentrates on data driven basis with the introduction of some new tools to help to interpret results over the whole domain. However, the objective of the proposed approach is multiple. The first goal is connected with the introduction of the FDA because it permits a robust dimensionality reduction with an interpretation linked to different parts of the time domain and immediate understanding (Febrero-Bande and de la Fuente 2012). In addition, FDA also provides additional information on the behaviour and variability of the curves over time (Ferraty and Vieu 2006). Finally, treating functions as single objects allows us to exploit some concepts of similarity between statistical units, which are very attractive because they only take into account the essential characteristics of the functions (Ramsay and Silverman 2005).
The second objective is linked to the introduction of FCTs combined with FDA. FCTs are powerful supervised classification tools and have the double advantage of having good accuracy and easy interpretation. By extending this approach to the FDA through FPCs and introducing the concepts of FESP, FPSP, and functional variance of the leaves, a straightforward reading of the classification rules in the functional field is obtained.
Third, the extension of the FCT-FPCs approach also to FBAG-FPCs and FRF-FPCs allows us to obtain excellent results in terms of accuracy and variance reduction. Indeed, the most remarkable result beyond the functional interpretation and the additional tools proposed is the performance of the functional classifier FRF-FPCs in terms of accuracy. Surprisingly, the FRF-FPCs, in many datasets, exhibited exceed the accuracy of all the other approaches.
Finally, a very interesting aspect is that dictated by the simulation studies which lead to draw some conclusions. The first FPC is quite important in distinguishing the groups only when they are well separated. In general, in situations in which the groups are not well separated, the following FPCs play a fundamental role. Moreover, as the number of classes to be recognized increases, the number of FPCs that are important in the FRF-FPCs increases. When the groups are particularly overlapping or sub-patterns exist within a group, the FRF-FPCs classify better than the other methods.
In examining the results of the application on the DIPD data, we also compared the FRF-FPCs scheme with a classification based on B-splines. The latter has only been mentioned in the methodological part because the extension to Eq. 1 is quite immediate. The central point is that the functional random forest based on a fixed basis system is not as efficient as the one based on a data-driven basis. The causes are essentially three. First, a data-driven basis system better adapts, by definition, to our data, capturing the variability and, therefore, the essential information we need. Second, the interpretation of a fixed basis system is aseptic, complicated, and not fascinating in considering the time domain. Finally, the performance, in terms of accuracy, of the functional classifier based on fixed basis is definitely lower than the approach using data-driven basis.
The proposed line of research, i.e. combining FDA and tree-based supervised classification methods, is exciting and promising. We can think about many possible future developments. The most immediate future development is to extend this approach to different types of basis systems such as Fourier or Wavelets basis or think about other types of distance or new interpretative tools. Another challenging line of research would undoubtedly be to consider improving the interpretation of FRF-FPCs by building a representative FCT. This plan is crucial if we aim to exploit the interpretative power of the ensemble of functional classification trees fully. Indeed, each FCT-FPCs has its own different classification rule, and we take the majority vote to predict a new statistical unit class. In truth, however, it would be attractive to reach a consensus rule that allows us to understand how most FCTs-FPCs "think".