Abstract
In the era of big data, an ever-growing volume of information is recorded, either continuously over time or sporadically, at distinct time intervals. Functional Data Analysis (FDA) stands at the cutting edge of this data revolution, offering a powerful framework for handling and extracting meaningful insights from such complex datasets. The currently proposed FDA methods can often encounter challenges, especially when dealing with curves of varying shapes. This can largely be attributed to the method’s strong dependence on data approximation as a key aspect of the analysis process. In this work, we propose a free knots spline estimation method for functional data with two penalty terms and demonstrate its performance by comparing the results of several clustering methods on simulated and real data.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
In recent years, there has been a rapid increase in the availability of large data resulting from the observation of a phenomenon along a continuous domain, such as time, space, or frequency domains. The growing need to analyze data of this type, characterized by an intrinsic functional nature, is the basis for the development of the field of Functional Data Analysis (FDA) (Ferraty and Vieu 2006; Ramsay and Silverman 1997). Functional data analysis typically involves two key steps: first, representing discrete observations as functions and then applying statistical methods. The step of functional representation helps to reduce noise and fluctuations in the data and plays a key rule in the FDA because it sets the foundation for subsequent statistical analyses. There are several methods and techniques that can be employed to perform functional representation. Some common approaches include smoothing spline, wavelet analysis, Fourier analysis, principal component analysis, smoothing techniques, and interpolation methods. The choice of the functional representation method depends on the nature of the data, the specific goals of the analysis, and the assumptions made about the underlying structure of the functions. Each method has its advantages and limitations, and the selection should be based on the characteristics of the data and the objectives of the analysis. To address the efficient representation and analysis of functional data with various shapes, we focus on the problem of representing functional data using free knots spline estimation. In the last years, the problem of representing functional data with free knots spline has been addressed by the same authors. Gervini Gervini (2006) proposes free knots regression spline estimators for the mean and the variance components of a sample of curves, showing that free knots splines estimate salient features of the functions (such as sharp peaks) more accurately than smoothing splines. Inspired by a Bayesian model, Denison et al. (2002), Dimatteo et al. (2001) propose estimating various curves using piecewise polynomials. The first one establishes a joint distribution over the number and positions of knots defining the piecewise polynomials and employs reversible jump Markov Chain Monte Carlo methods for posterior computation. The second, based on a marginalised chain on the knot number and locations, provides a method for inference about the regression coefficients and functions of them in both normal and non normal models. Among others, a novel knot selection algorithm is introduced by Ferraty and Vieu (2006) for nonparametric regression using regression splines. Recently Basna et al. (2022), building upon a previous proposal of Nassar and Podgórski (2021), introduces a data-driven approach that uses a machine learning-style algorithm to select knots efficiently and employs orthogonal spline bases (splinets) to represent sparse functional data. In this work, generalizing ’the balanced-discrepancy principle’ proposed by Sivananthan (2016) in the functional data analysis framework, we propose a simple double penalization criterion to improve the smoothing process with free knots spline approximation for various types of curve shapes. This approach is designed to improve the precision and overall effectiveness of fitting spline curves to diverse datasets. The paper is organized as follows. Section 2 introduces mathematical foundations of smoothing with roughness penalty. In Sect. 3, we illustrate free knots spline estimation and their construction with a double penalization criterion. In Sect. 4 we illustrate performance of the proposed penalization criterion on simulated data. An extensive comparative analysis to evaluate which approximation method is most suitable for functional data in terms of clustering performance is conducted. Section 5 shows the performance of a clustering method on COVID-19 pandemic data for 30 countries by comparing free knots spline and free knots spline with double penalization criterion.
2 Smoothing functional data
Let \(\{x_i(t), \; i = 1,..., n\}\) be a sample of real-valued functions on a compact interval T related to a functional variable X. The sample curves can be considered realizations of a stochastic process \(X = \{X(t): t \in T \}\) whose sample functions belong to the Hilbert space \(L^2(T )\) of square integrable functions with the usual inner product \((f,g) = \int _T f(s) g (s) ds\).
In real applications, \(x_i(t)\) often cannot be observed directly, but may be collected discretely over time points \(\{t_1,...,t_h\} \subset T\). FDA aims to reconstruct the true functions from the observed data using the basis function approach, that reduces the estimation of the function \(x_i(t)\) to the estimation of a finite linear combination of known basis functions \(\phi _k(t), \; k=1 \ldots n_{\mathcal {B}}\)
where \({{\textbf {c}}}^i \in {\mathbb {R}}^{n_{{\mathcal {B}}}}\) is a vector of coefficients. The choice of the base and the dimension \(n_{{\mathcal {B}}}\) depend strongly on the nature of the data. The most known bases are: the Fourier base, B-spline, polynomial, exponential, wavelets. The Fourier base is used when the data has a cyclical nature, the exponential base when the data shows exponential growth, and the B-spline base are the most used when the data does not have a strong cyclical trend. There are different ways of obtaining the basis coefficients depending on the kind of observations. Generally observed data are contaminated by random noise that can be viewed as random fluctuations around a smooth trajectory, or as actual errors in the measurements.
The observed data is collected in a matrix \({\textbf {Y}} \in {\mathbb {R}} ^ {h \times n}\) whose elements are:
where \(\varvec{\epsilon }_{i}\) is unobserved error term independent and identically distributed random variable with zero mean and constant variance \( \sigma ^2 \). In this case an appropriate way to estimate the basis coefficients from the data is by using a least squares approximation
with \( \varvec{\Phi } \in {\mathbb {R}}^{{n_{{\mathcal {B}}} \times h }}\), \({\varvec{C}} \in {\mathbb {R}}^{ {n_{{\mathcal {B}}} } \times n}\), where \(\phi _{ij}=\varvec{\phi }_k({{\textbf {t}}}_j)\) and \(c_{ij}={{\textbf {c}}}_j^{i}\).
The smoothness is implicitly controlled by the number of basis functions, \({n_{{\mathcal {B}}}}\). If we assume that \(n_{{\mathcal {B}}} \le h\) and \(rank(\varvec{\Phi })=n_{{\mathcal {B}}}\) then
and the matrix \( \hat{ {\textbf {Y}}} \) of the approximation values is:
Increasing \({n_{{\mathcal {B}}}}\) leads to overfitting and generates a \(x\) curve that is overly wiggly. One way to overcome this drawback is to use a roughness penalty term.
2.1 Smoothing functional data with a roughness penalty
The roughness penalty approach defines a measure of the roughness of the fitted function x using the differential operators of order \(m \ge 1\)
where \(D^m x(t) = \frac{d^m}{ds^m} x(t) \). This allows to measure the closeness of x(t) to a polynomial of order m. The most commonly used roughness penalty is \(m=2\) that permits to keep under control the curvature of the curve, that is we control the variability of the slope of the curve (O’Sullivan 1986).
Often, there is a need for a wider class of measures of deviation. Especially, when there is periodicity in the data or an exponential trend, it would not be sufficient to use the integrated squared mth derivative because it can only penalize deviations from polynomials. More generally, a measure of roughness is then given by
where L is the linear differential operator defined as Ramsay and Silverman (1997)
where \(\alpha _0(t),...,\alpha _m(t) \) are the coefficient functions depending on t. Obviously \(PEN_m(x)\) is a special case of \(PEN_L(x)\) with \(\alpha _0(t)=\alpha _1(t)=...=\alpha _{m-1}(t)=0\) and \(\alpha _m(t)=1\). In the following we will assume the \(\alpha _i(t)\) to be constant. Then, the penalized least squares approximation is given by
with \({{\textbf {R}}}_L=\alpha _0 {{\textbf {R}}}_{0}+\alpha _1 {{\textbf {R}}}_{1}+...\alpha _m {{\textbf {R}}}_{m}\) discretization of \(PEN_L(x)\), where \( {{\textbf {R}}}_{l} \in {\mathbb {R}}^{{n_{{\mathcal {B}}}} \times {n_{{\mathcal {B}}}}}\) (\(0\le l\le m)\):
where I is a suitable interval containing the data. The regularization parameter \( \lambda >0\) manages the compromise between the fitting to the data and the roughness of the function: the smaller it is \( \lambda \), the closer the estimate is to the estimate of least squares and tends to interpolate the observed points; the higher is \( \lambda \), the flatter the smooth function tends to be. This parameter can be calibrated through a generalised cross validation (GCV) (Ramsay and Silverman 1997). In this paper we consider the first and second-order roughness in order to control both the course and the variability of slope of the curve and to avoid placing an excessive burden on the cost of the penalty matrix. Then the penalized least square problem becomes
The solution of (7) can be computed by solving the system arising by first order optimality conditions
The expression for the data-fitting \(\hat{{\textbf {Y}}}\) is:
3 Free knots spline
For their compact support and fast computation, as well as the ability to create smooth approximations of non periodic data, B-splines are a common choice in the functional data framework. The approximations by splines can be significantly improved if knots are allowed to be free rather than at a priori fixed locations (Gervini 2006).
It is well-known that the primary advantage of free knots spline over smoothing splines is their greater flexibility in modelling data, as they allow for a better adaptation of the curve’s shape to the specific characteristics of the data.
The free knots spline is a spline in which the positions of the knots are considered parameters to be estimated by the data. Adjusting the position of the nodes allows you to adapt the shape of the function to the target function.
This section considers smooth estimators for knots selection for the approximation of a given curve. Existing approaches are based on individual levelling of sample curves, followed by the mean of the cross-section and the calculation of covariance (Rice and Silverman 1991; Ramsay and Silverman 1997). These methods, however, do not take strength from the dataset that we have available in the leveling phase. A further drawback is that analytical expressions for optimal knots locations, or even for the general characteristics of optimal knots distributions, are not easy to derive. We introduce free knots spline estimators that avoid individual levelling. We show that this approach applied to the methods seen in the previous section (smoothing spline with one parameter and with two parameters) often produces better estimators than sanding splines Ramsay and Silverman (1997) in which knots are chosen randomly and equally at the cost of a modest increase in computational complexity. In this section we introduce the algorithm of optimal knots placement. Given a vector of nodes \(\varvec{\tau } \in \Re ^p\) with \(a<\tau _1<\tau _2..<\tau _p<b\), the Jupp transformation \({\textbf {k}}\) of \(\varvec{\tau }\), \( {\textbf {k}}=J( \varvec{\tau })\), is defined componentwise as
where \( {\tau }_0 = a\) and \({\tau }_{p+1} = b \). This one-to-one transformation maps constrained and ascending knots vectors \(\varvec{\tau }\) on unconstrained and unclassified carriers \({\textbf {k}}\), which has some practical and theoretical advantages (Jupp 1978).
Note that for each fixed set of knots, the class of such splines is a linear space of functions with \((p+r) = {n_{{\mathcal {B}}}}\) free parameters where r is the order of spline and p is the length of knots.
Let \(\varvec{\phi }({\textbf {t}},{\textbf {k}}) \in {\mathbb {R}}^{{n_{{\mathcal {B}}}}}\) be the vector of the basic functions B-spline corresponding to a set of nodes \({\textbf {k}}\) and \(\varvec{\Phi }({\textbf {k}})\) the matrix \(h \times {n_{{\mathcal {B}}}} \) whose j-row is \( \varvec{\phi }({\textbf {t}}_j,{\textbf {k}})^T\). We find the coefficients of linear expansion and the vector of the optimal nodes minimizing the penalized least squares problemFootnote 1:
Alternating minimization algorithm can be used to solve problem (10); at each iteration the problem is splitted in two optimization ones. Closed form solution can be obtained for minimization with respect to \({\textbf {C}}\) fixing \({\textbf {k}}\). The optimal value can be obtained by solving the system with coefficient matrix
Proposition 1
If the \(\varvec{\Phi }({\textbf {k}})\) is full rank, the matrix \({\textbf {H({\textbf {k}})}}\) is SPD and let \(\sigma ({\textbf {H}})\) be the set of eigenvalues of matrix \({\textbf {H({\textbf {k}})}}\) we have \(\sigma ({\textbf {H}})\subseteq [\sigma ^-, \sigma ^+]\), with
Proof
\({\textbf {H({\textbf {k}})}}\) is the sum of \(\varvec{\Phi }({\textbf {k}}) \varvec{\Phi }({\textbf {k}})^T\) that is SPD, and of other matrices that are non negative definite (O’Sullivan 1986); then the first statement follows. For the second statement we recall that if \({\textbf {A}}\) and \({\textbf {B}}\) are real, symmetric matrices, then \({\textbf {A + B}}\) has real eigenvalues, and the following inequalities hold
\(\square \)
Proposition 1 suggests to use Cholesky factorization to compute
The minimization with respect to \({\textbf {k}}\) fixing \({\textbf {C}}\) gives the following nonlinear optimization problem:
To solve (11), we apply the knots addition algorithm that produces knots sequences of increasing dimensions. We define the functional \(f_j: {\mathbb {R}}^j \rightarrow {\mathbb {R}} \) as follows:
Now, we will present the procedure of the gradual node addition algorithm (Gervini 2006).
Gradual node addition algorithm | |
Initialization | |
Choose an ordered grid \( F_1 = \{s_1^1,...,s_N^1\} \subset (a,b) \). | |
Compute \( J_1 = \{J(\{s_1^1\}),...,\ J(\{s_N^1\}) \} \). | |
Find \(\tilde{{\textbf {k}}}_1 = \underset{{\textbf {k}} \in J_1}{{\text {argmin}}} f_1({\textbf {k}}) \) | |
Compute \(\hat{{\textbf {k}}}_1\) solution of (11) with \(p=1\) using the Gauss-Newton algorithm with \( \tilde{{\textbf {k}}}_1 \) as the starting point. | |
\( \hat{\varvec{\tau }}_1 = J^{-1} (\hat{{\textbf {k}}}_1) \) | |
Forward addition | |
For \(i=2,...,p\) | |
Choose an ordered grid \( F_i = \{s_1^i,...,s_N^i\} \qquad \subset (a,b) \). | |
Compute | |
\( J_i = \{ J \bigl ( \hat{{\varvec{\tau }}}_{j-1}\bigcup \{s_1^i\} \bigr ),..., J\bigl ( \hat{{\varvec{\tau }}}_{j-1}\bigcup \{s_N^i\} \bigr ) \} \). | |
Find \(\tilde{{\textbf {k}}}_i = \underset{{\textbf {k}} \in J_i}{{\text {argmin}}} f_i({\textbf {k}}) \) | |
Compute \(\hat{{\textbf {k}}}_i\) solution of (11) with \(p=i\) using the Gauss-Newton algorithm with \( \tilde{{\textbf {k}}}_i \) as the starting point. | |
\( \hat{\varvec{\tau }}_i = J^{-1} (\hat{{\textbf {k}}}_i) \) . |
Although there is no guarantee that this (or any other) algorithm will find the global minimizer of (11), we have found that it works well in practice. In our simulations and examples, knots have been added in the "right" order (Gervini 2006; Jupp 1978). This is important for the selection of the model, since the optimal number of knots \(p\) is never known in practice and will be chosen on the basis of sequences of intermediate nodes.
4 Computational experiments
In this section, we present some computational results using our algorithm on functional data coming both on synthetic data and on data from a real-world application. More precisely, we present applications of three different clustering methods to evaluate the benefits of detecting clusters when using free knots splines estimation with two penalty terms with respect to free knots splines and free knots splines with one penalty term. To facilitate the notation, we will use FS0 to indicate the traditional free knots spline approximation, FS1 to indicate the free knots spline approximation with a single penalty term on the second derivative and FS2 to indicate the free knots spline approximation with a double penalty term.
We will consider the classical k-means method for functional data (Hartigan and Wong 1979), a model based clustering method (Bouveyron et al. 2014) and a hierarchical agglomerative clustering methods (Ramsay et al. 2015; Tucker et al. 2013), which we will refer to respectively R packages as kmeans.fd, funFEM, fdahclust.
4.1 Synthetic datasets
In the simulated scenario, four groups of 50 functions each were generated as in Chen et al. (2014) using the following functions as average functions
-
\( y(t)= -2 \sin (t-1)\log (t+0.5)\),
-
\( y(t)= 2\cos (t) \log (t+0.5) \),
-
\( { y(t)= -0.3 t^2 e^{-0.5 t} \sin (t - 0.5)}\),
-
\( {y(t)= 1.5 e^{-0.3 t} \log (t + 1) \cos (2t) } \),
i.e. functions that resemble a negative sine (or cosine) wave with a regular oscillation period, multiplied either by the natural logarithm of \((t + 0.5)\) which modulates the amplitude of the curve, causing it to progressively decrease as t increases, or the product of a combination of log, power and square root functions which makes the curve strongly nonlinear and intricate.
Errors determined by a normal distribution with zero mean and standard deviation are added to each curve. Random errors are randomly generated and modeled to incorporate heterogeneity in the variance along the functional curves. Each function was sampled in a common set of 50 randomly chosen points in the interval [0, 5].
Each group, as can be seen in Fig. 1, is characterized by differences in terms of amplitude, shape, and complexity.
The first step in approximating the data was to determine the regularization parameters through cross validation: we chose the values \( \lambda _1 = 10^{-7}\) and \(\lambda _2 = 10^{-5}\) for FS2 obtained by minimizing with respect to \(\lambda _1\) and \(\lambda _2\) the GCV on a \(L \times L\) grid, with \( L= \{10^{l }, \; l=-8,-6 \ldots 3,4 \} \) and \(\lambda _2 = 10^{-4}\) for FS1.
The number of basis for the approximation, selected via cross validation, is set to 22. This simulation compares FS0, FS1 and FS2. The graphical results are shown in Fig. 2, and the numerical results are presented in Table 1.
Performances of the two different approximation methods were measured using the classic Integrated Sum of Square Errors (ISSE) and its local version defined on the tails of the curves. The ISSE is a statistical metric used to assess the goodness of fit of a regression model or interpolation model to observed data. It is particularly useful in scenarios where the dependent variable is a continuous function of an independent variable, such as in functional data analysis. Formally, it can be expressed as:
where:
-
y(t) is the observed value of the dependent variable at time t,
-
\({\hat{y}}(t)\) is the predicted value of the dependent variable at time t,
-
[a, b] is the integration interval over the entire data range.
Similarly to the traditional ISSE, we define a Local ISSE (ISSE\(_{inf}\),ISSE\(_{sup}\)) with the aim to quantify the fit in the initial and final portions of the curves.
The expression of ISSE is the same, but the boundaries of the integration interval need to be adjusted to reflect the specific regions of interest. This allows you to focus on the initial and final tails of the curve rather than the entire curve. The choice of regions for calculating the Integrated Sum of Squared Errors depends on the analysis objectives and the nature of the functional data or curves under examination. In our context, cross validation was used to test the model on different parts of the curves and identify the regions in the tails. Degrees of freedom (df), Integrated Sum of Square Errors, and the GCV scores are used to evaluate the quality of both the overall and local model fit. The results in Table 1 show that the lowest ISSE values were obtained using the free knots splines incorporating two regularization terms, both on the entire interval and on the tails.
The advantages obtained by using the dual penalty approach compared to free knots spline are even more evident if we look at the clustering.
To determine the number of clusters, we used the Elbow method. In cluster analysis, the Elbow method is a heuristic used to identify the optimal number of clusters in a given dataset. This method involves plotting the explained variation as a function of the number of clusters and selecting the ’elbow point’ on the curve as the optimal number of clusters. The results in Fig. 3 show that according to the index values, the optimal cluster number is c = 4. By applying kmeans.fd clustering to functional data, we can ascertain the number of curves assigned to each cluster for the methods under consideration.
Results reported in Table 2 and in Fig. 4, with respect to FS2, many more misclassifications are observed for both FS0 and FS1.
The approximation that yields the best clusters is the one using FS2.
In conclusion, it can be seen that introducing two regularization terms into the free knots spline approximation method led to have better approximation results also for further analyses such as clustering. Indeed, when data exhibit very similar shapes, clustering typically works regardless of approximation; the problem arises when dealing with data of highly dissimilar shapes, as seen in this case with simulated data, which a conventional method fails to capture.
To consolidate the results obtained from clustering the curves approximated by the free knot spline, we demonstrate that no method of functional data clustering can yield the same or better results. From now on we no longer consider FS1 since just adding a regularization term does not improve the analysis through the use of clustering.
Results resented in Table 3 for all the clustering methods show how the best classification is achieved by applying clustering to curves approximated with FS2. Nevertheless, none of the classifications results are comparable to the classification obtained by the kmeans.fd method.
We can draw the same conclusions by evaluating the results obtained using the previous clustering methods by performing 150 simulations. In order to compare the performances, we use the Adjusted Rand index ARI measuring the percentage of times a clustering algorithm correctly determines the number of clusters out of a total of 150 simulations. Clustering can be thought of as a series of decisions where the goal is to group two individuals into the same cluster if and only if they are similar. A “true positive" (TP) decision correctly assigns two similar individuals to the same cluster, while a “true negative" (TN) decision correctly assigns two dissimilar individuals to different clusters. On the other hand, a “false positive" (FP) decision incorrectly assigns two dissimilar individuals to the same cluster, and a “false negative" (FN) decision incorrectly assigns two similar individuals to different clusters.
The Rand index (RI) is defined as:
and falls within the range between 0 and 1. A value of 0 indicates that the two data clusterings do not agree on any pair of points, while a value of 1 means that the data clusterings are identical. However, RI’s value may not be close to 0 when category labels are randomly assigned, leading to potential issues.
To address this problem, the Adjusted Rand index (ARI) is introduced, defined as:
ARI’s range is between \(-1\) and 1, with values closer to 1 indicating better clustering results. From Table 4 we can observe that we can observe that for all the clustering methods, the use in the functional approximation of FS2 leads to higher ARI compared FS0.
4.2 Application: real datasets
In this section, we aim to provide further confirmation of the validity and competitiveness of the proposed method by examining a real-world dataset.
We will examine the application of clustering to the COVID-19 dataset on “New cases in different countries from 2020 to 2021, specifically COVID-19 pandemic data for 30 countries. In this context, our goal is to demonstrate the feasibility of analyzing pandemic models through time series clustering using knot-free splines with two regularization terms for approximating functional data and finding taxonomies within the data. The data used in this paper were sourced from the World Health Organization (WHO) and the COVID-19 Data Hub (Guidotti 2022; Guidotti and Ardia 2020). These databases are known for their transparency, open accessibility, and high credibility, ensuring a high level of accuracy.
As a significant number of countries experienced substantial outbreaks in March 2020, we selected this month as the starting point for our sequential data in this research. The endpoint of our dataset corresponds to the date of November 30st, 2021. As in Luo et al. (2023), to mitigate the impact of factors such as varying population size, land area, population density, and population mobility across different countries, which can lead to significant differences in the magnitude of daily new COVID-19 cases, we at first proceeded to standardize the raw data as follows:
where \( y_{i,t}^*\) represents the normalized number of daily new cases in the country \(i\) on day \(t\), \(y_{it}\) stands for the number of daily new cases in the country \(i\) on day \(t\), \(\bar{y}_{i}\) represents the mean of daily new cases in the country \(i\) during the whole study period, \(s_{i}\) is the standard deviation of daily new cases in the country \(i\) during the whole study period.
The data obtained can be seen in Fig. 5. The next step was to take the values \( \lambda _1= 10^{-3}\) and \(\lambda _2= 10^{-7}\) for FS2 and \(\lambda _2= 10^{-4} \) for FS1 as regularization parameters, obtained through a general cross-validation on a pre-established grid.
From Fig. 7 and Table 5, which provide comparisons between the three approximation methods, we can conclude that, as in previous cases, the double penalization approximation performs better along the extreme regions, ensuring that valuable information is not lost.
The advantages we gain from using the double penalization approximation method are observed in the application of clustering. To select the number of clusters, we apply the Elbow method. From the Fig. 6, we can determine that the appropriate number of clusters is 4.
The clustering method we will use is kmeans.fd. In the first analysis we perform is clustering on data approximated using FS2.
Figure 8 displays the time series of daily new cases for each country based on the clusters. We can visually observe that the model varies significantly for countries in different clusters.
We can observe that the pattern undergoes significant changes across countries within distinct clusters. The countries in Cluster 1 exhibit a consistent trend, followed by a sudden surge in cases from January 2022 to March 2022. Conversely, countries in Cluster 3 demonstrate a decline in new cases during the same period.
The countries in Cluster 2 show an increase in cases from the beginning of January 2021, followed by a decrease starting in February. During the same period, countries in Cluster 3 exhibited a consistent trend, followed by a rise in cases in March 2021. It can be observed that the pattern of pandemic development in Cluster 3 is often at the opposite pace to that of Cluster 4.
In general, each cluster of countries exhibits distinct characteristics that set their pandemic patterns apart from those of the other clusters.
Table 6 shows the specific clustering results for the 30 countries. We can see that all Cluster 1 and Cluster 4 countries are located in regions of Asia, while most of Cluster 3 is located in different regions in the European region.
For this reason we might think that geographical location can influence clustering. However, geographical proximity is not a decisive factor for clustering; just think of Cluster 2.
All the considerations we have made so far concern clustering on data approximated using FS2. From the Fig. 8 (upper figure) and the Table 6 Table 6, we can analyze the results of clustering on data approximated using FS0 and FS1. The Fig. 8 (center and down figure), Tables 7 and 8, which are identical, displays the different clusters for FS0 and FS1: the clusters are not well-defined, and the curves, not being correctly allocated to the clusters, making it challenging to read and interpret the data. The considerations that were previously made in this case are not as clear: in Clusters 1 and 2, we observe a similar trend, especially towards the end of 2021. Similar observations can be made for Clusters 3 and 4, which exhibit similar trends. Therefore, in the case of FS0 and FS1, specific characteristics to characterize each cluster are not found.
5 Conclusions
In this work, we considered the use of free knots spline in the context of functional data estimation, analyzing, in particular, the impact of the roughness regularization term. More specifically, we compared two different penalty regularization schemes, namely a standard regularization scheme with a one parameter roughness term handling the boundedness of the variability of the function (O’Sullivan 1986) and a two-penalty regularization scheme which controls monotonicity and smoothness. Our numerical experiments seem to demonstrate that compared to the free knots spline without any penalty, things do not change much when using the one-parameter scheme, while the two-parameter scheme shows notable improvements. However, the most significant advantages appear in the data analysis phase based on the functional data approximation obtained. In particular, when applied to simulated data, our method shows a higher improved ability to detect ties, highlighting a clearer clustering structure. A promising direction emerges from the non-linear effects linked to the selection of knots in the initial basis. The incorporation of machine learning method to enhance the analytical approach, open a different perspective for continuous advancements in the domain of functional data analysis.
Data Availability
No datasets were generated or analysed during the current study.
Notes
Note that for \(\lambda _1=\lambda _2=0\) (10) becomes the classic free knots spline problem.
References
Basna, R., Nassar, H., Podgski, K.: Data driven orthogonal basis selection for functional data analysis. J. Multivar. Anal. 189, 1048–1068 (2022). (ISSN 0047-259X)
Bouveyron, C., CÃ’me, E., Jacques, J.: The discriminative functional mixture model for the analysis of bike sharing systems. HAL n.01024186, (2014)
Chen, H., Reiss, P.T., Tarpey, T.: Optimally weighted l2 distance for functional data. Biometrics 70, 516–25 (2014)
Denison, D.G.T., Mallick, B.K., Smith, A.F.M.: Automatic Bayesian curve fitting. J. R. Stat. Soc. Ser. B Stat Methodol. 60(2), 333–350 (2002). https://doi.org/10.1111/1467-9868.00128
Dimatteo, I., Genovese, C.R., Kass, R.E.: Bayesian curve-fitting with free-knot splines. Biometrika 88(4), 1055–1071 (2001). https://doi.org/10.1093/biomet/88.4.1055
Ferraty, F., Vieu, P.: Nonparametric Functional Data Analysis: Theory and Practice. Springer, 2006. ISBN: 978-0-387-30369-7
Gervini, D.: Free-knot spline smoothing for functional data. J. R. Stat. Soc. 68(4), 671–687 (2006)
Guidotti, E.: A worldwide epidemiological database for COVID-19 at fine-grained spatial resolution. Sci. Data 9(1), 1–7 (2022)
Guidotti, E., Ardia, D.: COVID-19 data hub. Open J. 5(51), 2376 (2020)
Hartigan, J.A., Wong, M.A.: A k-means clustering algorithm. Appl. Stat. 28, 100–108 (1979)
Jupp, D.L.B.: Approximation to data by splines with free knots. SIAM J. Numer. Anal. 15, 328–343 (1978)
Luo, Z., Liu, N., Zhang, L., Wu, Y.: Time series clustering of COVID-19 pandemic-related data. Data Sci. Manage. 6(2), 79–87 (2023). https://doi.org/10.1016/j.dsm.2023.03.003
Nassar, H., Podgórski, K.: Empirically driven orthonormal bases for functional data analysis. In: Vermolen, F.J., Vuik, C. (eds.) Numerical Mathematics and Advanced Applications, ENUMATH 2019—European Conference, Lecture Notes in Computational Science and Engineering, pp. 773–783. Springer Science and Business Media B.V., United States (2021). https://doi.org/10.1007/978-3-030-55874-1_76
O’Sullivan, F.: A statistical perspective on ill-posed inverse problems. Stat. Sci. 1(4), 502–518 (1986)
Ramsay, J.O., Silverman, B.: Functional Data Analysis. Springer, New York (1997)
Ramsay, J.O., Marron, J.S., Sangalli, L.M., Srivastava, A.: Functional data analysis of amplitude and phase variation. Stat. Sci. 468–84 (2015)
Rice, J.A., Silverman, B.W.: Estimating the mean and covariance structure nonparametrically when the data are curves. J. R. Stat. Soc. 53(1), 233–243 (1991)
Sivananthan, S.: Multi-penalty regularization in learning theory. J. Complex. 36, 141–165 (2016)
Tucker, J.D., Wu, W., Srivastava, A.: Generative models for functional data using phase and amplitude separation. Comput. Stat. Data Anal. 61, 50–66 (2013)
Acknowledgements
This work was partially supported by Italian Ministry of University and Research (MIUR), PRIN 2022 Project: Numerical Optimization with Adaptive Accuracy and Applications to Machine Learning , Grant no. 2022N3ZNAX, PNRR PRIN 2022 Project: A multidisciplinary approach to evaluate ecosystems resilience under climate change, Grant no. P2022WC2ZZ, and by Istituto Nazionale di Alta Matematica - Gruppo Nazionale per il Calcolo Scientifico (INdAM-GNCS), by MIUR.
Funding
Open access funding provided by Università degli Studi della Campania Luigi Vanvitelli within the CRUI-CARE Agreement.
Author information
Authors and Affiliations
Contributions
All the authors collaborate to write the manuscripts.
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
De Magistris, A., De Simone, V., Romano, E. et al. Roughness regularization for functional data analysis with free knots spline estimation. Stat Comput 34, 165 (2024). https://doi.org/10.1007/s11222-024-10474-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-024-10474-w