1 Introduction

In recent years, there has been a rapid increase in the availability of large data resulting from the observation of a phenomenon along a continuous domain, such as time, space, or frequency domains. The growing need to analyze data of this type, characterized by an intrinsic functional nature, is the basis for the development of the field of Functional Data Analysis (FDA) (Ferraty and Vieu 2006; Ramsay and Silverman 1997). Functional data analysis typically involves two key steps: first, representing discrete observations as functions and then applying statistical methods. The step of functional representation helps to reduce noise and fluctuations in the data and plays a key rule in the FDA because it sets the foundation for subsequent statistical analyses. There are several methods and techniques that can be employed to perform functional representation. Some common approaches include smoothing spline, wavelet analysis, Fourier analysis, principal component analysis, smoothing techniques, and interpolation methods. The choice of the functional representation method depends on the nature of the data, the specific goals of the analysis, and the assumptions made about the underlying structure of the functions. Each method has its advantages and limitations, and the selection should be based on the characteristics of the data and the objectives of the analysis. To address the efficient representation and analysis of functional data with various shapes, we focus on the problem of representing functional data using free knots spline estimation. In the last years, the problem of representing functional data with free knots spline has been addressed by the same authors. Gervini Gervini (2006) proposes free knots regression spline estimators for the mean and the variance components of a sample of curves, showing that free knots splines estimate salient features of the functions (such as sharp peaks) more accurately than smoothing splines. Inspired by a Bayesian model, Denison et al. (2002), Dimatteo et al. (2001) propose estimating various curves using piecewise polynomials. The first one establishes a joint distribution over the number and positions of knots defining the piecewise polynomials and employs reversible jump Markov Chain Monte Carlo methods for posterior computation. The second, based on a marginalised chain on the knot number and locations, provides a method for inference about the regression coefficients and functions of them in both normal and non normal models. Among others, a novel knot selection algorithm is introduced by Ferraty and Vieu (2006) for nonparametric regression using regression splines. Recently Basna et al. (2022), building upon a previous proposal of Nassar and Podgórski (2021), introduces a data-driven approach that uses a machine learning-style algorithm to select knots efficiently and employs orthogonal spline bases (splinets) to represent sparse functional data. In this work, generalizing ’the balanced-discrepancy principle’ proposed by Sivananthan (2016) in the functional data analysis framework, we propose a simple double penalization criterion to improve the smoothing process with free knots spline approximation for various types of curve shapes. This approach is designed to improve the precision and overall effectiveness of fitting spline curves to diverse datasets. The paper is organized as follows. Section 2 introduces mathematical foundations of smoothing with roughness penalty. In Sect. 3, we illustrate free knots spline estimation and their construction with a double penalization criterion. In Sect. 4 we illustrate performance of the proposed penalization criterion on simulated data. An extensive comparative analysis to evaluate which approximation method is most suitable for functional data in terms of clustering performance is conducted. Section 5 shows the performance of a clustering method on COVID-19 pandemic data for 30 countries by comparing free knots spline and free knots spline with double penalization criterion.

2 Smoothing functional data

Let \(\{x_i(t), \; i = 1,..., n\}\) be a sample of real-valued functions on a compact interval T related to a functional variable X. The sample curves can be considered realizations of a stochastic process \(X = \{X(t): t \in T \}\) whose sample functions belong to the Hilbert space \(L^2(T )\) of square integrable functions with the usual inner product \((f,g) = \int _T f(s) g (s) ds\).

In real applications, \(x_i(t)\) often cannot be observed directly, but may be collected discretely over time points \(\{t_1,...,t_h\} \subset T\). FDA aims to reconstruct the true functions from the observed data using the basis function approach, that reduces the estimation of the function \(x_i(t)\) to the estimation of a finite linear combination of known basis functions \(\phi _k(t), \; k=1 \ldots n_{\mathcal {B}}\)

$$\begin{aligned} {x}_i(t) = \sum _{k=1}^{n_{{\mathcal {B}}}} {\textbf {c}}_{k}^i {\phi }_k({t}), \, \, \, i = 1,...,n \end{aligned}$$

where \({{\textbf {c}}}^i \in {\mathbb {R}}^{n_{{\mathcal {B}}}}\) is a vector of coefficients. The choice of the base and the dimension \(n_{{\mathcal {B}}}\) depend strongly on the nature of the data. The most known bases are: the Fourier base, B-spline, polynomial, exponential, wavelets. The Fourier base is used when the data has a cyclical nature, the exponential base when the data shows exponential growth, and the B-spline base are the most used when the data does not have a strong cyclical trend. There are different ways of obtaining the basis coefficients depending on the kind of observations. Generally observed data are contaminated by random noise that can be viewed as random fluctuations around a smooth trajectory, or as actual errors in the measurements.

The observed data is collected in a matrix \({\textbf {Y}} \in {\mathbb {R}} ^ {h \times n}\) whose elements are:

$$\begin{aligned} y_{i,j}={x}_i({t}_j)+ \varvec{\epsilon }_{i}^j, \;\;\; j=1,...,h;\; i=1,...,n\end{aligned}$$

where \(\varvec{\epsilon }_{i}\) is unobserved error term independent and identically distributed random variable with zero mean and constant variance \( \sigma ^2 \). In this case an appropriate way to estimate the basis coefficients from the data is by using a least squares approximation

$$\begin{aligned} \hat{{\textbf {C}}} = \underset{{{\textbf {C}}}}{{\text {argmin}}} \left\| {\textbf {Y}} -\varvec{\Phi }^T {\textbf {C}}\right\| ^2_2, \end{aligned}$$

with \( \varvec{\Phi } \in {\mathbb {R}}^{{n_{{\mathcal {B}}} \times h }}\), \({\varvec{C}} \in {\mathbb {R}}^{ {n_{{\mathcal {B}}} } \times n}\), where \(\phi _{ij}=\varvec{\phi }_k({{\textbf {t}}}_j)\) and \(c_{ij}={{\textbf {c}}}_j^{i}\).

The smoothness is implicitly controlled by the number of basis functions, \({n_{{\mathcal {B}}}}\). If we assume that \(n_{{\mathcal {B}}} \le h\) and \(rank(\varvec{\Phi })=n_{{\mathcal {B}}}\) then

$$\begin{aligned} \hat{{\textbf {C}}} = {(\varvec{\Phi }\varvec{\Phi }^T)}^{-1} \varvec{\Phi }{} {\textbf {Y}}. \end{aligned}$$
(1)

and the matrix \( \hat{ {\textbf {Y}}} \) of the approximation values is:

$$\begin{aligned} \hat{{\textbf {Y}}} = \varvec{\Phi }^{T} \hat{{\textbf {C}}} = \varvec{\Phi }^{T} {(\varvec{\Phi }\varvec{\Phi }^T)}^{-1} \varvec{\Phi }{} {\textbf {Y}}. \end{aligned}$$
(2)

Increasing \({n_{{\mathcal {B}}}}\) leads to overfitting and generates a \(x\) curve that is overly wiggly. One way to overcome this drawback is to use a roughness penalty term.

2.1 Smoothing functional data with a roughness penalty

The roughness penalty approach defines a measure of the roughness of the fitted function x using the differential operators of order \(m \ge 1\)

$$\begin{aligned} PEN_m(x) = \int {\left[ D^mx(s)\right] }^2 ds. \end{aligned}$$
(3)

where \(D^m x(t) = \frac{d^m}{ds^m} x(t) \). This allows to measure the closeness of x(t) to a polynomial of order m. The most commonly used roughness penalty is \(m=2\) that permits to keep under control the curvature of the curve, that is we control the variability of the slope of the curve (O’Sullivan 1986).

Often, there is a need for a wider class of measures of deviation. Especially, when there is periodicity in the data or an exponential trend, it would not be sufficient to use the integrated squared mth derivative because it can only penalize deviations from polynomials. More generally, a measure of roughness is then given by

$$\begin{aligned} PEN_L(x) = \int {\left[ Lx(s)\right] }^2 ds, \end{aligned}$$
(4)

where L is the linear differential operator defined as Ramsay and Silverman (1997)

$$\begin{aligned} Lx(t)&=\alpha _0(t) x(t)+\alpha _1(t) D^{1}x(t)+....+ \\&\quad + \alpha _{m-1}(t) D^{m-1} x (t)+ D^{m}x(t). \end{aligned}$$

where \(\alpha _0(t),...,\alpha _m(t) \) are the coefficient functions depending on t. Obviously \(PEN_m(x)\) is a special case of \(PEN_L(x)\) with \(\alpha _0(t)=\alpha _1(t)=...=\alpha _{m-1}(t)=0\) and \(\alpha _m(t)=1\). In the following we will assume the \(\alpha _i(t)\) to be constant. Then, the penalized least squares approximation is given by

$$\begin{aligned} \hat{{\textbf {C}}} = \underset{{{\textbf {C}}}}{{\text {argmin}}} \left\| {\textbf {Y}} -\varvec{\Phi }^T {\textbf {C}}\right\| ^2_2 + \lambda {\textbf {C}}^T {{\textbf {R}}}_L {\textbf {C}}, \end{aligned}$$
(5)

with \({{\textbf {R}}}_L=\alpha _0 {{\textbf {R}}}_{0}+\alpha _1 {{\textbf {R}}}_{1}+...\alpha _m {{\textbf {R}}}_{m}\) discretization of \(PEN_L(x)\), where \( {{\textbf {R}}}_{l} \in {\mathbb {R}}^{{n_{{\mathcal {B}}}} \times {n_{{\mathcal {B}}}}}\) (\(0\le l\le m)\):

$$\begin{aligned} (r_{l})_{ij} = \int _{I} D^l \phi _i(s) D^l \phi _j(s)^T ds, \end{aligned}$$
(6)

where I is a suitable interval containing the data. The regularization parameter \( \lambda >0\) manages the compromise between the fitting to the data and the roughness of the function: the smaller it is \( \lambda \), the closer the estimate is to the estimate of least squares and tends to interpolate the observed points; the higher is \( \lambda \), the flatter the smooth function tends to be. This parameter can be calibrated through a generalised cross validation (GCV) (Ramsay and Silverman 1997). In this paper we consider the first and second-order roughness in order to control both the course and the variability of slope of the curve and to avoid placing an excessive burden on the cost of the penalty matrix. Then the penalized least square problem becomes

$$\begin{aligned} \hat{{\textbf {C}}}&= \underset{{{\textbf {C}}}}{{\text {argmin}}} \left\| {\textbf {Y}} -\varvec{\Phi }^T {\textbf {C}}\right\| ^2_2 + \lambda _1 {\textbf {C}}^T {{\textbf {R}}}_1 {\textbf {C}} + \nonumber \\&\quad + \lambda _2 {\textbf {C}}^T {{\textbf {R}}}_2 {\textbf {C}} \ \ \ \ \ \ \ \lambda _2 \ge 0 , \lambda _1 \ge 0. \end{aligned}$$
(7)

The solution of (7) can be computed by solving the system arising by first order optimality conditions

$$\begin{aligned} \hat{{\textbf {C}}} = {\left( \varvec{\Phi } \varvec{\Phi }^T + \lambda _1{\textbf {R}}_1 + \lambda _2 {\textbf {R}}_2\right) }^{-1} \varvec{\Phi } {\textbf {Y}}. \end{aligned}$$
(8)

The expression for the data-fitting \(\hat{{\textbf {Y}}}\) is:

$$\begin{aligned} \hat{{\textbf {Y}}} = \varvec{\Phi }^T{\left( \varvec{\Phi } \varvec{\Phi }^T + \lambda _1{\textbf {R}}_1 + \lambda _2 {\textbf {R}}_2\right) }^{-1} \varvec{\Phi } {\textbf {Y}} \end{aligned}$$
(9)

3 Free knots spline

For their compact support and fast computation, as well as the ability to create smooth approximations of non periodic data, B-splines are a common choice in the functional data framework. The approximations by splines can be significantly improved if knots are allowed to be free rather than at a priori fixed locations (Gervini 2006).

It is well-known that the primary advantage of free knots spline over smoothing splines is their greater flexibility in modelling data, as they allow for a better adaptation of the curve’s shape to the specific characteristics of the data.

The free knots spline is a spline in which the positions of the knots are considered parameters to be estimated by the data. Adjusting the position of the nodes allows you to adapt the shape of the function to the target function.

This section considers smooth estimators for knots selection for the approximation of a given curve. Existing approaches are based on individual levelling of sample curves, followed by the mean of the cross-section and the calculation of covariance (Rice and Silverman 1991; Ramsay and Silverman 1997). These methods, however, do not take strength from the dataset that we have available in the leveling phase. A further drawback is that analytical expressions for optimal knots locations, or even for the general characteristics of optimal knots distributions, are not easy to derive. We introduce free knots spline estimators that avoid individual levelling. We show that this approach applied to the methods seen in the previous section (smoothing spline with one parameter and with two parameters) often produces better estimators than sanding splines Ramsay and Silverman (1997) in which knots are chosen randomly and equally at the cost of a modest increase in computational complexity. In this section we introduce the algorithm of optimal knots placement. Given a vector of nodes \(\varvec{\tau } \in \Re ^p\) with \(a<\tau _1<\tau _2..<\tau _p<b\), the Jupp transformation \({\textbf {k}}\) of \(\varvec{\tau }\), \( {\textbf {k}}=J( \varvec{\tau })\), is defined componentwise as

$$\begin{aligned} {k}_i = \log \frac{(\varvec{\tau }_{i+1} - \varvec{\tau }_i)}{(\varvec{\tau }_i - \varvec{\tau }_{i-1})} \, \, \, \, \, \, \, \, \, i=1,...,p \end{aligned}$$

where \( {\tau }_0 = a\) and \({\tau }_{p+1} = b \). This one-to-one transformation maps constrained and ascending knots vectors \(\varvec{\tau }\) on unconstrained and unclassified carriers \({\textbf {k}}\), which has some practical and theoretical advantages (Jupp 1978).

Note that for each fixed set of knots, the class of such splines is a linear space of functions with \((p+r) = {n_{{\mathcal {B}}}}\) free parameters where r is the order of spline and p is the length of knots.

Let \(\varvec{\phi }({\textbf {t}},{\textbf {k}}) \in {\mathbb {R}}^{{n_{{\mathcal {B}}}}}\) be the vector of the basic functions B-spline corresponding to a set of nodes \({\textbf {k}}\) and \(\varvec{\Phi }({\textbf {k}})\) the matrix \(h \times {n_{{\mathcal {B}}}} \) whose j-row is \( \varvec{\phi }({\textbf {t}}_j,{\textbf {k}})^T\). We find the coefficients of linear expansion and the vector of the optimal nodes minimizing the penalized least squares problemFootnote 1:

$$\begin{aligned} (\hat{{\textbf {C}}},\hat{{\textbf {k}}})&= \underset{({{\textbf {C}}},{{\textbf {k}}}) \in {\mathbb {R}}^{n_{{\mathcal {B}}}}\times {\mathbb {R}}^p}{{\text {argmin}}} \left\| {\textbf {Y}} -\varvec{\Phi }({\textbf {k}})^T{\textbf {C}}\right\| ^2_2 + \lambda _1 {\textbf {C}}^T{\textbf {R}}_1 {\textbf {C}} + \nonumber \\&\quad + \lambda _2 {\textbf {C}}^T{\textbf {R}}_2 {\textbf {C}} \ \ \ \ \ \lambda _1 \ge 0, \lambda _2 \ge 0 \end{aligned}$$
(10)

Alternating minimization algorithm can be used to solve problem (10); at each iteration the problem is splitted in two optimization ones. Closed form solution can be obtained for minimization with respect to \({\textbf {C}}\) fixing \({\textbf {k}}\). The optimal value can be obtained by solving the system with coefficient matrix

$$\begin{aligned} {{\textbf {H}}}({{\textbf {k}}})= \varvec{\Phi }({\textbf {k}}) \varvec{\Phi }({\textbf {k}})^T + \lambda _1 {\textbf {R}}_1 + \lambda _2 {\textbf {R}}_2. \end{aligned}$$

Proposition 1

If the \(\varvec{\Phi }({\textbf {k}})\) is full rank, the matrix \({\textbf {H({\textbf {k}})}}\) is SPD and let \(\sigma ({\textbf {H}})\) be the set of eigenvalues of matrix \({\textbf {H({\textbf {k}})}}\) we have \(\sigma ({\textbf {H}})\subseteq [\sigma ^-, \sigma ^+]\), with

$$\begin{aligned}\sigma ^-&=\sigma _{min}(\varvec{\Phi }({\textbf {k}}) \varvec{\Phi }({\textbf {k}})^T) + \lambda _1 \sigma _{min}({\textbf {R}}_1)+ \lambda _2 \sigma _{min}({\textbf {R}}_2) \\ \sigma ^+&=\sigma _{max}(\varvec{\Phi }({\textbf {k}}) \varvec{\Phi }({\textbf {k}})^T) + \lambda _1 \sigma _{max}({\textbf {R}}_1)+ \lambda _2 \sigma _{max}({\textbf {R}}_2) \end{aligned}$$

Proof

\({\textbf {H({\textbf {k}})}}\) is the sum of \(\varvec{\Phi }({\textbf {k}}) \varvec{\Phi }({\textbf {k}})^T\) that is SPD, and of other matrices that are non negative definite (O’Sullivan 1986); then the first statement follows. For the second statement we recall that if \({\textbf {A}}\) and \({\textbf {B}}\) are real, symmetric matrices, then \({\textbf {A + B}}\) has real eigenvalues, and the following inequalities hold

$$\begin{aligned}&\sigma _{min}({\textbf {A}})+\sigma _{min}({\textbf {B}}) \le \sigma _{min}({\textbf {A + B}}) \le \\&\quad \le \sigma _{max}({\textbf {A + B}}) \le \sigma _{max}({\textbf {A}})+\sigma _{max}({\textbf {B}}). \end{aligned}$$

\(\square \)

Proposition 1 suggests to use Cholesky factorization to compute

$$\begin{aligned} \hat{{\textbf {C}}}({\textbf {k}}) = {{\textbf {H}}}({{\textbf {k}}})^{-1} \varvec{\Phi }({\textbf {k}}) {\textbf {Y}}. \end{aligned}$$

The minimization with respect to \({\textbf {k}}\) fixing \({\textbf {C}}\) gives the following nonlinear optimization problem:

$$\begin{aligned} \hat{{\textbf {k}}} = \underset{{\textbf {k}} \in {\mathbb {R}}^p}{{\text {argmin}}} \left\| {\textbf {Y}} - \varvec{\Phi }({\textbf {k}})^T{\textbf {H}}({\textbf {k}})^{-1} \varvec{\Phi }({\textbf {k}}){\textbf {Y}}\right\| ^2 \end{aligned}$$
(11)

To solve (11), we apply the knots addition algorithm that produces knots sequences of increasing dimensions. We define the functional \(f_j: {\mathbb {R}}^j \rightarrow {\mathbb {R}} \) as follows:

$$\begin{aligned} f_j({\textbf {k}}) = \left\| {\textbf {Y}} - \varvec{\Phi }({\textbf {k}})^T{\textbf {H}}({\textbf {k}})^{-1} \varvec{\Phi }({\textbf {k}}) {\textbf {Y}}\right\| ^2 \, \, \, 1 \le j \le p. \end{aligned}$$
(12)

Now, we will present the procedure of the gradual node addition algorithm (Gervini 2006).

                  Gradual node addition algorithm

Initialization

      Choose an ordered grid \( F_1 = \{s_1^1,...,s_N^1\} \subset (a,b) \).

      Compute \( J_1 = \{J(\{s_1^1\}),...,\ J(\{s_N^1\}) \} \).

      Find \(\tilde{{\textbf {k}}}_1 = \underset{{\textbf {k}} \in J_1}{{\text {argmin}}} f_1({\textbf {k}}) \)

      Compute \(\hat{{\textbf {k}}}_1\) solution of (11) with \(p=1\) using the Gauss-Newton algorithm with \( \tilde{{\textbf {k}}}_1 \) as the starting point.

      \( \hat{\varvec{\tau }}_1 = J^{-1} (\hat{{\textbf {k}}}_1) \)

Forward addition

For \(i=2,...,p\)

      Choose an ordered grid \( F_i = \{s_1^i,...,s_N^i\} \qquad \subset (a,b) \).

      Compute

      \( J_i = \{ J \bigl ( \hat{{\varvec{\tau }}}_{j-1}\bigcup \{s_1^i\} \bigr ),..., J\bigl ( \hat{{\varvec{\tau }}}_{j-1}\bigcup \{s_N^i\} \bigr ) \} \).

      Find \(\tilde{{\textbf {k}}}_i = \underset{{\textbf {k}} \in J_i}{{\text {argmin}}} f_i({\textbf {k}}) \)

      Compute \(\hat{{\textbf {k}}}_i\) solution of (11) with \(p=i\) using the Gauss-Newton algorithm with \( \tilde{{\textbf {k}}}_i \) as the starting point.

      \( \hat{\varvec{\tau }}_i = J^{-1} (\hat{{\textbf {k}}}_i) \) .

Although there is no guarantee that this (or any other) algorithm will find the global minimizer of (11), we have found that it works well in practice. In our simulations and examples, knots have been added in the "right" order (Gervini 2006; Jupp 1978). This is important for the selection of the model, since the optimal number of knots \(p\) is never known in practice and will be chosen on the basis of sequences of intermediate nodes.

4 Computational experiments

In this section, we present some computational results using our algorithm on functional data coming both on synthetic data and on data from a real-world application. More precisely, we present applications of three different clustering methods to evaluate the benefits of detecting clusters when using free knots splines estimation with two penalty terms with respect to free knots splines and free knots splines with one penalty term. To facilitate the notation, we will use FS0 to indicate the traditional free knots spline approximation, FS1 to indicate the free knots spline approximation with a single penalty term on the second derivative and FS2 to indicate the free knots spline approximation with a double penalty term.

Fig. 1
figure 1

Simulated data

We will consider the classical k-means method for functional data (Hartigan and Wong 1979), a model based clustering method (Bouveyron et al. 2014) and a hierarchical agglomerative clustering methods (Ramsay et al. 2015; Tucker et al. 2013), which we will refer to respectively R packages as kmeans.fd, funFEM, fdahclust.

Fig. 2
figure 2

Comparison of FS0, FS1 and FS2 for data approximation in four case

4.1 Synthetic datasets

In the simulated scenario, four groups of 50 functions each were generated as in Chen et al. (2014) using the following functions as average functions

  • \( y(t)= -2 \sin (t-1)\log (t+0.5)\),

  • \( y(t)= 2\cos (t) \log (t+0.5) \),

  • \( { y(t)= -0.3 t^2 e^{-0.5 t} \sin (t - 0.5)}\),

  • \( {y(t)= 1.5 e^{-0.3 t} \log (t + 1) \cos (2t) } \),

i.e. functions that resemble a negative sine (or cosine) wave with a regular oscillation period, multiplied either by the natural logarithm of \((t + 0.5)\) which modulates the amplitude of the curve, causing it to progressively decrease as t increases, or the product of a combination of log, power and square root functions which makes the curve strongly nonlinear and intricate.

Errors determined by a normal distribution with zero mean and standard deviation are added to each curve. Random errors are randomly generated and modeled to incorporate heterogeneity in the variance along the functional curves. Each function was sampled in a common set of 50 randomly chosen points in the interval [0, 5].

Each group, as can be seen in Fig. 1, is characterized by differences in terms of amplitude, shape, and complexity.

The first step in approximating the data was to determine the regularization parameters through cross validation: we chose the values \( \lambda _1 = 10^{-7}\) and \(\lambda _2 = 10^{-5}\) for FS2 obtained by minimizing with respect to \(\lambda _1\) and \(\lambda _2\) the GCV on a \(L \times L\) grid, with \( L= \{10^{l }, \; l=-8,-6 \ldots 3,4 \} \) and \(\lambda _2 = 10^{-4}\) for FS1.

The number of basis for the approximation, selected via cross validation, is set to 22. This simulation compares FS0, FS1 and FS2. The graphical results are shown in Fig. 2, and the numerical results are presented in Table 1.

Performances of the two different approximation methods were measured using the classic Integrated Sum of Square Errors (ISSE) and its local version defined on the tails of the curves. The ISSE is a statistical metric used to assess the goodness of fit of a regression model or interpolation model to observed data. It is particularly useful in scenarios where the dependent variable is a continuous function of an independent variable, such as in functional data analysis. Formally, it can be expressed as:

$$\begin{aligned} \text {ISSE} = \int _{a}^{b} \left[ y(t) - {\hat{y}}(t) \right] ^2 \, dt \end{aligned}$$

where:

  • y(t) is the observed value of the dependent variable at time t,

  • \({\hat{y}}(t)\) is the predicted value of the dependent variable at time t,

  • [ab] is the integration interval over the entire data range.

Similarly to the traditional ISSE, we define a Local ISSE (ISSE\(_{inf}\),ISSE\(_{sup}\)) with the aim to quantify the fit in the initial and final portions of the curves.

The expression of ISSE is the same, but the boundaries of the integration interval need to be adjusted to reflect the specific regions of interest. This allows you to focus on the initial and final tails of the curve rather than the entire curve. The choice of regions for calculating the Integrated Sum of Squared Errors depends on the analysis objectives and the nature of the functional data or curves under examination. In our context, cross validation was used to test the model on different parts of the curves and identify the regions in the tails. Degrees of freedom (df), Integrated Sum of Square Errors, and the GCV scores are used to evaluate the quality of both the overall and local model fit. The results in Table  1 show that the lowest ISSE values were obtained using the free knots splines incorporating two regularization terms, both on the entire interval and on the tails.

Table 1 Table of comparison between FS0, FS1 and FS2

The advantages obtained by using the dual penalty approach compared to free knots spline are even more evident if we look at the clustering.

Fig. 3
figure 3

Evaluation of the number of clusters in the simulation dataset. The optimal number is indicated from the dotted line

To determine the number of clusters, we used the Elbow method. In cluster analysis, the Elbow method is a heuristic used to identify the optimal number of clusters in a given dataset. This method involves plotting the explained variation as a function of the number of clusters and selecting the ’elbow point’ on the curve as the optimal number of clusters. The results in Fig. 3 show that according to the index values, the optimal cluster number is c = 4. By applying kmeans.fd clustering to functional data, we can ascertain the number of curves assigned to each cluster for the methods under consideration.

Table 2 Table of comparison between clustering methods with FS0, FS1 and FS2: cluster cardinality and false positives (FP)
Fig. 4
figure 4

Clustering structure: Cluster 1 in red, Cluster 2 in blue, Cluster 3 in green, and Cluster 4 in orange. First column represents synthetic datasets, the second column represents the data approximated using FS0 segmented into clusters, the third column shows the data approximated using FS1 segmented into clusters, the last column illustrates the data approximated using FS2 segmented into clusters. (Color figure online)

Results reported in Table 2 and in Fig. 4, with respect to FS2, many more misclassifications are observed for both FS0 and FS1.

The approximation that yields the best clusters is the one using FS2.

In conclusion, it can be seen that introducing two regularization terms into the free knots spline approximation method led to have better approximation results also for further analyses such as clustering. Indeed, when data exhibit very similar shapes, clustering typically works regardless of approximation; the problem arises when dealing with data of highly dissimilar shapes, as seen in this case with simulated data, which a conventional method fails to capture.

To consolidate the results obtained from clustering the curves approximated by the free knot spline, we demonstrate that no method of functional data clustering can yield the same or better results. From now on we no longer consider FS1 since just adding a regularization term does not improve the analysis through the use of clustering.

Table 3 Table of comparison between clustering methods with FS0 and FS2: cluster cardinality and false positives (FP)

Results resented in Table 3 for all the clustering methods show how the best classification is achieved by applying clustering to curves approximated with FS2. Nevertheless, none of the classifications results are comparable to the classification obtained by the kmeans.fd method.

We can draw the same conclusions by evaluating the results obtained using the previous clustering methods by performing 150 simulations. In order to compare the performances, we use the Adjusted Rand index ARI measuring the percentage of times a clustering algorithm correctly determines the number of clusters out of a total of 150 simulations. Clustering can be thought of as a series of decisions where the goal is to group two individuals into the same cluster if and only if they are similar. A “true positive" (TP) decision correctly assigns two similar individuals to the same cluster, while a “true negative" (TN) decision correctly assigns two dissimilar individuals to different clusters. On the other hand, a “false positive" (FP) decision incorrectly assigns two dissimilar individuals to the same cluster, and a “false negative" (FN) decision incorrectly assigns two similar individuals to different clusters.

The Rand index (RI) is defined as:

$$\begin{aligned} RI = \frac{TP + TN}{TP + FP + FN + TN} \end{aligned}$$

and falls within the range between 0 and 1. A value of 0 indicates that the two data clusterings do not agree on any pair of points, while a value of 1 means that the data clusterings are identical. However, RI’s value may not be close to 0 when category labels are randomly assigned, leading to potential issues.

Table 4 A comparison table of clustering methods, including FS0 and FS2, based on 150 simulations

To address this problem, the Adjusted Rand index (ARI) is introduced, defined as:

$$\begin{aligned}ARI = \frac{RI - E[RI]}{\max (RI) - E[RI]}.\end{aligned}$$

ARI’s range is between \(-1\) and 1, with values closer to 1 indicating better clustering results. From Table 4 we can observe that we can observe that for all the clustering methods, the use in the functional approximation of FS2 leads to higher ARI compared FS0.

4.2 Application: real datasets

In this section, we aim to provide further confirmation of the validity and competitiveness of the proposed method by examining a real-world dataset.

We will examine the application of clustering to the COVID-19 dataset on “New cases in different countries from 2020 to 2021, specifically COVID-19 pandemic data for 30 countries. In this context, our goal is to demonstrate the feasibility of analyzing pandemic models through time series clustering using knot-free splines with two regularization terms for approximating functional data and finding taxonomies within the data. The data used in this paper were sourced from the World Health Organization (WHO) and the COVID-19 Data Hub (Guidotti 2022; Guidotti and Ardia 2020). These databases are known for their transparency, open accessibility, and high credibility, ensuring a high level of accuracy.

As a significant number of countries experienced substantial outbreaks in March 2020, we selected this month as the starting point for our sequential data in this research. The endpoint of our dataset corresponds to the date of November 30st, 2021. As in Luo et al. (2023), to mitigate the impact of factors such as varying population size, land area, population density, and population mobility across different countries, which can lead to significant differences in the magnitude of daily new COVID-19 cases, we at first proceeded to standardize the raw data as follows:

$$\begin{aligned} y_{i,t}^* = \frac{y_{it}- \bar{y}_{i}}{s_{i}} \;\;\;\; t=1,2,...,T \;\;\;\; i = 1,2,...,N \end{aligned}$$

where \( y_{i,t}^*\) represents the normalized number of daily new cases in the country \(i\) on day \(t\), \(y_{it}\) stands for the number of daily new cases in the country \(i\) on day \(t\), \(\bar{y}_{i}\) represents the mean of daily new cases in the country \(i\) during the whole study period, \(s_{i}\) is the standard deviation of daily new cases in the country \(i\) during the whole study period.

Fig. 5
figure 5

New Cases of Covid19 in different Country from 2020 to 2021

The data obtained can be seen in Fig. 5. The next step was to take the values \( \lambda _1= 10^{-3}\) and \(\lambda _2= 10^{-7}\) for FS2 and \(\lambda _2= 10^{-4} \) for FS1 as regularization parameters, obtained through a general cross-validation on a pre-established grid.

Table 5 Numerical results

From Fig. 7 and Table 5, which provide comparisons between the three approximation methods, we can conclude that, as in previous cases, the double penalization approximation performs better along the extreme regions, ensuring that valuable information is not lost.

Fig. 6
figure 6

Evaluation of the number of clusters in the simulation dataset. The optimal number is indicated from the dotted line

Fig. 7
figure 7

Covid19 Data smoothed by FS0, FS1 and FS2

The advantages we gain from using the double penalization approximation method are observed in the application of clustering. To select the number of clusters, we apply the Elbow method. From the Fig. 6, we can determine that the appropriate number of clusters is 4.

The clustering method we will use is kmeans.fd. In the first analysis we perform is clustering on data approximated using FS2.

Figure 8 displays the time series of daily new cases for each country based on the clusters. We can visually observe that the model varies significantly for countries in different clusters.

We can observe that the pattern undergoes significant changes across countries within distinct clusters. The countries in Cluster 1 exhibit a consistent trend, followed by a sudden surge in cases from January 2022 to March 2022. Conversely, countries in Cluster 3 demonstrate a decline in new cases during the same period.

The countries in Cluster 2 show an increase in cases from the beginning of January 2021, followed by a decrease starting in February. During the same period, countries in Cluster 3 exhibited a consistent trend, followed by a rise in cases in March 2021. It can be observed that the pattern of pandemic development in Cluster 3 is often at the opposite pace to that of Cluster 4.

In general, each cluster of countries exhibits distinct characteristics that set their pandemic patterns apart from those of the other clusters.

Table 6 Specific clustering results for daily new cases in 30 countries, in the case of FS2
Fig. 8
figure 8

K-means clustering results by using FS2 (upper figure), FS0 (center figure), FS1 (down figure)

Table 7 Specific clustering results for daily new cases in 30 countries, in the case of FS0
Table 8 Specific clustering results for daily new cases in 30 countries, in the case of FS1

Table 6 shows the specific clustering results for the 30 countries. We can see that all Cluster 1 and Cluster 4 countries are located in regions of Asia, while most of Cluster 3 is located in different regions in the European region.

For this reason we might think that geographical location can influence clustering. However, geographical proximity is not a decisive factor for clustering; just think of Cluster 2.

All the considerations we have made so far concern clustering on data approximated using FS2. From the Fig. 8 (upper figure) and the Table 6 Table 6, we can analyze the results of clustering on data approximated using FS0 and FS1. The Fig. 8 (center and down figure), Tables 7 and 8, which are identical, displays the different clusters for FS0 and FS1: the clusters are not well-defined, and the curves, not being correctly allocated to the clusters, making it challenging to read and interpret the data. The considerations that were previously made in this case are not as clear: in Clusters 1 and 2, we observe a similar trend, especially towards the end of 2021. Similar observations can be made for Clusters 3 and 4, which exhibit similar trends. Therefore, in the case of FS0 and FS1, specific characteristics to characterize each cluster are not found.

5 Conclusions

In this work, we considered the use of free knots spline in the context of functional data estimation, analyzing, in particular, the impact of the roughness regularization term. More specifically, we compared two different penalty regularization schemes, namely a standard regularization scheme with a one parameter roughness term handling the boundedness of the variability of the function (O’Sullivan 1986) and a two-penalty regularization scheme which controls monotonicity and smoothness. Our numerical experiments seem to demonstrate that compared to the free knots spline without any penalty, things do not change much when using the one-parameter scheme, while the two-parameter scheme shows notable improvements. However, the most significant advantages appear in the data analysis phase based on the functional data approximation obtained. In particular, when applied to simulated data, our method shows a higher improved ability to detect ties, highlighting a clearer clustering structure. A promising direction emerges from the non-linear effects linked to the selection of knots in the initial basis. The incorporation of machine learning method to enhance the analytical approach, open a different perspective for continuous advancements in the domain of functional data analysis.