Adaptive LASSO estimation for functional hidden dynamic geostatistical models

We propose a novel model selection algorithm based on a penalized maximum likelihood estimator (PMLE) for functional hidden dynamic geostatistical models (f-HDGM). These models employ a classic mixed-effect regression structure with embedded spatiotemporal dynamics to model georeferenced data observed in a functional domain. Thus, the regression coefficients are functions. The algorithm simultaneously selects the relevant spline basis functions and regressors that are used to model the fixed effects. In this way, it automatically shrinks to zero irrelevant parts of the functional coefficients or the entire function for an irrelevant regressor. The algorithm is based on an adaptive LASSO penalty function, with weights obtained by the unpenalised f-HDGM maximum likelihood estimators. The computational burden of maximisation is drastically reduced by a local quadratic approximation of the log-likelihood. A Monte Carlo simulation study provides insight in prediction ability and parameter estimate precision, considering increasing spatiotemporal dependence and cross-correlations among predictors. Further, the algorithm behaviour is investigated when modelling air quality functional data with several weather and land cover covariates. Within this application, we also explore some scalability properties of our algorithm. Both simulations and empirical results show that the prediction ability of the penalised estimates are equivalent to those provided by the maximum likelihood estimates. However, adopting the so-called one-standard-error rule, we obtain estimates closer to the real ones, as well as simpler and more interpretable models. Supplementary Information The online version contains supplementary material available at 10.1007/s00477-023-02466-5.


Introduction
With the growing availability of geo-referenced data in high spatial and temporal resolutions, geostatistical applications increasingly require efficient algorithms to select relevant regressors among a large set of candidates.In this context, statistical methods for such high-resolution spatial data often suffer from the so-called Big-N-problem, in which the time complexity of estimation algorithms grows polynomially with an order greater than 2 when the number of locations is increasing and traditional methods are often not computationally feasible (cf.Katzfuss, 2017a; Katzfuss and Cressie, 2011).To reduce the complexity of such models, various approaches have been used, some of which are based on inducing sparsity in the spatial covariance matrix (Furrer et al., 2006;Kaufman et al., 2008;Stein, 2013;Furrer et al., 2016).Some other approaches are related to the precision matrix, either using a graphical least absolute shrinkage and selection operator (LASSO) approach (Krock et al., 2021a,b), or a sparse Cholesky factors approach based on the Vecchia approximations (Stein et al., 2004;Kang and Katzfuss, 2021;Schäfer et al., 2021) and on multi-resolution approximations of Gaussian processes (Katzfuss, 2017b;Jurek and Katzfuss, 2021).In particular, Vecchia approximation can be efficently used to peform high-dimensional spatiotemporal filtering (Jurek and Katzfuss, 2022a) and spatiotemporal smoothing (Jurek and Katzfuss, 2022b).Low-rank covariance matrices have been also considered, including fixed-rank kriging and penalised methods (Banerjee et al., 2008;Cressie and Johannesson, 2008;Chang et al., 2010a;Hsu et al., 2012a;Cressie et al., 2010).Eventually, combined approaches, like the so-called full-scale approximation of the covariance matrix have been proposed (Sang and Huang, 2012).
In addition, the spatiotemporal data may be defined in a functional domain because of their characteristics, e.g., vertical atmospheric profiles in climate studies, (Fassò et al., 2018); off-shore coastal profile measurements for beach monitoring, (Otto et al., 2021).A functional data approach may also be used to reduce the dimensionality of high-frequency temporal observations.For example, Ignaccolo et al. (2008) considered the time series of air quality measurements at many stations as functional observations.Also, to understand the bike-sharing system, Piter et al. (2022) considered daily 5-min usage profiles of a bike-sharing system as daily functional observations.Due to the spatial nature of the underlying process, further applications can be found in environmetrics (e.g., Franco-Villoria and Ignaccolo, 2017;Ignaccolo et al., 2013;Giraldo et al., 2011), medicine (e.g., Aristizabal et al., 2019), econometrics (e.g., Pineda-Ríos et al., 2019).
For joint estimation and model selection, we will consider penalised estimation procedures.
In general, the use of penalised regression approaches for the selection of relevant variables has the advantage of greater stability of the solutions than in the classic backward and forward selection methods (Breiman, 1996;Bondell et al., 2010).In this regard, we refer to the review paper of Müller and Welsh (2010) on model selection curves, in which (1) different loss functions and penalty terms are extensively discussed for the selection of relevant covariates and (2) the loss functions are studied as functions of the penalty term, which allows for the exploration of their stability, instead of evaluating the function at single values of the penalty multiplier.
Furthermore, methods of selecting covariates have been developed based on penalised methods in spatial and spatiotemporal settings.For instance, Wang and Zhu (2009) suggested a penalised least squares approach for spatial regression models; Cai and Maiti (2020), for spatial autoregressive models; and Gonella et al. (2022), for conditional autoregressive models.
For additive spatial models including potential nonlinear effects, Nandy et al. (2017) developed a weighted penalised least squares estimator.Alternatively, penalised maximum likelihood estimators (PMLE) are considered.For instance, Zhu et al. (2010) suggested PMLE for linear models with spatially correlated errors.Chu et al. (2011a) and Chu et al. (2011b) additionally reduced the model's complexity by combining covariance tapering and a PMLE for spatial and spatiotemporal settings, respectively.
Following the least absolute shrinkage and selector operator (LASSO) methodology (Tibshirani, 1996), Reyes et al. (2012) proposed a spatiotemporal adaptive LASSO algorithm for linear regression models with spatiotemporal neighbourhood structures.The estimation strategy involved both the penalised least squares and PMLE approaches.Other examples of penalised regression for spatiotemporal data are in Al-Sulami et al. (2019), in which an adaptive LASSO method was developed to simultaneously identify and estimate spatiotemporal lag interactions in the context of a data-driven semiparametric nonlinear model.Furthermore, Safikhani et al. (2020) considered LASSO methods for generalised spatiotemporal autoregressive models.The estimators are obtained by a modified version of the penalised least squares that accommodates hierarchical group LASSO-type penalties.In addition to these contributions, there are several application-oriented papers that combine classic LASSO approaches and geostatistical models in multi-step procedures (e.g., Fassò et al., 2021;Ye et al., 2011;Pejović et al., 2018).
Penalised methods are also commonly applied in the context of functional data analysis, such as penalised splines (see Silverman and Ramsay, 2002;Claeskens et al., 2009).These methods usually regularise the smoothness of the estimated functions by penalising the integrated second derivatives.In this way, many basis functions can be used, thus avoiding the typical overfit resulting from unpenalised estimation methods.In this paper, we will not consider these smoothness penalties, but we will focus on model selection for spatiotemporal models.
In general, spline basis functions are widely used tools in geostatistics for the spatiotemporal interpolation of environmental phenomena (see Hofierka et al., 2002;Xiao et al., 2016;Chang et al., 2010b;Hsu et al., 2012b, for group-LASSO approaches in this context).Also, spatial and spatiotemporal model selection has been addressed in a Bayesian framework (see e.g., Katzfuss and Cressie, 2012;Carroll et al., 2016a,b;Lawson et al., 2017;Carroll et al., 2018) but it will not be the focus of this paper.
In this paper, we efficiently select the relevant regressors of the functional spatiotemporal model known as the functional hidden dynamics geostatistical model (f-HDGM).This model decomposes the random process into a fixed-effects part (that includes external regressors) and a random-effects part.The latter component separates spatial and temporal dependence into two different terms, namely a stationary geostatistical process and a Markovian temporal process.That is, the spatiotemporal dependence must be separable (see Huang et al. (2007) for a comparison of different spatiotemporal models).In a multivariate framework, the model and, in particular, the maximum-likelihood estimation procedure were introduced by Calculli et al. (2015).The procedure is based on maximising the expected likelihood function with respect to the parameters using an expectation-maximization (EM) algorithm.This approach is based on the so-called state-space representation and the related Kalman Filter algorithm.For a recent review of Kalman filtering for spatiotemporal models, see Ferreira et al. (2022), Jurek andKatzfuss (2021), and Jurek and Katzfuss (2022b).Also, extensions for non linear and non Gaussian spatiotemporal data, such as the Ensemble Kalman Filter, have been considered by Katzfuss et al. (2016Katzfuss et al. ( , 2020)).The multivariate HDGM approach was then generalised by Wang et al. (2021) to functional spatiotemporal models, which are implemented in the MATLAB software package D-STEM.
We propose a LASSO procedure for the selection of the functional coefficients of f-HDGM.
More precisely, the suggested penalised maximum-likelihood approach with an adaptive LASSO penalty simultaneously estimates the functional coefficients and selects relevant regressors.In the functional setting, these variables are the regressive coefficients of the spline bases.Thus, in addition to fully including or excluding regressors, we also find the relevant parts (for the Bspline bases) or frequencies (for the Fourier bases) when only some of the functional coefficients are set to zero.For instance, Otto et al. (2021) showed that major storm floods have an effect only on specific parts of the coastal profiles, that is, those affected by high waves during a flood.
It is important to note that geostatistical applications are prone to cross-correlated regressors due to their spatial nature.As Zhao and Yu (2006) pointed out, the classic LASSO approach would generally not be selection-consistent in this case.Also, Simon and Tibshirani (2012) showed that when the covariates are correlated, the group-LASSO estimator (cf.Yuan and Lin (2006)), which assumes orthonormal data within each group, performs poorly in selecting the relevant features.This motivated the choice of an adaptive LASSO penalty, which led to selection-consistent estimators in the case of cross-correlated regressors (see Zou, 2006) and Zou and Li (2008).
The aforementioned method was especially promoted by the Agrimonia project (https://agrimonia.net/),which aims at building spatiotemporal models for the relation between livestock and air pollution.In such a contexts, the number of regressors, which includes meteorology, land use, socio-economic variables, and their interactions, may become quite large.
The remainder of the paper is structured as follows.In Section 2, we briefly introduce the considered functional geostatistical model.In Section 3, we present the new penalised maximum likelihood approach using an adaptive LASSO penalty.In Section 4, we present the results of an extensive Monte Carlo simulation of three simulation settings.In Section 5, we illustrate how our proposed estimation algorithm can be used to model daily air quality profiles in Lombardy, a region in Northern Italy, during the coronavirus disease 2019 (COVID-19) pandemic.Finally, Section 6 concludes this paper and identifies potential topics for future research.

The functional model
The f-HDGM and the corresponding software D-STEM (Fassò et al., 2018) are designed to handle functional data {Y s,t (h) : s ∈ D, t = 1, . . ., T } defined on the interval That is, Y s,t (h) : H → R can be observed at any h ∈ H for any given location s in the spatial domain D and for any given discrete time t.Although the spatial domain D is continuous, we will observe such data on n spatial points in an irregular grid S = {s 1 , ..., s n }.Similarly, we will observe the data for each function at a discrete set of points h 1 , ..., h q , where both h i and q may depend on s i and t.These observations are denoted by vectors {y s,t = (Y s,t (h 1 ), ..., Y s,t (h q )}.
Our data set will composed of N = nT functional data.
To account for the spatial and temporal dependence, we model the process using a hidden dynamics geostatistical process that separates all regressive effects from the spatiotemporal interrelations.More precisely, f-HDGM is defined by where the fixed-effects component µ s,t (h), the random-effects component ω s,t (h) and the modelling errors variance σ 2 (h) = V AR[ε s,t (h)] are modelled using splines.Let B k,a (h) be the k-th of the K a basis functions of component a = µ, ω, σ.
In Equation 1, the mean, or the fixed-effects component, is a linear regression model in the functional domain.That is, where X s,t,j (h) denotes the functional observations of the j-th regressor.For the generic j-th regressor, by multiplying of the spline basis matrix by the coefficients β j = (β j1 , . . ., β jKµ ) , we obtain the functional coefficients shown in Figure 1.In Section 3, we propose an adaptive LASSO procedure to penalize these regression coefficients.In a nutshell, whether all entries of the vector β j are shrunk to zero or not, we can select the relevant regressors.That is if β j contains only zeros, then, the j-th regressor is removed from the model.Moreover, if β j is only partly shrunk to zero, we can select the relevant parts and knots of the j-th regressor in the functional domain.
In Equation 1, the spatiotemporal dependence is modelled by the functional random effects ω s,t (h), given by In Equation 3, the latent component z s,t = (z s,t,1 , . . ., z s,t,Kω ) follows a temporal Markovian process, as follows: with a spatially correlated K ω -dimensional Gaussian Process η s,t .In this way, temporal dependence is separated from spatial dependence, which is included in η s,t , while temporal effects are covered by the transition matrix G.The spatial Gaussian process is assumed to have a zero mean and diagonal covariance matrix: where ρ is a valid spatial correlation function for locations s and s based on an exponential model with parameters {θ 1 , . . ., θ Kω } and scaling factors {v 1 , . . ., v Kω }, that is, the variances.
Both G and V (η s,t ) are assumed to be diagonal, to reduce complexity (i.e., by removing the temporal dependence between different parts along the functional domain).
Eventually, the model errors are assumed to be independent and identically normally distributed across space and time, but the error variance may vary across the functional domain as follows: Let β = (β 1 , ..., β p ) be the stacked vector of the spline coefficient vectors of the fixedeffects model and ϑ = {G, V, θ, ν, σ 2 } be the set of all coefficients of the random effects and the error term.Moreover, let H denote the Hessian matrix of the model's log-likelihood.The full set of parameters (β, ϑ, H) is estimated by maximising the expected log-likelihood of the model using an EM algorithm.Let (β 0 , ϑ 0 ) denote the maximum likelihood estimates of (β, ϑ) , and let H 0 = H(β 0 , ϑ 0 ) denote the Hessian matrix computed at the ML solutions (β 0 , ϑ 0 ) .Notice that the estimates β 0 is a consistent estimator of β (see Calculli et al., 2015).The EM algorithm used for computation is implemented in the D-STEM software (Finazzi and Fassò, 2014;Wang et al., 2021) within the Matlab environment.
Estimating the parameters of f-HDGM can be computationally demanding.Following Wang et al. (2021), we reduced the computational time with the following two approximations.
In the first approximation the variance-covariance matrix of the parameters is computed using an approximated approach.This task is performed by fixing a threshold for the overall improvement in the variance-covariance matrix computation (see Section 2.5 of Wang et al., 2021).The second approximation concerns a spatial partitioning approach.According to Stein (2013), we divided the complete dataset into k groups (based on the geodesic distance) of size r, and we assumed that the data in the different groups were not correlated.This implies a factorised likelihood function and the possibility of computing the E-step in parallel.As a result, the computational complexity was reduced from O(T n 3 b 3 ) to O(T kr 3 b 3 ) (see Section 2.4 of Wang et al., 2021).Moreover, if the computing infrastructure can handle k parallel processes, the computing time may be reduced to O(T r 3 b 3 ).
3 Spatiotemporal adaptive LASSO estimation for functional coefficients In this section, we suggest an adaptive LASSO approach to select (1) the relevant regressors, (2) the relevant sections of the functional coefficients and (3) the relevant knots of the fixed-effects functional model µ s,t (h).The emphasis is on modelling the relationship between the covariates and the response variable.Therefore, the parameters of the random-effects components are kept unpenalised.Moreover, for regularised regression approaches, the covariance matrix of the model errors is usually not penalised (e.g., Fan and Li, 2001;Tibshirani, 1996).Spatiotemporal parameters could also be included in the penalised procedure (e.g., see Bondell et al., 2010, for random-effect shrinking in linear mixed models).However, in this case, the shrinkage target should be adjusted to the specific empirical case.Indeed, while the temporal dependence parameter matrix G could be shrunk to zero, i.e., in the case of temporal independence, a zero shrinkage target is not meaningful for the variance parameters and the range parameter of the spatial dependence θ.Regarding the intercept or constant term of the model, we follow the strategy originally proposed by Tibshirani (1996), of standardising all the covariates and centering the dependent variable before applying the penalised regression.Such pre-processing step allows for the omission of the intercept term in the adaptive LASSO optimisation (see Section 2.2 of Hastie et al., 2015).Note that, without standardisation, the penalised solutions would depend on the original units used to measure the predictors.Thus, standardisation ensures that the penalty is applied equally to all predictors in terms of the unit variance of all the predictors.Moreover, the estimates can easily be back-transformed with the sample mean and covariance matrix that were used for the standardisation (Lee, 2015).As we are dealing with spatiotemporal data, we standardised both the response variable and the covariates with respect to their overall mean and overall standard deviation.That is, we standardised the observations using the 24-hour sample mean and sample standard deviation.
We followed a penalised maximum-likelihood estimation strategy for the fixed-effect coefficients using the following equation: with the logarithmic likelihood function L and a penalty function f .To reduce the computational burden, we locally approximated the full model log-likelihood in (7) around the unpenalised ML estimates using a local quadratic approximation (Jennrich and Sampson, 1976;Longford, 1987), as follows: where ∇L(β 0 ) is the score function evaluated at the ML solution and H 0 = ∇ 2 L(β 0 ).Similar computational solutions involving local approximation of the likelihood have been proposed by Zou and Li (2008) for obtaining penalised estimates of the parameters in Generalised Linear Models (GLM) via the one-step sparse estimator by Fan and Li (2001)  We now consider the penalty function f (β) in Equation ( 7).Motivated by the oracle properties of the adaptive LASSO estimates (Zou, 2006;Bondell et al., 2010), we use an adaptive penalty for the likelihood of the functional HDGM.Because the observed data are supposed to be correlated in space and time (due to the natural ordering of the spatiotemporal data), it is important that the algorithm be selection-consistent (Zhang, 2010) even in the case of correlated observations.However, this may not often be the case for classic LASSO approaches (see Zhao and Yu, 2006, among others, for conditions of selection consistency).Thus, we suggest an adaptive LASSO penalty that has the desired property of selection consistency, as shown by Zou (2006); Zou and Li (2008); Huang et al. (2008).
We propose the following estimator with an adaptive LASSO penalty for the fixed-effects coefficients of f-HDGM: where λ is the regularisation parameter, and the penalty weights w i are chosen as the inverse initial ML estimates, that is, w i = 1 β 0,i γ , with γ = 1 for all i.To increase or diminish the influence of these initial estimates, γ ≥ 0 could also be chosen differently.Generally, to obtain the oracle properties, the penalty parameter λ should be of order √ n (see Zou and Li, 2008).
In the next paragraph, we will go into further details on the selection of λ.
The algorithm used to solve Equation ( 9) is based on the BFGS quasi-Newton method over the non-zero coefficients, that is, the so called active set, with the initial values being β 0 .
The algorithm requires limited computation effort, as the time consuming computation of the Hessian matrix H is done only once.
This penalised procedure shrinks irrelevant coefficients to zero.Because this applies to all basis functions separately -we do not impose that all coefficients associated with one regressor must be shrunk to zero simultaneously as for a block LASSO approach -it is possible to select the relevant sections of the functional coefficients and exclude the irrelevant knots.However, the basis functions may overlap to some extent.This implies that the height of the functional coefficient at a given point in the functional domain (i.e., the sum of the weighted basis functions at a given point) is determined by several coefficients.If only some of such coefficients are zero, the functional coefficient is not zero.Hence, typically, smooth transitions shrunken to zero can be observed in the functional domain, depending on the number and location of the knots (i.e., the fewer knots there are, the smoother the estimated function is).This further encourages the use of an adaptive LASSO penalty, which leads to asymptotically unbiased estimates (see Zou, 2006).
The penalty parameter λ is determined by minimising the prediction errors obtained from a random K-fold cross-validation (CV) study.For this reason, let D = {y s,t (h), s ∈ S, t = 1, ..., T } be the set of all available functional observations, and let D 1 , ..., D K be a random partition of D, which is made by randomly assigning N/k observation to each group D i with i = 1, . . ., K.
For each subset D i , the penalised estimation is performed for a certain predefined sequence of λ, including when λ = 0.In particular, the parameters are estimated according to Equation (9) using data in D − D i .Then, the data in D i , are used to evlauate the out-of-sample prediction performance given by the root-mean-square error (RMSE i ) and the mean absolute error (MAE i ).
Eventually, the overall performance measures, say RM SE(λ) and M AE(λ), are obtained by averaging RM SE i and M AE i for i = 1, 2, . . ., K. The optimal λ * may be chosen by minimising one of the following four CV criteria: arg min λ RM SE(λ), arg min λ M AE(λ) and the two corresponding one-standard-error rules (see Hastie et al., 2017Hastie et al., , 2015)).In the final step, the β vector is re-estimated with all observations D at the optimal penalty parameter λ * .
The procedure is synthesised using the pseudo-code in Algorithm 1.
Algorithm 1 Adaptive LASSO estimation for functional hidden dynamic geostatistical models for λ = 0, . . ., λ max on an exponentially decaying grid do 4: Estimate β (P M LE) (λ) in Equation (9) using D − D i and initialise it with β 0 5: Compute the fitted values (kriging) using β (P M LE) (λ) 6: Compute the RM SE i (λ) and M AE i (λ) using the observations in D i 7: end for 8: end for  1.The settings represent the following situations of interest in geostatistical models.First, we regarded a multiple regression model as a benchmark approach (i.e., all temporal and spatial dependence parameters are chosen such that the resulting process is independent in space and time).Second, we considered the case of a dependent variable that is correlated across space and time but with uncorrelated regressors.
Third, cross-correlation among the regressors was considered in Setting III, which became the most challenging for model selection, but also the most realistic one.In particular, we consider cross-correlation ranging from moderate (0.5) to strong (0.9).The spatial dependence is exponentially decreasing with θ = 50 km, so the correlation is below 0.37 after a distance of 50 km.
To represent a realistic spatial setting, we took the coordinates from the data that is used in the following empirical sections.More precisely, the coordinates of the spatial locations refer to the atmospheric monitoring sites that belong to the network of ARPA Lombardia, the regional agency in charge of air quality monitoring in Lombardy.Its network consists of 84 ground stations distributed over its regional territory (see the paper by Maranzano, 2022, for an overview on ARPA Lombardia's monitoring system, the specifications of the measurement stations and available data provided by the region agency).Regarding the temporal resolution, we considered that the data were observed over 365 days, with each day represents the functional domain of 24 hours.For each of these settings, 500 Monte Carlo replications were simulated with the same geographical set-up.Thus, in total, 365 × 24 = 8760 hourly data were simulated for 15 locations 500 times.
For the functional interpolation, we considered a simple set-up of cubic B-spline basis functions with 7 knots.To analyse the performance of the algorithm in selecting relevant parts across the functional domain, we considered functional regression coefficients, that is 1 at the start of a day and going smoothly to 0, as shown in Figure 1.Therefore, we set the coefficients of the first four bases equal to one, and to zero the remaining three.
The penalty term sequence was generated according to an exponentially decaying grid, starting from λ min = 10 −5 up to λ max = 0.5.As we will show in the simulation results, for a value of λ greater than 0.50, all the penalised coefficients shrunk to 0. We added as the first value of the sequence the zero penalty, that is λ = 0, that corresponds to the unpenalised maximum likelihood solution.In total, we considered 101 different values.To identify the optimal value of λ, we performed a 10-fold random cross-validation across space and time.(1, 1, 1, 1, 0, 0, 0) over a functional domain from 0 to 24.
For all the considered settings, we observed similar and reasonable results.For brevity, we will discuss only the detailed results of the spatiotemporal correlated observations with correlated covariates simulation setup, that is Setting III.The results of Setting I, Setting II and Setting III are reported in Appendix 3 to 5.
The major insights from the simulations are summarised as follows.First, Figure 2 shows the average RMSE and MAE across the Monte Carlo replications and the optimal λ values obtained after evaluating various criteria.Specifically, we considered the minimum error and one-standard-error from the minimum error.The upper panels show the MAEs (left) and the RMSEs (right) for all the values of λ that were considered in this study, and the lower panels report the values that correspond to a reasonable range of λ near the optimum solutions.
Moreover, we depict the CV variability with the error bars, computed as the standard error of the sample average.Both the RMSE and MAE plots suggest that the optimal value of λ, that is, the minimiser of RMSE or MAE, is different from the MLE solution.Overall, both the RMSE and MAE showed smooth patterns.For the values of the penalty term λ greater than 0.03 (i.e., log(λ) > −3.50), all the coefficients were shrunk to 0 and the RMSE and MAE stabilised around 2.7 and 2.1, respectively.Considering the one standard error rule, in both cases, the optimal values corresponded to λ * 1SE RM SE = λ * 1SE M AE = 1.11 × 10 −4 (orange and pink vertical lines).However, the prediction performance did not significantly different from the MLE solution.In Figure 3, we plotted the empirical distribution (i.e., the box-plot) across the simulations of each fixed-effects coefficient for λ = λ * min RM SE (upper panel) and λ = λ * 1SE RM SE (lower panel).In both cases, the following are very noticeable: (1) the coefficients are estimated very close to their actual value, which indicates that the penalised estimators are approximately unbiased and (2) the variability for null coefficients is considerably smaller than for coefficients equal to 1. Similar considerations can be inferred from the further materials in Appendix 5.
In particular Figure 11 reports the average functional pattern of each coefficient across the 24-hours, whereas Table 4, reports the average estimate and the sampling variability of each coefficients across the simulations.Eventually, we can see from Figure 4 that the average estimated coefficients are smoothly shrunk towards 0, where true zero coefficients are exactly 0 for log(λ) > −3.5, whereas the remaining positive coefficients are close to 1, which is their actual value.

Application to air quality in Lombardy
The proposed model selection algorithm is now applied to air quality data recorded during the COVID-19 pandemic in Lombardy (see Figure 5).Airborne pollution in Lombardy has attracted considerable research interest for many years.With the COVID-19 emergence, many researchers have become increasingly interested in the short-term effects of lockdowns on air quality (Cameletti, 2020;Collivignarelli et al., 2020;Lovarelli et al., 2020;Fassò et al., 2021).
All the research results showed that the reduced mobility imposed by the government induced a positive effect on air quality, significantly reducing the concentrations of airborne pollutants directly related to traffic, such as nitrogen, particulate matters and benzene.However, particulate matter, did not appear to have been significantly reduced and persists at values similar to pre-pandemic levels (Cameletti, 2020;Granella et al., 2021;ARPA Lombardia, 2020).This is seen to be because of the strong increases (up to +31%) in the concentration of ammonia, the main source of emission of particulate matter, In the Lombardy countryside due to the non-interruption of agricultural activities therein (Lovarelli et al., 2021) We model hourly nitrogen dioxide (NO 2 ) concentrations obtained by the ARPA Lombardia monitoring network (Maranzano, 2022)    ral plain and urban plain areas).The station locations are shown in Figure 5.Moreover, we consider the concentration throughout the day as a functional observation.For a first overview of the intraday variations, we show the regional daily distribution of NO 2 concentrations in the functional box plot in Figure 6.The 24-hour profile clearly shows the intra-day variability in the means and variances.In particular, there are strong differences between the NO 2 concentrations at night and day.They are in accordance with anthropogenic activities -that is, very high concentrations are seen during rush hours, between 8 and 11 and between 17 and 23.
To explain the airborne pollutant concentrations, we considered a set of nine meteorological and land cover variables: temperature ( • C), precipitation (mm), relative humidity (%), atmospheric pressure (Pa), eastward and northward component of the wind (m/s), geopotential height (m 2 /s 2 ) and high and low vegetation covering (measured as one-half of the total green leaf area per unit horizontal ground surface area, cf.Sabater, 2019).Since the variables present different scales and ranges, we standardized both the response variable and the covariates with respect to their overall 24-hour mean and standard deviation.The total number of observations was n × T = 185472 for each variable.
To account for the natural daily cycle, in Equation ( 1), we set t as the day, whereas h is the time across the day.Hence, periodic Fourier basis functions with b bases and a support between 0 and 24 were used.Recall that by construction, Fourier splines require an odd number of bases, and their interpretation depends on the frequency of their calculation.In fact, except for the first basis, the following basis pairs were calculated at increasing seasonal frequencies.The algorithm was initialised by estimating the HDGM parameters, that is, the regression coefficients, the variance-covariance matrix and the spatiotemporal dynamics, using the unpenalised MLE.After estimating the full model, we applied the penalised likelihood model selection algorithm using an exponentially decaying grid of penalty coefficients λ that ranged from λ min = 10 −4 to λ max = 0.50.We also included a value of λ = 0 for the unpenalised estimates.

Scalability of the algorithm
We consider now the behaviour of our approach for increasing model complexity when applied to real world data.We are interested in computational costs and the algorithm behaviour.
To do this, we consider varying numbers of Fourier bases b for each covariate and the related size of the variance-covariance matrix of the parameters.Also, we consider the impact of approximation methods for the fixed-effect coefficients (i.e., β 0 ) and the Hessian matrix (i.e., H 0 ) of the initial unpenalised f-HDGM.In particular, we consider spatial partitioning with groups varying from k = 1 (no spatial partitioning) to k = 5, and the approximated computation of the Hessian matrix, as described in Section 2.
We want to examine the algorithm's ability to select only the relevant seasonal frequencies of the Fourier interpolation by shrinking irrelevant frequencies towards 0. Thus, we test the following occurrences as the complexity increases: (1) the increases in the number of fixed zero coefficients, and (2) the more frequent setting to 0 of the coefficients associated with high Fourier frequencies.These facts would be consistent with the observed values of the NO 2 concentrations (Figure 6), whose intra-day behaviour is fairly smooth and shows two peaks, which mean that from a modelling perspective, a small number of seasonal frequencies (low complexity) could be expected.A total of 17 models were evaluated.For each model, we considered the computation time for the three main phases of the algorithm -that is, the initial model estimation with the EM algorithm, the computation of the variance-covariance matrix of the parameters, and the penalised likelihood algorithm.For model complexity we considered three scenarios: b = 5, b = 7 and b = 9.In the first case, the total number of the model's parameters was b × 14 =70; in the second case, is 98, and in the third case is 126.The numbers of fixed-effect parameters were 50, 70, and 90, respectively.The results are summarised in Figures 7 and 8.The detailed results for all the models considered are given in Table 1 of Appendix 1.
Figure 7 shows the computational costs of the penalised likelihood algorithm as a function of the model complexity, the spatial partitioning and the adoption of an approximation for the variance-covariance matrix.The main results are summarised as follows: • Variance-covariance matrix approximation (upper left panel of Figure 7): the approximated computation of the covariance matrix decreased its computation time of around 66%, which in turns reduced the overall computation time by around 25%.This holds independently from the model complexity.Of course, the penalisation algorithm is not affected.
• Spatial partitioning (upper right panel of Figure 7): the application of a spatial partitioning reduced the initial D-STEM computation time by 30% to 50% and the penalisation phase by up to 38% .Moreover, it reduced the overall time of more than 30%.The variance-covariance matrix computation was not affected.The time gain became negligible when the number of groups increased (i.e.k ≥ 4).
• Model complexity (left panels of Figure 7): when b was reduced from 9 to 7 basis functions, the computation time of all the phases significantly increased, independent of the approximation of the variance-covariance matrix.In particular, the penalised likelihood estimation decreased by up to 68% and the overall computation by 45%.When b = 5 instead of b = 7, the gain was still evident but less pronounced.
Concerning the cross-validated model error, both MAE and RMSE were affected only by the model complexity (lower right panel of Figure 7).Indeed, independent of the approximation of the covariance matrix or of the spatial partitioning, both MAE and RMSE decreased as the number of basis functions increased for all four criteria used to define the optimum λ * .In Figure  with the adaptive LASSO.Previously, we stated that at an increasing number of fixed-effect coefficients, the number of irrelevant coefficients would increase and thus, would be set to 0 by the algorithm.Figure 8 clearly shows that this statement is correct.Indeed, when the number of basis functions is large, the overall proportion of zero coefficients increased up to 25% of the total.The graph on the right in Figure 8 examines the excluded coefficients in detail and shows that the highest frequencies (corresponding to β 6 to β 9 ) were the most frequently removed by LASSO.This is consistent with what was observed with the NO 2 concentrations: since the response variable exhibited a very smooth intra-day pattern, the number of seasonal frequencies required to model the relationship with the covariates was significantly reduced.
This result allows us to state that the adaptive LASSO algorithm proposed in this study can be a very useful tool for identifying the most relevant frequencies as it is precise in its selection and implementable even in contexts with large data sets.If time computing time is not an issue, using a higher number of frequencies (e.g., b = 9 in our case) provides better forecasting performance (i.e., lower RMSE and MAE) while avoiding an excessive number of non-zero coefficients.

Penalised estimates
The numerical estimates showed very limited variability across the models and exhibited weak sensitivity to the approximations used in the estimation (i.e., spatial partitioning and Hessian matrix approximation).As shown in the preceding section, the impact of our adaptive LASSO procedure became more pronounced as the number of fixed-effects coefficients increases.Thus, we report and comment on the empirical results obtained considering the model with the lowest prediction error among those using b = 9 basis functions.Specifically, we consider the case with an approximate variance-covariance matrix and without spatial partitioning.
In Figure 9, we depict the behavior of the average cross-validation MAE (left panels) and RMSE (right panels) for increasing values of the penalty term λ.Both criteria highlight improvements in prediction errors when the adaptive LASSO was used to estimate the parameters.
Both MAE and RMSE show a monotonic pattern apart from little sampling variability; and for large values of the penalty term (i.e., log(λ) > −4) all the covariates were dropped.However, the one-standard-error-rule provided more parsimonious models with prediction errors not sig- nificantly different from the optimal ones.The estimated coefficients that correspond to the unrestricted model and the models associated with the four penalty terms considered are shown in Figure 10.
In Figure 11, we show the 24-hour estimated functional coefficients for each variable.
The black lines correspond to the unpenalised ML solution; the green lines to the optimal λ w.r.t RMSE; the grey lines, to the optimal λ w.r.unchanged during the day.However, the penalty seemed to mitigate the temperature effect at peak hours (10 a.m. and 8 p.m.).Rainfall always showed a negative effect on the NO 2 concentrations, especially in the evening and just before dawn.In both moments, the effect reached the minimum peaks.Unlike the temperature, whose daily pattern varied slightly as the penalty increased, for higher values of λ, the precipitation diminished its effect and tended to flatten slightly towards 0. Considering the one-standard-error rule, between 7 a.m. and 5 p.m., the curve flattened to a constant negative value without being exactly 0. For the same λ values, the two negative peak periods were greatly mitigated.These elements confirm the important role of temperature and rainfall in mitigating NO 2 concentrations, which is highlighted in literature (e.g., Fassò et al., 2021).Relative humidity presented some very interesting findings.
First, its effect was null at around midnight, slightly negative at night before dawn and strongly positive in the daylight hours.Moreover, the penalisation appeared to produce no effect on the intra-day behaviour.This is consistent with the fact that whereas temperature and rainfall showed more complex patterns during the day, relative humidity already exhibited a simple pattern and did not need further smoothing.
Both the effects of atmospheric pressure and geopotential height (used as a proxy of elevation) depended on the moment of the day that was being considered.In both cases, the estimates showed a positive effect at the start and at the end of the day and a negative effect in the afternoon.However, in the case of elevation, the functional coefficient in the early and late hours was very close to zero, and in the central hours, it deviated significantly from 0 regardless of the penalty used.For both variables, penalisation did not play a significant role, the difference between penalised and non-penalised curves approached 0 and the infra-daily dynamics seemed to be stable.
Also, the U (eastward) and V (northward) components of wind showed a time-varying behaviour across the day.In both cases, the effect on the NO 2 concentrations was positively estimated during the early stage of the day, especially between 5 a.m. and 10 a.m., and it strongly weakened in the afternoon and at night, reaching values very close to 0 between 3 p.m. and 8 p.m.The morning positive estimate indicated that winds from the East (the Adriatic Sea) and the North (the Alps) reduced NO 2 concentrations, while those from the West and the South had a stagnating impact on the local NO 2 concentrations.However, the cleaning effect was limited to the early part of the day.The shrinkage effect induced by the penalisation algorithm was more pronounced in the eastward component than in the northward component.In fact, we noticed that the effect of the eastward component was strongly smoothed in the morning hours, and the coefficient was cancelled during the afternoon.The northward component, although also smoothed, showed a significantly positive effect in the early hours of the day.
Finally, we saw that penalisation generated a remarkable influence on the two land cover variables, that is the high vegetation and low vegetation indices.Both variables were heavily squeezed towards 0 even for the contained values of λ until they became almost 0 for the values associated with the one-standard-error rule.Similar results were presented in Fassò et al. (2021), in which the effect of the same covariates on the NO 2 , PM 10 and PM 2.5 concentrations in Lombardy was estimated to be close to 0, and thus not statistically significant, with the exception of the most urbanised areas.

Conclusions and future developments
In this paper, we introduced an adaptive LASSO estimator for functional hidden dynamic geostatistical models.This new estimation approach based on penalised maximum-likelihood estimation can be used to efficiently select relevant covariates in functional geostatistical models.
Our proposal may be useful in environmental policy assessment.Special cases are agricultural policies considered by the mentioned Agrimonia project, air quality assessment (Fassò et al., 2021), traffic policy (Maranzano et al., 2020), sustainable development (Wang et al., 2016) and energy policies (Yuan et al., 2018).
Moreover, the algorithm can be successfully applied to identify only relevant parts of the coefficients across the functional domain.From a computational perspective, we showed that the estimation can be efficiently implemented as a local quadratic approximation around the maximum of the expected likelihood function.To find this maximum, the EM algorithm implemented in the D-STEM software can be used (see Wang et al., 2021).Then, we used a BFGS quasi-Newton iterative method to optimise the penalised function.
We analysed the performance of this estimation procedure through a Monte Carlo simulation study based on three settings with increasing level of complexity and representative of common applied contexts.To be precise, we considered settings where only parts of the functional coefficients had zero effects and where the regressors were cross-correlated and driven by spatiotemporal dynamics, as is often observed in geostatistical applications.Eventually, we applied the penalisation algorithm to an empirical example of air quality assessment.In particular, we modeled the N O 2 concentrations observed in Lombardy during the COVID-19 pandemic period in Spring 2020.In addition to showing the direct effects of the adaptive LASSO algorithm on the estimated coefficients, we provided a study of the scalability of the algorithm when applied to real world data.In particular, we showed that even with high model complexity, the computation time (both of the penalised likelihood and overall) can greatly benefit from approximations in model estimation, leaving performance essentially unaffected.
This paper focused on model selection in functional data contexts, performed using an adaptive LASSO penalisation algorithm.However, further extensions can be pursued.Indeed, the smoothness of the estimated functional coefficients can also be of high interest in such applications because too large a number of spline bases leads to over-fitting and non-smooth estimated effects.A further penalty term based on the integrated second derivatives could counter these effects.Thus, an elastic net structure that includes the smoothness penalty and an adaptive LASSO penalty is a very interesting topic for future research.Eventually, using the results of Simon and Tibshirani (2012), the standardised group-LASSO estimator can be easily extended to spatiotemporal functional models by optimising a penalized likelihood function with quadratic approximation and assuming that the spline basis functions associated with each covariate are a group.

9:
Compute RM SE(λ) and M AE(λ) and select λ * according to one of the four CV criteria 10: Estimate the full model with λ * on D to obtain the final parameters βλ * 4 Monte Carlo simulation study To evaluate the performance of the model selection algorithm, we performed a Monte Carlo simulation study based on three settings, labelled as Setting I, Setting II, and Setting III.The three schemes are summarised in Table

Figure 1 :
Figure 1: True functional coefficients with seven basis functions and coefficients equal β =
from 1 March, 2020, to 31 May, 2020, that is T = 92 days.The monitoring network has n = 84 ground stations, which are classified by type (background, industrial, rural and traffic stations) and zone (metropolitan, mountain, ru-

Figure 3 :
Figure 3: Box plot of the estimated coefficients across 500 simulations at λ * = λ min RM SE (upper panel) and at λ * = λ 1SE RM SE (lower panel) for Setting III.

Figure 4 :
Figure 4: Average estimated coefficients for different values of λ in Setting III.The positive coefficients are drawn in blue, and the zero coefficients are depicted by the red dashed lines.
For example, if the number of bases was b = 5, the pair formed by the fourth and fifth bases would have twice the frequency of the second and third pair.The use of Fourier bases ensures the continuity of the last hour of a day to the first hour of the consecutive day.According to the number of basis functions, the total number of parameters to estimate is equal to b × 14, In particular, b × 10 parameters are associated with the covariates and the functional intercept; b × 3 with spatiotemporal dynamics, and b with residual component variances.

Figure 5 :
Figure 5: Physical map of Po Valley (upper left panel) and Lombardy (upper right panel) and the ARPA Lombardia air quality monitoring network by type of station (lower left panel) and type of area (lower right panel).

Figure 7 :
Figure 7: Computation time and cross-validation errors across the models.Computation time of each phase by model complexity with and without spatial partitioning (upper left panel); computation time of each phase by increasing level of spatial partitioning (upper right panel); computation time of each phase by increasing level of spatial partitioning and model complexity (lower left panel); RMSE and MAE by model complexity and increasing spatial partitioning (lower right panel).

Figure 8 :
Figure 8: Percentage of zero coefficients across the models.Percentage of zero coefficients by model complexity when applying a spatial partitioning with k = 2 groups (left panel); percentage of zero coefficients by basis function (coefficients) when applying a spatial partitioning with k = 2 groups.

Figure 9 :
Figure 9: MAE against the logarithm of the penalty term λ (left panels) and RMSE against the logarithm of the penalty term λ (right panels).The horizontal lines represent the values of MAE and RMSE for key values of log(λ), the optimal (λ * RM SE and λ * M AE ), 1-SE optimal values (λ * 1SE RM SE and λ * 1SE M AE ).The bottom panels are details near the optimum.

Figure 11 :
Figure 11: Estimated functional β coefficients for differently selected optimal penalty parameters.

Table 1 :
Specification of the simulation settings.
t MAE; the orange lines, to the 1SE optimal λ w.r.t RMSE; and the pink lines, to the 1SE optimal λ w.r.t MAE.The estimated coefficients associated with temperature always exhibited negative values, particularly in the late afternoon and evening hours.The patterns obtained for different values of λ did not show large discrepancies and tended to overlap throughout the day, leaving the overall dynamics