Penalised inference for lagged dependent regression in the presence of autocorrelated residuals

This paper deals with linear models for a time-dependent response and explanatory variables in a high-dimensional setting. We account for the time dependency in the data by explicitly adding autoregressive terms to the response variable in the model together with an autoregressive process for the residuals. We present a penalized likelihood approach for parameter estimation and discuss its theoretical properties. Finally, we show the successful application of the proposed methodology on simulated data and on two real applications, where we model air pollution and stock market indices, respectively. We provide an implementation of the method in the R package DREGAR, freely available on CRAN, http://CRAN.R-project.org/package=DREGAR.


Introduction
This paper deals with fitting a time series-regression model using l 1 regularized inference. In the context of linear models, l 1 penalized approaches have received great interest in recent years as they allow performing variable selection and parameter estimation simultaneously for any data, including high-dimensional datasets, where classical approaches for parameter estimation break down, e.g. [9,13,19,21,25]. In [25], it is shown that a model where penalties are adapted to each individual regressor enjoys oracle properties. Most of the advances in regularized regression models have been for the case of independent and identically distributed data. A recent line of research has concentrated on regularized models in time dependent frameworks. Amongst these, [22] showed the successful application of l 1 penalised inference in the context of autocorrelated residuals for a fixed order, by proposing the model θ j t− j + e t and studied its asymptotic properties. We refer later to this model as REGAR(q). Nardi and Rinaldo [15] studied the theoretical properties of a regularized autoregressive process on Y t for both low and high dimensional cases, whereas [1,14,20] studied the l 1 estimation of vector autoregressive models. Recently, [6] proposed an alternative to l 1 penalisation for vector autoregressive (VAR) models in the high-dimensional framework. In [1,6,15,20], no exogenous variables are included in the model. In contrast to this, [12] studied the asymptotic properties of adaptive lasso in high dimensional time series models when the number of exogenous variables increases as a function of the number of observations and [17] considered the case of VAR models with exogenous variables. While both models cover a lagged regression in the presence of exogenous variables, they do not consider autocorrelated residuals. Recently, [23] proposed an extension of the model in [22] by adding a moving average term as follows We refer later to this model as REGARMA(p,q). Similar to [22], they proved the consistency of the model in low-dimensional cases. Despite the generality of this model, considering an ARMA process for the errors results in a complex model with a challenging implementation.
In this paper, we propose to account for the time dependency in the data by explicitly adding autoregressive terms of the response variable in the linear model, as in [15], as well as an autocorrelated process for residuals, as in [22], in order to capture complex dynamics parsimoniously. In particular, given fixed orders p and q, we propose the model We name the resulting model DREGAR(p,q). The model is essentially a double autoregressive model with unbalanced weights for response and explanatory variables. To show this, we rewrite the model in (1), the REGARMA model of [23] and the REGAR model of [22] using the backward shift operator L: REGARMA : L(θ )y t = L(θ )x t β + L(φ)e t , where L(.) represents a stationary polynomial of L and L(θ )L(φ) represents a special case of an AR( p + q) process. From these equations, one can see how REGAR and REGARMA impose the same autoregressive structure on both response and covariates, whereas DREGAR assumes different structures on each of them. We found this aspect to be particularly advantageous on a number of analyses of real datasets, where DREGAR fits the data better than competitive models, with two of these examples reported in Sect. 6. In contrast to REGAR and DREGAR, REGARMA contains a moving average process on the errors. The MA component, however, induces a higher level of complexity in the parameter estimation and in the proofs of the theoretical results. All three models belong to the general framework of ARMAX [11], which is common in the system identification and signal processing literature [16] where inference is typically performed in the low-dimensional case. The focus of this paper is on statistical inference for the DREGAR model, in particular for the high-dimensional case where maximum likelihood estimation fails. In particular, we devise a penalised likelihood approach for parameter estimation, in the same spirit to the REGAR and REGARMA contributions. In Sect. 2, we formulate the model and present an l 1 penalized likelihood approach for the estimation of the parameters. In Sect. 3, we prove the asymptotic properties of the model DREGAR(p,0). In Sect. 4, we discuss the implementation of DREGAR. A simulation study, given in Sect. 5, will accompany the theoretical results. In Sect. 6 we apply the model to two real datasets, one on air pollution and another on the stock market, and compare the fit of the model with REGARMA and REGAR models. Finally, we draw some conclusions in Sect. 7.

L 1 penalised estimation
The general form of DREGAR consists of a lagged response variable, covariates and autocorrelated residuals. Consider the following Gaussian DREGAR model of order p and q, where x t is the tth row of the design matrix containing r predictors, X T ×r ; {y t } and { t } follow stationary time series processes, that is all roots of the polynomials 1 − p i=0 φ i L i and 1 − q i=0 θ i L i are unequal and outside the unit circle; e t , t = 1, . . . , T are independent and identically normally distributed noises with mean zero and known finite fourth moments, and p + q < T . Moreover, we assume that the error and explanatory variables are mutually independent for all time points. To remove the constants from the model we follow the literature on regularized models, e.g. [7,21], and standardize the covariates to have zero mean and unit variance and the response to zero mean.

Matrix representation of the model
For convenience, we write the model in matrix representation. Let H = (H ( p) , H (q) , X ) be a n × ( p + q + r ) matrix including lags of response (H ( p) ), residuals (H (q) ), and explanatory variables (X ). Let Θ = (φ, θ, β) denote the vector of corresponding parameters, e = (e T 0 +1 , e T • +2 , . . . , e T ) be the vector of errors, T • = p + q and n = T − T • , as previously defined. Then, in matrix form, the model can be written as Y = H Θ + e and the l 1 penalized conditional likelihood given the first T 0 observations is equivalent to Similarly, the adaptive form of the model is given by where the parameters are given by

Theoretical properties of the model
In order to study the theoretical properties of DREGAR and adaptive-DREGAR, we define the true coefficients by Θ • = (β • , φ • , θ • ) and assume that some of the coefficients are zero. The indices of non-zero coefficients in each group of coefficients, β, φ and θ , are denoted by s 1 , s 2 and s 3 respectively, whereas s c 1 , s c 2 , s c 3 are the complementary sets and contain the indices of zero coefficients. We also define β • s 1 , φ • s 2 , θ • s 3 and their corresponding (DREGAR) estimations byβ s 1 ,φ s 2 ,θ s 3 . Similarly, adaptive-DREGAR estimations are denoted byβ * s 1 ,φ * s 2 ,θ * s 3 . Finally, different combinations of model parameters are going to be used, with obvious meaning, in particular , ) .

Assumptions
To prove the theoretical properties of the estimators, in line with the literature, we make use of the following assumptions: (a) The response variable is assumed to be stationary and ergodic with finite second order moment. Further, we assume that the two polynomials 1− p i=1 φ i L i and 1− q i=1 θ i L i have all the roots unequal and outside the unit circle. (b) Covariates are assumed to be mutually independent of each other and of the error term.
Additionally, we assume that x .s , s = 1, . . . , r are generated from stationary and ergodic processes. (c) e t s are i.i.d Gaussian random variables with finite fourth moments. (d) 1 n X X a.s → E(X X ) < ∞ and max 1≤i≤r x i x i < ∞.
The first three assumptions guarantee that the mean and variance of the whole system remain unchanged over time. The last assumption guarantees the existence and convergence of the sample moments.

Theoretical properties of l 1 penalized DREGAR(p,0)
In this section we focus on the theoretical properties of l 1 penalized DREGAR estimators.
In particular, we concentrate on the theoretical properties of DREGAR(p,0) as we can prove that there is asymptotically no bias in this model. This model differs from REGAR(p) [22] as it imposes an autoregressive process on the response whereas REGAR(p) considers the case of autocorrelated residuals (i.e. similar to a DREGAR(0,q) model). In the next two sections, we distinguish the cases of DREGAR(p,0) and adaptive-DREGAR(p,0), respectively.

Asymptotic properties of DREGAR(p,0)
Note that k n reaches the minimum at √ n(Θ − Θ • ). From (2), The last two terms have limits: As for the first term: which is equivalent to From left to right, we prove that the first term in (7) is bounded and the next two terms follow (asymptotically) normal distributions: Similar calculations to [5] show that (8) tends to δ U B δ where U B is the covariance matrix of (X , H ( p) ), which is bounded (O(1)). Defining S n as a function of n, and using assumptions [a−d] and the central limit theorem for martingales result in Substituting all results in Eq. (5), Up to now, we have proved k n (δ) [8,9]. This follows from (1). The proof of the theorem is completed.
This theorem shows that the DREGAR estimator has the Knight and Fu [9] asymptotic property and it implies that the tuning parameters in Q n (Θ) do not shrink to zero at the speed faster than n −1/2 . In the proof of Theorem 1, the errors must be independent and identically distributed and we do not make a specific assumption about the type of distribution. In other words, the central limit theorem for martingale guarantees the convergence to the normal distribution.
As shown in [9], minimizing l 1 penalized likelihood in the linear model leads to unavoidable bias in the estimates of the non-zero parameters. In the following remark, we show this also in the context of the DREGAR model.
assuming that there are enough observations and that the minimizer k(δ) correctly identifies the coefficients, that is, u = 0 and v = 0, then, k(δ) must satisfy where U B 1:r is the first r rows of U B corresponding to the r covariates. From the final equation, DREGAR(p,0) suffers an asymptotic bias, provided the tuning parameter is positive. In other words, lasso regularization of DREGAR(p,0) is not asymptotically consistent. In the next section we discuss the adaptive-DREGAR(p,0) where a fixed level penalty term is replaced by a weighted (adaptive) one. We show that under certain conditions adaptive-DREGAR(p,0) is consistent and enjoys the oracle property.

Adaptive DREGAR(p,0) model
Recall from Eq. (3) that parameter estimation in adaptive-DREGAR(p,q) involves the minimization of is parameter space. To prove the asymptotic property of adaptive-DREGAR(p,0) we follow [9,22] and define, where a n and b n are maximum and minimum penalties for significant and insignificant coefficients respectively.
Proof Let α n = n −1/2 + a n , and {Θ • + α n δ : δ ≤ d, δ = (u, v) } be a ball around Θ • . Then for δ = d we have (Using triangular inequality) : The last equation holds because of the decreasing speed of α n . On the other hand, similar calculations to Theorem 1 results in Because (11) dominates (10), then for any given η > 0, there is a large enough constant d such that This result shows that with probability at least 1 − η, there is a local minimiser in the ball {Θ • + α n δ : δ ≤ d} and as a result a minimiser Q * n (Θ) such that Θ * − Θ • = O p (α n ) (see [22,Lemma 1], [5]).
The proof is completed.
Theorem 3 implies that there exist a √ n-consistent local minimiser Q * n (Θ), when tuning parameters (for significant variables) in DREGAR(p,0) converge to zero at the speed faster than n −1/2 .
In the next step we prove that under the case where the tuning parameter associated with insignificant variables in DREGAR(p,0) shrink to zero at a speed slower than n −1/2 , then their associate coefficients will be estimated exactly equal to zero with probability tending to 1. Further, in the next theorem we show that by increasing the penalties on the zero parameters at a certain speed, the probability of these coefficients to be estimated exactly zero tends to one.
Proof This proof follows from the fact that Q * n (Θ * ) must satisfy where U i is the ith row of U B and i ∈ s c 1 . The second term in (12) is a direct result of adding a ±X β, ±H ( p) φ to L n (Θ * ). By using the central limit theorem, the first term in Eq. (12), t e t x ti , is of order O p (n 1/2 ) and the second term is O p (n 1/2 ). Furthermore, both terms are dominated by nλ * i since b n √ n → ∞ (expansion of [9,22]). Then the sign of ∂ Q * n (Θ * ) ∂β i is dominated by the sign ofβ * i , from whichβ * i = 0 in probability. Analogously, we can show that Pr(φ * s c 2 ) p → 1. The proof is completed.
where U 0 is the sub-matrix U B corresponding to Θ • 1 , andΘ * 1 corresponds to non-zero elements ofΘ * .
Proof From Theorems 3 and 4, one can conclude that Pr(Θ * 2 = 0) p → 1. Thus, the minimiser . This implies that the lasso estimatorΘ * 1 satisfies the following equation From Theorem 3,Θ * 1 is a √ n-consistent estimator. Thus a Taylor expansion of the above equation yields where p() is the tuning function i∈s 1 For n sufficiently large p(Θ * 1 ) = p(Θ • 1 ). Thus, Theorem 5 implies that, adaptive DREGAR(p,0) is asymptotically an oracle estimator provided a n tends to zero at the speed faster than √ n (or a n √ n → 0) and simultaneously b n increases at the speed slower than √ n (or b n √ n → ∞).

Implementation
The formulation of the model lends itself naturally to its implementation, in contrast to other time series models such as [23]. As the model contains residuals, which are unknown, we apply a two-step optimization procedure Repeating steps 1 and 2 iteratively provides the solution to DREGAR. The tuning parameters λ, γ and τ can be chosen by K-fold cross-validation or by an information criterion such as AIC, BIC or eBIC. For our model these are given by: where par is the number of non-zero estimated parameters. For the simulation and real data analyses, we use eBIC which was found to have a good performance by [4]. For adaptive-DREGAR, we use λ * = ω/|β|, γ * = ω/|φ| and τ * = ω/|θ|, withβ,φ andθ the unpenalized or lasso estimations of the parameters. We assume the same ω for both terms, so that we can simplify the problem to the ordinary adaptive-lasso problem, and select this tuning parameter by one of the criteria mentioned above.
A final choice for model selection is setting the orders p and q. We propose two general approaches to choose the optimal orders: (a) setting an upper bound P and Q and choosing the model that achieves the minimum eBIC inside the grid, (b) setting an upper bound P and Q and letting the model choose the optimal orders by keeping or eliminating the coefficients under l 1 sparsity constraints. In the second approach, the fitting is based on n = T − (P + Q) time points, whereas in the first approach, the number of time points depends on the orders, p and q. Then a rule of thumb is to use the first approach when the number of observations is low and choose the second approach when there are enough observations.
The method is implemented in the R package DREGAR, available in CRAN http://CRAN. R-project.org/package=DREGAR.

Simulation study
We design a simulation study to compare the (adaptive) DREGAR with (adaptive) REGAR [22] and (adaptive) lasso [25]. To this end we propose the following configuration: 1. Generate the design matrix, X , using a stationary Gaussian process with r = 100 and T = 50, 100, 1000. That is, high-dimensionality is considered in terms of the number of exogenous variables. 2. Set 90% of β coefficients to zero. Assign unequal random numbers in (−1, 1) to the non-zero coefficients. 3. Generate data from the DREGAR(2,2) model with σ 2 = 0.5, 1 and 1.5. 4. Sample 1500 data points from the above model so that the first 50, 100 or 1000 observations are used for parameter estimation (training set) and the rest n = 1500 − T points are left for evaluating the model performance (test set). 5. Select tuning parameters by minimizing eB I C and fix the maximum orders P and Q to 3 (i.e. allowing also for variable selection for φ and θ ).
We repeat each combination of models 100 times and calculate mean squared error ofβ and the prediction mean squared error, defined by whereŷ test is calculated using the two steps discussed in the implementation. We compare DREGAR (3,3) with lasso and with a DREGAR(0,6) model, which has the same number of parameters as DREGAR (3,3) and is the closest in the DREGAR family to a REGAR model [22]. Figures 1 and 2 show overall how adaptive DREGAR dominates lasso and REGAR for low and high-dimensional problems in terms of both prediction error and MSE ofβ. Table 1 shows the full set of results for PMSE with a better performance of DREGAR across the range of simulations. 6 Real data illustration 6

.1 Analysis of air pollution data
In this section, we show the performance of the model on the National Mortality, Morbidity and Air Pollution Study (NMMAPS) dataset. This dataset is publicly available from http://www.ihapss.jhsph.edu/data/NMMAPS/ and contains daily mortality, air pollution, and weather data for 108 cities in the US from January 1, 1987 to December 31, 2000. The variables include six indicators for mortality (total non-accidental, cardiovascular disease, respiratory, pneumonia, chronic obstructive pulmonary disease, accidental), , sulphur dioxide (SO 2 ), nitrogen dioxide (NO 2 )) as well as three indicators of weather (temperature (T), dew point temperature (D) and relative humidity (H)). Similar to [23] we study the relationship between ground level of ozone and indicators of air pollution and weather conditions in Chicago in 1995. Differently to [23], we take the effect of carbon monoxide (CO) into account. The covariates in the model consist of NO 2 , SO 2 , CO, PM10, temperature and relative humidity as well as all two-ways interactions. We show the interactions by initials, for instance NS represents the interaction between NO 2 and SO 2 . A total number of 365 observations and 21 covariates are included  in the analysis. All covariates and response are normalized to zero mean and unit variance. Figure 3 shows a difficult choice for the maximum orders P and Q. We therefore follow the second approach in Sect. 4 and propose P = 5 and Q = 5 and let the model's parameter inference guide the best orders. The parameters are estimated using the adaptive algorithm in Sect. 4, setting a maximum of 50 iterations and selecting the tuning parameters by minimizing eBIC.
We compare the optimal DREGAR(p,q) model with alternative models of similar complexity or natural subclasses. In particular, we consider DREGAR(p+q,0), DREGAR(p,0), DREGAR(0,q+p) and DREGAR(0,q). Note that the last two are the closest models to [22] in the DREGAR family. In addition, we consider standard non-dynamic models, namely adaptive-lasso and elastic-net. For the latter, we choose the optimal proportion of norms α over a range of 100 values. We compare the models on the basis of a number of commonly used criteria: eBIC, AIC, Quasi-likelihood Information Criteria (QIC) [18] and Consistent AIC (CAIC) [3]. Table 2 provides a detailed illustration of the parameter estimates as well as information on the comparison of the models. Time series coefficients in the middle-bottom of the table propose an order of four and three for DREGAR as well as DREGAR(1,0) and DREGAR(0,3) for the other models, suggesting that the maximum order of 5 for p and q is sufficient.   However, we should stress that the two analyses are not directly comparable, since we consider an additional variable, CO, which shows a significant effect on the ozone ground level and a non-zero effect for the interaction with weather indicators, namely CT and CS. We further report the Ljung-Box test [2] statistics in the bottom of the Table 2. With the exception of lasso, elastic-net, DREGAR(5,0) and DREGAR(10,0), all other models show good fitting, i.e. no evidence against the white noise assumption. Figure 4 displays the scatterplot of fitted versus observed response for lasso and DREGAR (4,3), the residuals from the DREGAR(4,3) mode and the corresponding sample ACF and PACF. The small curvature in the scatter plot, mentioned also by [23], can be an indication of a particular weather condition that results in an interaction between primary pollutants. The residuals' ACF and PACF suggest that the residuals are white noise as confirmed also by the p-value of the Ljung-Box test (0.61). Finally, we have also compared the fit of the best DREGAR model, DREGAR(4,3), with a DREGAR(0,7) model (the closest to a REGAR(7) model), in order to assess the benefit in having different autoregressive structures for the response and the predictors, a unique feature of the model that we propose in this paper. Without penalising the coefficients, the maximum likelihood for DREGAR(4,3) is −1106.884 and that of DREGAR(0,7) is −1110.832, suggesting an improved fit for the DREGAR(4,3) model.

Analysis of stock market data
For the second real application we take an example from the stock market. Data are collected from yahoo finance (https://finance.yahoo.com) and contain 251 closing prices for 30 indices in the DowJones market in 2015. We take the IBM index as the response and the remaining   For the information criteria, the asterisk denotes the minimum We apply first differences of the log-prices to get stationary returns [10]. Figure 5 shows low orders of auto-correlations in the residuals as typical of financial data.
Adaptive DREGAR(5,5), DREGAR(10,0), DREGAR(0,10), DREGAR(5,0) and DRE-GAR(0,5) are applied to the data and the tuning parameters are selected using eBIC. In addition, we consider adaptive-lasso and elastic net as well as GARCH, which is typically used for financial data. For GARCH, we use the R package rugarch and choose the optimal model by searching among all models with maximum orders (2,2). The models are compared on the basis of eBIC, AIC, CAIC, QIC, Ljung-Box statistic and sparsity. Table 3 shows that DREGAR(5,5) performs very well compared to other methods with respect to eBIC, AIC and CAIC as well as sparsity. Fitting adaptive DREGAR (5,5) to data results in an order of 3 for the dynamic term and an order of 4 for the residuals. So the final selected model is DREGAR (3,4), suggesting that a maximum order of 5 for p and q is adequate also for this dataset. Amongst the top selected predictors, there are: MSFT (coefficient 0.3), HPQ (0.23), VZ (0.20), MMM (0.14), MRK (0.13) and CVX (0.10). Figure 6(top) shows observed versus fitted response for lasso and DREGAR (3,4). From this figure, DREGAR(3,4) has a better fit compared to lasso in terms of the correlation between the observed and fitted values (ρ DREGAR (3,4) = 0.811, ρ lasso = 0.804). Finally, the sample ACF and PACF at the bottom of Fig. 6 confirm the results from the Ljung-Box statistic, showing that the residuals from DREGAR (3,4) behave like white noise.

Conclusion
This paper addressed the problem of dynamic regression in the presence of autocorrelated residuals by proposing an extension of the regression model of [22] with the inclusion of lags of the response. We showed that adding this dynamic term results in a structure more similar to a general ARMAX model than REGAR [22] and REGARMA [23] and with fewer difficulties in parameter estimations than REGARMA. Further, we proposed an l 1 penalized likelihood approach for variable selection for both regression and time-dependent coefficients and studied its theoretical properties. We proposed two iterative algorithms for parameter estimation and provided an R package that contains the implementations and simulation from the model. Finally, we show the applicability of the model and comparison with existing approaches in the simulation study as well as two real data applications.
Future work could extend the methods presented in this paper by estimating DREGAR coefficients using penalties that strike a trade-off between l 1 and l 2 norms, such as elastic net. We expect these methods to work well, as the l 2 penalty imposes less weight on small coefficients compared to the l 1 penalty. Such an extension is also expected to work well in the presence of correlation among the predictors. Moreover, it would be interesting to add GARCH-type errors to the model, similar to a recent contribution to the literature for the REGARMA model [24]. Finally, it would be of interest to extend the methodology to non-linear and non-stationary cases.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.