Semi-automated simultaneous predictor selection for Regression-SARIMA models

Deciding which predictors to use plays an integral role in deriving statistical models in a wide range of applications. Motivated by the challenges of predicting events across a telecommunications network, we propose a semi-automated, joint model-fitting and predictor selection procedure for linear regression models. Our approach can model and account for serial correlation in the regression residuals, produces sparse and interpretable models and can be used to jointly select models for a group of related responses. This is achieved through fitting linear models under constraints on the number of non-zero coefficients using a generalisation of a recently developed Mixed Integer Quadratic Optimisation approach. The resultant models from our approach achieve better predictive performance on the motivating telecommunications data than methods currently used by industry.

traditional (manual ) approaches for building statistical models are often infeasible for the ever-increasing volumes of data. Automating these approaches is thus necessary, and will allow principled statistical methods to continue being at the forefront of business practice.
The work in this article is motivated by modelling challenges faced by an industrial collaborator. The data we consider consists of daily event observations from multiple locations within a telecommunications network. The task of interest is to develop models for how the rates of these events depend on a range of external factors so as to better understand the physical relationship between the network and external influences. The number of such predictors of events considered for a model in this setting can be in the tens or hundreds, and often it is natural to choose candidates within groups of predictors. Whilst historically practitioners have fitted such models by hand, this is costly. The statistical challenge in this context is therefore to fit sparse and interpretable models for the responses, whilst accounting for the serial correlation in the data and ensuring we borrow information across the response variables. This modelling task needs to be accomplished with minimal human input.
A body of work in the statistical literature is devoted to predictor selection in univariate response models, see for example, Hocking (1976); Tibshirani (1996); Zou and Hastie (2005); Bertsimas et al (2016) and Hastie and Tibshirani (2017) and the references therein. Hastie et al (2008) provide an accessible review of many of these methods. In the multiple response setting, Breiman and Friedman (1997) and Srivastava and Solanky (2003) have shown that simultaneous model estimation has advantages over individual modelling procedures. Turlach et al (2005), Similia and Tikka (2007) and Simon et al (2013) consider selecting variables for the multi-response models used by Breiman and Friedman (1997) and Srivastava and Solanky (2003).
In our setting, due to the grouped nature and large number of potential predictors, it is natural to consider combinatorial approaches to predictor selection. To this end, in this article we propose a multivariate re-sponse implementation of the so-called best subset problem (Miller, 2002), and perform predictor selection via a generalisation of the Mixed Integer Quadratic Optimisation (MIQO) model approach of Bertsimas et al (2016). To the best of our knowledge, addressing the task of simultaneous predictor selection for multiple separate linear regression models via a MIQO formulation has not been considered in the literature.
Our approach is to fit the same model form for each response variable, but allow for the coefficients associated with a particular predictor to vary across each model. We expand the scope of the original MIQO formulation to automatically fit such a model in the presence of a known serial correlation structure for the time series of responses by considering more general regression seasonal autoregressive integrated moving average (Reg-SARIMA) models, and propose an iterative procedure that alternates between learning the serial correlation structure and fitting the model. We find that a more accurate specification of the model for the regression residuals can lead to a significant reduction in the variance of the predictor selection routine. Using the generalised least squares objective (Rao and Toutenburg, 1999) we can improve inference and predictor selection accuracy.
To improve model sparsity, our approach can also shrink the coefficients associated with a particular predictor to a common value if desired. The model fitting can be performed under constraints that avoid including highly correlated predictors, which increases the interpretability of the final models. Hence with our proposed semi-automated procedure, we reduce the human input by modelling characteristics of the response variables, instead of determining subjective pre-processing steps to remove this variation. The only user input needed is through choosing an appropriate set of initial predictors and potential non-linear transformations of these variables. Here, we estimate the serial correlation by pre-specifying a suitable list of time series models, although iterative approaches adopted by Hyndman and Khandakar (2008) could be incorporated very easily. Our implementation is computationally feasible for hundreds of predictors and multiple response variables.
This article is structured as follows. In Section 2 we review pertinent literature for predictor selection and propose how to use the formulations of Bertsimas and King (2016) to develop an automated modelling procedure. In Section 3 we introduce our multi-response MIQO formulation and extensions that can improve the performance of the models. In particular, Section 3.2 outlines our two-step procedure which can perform predictor selection whilst accounting for serial correlation in the data. Section 4 highlights the advantages of our approach over standard methods in the literature through a simulation study. We apply our approach to a motivating data application in Section 5 before concluding the article in Section 6.

Problem statement and existing approaches
In this section we first review the standard linear regression model and existing methods for choosing suitable predictors. We then outline how we propose to automate modelling for one response variable and show how expert opinion can be incorporated into the model. The linear regression model is able to describe the relationship between a response variable, Y and dependent variables, X 1 , . . . , X P as follows: where η is assumed to be normally distributed, η ∼ N (0, σ 2 η ). If the set of predictors X := {X 1 , . . . , X P } is known, the coefficients β = [β 1 , . . . , β P ] can be estimated with the standard ordinary least squares (OLS) estimatê (2) When P is large and X contains redundant predictors, OLS estimates can be unsatisfactory. Prediction accuracy can be improved by shrinking or setting some of the coefficients to zero (Hastie et al, 2008). Setting coefficients to zero removes the corresponding predictors from (1), leading to simpler, more interpretable models. Throughout this article we refer to the number of non-zero coefficients in the model as the model sparsity, which we denote by k. The regression model above assumes a linear relationship between predictors and a response variable but this may not be suitable (Rawlings et al, 1998). For instance, in our motivating example some telecommunication events are caused by long periods of heavy rainfall, causing underground cables to flood. Exponential smoothing can be applied to daily precipitation measurements to provide a surrogate predictor for ground water levels. This introduces the question of how best to choose the smoothing parameter. One option is to obtain such surrogate predictors for a grid of smoothing parameters; this both substantially increases the number of potential predictors to choose from, and can lead to highly correlated predictors. We note here that in other contexts, different transformed variables could be appropriate, for example models which include lagged predictors.
In this article, we focus on subset selection methods that attempt to choose the set of k predictors that give the smallest value of the residual sum of squares (2). A number of classical subset methods are described in detail by Hocking (1976). The forward-stepwise routine is the current algorithm of choice for selecting predictors by our industrial collaborator. This algorithm is usually initialised with an intercept term (the null model), and iteratively adds the predictor which most improves the least squares objective. This gives a fitted model with k predictors, for some k ∈ {1, . . . , P }. However, for any k ≥ 2 the model produced by stepwise methods is not guaranteed to be the best model with k predictors in terms of minimising the least squares objective. Despite the resultant sub-optimal models and issues raised by many authors, e.g. Beale (1970); Mantel (1970);Hocking (1976); Berk (1978), fast and easy implementation of these algorithms may explain why they remain popular.
Finding the model with sparsity k which minimises the least squares objective is known as the best subset problem (Miller, 2002). This optimisation problem is non-convex, and can be computationally challenging to solve when we have many predictors available. However, Bertsimas et al (2016) show that by appropriately formulating the problem and using recent developments in optimisation algorithms it is possible to perform best subset selection with hundreds of potential predictors and thousands of observations. Bertsimas et al (2016) also show that best subset selection tends to produce sparser and more interpretable models than more computationally efficient procedures such as the LASSO (Tibshirani, 1996).
2.1 Automated predictor selection procedure Automated model selection procedures limit an analyst's control over the output. Consequently, we do not seek a fully automated approach, but one that can produce sensible outputs with minimal user input for potentially hundreds of predictors. We thus propose a semi-automated procedure where an analyst supplies a suitable set of predictors, with which we use best subset selection to automatically choose the best model using this set.
We formulate the problem of choosing the best model as a Mixed Integer Quadratic Optimisation (MIQO) as suggested by Bertsimas et al (2016). The MIQO formu-lation with sparsity k solves the minimisation The binary variable z p takes the value 1 if predictor X p is used in the model and zero otherwise. Special ordered set constraints (3b) allow only one of 1 − z p , or β p , to be non-zero. Constraint (3c) controls the sparsity of the models by restricting the maximum number of predictors to k. The value k can be chosen with model selection criteria such as the AIC (Akaike, 1973) or BIC (Schwarz, 1978). Alternatively, cross-validation methods can be used (see e.g. Stone (1974)). The MIQO optimisation problems can be solved efficiently with modern optimisation solvers such as Gurobi (Gurobi Optimization, 2019). Good feasible solutions can be obtained for models with sparsity k + 1 using the optimal solution with sparsity k. By modifying the right-hand side of constraint (3c) Gurobi can automatically use the previous optimal solution to 'warmstart' the solver.
Treatment of correlated predictors. Similar to Bertsimas and King (2016), we can easily have additional constraints to the MIQO formulation, for example, to avoid including highly correlated predictors within our model. Specifically, we can add the constraint z p +z s ≤ 1, ∀ (p, s) ∈ HC := {(p, s) : Cor(X p , X s ) > ρ}.
(4) Constraint (4) allows at most one of z p or z s into the model for all pairs of highly correlated variables, specified by the set HC. Here, ρ can be seen as the maximum pairwise correlation between predictors that will be permitted to enter a model.
Incorporating expert knowledge. In many settings, expert knowledge may suggest predictors that must be present in the model. For example, it may be suitable to account for known outliers or other known external influences. Let the set J denote the indices of predictors that must be present in the model. This can be enforced by adding the constraint Expert knowledge may also suggest how the predictors should effect the response variables. For example, some predictors may be known to have a positive effect on the response variables (see e.g. Section 5). We propose to include this expert knowledge as follows. Let the sets P and N denote the sets of predictor indices that should have positive and negative effects on the response variables respectively. Then the constraints β p ≥ 0 ∀p ∈ P, and β p ≤ 0 ∀p ∈ N , ensure that the coefficients take the correct sign according to expert opinion, or the corresponding predictors are excluded from the models. As well as aiding context-specific interpretability of the models, an additional advantage of enforcing sign constraints is that we have observed that it speeds up the optimisation.
Including transformations of predictors. In Section 1 we discussed the need to determine the best parameter for a set of non-linear transformations of a predictor.
To ensure the best parameters are found in terms of minimising the least squares objective, we can use the following constraints. Let T i denote the set of predictors obtained by applying a non-linear transformation to a predictor over a grid of values. Then the constraints will ensure at most one of the predictors from each group T i will appear in the model.
Ensuring model sparsity. In our motivating application, as well as many other contexts, sparse models are desired to illustrate the strongest effects of a few predictors. Hence for computational reasons we suggest setting a maximum model sparsity k max . The choice of k max could be somewhat arbitrary. However, in our formulation the value k max can be determined automatically by using constraints of the form (4) and (5). These constraints suggests that there exists a maximum level of model sparsity where at least one constraint (4) or (5) will be violated if an additional predictor is included into the model. State-of-the-art optimisation solvers, such as Gurobi (Gurobi Optimization, 2019) will inform the user if an optimisation formulation is infeasible. We propose modifying the sparsity constraint (3c) as follows: If k > k max , a feasible solution to the modified best subset problem does not exist and the solver will inform the user of an infeasible optimisation model; larger predictor subsets can hence be discounted. In practice, an additional choice to reduce computation is to set a maximum runtime of the solver, as suggested by Bertsimas et al (2016). Often this finds the optimal solution quickly, but may take hours to provide the certificate of optimality.
3 Simultaneous predictor selection for systems of linear regression models Interpretability and consistency of models is important in industry. If a model is difficult to interpret then it is of limited use for practitioners trying to understand the dynamics of the system of interest. When models contradict expert opinion or take very different forms for a number of related response variables, the reliability of the models may be questioned. We now describe our proposed extension to the best subset formulation (3) to simultaneously select predictors and obtain models for multiple related response variables to ensure consistency in the selected predictors for each response variable.

MIQO formulation for multiple response variables
Consider estimating regression models for M response variables, where we assume that these response variables are suitable for joint analysis. We write the system of models as where η m ∼ N (0, σ 2 m ), and β m,p ∈ R for p = 1, . . . , P , m = 1, . . . , M .
Here, we assume that each response variable has a unique realisation of the P predictor variables. For example, suppose predictor X 1 corresponds to precipitation, then predictor X m,1 corresponds to the precipitation for response Y m . Let S m denote the set of selected predictors for response m. The current procedure used by our industrial collaborator often produces models where S m1 = S m2 , contrary to expert opinion. This motivates the following formulation, which we call the Simultaneous Best Subset (SBS) problem: subject to M m=1 S m ≤ k.

The union
M m=1 S m gives the selected predictors across all models: if all models contain the same predictors, then each model may have up to k predictors present.
As well as consistency in predictor selection, some similarity in the coefficients β 1,p , . . . , β M,p may be expected in considering multiple response variables. We can penalise for large dissimilarities in the coefficients by introducing dummy variablesβ 1 , . . . ,β P and adding the following penalty to the objective appearing in (7): The tuning parameter λ must be determined. For large λ the penalty (8) will dominate the objective and force the solver to encourage β 1,p , . . . , β M,p close toβ p , for p = 1, . . . , P . In practice, a suitable range of λ must be chosen. In what follows, we use a sequence of λ values equally spaced on the log scale between 0 and 2g k , where g k is the value of the objective of the solution to the SBS problem (7) with sparsity k. We have observed that coefficients become more stable for large values of λ, and that the coefficients β 1,p , . . . , β M,p become sufficiently close toβ p for p = 1, . . . , P when λ = 2g k . The number of binary variables in the optimisation model need not increase when simultaneously estimating multiple regression models -the number stays at P , the number of predictor variables. However, the number of constraints in the optimisation must be increased to ensure a feasible solution of (7) is obtained. To this end, we use the SOS − 1 constraints for p = 1, . . . , P, m = 1, . . . , M . These constraints, along with the sparsity constraint (3c), ensure that no more than k predictors are present across each of the M regression models. Analogous to Section 2.1, to prevent pairs of highly correlated predictors we define the set of highly correlated predictors HC in this setting as pairs (p, s) ∈ {1, . . . , P } × {1, . . . , P } such that By using the constraints of the form (4), we prevent any model in the system (6) containing pairs of predictors with correlation that exceeds ρ.

Extension to serially correlated data
Fitting linear regression models to time-ordered data often produces models where the observed residuals appear serially correlated (Brockwell and Davis, 2002). To remedy this issue, in this section we propose a two-step algorithm, similar in spirit to that of Cochrane and Orcutt (1949) that implements a predictor selection step to a generalised least squares (GLS) transform of the data. In what follows, we give an example of the GLS transform, before describing how we incorporate predictor selection.
Suppose we have a response variable Y and predictors X 1 , . . . , X P , and suppose the true model for the relationship between the response and predictors is In this setting, the regression residuals η t are serially correlated. Ignoring serial correlation in observed residuals not only mis-specifies the model but ignores potentially valuable information. Minimising the least squares objective (2) no longer gives the most efficient estimator for the regression coefficients (Rao and Toutenburg, 1999). Providing (10b) is stationary (see Brockwell and Davis, 2002) we can write (10) as a regression model with residuals that are not serially correlated via where L denotes the backward-shift operator such that Lη t = η t−1 . The linear filter can be applied to the response and predictor variables to obtain transformations of the original variables. In other words, the original variables can be writtenỸ t = Yt 1−φL andX t,p = Xt,p 1−φL . We show empirically in Section 4.2 that predictor selection accuracy can be improved by transforming the response and predictor variables appropriately. In general, neither the predictor variables present in the model or the serial correlation structure of the regression residuals are known. We assume a general regression seasonal autoregressive integrated moving average (Reg-SARIMA) model of the form where and propose the following two-step algorithm to determine the best predictors and autocorrelation structure of the regression residuals. First we seek suitable predictors for the model. We fix the sparsity k and use the data (Y 1 , X 1,1 , . . . , X 1,P ), . . . , (Y M , X M,1 , . . . , X M,P ) to determine a suitable set of predictors by solving the SBS problem. Given initial estimates of the coefficientŝ β k,0 1,1 , . . . ,β k,0 M,P , we then obtain the observed residuals for each model We need to estimate the serial correlation structure of the regression residuals. Given a list L of suitable SARIMA models, these models can be fit to the observed regression residualsη k,0 m,t for m = 1, . . . , M and the best SARIMA model identified for each m, for example, based on an appropriate information criterion. We require the transformed data for m = 1, . . . , M.
Consider fitting the SARIMA model (12b) to obtain the observed model errorsˆ m,t , This process can be applied to (13) and (14) to obtaiñ Y m,t andX m,p,t for m = 1 . . . , M and p = 1, . . . , P . Lastly, the predictors can be re-selected by solving the SBS problem again with the filtered data,Ỹ m,t and X m,p,t . This procedure can be iterated until convergence in the regression estimates, selected predictors, and the models for serial correlation. If the procedure does not converge quickly an upper limit to the number of iterations can also be considered. However, we have observed that convergence often occurs after two iterations. The pseudo-code for our two-step procedure, Two-stage Simultaneous Predictor Selection (SPS2) is shown in given in Algorithm 1.

Performance on simulated data
In this section we investigate the properties of our simultaneous predictor selection approach. In particular, we perform a number of simulations investigating how our SBS model compares to applying the standard best subset approach to estimate each linear regression Alg. 1: Pseudo-code for the two-step subset selection algorithm (SPS2) allowing for serial correlation. model separately. We compare our simultaneous estimation procedure to the LASSO (Tibshirani, 1996) and elastic net (Zou and Hastie, 2005). We also compare our approach to an alternative simultaneous estimation procedure: we modify the Simultaneous Variable Selection approach of Turlach et al (2005) to estimate the system of linear models (6). We call this approach SVS-m, the modified SVS approach; further algorithmic details of this procedure can be found in Appendix A.
We generate data from Model (6) where we fix the regression coefficients as for all m.
The predictors and residuals are simulated as follows: where The particular values of the residual variance, σ 2 η and predictor correlation, ρ will be clarified in each simulation. When the correlation between predictors is large, the predictors X 17 , X 18 , and X 19 become hard to distinguish and hence accurately selecting the correct generating predictors is challenging. We use P = 35 predictor variables as provably optimal solutions can be obtained within seconds for sparse models (see Appendix B).
In the simulations that follow we solve the SBS problem with M = 1, 5, 10, 20, 35, increasing the number of regression models used for simultaneous predictor selection and coefficient estimation. Note that M = 1 corresponds to the best subset approach of Miller (2002). In a simulation of size N , we record the number of times each application of the SBS approach recovers the true subset by applying the SBS approach with the sparsity set to the true value, k = 3. We also record the mean squared error of the regression coefficients given by whereβ m,p is the estimate of β m,p . This measure will penalise large deviations from the true coefficients and take account of potential variation as we change the value of M . Unless specified otherwise, we do not apply shrinkage as we wish to demonstrate the gains from simultaneous selection only.

Effect of correlated predictors
We start by investigating how predictor correlation affects selection accuracy for the best subset method, and how this improves for simultaneous predictor selection as the number of jointly-estimated models increases. We generate N = 1000 synthetic datasets using the specification (15) and fix σ 2 η = 1. Figure 1a shows the selection accuracy for simultaneous subset selection with differing values of M . We see that for the best subset method (M = 1), the accuracy deteriorates rapidly as the predictor correlation, ρ, exceeds 0.5. However, simultaneous predictor selection increases the correlation threshold at which selection accuracy deteriorates to 0.87 with just five models. Consequently, the mean squared error in coefficient estimates decreases, as can be seen from Figure 1b. Selection accuracy is seen to improve further with a greater number of models estimated simultaneously.
We also investigate the performance of SBS with increasing residual variance, σ 2 η for differing values of M and data length, T ; as one might expect, with increasing residual variance it is much harder to recover the true predictors. For reasons of brevity, these results are deferred to Appendix B.

Simultaneous shrinkage
The coefficients obtained from minimising the least squares objective with highly correlated predictors can suffer from high variance. As such, the variation in selected predictors for the best subset method is also high, as shown in Section 3.1, mirroring the observations by Hastie et al (2008). To investigate the effect of shrinking coefficients for each predictor towards a common value, we fix M = 5 and simulate T = 750 observations for each response variable and their associated predictors from the model (15). We split the data randomly into two sets, using 500 observations for each response variable as a training set to estimate the models. The remaining 250 observations are used to determine the predictive accuracy of the models. We fix ρ = 0.95 and σ 2 η = 2 and again consider when k = 3. to show the effects on in-sample and out-of-sample prediction error. Figure 2 shows the MSE for both scenarios over a range of increasing penalty values, λ. By penalising the differences in β 1,p , . . . , β M,p for p = 1, . . . , P , we bias the estimates of the regression coefficients, increasing the in-sample error (see Figure 2a). However this leads to improved out-of-sample prediction error (see Figure  2b) as information is shared across regression models by shrinking the coefficients for each predictor to a common value. Figure 3 shows shows trace plots of the regression coefficients (for one simulated dataset) for each of the five response variables in the system, as the value of the simultaneous shrinkage penalty increases. The horizontal lines show the coefficients of predictors X 17 ,X 18 , and X 19 .
As the penalty increases, the simultaneous best subset changes. Despite seeking the best subset of predictors given the true level of sparsity, the true predictors are not initially selected upon solving the SBS problem. Two of the three predictors are correctly identified although the estimates for each model are rather far from the truth. A spurios (zero) predictor is also selected with relatively large coefficients for some of the models (indicated by non-zero coefficients for β m,21 and β m,27 ). As the strength of the joint shrinkage is increased, the noisy predictor leaves for the true third predictor, reenters the models, upon being replaced finally for the true third predictor again. At this point, the coefficients for all three predictors in each of the regression models   appear significantly closer to the true values in comparison to the solutions obtained upon solving the initial SBS problem.

Performance on serially correlated data
In Section 3.2 we motivated the need to consider autocorrelated regression residuals in predictor selection problems. In this section we demonstrate that we can recover both the true predictors and correlation structure of the regression residuals using the two-step algorithm described in Section 3.2. To this end, we simulate data from Model (6) but now impose a correlation structure on the residuals, taking the form with e m,t ∼ N (0, 1), i.e. the residuals η m,t follow an AR(1) or SARIMA(1,0,0)(0,0,0,0) model. The predictors and regression coefficients are the same as those in Section 3.1. Our industrial collaborator often observes large changes in the predictors that are selected when the number of observations available changes only slightly. For N = 50 datasets of length T = 600 simulated under model (16), we apply our two-step algorithm (with k = 3) to each simulated dataset a total of six times: first we use the first 500 datapoints, then first 520 and so on until all 600 points are used. We highlight the predictors selected in each application with and without using the two-step algorithm in Figure 4. The selected predictors for each of the simulated datasets are shown within each set of vertical lines. From left to right the vertical triplet of dots indicate the selected predictors for T = 500, 520, . . . , 600 within each set of vertical bars.
For the standard selection procedure, the variation of selected predictors within each dataset is quite alarming as well as the range of predictors across different simulated datasets, reflecting the sensitivity to data length as experienced by our industrial collaborator. This is shown in Figure 4a. In comparison, using the two-step algorithm (Figure 4b) we observe much less variation in the selected predictors. Further, the algorithm selects the true predictors in many cases.
We now investigate how well we can recover the true correlation structure of the regression residuals.
Recall that the correct model order from specification (16) is (1, 0, 0)(0, 0, 0). Figure 5 shows that the model order was correctly identified for a particular simulation if '.' appears on each row, or the value of the order (p, d, q)(P, D, Q) chosen if it were mis-specified.
From Figure 5, we see that correct values were chosen for the majority of values of the six model orders (p, d, q) and (P, D, Q). We observe that at least one autoregressive parameter was used (p ≥ 1) for each dataset, sometimes erroneously using more or including another term, however this is often the case with model selection criteria such as the AIC or BIC. Modifying the penalty used to select the regression residual model may improve accuracy of selecting these models.

Comparison to other approaches
In this simulation we replicate the scenario that motivated our SBS approach. In particular, we simulate series with five blocks of highly correlated predictors. A block of predictors is denoted The predictors are simulated as for b = 1, . . . , 5. We vary the positions of the active predictors relative to their blocks and the values of the regression coefficients. The regression coefficients take the form Our primary goal is to compare SBS to current methods in the literature. We apply the elastic net using the glmnet package (Zou and Hastie, 2018) implemented in the R statistical software (R Core Team, 2019), over the default values, α = 0, 0.01, . . . , 1 and for 100 values of λ to produce a model for each m = 1, . . . , 5. We train each model with T = 500 observations and then use the mean squared prediction error on a 250 observation held-out test set to select the best elastic net model for each m = 1, . . . , 5. We also compare our results to a forward stepwise algorithm using the standard step function (R Core Team, 2019) for each m, selecting the best model by AIC. We also apply the modified SVS approach (SVS-m), as well as a variant with the regression coefficients constrained to be positive which we denote SVS − m + . We select the models fit by the simultaneous procedures by considering the simultaneous mean squared prediction error defined in (17).  Fig. 4: Comparison of the predictors selected using the standard approach and two-step iterative approach: (a) standard procedure, ignoring autocorrelation in the regression residuals (unfiltered covariate selection); (b) twostep procedure SPS2 (filtered covariate selection).  For each of the selected models we record the following performance measures averaged over N = 50 datasets across each of the models for the M response variables: -The average number of predictors (model sparsity), k = M m=1 P p=1 1 βm,p =0 . -The mean squared prediction error on a 250 observation held-out validation set.
-The number of models containing the true subset of predictors. -The number of models that included at least one negative coefficient.
The average model sparsity will help inform the interpretability of the models, whilst the prediction error allows us to compare the performance numerically. The average number of models containing the true subset indicates the accuracy of each method as a predictor selector. By counting the number of models with negative coefficients, we can compare how often our industrial collaborator may have obtained misleading models. Note that the elastic net uses 100 values of both α and λ which fits 1000 elastic net models to each response variable.
The summary measures of all approaches are shown in Table 1. Our proposed SBS approach produces the sparsest models aided by the transformation constraints (5), with the average sparsity being slightly lower than the true sparsity. The most likely cause of this is due to not selecting predictor 2 (the relative value of the coefficient is small in comparison to the other predictors). The only method able to recover the true subset was our SBS approach in half of the simulations. The SBS and SVS-m + techniques always include coefficients with positive values and were the only approaches which did so. All other methods included at least one negative coefficient in a high number of models. Figure 6 shows the average estimate for the regression coefficients for each of the methods in the study. With the exception of predictor 2, the SBS method appears to give unbiased estimates. Underestimating β m,2 is likely caused by the small coefficient value where the predictor was not included. The other methods tend to underestimate all of the coefficients which may be expected since they are all shrunk towards zero.
We have also investigated computational aspects (e.g. runtime) of our SBS approach when varying the number of response variables, M . For reasons of brevity we do not include this here, but further details can also be found in Appendix B.

Telecommunications data study
We now demonstrate our proposed methodology on a dataset provided by our industrial collaborator. In our motivating application, the total number of daily events in a telecommunications network are recorded by type and location within the network. Each type of event may be influenced by a different set of predictors. For the dataset we consider here, location corresponds to a geographic location, but more detailed information such as the location within the network is available in other applications. We use three response variables of the same type (denoted R1, R2 and R3) from regions in the network considered to be suitable for joint modelling. Urban or rural classifications may help determine if response variables are suitable for joint modelling. There are a total of 1396 daily observations, corresponding to about 3 years 9 months of data.
We use five groups of predictor variables. Motivated by the remarks in Section 2.1 in relation to weather variables, the first four groups of predictors are derived from transformations applied to the following predictors: Group 1: Humidity: The mean relative humidity (gm −3 ) over a 24-hour period. Group 2: Wind speed: The maximum recorded wind speed (mph) within a 24-hour period. Group 3: Precipitation: The total amount of rainfall (mm) within a 24-hour period. Group 4: Lightning: The total number of lighting strikes within a 24-hour period.
The particular base transformation we consider is exponential smoothing, defined by where we set x 1,s = x 1,t . In equation (18) the tuning parameter α is used to adjust how much the time series x t,p is smoothed: a value of α close to 1 will produce a time series very close to the original, whilst a value of α close to 0 will produce a time series that evolves much more slowly. We apply the transformation to the predictors above for a range of values of α, with the particular number and values chosen to sufficiently capture the non-linear effects for each predictor (guided by our industrial collaborator). Note that due to the nature of the telecommunication events, all potential predictors should have a positive relationship to the response variables. The last group relates to indicator variables to adjust for calendar effects which are likely to influence the event data. In particular we include three indicator variables, corresponding to the Christmas bank holiday (Christmas day and Boxing day); 27th December until New Year's Day; and any other bank holiday 1 . We present three methods for modelling the event data. The first method (denoted Automated) is our simultaneous predictor selection approach for multiple response variables, using our two-step procedure SPS2 to estimate a model for the regression residuals. The Individual Automated approach uses the two-step procedure of the first method, but is applied to each response variable separately. Consequently Individual Automated cannot take advantage of simultaneous predictor selection; we present this method to highlight the gains in a simultaneous predictor selection approach. Finally, Current is the procedure adopted by our industrial collaborator, included as a baseline comparison. This method removes the weekly seasonality and calendar effects from the response variables as part of a data pre-processing step, as these are not thought to   be attributed to the effects of the predictors of interest (hence the bank holiday group of predictors is not considered for Current). Data pre-processing choices can be subjective, as well as being time-consuming and therefore costly. Furthermore, such pre-processing does not allow joint estimation of the external predictor, bank holiday effects and seasonality. Our two-step procedure for fitting a Reg-SARIMA model allows seasonality to be incorporated directly into the model specification which is iteratively updated as the predictor coefficients are refined. By modelling seasonality we can obtain more accurate estimates of prediction uncertainty and completely remove the need to pre-process the data by including calendar effects as indicator variables.
The estimated regression coefficients for the three approaches are given in Table 2. An immediate observation from Table 2 is that the models produced by the automated, two-step procedures (Automated and Individual Automated methods) are much sparser than those produced by the Current approach, not considering the calendar effects. Furthermore, all coefficients for the weather predictors produced from Automated and Individual Automated methods are positive, which, as outlined before, would be expected in this context for the telecommunications event data. In contrast, the Current method includes highly correlated predictors, from the same group, and with opposing effects; for example, all six transformed variables of predictor 3 are included. Both large negative and large positive coefficients appear for the predictor variables from Group 3 for the Current method. This reflects the behavior of the least squares estimator discussed by Hastie et al (2008) which motivated the use of the ridge penalty (Hoerl and Kennard, 1970). Using simultaneous predictor selection and constraining the sign of the coefficients we are able to select the single best transformation of predictor 3.
The mean squared errors for 14-day ahead predictions for the three methods are given in Table 3. The prediction accuracy is significantly reduced using the Automated and Individual Automated approaches that produce Reg-SARIMA models, rather than using a preprocessing step (Current). Recall that the Reg-SARIMA methods model the seasonality and calendar effects explicitly rather than remove it. They also describe the effects of other predictors. By selecting predictors simultaneously, the Automated approach provides more accurate forecasts of the response variables. We can see from Table 2 that different predictors from Groups 1 and 3 are chosen in comparison to Regions 1 and 2.
To determine whether the SARIMA models produced by the Automated method have adequately captured the autocorrelation and seasonality within the data we can inspect the sample autocorrelation and sample partial autocorrelation functions of the model Group 1 (humidity)     Table 3: Mean squared error for 14 day ahead predictions for each of the three response variables and the three methods described in the text.
errors. The sample autocorrelation functions for the Automated and Current are shown in Figure 7. The plots show that there is very little significant unmodelled autocorrelation left in the residuals for the Automated technique, demonstrating that modelling the regression residuals as a SARIMA process accounts for most of the temporal correlation (full model specifications for the Automated procedure can be found in Appendix C). In contrast, the Current method appears to violate the typical regression assumptions of independent regression residuals as there is significant remaining autocorrelation at many lags in the regression residuals for all three response variables. Similar conclusions can be drawn from the plots for the sample partial autocorrelation functions; these are shown in Figure 8.
When serial correlation in the regression residuals is ignored the standard errors for each of the regression coefficients may be severely underestimated (Rawlings et al, 1998). This would raise suspicions about the sig-nificance of any predictor in the model. Further, prediction intervals are likely to be too narrow. Our observations mirror this tendency -the standard errors of the regression coefficients for the three response variables produced from the Automated and Current methods are shown in Appendix C.

Concluding remarks
Motivated by an industrial problem we have proposed a procedure to help automate the modelling process of telecommunications data. More specifically, we have developed a MIQO model to solve the simultaneous best subset problem to select predictors when jointly modelling multiple response variables. We have incorporated predictor selection within a two-step procedure, that iterates between selecting predictors for a regression model and modelling the serial correlation of the regression residuals. Automation is achieved by placing constraints in the MIQO formulation to ensure sensible models are produced, and by eliminating the need to pre-process the data through modelling calendar affects and seasonality.
We have shown that predictor selection accuracy can be improved by simultaneously selecting predictors for multiple response variables. Selection accuracy and coefficient estimation can further be improved by shrinkage. The shrinkage we introduced is specifically designed for settings when joint estimation of models is considered -in contrast to LASSO-like penalties that shrink coefficients towards zero, our shrinkage method forces coefficients between models to a common value.
Whilst not relevant for our dataset, an interesting avenue for future research would be to investigate the impact of modelling the regression residuals simultaneously. For example, in other settings modelling the regression residuals as a vector autoregression (VAR) could explain both temporal and cross correlations between the regression residuals between multiple responses. We anticipate that prediction error may be reduced further as well as give a consistent form for the regression residuals between responses.
Acknowledgements Lowther gratefully acknowledges financial support from EPSRC and BT via the STOR-i Centre for Doctoral Training.
A Python implementation of the methods in this article will be publicly available in due course at https://github.com/aaronplowther/sps.

A Algorithmic details of the SVS-m procedure
In this section we outline the Simultaneous Variable Selection (SVS) problem and Convex Quadratic Program (CQP) used to solve it by Turlach et al (2005). The SVS problem was developed as an exploratory data analysis tool to determine suitable predictors for multiple response regression models. We modify the CQP to produce estimates for multiple regression models simultaneously.
The problem proposed by Turlach et al (2005) to acheive simultaneous predictor selection for multi-response models is We propose a slight modification of this problem more suited for predictor selection in multiple separate linear regression models, as follows: Here, we allow a realisation of predictor p for each response variable Y m , for m = 1, . . . , M . The convex quadratic program formulated by Turlach et al (2005) where u M ∈ R M with each entry equal to 1, and z ∈ R P give auxiliary variables. We modify formulation (21)  x m,p,t β m,p 2 subject to The final step is to determine the maximum value t. We set t max = M m=1 P p=1 |β m,p | whereβ m,p minimises M m=1 T t=1 y m,t − P p=1 x m,p,t β m,p 2 . All coefficients which are solutions to formulation (22) are non zero. We apply the same heuristic proposed by Turlach et al (2005) to determine those that should be zero. Let All coefficients β m,p for p / ∈ I are set to zero.

B Additional simulation results
This appendix provides additional simulation results for our proposed algorithm described in the main article. We also summarise some computational aspects of the algorithm.

B.1 Simulated performance of SBS for increasing residual variance
As the variance of the regression residuals increases the variation in the response variables is increasingly attributed to randomness rather than changes in the predictors. This makes it much harder to recover the true predictors. To investigate this, we simulate N = 1000 synthetic datasets and compare the predictor selection accuracy and mean squared error of the regression coefficients between the best subset method, applied independently to the M regression models, and our simultaneous best subset implementation. The level of sparsity, k = 3, is set to the true model sparsity.  for different values of jointly-modelled responses, M . With ρ = 0.95 we observe the deterioration in selection accuracy as σ 2 η increases and how this is improved by the SBS approach. In particular, Figure 9 shows that the best subset method is unable to recover the true predictors for σ 2 η ≥ 3. Improvements in predictor selection can be achieved by increasing M , and there appears to be improved consistency in the SBS approach over the best subset approach as M increases.
We also compare the selection accuracy for SBS applied with M response variables each with T observations in Figure  10 to the best subset method (M = 1) with M T observations. Our gain in selection accuracy is not quite as high as previously observed. However, in many applications of interest (for example our motivating telecommunications setting) it is not always possible to increase the number of observations. Best subset accuracy, changing residual variance (ρ = 0.95) T=500 T=2500 T=5000 T=10000 T=17500 Fig. 10: Selection accuracy of SBS as the residual variance σ 2 η increases for increasing dataset length, T .

B.2 Computational aspects
We now illustrate the total runtime of the solver in worst case scenarios. In this simulation study, all data is simulated from the multi-response regression model where the predictors have the specification X ∼ MVN P (0, Σ) where Σ ∈ R P ×P := Σ i,j = 0.25 |i−j| .
The number of response variables, M and the number of predictors, P will be made clear where relevant. The regression coefficients are given by β m,p = 1, if p = 1, 3, 5, 0, otherwise, for m = 1, . . . , M.
In each simulation we use T = 500 observations. Firstly, we investigate the impact of increasing M on our formulation for the SBS problem with P = 30 predictors. The number of continuous variables in the SBS problem increases linearly with M and causes the total solve time to increase quadratically. This can be seen in Figure 11.
There is a near linear trend for the solve time on the square root scale. As the number of predictors, P increases, the number of continuous and binary variables increase linearly. For a fixed k, the possible number of predictor combinations increases exponentially. The total solve times are shown

C Additional model details from the telecommunications data study
In this section we provide the full specification of Reg-SARIMA models produced by the Automated method for the three response variables. More specifically, the model coefficients, standard errors, and lower and upper quartiles (LQ and UQ respectively) of the model coefficients for the Automated approach are shown in Tables 4 -6. For the Current method, the results are shown in Tables 7 to 9.       Table 9: Model summary for Response 3 from the standard approach used by our industrial collaborator.