1 Introduction

The use of statistical models to drive business efficiency is becoming increasingly widespread (Proost and Fawcett 2013). Consequently, organisations are recording more and more data for subsequent analysis (see Katal et al. (2013) or Jordan and Mitchel (2015) for a review of current modelling challenges in this area). As a result, traditional (manual) approaches for building statistical models are often infeasible for the ever-increasing volumes of data. Automating these approaches is thus necessary and will allow principled statistical methods to continue being at the forefront of business practice.

Our work is motivated by challenges faced by an industrial collaborator. In various parts of the business, diagnostic applications rely on the interpretability of models to guide investment or improvement programmes that correct for impact of important predictors. In these applications, e.g. modelling building-level energy consumption, accurate demand predictions allow effective capacity planning and efficient maintenance scheduling.

In this article, we focus on one such application representative of a typical industrial modelling challenge. The data we consider consist of daily events from multiple locations within a telecommunications network. Telecommunications events are often influenced by external predictors, for example, weather variables. The relationship between the predictors and the observed response variables is often complex and nonlinear, and the number of such predictors of events considered for a model in this setting can be in the tens or hundreds. Often it is required to choose candidates within groups of similar predictors; for example, it hinders interpretability to have multiple predictors pertaining to one particular weather variable included in a model. To build trust in the models with stakeholders outside the modelling team, it is also important to produce models that do not contradict expert knowledge. Due to technological and organisational changes, models often have to be refit, making the laborious task of manually fitting models increasingly unmanageable.

The statistical challenge in this context is therefore to fit sparse and interpretable models for the responses, whilst accounting for the serial correlation in the data and ensuring we borrow information across the response variables to produce a unique set of predictors across all responses. This modelling task needs to be accomplished with minimal human input.

Pooling information across response variables is by no means new. There are many methods that can be used to model data with multiple responses such as described above. For example, spatiotemporal regression models (see, for example, Stroud et al. 2001) can explain correlation in time and space, but are too specific in the specification of the correlation structure for the breadth of applications in our industrial setting. Multitask learning can be applied to neural networks to leverage knowledge for multiple related tasks (Caruana 1997; Duong et al. 2015). However, this technique is more appropriate in settings where different training sets and predictors are available for each response; in addition, whilst neural networks can be effective at capturing nonlinear effects, the resultant models are often difficult to interpret. Similarly, reduced rank regression (Izenman 1975; Reinsel and Velu 2013) exploits correlation amongst multiple response variables in multiresponse regression to determine good linear combinations of the predictors. However, this is not ideal as it loses the interpretability of the predictor effects and fits multiresponse models where we wish to fit multiple models simultaneously. In contrast, regression seasonal autoregressive integrated moving average (Reg-SARIMA) models are able to explain the effects of predictors on a response variable, capture temporal correlation and are easily explained due to their linear nature. Nonlinear effects of predictors can be included by transforming the observed predictors. Providing the models are sparse, they are often interpretable. Hence, we restrict our attention to selecting predictors simultaneously within such models.

A body of work in the statistical literature is devoted to predictor selection in univariate regression models, see, for example, Hocking (1976), Tibshirani (1996), Zou and Hastie (2005), Bertsimas et al. (2016) and Hastie and Tibshirani (2017) and the references therein. Hastie et al. (2008) provide an accessible review of many of these methods. In the multivariate response setting, it has been shown that simultaneous model estimation has advantages over individual modelling procedures (see, for example, Breiman and Friedman 1997; Srivastava and Solanky 2003). Predictor selection for multivariate response models has been considered by Turlach et al. (2005), Similia and Tikka (2007) and Simon et al. (2013).

Recall that in our industrial setting, we would like to choose candidates from groups of predictors, and the number of potential predictors is large; it is thus natural to consider combinatorial approaches to predictor selection. We propose a multivariate response implementation of the so-called best subset problem (Miller 2002) and perform predictor selection via a generalisation of the Mixed Integer Quadratic Optimisation (MIQO) approach of Bertsimas et al. (2016) to fit sparse regression models to all responses simultaneously. Bertsimas and King (2016) have shown that by using binary optimisation variables, it can be effortless to impose constraints on the selected predictors with some guarantee on desirability of the models obtained.

We expand the scope of the original MIQO formulation to automatically fit such a model in the presence of a known serial correlation structure for the time series of responses by considering more general regression seasonal autoregressive integrated moving average (Reg-SARIMA) models and propose an iterative procedure that alternates between learning the serial correlation structure and fitting the model. We find that a more accurate specification of the model for the regression residuals can lead to a significant reduction in the variance of the predictor selection routine. Using the generalised least squares objective (Rao and Toutenburg 1999), we can improve model fit and predictor selection accuracy.

To improve model sparsity, our approach can also shrink the coefficients associated with a particular predictor to a common value if desired. The model-fitting can be performed under constraints that avoid including highly correlated predictors, which increases the interpretability of the final models. Hence, with our proposed semi-automated procedure, we reduce the human input by modelling characteristics of the response variables, instead of determining subjective pre-processing steps to remove this variation. The only user input needed is through choosing an appropriate set of initial predictors and potential nonlinear transformations of these variables. Here, we estimate the serial correlation by pre-specifying a suitable list of time series models, although iterative approaches adopted by Hyndman and Khandakar (2008) could be incorporated very easily. Our implementation is computationally feasible for hundreds of predictors and multiple response variables; the optimisation problems we formulate can be solved with a number of common optimisation solvers, see Kronqvist et al. (2019) for a comprehensive discussion of such solvers.

This article is structured as follows. In Sect. 2, we review pertinent literature for predictor selection and propose how to use the formulations of Bertsimas and King (2016) to develop an automated modelling procedure. In Sect. 3, we introduce our multiresponse MIQO formulation and extensions that can improve the performance of the models. In particular, Sect. 3.2 outlines our two-step procedure which can perform predictor selection whilst accounting for serial correlation in the data. Section 4 highlights the advantages of our approach over standard methods in the literature through a simulation study. We apply our approach to a motivating data application in Sect. 5 before concluding the article in Sect. 6.

2 Problem statement and existing approaches

In this section, we first review the standard linear regression model and existing methods for choosing suitable predictors. We then outline how we propose to automate modelling for one response variable and show how expert opinion can be incorporated into the model.

The linear regression model is able to describe the relationship between a response variable, Y, and dependent variables, \(X_1, \ldots , X_P\), as follows:

$$\begin{aligned} Y = \sum _{p=1}^{P} X_{p} \beta _p + \eta , \end{aligned}$$
(1)

where \(\eta \) is assumed to be normally distributed, \(\eta \sim N(0,\sigma ^2_\eta )\). If the set of predictors \(\mathcal {X}:=\{X_1, \ldots , X_P\}\) is known, the coefficients \(\varvec{\beta } = [\beta _1, \dots , \beta _P]\) can be estimated with the standard ordinary least squares (OLS) estimate

$$\begin{aligned} \hat{\varvec{\beta }}_\mathrm{OLS} = \mathop {{{\,\mathrm{arg\,min}\,}}}\limits _{\varvec{\beta }} \left\{ \sum _{t=1}^{T} \left( y_t - \sum _{p=1}^{P}X_{t,p} \beta _{p} \right) ^2 \right\} . \end{aligned}$$
(2)

When P is large and \(\mathcal {X}\) contains redundant predictors, OLS estimates can be unsatisfactory. Prediction accuracy can be improved by shrinking or setting some of the coefficients to zero (Hastie et al. 2008). Setting coefficients to zero removes the corresponding predictors from (1), leading to simpler, more interpretable models. Throughout this article, we refer to the number of nonzero coefficients in the model as the model sparsity, which we denote by k.

The regression model above assumes a linear relationship between predictors and a response variable, but this may not be suitable (Rawlings et al. 1998). For instance, in our motivating example, some telecommunication events are caused by long periods of heavy rainfall, causing underground cables to flood. Exponential smoothing can be applied to daily precipitation measurements to provide a surrogate predictor for groundwater levels. This introduces the question of how best to choose the smoothing parameter. One option is to obtain such surrogate predictors for a grid of smoothing parameters; this both substantially increases the number of potential predictors to choose from and can lead to highly correlated predictors. We note here that in other contexts, different transformed variables could be appropriate, for example, models which include lagged predictors.

In this article, we focus on subset selection methods that attempt to choose the set of k predictors that give the smallest value of the residual sum of squares (2). A number of classical subset methods are described in detail by Hocking (1976). The forward-stepwise routine is the current algorithm of choice for selecting predictors by our industrial collaborator. This algorithm is usually initialised with an intercept term (the null model) and iteratively adds the predictor which most improves the least squares objective. This gives a fitted model with k predictors, for some \(k\in \{1,\ldots ,P\}\). However, for any \(k \ge 2\) the model produced by stepwise methods is not guaranteed to be the best model with k predictors in terms of minimising the least squares objective. Despite the resultant sub-optimal models and issues raised by many authors, e.g. Beale (1970), Mantel (1970), Hocking (1976), Berk (1978), fast and easy implementation of these algorithms may explain why they remain popular.

Finding the model with sparsity k which minimises the least squares objective is known as the best subset problem (Miller 2002). This optimisation problem is non-convex and can be computationally challenging to solve when we have many predictors available. However, Bertsimas et al. (2016) show that by appropriately formulating the problem and using recent developments in optimisation algorithms, it is possible to perform best subset selection with hundreds of potential predictors and thousands of observations. Bertsimas et al. (2016) also show that best subset selection tends to produce sparser and more interpretable models than more computationally efficient procedures such as the LASSO (Tibshirani 1996), particularly when the signal-to-noise ratio is high.

2.1 Automated predictor selection procedure

Automated model selection procedures limit an analyst’s control over the output. Consequently, we do not seek a fully automated approach, but one that can produce sensible outputs with minimal user input for potentially hundreds of predictors. We thus propose a semi-automated procedure where an analyst supplies a suitable set of predictors, with which we use best subset selection to automatically choose the best model using this set.

We formulate the problem of choosing the best model as a Mixed Integer Quadratic Optimisation (MIQO) program as suggested by Bertsimas et al. (2016). The MIQO formulation with sparsity k solves the following optimisation problem

$$\begin{aligned} \min _{\varvec{\beta }, \varvec{z}} \sum _{t=1}^{T} \left( y_t - \sum _{p=1}^{P} X_{t,p} \beta _{p} \right) ^2&\quad \text {subject to} \end{aligned}$$
(3a)
$$\begin{aligned} \sum _{p=1}^P z_p \le k,&\end{aligned}$$
(3b)
$$\begin{aligned} (1-z_p, \beta _{p}) \in \mathcal {SOS}_1,&\quad p = 1, \ldots , P, \end{aligned}$$
(3c)
$$\begin{aligned} z_p \in \{0,1\},\ \beta _p \in \mathbb {R},&\quad p = 1, \ldots ,P. \end{aligned}$$
(3d)

The binary variable, \(z_p\), indicates whether the predictor, \(X_p\), is present in the model or not. Constraint (3b) controls the maximum number of predictors allowed to enter a model to k. The value k can be chosen with model selection criteria such as the AIC (Akaike 1973) or BIC (Schwarz 1978). Alternatively, selection can be based on cross-validation [see, for example, Stone (1974)]. Predictor \(X_p\) is excluded from a model if the corresponding coefficient, \(\beta _p\), is zero. The coefficients of the excluded predictors are controlled by special ordered set constraints (of type 1) (3c) by ensuring only one of \(1-z_p\), or \(\beta _p\), is nonzero. Alternatively, it is possible to control the maximum absolute value of coefficients and ensure the coefficients of the excluded predictors are zero using so-called Big-M constraints

$$\begin{aligned} -M z_p \le \beta _p \le M z_p, \quad \text {for} \ p=1, \ldots , P. \end{aligned}$$

The parameter in these constraints, M, can be estimated using data-driven approaches, see Bertsimas et al. (2016) for more discussion. In some problems, Big-M constraints can lead to improvements in the time to solve the optimisation problems as seen in Soltysik and Yarnold (2010).

Following Bertsimas and King (2016), we now discuss how different constraints in the MIQO formulation can be used to ensure that we obtain desirable models from our procedure.

Treatment of correlated predictors      Similar to Bertsimas and King (2016), we can easily have additional constraints to the MIQO formulation, for example, to avoid including highly correlated predictors within our model. Specifically, we can add the constraint

$$\begin{aligned} z_p + z_s \le 1, \ \forall \ (p,s) \in \mathcal {HC}:= \{(p,s) : |\text {Cor}(X_p,X_s)| > \rho \}. \end{aligned}$$
(4)

Constraint (4) allows at most one of \(X_p\) or \(X_s\) into the model if the absolute correlation between them exceeds \(\rho \). Including this constraint helps to induce sparsity. By including a single predictor from a pair of highly correlated predictors, we reduce the complexity of the model without a significant loss in explanatory power of the response variable. This is particularly true in cases where the absolute pairwise correlation is close to 1.

Incorporating expert knowledge      In many settings, expert knowledge may suggest predictors that must be present in the model. For example, it may be suitable to account for known outliers or other known external influences. Let the set \(\mathcal {J}\) denote the indices of predictors that must be present in the model. This can be enforced by adding the constraint

$$\begin{aligned} z_p = 1, \quad \forall \ p \in \mathcal {J}. \end{aligned}$$

Expert knowledge may also suggest how the predictors should effect the response variables. For example, some predictors may be known to have a positive effect on the response variables (see, for example, Sect. 5). We propose to include this expert knowledge as follows. Let the sets \(\mathcal {P}\) and \(\mathcal {N}\) denote the sets of predictor indices that should have positive and negative effects on the response variables, respectively. Then, the constraints

$$\begin{aligned} \beta _{p} \ge 0 \quad \forall p \in \mathcal {P}, \quad \text {and} \quad \beta _{p} \le 0 \quad \forall p \in \mathcal {N}, \end{aligned}$$

ensure that the coefficients take the correct sign according to expert opinion, or the corresponding predictors are excluded from the models. As well as aiding context-specific interpretability of the models, an additional advantage of enforcing sign constraints is that we have observed that it speeds up the optimisation.

Including transformations of predictors      In Sect. 1, we discussed the need to determine the best parameter for a set of nonlinear transformations of a predictor. To ensure the best parameters are found in terms of minimising the least squares objective, we can use the following constraints. Let \(\mathcal {T}_i\) denote the set of predictors obtained by applying a nonlinear transformation to a predictor over a grid of values. Then, the constraints

$$\begin{aligned} \sum _{p \in \mathcal {T}_i} z_p \le 1, \quad \text {for} \ \mathcal {T}_1, \ldots , \mathcal {T}_I, \end{aligned}$$
(5)

will ensure at most one of the predictors from each group \(\mathcal {T}_i\) will appear in the model.

3 Simultaneous predictor selection for systems of linear regression models

Interpretability and consistency of models are important in industry. If a model is difficult to interpret, then it is of limited use for practitioners trying to understand the dynamics of the system of interest. When models contradict expert opinion or take very different forms for a number of related response variables, the reliability of the models may be questioned. We now describe our proposed extension to the best subset formulation (3) to simultaneously select predictors and obtain models for multiple related response variables to ensure consistency in the selected predictors for each response variable.

3.1 MIQO formulation for multiple response variables

Consider estimating regression models for M response variables, where we assume that these response variables are suitable for joint analysis. We write the system of models as

$$\begin{aligned} \begin{aligned}&Y_{1} = \sum _{p=1}^{P} X_{1,p} \beta _{1,p} + \eta _{1}, \\&\ \vdots \qquad \vdots \\&Y_{M} = \sum _{p=1}^{P} X_{M,p} \beta _{M,p} + \eta _{M}, \end{aligned} \end{aligned}$$
(6)

where \(\eta _m\sim N(0,\sigma ^2_m)\), and \(\beta _{m,p} \in \mathbb {R}\) for \(p=1,\ldots ,P\),\(m=1,\ldots ,M\).

Here, we assume that each response variable has a unique realisation of the P predictor variables. For example, suppose predictor \(X_1\) corresponds to precipitation, then predictor \(X_{m,1}\) corresponds to the precipitation for response \(Y_m\). Let \(\mathcal {S}_m\) denote the set of selected predictors for response m. The current procedure used by our industrial collaborator often produces models where \(\mathcal {S}_{m_1} \ne \mathcal {S}_{m_2}\), contrary to expert opinion. This motivates the following formulation, which we call the Simultaneous Best Subset (SBS) problem:

$$\begin{aligned}&\min _{\varvec{\beta }} \sum _{m=1}^M \sum _{t=1}^{T} \left( y_{m,t} - \sum _{p=1}^{P} X_{m,t,p} \beta _{m,p} \right) ^2, \nonumber \\&\text {subject to} \bigcup _{m=1}^{M} \mathcal {S}_m \le k. \end{aligned}$$
(7)

The union \(\bigcup _{m=1}^{M} \mathcal {S}_m\) gives the selected predictors across all models: if all models contain the same predictors, then each model may have up to k predictors present.

Note that whilst penalisation techniques such as the group lasso (Yuan and Lin 2006) could be modified to select predictors grouped by response to produce sparse models, this approach would not naturally incorporate the other constraints that feature in our setting, which we discuss next.

As well as consistency in predictor selection, some similarity in the coefficients \(\beta _{1,p}, \ldots , \beta _{M,p}\) may be expected in considering multiple response variables. Using an \(l_2\) penalty, we can penalise large differences in the coefficients thereby shrinking them towards some common value. By adding P auxiliary optimisation variables, \(\bar{\beta }_1, \ldots , \bar{\beta }_P\), we can add the following penalty to the objective appearing in (7):

$$\begin{aligned} \mathcal {P}(\varvec{\beta }) = \lambda \sum _{m=1}^{M} \sum _{p=1}^{P} (\bar{\beta }_p - \beta _{m,p})^2. \end{aligned}$$
(8)

The tuning parameter \(\lambda \) must be determined. For large \(\lambda \), the penalty (8) will dominate the objective and force the solver to encourage \(\beta _{1,p}, \ldots , \beta _{M,p}\) close to \(\bar{\beta }_p\), for \(p=1,\ldots ,P\). In simulation studies, we have found that \(\lambda = 2 g_k\), where \(g_k\) denotes the objective value of the SBS problem (7) with sparsity k, is large enough to force the coefficients sufficiently close together. In Sect. 4, we use a sequence of \(\lambda \) values equally spaced on a log scale in the interval \([0,2 g_k]\).

Note that an alternative to the term (8) would be to use an \(l_1\) penalty. This would have the effect of setting the coefficients across models to the same value exactly, for high enough \(\lambda \).

The number of binary variables in the optimisation model need not increase when simultaneously estimating multiple regression models—the number stays at P, the number of predictor variables. However, the number of constraints in the optimisation must be increased to ensure a feasible solution of (7) is obtained. To this end, we use the \(\mathcal {SOS}_1\) constraints

$$\begin{aligned} (1-z_p, \beta _{m,p}) \in \mathcal {SOS}_1, \end{aligned}$$
(9)

for \(p=1,\ldots ,P, m=1,\ldots ,M\). These constraints, along with the sparsity constraint (3b), ensure that no more than k predictors are present across each of the M regression models. The equivalent Big-M constraints in this setting are

$$\begin{aligned} -M z_p \le \beta _{m,p} \le M z_p, \end{aligned}$$
(10)

for \(p=1,\ldots ,M\), \(m=1, \dots , M\), where the product of M and the binary variables controls the inclusion of predictors. Providing M is large enough, an optimal solution to the SBS problem will be obtained. However, in what follows we use the \(\mathcal {SOS}_1\) constraints (9) since we avoid the problem of specifying M in (10), which is preferable in our application.

Analogous to Sect. 2.1, to prevent pairs of highly correlated predictors, we define the set of highly correlated predictors \(\mathcal {HC}\) in this setting as pairs \((p,s)\in \{1,\ldots ,P\} \times \{1,\ldots ,P\} \) such that

$$\begin{aligned} \left( \sum _{m=1}^{M} \sum _{p \ne s} \mathbb {1}_{|\mathrm{cor}(X_{m,p}, X_{m,s})|> \rho }\right) > 0. \end{aligned}$$

By using the constraints of the form (4), we prevent any model in the system (6) containing pairs of predictors with absolute correlation that exceeds \(\rho \).

Ensuring model sparsity      In our motivating application, as well as many other contexts, sparse models are desired to illustrate the strongest effects of a few predictors. Hence, for computational reasons, we suggest setting a maximum model sparsity \(k_{\text {max}}\). The choice of \(k_{\text {max}}\) could be somewhat arbitrary. However, in our formulation, the value \(k_{\text {max}}\) can be determined automatically by using constraints of the form (4) and (5). These constraints suggests that there exists a maximum level of model sparsity where at least one constraint (4) or (5) will be violated if an additional predictor is included into the model. State-of-the-art optimisation solvers, such as Gurobi (Gurobi Optimization 2019), will inform the user if an optimisation formulation is infeasible. We propose modifying the sparsity constraint (3b) as follows:

$$\begin{aligned} \sum _{p=1}^{P} z_p = k. \end{aligned}$$

If \(k > k_{\text {max}}\), a feasible solution to the modified best subset problem does not exist and the solver will inform the user of an infeasible optimisation model; larger predictor subsets can hence be discounted. In practice, an additional choice to reduce computation is to set a maximum runtime of the solver, as suggested by Bertsimas et al. (2016). Often this finds the optimal solution quickly, but may take hours to provide the certificate of optimality. Good feasible solutions can be obtained for models with sparsity \(k+1\) using the optimal solution with sparsity k. By modifying the right-hand side of constraint (3b) as above, modern optimisation solvers such as Gurobi can automatically use the previous optimal solution to “warmstart” the solver.

3.2 Extension to serially correlated data

Fitting linear regression models to time-ordered data often produces models where the observed residuals appear serially correlated (Brockwell and Davis 2002). To remedy this issue, in this section, we propose a two-step algorithm, similar in spirit to that of Cochrane and Orcutt (1949) that implements a predictor selection step to a generalised least squares (GLS) transform of the data. In what follows, we give an example of the GLS transform, before describing how we incorporate predictor selection.

Suppose we have a response variable Y and predictors \(X_1, \ldots , X_P\), and suppose the true model for the relationship between the response and predictors is

$$\begin{aligned} Y_t&= \sum _{p=1}^{P} X_{t,p} \beta _P + \eta _t \quad \text {where} \end{aligned}$$
(11a)
$$\begin{aligned} \eta _t&= \phi \eta _{t-1} + e_{t}. \end{aligned}$$
(11b)

In this setting, the regression residuals \(\eta _t\) are serially correlated. Ignoring serial correlation in observed residuals not only mis-specifies the model but ignores potentially valuable information. Minimising the least squares objective (2) no longer gives the most efficient estimator for the regression coefficients (Rao and Toutenburg 1999). Providing (11b) is stationary (see Brockwell and Davis 2002), we can write (11) as a regression model with residuals that are not serially correlated via

$$\begin{aligned} \frac{Y_t}{1 - \phi L} = \sum _{p=1}^{P} \frac{X_{t,p}}{1 - \phi L} \beta _p + e_t, \end{aligned}$$
(12)

where L denotes the backward-shift operator such that \(L \eta _t = \eta _{t-1}\). The linear filter can be applied to the response and predictor variables to obtain transformations of the original variables. In other words, the original variables can be written \(\tilde{Y}_t = \frac{Y_t}{1 - \phi L}\) and \(\tilde{X}_{t,p} = \frac{X_{t,p}}{1 - \phi L}\). We show empirically in Sect. 4.4 that predictor selection accuracy can be improved by transforming the response and predictor variables appropriately.

In general, neither the predictor variables present in the model or the serial correlation structure of the regression residuals is known. We assume a general Reg-SARIMA model of the form

$$\begin{aligned} Y_{m,t} = \sum _{p=1}^{P} X_{m,p,t} \beta _{m,p} + \eta _{m,t}, \end{aligned}$$
(13a)

where

$$\begin{aligned} \eta _{m,t} = \frac{\theta _m(L) \Theta _m(L^s)}{\nabla ^{d_m} \nabla _s^{D_m} \phi _m(L) \Phi _m(L^s)} \epsilon _{m,t}, \end{aligned}$$
(13b)

where we assume that \(\eta _{m,t}\) is independent for each m. Note that in some settings, this assumption may not be appropriate, see Sect. 6 for more discussion.

We propose the following two-step algorithm to determine the best predictors and autocorrelation structure of the regression residuals. First, we seek suitable predictors for the model. We fix the sparsity k and use the data \( (Y_1, X_{1,1}, \ldots , X_{1,P}), \ldots , (Y_M, X_{M,1}, \ldots , X_{M,P}) \) to determine a suitable set of predictors by solving the SBS problem. Given initial estimates of the coefficients \(\hat{\beta }_{1,1}^{k,0}, \ldots , \hat{\beta }_{M,P}^{k,0}\), we then obtain the observed residuals for each model

$$\begin{aligned} \hat{\eta }^{k,0}_{m,t} = y_{m,t} - \sum _{p=1}^{P} X_{m,p,t} \hat{\beta }_{m,p}^{k,0}. \end{aligned}$$

We need to estimate the serial correlation structure of the regression residuals. Given a list \(\mathcal {L}\) of suitable SARIMA models, these models can be fit to the observed regression residuals \(\hat{\eta }^{k,0}_{m,t}\) for \(m=1, \ldots , M\) and the best SARIMA model identified for each m, for example, based on an appropriate information criterion. We require the transformed data

$$\begin{aligned}&\frac{\nabla ^{\hat{d}_m} \nabla _s^{\hat{D}_m} \hat{\phi }_m(L) \hat{\Phi }_m(L^s)}{\hat{\theta }_m(L) \hat{\Theta }_m(L^s)} Y_{m,t} = \tilde{Y}_{m,t} \quad \text {and} \end{aligned}$$
(14)
$$\begin{aligned}&\frac{\nabla ^{\hat{d}_m} \nabla _s^{\hat{D}_m} \hat{\phi }_m(L) \hat{\Phi }_m(L^s)}{\hat{\theta }_m(L) \hat{\Theta }_m(L^s)} X_{m,p,t} = \tilde{X}_{m,p,t}, \end{aligned}$$
(15)

for \(m=1,\ldots ,M.\)

Consider fitting the SARIMA model (13b) to obtain the observed model errors \(\hat{\epsilon }_{m,t}\),

$$\begin{aligned} \hat{\eta }_{m,t} \frac{\nabla ^{\hat{d}_m} \nabla _s^{\hat{D}_m} \hat{\phi }_m(L) \hat{\Phi }_m(L^s)}{\hat{\theta }_m(L) \hat{\Theta }_m(L^s)} = \hat{\epsilon }_{m,t}. \end{aligned}$$

This process can be applied to (14) and (15) to obtain \(\tilde{Y}_{m,t}\) and \(\tilde{X}_{m,p,t}\) for \(m=1\ldots ,M\) and \(p=1,\ldots ,P\). Lastly, the predictors can be re-selected by solving the SBS problem again with the filtered data, \(\tilde{Y}_{m,t}\) and \(\tilde{X}_{m,p,t}\). This procedure can be iterated until convergence in the regression estimates, selected predictors and the models for serial correlation. If the procedure does not converge quickly, an upper limit to the number of iterations can also be considered. However, we have observed that convergence often occurs after two or three iterations. The pseudo-code for our two-step procedure, Two-stage Simultaneous Predictor Selection (SPS2), is given in Algorithm 1.

figure a

4 Simulation study

In this section, we investigate the properties of our simultaneous predictor selection approach. In particular, we perform a number of simulations investigating how our SBS model compares to applying the standard best subset approach to estimate each linear regression model separately. We compare our simultaneous estimation procedure to the LASSO (Tibshirani 1996) and elastic net (Zou and Hastie 2005). We also compare our approach to an alternative simultaneous estimation procedure: we modify the Simultaneous Variable Selection approach of Turlach et al. (2005) to estimate the system of linear models (6). We call this approach \(\texttt {SVS}_{\texttt {m}}\), the modified SVS approach; further algorithmic details of this procedure can be found in “Appendix A”.

We generate data from Model (6) where we fix the regression coefficients as

$$\begin{aligned} \beta _{m,p} = {\left\{ \begin{array}{ll} 0.3, \quad &{} \text {for} \ p = 17, \\ 1, \quad &{} \text {for} \ p = 18, \\ 0.6, \quad &{} \text {for} \ p = 19, \\ 0, \quad &{} \text {otherwise}, \end{array}\right. } \quad \text {for all} \ m. \end{aligned}$$

The predictors and residuals are simulated as follows:

$$\begin{aligned} \varvec{X}_{m,t} \sim \text {MVN}_{35}(\varvec{0}, \varvec{\Sigma }_{x}), \quad \eta _{m,t} \sim \text {N}(0, \sigma _{\eta }^2),\nonumber \\ \text {where } \varvec{\Sigma }_{x} := (\varvec{\Sigma }_{x})_{i,j} = \rho ^{|i-j|}. \quad \end{aligned}$$
(16)

The particular values of the residual variance, \(\sigma _{\eta }^2\) and predictor correlation, \(\rho \) will be clarified in each simulation. When the correlation between predictors is large, the predictors \(X_{17}\), \(X_{18}\) and \(X_{19}\) become hard to distinguish, and hence, accurately selecting the correct generating predictors is challenging. We use \(P=35\) predictor variables as provably optimal solutions can be obtained within seconds for sparse models (see “Appendix B”).

Fig. 1
figure 1

Performance of SBS as \(\rho \), the correlation between predictors, increases for different numbers of jointly modelled response variables, M: a selection accuracy; b mean squared error of regression coefficient estimates

To evaluate performance of our proposed technique, we record the mean squared error of the regression coefficients given by

$$\begin{aligned} \text {MSE}(\varvec{\beta }) = \frac{1}{MP} \sum _{m=1}^{M} \sum _{p=1}^{P} \left( \beta _{m,p} - \hat{\beta }_{m,p} \right) ^2, \end{aligned}$$

where \(\hat{\beta }_{m,p}\) is the estimate of \(\beta _{m,p}\). This measure will penalise large deviations from the true coefficients and take account of potential variation as we change the value of M. Unless specified otherwise, we do not apply shrinkage as we wish to demonstrate the gains from simultaneous selection only.

For the simulations, in Sect. 3.2, we are also interested in model-fitting and predictive ability. We hence evaluate performance with the mean squared prediction error

$$\begin{aligned} \text {MSE}(\hat{\varvec{y}})= \frac{1}{T} \sum _{t=1}^{T} \left( y_{m,t} - \sum _{p=1}^{P} X_{m,p,t} \hat{\beta }_{m,p} \right) , \end{aligned}$$
(17)

where as above \(\hat{\beta }_{m,p}\) is the estimate of the regression coefficient \(\beta _{m,p}\). This quantity gives a measure of average fidelity to the simulated datasets.

We now investigate the performance of simultaneous best subset selection. In particular, we examine effects of correlated predictors and shrinkage on model selection capability.

4.1 Evaluation of simultaneous predictor selection

In the simulations that follow, we solve the SBS problem with \(M=1,5,10,20,35\), increasing the number of regression models used for simultaneous predictor selection and coefficient estimation. Note that \(M=1\) corresponds to the best subset approach of Miller (2002). In a simulation of size N, we record the number of times each application of the SBS approach recovers the true subset by applying the SBS approach with the sparsity set to the true value, \(k=3\).

We start by investigating how predictor correlation affects selection accuracy for the best subset method, and how this improves for simultaneous predictor selection as the number of jointly estimated models increases. We generate \(N=1000\) synthetic datasets using specification (16) and fix \(\sigma _{\eta }^2 = 1\).

Figure 1a shows the selection accuracy for simultaneous subset selection with differing values of M. We see that for the best subset method (\(M=1\)), the accuracy deteriorates rapidly as the predictor correlation, \(\rho \), exceeds 0.5. However, simultaneous predictor selection increases the correlation threshold at which selection accuracy deteriorates to 0.87 with just five models. Consequently, the mean squared error in coefficient estimates decreases, as can be seen from Fig. 1b. Selection accuracy is seen to improve further with a greater number of models estimated simultaneously.

We also investigate the performance of SBS with increasing residual variance, \(\sigma ^2_{\eta }\) for differing values of M and data length, T; as one might expect, with increasing residual variance, it is much harder to recover the true predictors. For reasons of brevity, these results are deferred to “Appendix B”.

4.2 Simultaneous shrinkage

The coefficients obtained from minimising the least squares objective with highly correlated predictors can suffer from high variance. As such, the variation in selected predictors for the best subset method is also high, as shown in Sect. 3.1, mirroring the observations by Hastie et al. (2008). To investigate the effect of shrinking coefficients for each predictor towards a common value, we fix \(M = 5\) and simulate \(T=750\) observations for each response variable and their associated predictors from the model (16). We split the data randomly into two sets, using 500 observations for each response variable as a training set to estimate the models. The remaining 250 observations are used to determine the predictive accuracy of the models. We fix \(\rho =0.95\) and \(\sigma _{\eta }^2 = 2\) and again consider when \(k=3\) to show the effects on in-sample and out-of-sample prediction error.

Fig. 2
figure 2

Mean squared error when coefficient shrinkage across the M regression models is imposed: a in-sample error; b out-of-sample error

Fig. 3
figure 3

Trace plot of the regression coefficients \(\varvec{\beta }_1, \ldots , \varvec{\beta }_5\) (from left to right), as the shrinkage parameter \(\lambda \) is increased, penalising dissimilarities in the coefficients

Figure 2 shows the MSE for both scenarios over a range of increasing penalty values, \(\lambda \). By penalising the differences in \(\beta _{1,p}, \ldots , \beta _{M,p}\) for \(p=1,\ldots ,P\), we bias the estimates of the regression coefficients, increasing the in-sample error (see Fig. 2a). However, this can lead to improved out-of-sample prediction error (see Fig. 2b) as information is shared across regression models by shrinking the coefficients for each predictor to a common value.

Figure 3 shows trace plots of the regression coefficients (for one simulated dataset) for each of the five response variables in the system, as the value of the simultaneous shrinkage penalty increases. The horizontal lines show the coefficients of predictors \(X_{17},X_{18}\), and \(X_{19}\).

As the penalty increases, the simultaneous best subset changes. Despite seeking the best subset of predictors given the true level of sparsity, the true predictors are not initially selected upon solving the SBS problem. Two of the three predictors are correctly identified although the estimates for each coefficient are rather far from the truth. A spurious (zero) predictor is also selected with relatively large coefficients for some of the models (indicated by nonzero coefficients for \(\beta _{m,21}\) and \(\beta _{m,27}\)). As the strength of the joint shrinkage is increased, the noisy predictor leaves for the true third predictor and re-enters the models, upon being replaced finally for the true third predictor again. At this point, the coefficients for all three predictors in each of the regression models appear significantly closer to the true values in comparison with the solutions obtained upon solving the initial SBS problem.

For large k, we have observed that when many spurious predictors are present in the model, our shrinkage operator can push the coefficients of the spurious regressors towards zero, see Chapter 7 in Lowther (2019) for more details.

Table 1 Comparative performance of the predictor selection algorithms using the measures described in the text

4.3 Comparison to other approaches

In this simulation, we replicate the scenario that motivated our SBS approach. In particular, we simulate series with five blocks of highly correlated predictors. A block of predictors is denoted \(\varvec{X}_{(b)} = [X_{(b,1)}, \ldots , X_{(b,N_b)}]\). The predictors are simulated as

$$\begin{aligned} \varvec{X}_{(b)} \sim \text {MVN}_{b+4}(\varvec{0}, \Sigma _{(b)}), \ \text {with}\ {\Sigma _{(b)}}_{i,j} := 0.95^{|i-j|}, \end{aligned}$$

for \(b=1,\ldots , 5\). We vary the positions of the active predictors relative to their blocks and the values of the regression coefficients. The regression coefficients take the form

$$\begin{aligned} \beta _{m,p} = {\left\{ \begin{array}{ll} 1, &{}\quad \text {if} \ p = 30, \\ 0.775,&{} \quad \text {if} \ p = 25, \\ 0.55, &{}\quad \text {if} \ p = 14, \\ 0.325, &{}\quad \text {if} \ p = 5, \\ 0.1, &{}\quad \text {if} \ p = 2, \\ 0, &{}\quad \text {otherwise} \end{array}\right. } \quad \text {for} \ m=1,\ldots ,5. \end{aligned}$$

Our primary goal is to compare SBS to current methods in the literature. We apply the elastic net using the glmnet package (Zou and Hastie 2018) implemented in the R statistical software (R Core Team 2019), over the default values, \(\alpha =0,0.01,\ldots ,1\) and for 100 values of \(\lambda \) to produce a model for each \(m=1,\ldots ,5\). We train each model with \(T=500\) observations and then use the mean squared prediction error (17) on a 250 observation held-out test set to select the best elastic net model for each \(m=1,\ldots ,5\). We also compare our results to a forward-stepwise algorithm using the step function (R Core Team 2019) for each m, selecting the best model by AIC. We also apply the modified SVS approach (\(\texttt {SVS}_{\texttt {m}}\)), as well as a variant where the regression coefficients constrained to be positive which we denote \(\texttt {SVS}_{\texttt {m}^{+}}\). We select the models fit by the simultaneous procedures by considering the simultaneous mean squared prediction error defined in (17).

For each of the selected models, we record the following performance measures averaged over \(N=50\) datasets across each of the models for the M response variables:

  • The average number of predictors (model sparsity), \(\hat{k} = \frac{1}{M}\sum _{m=1}^{M} \sum _{p=1}^{P}\mathbb {1}_{\beta _{m,p} \ne 0}\).

  • The mean squared prediction error on a 250 observation held-out validation set.

  • The number of models containing the true subset of predictors.

  • The number of models that included at least one negative coefficient.

The average model sparsity will help inform the interpretability of the models, whilst the prediction error allows us to compare the performance numerically. The average number of models containing the true subset indicates the accuracy of each method as a predictor selector. By counting the number of models with negative coefficients, we can compare how often our industrial collaborator may have obtained misleading models. Note that the elastic net uses 100 values of both \(\alpha \) and \(\lambda \) which fits 1000 elastic net models to each response variable.

The summary measures of all approaches are shown in Table 1. Our proposed SBS approach produces the sparsest models aided by the transformation constraints (5), with the average sparsity being slightly lower than the true sparsity. The most likely cause of this is due to not selecting predictor 2. (The relative value of the coefficient is small in comparison with the other predictors.) The only method able to recover the true subset was our SBS approach in half of the simulations. The SBS and \(\texttt {SVS}_{\texttt {m}^{+}}\) techniques always include coefficients with positive values and were the only approaches which did so. All other methods included at least one negative coefficient in a high number of models.

Fig. 4
figure 4

Average estimate of the regression coefficients for each of the methods considered

Figure 4 shows the average estimate for the regression coefficients for each of the methods in the study. With the exception of predictor 2, the SBS method appears to give unbiased estimates. Underestimating \(\beta _{m,2}\) is likely caused by the small coefficient value where the predictor was not included. The other methods tend to underestimate all of the coefficients which may be expected since they are all shrunk towards zero.

We have also investigated computational aspects (e.g. runtime) of our SBS approach when varying the number of response variables, M. For reasons of brevity, we do not include this here, but further details can also be found in “Appendix B”.

4.4 Performance on serially correlated data

In Sect. 3.2, we motivated the need to consider autocorrelated regression residuals in predictor selection problems. In this section, we demonstrate that we can recover both the true predictors and correlation structure of the regression residuals using the two-step algorithm described in Sect. 3.2. To this end, we simulate data from Model (6) but now impose a correlation structure on the residuals, taking the form

$$\begin{aligned} \eta _{m,t} = 0.9\, \eta _{m,t-1} + e_{m,t} \quad \text {for} \ m = 1,\ldots ,5, \end{aligned}$$
(18)

with \(e_{m,t}\sim N(0,1)\), i.e. the residuals \(\eta _{m,t}\) follow an AR(1) or SARIMA(1,0,0)(0,0,0,0) model. The predictors and regression coefficients are the same as those in Sect. 3.1. Our industrial collaborator often observes large changes in the predictors that are selected when the number of observations available changes only slightly. For \(N=50\) datasets of length \(T=600\) simulated under model (18), we apply our two-step algorithm (with \(k=3\)) to each simulated dataset a total of six times: first, we use the first 500 datapoints, then first 520 and so on until all 600 points are used.

We highlight the predictors selected in each application with and without using the two-step algorithm in Fig. 5. The selected predictors for each of the simulated datasets are shown within each set of vertical lines. From left to right, the vertical triplet of dots indicates the selected predictors for \(T=500, 520, \ldots , 600\) within each set of vertical bars.

Fig. 5
figure 5

Comparison of the predictors selected using the standard approach and two-step iterative approach: a standard procedure, ignoring autocorrelation in the regression residuals (unfiltered covariate selection); b two-step procedure SPS2 (filtered covariate selection)

For the standard selection procedure, the variation of selected predictors within each dataset is quite alarming as well as the range of predictors across different simulated datasets, reflecting the sensitivity to data length as experienced by our industrial collaborator. This is shown in Fig. 5a. In comparison, using the two-step algorithm (Fig. 5b), we observe much less variation in the selected predictors. Further, the algorithm selects the true predictors in many cases.

We now investigate how well we can recover the true correlation structure of the regression residuals. Recall that the correct model order from specification (18) is (1, 0, 0)(0, 0, 0). Figure 6 shows that the model order was correctly identified for a particular simulation if “.” appears on each row, or the value of the order (pdq)(PDQ) chosen if it were mis-specified.

Fig. 6
figure 6

Indication if the true SARIMA model order was identified by the SPS2 algorithm for each of \(N=50\) datasets simulated from model (18)

From Fig. 6, we see that correct values were chosen for the majority of values of the six model orders (pdq) and (PDQ). We observe that at least one autoregressive parameter was used (\(p \ge 1\)) for each dataset, sometimes erroneously using more or including another term; however, this is often the case with model selection criteria such as the AIC or BIC. Modifying the penalty used to select the regression residual model may improve accuracy of selecting these models.

5 Telecommunications data study

We now demonstrate our proposed methodology on an example dataset provided by our industrial collaborator. In our motivating application, the total number of daily events in a telecommunications network is recorded by type and location within the network. Each type of event may be influenced by a different set of predictors. For the dataset we consider here, location corresponds to a geographic location, but more detailed information such as the location within the network is available in other applications. We use three response variables of the same type (denoted R1, R2 and R3) from regions in the network considered to be suitable for joint modelling. Urban or rural classifications may help determine whether response variables are suitable for joint modelling. There are a total of 1396 daily observations, corresponding to about 3 years 9 months of data.

We use five groups of predictor variables. Motivated by the remarks in Sect. 2.1 in relation to weather variables, the first four groups of predictors are derived from transformations applied to the following predictors:

Group 1::

Humidity: The mean relative humidity (g m\(^{-3}\)) over a 24-h period.

Group 2::

Wind speed: The maximum recorded wind speed (mph) within a 24-h period.

Group 3::

Precipitation: The total amount of rainfall (mm) within a 24-h period.

Group 4::

Lightning: The total number of lightning strikes within a 24-h period.

Table 2 Regression coefficients for our proposed SPS2 method (Automated), our proposed SPS2 procedure applied to individual responses (Individual Automated) and the current implementation used by our industrial collaborator (Current

The particular base transformation we consider is exponential smoothing, defined by

$$\begin{aligned} x_{t,s} = \alpha x_{t,p} + (1-\alpha ) x_{t,p} \quad \text{ for }\ t=2,\dots ,T \end{aligned}$$
(19)

where we set \(x_{1,s}=x_{1,t}\). In Eq. (19), the tuning parameter \(\alpha \) is used to adjust how much the time series \(x_{t,p}\) is smoothed: a value of \(\alpha \) close to 1 will produce a time series very close to the original, whilst a value of \(\alpha \) close to 0 will produce a time series that evolves much more slowly. We apply the transformation to the predictors above for a range of values of \(\alpha \), with the particular number and values chosen to sufficiently capture the nonlinear effects for each predictor (guided by our industrial collaborator). Note that due to the nature of the telecommunication events, expert knowledge suggests that all potential predictors should have a positive relationship to the response variables. The last group relates to indicator variables to adjust for calendar effects which are likely to influence the event data. In particular, we include three indicator variables, corresponding to the Christmas bank holiday (Christmas day and Boxing day); 27 December until New Year’s Day; and any other bank holiday.Footnote 1

We present three methods for modelling the event data. The first method (denoted Automated) is our simultaneous predictor selection approach for multiple response variables, using our two-step procedure SPS2 to estimate a model for the regression residuals. The Individual Automated approach uses the two-step procedure of the first method, but is applied to each response variable separately. Consequently, Individual Automated cannot take advantage of simultaneous predictor selection; we present this method to highlight the gains in a simultaneous predictor selection approach. Finally, Current is the procedure adopted by our industrial collaborator, included as a baseline comparison. This method removes the weekly seasonality and calendar effects from the response variables as part of a data pre-processing step, as these are not thought to be attributed to the effects of the predictors of interest. (Hence, the bank holiday group of predictors is not considered for Current.) Data pre-processing choices can be subjective, as well as being time-consuming and therefore costly. Furthermore, such pre-processing does not allow joint estimation of the external predictor, bank holiday effects and seasonality. Our two-step procedure for fitting a Reg-SARIMA model allows seasonality to be incorporated directly into the model specification which is iteratively updated as the predictor coefficients are refined. By modelling seasonality, we can obtain more accurate estimates of prediction uncertainty and completely remove the need to pre-process the data by including calendar effects as indicator variables.

The estimated regression coefficients for the three approaches are given in Table 2. An immediate observation from Table 2 is that the models produced by the automated, two-step procedures (Automated and Individual Automated methods) are much sparser than those produced by the Current approach, not considering the calendar effects. Furthermore, all coefficients for the weather predictors produced from Automated and Individual Automated methods are positive, which, as outlined before, would be expected in this context for the telecommunications event data. In contrast, the Current method includes highly correlated predictors, from the same group, and with opposing effects; for example, all six transformed variables of predictor 3 are included. Both large negative and large positive coefficients appear for the predictor variables from Group 3 for the Current method. This reflects the behaviour of the least squares estimator discussed by Hastie et al. (2008) which motivated the use of the ridge penalty (Hoerl and Kennard 1970). Using simultaneous predictor selection and constraining the sign of the coefficients, we are able to select the single best transformation of predictor 3.

Table 3 Mean squared error for 14 day ahead predictions for each of the three response variables and the three methods described in the text
Fig. 7
figure 7

The sample autocorrelation for the fitted model errors for each of the three response variables for the Automated method (top) and the Current method (bottom)

The mean squared errors for 14-day ahead predictions for the three methods are given in Table 3. The prediction accuracy is significantly reduced using the Automated and Individual Automated approaches that produce Reg-SARIMA models, rather than using a pre-processing step (Current). Recall that the Reg-SARIMA methods model the seasonality and calendar effects explicitly rather than remove it. They also describe the effects of other predictors. By selecting predictors simultaneously, the Automated approach provides more accurate forecasts of the response variables. We can see from Table 2 that different predictors from Groups 1 and 3 are chosen in comparison with Regions 1 and 2.

To determine whether the SARIMA models produced by the Automated method have adequately captured the autocorrelation and seasonality within the data, we can inspect the sample autocorrelation and sample partial autocorrelation functions of the model errors. The sample autocorrelation functions for the Automated and Current are shown in Fig. 7.

The plots show that there is very little significant unmodelled autocorrelation left in the residuals for the Automated technique, demonstrating that modelling the regression residuals as a SARIMA process accounts for most of the temporal correlation. (Full model specifications for the Automated procedure can be found in “Appendix C”.) In contrast, the Current method appears to violate the typical regression assumptions of independent regression residuals as there is significant remaining autocorrelation at many lags in the regression residuals for all three response variables. Similar conclusions can be drawn from the plots for the sample partial autocorrelation functions; these are shown in Fig. 8.

Fig. 8
figure 8

The sample partial autocorrelation for the fitted model errors for each of the three response variables for the Automated method (top) and the Current method (bottom)

When serial correlation in the regression residuals is ignored, the standard errors for each of the regression coefficients may be severely underestimated (Rawlings et al. 1998). This would raise suspicions about the significance of any predictor in the model. Further, prediction intervals are likely to be too narrow. Our observations mirror this tendency—the standard errors of the regression coefficients for the three response variables produced from the Automated and Current methods are shown in “Appendix C”.

Our algorithm has also been applied to other datasets of larger size in Chapter 4 of Lowther (2019) without significant increase in runtime. Our Automated approach consistently produces sparser models with interpretable coefficients when compared with the current industrial approach, improving prediction accuracy, especially on short-term horizons. As such, for the datasets considered, our procedure does not suffer from the problem of overfitting as observed for best subset methods by some works, for example, when the signal-to-noise ratio is low, see, for example, Hastie and Tibshirani (2017), Mazumder et al. (2017) and Hazimeh and Mazumder (2018). In other contexts, this may be the case; we leave this as an area for future work.

6 Concluding remarks

Motivated by an industrial problem, we have proposed a procedure to help automate the modelling process of telecommunications data. More specifically, we have developed a MIQO model to solve the simultaneous best subset problem to select predictors when jointly modelling multiple response variables. We have incorporated predictor selection within a two-step procedure that iterates between selecting predictors for a regression model and modelling the serial correlation of the regression residuals. Automation is achieved by placing constraints in the MIQO formulation to ensure sensible models are produced, and by eliminating the need to pre-process the data through modelling calendar effects and seasonality.

We have shown that predictor selection accuracy can be improved by simultaneously selecting predictors for multiple response variables. Selection accuracy and coefficient estimation can further be improved by shrinkage. The shrinkage we introduced is specifically designed for settings when joint estimation of models is considered—in contrast to LASSO-like penalties that shrink coefficients towards zero, our shrinkage method forces coefficients between models to a common value.

Whilst the main focus of this article was on developing an automated approach to selecting sparse multiresponse models, an interesting avenue for future research would be to investigate the impact of modelling the regression residuals simultaneously. For example, in other settings, modelling the regression residuals as a vector autoregression could explain both temporal and cross-correlations between the regression residuals between multiple responses. We anticipate that prediction error may be reduced further and give a consistent form for the regression residuals between responses.

Finally, we note here that the problems discussed in this article are of moderate size. When the problem scale becomes large, more sophisticated optimisation methods tailored to the MIQO approach (see, for example, Hazimeh and Mazumder 2018; Xie and Deng 2018) may be more appropriate. It would be interesting to investigate the potential for sparse learning using these solvers within an automated approach such as proposed here.