1 Introduction

Multi-objective optimization is an important methodology when we face conflicting objectives (see Das and Dennis 1998; Handi et al. 2007). Portfolio analysis, for example, can be presented in terms of multiobjective programming instead of the classical mean-variance approach. The nature of the problem as multicriteria decision making has been emphasized by many authors (e.g. Mavrotas et al. 2008; Xidonas and Psarras 2009; Xidonas et al. 2009a, b, 2010a, b; Steuer et al. 2005, 2006a, b, 2007a, b; Zopounidis and Doumpos 2002; Zopounidis 1999; Hurson and Zopounidis 1993, 1995, 1997; Spronk and Hallerbach 1997; Zeleny 1977, 1981, 1982; Colson and Zeleny 1979, 1980). For deep reviews of other solution methods, consider the below references: Awasthi and Omrani (2019), Duan et al. (2018), Dubey et al. (2015), Gharaei et al. (2019a, 2019b, 2019c), Giri and Bardhan (2014), Giri and Masanta (2018), Hao et al. (2018), Kazemi et al. (2018), Rabbani et al. (2019, 2020), Sarkar and Giri (2018), Sayyadi and Awasthi (2018a, 2018b), Shah et al. (2020). Shakarabi et al. (2019), Tsao (2015) and Yin et al. (2016).

Narula and Wellington (2007) consider a multi-criteria formulation in regression with a single explanatory variable. Their motivation is different from this paper, as they want to minimize the sum of squared and absolute errors simultaneously, or minimize the sum of absolute errors and the maximum absolute error simultaneously. We are not aware of further applications or extensions of this method. Hwang et al. (2010) propose using regression in a special context (collaborative filtering in engineering) to obtain weights and then use multi-criteria analysis and find that experimental results showed that the proposed approach outperformed the single-criterion collaborative filtering method. Priya and Venkatesh (2012) follow the same approach but first they use regression and principal components to identify important objectives and then use the Analytic Hierarchy Process. For a variation of this technique see Nilashi et al. (2016).

This paper is based on the idea that obtaining parameter estimates in regression is, indeed, a multi-criteria decision making problem, but the objectives should include criteria that can deal with the standard problems of regression, viz. autocorrelation, heteroskedasticity, possible nonlinearities, out-of-sample forecasting, as well as endogeneity (correlation between errors and explanatory variables). It is known that the Ordinary Least Squares (OLS) estimator is consistent even when autocorrelation and heteroskedasticity are present but inconsistent when we have nonlinearities and endogeneity. When autocorrelation, heteroskedasticity and the other problems are absent, the OLS estimator is known to be the best linear unbiased estimator (BLUE), a property that holds for finite samples (whereas consistency is associated with infinitely large data sets). In practice, researchers have used a number of criteria to obtain parameter estimates. The OLS estimator is known as \(L_{2}\) estimator as it minimizes the sum of squared residuals. The \(L_{1}\) estimator minimizes the sum of absolute errors, and it is popular when researchers want to mitigate the problem of outliers. The \(L_{\infty }\) estimator minimizes (with respect to \(\beta \)) the maximum absolute error and depends sensitively on outliers (Stam 1997). In the operations research community, practical problems associated with OLS are, to a large extent, ignored, despite the fact that in small or finite samples, autocorrelation, heteroskedasticity, and the other problems mentioned, can have a significant effect on estimates and thus, in measurement and interpretations.

This becomes even more important, when we realize that even “simple” violations of the assumptions in OLS (like autocorrelation and heteroskedasticity) can be viewed as misspecification errors: When a heteroskedastic or autocorrelated variable is erroneously omitted from the regression model, the OLS estimates of the parameters are biased and inconsistent, but the residuals can be informative for these problems as the omitted variable becomes part of the error term. Of course, both autocorrelation and heteroskedasticity can be present in a missing variable, so a systematic investigation of the residuals is called for. In practically every empirical situation, tests are performed for autocorrelation and heteroskedasticity and, if problems are found, the standard errors of OLS are replaced by Heteroskedastic and Autocorrelation Consistent (HAC) standard errors. HAC retains OLS estimates and corrects only standard errors. This practice, however, ignores the fact that it is the rule, rather than the exception, that autocorrelation and heteroskedasticity may, in fact, be due to omitted variables which happen to be autocorrelated and / or heteroscedastic. In such instances, the OLS estimator is biased and inconsistent, and application of HAC standard errors is misguided. To summarize, autocorrelation and heteroskedasticity may, in fact, detect misspecification problems .

From the empirical viewpoint, this problem is important as in most applications, measurement of effects and interpretation of coefficients may, in fact, be compromised under conditions of misspecification. From another point of view, datasets with outliers pose a serious challenge in regression analysis and many solution techniques have been proposed (e.g. Panagopoulos et al. 2019, and Zioutas et al. 2009). Mielke and Berry (1997) have proposed \(L_{1}\) regression when errors are generated from fat-tailed or outlier-producing distributions, which are common in operations research. Moreover, these authors developed a chance-corrected goodness-of-fit measure between observed and predicted values. Dielman and Rose (1997) proposed \(L_{1}\) regression with autocorrelated errors. Bowlin et al. (1984) compare Data Envelopment Analysis and regression approaches to efficiency measurement, which shows the importance of estimation procedure in efficiency analysis: Without model problems, efficiency estimates will be accurate enough. See also Ouenniche and Carrales (2018). However, under heteroskedasticity and /or autocorrelation, such estimates will not even be consistent. Desai and Bharati (1998) investigated whether the predictive power of economic and financial variables can be enhanced if regression is replaced by feedforward neural networks with back-propagation of error. These authors find that the neural networks forecasts are conditionally efficient with respect to the linear regression forecasts. In fact, this finding may be due to misspecification of functional form or other diagnostic failures in regressions and reinforces our arguments. More recent approaches (Wang and Zhu 2018) employ support vector regression for financial time series forecasting. The authors applied their technique to forecast the S&P500 and the NASDAQ market indices with promising results.

Stam (1997) correctly argued that: “there is a need to forge a link between researchers active in statistical discriminant analysis and researchers in the area of \(L_{p}\)-norm classification. Such a link would be beneficial for both groups. Particularly, \(L_{p}\)-norm classification may well be of considerable interest to researchers in areas where nonparametric classification analysis is traditionally used successfully, such as discrete variable classification, mixed variable classification, and in application areas which are often susceptible to data analytical problems, such as medical diagnosis, psychology, marketing, financial analysis, engineering and pattern recognition. Without reaching out, the \(L_{p}\)-norm classification field will remain limited to a small group of researchers with interesting new methodologies that are hardly used where they may be most needed” (pp. 28–29). This statement illustrates the importance of alternative criteria, other than OLS, in the context of linear (or nonlinear) models with an emphasis on potential applications in the fields mentioned by Stam (1997). This paper contributes to this general research agenda by focusing not only on \(L_{p}\)-norm regression but also in the area of providing estimators that are robust to potential problems such as autocorrelation, heteroskedasticity, endogeneity, nonlinearity, etc.

2 The model

In this paper, we consider regression models of the form:

$$\begin{aligned} y_{t}=x'_{t}\beta +u_{t},t=1,\ldots ,T, \end{aligned}$$
(1)

where \(x_{t}\in {\mathbb {R}}^{k}\) is a vector of explanatory variables, \(\beta \in {\mathbb {R}}^{k}\) is a vector of coefficients (parameters) and \(u_{t}\) is an error term whose properties are not specified for the moment except that \(E(u_{t}|x_{t})=0\). Let \(y=[y_{1},\ldots ,y_{T}]'\) and \(X=[x'_{t},t=1,\ldots ,T]\). Common regression problems include (i) heteroskedasticity, (ii) autocorrelation, (iii) misspecified functional form, (iv) outliers, and (v) relatively unacceptable out-of-sample forecasting performance. The modern econometric approach in heteroskedasticity and autocorrelation, is to retain the same Ordinary Least Squares (OLS) coefficients but provide so-called robust standard errors derived from Heteroskedasticity and Autocorrelation Consistent (HAC) covariance matrices. This practice is justified when the model is correctly specified, that is there are no important omitted variables, the functional form is correctly specified, etc. In such cases, under heteroskedasticity and / or autocorrelation, the OLS estimator remains consistent so OLS estimates are reliable but OLS standard errors are biased. More often than not, researchers are not comfortable with the assumption of correct specification in terms of variables included and the linearity assumption as in (1). In such cases, residuals are informative about misspecification problems. For example, if the omitted variables are heteroskedastic and / or autocorrelated, standard diagnostic tests will revel the existence of heteroskedasticity and / or autocorrelation. This, in fact, shows that heteroskedasticity and /or autocorrelation are not merely “nuisances” that can be dealt away using robust-HAC standard errors. The diagnostic tests, actually, provide guidance as to the problems of model specification itself. In this paper, we propose a multi-criteria decision-making approach to regression by developing an estimator which minimizes, simultaneously, the sum of least-squares errors and \(L_{p}\)-objective (as measures of fit, for some \(p>0\)), as well as the extent of heteroskedasticity, autocorrelation, nonlinearity in the functional form, outliers, and out-of-sample forecast errors.

In practical applications we face several econometric problems which can be summarized as follows:

Autocorrelation: This happens when:

$$\begin{aligned} u_{t}=\sum _{l=1}^{L}\gamma _{l}u_{t-l}+\varepsilon _{t},t=1,\ldots ,T, \end{aligned}$$
(2)

where, typically, \(\varepsilon _{t}\sim iid(0,\sigma ^{2})\), and L is the number of lags in the autoregressive process.

Heteroskedasticity: This is the case when:

$$\begin{aligned} E(u_{t}^{2}|x_{t})\equiv var(u_{t}|x_{t})=f(x_{t};\delta ), \end{aligned}$$
(3)

for some function \(f(\cdot )\) and parameters \(\delta \in {\mathbb {R}}^{K}\).

Nonlinearity: When nonlinear functions of the explanatory variables have been omitted from (1).

Endogeneity: When the assumption \(E(u_{t}|x_{t})=0\) is violated.

Failure in out-of-sample forecasting: When actual and predicted values out-of-sample (or in a hold-out sample) are not “close” enough.

Autocorrelation and heteroskedasticity are not considered as problems, per se, as one can always use “robust”-HAC standard errors. However, this practice of correcting standard errors but not ordinary least squares (OLS) estimates is misguided. Autocorrelation and heteroskedasticity, more often than not, are, in reality, specification problems and, as such, they indicate misspecification in some direction(s). Therefore, they really call for re-examining the specification in (1). Moreover, we often have to deal with the problem of outliers.

In this paper, we want to provide an estimator of parameters that satisfies multiple criteria: Specifically, additionally to minimizing the sum of squared residuals in (1), we also need to minimize simultaneously the presence of autocorrelation, heteroskedasticity, misspecification arising from nonlinearities, endogeneity, and failure in out-of-sample forecasting.

3 The multi-objective nature of regression problems

Suppose the model in (1) has been estimated using a certain technique (not necessarily OLS) and the resulting estimates are \({\hat{\beta }}\). Then, autocorrelation can be tested via the hypothesis \(H_{0}:\gamma _{1}=\cdots =\gamma _{L}=0\) under the specification in (2) where \({\hat{u}}_{t}=y_{t}-x'_{t}{\hat{\beta }}\) is used in place of \(u_{t}\).

Heteroskedasticity is traditionally tested using White’s form:

$$\begin{aligned} h(\hat{u_{t}})=\delta _{0}+\delta '_{1}x_{t}+\delta '_{2}vech(x'_{t}\otimes x_{t})+\xi _{t}, \end{aligned}$$
(4)

for some error term \(\xi _{t}\). Moreover, \(\otimes \) is the Kronecker product of two vectors, viz. \(a\otimes b=[a_{i}b,i=1,\ldots ,dim(a)]\), and vech removes all duplicate elements. This is a second order approximation to an arbitrary variance function, and \(h(\hat{u_{t}})={\hat{u}}_{t}^{2}\).

Functional form misspecification is traditionally tested using Ramsey’s Regression Specification Error Test (RESET) . If \({\hat{y}}_{t}=x'_{t}{\hat{\beta }}\), the RESET test uses the regression:

$$\begin{aligned} {\hat{u}}_{t}=a_{0}+a_{1}{\hat{y}}_{t}^{2}+a_{2}{\hat{y}}_{t}^{3}+a_{3}{\hat{y}}_{t}^{4}+\zeta _{t}, \end{aligned}$$
(5)

with an error term \(\zeta _{t}\). If \(a_{1}=a_{2}=a_{3}=0\) we conclude there is no neglected nonlinearity.

Endogeneity is trickier and cannot be tested using OLS regressions (notice that all the above tests can be effectively implemented using OLS residuals and fitted values). All we can do, practically, is to find a vector of instruments (say \(z_{t}\in {\mathbb {R}}^{d_{z}}\)) such that \(E(u_{t}|z_{t})=0\) and \(E(z_{t}x_{t}')\ne \mathbf {0}\) to implement the so called Instrumental Variables (IV) estimator or the Generalized Method of Moments (GMM) estimator.

Out-of-sample forecasts, say \(\{{\hat{y}}_{T+h},h=1,\ldots ,H\}\), can be compared to actual values \(y_{T+h}\) to \({\hat{y}}_{T+h}\) using a metric such as root-mean-squared-error (RMSE), mean absolute error (MAE), etc. Finally, outliers can be avoided by adopting a norm other than \(L_{2}\) when minimizing the average deviation of \(u_{t}(\beta )=y_{t}-x'_{t}\beta \), as we explain in the next section.

4 Multi-criteria OLS and IV

In the absence of endogeneity, we propose a new formulation of OLS which takes explicitly into account, autocorrelation, heteroskedasticity, possible nonlinearity, and out-of-sample forecasts as follows:

$$\begin{aligned} \min _{\beta }\left\{ T^{-1}\sum _{t=1}^{T}(y_{t}-x_{t}'\beta )^{2},\left( T^{-1}\sum _{t=1}^{T}|y_{t}-x'_{t}\beta |^{p}\right) ^{1/p},\delta '\delta ,a'a,\gamma '\gamma ,H^{-1}\sum _{h=1}^{H}|{\hat{y}}_{T+h}-y_{T+h}|\right\} .\nonumber \\ \end{aligned}$$
(6)

The first criterion is the usual OLS criterion for fit. The second minimizes the \(L_{p}\)-norm (for example \(p=1\) provides the Least Absolute Deviations (LAD) estimator and \(p\rightarrow \infty \) provides the maximum absolute residual, i.e. the Chebyshev norm). Here \(\delta =[\delta '_{1},\delta _{2}']'\), \(a=[a_{1},a_{2},a_{3}]'\) and \(\gamma \) is defined by (2). Therefore, \(\delta '\delta \) deals with heteroskedasticity, \(\gamma '\gamma \) with autocorrelation, \(a'a\) with problems of neglected nonlinearity, and the last criterion minimizes the average absolute error of out-of- sample forecasting using a hold out sample \(\{y_{T+1},...,y_{T+H}\}\). To avoid heteroskedasticity, autocorrelation, and nonlinearity, ideally, we should have \(\delta '\delta =\gamma '\gamma =a'a=0\).

Moreover, we may have problems with outlying observations which are taken care of by using \(p=1\) or similar. In addition, if a set of linear restrictions must be imposed, the minimization above is subject to:

$$\begin{aligned} R\beta \le b, \end{aligned}$$
(7)

where R is a \(J\times k\) matrix of coefficients representing the J restrictions, and b is a \(J\times 1\) vector of constants.

If endogeneity is thought to be a problem, and instruments \(z_{t}\in {\mathbb {R}}^{d_{z}}\) are available, the problem in (6) can be reformulated as follows:

$$\begin{aligned}&\min _{\beta }\left\{ T^{-1}\sum _{t=1}^{T}({\tilde{y}}_{t}-{\tilde{x}}'_{t}\beta )^{2},\left( T^{-1}\sum _{t=1}^{T}|{\tilde{y}}_{t}-{\tilde{x}}'_{t}\beta |^{p}\right) ^{1/p}, \right. \nonumber \\&\quad \left. \delta '\delta ,a'a,\gamma '\gamma ,H^{-1}\sum _{h=1}^{H}|{\hat{y}}_{T+h}-y_{T+h}|\right\} , \end{aligned}$$
(8)

where Z is the \(T\times d_{z}\) matrix of instrumental variables, \({\tilde{y}}_{t}=z_{t}y_{t}\), \({\tilde{x}}_{t}=z_{t}x'_{t}\). A more convenient expression results if we express (1) in vector form:

$$\begin{aligned} y=X\beta +u. \end{aligned}$$
(9)

Given the matrix of instruments, we have:

$$\begin{aligned} Z'y=Z'X\beta +Z'u, \end{aligned}$$
(10)

where by definition \(E(Z'u)=\mathbf {0}\). This can be written as

$$\begin{aligned} Z'y=Z'X\beta +e, \end{aligned}$$
(11)

where \(e=Z'u\). The IV estimator is \({\hat{\beta }}_{IV}=(Z'X)^{-1}Z'y\) when \(d_{z}=k\). When \(d_{z}>k\) one can use the OLS estimator:

$$\begin{aligned} {\hat{\beta }}_{IV}=(X'ZZ'X)^{-1}X'ZZ'y. \end{aligned}$$
(12)

Since

$$\begin{aligned} cov(e)\propto (Z'Z)^{-1}, \end{aligned}$$
(13)

the Generalized Instrumental Variables Estimator (GIVE) is the Generalized Least Squares (GLS) estimator applied to (10):

$$\begin{aligned} {\hat{\beta }}_{GIVE}=[X'Z(Z'Z)^{-1}Z'X)^{-1}X'Z(Z'Z)^{-1}Z'y. \end{aligned}$$
(14)

Of course, using (8) is more transparent. If we define \({\tilde{u}}_{t}(\beta )={\tilde{y}}_{t}-{\tilde{x}}_{t}\beta \), and \({\tilde{u}}(\beta )=[{\tilde{u}}_{t}(\beta ),t=1,...,T]'\) the GIVE form of (8) is:

$$\begin{aligned}&\min _{\beta }\left\{ T^{-1}{\tilde{u}}(\beta )'(Z'Z)^{-1}{\tilde{u}}(\beta ),\left( T^{-1}\sum _{t=1}^{T}|{\tilde{u}}_{t}(\beta )|^{p}\right) ^{1/p},\right. \nonumber \\&\quad \left. \delta '\delta ,a'a,\gamma '\gamma ,H^{-1}\sum _{h=1}^{H}|{\hat{y}}_{T+h}-y_{T+h}|\right\} , \end{aligned}$$
(15)

possibly subject to (7). To obtain \(\delta ,a,\gamma \) which implicitly depend on the estimator \(\beta \) we perform regressions in (4), (5), and (2). Similarly, for the hold-out-sample we define: \({\hat{y}}_{T+1:T+h}=x_{T+1:T+h}'\beta \).

Additionally, one may wish to avoid OLS altogether in (15) and use Rousseeauw’s (1984) least median of squares (LMS) technique:

$$\begin{aligned}&\min _{\beta }\left\{ \mathrm {\underset{ t=1,...,T }{median}}({\tilde{y}}_{t}-{\tilde{x}}_{t}'\beta ),\right. \nonumber \\&\quad \left. \delta '\delta ,a'a,\gamma '\gamma ,H^{-1}\sum _{h=1}^{H}|{\hat{y}}_{T+h}-y_{T+h}|\right\} . \end{aligned}$$
(16)

This formulation begins directly with LMS and then proceeds with autocorrelation, heteroskedasticity, neglected non-linearity, endogeneity, and out-of sample forecasting. Relative to (15), (16) deals with endogeneity in a somewhat clumsy way as it does not take account of the covariance of errors given by (13). Choosing between (15) and (16) is an empirical issue that we try to resolve on the basis of Monte Carlo simulations.

Suppose now \(\theta =[\delta ',a',\gamma ']'\) denotes the corresponding parameters for heteroskedasticity, neglected nonlinearity and autocorrelation. It may not be enough to make the Euclidean norm \(||\theta ||\) as small as possible as in (6), (15) or (16). What is of interest is to be able to accept the hypothesis: \(H:\theta =\mathbf {0}\). To this purpose, we would like to minimize the maximum coefficient of determination (\(R^{2}\)) in (6), (15), and (16). If the maximum coefficient of determination in regressions (2), (4), and (5) is denoted by \(R_{\theta }^{2}\) then the modified multi-criteria IV problems can be stated as follows:

$$\begin{aligned}&\min _{\beta }\left\{ T^{-1}{\tilde{u}}(\beta )'(Z'Z)^{-1}{\tilde{u}}(\beta ),\left( T^{-1}\sum _{t=1}^{T}|{\tilde{u}}_{t}(\beta )|^{p}\right) ^{1/p},\right. \nonumber \\&\quad \left. \delta '\delta ,a'a,\gamma '\gamma ,R_{\theta }^{2},H^{-1}\sum _{h=1}^{H}|{\hat{y}}_{T+h}-y_{T+h}|\right\} , \end{aligned}$$
(17)
$$\begin{aligned}&\min _{\beta }\left\{ \mathrm {\underset{ t=1,...,T }{median}}({\tilde{y}}_{t}-{\tilde{x}}_{t}'\beta ),\delta '\delta ,a'a,\gamma '\gamma ,R_{\theta }^{2},H^{-1}\sum _{h=1}^{H}|{\hat{y}}_{T+h}-y_{T+h}|\right\} . \end{aligned}$$
(18)

Additionally, if we target statistical insignificance of \(\delta ,a,\gamma \) we can consider the maximum absolute value of the t-statistics for these coefficients, which we denote by \(t_{\theta }\). Therefore, we can modify (17) and (18) as follows:

$$\begin{aligned}&\min _{\beta }\left\{ T^{-1}{\tilde{u}}(\beta )'(Z'Z)^{-1}{\tilde{u}}(\beta ),\left( T^{-1}\sum _{t=1}^{T}|{\tilde{u}}_{t}(\beta )|^{p}\right) ^{1/p},\right. \nonumber \\&\quad \left. \delta '\delta ,a'a,\gamma '\gamma ,R_{\theta }^{2},t_{\theta },H^{-1}\sum _{h=1}^{H}|{\hat{y}}_{T+h}-y_{T+h}|\right\} , \end{aligned}$$
(19)
$$\begin{aligned}&\min _{\beta }\left\{ \mathrm {\underset{ t=1,...,T }{median}}({\tilde{y}}_{t}-{\tilde{x}}_{t}'\beta ),\delta '\delta ,a'a,\gamma '\gamma ,t_{\theta },R_{\theta }^{2},H^{-1}\sum _{h=1}^{H}|{\hat{y}}_{T+h}-y_{T+h}|\right\} . \end{aligned}$$
(20)

Therefore, we have eight objectives in (19) and seven objectives in (20), possibly subject to (7). Additional restrictions that we may want to impose are as follows:

$$\begin{aligned} \begin{array}{c} -2\le t_{\theta }\le 2,\\ R_{\theta }^{2}\le 0.10,\\ H^{-1}\sum _{h=1}^{H}|\tfrac{{\hat{y}}_{T+h}-y_{T+h}}{y_{T+h}}|\le 0.05. \end{array} \end{aligned}$$
(21)

The expression in the last constraint is Mean Absolute Relative Error (MARE). The other two constraints imply that (i) the maximum absolute value of t-statistics in diagnostic regressions, which is denoted by \(t_{\theta }\), is less than the 95% critical value (1.96 which is close to 2), and (ii) the maximum coefficient of determination in diagnostic regressions is less than 10%. The value of the coefficient of determination in diagnostic regressions, shows the explanatory power of such regressions. If large then the presence of autocorrelation. heteroskedasticity and misspecified functional form cannot be excluded. Clearly, in practice, we want to avoid this.

Under these constraints, a solution may not exist so one may want to remove (21) and examine their values at a Pareto optimal solution. We examine the behavior of estimators in both (19) and (20). We use 10,000 simulations. In all cases we have \(T+H\) observations where \(H=20\) is the length of the hold-out sample.

5 Solution technique

Due to the presence of absolute values the multi-criteria problems in (19) and (20) are not differentiable so, finding a Pareto optimal solution is difficult, and, clearly, not available in closed form. For the solution technique, we rely heavily on Tsionas (2017) although we avoid the use of cumbersome Sequential Monte Carlo or Particle Filtering techniques. A significant advantage of the technique proposed here is that it delivers (posterior) standard deviations of parameters of interest and, therefore, confidence bands can be constructed as well. This is in contrast to other multi-objective optimization techniques which, if applied to the present problem, would not deliver measures of statistical uncertainty such as standard errors. Standard errors would have to be computed via sub-sampling or bootstrap methods which would increase the complexity of existing multi-objective methods and, at any rate, they would put them at par with our MCMC technique for Bayesian inference.

Suppose \(F(\beta )=(F_{1}(\beta ),...,F_{n}(\beta ))'\in {\mathbb {R}}^{n}\), \({\underline{F}}=({\underline{F}}_{1},...,{\underline{F}}_{n})'\in {\mathbb {R}}^{n}\). The objective is to solve the problem:

$$\begin{aligned} \min _{\beta \in X\subseteq {\mathbb {R}}^{k}}\;F(\beta ), \end{aligned}$$
(22)

where \(\beta \) incorporates all restriction on \(\beta \). As in Qu et al. (2014) we settle for global Pareto optimality, meaning that \(\beta ^{*}\) is a solution if and only if there does not exist \(\beta \in X\) and \({\hat{\beta }}_{GIVE}=[X'Z(Z'Z)^{-1}Z'X)^{-1}X'Z(Z'Z)^{-1}Z'y.F(\beta )\le F(\beta ^{*})\), \(F(\beta )\ne F(\beta ^{*})\).

In multi-criteria decision making the problem is:

$$\begin{aligned} \min _{\beta \in X}\;\sum _{i=1}^{n}\alpha _{i}F_{i}(\beta ), \end{aligned}$$
(23)

for a certain vector of Pareto weights \(\alpha =(\alpha _{1},...,\alpha _{n})'\) which belong to the unit simplex, \(S=\{\alpha \in {\mathbb {R}}^{n}:\alpha _{i}\ge 0,\;i=1,...,n,\;\sum _{i=1}^{n}\alpha _{i}=1\}\). In this problem, we can get a solution for any given set of \(\alpha \)s. Moreover, if we solve (23) for a range of values of \(\alpha \in S\) we can trace out the Pareto frontier. Therefore, we proceed as follows. Problem (23) is equivalent to finding the mean of the following posterior distributionFootnote 1:

$$\begin{aligned} p(\beta |\alpha ,h,F)=\frac{\exp \left\{ -h\sum \alpha _{i}F_{i}(\beta )\right\} p(\beta )}{\int _{X}\exp \left\{ -h\sum \alpha _{i}F_{i}(b)\right\} p(b)db}, \end{aligned}$$
(24)

for a given positive constant h and a prior for \(\beta \) , denoted by \(p(\beta )\). For this prior, it is reasonable to assume:

$$\begin{aligned} \beta \sim {\mathcal {N}}_{k}({\hat{\beta }}_{OLS},\varphi s^{2}(X'X)^{-1}), \end{aligned}$$
(25)

where \({\hat{\beta }}_{OLS}=(X'X)^{-1}X'Y\), \({\hat{u}}=(y-X{\hat{\beta }}_{OLS})'(y-X{\hat{\beta }}_{OLS})\), and \((n-k)s^{2}={\hat{u}}'{\hat{u}}\), and \(\varphi =10\). This prior depends on the data so, strictly speaking, it is not a pure, coherent Bayes prior. Nevertheless, it is reasonable in our context which can, in fact, be given an empirical Bayes interpretation. Here, we re-emphasize that we can condition on the \(\alpha \)s. Then, the posterior mean is:

$$\begin{aligned} {\overline{\beta }}=\frac{\int _{X}\beta \cdot \exp \left\{ -h\sum \alpha _{i}F_{i}(\beta )\right\} p(\beta )d\beta }{\int _{X}\exp \left\{ -h\sum _{i=1}^{n}\alpha _{i}F_{i}(b)\right\} p(b)db}. \end{aligned}$$
(26)

This result goes back to Pincus (1968) and it is known that h must be “large”. If we consider the non-normalized posterior:

$$\begin{aligned} p(\beta ,h|\alpha ,F)\propto \exp \left\{ -h\sum _{i=1}^{n}\alpha _{i}F_{i}(x)\right\} p(\beta )p(h), \end{aligned}$$
(27)

for a certain prior p(h) then h becomes part of the parameter vector. For example, we can use a gamma prior of the form:

$$\begin{aligned} p(h)\propto h^{a-1}\exp \left\{ -ch\right\} ,\;h>0,\;a,b>0, \end{aligned}$$
(28)

where the parameters a and b can be chosen so that the prior mean \(E(h)=\frac{a}{c}\) is small and the prior variance \(Var(h)=\frac{E(h)}{c}\) is also small. For example, we can set \(a=0.01\) and \(c=\frac{a}{100}\). In this way, we do not have to consider different values of h, although it might be useful to perform sensitivity analysis. An alternative is to integrate analytically h out of (27) using (28) to obtain:

$$\begin{aligned} p(\beta |\alpha ,F)\propto \left\{ c+\sum _{i=1}^{n}\alpha _{i}F_{i}(\beta )\right\} ^{-a}. \end{aligned}$$
(29)

Further analytical integration with respect to \(\alpha \) (when unknown and assigned a prior distribution) is not possible. Therefore, the posterior mean has to be computed numerically. Here, we use a standard random-walk Metropolis-algorithm (a well known Markov Chain Monte Carlo [MCMC] technique). This MCMC technique produces a set of draws \(\{\beta ^{(s)},s=1,...,S\}\) that converges (as S increases) to the posterior whose non-normalized density is (29). To construct this sequence suppose we have \(\beta ^{(s)}\) and we generate a candidate, say, \(\beta ^{c}\) as follows:

$$\begin{aligned} \beta ^{c}\sim {\mathcal {N}}(\beta ^{(s)},\tau s^{2}(X'X)^{-1}), \end{aligned}$$
(30)

where \(\tau \) is a certain parameter. Then we accept the candidate, that is \(\beta ^{(s+1)}=\beta ^{c}\) with the Metropolis-Hastings probability:

$$\begin{aligned} \min \left\{ 1,\,\frac{p(\beta ^{c}|\alpha ,F)}{p(\beta ^{(s)}|\alpha ,F)}\right\} , \end{aligned}$$
(31)

otherwise we set \(\beta ^{(s+1)}=\beta ^{(s)}\). We select \(\tau \) so that the acceptance ratio is between 20-30% during the “burn in” phase, as the optimal acceptance ratio for a multivariate normal posterior is close to 24%.

On an average personal computer, the algorithm does not take more than about a few minutes of wall clock time to perform full posterior analysis in samples of size 1,000. Convergence is examined using two techniques: i) Geweke’s (1992) t-statistic for convergence of posterior means, and ii) by running separate chains starting from randomly chosen different initial conditions. Ten such chains are used here and computation is implemented in parallel. This parallel implementation does not contribute to increases in computation timing.

Convergence of MCMC algorithms like the ones described here can be tested by starting the algorithm from different sets of initial conditions and testing convergence using Geweke’s (1992) diagnostic which is asymptotically distributed (in the number of MCMC draws) as standard normal. Using the settings reported here and starting from ten randomly selected sets of initial conditions, Geweke’s z-test indicated that our MCMC chains are not incompatible with the hypothesis of convergence.

6 Monte Carlo study

We consider the following model. First, \(k=5\) so that we have five regressors. The specification is \(y_{t}=1+x_{t1}+x_{t2}+x_{t3}+x_{t4}+x_{t3}x_{t4}+u_{t},\,u_{t}\sim iid\mathcal {\,N}(0,0.1^{2}),t=1,\ldots ,T\). The regressors \(x_{t3},x_{t4}\) are omitted and they are generated as follows:

$$\begin{aligned} x_{t1}= & {} 0.8x_{t-1,1}+u_{t}+e,\,e_{t}\sim iid\,{\mathcal {N}}(0,0.1^{2}), \end{aligned}$$
(32)
$$\begin{aligned} x_{t4}= & {} \frac{1}{\exp \left( -\tfrac{x_{t3}-\min x_{t3}}{\max x_{t3}-\min x_{t3}}\right) }+|x_{t1}+x_{t1}^{2}|v_{t}+u_{t},\,v_{t}\sim iid\,{\mathcal {N}}(0,1). \end{aligned}$$
(33)

Equation (32) implies that \(x_{t1}\) is autocorrelated. Equation (33) implies that \(x_{t2}\) is heteroskedastic and depends on \(x_{t1}\) in a nonlinear way. In fact, the first term is a sigmoid. Notice that to allow for endogeneity both (32) and (33) depend on \(u_{t}\).

By omitting \(x_{t3},x_{t4},x_{t1}x_{t2}\) all the problems we mentioned are simultaneously present. We construct five instruments as follows:

$$\begin{aligned} z_{ti}=\beta _{i1}x_{t1}+\beta _{i2}x_{t2}+\varepsilon _{ti},\,\varepsilon _{ti}\sim iid\mathcal {\,N}(0,0.1^{2}),\,i=1,\ldots ,5, \end{aligned}$$
(34)

and \(\beta _{i1},\beta _{i2},\,i=1,\ldots ,5\), are generated from a uniform distribution in \(\left[ -1,1\right] \).Footnote 2 We implement MCMC using 15,000 passes, the first 5000 of which are discarded to mitigate possible start-up effects. The initial conditions are obtained from OLS. We assume that all Pareto weights are equal.

From the results in Table 1, it turns out that OLS always has a bias as expected and the RMSE remains approximately constant for all sample sizes. The multi-criteria OLS does better but still it is biased and inconsistent. In contrast, both multi-criteria IV techniques have much lower bias and RMSE which decrease as the sample size increases, showing that they have great potential in large samples. Autocorrelation is eliminated in, approximately, 75-80% of Monte Carlo samples, while heteroskedasticity is eliminated in over 80% of the samples. The multi-criteria OLS is not equally successful and, of course, this can be attributed to the fact that it does not deal with the endogeneity problem. From Fig. 1, it is evident that multi-criteria IV provides estimators which are much closer to the true values compared to OLS and multi-criteria OLS. It seems that there is no ground to choose between multi-criteria IV as in (19) and (20). However, from Table 1, it is evident that average mean absolute relative error is much lower for the multi-criteria IV estimator in (19).

Table 1 Monte Carlo results

Sampling distributions of the different estimators are reported in Fig. 1. Notably the sampling distribution of multi-criteria OLS is non-normal even in large samples. Multi-criteria IV has a non-normal sampling distribution only when \(T=100\).

Fig. 1
figure 1

Sampling distributions

To examine more closely the behavior of the two IV estimators, we present the sampling distributions of \(t_{\theta }\), \(R_{\theta }^{2}\) and MARE in Fig. 2.

Fig. 2
figure 2

Sampling distributions of \(t_{\theta }\), \(R_{\theta }^{2}\) and MARE

For both multi-criteria IV estimators, maximum absolute t-statistics are less than 1.96 for multi-criteria IV-1 but they exceed this critical value for nearly 1% of the samples when we consider multi-criteria IV-2. Values of \(R_{\theta }^{2}\) are less than 0.10 in both cases (although lower for multi-criteria IV-1). Finally, the sampling distribution of MARE is concentrated around 0.70% for multi-criteria IV-1. Although there is some concentration around this value for multi-criteria IV-2, its sampling distribution allows for values as large as over 8%. From this point of view, multi-criteria IV-1 performs better than multi-criteria IV-2.

To visualize the Pareto front, we assume a simple model where:

$$\begin{aligned} \begin{array}{c} y_{t}=\beta x_{t}+u_{t},t=1,\ldots ,T,\\ u_{t}=\rho u_{t-1}+\varepsilon _{t},\,\varepsilon _{t}\sim iid\mathcal {\,N}(0,0.1^{2}), \end{array} \end{aligned}$$
(35)

where \(\beta =1\), \(x_{t}\sim iid\mathcal {\,N}(0,1)\), and \(\rho =0.7\). The weight for the OLS criterion is \(\lambda \in (0,1)\) and the weight for autocorrelation is \(1-\lambda \). As \(\lambda \rightarrow 1\) we obtain OLS, and as \(\lambda \rightarrow 0\) we focus exclusively on autocorrelation. As the latter case does not make sense, we restrict \(\lambda \) to [0.20, 1] and we examine 100 points in this interval. We also examine the heteroskedastic case where

$$\begin{aligned} \begin{array}{c} u_{t}|\sigma _{t}\sim {\mathcal {N}}(0,\sigma _{t}^{2}),t=1,\ldots ,T,\\ \sigma _{t}^{2}=\exp \left( 0.1+x_{t}+x_{t}^{2}\right) . \end{array} \end{aligned}$$
(36)

In both cases the sample size is \(T=100\). The Pareto front is presented in panel (a) of Fig. 3, where b denotes the estimate of \(\beta \).

Fig. 3
figure 3

Pareto front

As the choice of \(\lambda \) is not obvious from panel (a), in panel (b) we report MARE across different values of \(\lambda \) for the autocorrelation and heteroskedasticity cases. Out-of-sample prediction is implemented using a hold-out sample of size 20. Clearly, MARE attains a minimum value close to 0.55 for autocorrelation, and 0.47 for heteroskedasticity, showing that some balance between in-sample fit and autocorrelation / heteroskedasticity is required. The Pareto weights are not far from 0.5, at least in this example. It is possible that in some cases MARE does not attain a minimum for the chosen values of \(\lambda \). In such cases, one can use leave-one-out cross-validation similarly to bandwidth selection in non-parametric estimation.

7 Empirical application

To illustrate the usefulness of the new techniques, we use the same data-construction methodology as in Wang and Zhu (2010). We use data for 2,014 trading days from 11/June/2011 to 12/June/2019, which covers nearly eight years, for the NASDAQ index. From the daily closing prices of NASDAQ indices, Wang and Zhu (2010) proposed to compute the 5 (weekly), 10 (biweekly), 20 (monthly), and 50-day (quarterly) moving averages, which are widely used technical indicators by traders. Let \(P_{t}\) denote the closing price on day t and \(A_{t,T}\) be the T -day moving average on day t, where \(A_{t,T}\) is computed as follows:

$$\begin{aligned} A_{t,T}=T^{-1}\sum _{k=t-T+1}^{t}P_{k},\,T=5,\,10,\,20,\,50. \end{aligned}$$
(37)

Next, the moving average log-return \(R_{t,T}\) (including the daily log-return) for each day t, is as follows:

$$\begin{aligned} R_{t,T}=\ln \frac{A_{t,T}}{A_{t-T,T}},\,T=5,\,10,\,20,\,50. \end{aligned}$$
(38)

In turn, we have five time series: (1) \(R_{t,1}\), (2) \(R_{t,5}\), (3) \(R_{t,10}\), (4) \(R_{t,20}\), and (5) \(R_{t,50}\). The dependent variable is \(y_{t}=R_{t+1,1}\), the next day’s log-return. Then we constructed the input features. For the jth time series, we extract \(p_{j}\) data points to construct the input features. Specifically, let

$$\begin{aligned} \mathbf {x}_{t}^{j}=\left[ R_{t-(p_{j}-1)T_{j,T_{j}}},R_{t-(p_{j}-2)T_{j},T_{j}},\ldots ,R_{t,T_{j}}\right] ,\,j=1,\ldots ,5, \end{aligned}$$
(39)

where \(T_{1}=1\), \(T_{2}=5\), \(T_{3}=10\), \(T_{4}=20\), \(T_{5}=50\). The overall input features for day t are:

$$\begin{aligned} \mathbf {x}_{t}=[\mathbf {x}_{t}^{1},\mathbf {x}_{t}^{2},\mathbf {x}_{t}^{3},\mathbf {x}_{t}^{4},\mathbf {x}_{t}^{5}]. \end{aligned}$$
(40)

As in Wang and Zhu (2010) , \(\mathbf {x}_{t}^{1}\) and \(\mathbf {x}_{t}^{2}\) capture the short-term (daily and weekly) behaviors of the market, while \(\mathbf {x}_{t}^{4}\) and \(\mathbf {x}_{t}^{5}\) capture the long-term (monthly and quarterly) trends. Moreover, “[i]t is not clear a priori which features are important for predicting the next day return, neither how they should be combined to predict (Wang and Zhu 2010, p. 110). The problem with the specification of explanatory variables in (40) is that, since they are constructed from functions of lagged values of \(y_{t}\), they introduce, autocorrelation as well as heteroskedasticity in the error term. The error term is also expected to be heteroskedastic because of the well-known stylized fact that we have time-varying second-order moments in financial data.

To compare the various alternatives, we consider buy-and-hold strategy, artificial neural network (ANN)Footnote 3 , OLS and MCDR for the last 100 trading days which have not been taken into account in estimations.

Fig. 4
figure 4

Performance for 100 last trading days for NASDAQ

In the upper left panel of Fig. 4, we present NASDAQ log-returns. In the upper right panel we compare buy-and-hold, ANN, OLS, and MCDMR. OLS, clearly, does not do well relative to buy-and-hold and ANN. MCDMR, for the most part, does quite well in terms of cumulative returns, even when compared to ANN. In the lower panel we report results from Multi-criteria IV along with 95% Bayes probability intervals which can be computed easily, once MCMC draws from the posterior in (29) are available. It turns out that Multi-criteria IV delivers rather tight error bounds so, the performance in terms of cumulative returns, is statistically, important.

8 Managerial implications

That linear regression is used in many managerial decision making processes, is well-known and beyond any doubt. Stam (1997) succinctly pointed out the relevance in discrete variable classification, mixed variable classification, and in application areas which are often susceptible to data analytical problems, such as medical diagnosis, psychology, marketing, financial analysis, engineering and pattern recognition. In this paper, we provide an estimator of parameters in linear regression or instrumental-variable models that satisfies multiple criteria: In addition to minimizing the sum of squared residuals in (1), we also need to minimize simultaneously the presence of autocorrelation, heteroskedasticity, misspecification arising from nonlinearities, endogeneity, and failure in out-of-sample forecasting. These problems are commonly encountered in applications, but they are addressed, for the most part, in an ad hoc way. A formulation of linear regression estimation as a multi-criteria decision-making problem, addresses these problems in a common and principled framework, also permitting the user to assign different importance to different objectives, if so desired. Managers often need to understand how a process work by using the framework of linear regression. We have argued that autocorrelation and heteroskedasticity are not incidental problems that can be treated in a mechanical way by using, for example, so-called robust standard errors, which are now part of commonly available statistical software. Autocorrelation and heteroskedasticity rather indicate misspecification of the model as it is quite likely that autocorrelated and / or heteroskedastic variables, have been omitted from the model, becoming part of the error term. Our new multi-criteria decision making for OLS and IV regression, is found to perform well in a Monte Carlo study. An application to NASDAQ daily returns shows that the cumulative predicted returns are higher than buy-and-hold strategies and even artificial neural networks, whereas least-squares regression fails to deliver acceptable results.

These particular aspects of the model as it relates to forecasting performance are encouraging and deserve further investigation in future research.

9 Concluding remarks

In this paper, we consider OLS and IV regression. We argue that specification problems related to autocorrelation, heteroskedasticity, neglected non-linearity, unsatisfactory out-of-small performance and endogeneity can be addressed in the context of multi-criteria optimization. We show that the new technique performs well, it minimizes all these problems simultaneously, and effectively eliminates them for the most part. Markov Chain Monte Carlo techniques are used to perform the computations. An application to NASDAQ daily returns shows that cumulative predicted returns are higher than buy-and-hold strategies and even an artificial neural network, whereas OLS regression fails to deliver acceptable results. As such, the new techniques are likely to be of interest for practitioners in most applied fields dealing with estimation, misspecification and interpretation of regression models. In particular, the method may relevant in portfolio optimization and prediction, as it minimizes the effect of regression problems, simultaneously. Although we use Bayesian analysis and MCMC methods to solve the multi-criteria OLS and IV problems, it is possible to use several other techniques (see Footnote 1). The downside is that computation of standard errors and confidence bands are not straightforward to derive, and it is likely that the use of the sub-sampling bootstrap and other variants becomes imperative thus increasing computational complexity and timing . From this point of view, as simulation techniques are required, the use of Bayesian MCMC may be more straightforward, in practice.

The new techniques allow dealing with outliers in a straightforward way, as appropriate norms of errors are introduced among the objectives in multi-criteria OLS / IV. Out-of-sample fit is also taken into account so that the techniques deliver the best predictions possible when commonly encountered regression problems are also taken into account.

In terms of future research several problems are open. First, a generalization to Generalized Method of Moments estimation is possible to deal with specification problems of regressions. Second, it would be worthwhile to address the same problems in non-linear regression models, which seems to be quite easy. Third, methods for selecting valid instruments in practical situations should be developed. One such technique in the big-data context is provided by Bai and Ng (2010).