Introduction

The purpose of this paper is to introduce and encourage the use of a theoretically motivated and intuitive robust regression estimator for asset factor models, as a standard practice complement to the universal use of least squares to fit factor models. While the robust estimator proposed is equally useful for cross section and time series factor models, we focus in this paper on demonstrating the estimator’s value in the context of time series factor models.

In data oriented terms, a robust regression estimator is one that is not much influenced by outliers and that provides a good fit to the bulk of the data. In spite of the fact that least squares (LS) is a best linear unbiased estimator (BLUE) and a maximum-likelihood estimator (MLE) in the case of normally distributed errors, the minimization of a quadratic function of the residuals causes least squares to be quite lacking in such robustness. A small fraction of highly influential outliers often results in least squares regressions analysis conclusions that are quite misleading about the relationship between stock returns and explanatory factors, and on the other hand a robust regression that controls for outliers can lead to a clear understanding of the relationship for the large majority of the data.

Over several decades, a body of knowledge about the statistical properties and computational methods for various robust regression methods has accumulated in the academic statistics research literature, with the books by Hampel et al. (1986), Rousseeuw and Leroy (1987), the Huber and Ronchetti (2009) second edition of Huber (1981), and Maronna et al. (2019) being the primary references. However, this good news is counter-balanced by the bad news that today there are simply too many different types of robust regression estimators, including not only maximum-likelihood type M-estimators and MM-estimators discussed herein, but other distinct families of estimators such as the Rousseeuw least-trimmed-squares (LTS) discussed in Rousseeuw and Leroy (1987) and used by Knez and Ready (1997), and bounded-influence estimators. Furthermore, even the best known and most frequently used family of regression M-estimator has far too many variants based on different choices of optimization loss function. For example, the SAS software product offers 10 M-estimator variants with names: Andrews, Bisquare, Cauchy, Fair, Hampel, Huber, Logistic, Median, Talworth, Welsch.Footnote 1 It is therefore not surprising that there is currently no consensus what-so-ever concerning which robust regression estimator one should use, even within the family of regression M-estimators. Correspondingly there is not a common benchmark method for comparison purposes, as there is in the case of least squares. It seems clear that the existence of a large number of robust regression alternatives that are not very well justified theoretically, along with the lack of a common benchmark, has been and continues to be a serious impediment to the routine use of robust regression in many quantitative fields, and in asset management in particular.

As a solution to the conundrum of too many robust regression estimators, we introduce a theoretically well-justified robust regression estimator called an mOpt estimator, originally introduced in Chapter 5 of Maronna et al. (2019) and the Konis and Martin (2021) article, and extensively demonstrate the method’s efficacy in fitting time series factor models. The mOpt estimator is obtained by replacing the quadratic function of regression residuals with a special bounded function of robustly scaled residuals, and its mathematical foundation was provided in the important technical papers of Yohai and Zamar (1997) and Svarc et al. (2002). Like all good robust regression methods and unlike least squares (LS), the mOpt estimator results in good factor model fits to most of the data and clearly identifies outliers. The mOpt method also results in accurate t-statistics that are representative of a large majority of the data. The mOpt estimator has the property that it minimizes the maximum estimator bias over a family of distributions that contains the normal distribution and an infinite number of non-normal outlier generating distributions, subject to a constraint of specified high normal distribution efficiency. A high normal distribution efficiency constraint means that the robust regression coefficient estimates will have variances that are not much larger than those of least squares (LS) when the regression model has normally distributed errors. From a data point of view, the mOpt estimator has an appealing intuitive representation as a weighted least squares estimator based on a smooth weighting function that is zero for robustly scaled regression residuals that are greater than 3.0 in magnitude. Such data with scaled residual greater than 3.0 are deemed to be outliers, and in the idealized case of normal distribution data are very infrequently declared to be outliers.

For the applications of robust regression estimators in single-factor time series models there are a number of relevant prior research results. The first two of these were Sharpe (1971) and Cornell and Dietrich (1978), both of whom focused on mean-absolute deviation (MAD) estimators, and based on their empirical studies concluded that the MAD estimator does not yield much improvement over LS. Subsequent papers by Chan and Lakonishok (1992), Mills and Coutts (1996) and Cloete et al. (2002) established superior performance of several regression quantile based beta estimators in the presence of fat-tailed non-normality and outliers, but those estimators have seen little common use. Martin and Simin (2003) introduced robust beta M-estimators similar to the one we discuss herein, and showed that they result in attractive beta point estimates and forecasts. The very interesting paper by Genton and Ronchetti (2008) focuses on robust forecasting of beta that combines Vasicek (1973) shrinkage with a regression M-estimator variant.Footnote 2 Finally, Bailer et al. (2011a) extends the work of Martin and Simin (2003) to a large cross section of U.S. stocks, and is a point of departure for the robust single factor models portion of the current paper.

It is curious that in spite of the existing published research on robust estimation of CAPM betas, the literature is quite silent on the topic of robust fitting of multifactor models.Footnote 3 In particular, there is no mention what-so-ever of robust regression in any of the many papers on empirical asset pricing articles that use time series factor models, such as Fama and French (1993), Carhart (1997), Fama and French (2015), Hou and Dijk (2018), and Hou et al. (2020). This is also the case in the extensive and growing literature on anomalies, such as Fama and French (2008), Hou et al. (2020), and Feng et al. (2020), and the use of robust time series factor model fitting for research on anomalies is overdue.

The remainder of the paper is organized as follows. Section 2 discusses the mOpt robust regression method, and points out a basic lack of robustness of the well-known Huber robust regression method. Section 3.1 focuses on single factor CAPM beta model estimation for the universe of CRSP database (“Center for Research in Security Prices, LLC”) To Editor: the previous reference in quotes would be best as a footnote. RDM liquid stocks from 1963 to 2018, for which extensive empirical analysis of the distribution of LS versus robust mOpt betas reveals frequent substantial adverse influence of outliers on the LS betas. Section 3.2 focuses on mOpt and LS fits of the Fama-French 3-factor (FF3) and the Fama-French-Carhart 4-factor (FFC4) multifactor models to stock returns, using classical t-statistics and adjusted R-squared inference values for the LS fits and robust versions thereof for the mOpt fits. The results show that mOpt robust fit is better than the LS fit for the FF3 model, and for the FF4 the mOpt fit is substantially better than that of LS. Section 3.2 also introduces a new robust test for significant differences between mOpt and LS coefficients and demonstrates its usefulness in comparing robust mOpt fits of the FF3 and FF4 models. Section 4 introduces a robust final prediction error (RFPE) criteria for model selection, and demonstrates its efficacy. Section 5 provides a brief discussion and summary. The Appendices contain supplemental details concerning the mOpt robust regression method.

Efficient bias robust regression for time series factor models

We begin in Sect. 2.1 by defining the time series factor models to which we will apply our efficient bias robust mOpt regression estimator. Then Sect. 2.2 discusses general maximum-likelihood type M-estimators, which are defined in terms of a loss function \(\rho (x)\) applied to scaled prediction residuals, and presents the coefficient estimates asymptotic covariance matrix formula, which is used in defining mOpt efficiency relative to the LS and in computing M-estimator coefficient standard errors. Section 2.3 introduces the intuitively appealing bounded loss function \(\rho _{\mathrm{mOpt}}(x)\) that defines the mOpt regression M-estimator, shows that the estimator has a weighted least squares representation with data-dependent weights, where the latter are generated by a smooth weight function that is zero for scaled prediction residuals that are larger than 3.0 in absolute value. Section 2.4 discusses the important mOpt estimator bias robustness property of minimizing the maximum asymptotic bias for distributions in a Tukey-Huber family of two-component mixture distributions, one of which is a normal distribution and the other is any distribution function, many of which generate outliers that give rise to M-estimator bias. Finally, Sect. 2.5 shows that the well-known Huber robust regression estimator, which is defined in terms an unbounded function \(\rho _{\mathrm{Huber}}(x)\), lacks bias robustness in comparison to mOpt.

Time series factor models

Throughout this paper we assume that asset returns are generated by a time series factor model of the form

$$\begin{aligned} r_{t}&\,=\,\beta _{0}+f_{1,t}\beta _{1}+f_{2,t}\beta _{2}+\cdots +f_{K,t}\beta _{K}+s\epsilon _{t}\nonumber \\&=\tilde{\mathbf {f}}_{t}^{\prime }\varvec{\upbeta }+s\epsilon _{t}\nonumber \\&=\left( 1,\mathbf {f}_{t}^{\prime }\right) \varvec{\upbeta }+s\epsilon _{t},\;t=1,2,\ldots ,T \end{aligned}$$
(1)

where the \(r_{t}\) are the (typically excess) returns of a specific asset at time t, \(\varvec{\upbeta }=(\beta _{0},\beta _{1},\ldots ,\beta _{K})^{\prime }\) is a \(K+1\) dimensional vector of unknown regression coefficients, the \(\epsilon _{t}\) are the regression errors, and

$$\begin{aligned} \mathbf {f}_{t}\,= \,& {} \left( f_{1,t},f_{2,t},\ldots f_{K,t}\right) ^{\prime },\;t=1,2,\ldots ,T \end{aligned}$$
(2)

is a time series of K dimensional random factor returns.

Examples of the above time series factor model include: (1) the CAPM model where \(K=1\), \(r_{t}=r_{t}^{e}\) is the asset return in excess of a risk=free rate, and \(f_{1,t}=r_{M,t}^{e}\) is the market return in excess of a risk-free rate; (2) the Fama-French 3-factor model (FF3) where \(K=3\) with \(f_{1,t}=r_{M,t}^{e}\), \(f_{2,t}\) the small-minus-big (SMB) factor return, and \(f_{3,t}\) the high-minus-low (HML) factor return. The Fama-French-Carhart 4-factor model (FFC4) adds the Carhart (1997) Momentum factor to FF3 model, and other multifactor models including the Fama and French (2015) 5-factor model (FF5), and the Hou et al. (2015) q-factor model. When dealing with a cross section of N assets, say stocks, one replaces \(r_{t},\beta _{0},\beta _{1},\ldots ,\beta _{K},\epsilon _{t}\) with \(r_{it},\beta _{i0},\beta _{i1},\ldots ,\beta _{iK},\epsilon _{it}\) for \(i=1,2,\ldots ,N\).

General regression M-estimators

A regression M-estimator \(\hat{\varvec{\upbeta }}\) for the model (1) is defined as a solution to the minimization problem

$$\begin{aligned} \hat{\varvec{\upbeta }}=\underset{\mathbf {\varvec{\upbeta }}}{\mathrm{argmin}}\left[ \sum _{t=1}^{T}\rho \left( \frac{r_{t}-\tilde{\mathbf {f}}_{t}^{\prime }\varvec{\upbeta }}{\hat{s}}\right) \right] \end{aligned}$$
(3)

where \(\rho (x)\) is a symmetric rho function for which \(\rho (0)=0\) and \(\rho (x)\) is non-decreasing for \(x>0\). The \(\hat{s}\) above is a robust estimate of the scale parameter s, which is computed prior to the above optimization as described in the last paragraph of Sect. 2.3. The corresponding estimating equation obtained by differentiating the above summation expression with respect to \(\mathbf {\varvec{\upbeta }}\) is:

$$\begin{aligned} \sum _{t=1}^{T}\tilde{\mathbf {f}}_{t}\psi \left( \frac{r_{t}-\tilde{\mathbf {f}}_{t}^{\prime }\hat{\varvec{\upbeta }}}{\hat{s}}\right) =0 \end{aligned}$$
(4)

where \(\psi (x)=\rho ^{\prime }(x)\) is the corresponding psi function, and

$$\begin{aligned} \hat{\epsilon }_{t}=\frac{r_{t}-\tilde{\mathbf {f}}_{t}^{\prime }\hat{\varvec{\upbeta }}}{\hat{s}} \end{aligned}$$
(5)

are the robustly scaled regression residuals.

Robust regression M-estimators were originally introduced by Huber (1973), and are extensively discussed in Chapters 4 and 5 of Maronna et al. (2019). The least squares estimator \(\hat{\varvec{\upbeta }}_{LS}\) is a special non-robust case of an M-estimator obtained with the quadratic rho function \(\rho (x)=x^{2}/2\) and correspondingly \(\psi (x)=x\). The least absolute deviation (LAD) estimator \(\hat{\varvec{\upbeta }}_{LAD}\) is obtained when \(\rho (x)=|x|\).Footnote 4

Regression M-estimator covariance matrix

It can be shown that under rather general conditions an M-estimator \(\hat{\varvec{\upbeta }}=\hat{\varvec{\upbeta }(\psi )}\) based on a sample size T is asymptotically normal with a \((K+1)\times (K+1)\) covariance matrix

$$\begin{aligned} \mathbf {V}_{\tilde{\mathrm {f}}}(\psi )=\mathrm {v}(\psi ,F_{0})\mathbf{M }_{\tilde{\mathrm {f}}}^{-1} \end{aligned}$$
(6)

where

$$\begin{aligned} \mathbf{M }_{\tilde{\mathrm {f}}}=\mathrm {E}\left( \tilde{\mathbf {f}}_{t}\tilde{\mathbf {f}}_{t}^{\prime }\right) =\left( \begin{array}{cc} 1 &{} \varvec{\upmu }_{\mathrm {f}}^{\prime }\\ \varvec{\upmu }_{\mathrm {f}} &{} \mathbf {M}_{\mathbf {f}} \end{array}\right) \end{aligned}$$
(7)

is the moment matrix of the \(\tilde{\mathbf {f}}_{t}\) in (1), the formula for \(\mathrm {v}(\psi ,F_{0})\) is given by (8), \(\mathbf {M}_{\mathbf {f}}\) is the moment matrix of the factor returns vector \(\mathbf {f}_{t}\), and \(\varvec{\upmu }_{\mathrm {f}}=E(\mathbf {f}_{t})\).

The scalar quantity \(\mathrm {v}(\psi ,F_{0})\) is the variance of an M-estimator of location introduced by Huber (1964), who derived the expression

$$\begin{aligned} {\mathrm {v}}(\psi ;F_{0})=s^{2}\dfrac{{E}_{F_{0}}\psi ^{2} \left( \epsilon _{1}\right) }{{E}_{F_{0}}^{2}\psi ^{\prime }\left( \epsilon _{1}\right) } \end{aligned}$$
(8)

where \(F_{0}\) a standard distribution of the \(\epsilon _{t}\) in (1), for example a standard normal or a standard t-distribution. A formal statement of the asymptotic normality of a regression M-estimator with asymptotic covariance matrix of the form (6) is provided in Section 5.3.1.3 Maronna et al. (2019), and a proof of the result is provided in Section 10.2.2.

Use of a standard matrix inversion formula for partitioned allows one to express the inverse of the moment matrix (7) in the form

$$\begin{aligned} \mathbf{M }_{\tilde{\mathrm {f}}}^{-1}=\left( \begin{array}{cc} 1+\varvec{\upmu }_{\mathrm {f}}^{\prime }\mathbf {C}_{\mathrm {f}}^{-1}\varvec{\upmu }_{\mathrm {f}}\quad - &{} \varvec{\upmu }_{\mathrm {f}}^{\prime }\mathbf {C}_{\mathrm {f}}^{-1}\\ -\mathbf {C}_{\mathrm {f}}^{-1}\varvec{\upmu }_{\mathrm {f}} &{} \mathbf {C}_{\mathrm {f}}^{-1} \end{array}\right) \end{aligned}$$
(9)

where \(\mathbf {C}_{\mathrm {f}}\) is the covariance matrix of the factor returns \(\mathbf {f}_{t}^{\prime }=(f_{1,t},f_{2,t},\ldots f_{K,t})\).Footnote 5 For a time series factor model (1) without an intercept \(\beta _{0}\) the right-hand side of the above expression reduces to \(\mathbf {C}_{\mathrm {f}}^{-1}\).

It is important to note that the regression M-estimator covariance matrix (6) depends on the choice of rho function \(\rho (x)\), and hence the corresponding psi function \(\psi (x)\), only through the scalar expression (8). In the special case of the LS estimator where \(\psi _{LS}(x)=x\) and \(F_{0}\) is a standard normal distribution \(\Phi\), \({\mathrm {v}}(\psi _{LS}(x);\Phi )=s^{2}\) which is the variance of \(s\epsilon _{1}\). Standard errors for the regression M-estimator coefficients are obtained in the usual way by dividing estimates of the diagonal elements of (6) by the sample size T and taking the square root of the results. For details see Appendix A3.

The mOpt robust regression estimator

The robust regression estimator that we use throughout the paper, and recommend in general, is a regression M-estimator based on the rho function \(\rho _{\mathrm{mOpt}}(x)\) and psi function \(\psi _{\mathrm{mOpt}}(x)\) shown in Fig. 1. We call the estimator an mOpt estimator because it is motivated by an efficiency constrained optimal bias robustness property described in Sect. 2.4. The rho and psi functions in Fig. 1 have the following intuitive appeal for fitting factor models. The actual distribution of the \(\epsilon _{t}\) in (1) is typically approximately normal in the central part of the distribution, but has an unknown fat-tailed and typically skewed distribution outside the central region. Correspondingly, the rho function is quadratic for a large portion of the central region \((-3_{,}+3)\) where it behaves like the least squares loss function, and is constant outside the interval \((-3_{,}+3)\), where it is indifferent to the size of the scaled regression residuals in (3). This behavior represents an agnostic assignment of equal losses to all pairs \((r_{t},\tilde{\mathbf {f}}_{t}^{\prime })\) for which the absolute value of the robustly scaled residuals is larger than 3.00. Correspondingly, the mOpt psi function has value zero for all such pairs \((r_{t},\tilde{\mathbf {f}}_{t}^{\prime })\), which therefore have no influence on the regression parameter estimates \(\hat{\varvec{\upbeta }}_{\mathrm{mOpt}}\). An M-estimator psi function that is zero outside some finite interval is called a redescending psi function in the robust statistics literature.Footnote 6

Fig. 1
figure 1

Bounded \(\rho _{\mathrm{mOpt}}(x)\) and corresponding \(\psi _{\mathrm{mOpt}}(x)\) for which \(\psi _{\mathrm{mOpt}}(x)=0\) and \(\rho _{\mathrm{mOpt}}(x)\) is constant for \(|x|\ge 3.00\)

A general formula for \(\psi _{\mathrm{mOpt}}(x)\) that depends on a tuning parameter c is given in Appendix A1, where it is seen that \(\psi _{\mathrm{mOpt}}(x)=0\) for \(|x|\ge c\). Since the rho function \(\rho _{\mathrm{mOpt}}(x)\) is the integral of the psi function \(\psi _{\mathrm{mOpt}}(x)\), it follows that the rho function has a constant value for \(|x|\ge c\). The parameter c controls the normal distribution efficiency of the mOpt estimator, where efficiency is defined as the ratio of the asymptotic variance of the least squares estimator to that of the mOpt estimator, often expressed as a percent. A formula for the normal distribution efficiency of the mOpt estimator in terms of the tuning constant c is given in Appendix A2. Since an LS estimator has the minimum possible variance for normal distributions, an mOpt estimator has an efficiency less than 100%. The choice of an mOpt value for c controls a trade off between normal distribution efficiency and robustness toward outliers. Increasing the value of c results in the rho function \(\rho _{\mathrm{mOpt}}(x)\) becoming closer to the least squares quadratic rho function, with correspondingly higher normal distribution efficiency but less robustness toward outliers, and conversely. Throughout this paper, as in Fig. 1, the mOpt estimator is based on the choice \(c=3.0\), which results in a normal distribution efficiency of 95%.

For the 95% efficient mOpt estimator that we use throughout this paper, and recommend in general, the ratio of the asymptotic mOpt variance to least squares variance is 1.0526 (to 4 digits), which means that the ratio of asymptotic standard deviations is 1.026 (to 3 digits). This suggests that in practice the standard errors (SE) of an mOpt estimator will be only about 2.6% larger than that of the least squares estimator. Thinking of the normal distribution excess standard error (SE) as an “insurance premium” to be paid in order to provide protection against bias and inflated variability due to outlier generating non-normal distributions, an SE premium of 2.6% seems quite “inexpensive. We propose that a 95% normal distribution efficiency mOpt estimator be used as a benchmark for evaluating the performance of not only an LS alternative, but also other robust regression estimator alternatives.

Iteratively reweighted least squares and MM-estimator

A regression M-estimator psi function \(\psi (x)\) defines a weight function

$$\begin{aligned} w(x)=\dfrac{\psi (x)}{x} \end{aligned}$$
(10)

that allows one to express the solution \(\hat{\varvec{\upbeta }}\) of the estimating Eq. (4) in the form of a weighted least squares (WLS) estimate. To see that this is the case, note that (4) can be written in the form

$$\begin{aligned} \sum _{i=1}^{T}w_{t}\tilde{\mathbf {f}}_{t}\left( r_{t}-\tilde{\mathbf {f}}_{t}^{\prime }\hat{\varvec{\upbeta }}\right) =0 \end{aligned}$$
(11)

where the \(w_{t}\) are the data-dependent weights:

$$\begin{aligned} w_{t}\,=\,w_{t}(\hat{\varvec{\upbeta }};\,\hat{s})=w\left( \frac{r_{t}-\tilde{\mathbf {f}}_{t}^{\prime }\hat{\varvec{\upbeta }}}{\hat{s}}\right) . \end{aligned}$$
(12)

Thus the estimate \(\hat{\varvec{\upbeta }}\) may be expressed in the nonlinear weighted least squares (WLS) mathematical form:

$$\begin{aligned} \hat{\varvec{\upbeta }}=\left( \sum _{i=1}^{T}w_{t}\left( \hat{\varvec{\upbeta }};\,\hat{s}\right) \tilde{\mathbf {f}}_{t}\tilde{\mathbf {f}}_{t}^{\prime }\right) ^{-1}\left( \sum _{i=1}^{T}w_{t} (\hat{\varvec{\upbeta }};\,\hat{s})\tilde{\mathbf {f}}_{t}r_{t}\right) . \end{aligned}$$
(13)

The weight function \(w_{\mathrm{mOpt}}(x)=\psi _{\mathrm{mOpt}}(x)/x\) for our mOpt estimator \(\hat{\varvec{\upbeta }}_{\mathrm{mOpt}}\) is shown in Fig. 2.

Fig. 2
figure 2

The mOpt weight function (\(c=3.00\))

The mOpt weight function gives a weight of 1 to all sufficiently small robustly scaled residuals \((r_{t}-\tilde{\mathbf {f}}_{t}^{\prime }\hat{\varvec{\upbeta }}_{\mathrm{mOpt}})/\hat{s}\), and smoothly transitions to zero weight for scaled residuals whose absolute is greater than 3.00. All asset and factor returns pairs \((r_{t},\tilde{\mathbf {f}}_{t}^{\prime })\) whose scaled residuals have absolute values larger than 3.0 are said to be rejected. In this regard, the mOpt robust regression estimator has the highly intuitive appeal that it is robust smoothed version of non-robust \(\mathrm {classical}\) 3-sigma edit rule, according to which all data values in a sample that differ from the sample mean by at least 3 times the sample standard deviation are deleted. Under the assumptions that the residuals in (1) are normally distributed and the estimates \(\hat{s}\) and \(\hat{\varvec{\upbeta }}_{\mathrm{mOpt}}\) are equal to the true parameter values, the probability that a pair \((r_{t},\tilde{\mathbf {f}}_{t}^{\prime })\) is rejected by the mOpt estimator is the probability that a standard normal random variable is larger than 3.00 in magnitude, which is only .27%. Thus, such data pairs are rejected only very rarely in the highly idealized case of normally distributed residuals.

An important aspect of the mOpt WLS version (13) is that it lends itself to the iteratively-reweighted least squares (IRWLS) computation, whose mathematical representation is

$$\begin{aligned} \hat{\varvec{\upbeta }}^{k+1}=\left( \sum _{i=1}^{T}w_{t}(\hat{\varvec{\upbeta }}^{k};\,\hat{s})\tilde{\mathbf {f}}_{t}\tilde{\mathbf {f}}_{t}^{\prime }\right) ^{-1}\left( \sum _{i=1}^{T}w_{t}(\hat{\varvec{\upbeta }}^{k};\,\hat{s})\tilde{\mathbf {f}}_{t}r_{t}\right) ,\,k=0,1,2,\ldots \end{aligned}$$
(14)

with an initial estimate \(\hat{\varvec{\upbeta }}^{0}\). It is shown in Section 9.1 of Maronna et al. (2019) that for M-estimators based on rho functions with a similar shape as \(\rho _{\mathrm{mOpt}}(x)\), the IRWLS algorithm converges to a solution of the estimating Eq. (4). However, such a solution may yield only a local minimum of the objective function in (3).

The above IRWLS algorithm needs a highly robust initial estimate \(\hat{\varvec{\upbeta }}^{0}\) that helps insure that a global minimum of (3) is achieved. An estimator that serves this purpose well is an S-estimator that was introduced by Rousseeuw and Yohai (1984), and is described in Section 5.4.1 of Maronna et al. (2019). An S-estimator is a special form of M-estimator that jointly estimates the regression coefficient \(\varvec{\upbeta }\) and the error term scale parameter s, and has a high breakdown point but a low normal distribution efficiency of about 29%. The S-estimator regression coefficient estimate is used as the initial estimate of the IRWLS algorithm (14) and the S-estimator scale estimate \(\hat{s}\) is used in computing the robustly scaled prediction residuals given by (5). The breakdown point (BP) of an estimator is the largest fraction of data outliers whose value can tend to infinity without taking the estimator to the boundary of the parameter space, which is infinity in the case of regression, and a high-breakdown point estimator is one with a breakdown point of 0.5, or approximately so.Footnote 7 An M-estimator that uses a high BP but inefficient initial estimator is called an MM-estimator, whose theoretical properties were developed by Yohai (1987). The regression estimator used for the examples in this paper and recommended in general, is an mOpt MM-estimator, but for simplicity we just refer to it as the mOpt estimator.Footnote 8 The mOpt estimator may be computed using the function lmrobdetMM in the RobStatTM R package that is available at CRAN https://cran.r-project.org/web/packages/RobStatTM/index.html.

Efficient bias robustness of the mOpt regression estimator

Here, we describe an efficient bias robustness property of the mOpt regression estimator that does not exist for any other robust regression estimator, and in particular does not exist for any of the 10 SAS M-estimators mentioned in Sect. 1. The bias robustness optimality of the mOpt estimator is obtained with respect to the Tukey-Huber family of distributions

$$\begin{aligned} F_{\gamma }(r,\mathbf{f })=(1-\gamma )N\left( r-\mathbf {f}^{\prime }\varvec{\upbeta }\right) G(\mathbf{f })+\gamma H(r,\mathbf{f }),0\le \gamma <0.5 \end{aligned}$$
(15)

for \((r_{t},\mathbf{f }_{t})\), where N is a normal distribution for the regression model errors \(s\epsilon _{t}\), \(G(\mathbf{f })\) is a fixed multivariate distribution function of factor returns \(\mathbf{f }_{t}\), and \(H(r,\mathbf{f })\) is an unrestricted joint distribution.Footnote 9 When \(\gamma =0\) the distribution \(F_{0}(r,\mathbf{f })\) generates data with normally distributed errors and the MLE of \(\varvec{\upbeta }\) is the least squares estimator, but when \(\gamma >0\) many distributions \(H(r,\mathbf{f })\) result in \(F_{\gamma }(r,\mathbf{f })\) generating outliers with probability \(\gamma\). Correspondingly, when \(\gamma >0\) both LS and M-estimators will be biased for some \(H(r,\mathbf{f })\). The optimality problem is to find a regression M-estimator Opt that minimizes the maximum asymptotic bias with respect to all possible \(H(r,\mathbf{f })\), subject to a constraint of a user specified high normal distribution efficiency. The latter constraint insures that the Opt performance is only slightly worse than least squares when \(\gamma =0\) in the Tukey-Huber regression model. Yohai and Zamar (1997) solved this problem locally for small \(\gamma\), and the mOpt estimator we discuss here is a very slight modification of the Opt estimator that is needed to insure convergence of the IRWLS algorithm in Sect. 2.3.1. Later, Svarc et al. (2002) solved the global optimality problem for all \(0<\gamma <0.5\), and their show that the Yohai and Zamar Opt local solution is a remarkably good approximation to the global solution.

The huber estimator lack of robustness

It was shown by Martin et al. (1989) that the asymptotic bias of an M-estimator with an unbounded rho function \(\rho (x)\) is unbounded with respect to all possible \(H(r,\mathbf{f })\) in the Tukey-Huber family 15, but such bias is bounded for an bounded \(\rho (x)\). Thus, it is not surprising that the Opt function discovered later by Yohai and Zamar (1997) has a bounded rho function, as does the mOpt variant. On the other hand, the well-known Huber (1973) regression estimator based on the rho function

$$\begin{aligned} \rho _{\mathrm{Huber}}(x;c)={\left\{ \begin{array}{ll} 0.5x^{2}, &{} |x|\le c\\ c|x|-\frac{1}{2}c^{2}, &{} |x|>c \end{array}\right. } \end{aligned}$$
(16)

is the leading example of a regression M-estimator based on an unbounded rho function. The corresponding Huber psi function is the monotonic function:Footnote 10

$$\begin{aligned} \psi _{\mathrm{Huber}}(x;c)={\left\{ \begin{array}{ll} x, &{} |x|\le c\\ c\cdot \mathrm {sgn}(x), &{} |x|>c\,. \end{array}\right. } \end{aligned}$$
(17)

The monotonicity of the Huber psi function has the undesirable consequence, unlike the mOpt estimator redescending psi function, that it does not reject outliers. The Huber rho and psi functions are displayed in Fig. 3 for the tuning constant choice \(c=1.345\) that results in an estimator with 95% normal distribution efficiency.

Fig. 3
figure 3

Huber Rho and Psi function for \(c=1.345\)

The Huber regression estimator is quite popular, partly because of the seminal min-max variance robustness property established for estimation of a location parameter in Huber (1964) and inherited by the Huber (1973) regression M-estimator, but the more so because unlike the non-convex function \(\rho _{\mathrm{mOpt}}(x;c)\), the function \(\rho _{\mathrm{Huber}}(x;c)\) is convex. Consequently, the Huber regression M-estimator may be easily computed using readily available convex optimization software.Footnote 11 None-the-less, the Huber estimator can have arbitrarily large bias for distributions in the Tukey-Huber family, and this theoretical property has quite practical applications relevance, as is demonstrated by examples in Sect. 3.1 where outliers cause the Huber regression estimator to be as badly biased as least squares (see Fig. 5).

While theoretical demonstration of unbounded bias for regression M-estimators with unbounded rho functions in Martin et al. (1989) is highly technical, the following simple heuristic argument shows why the unbounded rho and monotonic psi function of the Huber regression estimator is vulnerable to arbitrarily large bias due to outliers of unbounded size. Consider the single factor model (18) with intercept \(\beta _{0}=0\), arbitrary \(\beta _{1}\), and residual scale parameter \(s=1\). In that case the Huber M-estimator equation is:

$$\begin{aligned} \sum _{t=1}^{T}f_{t}\psi _{\mathrm{huber}}\left( r_{t}-\hat{\beta _{1}}f_{t}\right) =0. \end{aligned}$$

Suppose that the factor return at any given time, say for example \(f_{1}\) at \(t=1\), takes on an arbitrarily large absolute value with a corresponding arbitrary asset return value \(r_{1}\). For any pair of values \(r_{1}\)and \(f_{1}\) with \(r_{1}\ne \hat{\beta _{1}}f_{1}\) and \(f_{1}\) arbitrarily large, the term \(f_{1}\psi _{\mathrm{huber}}(r_{1}-\hat{\beta _{1}}f_{1})\) will be non-zero and arbitrarily large, thereby dominating \(\sum _{t=2}^{T}f_{t}\psi _{\mathrm{huber}}(r_{t}-\hat{\beta _{1}}f_{t})\). Consequently, the solution of the above estimating equation will in the limit, as \(f_{1}\) gets arbitrarily large, be the solution of

$$\begin{aligned} f_{1}\psi _{\mathrm{huber}}\left( r_{1}-\hat{\beta _{1}}f_{1}\right) =0 \end{aligned}$$

which is \(\hat{\beta _{1}}=r_{1}/f_{1}\). Since this estimate can take on any value determined by the ratio \(r_{1}/f_{1}\), \(\hat{\beta _{1}}\) can have an unbounded bias \(\mathrm {B}=\) \(r_{1}/f_{1}-\beta _{0}\). Note that this does not happen in the case of the redescending psi function \(\psi (x)=\psi _{\mathrm{mOpt}}(x)\) because for fixed \(r_{1}\) the term \(f_{1}\psi _{\mathrm{huber}}(r_{1}-\hat{\beta _{1}}f_{1})\) will be zero for arbitrarily large \(|f_{1}|\).

Robust time series factor models

The mOpt robust regression fitting method can be applied in a wide variety of time series factor models, including single-factor index models and multi-factor models such as the macro-economic models of Chen et al. (1986), the style analysis models introduced by Sharpe (1988) and discussed in Sharpe (1992), and an increasingly wide variety of empirical asset pricing factor models. Early examples of the latter are the Fama and French (1993) three-factor model (FF3) and the Fama-French-Carhart four-factor model (FFC4) that adds the Carhart (1997) momentum factor to the FF3 model. More recently introduced empirical asset pricing models include the Fama and French (2015) five-factor model (FF5), and the q-factor and augmented q-factor models discussed in Hou et al. (2015) and Hou et al. (2020). See also Feng et al. (2020) who refer to the wide variety of factor models as “The Factor Zoo”. Given the large number of time series factor models that currently exist in the literature, and the likelihood of more in the future, the potential value of mOpt robust fitting of these models is considerable.

We illustrate mOpt versus LS fitting first for the single factor model that is used extensively by practitioners and financial data service providers to compute least squares CAPM beta estimates. For the CAPM betas we show first by a few striking examples, and then by extensive empirical analysis of the cross sections of mOpt and LS beta estimates based on two-year intervals of weekly returns, that LS beta estimates are often considerably biased by outliers that have little influence on the mOpt estimates. In particular, we show that LS and mOpt betas differ in absolute value by at least 0.3 for roughly 26% of microcap stocks, 14% of smallcaps and 7% of bigcaps, and differ by at least 0.5 for roughly 12% of microcap stocks, 5% of smallcaps and 2% of bigcaps Then we present striking examples of multifactor models applications of the mOpt versus LS fits of stock returns for the FF3 model and the FFC4 model. We also introduce a test for a significant difference between mOpt and LS coefficient estimates, and apply it to both the single factor and the multifactor models. Finally, we introduce a new robust model selection criteria called RFPE as an alternative to the non-robust AIC based method, and illustrate the use of RFPE versus AIC in selecting a best subset of the FFC4 model factors.

Single factor time series models

A single factor model has the form

$$\begin{aligned} r_{t}=\alpha +\beta f_{t}+s\epsilon _{t}\,t=1,2,\ldots ,T \end{aligned}$$
(18)

where \(r_{t}\) is an equity return and \(f_{t}\) a factor return that is typically a market proxy or an active manager’s index benchmark such as the S&P500, Russell 1000, Russell 2000, or Russell 3000, among many others. When \(f_{t}\) is a market return \(r_{M,t}\), and both the equity return and market return are excess returns relative to a risk-free rate, this is the time series version of the CAPM.

Here, we extend the robust beta M-estimators study of Bailer et al. (2011b) in the following ways: (a) we use the improved mOpt estimator based on the analytic form of \(\psi _{\mathrm{mOpt}}(x)\) and its integral \(\rho _{\mathrm{mOpt}}(x)\), rather than a polynomial approximation of these functions; (b) we demonstrate by example a robustness inadequacy of the Huber M-estimator rho and psi functions discussed in Sect. 2.5; (c) we extend the study period from 1963–2009 to 1963–2018, and compute mOpt and LS betas using weekly return for cross sections of liquid stocks on 28 contiguous two-year intervals. We define liquid stocks as those that have at least 100 non-zero returns in a two-year interval. The numbers of such stocks in each two-year window vary from a minimum of 1,725 for 1963-1964 to a maximum of 7,126 for 1996-1998, with an average of 4,207 stocks over all two-year intervals.

We begin with the two examples in Fig. 4 that display the results of computing robust mOpt and LS beta estimates using two years of weekly returns for the stocks with tickers OFG and DD. The legend gives the beta estimate values with their standard errors in parentheses. The dotted lines, parallel to the solid line mOpt fit, define strips outside of which the stock and market returns pairs are rejected by the mOpt estimator, and as such have no influence on the fit. The rejected data points, which we define to be outliers, are indicated by the open circle symbols. For the OFG stock results in the left-hand plot, the mOpt estimator rejects 8 outliers, 6 of them barely so. In this example the mOpt robust method provides a very good fit to the bulk of the data in spite of the huge outlier in the upper-right part of the figure, but the LS estimator is quite adversely influenced by that outlier and fails to provide a good fit to the bulk of the data. The difference of 2.26 between the LS and mOpt beta estimates of OFG would be of concern to any financial analyst or portfolio manager. Furthermore, the 2.26 difference in the OFG LS and mOpt slopes is a littler over 9 times the mOpt robust standard error value of 0.25, indicating that the difference between the LS and mOpt beta estimates is very highly significant for OFG.

Fig. 4
figure 4

mOpt and LS fits for OFG and DD

The results for the stock with ticker DD in the right-hand plot of Fig. 4, in which there is a huge outlier in lower left corner of the plot, is dramatically different from the left-hand plot in that the mOpt and LS fits are essentially identical for the stock DD. That huge outlier is due to the October 19, 1987 (“Black Monday”) U.S. market crash, when the Dow Jones fell 22.6%. The mOpt robust estimator does not reject that outlier because it is consistent with the mOpt fit of DD returns to the market returns for the rest of the non-outliers data. This is a general property of the mOpt estimator that makes good sense.

Huber estimator lack of robustness examples

The lack of robustness of the Huber regression estimator discussed in Sect. 2.5 manifests itself in Huber beta estimates having slopes that are sometimes quite close to those of the LS estimate model fits. This behavior is illustrated in the two plots of Fig. 5, where the Huber estimators are almost the same as the LS estimators, and therefore provide no protection against outliers bias. It is the vulnerability to leverage outliers that cause Huber estimate to be so close to the LS estimate in the figure, where a leverage outlier in the present context is loosely defined to be a market return and stock return pair for which the market return is extreme and the data pair is not consistent with the a robust fit to the bulk of the data.Footnote 12 In this regard, the October 20, 1987 market return in the lower-left of the right-hand plot of Fig. 4 is a market return outlier but it is not a leverage outlier since the market return and stock return pair is consistent with the robust fit to the bulk of the data.

While the Huber beta estimate is sometimes close to the LS beta estimate, as in the extreme case of the nearly equal betas in Fig. 5, it turns out the Huber beta estimate is also sometimes close to or nearly equal to the mOpt beta estimate. In order to quantify the extent to which the Huber beta estimate is close to the LS beta estimate, we computed the percent of the stocks in each of the 28 cross sections studied for which the Huber beta is closer to the LS beta than the mOpt beta. The percent ranges from a minimum of 34% and a maximum of 47%, which we regard as a substantial lack of robustness of the Huber estimators. Consequently, the Huber estimator should only be used with full recognition of this limitation, and we strongly recommend instead the mOpt estimator when considering a robust regression estimator in practice.

Fig. 5
figure 5

mOpt, Huber and LS fits

The cross section distribution of robust mOpt and LS betas

The existence of differences between robust mOpt betas and LS betas, such as those discussed above, motivated us to carry out a detailed empirical analysis of the behavior of these two beta estimates for large cross sections of U.S. stocks from the CRSP database. Specifically, we computed mOpt and LS betas using two years of weekly returns of “liquid” stocks during each of the 28 contiguous two-year intervals during 1963 to 2018. Here, we take liquid stocks to be all those in the CRSP database that have at least 100 non-zero returns in a two-year interval. Then, we split the mOpt and LS beta estimates into microcap, smallcap and bigcap groups, where we follow Fama and French (2008) and Hou et al. (2020) in using the 20th and 50th percentiles of the NYSE capitalization data as breakpoints.

Figure 6 shows the time series of the resulting biennial microcap, smallcap and bigcap group counts. The rapid growth of microcap stocks from the early 1980’s to the late 1990’s is striking, as is the abrupt subsequent reversal to declining numbers of microcaps that was precipitated by the dot-com bubble collapse. From the early 1990’s until just before the onset of the financial crisis in 2007, the relative growth of microcap, smallcap and bigcap stocks was such that they had approximately constant percentages of the market during that time period. Then, after the momentary decrease in the number of small and bigcap stocks and increase in the number of microcap stocks during the 2007-2008 crisis, the counts of the three cap groups were fairly constant. The dominant numbers of microcap stocks from the mid 1980’s onward are of interest here because microcap stock returns are more volatile and prone to outliers than smallcap and bigcap stocks, and this leads one to anticipate that the adverse influence of outliers on LS beta estimates will be most severe and frequent for microcap stocks, somewhat less frequent for midcaps, and even less frequent for bigcaps.

With the preceding anticipated behavior in mind, we turn to Fig. 7 display of the boxplot representation of the cross section distribution of mOpt robust betas for each contiguous two-year interval between 1963 and 2018. The solid dot in each boxplot represents the cross section median of the betas for each two-year interval, the empty circle symbol represents the average of the betas, and the ends of the boxes are located at the lower and upper quartiles of the data.

Fig. 6
figure 6

Biweekly liquid CRSP stocks market capitalization group counts and percentages

Fig. 7
figure 7

mOpt robust betas cross section distribution boxplots computed using two-year intervals of liquid weekly returns of stocks in the CRSP database

The most striking features of the robust mOpt betas boxplots in Fig. 7 are: (a) They reveal interactions between beta and firm size, with almost all mOpt median betas less than 1 for microcaps, a little more than one half of mOpt median betas less than 1 for smallcaps, and most betas quite close to 1 for bigcaps; (b) the temporal patterns of the distributions of mOpt betas for the microcap and smallcap groups are strikingly similar from 1963 through end of the collapse of the dotcom bubble in 2002, and over that time interval both groups have most median betas less than 1 and often considerably so. Then for 2002 through 2018 the microcap and smallcap mOpt betas have similar increasing temporal patterns, with the medians of the microcaps remaining below one and settling around 0.75 toward the end of that interval, while during that time period the medians of the smallcaps drift to slightly above 1; (c) the mOpt beta medians of the bigcap group only deviate substantially from being close to one during 1963–1964, 1995–1996, and most substantially during the time period 1999–2002 when the dotcom bubble collapsed and subsequent rebound initiation. The temporal pattern of the cross section distributions of the bigcap and smallcap stocks are strikingly similar from 1999 through 2018.

Figure 8 displays the boxplots of the paired differences between LS and mOpt betas, i.e., LS beta minus mOpt beta for each stock, with horizontal dotted lines at +0.3 and -0.3. These lines were chosen for the following reasons: (a) Data values that fall outside the “whiskers” of a boxplot are by convention deemed outliers, which are plotted as individual points. For the bigcap boxplots in the lower panel, the average locations of the whiskers are at about + 0.3 and – 0.3, and so beta differences larger than +0.3 and smaller than – 0.3 for bigcaps are considered to be LS minus mOpt beta outliers. We then use +0.3 and – 0.3 as boundary lines for the smallcap and microcap as well because those choices clearly reveal the increased dispersion and skewness of the LS minus mOpt betas distributions and outliers, relative to the bigcaps; (b) We believe that a broad range of consumers of beta estimates, who have the market beta value of 1 as a reference point, will find that LS minus mOpt beta differences with absolute values greater than 0.3 will be a useful alert that outliers are likely to be influencing the LS estimate, and that the asset returns data should be examined for influential outliers, and possible firm financial conditions or market conditions that may give rise to such outliers.

It is clear in Fig. 8 that the cross section distributions of the paired differences between the LS and mOpt betas tend to be concentrated on positive values to various degrees, with this effect most substantial for microcaps for all 28 two-year intervals during 1963 to 2018, and most striking for the 1975–1976, 1987–1988, 2007–2008 intervals associated with major market events, where well over 25% of LS betas are larger than the mOpt robust betas by at least 0.3.Footnote 13 For smallcaps, the clear tendency for LS and mOpt beta differences to be concentrated on positive values in the early years diminishes over time, and is essentially absent after 2002, and a similar comment applies to the bigcap beta differences.

Fig. 8
figure 8

LS minus mOpt beta differences cross section distribution boxplots using the same returns data as for Fig. 7

We also notice from Fig. 8 that for each two-year interval there exist at least some and often many stocks for which the absolute difference between LS and mOpt beta estimates exceed not only the value 0.3, but also the value 0.5. In order to quantify the frequency of such differences, we tabulated for each two-year interval the percent of stocks in each market cap group and and the market for which the absolute difference of LS versus mOpt betas exceeds level thresholds of both 0.3 and 0.5. We then computed the average of these percentages over all the two-year intervals between 1963 and 2018, and display the results in the Cap Group Percent columns of Table 1. The results clearly reveal that the problem of outlier bias influence on LS betas is most frequent for microcaps, less so for smallcaps, relatively infrequent for bigcaps, and significant for the overall market. The market wide exceedance frequencies of 18.1% and 7.5% for thresholds 0.3 and 0.5, respectively, are results that practitioners should be aware of. The Positive/Negative Exceedances ratios in Table 1, that range from 2.7 to 4.2, reflect the positive skewness of the distribution of LS minus mOpt betas.

Table 1 1963–2018 average percent of stocks in marketcap groups and in the market for which the absolute differences of LS and mOpt betas exceed thresholds of 0.3 and 0.5, along with ratios of positive to negative exceedances

One may argue that relative differences in LS and mOpt betas are more meaningful than absolute differences, which is reasonable enough when both betas are larger than 1. However, when both betas are small, relative differences are not so meaningful, e.g., for LS and mOpt betas are .2 and .05 the relative difference is 3.0. So in Table 2 we display exceedance ratios of the absolute difference between LS and mOpt betas relative to mOpt betas for not only all mOpt estimates, but also only for mOpt betas whose absolute values are at least 0.5. It is not surprising that the cap group ratios are substantially smaller when we only look at mOpt betas with absolute value larger than 0.5. We also find the interesting fact that the values in the “|mOpt Beta|> 0.5” rows of Table 2 are rather similar overall to those in Table 1.

Table 2 1963–2018 average percent of stocks in market cap groups and the market for which an absolute difference of LS and mOpt betas relative to the mOpt beta exceed thresholds of 0.3 and 0.5, along with ratios of positive to negative exceedances

Thin trading consideration

The downward bias of the median behavior of the mOpt beta estimates, which is also the case for LS estimates, raises the question of the extent to which this is due to thin trading bias. We studied this issue using the Dimson (1979) method, corrected as in Davidson and Josev (2005), for both mOpt and LS beta estimators. It turns out that this correction indeed reduces the thin trading bias, but it also inflates the variability of the betas, and correspondingly results in larger exceedance percentages than those in Table 1. Our study of a robust Dimson method needs to be extended to include the Scholes and Williams (1977) correction method.

Multifactor time series models

Our examples in this section are focused on fitting the FF3 and FFC4 time series models to and asset returns time series \(r_{t}\). The FFC4 model has the form

$$\begin{aligned} r_{t}^{e}=\alpha +f_{MKT,t}^{e}\beta _{1}+f_{SMB,t}\beta _{2}+f_{HML,t}\beta _{3} +f_{MOM,t}\beta _{4}+\epsilon _{t} \end{aligned}$$
(19)

where \(r_{t}^{e}\) is a time series of the asset excess returns relative to a risk-free rate, \(f_{MKT,t}^{e}\) is a time series of market excess returns, \(f_{SMB,t}\) are the returns of the Fama-French “small minus big” (SMB) size factor portfolio, \(f_{HML,t}\) are the returns on the “high minus low” (HML) value factor portfolio, \(f_{MOM,t}\) are the returns on the Carhart (1997) momentum factor portfolio, and \(\alpha\) is an intercept. The FF3 model is obtained by dropping the term \(f_{MOM,t}\beta _{4}\).Footnote 14

It is common practice to use time series factor models such as the FF3 and FFC4 models for analysis where \(r_{t}\) is a portfolio return. However, the models can also be used for individual stocks with returns \(r_{t}\), where for example the goal might be to select stocks that have desired exposures, or lack thereof, to one or more of the factors. Here we examine the mOpt and LS fits of the FF3 and FFC4 models to the weekly excess returns of the stock with ticker FNB for the year 2008. Figure 9 displays the pairwise scatter plots of the FNB, MKT, SMB, HML and MOM data. Given the turbulence of the 2007-2008 market crisis, it is not surprising to see bivariate outliers, some of which may also be 3 or 4 dimensional outliers.

Fig. 9
figure 9

Pairwise scatter plots of FNB 2008 weekly returns and corresponding MKT, SMB, HML, MOM factors

The results of the LS and mOpt fits of the FF3 and FFC4 models to the FNB returns are displayed in Table 3 coefficient estimates, standard and robust t-statistics, and standard and robust adjusted R-squared values.Footnote 15 Since we are no longer in the context of new factor significance discovery, we use the usual absolute t-value threshold of 1.96 for a t-statistic to be significant. For both models the intercept estimates of both the LS and mOpt fits are equal to zero to two significant digits, and correspondingly the t-statistics values are all quite insignificant. The main differences in FF3 LS and mOpt model fits are that the mOpt fit has a slightly higher robust adjusted R-squared, and for the HML factor the mOpt slope is a little over twice that of the LS slope and correspondingly the mOpt t-statistic value is considerably larger than that of the LS estimate.

For the FFC4 model the LS fit t-statistics for the HML and MOM factors are now both insignificant, leaving only the MKT.RF and SMB as significant factors, but the mOpt fit t-statistics are significant for all but the HML factor, whose t-statistic is quite insignificant, and its MOM slope is negative. This latter result is not surprising in view of the fact, illustrated in Fig. 9, that the HML and MOM factors are negatively correlated, and MOM replaces HML in the FFC4 model. Table 3 also reveals that the FFC4 mOpt fit has the highest adjusted R-squared among the LS and mOpt fits of both the FF3 and FFC4 models, and overall, the results in the table lead to the sense that the FFC4 model is preferred to the FF3 model. However, a proper robust model selection method is needed for this purpose, and we discuss this in the next section.

Table 3 mOpt and LS fits of FF3 and FFC4 models to FNB 2008 weekly returns

It is useful to compare the mOpt and LS fits of the FFC4 model using Fig. 10 model fit residuals normal QQ plots (quantile-quantile plots), where the one on the left is for the LS fit and the one on the right is for the mOpt fit. The dotted lines in the figure are the point-wise 95% confidence intervals. The QQ plot for the LS fit in the left panel has a number data points in the central region that deviate from a straight line and are near or outside the dotted line for the bulk of the data, which indicates that the model fit is not very good, and there is no indication of residuals outliers. On the other hand, the majority of the data points in the mOpt QQ plot in the right panel fall along a straight line that is well within the dotted lines, thereby indicating a good model fit for most of the data, and the extreme regions of the qqplot reveal a number of positive and negative residuals outliers that are well outside the dotted lines. This is the typical behavior of the residuals from an mOpt robust estimate relative to an LS estimate, i.e., the robust regression fit is very good for the bulk of the data in the middle of the distribution and outliers are clearly revealed, but this is not the case for a LS regression.

Fig. 10
figure 10

Normal qqplots of residuals from robust LS and mOpt fits of the FFC4 model to FNB 2008 weekly returns

In addition to above visual assessment of comparative LS and mOpt robust fits, and standard statistical assessment as provided in Table 3, one can use a significance test statistic to test for a non-zero difference between and LS and mOpt fits. For estimators \(\hat{\varvec{\upbeta }}_{LS}\) and \(\hat{\varvec{\upbeta }}_{\mathrm{mOpt}}\) that converge in probability to the true parameter vector \(\varvec{\upbeta }=(\beta _{1},\ldots ,\beta _{K})\), it was shown by Maravina and Martin (2012) that the difference \(\varvec{\triangle }\hat{\varvec{\upbeta }}=\hat{\varvec{\upbeta }}_{LS}-\hat{\varvec{\upbeta }}_{\mathrm{mOpt}}\) converges in distribution to a zero mean multivariate normal distribution with covariance matrix

$$\begin{aligned} \mathbf {V}_{{\varvec{\triangle }\hat{\varvec{\upbeta }}}}=\delta ^{2} \mathbf {C}_{\mathbf {f}}^{-1} \end{aligned}$$
(20)

where \(\mathbf {C}_{\mathbf {f}}\) is the covariance matrix of factor returns appearing in (9), and

$$\begin{aligned} \delta ^{2}=\mathrm {E}\left( \dfrac{s\psi _{\mathrm{mOpt}}(u/s)}{\mathrm {E}\psi _{\mathrm{mOpt}}^{\prime }(u/s)}-u\right) ^{2} \end{aligned}$$
(21)

where \(u=s\epsilon _{1}\) in (18).Footnote 16 It follows that the quadratic form \(\varvec{\triangle }\hat{\varvec{\upbeta }}^{\prime }\delta ^{2}\mathbf {C}_{\mathbf {f}}^{-1}\varvec{\triangle }\hat{\varvec{\upbeta }}\) has an asymptotic chi-squared distribution with K degrees of freedom, and the significance test statistic

$$\begin{aligned} \mathrm {D}_{K}=\varvec{\triangle }\hat{\varvec{\upbeta }}^{\prime } \left( \dfrac{\widehat{\delta ^{2}}}{T}\hat{\mathbf {C}}_{\mathbf {f}}^{-1}\right) \varvec{\triangle }\hat{\varvec{\upbeta }} \end{aligned}$$
(22)

has an approximate chi-squared distribution with K degrees of freedom. For our applications here where we use two years of weekly returns, a normal distribution approximation will suffice for \(\Delta _{beta}\), and hence a chi-squared approximation for \(\mathrm {D}_{K}\) will suffice for computing p values.

The p values obtained by applying the above test statistic for the FF3 and FFC4 models are displayed in Table 4. For the FF3 model, the \(\mathbf {\varvec{\triangle }\hat{\varvec{\upbeta }}}_{K}\) difference and the HML coefficient difference are both significant with common p values of 0.000 to three digits. This is not too surprising in view of the large difference between the LS and mOpt HML coefficients in Table 3, but neither of the MKT and SMB factor LS and mOpt coefficient differences are significant. On the other hand for the FFC4 model the \(\mathbf {\varvec{\triangle }\hat{\varvec{\upbeta }}}_{K}\) difference is is quite significant with a p value 0.011, and the MOM factor difference is highly significant with a p value 0.003, but none of the differences in the MKT, SMB and HML factors are significant.

Table 4 Tests for differences between the mOpt and LS overall model and individual coefficients

Robust time series model selection

While the robust regression results in Table 3 suggest that the FFC4 is a better factor model than the FF3, use of a proper model selection method that penalizes for over-fitting is needed when searching for a “best” model. The normal distribution theory based Akaike information criterion (AIC) criterion introduced by Akaike (1973) is commonly used for this purpose. In the case of time series linear regression models with sample size T and number of variables \(p\le p_{\mathrm{max}}\), the AIC model selection method chooses the model order p that minimizes \(T\mathrm {log}(\hat{\sigma })+p\) , where \(\hat{\sigma }^{2}\) is the average of the squared regression residuals. The AIC criterion lacks robustness in two ways, the first is the lack of robustness of the least squares coefficient estimates used to compute the residuals, and the second is that outliers can adversely influence the average of the squared residuals.Footnote 17 We propose to replace AIC with a robust final prediction error (RFPE) method that makes natural use of the mOpt robust regression method.

The RFPE method was motivated by the final prediction error (FPE) method introduced by Akaike (1969) in the context of estimating the order of an autoregressive time series. The FPE method chooses the autoregression order k that minimizes

$$\begin{aligned} \mathrm {FPE}(k)&=\hat{\sigma }_{k}^{2}\dfrac{T+k}{T-k}\nonumber \\&\approx \hat{\sigma }_{k}^{2}\left( 1+2\dfrac{k}{T}\right) \,\mathrm {for}\,k\ll T \end{aligned}$$
(23)

where \(\hat{\sigma }_{k}^{2}\) is the average of the squared residuals from the least-squares fit. For linear regression models the FPE criterion is used by considering a maximal linear model that has K predictor variables, computing \(\mathrm {FPE}(k)\) for subset models indexed by \(C_{k}\) with \(k\le K\) predictor variables, and finding the subset \(C_{k}^{*}\) that minimizes \(\mathrm {FPE}(k)\). However, the FPE criterion lacks robustness for the same reasons that the AIC lacks robustness

The RFPE robust model selection method is defined as follows using the mOpt regression estimator in an intuitive manner.Footnote 18 Let

$$\begin{aligned} \hat{\mathbf {\varvec{\upbeta }}}_{C_{k}}=\underset{\mathbf {\varvec{\upbeta }}}{\mathrm{argmin}}\left[ \sum _{t=1}^{T}\rho _{\mathrm{mOpt}}\left( \frac{r_{t} -\tilde{\mathbf {f}}_{C_{k},t}^{\prime }\varvec{\upbeta }}{\hat{s}}\right) \right] \end{aligned}$$
(24)

where the robust residuals scale estimate \(\hat{s}\) is computed for the maximal model \(C_{\mathrm{max}}\) with K predictor variables \(\tilde{\mathbf {f}}_{C_{\mathrm{max}},t}\). This serves to provide a smallest robust scale that serves as an equitable scale reference across models with different predictor variables. With robust residuals for the subset model \(C_{k}\) defined as

$$\begin{aligned} \hat{\epsilon }_{C_{k},t}=r_{t}-\tilde{\mathbf {f}}_{C_{k},t}^{\prime } \hat{\mathbf {\varvec{\upbeta }}}_{C_{k}},\,t=1,2,\ldots ,T. \end{aligned}$$

the RFPE is defined as

$$\begin{aligned} \mathrm {RFPE}(C_{k})=\dfrac{1}{T}\sum _{t=1}^{T}\rho _{\mathrm{mOpt}}\left( \frac{\hat{\epsilon }_{C_{k},t}}{\hat{s}}\right) +\dfrac{k}{T}\dfrac{{{A}}}{{{B}}} \end{aligned}$$
(25)

where

$$\begin{aligned} A= & {} \dfrac{1}{T}\sum _{t=1}^{T}\psi _{\mathrm{mOpt} }^{2}\left( \frac{\hat{\epsilon }_{C_{k},t}}{\hat{s}}\right) ,\nonumber \\ B= & {} \dfrac{1}{T}\sum _{t=1}^{T}\psi _{\mathrm{mOpt}}^{\prime }\left( \frac{\hat{\epsilon }_{C_{k},t}}{\hat{s}}\right) . \end{aligned}$$
(26)

Note that if \(\rho _{\mathrm{mOpt}}(r)\) is replaced with the least squares rho function \(\rho (r)=r^{2}\), then \(\psi (r)=2r\), \(\psi ^{\prime }(r)=2\), and \(\mathrm {RFPE}(C_{k})\) is the same as the approximate version of (23).Footnote 19

Ideally, one would like to choose a best model by evaluating \(\mathrm {RFPE}(C_{k})\) for all subsets of the full model \(C_{K}\). But this is often infeasible due the computing time required by a very large number of subsets unless K is rather small. However, a backward step-wise RFPE criterion selection method is feasible for the mOpt estimator, and is available in the the R package RobStatTM.Footnote 20 We illustrate the use of the method for the problem of finding a best subset of the FFC4 factors for modeling the FNB stock returns, with the results shown in Table 5.

Table 5 RFPE backward step wise model selection from FNB ~ MKT + SMB + HML + MOM

The Full Model columns of the table show the RFPE value 0.221 of the full FFC4 model in the All row, and subsequent rows show the RFPE values for each of the 3-factor models obtained by deleting each of the MKT, SMB, HML and MOM factors one at a time. Since the RFPE value 0.217 obtained by deleting the HML factor is the smallest RFPE among all three deletions and it is smaller than the full four-factor model RFPE value 0.221, this factor is deleted and results in the new All row value in the MKT+SMB+MOM columns of the table. Since none of the single factor deletions of MKT, SMB and MOM result in a smaller value of RFPE than 0.221, the process stops with those three factors as the best subset of the four FFC4 factors. This result is quite consistent with FFC4-mOpt row of Table 3, where the HML factor is quite insignificant.

We also used least-squares based AIC model selection for the FFC4 factors, and the results are shown in Table 6. The AIC method results in deleting the MOM factor leaving the three factors MKT, SMB and HML as the best model. This is in reasonable agreement with the LS fit of the full FFC4 set of factors to FNB that finds MOM quite insignificant and HML close to significant with a t-statistic value 1.8, but in disagreement with the above RFPE best set of factors where MOM is retained instead of HML. This shows that AIC, like LS, suffer from the adverse influence of outliers and can not serve as a reliable method for model selection.

Table 6 AIC backward step wise model selection from FNB ~ MKT + SMB + HML + MOM

While the above example is quite simple it suggests possible considerable usefulness of routine selection of time series factor model factors with RFPE. Among the many possible applications, one would be to model portfolio or asset style drift over time using RFPE model selection on a moving time window.

Summary and discussion

We introduced a theoretically well-justified mOpt robust regression method in the context of time series factor models. The efficacy of the mOpt regression in providing a good fit to the bulk of the data is demonstrated for both single factor and multifactor models where LS fits are adversely influenced by outliers. We also showed that the well-known Huber robust regression estimator is lacking in bias robustness, both theoretically and in applications.

For the case of single factor model CAPM betas, we compared the mOpt and LS estimates for two-year intervals of weekly returns of liquid stocks from the CRSP database for 1963 to 2018. The lack of robustness toward outlier of the LS estimator is manifested in absolute beta value differences between mOpt and LS estimates of at least 0.3 for about 18% of the liquid stocks in the market, and at least 0.5 for about 7.5% of those stocks. Our analysis showed that the lack of robustness of the LS estimator is revealed most frequently for microcap stocks, less frequently for smallcaps, and considerably less frequently for bigcaps. Such differences can be of concern to a wide range of users of beta estimates.

For the Fama-French three-factor model (FF3) and the Fama-French-Carhart four-factor model (FFC4) fits to weekly stock returns that we studied, the mOpt estimator continues to produce good fits to the bulk of the data that differ significantly from LS fits. An important feature of mOpt is that in addition to robust coefficients, it computes robust t-statistics. For the FF3 model, the robust t-statistics lead to same significant Market, SMB and HML factors as the LS fit classical t-statistics. But for the FFC4 model the robust t-statistics result in significant Market, SMB and MOM factors, whereas the LS classical t-statistics only lead to significant Market and SMB factors. We also introduced a new test of significant difference between mOpt and LS estimates, and illustrated its usefulness in the context of FF3 and FFC4 model fitting. Finally, the new Robust Final Prediction Error (RFPE) model selection criterion produces better fitting factor models than the well-known normal distribution based AIC method, which is quite non-robust toward outliers.

The above results are only first steps in understanding the use of mOpt robust regression estimators, robust t-statistics, and RFPE model selection for time series multifactor models. Additional directions for further research include the following. The first is to carry out an in-depth empirical study of the relative performance of the mOpt versus LS fits of the FF3 and FFC4 models, and the RFPE versus AIC model selection, for cross sections of liquid stocks, similar to what we have done for single factor models herein. A second step will be to do a similar empirical study for other increasingly popular multifactor models, e.g., the Fama and French (2015) 5-factor model, and the Hou et al. (2020) augmented q-factor model. Furthermore, it will be important to assess the efficacy of mOpt robust versus non-robust LS fits of multifactor models in evaluating the character and performance of funds, where individual stock returns outliers influence are mitigated by the portfolio nature of the funds, but the fund returns as well as the factors can none-the-less contain outliers. Finally, we have developed a vary natural way to robustify the Vasicek (1973) Bayesian beta estimates using the mOpt estimator, and it remains to study its empirical efficacy for cross sections of returns.

The results presented in this paper lead us to strongly recommend standard use of the mOpt robust regression estimator as a complement to least-squares for time series factor models research and applications, as well as a standard against which to evaluate alternative proposed robust regression methods.