Machine learning techniques for cross-sectional equity returns’ prediction

We compare the performance of the linear regression model, which is the current standard in science and practice for cross-sectional stock return forecasting, with that of machine learning methods, i.e., penalized linear models, support vector regression, random forests, gradient boosted trees and neural networks. Our analysis is based on monthly data on nearly 12,000 individual stocks from 16 European economies over almost 30 years from 1990 to 2019. We find that the prediction of stock returns can be decisively improved through machine learning methods. The outperformance of individual (combined) machine learning models over the benchmark model is approximately 0.6% (0.7%) per month for the full cross-section of stocks. Furthermore, we find no model breakdowns, which suggests that investors do not incur additional risk from using machine learning methods compared to the traditional benchmark approach. Additionally, the superior performance of machine learning models is not due to substantially higher portfolio turnover. Further analyses suggest that machine learning models generate their added value particularly in bear markets when the average investor tends to lose money. Our results indicate that future research and practice should make more intensive use of machine learning techniques with respect to stock return prediction.


Introduction
Recently, Gu et al. (2020) conducted a much noted comparative analysis of machine learning methods for stock return prediction by synthesizing the body of empirical asset pricing literature within the field of machine learning. They provide evidence for a clear rejection of the ordinary least squares (OLS) benchmark in favor of machine learning methods in terms of statistical performance and investors' economic performance. The OLS benchmark represents the typical approach in one of two basic strands of the empirical literature on stock return prediction. Specifically, cross-sectional stock return predictability research (e.g., Fama and French 2008;Lewellen 2015;Gu et al. 2020) runs cross-sectional regressions of future stock returns on a handful of lagged stock characteristics. The second strand of literature, i.e., time-series stock return predictability research, does not forecast the cross-section but the time-series of returns. This literature typically tries to forecast stock indices employing macroeconomic predictors. 1 Attributable to the underperformance of the linear benchmark in their study, Gu et al. (2020) recommend using machine learning techniques to overcome the severe limitations of commonly applied methods.
The paper from Gu et al. (2020) is the first to comprehensively use ML methods in cross-sectional predictability research. Following this seminal paper, numerous concurrent papers have emerged that contribute to the literature by investigating the generality of the conclusions derived in Gu et al. (2020). This is achieved by applying their research design to other stock markets (see, e.g., Tobek and Hronec 2021;Drobetz and Otto 2021;Leippold et al. 2022;Liu et al. 2022;Rubesam 2022;Lalwani and Meshram 2022, for international applications and applications to individual regions or countries) or other asset classes (see, e.g., Bianchi et al. 2021, for applications to bond returns). Another strand of the literature (see, e.g., Leippold et al. 2022, for a list of exemplary papers) is dedicated to numerous additional refinements of the basic algorithms surveyed in Gu et al. (2020). 2 Our empirical analysis contributes to the line of research featured in the aforementioned papers by deviating from their research design in several key aspects to further investigate the generality of the conclusion to reject the OLS benchmark in favor of machine learning methods in cross-sectional predictability research. More specifically, we devote our attention to three distinct issues: (1) overfitting and irrelevant predictors, (2) missing data, and (3) a U.S. bias. In the following, we discuss these issues and how they can be overcome. First, the aforementioned literature typically relies on large predictor sets. For example, the predictor set in Gu et al. (2020) includes 94 firm characteristics and interactions of each firm characteristic with eight macroeconomic time-series from 1 3 Machine learning techniques for cross-sectional equity… Welch and Goyal (2008) and 74 industry sector dummy variables. This results in a set of 900+ predictors. Following (Gu et al. 2020;Leippold et al. 2022) even extend this predictor set (i.e., 1160 predictors) while relying on a smaller cross-section (i.e., 3,900 stocks) and time-series (i.e., a study period from 2000 to 2020). Tobek and Hronec (2021) and Bianchi et al. (2021) also use more than 100 predictors when investigating stocks internationally and bonds, respectively. Such large predictor sets and the enhanced flexibility of machine learning methods over more traditional prediction techniques such as the OLS benchmark come at the risk of overfitting the data, which may put machine learning methods at a disadvantage. However, one can also argue that some machine learning methods can handle irrelevant predictors while OLS cannot, which may put (some) machine learning methods at an advantage. For a comparison at eye level, we provide the OLS benchmark and machine learning methods with the same and a relatively sparse set of only relevant predictor variables to prevent overfitting and avoid irrelevant predictors. 3 Our set of predictors consists of beta (Sharpe 1964), market capitalization (Banz 1981), the book-to-market-equity ratio (Rosenberg et al. 1985), momentum (Jegadeesh and Titman 1993), investment (Titman et al. 2004) and operating profitability (Novy-Marx 2013), all of which form the basis for well-known factor models such as the Fama and French (1993) three-factor model, the Carhart (1997) four-factor model and the Fama and French (2015) five-factor model. Second, the treatment of missing data is another potential problem affecting the findings derived in the previous literature. More specifically, if a characteristic is missing, then it is typically replaced by the cross-sectional mean or median, which is zero as stock characteristics are rank-transformed and mapped into a [− 1,1] interval. 4 In this vein, Cismondi et al. (2013) argue that in cases where missing data can range up to 50% or more, imputing the data is incorrect, as it might create unrealistic states of the process being modeled. Afifi and Elashoff (1966) even argue that imputing the mean yields unbiased estimates if and only if the data follow a multivariate normal distribution and the data are missing at random. Given that this is likely not the case for financial market and accounting data, imputing missing data has received much attention in recent research by Freyberger et al. (2021); Cahan et al. (2022); Bryzgalova et al. (2022); Beckmeyer and Wiedemann (2022). To reduce the unintended impact of missing data, we select the aforementioned predictors so that they are available over the entire period and there 1 3 Machine learning techniques for cross-sectional equity… turnover. The analysis of bull and bear markets suggests that machine learning models generate their added value particularly in bear markets when the average investor tends to lose money. Lastly, we find that forecast combinations provide the most robust forecasts, on average. More specifically, forecast combinations consisting of only nonlinear methods consistently outperform forecast combinations that also consider linear models. Thus, we find that investors and researchers alike should prioritize machine learning techniques over the commonly applied benchmark approach when engaging in stock return predictions.
The remainder of the paper is organized as follows. Section 2 discusses the differences between time-series and cross-sectional stock return predictability. Section 3 formulates the OLS benchmark and the machine learning methods applied. Section 4 describes the data and provides descriptive statistics on the predictor variables and the dependent variable (stock returns) from the approximately 12,000 European firms included in this study. Sections 5 and 6 report the main and additional results of the performance comparisons based on statistical (model R 2 ) and economic (economic gains to investors from portfolio strategies) analyses. Section 7 concludes the paper.

Related literature
In finance research, and even more so in non-finance research such as operations research, the time-series approach dominates the cross-sectional approach when it comes to forecasting asset returns. Against this background, in this section, we aim to shed light on the differences between the two approaches discussed in recent literature: • Engelberg et al. (2022): "Financial researchers have examined the predictability of stock returns for over a century (e.g., Gibson 1906) and a large literature has documented evidence of predictability in the cross-section of stock returns. A separate literature has examined the predictability of the equity risk-premium using time-series predictive variables. To date, these two literatures have evolved relatively independently." • Dong et al. (2022): "The first examines whether firm characteristics can predict the cross-sectional dispersion in stock returns. These studies identify numerous equity market anomalies (e.g., Fama and French 2015;Harvey et al. 2016;McLean and Pontiff 2016;Hou et al. 2018). The second line of research investigates the time-series predictability of the aggregate market excess return based on a variety of economic and financial variables, such as valuation ratios, interest rates, and inflation (Nelson 1976;Campbell 1987;French 1988, 1989;Pástor and Stambaugh 2009). Studies in this vein attempt to shed light on the variables that affect the equity risk premium." • Gu et al. (2020): "The first strand models differences in expected returns across stocks as a function of stock-level characteristics, and is exemplified by Fama and French (2008) and Lewellen (2015). The typical approach in this literature runs cross-sectional regressions of future stock returns on a few lagged stock charac-teristics. The second strand forecasts the time-series of returns and is surveyed by Welch and Goyal (2008), Koijen and van Nieuwerburgh (2011), and Rapach and Zhou (2013)." In summary, prediction of the time-series of a single asset such as the aggregate market returns is related to the time-series predictability research. The typical approach is to forecast a single time-series (e.g., a specific MSCI index) based on other time-delayed time-series (e.g., economic and financial variables, such as valuation ratios, interest rates, and inflation). Predicting the cross-sectional dispersion in stock returns is related to cross-sectional predictability. The main approach is to forecast the returns of multiple assets at once and at one point in time (e.g., returns from all stocks in the US at time t) using other, time-delayed, cross-sectional information (e.g., market equity, momentum, market beta from all stocks in the US at time t-1). Following (Rapach and Zhou 2020) and based on a linear regression model, one can also make a more formal distinction between the two approaches: • The time-series predictability approach considers the following time-series multiple regression model for an individual asset for t = 1, ..., T , where r is the return on a broad stock market index in excess of the risk-free return, T is the number of time-series observations, x j,t−1 are the lagged predictor variables, and t is a zero-mean disturbance term. • The cross-sectional predictability approach considers the following cross-sectional multiple regression model for a certain month for i = 1, ..., N , where z i,j is the jth lagged characteristic of firm i and N is the number of available firms. 7 In this paper, we are interested in predicting the cross-sectional dispersion in stock returns. However, in terms of cross-sectional predictability, Gu et al. (2020) is the first paper to comprehensively use ML methods and compare them against the standard OLS benchmark in this line of research (e.g., Fama and French 2008;Lewellen 2015;Gu et al. 2020). Against this background, we also rely on OLS as our "natural" benchmark. (1)

3
Machine learning techniques for cross-sectional equity…

Methodology
In our empirical study, we are dealing with a panel of stocks, where months are indexed as t = 1, ..., T and stocks are indexed as i = 1, ..., N . Accordingly, the future stock return r of asset i at month t + 1 can be defined in general terms as: 8 with where current expectations about future returns are expressed as a function g(⋅) of a vector of predictor variables x i,t (i.e., firm characteristics). Specifically, the predictor variables are defined as a P-dimensional vector x i,t = x i,t,1 , … , x i,t,P . Hence, the aim of the following methods is to provide an estimate ĝ(⋅) of g(⋅).

Simple and penalized linear regression
Throughout the recent decades of cross-sectional asset pricing research, the common strand in the literature on stock return predictability has been to model conditional expectations about future returns g(⋅) as a linear function of stock-level characteristics and a parameter vector = 1 , … , P (e.g., Fama and French 2008;Lewellen 2015;Green et al. 2017): The parameter vector can be obtained by (pooled) OLS, minimizing the l 2 objective function: where N is the number of stocks in the cross-section and T is the number of months in the estimation period. We refer to this approach as the OLS benchmark. To reduce the impact of overfitting, we also add a penalty term to the objective function: Regularization techniques, such as penalty terms, are most commonly used in the context of large feature spaces. Although the problem of overfitting is only of minor importance for the linear model given the small feature set, it can still occur and affect performance. For the sake of completeness, we follow prior literature and append the linear model by the popular elastic net penalty, which takes the form: The two nonnegative hyperparameters of the elastic net penalty, and , nest some special cases. controls the penalty strength; hence, for = 0 , the model reduces to the standard OLS. The parameter controls the ratio of lasso versus ridge penalty, where the cases = 0 and = 1 correspond to lasso and ridge regression, respectively. Lasso imposes sparsity in the coefficients and may lead to some coefficients being exactly zero. Ridge regression is a dense modeling technique that prevents coefficients from becoming unduly large in magnitude by shrinking them closer to zero without imposing a weight of exactly zero. In addition to lasso ( = 0 ) and ridge regressions ( = 1 ), we consider elastic net ( = 0.5 ), which represents a compromise between the two.
In our empirical application, we determine the parameter using grid search (see Table 11 in appendix for the ranges) on the validation data.

Support vector machine
Support vector regression (SVR) is an extension of the basic principle underlying support vector machines developed by Vapnik (1995), originally designed for classification. SVR generally estimates a linear regression function of the following form: where w is a vector of weights and b is a constant. The objective of SVR is to estimate a function that provides a good balance between model complexity and goodness of fit by introducing an error margin ( -insensitive tube) to which the cost function is insensitive in approximating the returns. This may be interpreted as the tolerance for minuscule deviations up to . The -insensitive loss is defined as (cf. Vapnik 1995): By introducing two slack variables i,t+1 and * i,t+1 for positive and negative deviations from the -insensitive loss function, respectively, the objective function can be written in primal form as:

3
Machine learning techniques for cross-sectional equity… where the optimization problem is defined as the sum of aggregate errors exceeding the -boundary plus l 2 regularization that encourages a simple model to prevent overfitting of the data. The nonnegative constant C controls how strong deviations from should be penalized. 9 Note that C can be interpreted as an inverse regularization parameter, as it controls the tradeoff between error tolerance and complexity. Two notable cases are C = 0 , for which SVR approximates a constant function, and C → ∞ , = 0 , for which SVR essentially estimates an l 1 -regression. By means of Lagrangian theory, the above minimization problem can be represented in dual form as: where i,t and * i,t are nonnegative Lagrangian multipliers for each observation x i,t . SV are the indices of nonzero Lagrangian multipliers, where the corresponding x i,t are the support vectors. Accordingly, the first line of equation (12) considers pairwise combinations of panel observations if they are in the set of support vectors. Support vectors are observations that correspond to nonzero Lagrangian multipliers, i.e., either i,t or * i,t is nonzero. The dual form enables the introduction of nonlinearity by replacing the inner product x ′ i,t x j,k , i.e., a linear mapping, with a kernel function K(x i,t , x j,k ) , which respectively. and * denote errors with magnitude greater than the | | threshold 9 By convention, the SVR formulation typically uses an (inverse) regularization parameter. The objective may also be formulated as The relation between and C is C = 1 .

3
performs a nonlinear mapping of the inputs. Figure 1 illustrates a SVR with a linear (a) and a nonlinear (b) kernel. Conveniently, the introduction of a kernel function allows all calculations to be performed in the original feature space. We apply the most popular radial basis function (RBF), which takes the form Lu et al. (2009), Rasekhschaffe and Jones (2019): This kernel function measures the similarity of firm characteristics between firmobservations and, consequently, how they affect the regression function. Here, is a bandwidth parameter that controls the influence of observations. In our empirical application, we determine parameters , and using grid search (see Table 11 in appendix for the ranges) on the validation data.

Tree-based methods: random forest and gradient boosted regression trees
Regression trees are fully nonparametric and based on recursive binary splits, which enables them to account for high-level interactions. With respect to our application, regression trees group observations based on firm characteristics such that observations within a group are similar in terms of future returns. A tree 'grows' by performing a series of binary splits based on cutoffs of the firm characteristics at each branching point. Consequently, each split adds an additional layer of depth capable of incorporating more complex interaction effects. Starting with all observations, the tree successively divides the feature space into two rectangular partitions. The firm characteristic and its cutoff value are chosen to provide the best fit in terms of forecasting error. The resulting rectangular regions in the predictor space approximate the unknown function g(⋅) as the mean of the respective partition. The splitting procedure continues on one or both regions, resulting in increasingly smaller rectangles until a stopping criterion is met, e.g., depth of the tree (J) or no improvement by additional splits.  Figure 2 illustrates a regression tree of depth J = 2 based on exemplary firm characteristics char1 and char2. 10 Figure 2a presents the tree architecture, and Fig. 2b shows the representation in the rectangular feature space. Initially, all observations are divided based on the firm characteristic char1 with a cutoff value of 0.7. Firmmonth observations with a value below that threshold are assigned to the left branch, while all other observations are assigned to the right branch. The left node is not split up any further, resulting in R 1 , which is also called the "leaf node". The right node represents another decision node, where the remaining firm-month observations are additionally divided on firm characteristic char2 with a threshold value of 0.6, resulting in R 2 and R 3 . Thus, the observations are divided into regions R 1 , R 2 and R 3 , as shown in Fig. 2b. The prediction of a tree is the simple average of the response within a leaf node.
Formally, the unknown function g(⋅) is approximated by a regression tree as: where R m is one of the M leaf nodes, I is an indicator function of the firm characteristics identifying observations that belong to a particular region and c m is the associated average response. At each decision node, the algorithm aims to find the firm characteristic and its respective value that minimize the forecasting error. Finding a global optimal solution is generally computationally infeasible (cf. Hastie et al. 2017). Therefore, we follow (Breiman et al. 1984) and proceed with a greedy algorithm, in which the optimization is only performed locally at each individual decision node, i.e., solving the following optimization problem: with where c l and c r are the average responses of the left and right partitions, which are conditional on variable j and its value s subject to the minimization. Put differently, at each decision node, the split variable j and its associated split value s are to be found with the aim of locally minimizing the forecasting error on each split. Regression trees are widely adopted in the ML literature because (as visualized in Fig. 2) the sequence of binary decision rules allows a very simple and intuitive interpretation. Additionally, the inherent structure is ideal for representing multiway interaction effects and nonlinearities. However, regression trees are rarely used individually, as they are unstable with respect to changes in the input data and subject to severe overfitting problems. Unlike OLS, a regression tree can heavily overfit even with few variables, and with sufficient depth, a perfect fit can be achieved. In the limiting case, each leaf node contains only one observation. To alleviate the drawbacks of individual regression trees, we consider popular ensemble algorithms, i.e., random forests and gradient boosting, each of which combines many individually weak trees.
Random forest (RF) is based on the concept of bootstrap aggregation, in which the variance of an estimator is reduced by averaging over multiple independent and identically distributed (i.i.d.) bootstrapped samples. Thus, in the random forest, several B trees are trained independently on bootstrapped samples. In addition, at each decision node of a tree, only a random subset of p < P predictors is considered, reducing both the correlation and variance of the estimator (cf. Breiman 2001). The output of a random forest is provided by: While trees are fit independently in the random forest algorithm, gradient boosted regression trees (GBRT) are estimated in an adaptive manner to reduce the bias. Therefore, B trees are estimated sequentially and combined additively to form an ensemble prediction. At each iteration b, a new tree ĝ tree b is fit on the residuals from the previous iteration and added to the ensemble. However, the contribution of each individual tree is controlled by the learning rate 0 < < 1 (shrinkage) to prevent overfitting of the residuals. Finally, the estimation function of GBRT is given by: In our empirical application, we determine parameters (J, p) and (J, , B) for RF and GBRT, respectively using grid search (see Table 11 in the appendix for the ranges) on the validation data. Machine learning techniques for cross-sectional equity…

Neural networks
Neural networks are successfully applied in numerous scientific and practical applications. Their flexibility draws from the concept to derive features from linear combinations of the input space and to model the response as a (nonlinear) function of these features (cf. Hastie et al. 2017). We focus on traditional feedforward neural networks, which generally consist of an input layer, one or more hidden layers, and an output layer. Each layer is represented by (computational) units called neurons. Figure 3a gives a graphical representation of a general feedforward neural network. Starting with the input layer, which represents the firm characteristics, the information is carried through the hidden layers via interconnected neurons and finally aggregated to form an estimate of the future returns (represented by the output layer). Each of the neurons performs very simple operations individually, i.e., combining information from the previous layer and applying an (nonlinear) activation function to the aggregate signal (see Fig. 3b). More generally, the output of an L-layer neural network (including the output layer) with K l ∀1 ≤ l ≤ L neurons per layer can be thought of as a chain of functions (cf. Goodfellow et al. 2016), which recursively calculates the activations. Accordingly, the l-th layer's activations l , which is a vector of length equal to the number of nodes, are given by: where W l is a K l × K (l−1) matrix of weights transmitting the outputs from layer (l − 1) to layer l, where they then aggregate to form the activation a. Accordingly, l = 0 corresponds to the input layer with a 0 being the raw vector of firm characteristics, x i,t , and f 1 , … , f L are univariate transformations of the weighted signal from the previous layer plus a bias. W = W 1 , … , W L and b = b 1 , … , b L are parameter sets of weight matrices and bias vectors, respectively, determined in the model calibration.
In a regression task, the output layer L performs a linear combination, yielding the final output: There are numerous architectural design options when structuring a neural network, e.g., number of neurons, number of layers, and activation functions. Zhang et al. (1998, p. 42) state that "[...] the design of an ANN is more of an art than a science". However, there is no clear guidance on how to determine an appropriate network architecture. Therefore, we take a pragmatic approach by choosing a single hidden layer and the hyperbolic tangent as an activation function: Hornik et al. (1989) and Cybenko (1989) proved universal approximator properties for single hidden layer feedforward neural networks with sigmoid activations.
The number of hidden neurons is chosen based on a rule-of-thumb for single hidden layer neural networks derived by Wong (1991), i.e., twice the number of input nodes.
As stated before, in a neural network, W and b are the parameters of interest. Their estimates, however, are solutions of a nonconvex optimization problem. One of the most widely used algorithms for training neural networks are variants of stochastic gradient descent (SGD). SGD iteratively approaches the (local) minimum of the l 2 objective function by approximating its gradient on random subsets of the training data and adapting the parameters accordingly. We use the popular Adaptive Moment estimation (Adam) by Kingma and Ba (2014). A critical tuning parameter in the learning algorithm is the learning rate lr, which controls the step size of each parameter update. First, to mitigate the risk of overfitting, we impose an l 2 penalty (weight decay), which shrinks the weights between neurons toward zero without imposing exact zero. Second, we stop the training of our network early if the error on the validation set does not improve for ten consecutive iterations. Third, we build ensembles of size ten and average their outcome, as the parameter initialization and stochastic training introduce sources of randomness. Hansen and Salamon (1990) provide evidence that ensembles not only stabilize results but also improve generalization. In our empirical application, we determine the learning rate lr and l 2 penalty strength using grid search (see Table 11 in the appendix for the ranges) on the validation data.

Estimation, tuning and testing
We provide all models with the same data, i.e., a ten-year rolling period of monthly data, and predict returns for the months of the subsequent year for every firm. Stated differently, we use only past information to predict future (out-of-sample) returns. Unlike the OLS model, the performance of most machine learning algorithms depends heavily on the choice of hyperparameters. A commonly used strategy is to divide the in-sample data into disjoint subsets (training and validation) to mimic a pseudo out-of-sample setting. The hyperparameters are then chosen to maximize the performance on the validation set. Therefore, we split each in-sample window into seven years of training and three years of validation and perform a grid search to find an appropriate set of hyperparameters. 11 Fig. 4 This figure illustrates the rolling estimation scheme employed in this study 11 By retaining the temporal ordering of the data, we mimic a realistic prediction scenario. Figure 4 illustrates our sample splitting scheme. More precisely, we estimate the respective model with a set of hyperparameters from the parameter grid (see Table 11) on the training set. Then, we generate predictions for the validation set and evaluate the performance. After determining the best set of hyperparameters, we re-estimate the model on the full ten-year period of training and validation data to incorporate the most recent information. However, data for actual forecasting never enter the model during the estimation procedure.
To assess the predictive performance of our models, we evaluate them in three ways. We first deploy a statistical analysis to quantify the models' predictive accuracy, i.e., how well predicted returns reflect realized returns. A common metric to do so is the out-of-sample (pseudo) R 2 , which is defined as: where r i,t denotes the model's prediction and r i,t denotes the historical average for asset i at time t. This definition measures the proportional reduction in forecasting errors of the predictive model relative to the historical average. Thus, R 2 > 0 indicates that the predictive model is more accurate than the historical average in terms of forecasting errors. Note that unlike the conventional R 2 , this formulation can also be negative. While the historical average is appropriate in terms of a time-series analysis, it leads to an inflated R 2 in the case of a panel of stock returns. Therefore, we set r to zero, which yields a stricter threshold for proper forecasting models, and we follow this suggestion. 12 R 2 represents a measure that provides information about the quality of forecasts from a predictive model compared to that from a constant model. However, if we want to compare among different models, then examining R 2 alone does not allow us to infer whether one model is significantly superior to another. Therefore, we perform the popular (Diebold and Mariano 1995) (DM) test, which is designed to compare the predictive accuracy of two models. More precisely, it tests for the null of equal prediction errors H 0 ∶ E[d t ] = 0 , where d t is the loss differential between two forecasts at time t. The DM test is applicable in the case of serial correlation. However, stock returns are prone to be strongly correlated in the cross-section while only being weakly correlated across time. To alleviate this issue, we average loss differential d t across assets, thus resulting in a time-series with weak serial correlation. The test statistic is defined as follows: Second, we analyze our forecasts in terms of economic gains. Leitch and Tanner (1991) argue that traditional statistical metrics are only loosely related to a forecast's achievable profit. Thus, we consider forecast-based portfolio sorts to assess the economic performance of our models. At the end of each month in the testing period (January 2001 to December 2019), we assign all available stocks with equal weight to ten portfolios using cross-sectional decile breakpoints based on the return prediction for the next month. In addition, we construct a long-short portfolio reflecting a zero-net investment by buying the top decile portfolio and selling the bottom decile portfolio to capture the aggregate effect. Third, we use the entire cross-section of stocks to estimate our models. Stocks of small firms, however, follow different dynamics in terms of their covariances, liquidity, and expected returns (cf. Kelly et al. 2019). More specifically, Hou et al. (2018) show that much of the predictability detected in the previous literature is found in the smallest stocks. While microcaps represent only approximately 3% of the total market capitalization of the NYSE/Amex/NASDAQ universe, they account for approximately 60% of the number of stocks. Due to high costs in trading these stocks (e.g., Israel and Moskowitz 2013), predictability (in terms of economic performance) in small stocks is more apparent than it is real. To investigate the degree to which the performance depends on firm size, we successively remove stocks below the respective deciles of the cross-sectional market capitalization.

Data
Previous studies have documented the existence of a U.S. bias in research in economics (e.g., Das et al. 2013) and finance (e.g., Karolyi 2016). In this regard, recent literature (e.g., Harvey et al. 2016) argues for a higher statistical threshold (t-statistic of 3 rather than 2) in empirical studies on the U.S. stock market (22) DM =d 1,2d ∼ N(0, 1), with as well as a stronger focus on other asset classes and stock markets. We follow this demand by investigating the European stock market. More specifically, we use all stocks (approximately 12,000) from the countries in the MSCI Developed Markets Europe Index 13 . We retrieve monthly data from January 1990 to December 2019 from Datastream and Worldscope. The data retrieval starts with the identification of common equity stocks using Datastream's constituent lists (research lists, Worldscope lists and dead lists). Specifically, for every country in the MSCI Developed Europe Index, we use its constituent lists and eliminate any duplicates. As a result, we obtain one remaining list for every country. To each of these lists, we apply generic as well as country-specific screens to eliminate noncommon equity stocks. Moreover, to all stocks remaining from this screening procedure, we then apply dynamic screens to account for data errors. The procedure described above is established in the academic literature and described extensively (e.g., in Ince and Porter 2006;Campbell et al. 2010;Griffin et al. 2010Griffin et al. , 2011Karolyi et al. 2012;Annaert et al. 2013;Fong et al. 2017). Our predictors are beta, market capitalization, the book-to-market equity ratio, momentum, investment, and operating profitability. These predictors are calculated Table 1 Descriptive statistics and characteristics of quantile portfolios formed from monthly sorts of stocks on predictor variables Panel A reports time-series averages of the cross-sectional mean (Avg), standard deviation (Std), and sample size (N) for all stocks from the countries in the MSCI Developed Markets Europe on Datastream and Worldscope from January 1990 to December 2019. The sample includes all stocks with valid monthly values for return (ret), the characteristics from the Fama and French (1993) three-factor model, the Carhart (1997) four-factor model and the Fama and French (2015) five-factor model: beta (beta), log market capitalization (size, in millions ), book-to-market equity (be2me), momentum (mom) operating profitability (op), and investment (inv). Panel B reports returns of quintile portfolios formed from monthly sorts of stocks on predictor variables. The five portfolios are defined as follows: For each month from 1990 to 2019, we assign all stocks from the countries in the MSCI Developed Markets Europe to five portfolios using quintile breakpoints from the cross-section of the end-of-month predictor variable values and calculate the equal-weighted monthly percent returns for the next month. To analyze the aggregate effect of a predictor variable on stock returns, we take a long position in portfolio 5 and a short position in portfolio 1. The numbers in columns 1 to 5 report the average of the equally weighted returns. pVal(5-1) reports the p-value (in percent) from a t-test of the null hypothesis of zero average from the Datastream and Worldscope variables (datatype in brackets) total return index (RI), market equity (MV), book equity (WC03501), total asset growth (WC08621) and operating income (WC01250). We assume that accounting data are available to the public six months after the fiscal year ends to avoid look-ahead bias. Beta (beta) is the exposure of a stock to the market factor as derived from the capital asset pricing model (CAPM). According to the CAPM, stock returns are expected to increase as the stock's beta increases. For each month, we estimate the beta of a stock based on the daily stock returns from the previous 12 months to avoid any look-ahead bias. Following the established literature, returns are expected to exhibit an inverse relationship with firms' market capitalization (size) (Banz 1981) and a positive association with the book-to-market-equity ratio (be2me) (Rosenberg et al. 1985). Momentum (mom) is defined as the total return over the prior 12 months, excluding the last month. Prior literature finds evidence of a positive relation between momentum and stock returns (Jegadeesh and Titman 1993;Carhart 1997). Investment (inv) is growth in total assets, and operating profitability (op) is operating income scaled by the book value of equity. Following (Titman et al. 2004) and (Novy-Marx 2013), stock returns are expected to increase as investment decreases and operating profitability increases, respectively. In addition to exposure to the market factor (beta), market capitalization and the book-to-market-equity ratio are proxies for sensitivities to common unobservable risk factors in the Fama and French (1993) three-factor model. Furthermore, momentum is a surrogate for the sensitivity to a common unobservable risk factor in the Carhart (1997) four-factor model. Lastly, investment and operating profitability proxy for sensitivities to common unobservable risk factors in the Fama and French (2015) five-factor model. We report time-series averages of cross-sectional statistics, specifically, the mean, standard deviation, and number of observations, of the monthly predictor variables and stock returns in Panel A of Table 1. For comparability, we include only firmmonth observations with nonmissing returns in our analysis, resulting in an average sample size of 1,755 firms per month. We find that the average monthly stock return is 0.88% and the average cross-sectional standard deviation in monthly stock returns is 8.86%. To investigate the predictive ability of an individual variable, we further assign all stocks to five portfolios for each month from 1990 to 2019 using quintile breakpoints from the cross-section of monthly predictor variable values and calculate the equal-weighted monthly percent returns for the next month. Panel B of Table 1 reports the average monthly returns of the five portfolios. To analyze the aggregate effect of the predictor variable on stock returns, we take a long position in portfolio 5 and a short position in portfolio 1. Column "pVal (5-1)" reports the p-value (as a percentage) from a t-test against the null hypothesis of zero average return of the "5-1" portfolio. As expected, we find a positive relation between stock returns and momentum, the book-to-market-equity ratio and operating profitability and a negative relation between stock returns and investment for the European stock market. Contradicting the CAPM, we find a negative relation between stock returns and beta, which is consistent with empirical evidence provided by previous studies (e.g., Frazzini and Pedersen 2014). Similarly, we find no relationship between firm size and stock returns, which is also consistent with recent empirical findings (e.g., van Dijk 2011).
In our empirical study, we further account for missing values, different scales and extreme observations. To address missing values and maintain the number of return observations, we replace missing characteristics with their cross-sectional median if the respective returns are available. However, in contrast to much of the current cross-sectional predictability literature, we have observations on each characteristic in every month of our sample and do not fill entire cross-sections with values of zero. Different scales and extreme observations are dealt with by cross-sectionally rank normalizing firm characteristics. Table 2 presents monthly out-of-sample R 2 for the employed methods and different levels of market liquidity. Column 1 displays the levels of market liquidity established by removing stocks with the prior month's market values below the cross-sectional decile breakpoints corresponding to q. Columns 2 to 9 correspond to the R 2 values as a percentage achieved by each individual model. Positive values indicate better predictive accuracy compared to a naive constant zero forecast.

Predictive accuracy
For the full cross-section of stocks (first row), OLS generates an R 2 of 0.43%. After dropping firms with market values below cross-sectional thresholds in increments of ten percent, the predictive power gradually decreases to − 0.32% for the largest stocks. Penalized linear methods, i.e., Lasso, Ridge and Enet, perform on a very similar, yet slightly worse, level than OLS. This result is not surprising, as we are using only a small set of well-established, and essentially uncorrelated firm characteristics. The predictive accuracy of SVR is consistently below that of OLS. A potential source of this underperformance is that distance-based algorithms such  Table 1 indicates that this may be the case for size, as the return differences across portfolios are only of minuscule amplitude. Additionally, SVRs are highly sensitive to the choice of hyperparameters, as Probst et al. (2019) identify a much higher tunability of SVR compared to that of RF or GBRT. RF, GBRT and NN, however, exhibit a remarkable improvement over OLS for the broad market (0.58%, 0.60% and 0.67%, respectively). With respect to persistence, they exhibit a diminishing pattern in predictability toward larger stocks similar to that of OLS. This pattern is in line with Hou et al. (2018) showing that much of the predictability detected in the previous literature is attributable to the smallest stocks. GBRT and NN, however, are able to achieve both (1) consistently greater R 2 than OLS and (2) consistently positive values for R 2 . The differences range between 0.46 and 0.56 percentage points for GBRT and NN, respectively, indicating that the predictive accuracy can be improved by (choosing the appropriate) machine learning methods. Table 3 presents the p values for the DM test on the full sample of stocks (Panel A) and for the top decile of market capitalization (Panel B). We display the lower triangle only, in which row models correspond to forecast errors ê 1 in equation (22); thus, significant p-values indicate that the row model outperforms the column model. 14

3
The ten portfolios are defined as follows: At every month in our testing sample from January 2000 to December 2019, we sort all stocks by their expected return for the next month and assign them to the corresponding decile portfolio. To analyze the aggregate effect of a certain model on stock returns, we take a long position in portfolio 10 (High) and a short position in portfolio 1 (Low). Avg (Std, Shp) is the average (standard deviation, Sharpe ratio) of equally weighted returns. Pred is the average expected portfolio return of a model. pVal(H-L) reports the p value (in percent) from a t-test against the null hypothesis of zero average return (column Avg). pVal (MR) reports the p value (in percent) from a one-tailed monotonic relationship test in averages (Avg), standard deviations (Std), and Sharpe ratios (Shp) of the ten portfolios. The direction of the test depends on the sign of the difference between the values in the High and Low portfolios The DM test provides support for the empirical results derived from the R 2 values presented in Table 2. The first conclusion from Panel A of Table 3 is that predictions from penalized linear methods and SVR do not provide statistically significant improvements over OLS. More specifically, SVR and Enet even exhibit a significantly inferior predictive ability at the 5% level. Tree-based methods and NN produce statistically significant improvements over linear methods and SVR at least at the 5% level. For any pairwise comparison among tree-based methods and NN, the null of equal predictive accuracy cannot be rejected. For Panel B of Table 3, we find that the p-values of penalized linear methods and SVR generally decrease compared to OLS, indicating that they experience relative improvement when predicting returns of large cap stocks. Conversely, the superiority of RF over linear models and SVR disappears, indicating that its predictive performance is driven by picking up small-scale inefficiencies present in microcap dynamics. For GBRT and NN, the null hypothesis against not only OLS but also against all other models can still be rejected, indicating that their outperformance is present throughout all market segments.

Economic gains
From an investor's perspective, statistical performance is only of subordinate importance. In addition, Leitch and Tanner (1991) argue that statistical metrics are only loosely related to a forecast's achievable profit. Next, we assess the economic performance of our models by portfolio sorts. At the end of each month, we generate return forecasts for the next month with each model. After that, we assign stocks to one of ten portfolios based on decile breakpoints of the forecasts. Finally, to capture the aggregate effect, we construct a zero-net-investment portfolio that simultaneously buys (sells) the stocks with the highest (lowest) expected returns, i.e., decile 10 and 1, respectively. Table 4 depicts the results using the full sample. Note that Fig. 5 This figure shows cumulative logarithmic returns of prediction-based portfolios for different models and the market (Mkt). The strategy is implemented as follows: In every month of our testing sample from January 2000 to December 2019, we sort all stocks by their expected return for the next month and take a long position in stocks corresponding to the top decile of expected returns all models share a monotonically increasing relationship in realized returns (column 'Avg') and in the Sharpe ratio (column 'Shp'), which is statistically significant at the 1% level after testing for monotonicity between portfolios 1 to 10 using the Patton and Timmermann (2010) test (row 'pVal(MR)'). From an investor's perspective, this property is highly desirable, as higher returns are, thus, relatively cheaper in terms of risk than low returns. Moreover, we see significantly positive spreads (row 'H-L') at the 1% significance level for all models, indicating that, generally, all models are effective in capturing the cross-sectional dispersion of returns (row 'pVal(H-L)'). Upon examining the return spreads, we find penalized linear methods and SVR to be on eye-level with OLS, with Lasso, Enet and SVR resulting in slightly larger spreads. NN and tree-based methods outperform OLS by a quite substantial margin, with a surplus in return spreads of up to 64 basis points for RF. The zero-netinvestment strategy for the predictions of RF yields an extraordinary return of 2.38% per month (32.61% annualized), while GBRTs provide the best risk-return-tradeoff with a Sharpe ratio of 0.56 per month (1.94 annualized). Even if SVR exhibits a negligible improvement in terms of return spread, it has a unique characteristic from a risk perspective, as it is the only model with a significantly monotonic increasing relationship in returns while having a significantly monotonic decreasing relationship with risk, i.e., higher returns are associated with lower risk, which is extremely compelling from an investor's perspective. Figure 5 presents cumulative performance plots corresponding to the long portfolios (portfolio 'H') of Table 4 along with the cumulative market return. The market return for Europe is retrieved from Kenneth French's website. 15 Note that all models outperform the market index and Table 5 This table reports the  return of a prediction-based  zero-investment strategy formed  from different models The strategy is implemented as follows: At every month in our testing sample from January 2000 to December 2019, we sort all stocks bytheir expected return for the next month and take a long position in stocks corresponding to the top decile and a short position in stocks corresponding to the bottom decile of predictions. We repeat this procedure for several levels of liquidity by removing stocks with market capitalization (Size) less than the cross-sectional quantile q. Columns 2-9 then show realized returns of a zero-investment strategy based on predictions formed by the model shown in the column header

3
Machine learning techniques for cross-sectional equity… that tree-based methods dominate all other models by a large margin. Consistent with Leitch and Tanner (1991), we find that the statistical performance reported in the previous subsection does not necessarily translate into economic performance. Table 5 presents return spreads analogous to row 'H-L' in Table 4 for the different levels of market capitalization. In Table 2, the predictive accuracy gradually decreases for larger stocks. For the return spreads we generally find a similar pattern; however, most models exhibit a peak (economic) performance after cutting off the smallest 20%. This level of market liquidity is often considered free of microcaps (e.g., Lewellen 2015). A possible interpretation might be that the dynamics of those stocks are neither very persistent nor predictable. Tree-based methods, on the other hand, experience a sharp drop in economic performance after cutting off the bottom decile of market capitalization, again indicating that their outperformance, to a large extent, relies on capturing microcap dynamics. Further increasing levels of market capitalization are accompanied by decreasing return spreads, a pattern consistent with the concept of market efficiency, assuming that the information environment for large firms is richer and more efficient than that for small firms. Penalized linear methods are again close to OLS. The outperformance of tree-based methods is mostly present when considering the full sample. While outperformance vanishes for RF afterward, GBRTs maintain a slight surplus over OLS toward reducing the sample to the largest stocks. NN and SVR consistently achieve higher return spreads than OLS across all market segments. While the results for NN are in line with the previous findings, the results for SVR come as quite of a surprise given its low predictive accuracy (cf. Table 2). Specifically, after cutting off the bottom decile of market capitalization, it exhibits the highest returns for the zero-net-investment strategy overall with remarkable stability. The evaluation thus far provides evidence that machine learning can lead to better results than the traditional OLS model, not only statistically but also economically. 1 3 6 Additional results

Transaction costs
Transaction costs impair the profitability of any investment. Higher turnover rates of investment strategies can easily nullify their overperformance in terms of raw returns. Therefore, we aim to examine whether the stronger performance of machine learning models compared to OLS is driven by substantially higher turnovers. However, realistic transaction costs are not always available, cumbersome to use or expensive to acquire (cf. Lesmond et al. 1999). Additionally, Collins and Fabozzi Monthly predictions are formed for our testing sample from January 2000 to December 2019 for different models and levels of market liquidity in terms of market capitalization. Columns correspond to the models. Rows correspond to different levels of market liquidity, which we establish by removing stocks with market capitalization (Size) less than the cross-sectional quantile q. The results are presented separately for bull and bear markets, i.e., when the trailing cumulative 12-month excess return on the market portfolio in months t-12 to t-1 is positive and negative, respectively where w i,t is the weight of stock i in the portfolio at time t. Table 6 reports the average monthly turnover in percent for all models and levels of market liquidity. In line with Gu et al. (2020), we find that turnover rates for tree-based methods are generally approximately 20% higher than those for OLS. As the outperformance of tree-based methods appeared to be meaningful only when including microcaps, we conclude that their potential is limited by the higher turnover since microcaps are inherently more difficult and costlier to trade. Penalized linear methods, SVR and NN, on the other hand, have turnover rates comparable to those of OLS. Especially on the largest stocks, where transaction costs are generally smaller, SVR and NN still achieve a spread of up to 20 basis points above OLS. We conclude that the marginally higher turnovers are unlikely to explain the pronounced outperformance or, to put it differently, outperformance is unlikely to disappear in the light of transaction costs.

Performance within subperiods
Inspired by a number of studies suggesting that returns may differ across market states (e.g., Cooper et al. 2004;Wang and Xu 2015), we evaluate the performance of the models separately for bull and bear markets. Following (Zaremba et al. 2020), we define these states as subperiods of positive (bull market) and negative (bear market) total excess return on the market portfolio during the last 12 months. The results for the predictive accuracy and economic gains from the individual models are reported in Tables 7 and 8, respectively. The first conclusion from Table 7 is that return predictability is generally higher in bull markets (Panel A) than in bear markets (Panel B). While we encounter decreasing predictability with firm size in the full panel (cf. Table 2), we find that in bull markets, the predictability in terms of R 2 is greater when excluding very small stocks. For bear markets, the R 2 values are generally negative and monotonically decreasing in firm size. Among the machine learning methods, NN produces the highest R 2 and shows the highest outperformance compared to OLS in both bull and bear markets.
With regard to the economic gains reported in Table 8, we find that differences are visible, although not dramatic, during times of a general upward trend (Panel A). In 16 The turnover of a zero-net-investment portfolio falls between 0% (no turnover at all) and 200% (a full reallocation of both the high and low portfolios). To put it in perspective, a strategy based only on mom generates a turnover of approximately 160% over the sample period.
bear markets, however, the differences become more apparent. In particular, SVR and NN exhibit consistently higher returns compared to OLS. Even for the 10% largest stocks, the difference of approximately 40 basis points is quite staggering. The results presented here suggest that machine learning does not add exceptional value over OLS during periods of a general upward trend. However, in downward markets-which is when investors usually lose money-these models generate considerable value.

Forecast combinations
Lastly, we consider forecast combinations. In general, forecast combination acts as a tool for risk diversification across individual models while, in many cases, simultaneously improving performance (Timmermann 2006). Some of the models presented herein (i.e., RF, GBRT, and NN) utilize the benefits of building ensembles as an integral part of the model definition itself. Several studies have highlighted the benefits of forecast combinations and proposed different methods with respect to choosing optimal model weights (e.g., Bates and Granger 1969;Winkler and Makridakis 1983). However, here, we use the forecast combinations as a simple proxy to test whether linear methods contain information complementary to that of the machine learning models. To do this, we form combinations from 1) all models and 2) only nonlinear methods (SVR, RF, GBRT, NN). If there is additional value, we would expect 1), i.e., the combination of all models, to provide superior predictions. Therefore, we restrict ourselves to simple combination schemes that often show better results than more sophisticated methods involving additional parameter estimates (Timmermann 2006). The forecast combination takes the general form: with where w h t is the weight of model h at time t and H is the total number of models. MSE refers to the in-sample mean-squared error, and R refers to the ranking of MSE (ascending order) among all models. Table 9 reports results on the predictive accuracy of forecast combinations. The left-hand panel considers all models in the combination, while the right-hand panel excludes linear methods. We find that the predictive accuracy is consistently higher when relying only on nonlinear methods, irrespective of the concrete combination scheme used. With respect to the weighting-schemes, inverse rankweighting appears to provide slightly worse results, while the other two are on equal footing. Most importantly, we exhibit two key take-aways: First, all combinations outperform the single use of OLS. Second, and in contrast, none of the combinations is superior to the strongest machine learning model, NN. However, the best model may not be known ex ante; thus, the combination of various models might still reduce model selection risk. Table 10 reports return spreads based on the forecast combinations. The results generally support the statistical findings. However, we find that the economic performance improves over the best individual model, SVR, up to Size > 0.7 . Overall, the performance of the conventional OLS approach can be improved when combined with models from the machine learning literature. Moreover, our results raise the Table 8 This table reports the return of a prediction-based zero-investment strategy from the models The strategy is implemented as follows: At every month in our testing sample from January 2000 to December 2019, we sort all stocks by their expected return for the next month and take a long position in stocks corresponding to the top decile and a short position in stocks corresponding to the bottom decile of predictions. We repeat this procedure for several levels of liquidity by removing stocks with market capitalization (Size) less than the cross-sectional quantile q. Columns 2-9 then show the realized returns of a zero-investment strategy based on predictions formed by the model shown in the column header. The results are presented separately for bull and bear markets, i.e., when the trailing cumulative 12-month excess return on the market portfolio in months t-12 to t-1 is positive and negative, respectively The strategy is implemented as follows: At every month in our testing sample from January 2000 to December 2019, we sort all stocks by their expected return for the next month and take a long position in stocks corresponding to the top decile and a short position in stocks corresponding to the bottom decile of predictions. We repeat this procedure for several levels of liquidity by removing stocks with market capitalization less than the cross-sectional quantile q. Columns 2-4 and 5-7 then show the realized returns of a zero-investment strategy based on predictions formed by the forecast combination shown in the header question of whether the commonly applied purely linear model is even necessary for building expectations about future stock returns, as the combination of nonlinear methods provides superior performance both statistically and economically while individual model selection risk is reduced.

Conclusion
In this paper, we provide an eye-level comparison of the commonly applied linear regression approach in stock return predictions (Lewellen 2015) and machine learning methods, i.e., penalized linear methods, support vector regression, tree-based methods, and neural networks. We find that the nonlinear machine learning methods in particular can provide both statistically and economically meaningful performance gains over the conventional OLS approach, thereby revealing that machine learning methods are not more vulnerable than OLS to model breakdowns. 17 In our analysis, GBRTs exhibit the highest outperformance, of approximately 0.6% per month, based on the entire crosssection of European stocks, which decreases to 0.1% per month when only stocks of the largest ten percent of firms are considered. Overall, we find that NN provides the most reliable predictions in terms of both statistical and economic performance, as its superior performance is robust over all market segments. In addition, we find that the economic performance gains are not attributable to substantially higher turnover of stocks. Analyzing bull and bear markets separately, we find that the value of NN is mostly generated in downward trending markets when investors generally lose money. As forecast combinations have frequently been found to produce better forecasts than methods based on the ex ante best individual forecasting models (Timmermann 2006), we investigate the performance of forecast combinations and find that the combination of only nonlinear models exhibits the performance of the best individual model, which is good news, as it allows investors to reduce model selection risk by diversification. We further find that all forecast combinations in this study outperform the sole use of OLS.
Our results are derived based on a small set of well-established predictor variables borrowed from the asset pricing literature. However, as the performance of the (nonlinear) machine learning methods is generally higher than that of OLS, we strongly recommend a more intensive use of machine learning models to predict stock returns in future research and practice. Beyond our considerations, future research should also examine the use of machine learning techniques with respect to long-term stock returns (as considered in Kyriakou et al. 2021;Mammen et al. 2019) more closely.
Funding Open Access funding enabled and organized by Projekt DEAL.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. ∈ 10 −4 , 10 −1 ∈ 10 −4 , 10 −1 = 0.5 ∈ 10 −4 , 10 −1 ∈ 10 −1 , 10 3 C ∈ 10 −1 , 10 3 ∈ 10 −3 , 10 −1 act. = tanh lr ∈ 10 −3 , 10 −2 ∈ 10 −5 , 10 −3