There is little doubt the research landscape is rapidly changing. Many researchers are pushing for greater transparency and open science practices in general (Munafò et al., 2017; Nosek, Ebersole, DeHaven, & Mellor, 2018). There has been a corresponding call for more meaningful statistics (Cumming, 2014; Kruschke & Liddell, 2018), cumulative approaches to research (Cumming, 2014; Schmidt & Oh, 2016), and greater reliance on data visualization (Fife, 2020; Fife & Rodgers, 2021; Tay, Parrigon, Huang, & LeBreton, 2016). While some push for stricter standards of research practices (Nelson, Simmons, & Simonsohn, 2018), others (e.g., Fife & Rodgers, 2021) advocate that we broaden our perspective on what constitutes scientific research to include greater use of exploratory data analysis (EDA).

EDA is a data analytic philosophy that emphasizes “listening” to one’s data. (Tukey, 1986), the father and a vocal advocate for EDA, suggested that confirmatory research is akin to a prosecutor that places hypotheses on trial. EDA, on the other hand, is more like a detective, hunting for clues and letting the evidence speak for itself. As such, while confirmatory research is hypothesis-driven, EDA tends to be hypothesis-generating.

There are several reasons to expand the use of EDA. First, few applied researchers are actually ready to conduct confirmatory research, since it requires a detailed analysis plan that fully anticipates all analytic strategies without any deviation. In the words of McArdle, “… it can be said that exploratory analyses predominate our actual research activities. To be more extreme, we can assert there is actually no such thing as a true confirmatory analysis of data, nor should there be” (McArdle, 2012, p. 405).

Second, while most researchers have EDA intentions, some might mistakenly utilize CDA tools (Fife & Rodgers, 2021).Footnote 1 Fife & Rodgers, 2021 distinguished between “tools” and “intentions.” There are many tools that were developed under the EDA paradigm that may have a place in CDA. For example, residual analyses were developed under the EDA paradigm (as a way to look for patterns the researcher failed to model), but can be used with CDA analyses as well (e.g., to verify/demonstrate the researcher met assumptions). On the other hand, CDA tools (e.g., hypothesis tests) are probably not appropriate for analyses with EDA intentions since their probability distributions rely on many assumptions not likely to be met when doing EDA (e.g., the sample size is planned in advance, multiple tests are corrected for multiple comparisons).

For example, one frequently misused and abused CDA tool is multiple regression. Researchers routinely perform multiple tests of significance with dozens of variables to identify a hypothesis that is supported by the data. However, the p values associated with these significance tests have no probabilistic meaning without protections in place (e.g., corrections for multiple comparisons, adherence to distributional assumptions, (Cramer et al., 2016; Fife & Rodgers, 2021)). Even with these protections, multiple regression is a poor tool to use for exploratory research, particularly when multiple variables are involved, mostly due to its tendency to capitalize on chance (see, e.g., McNeish, 2015).

As the research landscape evolves, we hope it becomes more friendly toward EDA analyses, and more receptive to EDA tools as well. One such tool is random forest (RF), a machine learning algorithm well equipped to handle the shortcomings of multiple regression. RF has several advantages over traditional approaches. First, RF models reduce (but do not eliminate, (Gashler, Giraud-Carrier, & Martinez, 2008; Segal, 2004)overfitting (Breiman, 2001). Second, these models and the classification (and decision trees upon which they are based) are nonparametric, providing more flexibility when statistical assumptions are untenable (Malley, Kruppa, Dasgupta, Malley, & Ziegler, 2012; Steinberg & Colla, 1995). Third, RF natively detects interaction and nonlinear effects without requiring the user to explicitly model these relationships (Ryo & Rillig, 2017; Touw et al., 2013). Finally, RF can be used in situations where the number of variables far exceeds the number of subjects (Breiman, 2001; Matsuki, Kuperman, & Van Dyke, 2016)Footnote 2

In this paper, we hope to accomplish multiple goals. First, we describe how RF works and why it offers several advantages over traditional statistical models. Second, we discuss its strengths and limitations as well as address common misconceptions. Finally, we conclude by discussing several novel and/or underutilized applications of RF in psychology, in hopes to guide researchers in how best to utilize this statistical tool.

How random forest algorithms work

In this section, we intentionally provide a nontechnical overview of the RF procedure in an attempt to make RF less mystical and/or daunting. In addition, we have found that one need not understand the technical nuances of RF in order to capitalize on its strengths and use it to make interesting discoveries. For those more interested in a more technical treatment of RF, see (Strobl, Malley, & Tutz, 2009), as well as Chapter 8 of (James, Witten, Hastie, & Tibshirani, 2013).

Fig. 1
figure 1

Example of a decision tree. This fictitious decision tree attempts to predict whether someone will use random forest based on their statistics anxiety and years of programming. Rectangles are called “nodes.” The dotted circles are the predictions of the model with the sample size of those meeting each condition listed underneath the fitted predictions

Decision trees

The basic unit of analysis for RF models is a decision tree. Figure 1 shows an example of a decision tree, which is aimed at determining whether an individual will choose to use an RF model for their personal research. The top box, or “node” asks whether the individual has statistics anxiety. For those who do, we predict they will not use RF. Model predictions are indicated by dotted circles, with the number of individuals who meet those conditions written below the dashed circles (e.g., 44 individuals reported having statistics anxiety). The second node asks how many years of experience the individual has in computer programming. For those with more than 3 years of experience, the model predicts they will use RF for their research. For those who do not, the model predicts they will not. Those nodes early in the decision tree (e.g., statistics anxiety) are called “branching nodes,” or “internal nodes,” while those nodes that do not branch into other nodes (e.g., years of programming) are called “terminal nodes,” or “leaf nodes.” (Note: the predictions, indicated by circles, are not considered nodes).

Decision trees have been used for decades to model statistical relationships. When entered into a computer algorithm, the computer will decide, algorithmically, where in the tree the nodes fall and also determine the optimal cutoff (e.g., 3 versus 8 years of programming experience). From this single decision tree, the computer generates predictions. In this example, those with more than 3 years of experience in computer programming and who do not report anxiety about statistics will use RF, while those who do not meet both conditions will not. As with any other statistical model, we can identify how well we classify individuals. Specifically, when we speak of classification accuracy at the node level, we call it “node purity.”Footnote 3

Fig. 2
figure 2

Another example of a decision tree, though this example includes a continuous outcome variable (income). This fictitious decision tree attempts to predict someone’s income from their industry (entertainment, science and technology, and service) and their self-efficacy score. As before, dotted circles represent the prediction and the numbers below the circles indicate how many individuals matched those conditions

This example can be extended to continuous outcomes. Suppose one wanted to predict an individual’s income based on the following variables: a) their type of industry (science and technology, service, or entertainment), and b) their level of self-efficacy. As before, a computer can optimally compute the cutoffs for each variable and levels within each variable.Footnote 4 However, with continuous outcomes, the predictions for each individual are no longer binary, but rather individuals are assigned a value as a prediction (e.g., an income of $63,690). Prediction accuracy might then be evaluated using sum of squared residuals as is done in a simple regression model. Also similar to regression, the fit of the model can be visualized with scatterplots. Figure 2 shows the decision tree of this model, while Fig. 3 shows the corresponding fit of the model overlaid on a scatterplot. Notice that the fit of the model is nonlinear, and that the decision-tree allows the fit of the model to “bend” with the data, as appropriate. Likewise, notice how the different prediction lines suggest an interaction effect. Specifically, self-efficacy increases with income for the science and technology, and service industries, but not for entertainment. Although we never explicitly asked the decision-tree to model an interaction, it was detected through analysis. The tree simply and innately generated different predictions for different combinations of the variables.

Fig. 3
figure 3

This plot shows raw data and predictions from the decision tree shown in Fig. 2. The lines visually reflect the predictions from the decision tree. These lines are nonlinear and suggest an interaction, even though the model never explicitly modeled nonlinear or interaction effects

Random forest

RF utilizes multiple decision trees, often in the hundreds, if not thousands, hence the name random forest. Each tree in the forest will utilize a different and randomly selected set of observations and predictor variables, which is where the term “random” comes from. RF will randomly sample participants. By default, it samples with replacement (i.e., the sample is “boostrapped”), though one can easily sample without replacement. Typically, the sample size is set to 67% of the entire sample. This 67% sample is used to calibrate the decision tree. The remaining 33% are reserved for cross-validation in what is called the “out of bag” (OOB) sample (Breiman, 2001), each OOB observation is passed through each tree to generate predictions. The prediction accuracy of the model is then evaluated based on the OOB sample. RF has repeatedly been shown to outperform regression models (both logistic and standard regression, (Couronné, Probst, & Boulesteix, 2018; Kirasich, Smith, & Sadler, 2018; Muchlinski, Siroky, He, & Kocher, 2016)).Footnote 5 Although the algorithm defaults to a sample size of 67%, this can be modified to include more or less of the sample.

In addition to random sampling individuals, RF will also randomly sample variables. By default, RF generally samples \(\sqrt {m}\) variables when the outcome is categorical and m/3 when the outcome is numeric, where m is the total number of variables. For example, suppose one has five predictor variables of income: industry, self-efficacy, years of education, parental socioeconomic status (SES), and geographic region. In this case, m, the total number of variables equals five and the algorithm will sample \(\sqrt {5} \approx 2\) variables. So, the first tree might sample self-efficacy and parental SES, while the second tree might sample parental SES and geographic region. (This random sampling of variables technically happens at the node level, not at the tree level). The number of variables selected can also be modified to sample more or less variables.

Once variables have been sampled, the algorithm then builds a decision tree from this subset of predictors. As with regular decision trees, the computer determines the hierarchy of the tree, the optimal cutoff values for the node splits, and the predicted scores for each individual. Given each tree contains a different set of predictors, each tree may produce different predictions for each row in the dataset. Additionally, because each tree uses only a subset of the variables, there is never any concern about running out of degrees of freedom to test the model.Footnote 6

Once RF constructs multiple decision trees, one can assess the “variable importance” (VI) of each predictor in the model. Various measures of VI exist, but each attempts to evaluate how the fit of the model is improved by the inclusion of each individual predictor. Once such a measure is the mean decrease in impurity (also known as the “Gini index”). This index simply compares the prediction accuracy of decision trees both before and after the inclusion of the predictor of interest. For example, if we omitted the “Years of programming” node in Fig. 1 and simply categorized all those without statistics anxiety as “Yes,” 11 individuals who were classified as “No” would now be classified as “Yes.” If we compare the accuracy of this prediction to the prediction where the model includes the “Years of programming” node, that would tell us how important “Years of programming” is in predicting the outcome. If we were then to do that with all variables across all trees (and weight this by the number of times that variable is used for splitting), that would give us the “mean decrease in impurity,” or Gini index.

While computationally easy to assess, the mean decrease in impurity suffers from a major limitation. Specifically, variables with many possible values (e.g., continuous variables) have inflated estimates of VI relative to variables with few possible values (e.g., one’s gender classification, (Strobl, Boulesteix, Zeileis, & Hothorn, 2007)). Put differently, two variables with identical predictive accuracy will have different VI estimates if one has more unique values.

An alternative measure of VI, called “permutation VI” does not suffer from the same bias. This measure is also called “mean decrease in accuracy,” but we will call it “permutation VI” so as to not confuse it with mean decrease in impurity. This approach works by randomly shuffling OOB participants’ scores for each node in a decision tree. For example, Table 1 shows an example where scores are shuffled. In this case, the first OOB person had a score of five, but after permutation was given the score associated with the second individual (a two). This shuffling removes any correlations between the variables in the dataset. The algorithm can then assess the accuracy of the model before versus after shuffling. If there is a large difference in OOB predictions before versus after shuffling the scores for a particular variable, we can conclude that variable has a strong association with the outcome.

Table 1 Simulated dataset where the self-efficacy scores were permuted (or shuffled). Permutation breaks any association between the other variables (e.g., income, in this case) and the shuffled variable

For binary outcomes, the permuted VI score indicates the average change in OOB error (before versus after shuffling) across all trees. For example, if a variable’s permutation VI is 0.3, that says that, relative to shuffled scores, the unshuffled scores had OOB scores lower by 30%. For continuous variables, the permutation VI score represents the difference in sum of squared errors between shuffled and unshuffled datasets.

Another advantage of the permutation VI measure is that missing data are handled naturally (Hapfelmeier, Hothorn, Ulm, & Strobl, 2012). If an individual is missing a score on a particular variable (say, years of education), the shuffling of scores will simply assign that missing value to another individual.

The disadvantage of permutation VI is that it is computationally expensive. The computer must shuffle scores for every variable, across every node, across every tree in which it appears. Yet this disadvantage becomes increasingly less frustrating as computers become more powerful.

These VI measures allow researchers to winnow down a large list of variables into a smaller subset of contenders for further exploration. This could be done ad hoc (e.g., by choosing to investigate the top three variables), or more concretely. For example, (Genuer, Poggi, & Tuleau-Malot, 2010) utilized mean decrease in impurity VI to develop an objective variable selection algorithm. This algorithm works by generating multiple forests so one can estimate variability in VI estimates, which can then be used to essentially determine which variables have VIs that exceed chance. This algorithm can be found in the VSURF package in R (Genuer, Poggi, & Tuleau-Malot, 2019).

In summary, RF models generate hundreds or thousands of decision trees to produce aggregated predictions, all the while natively detecting interactions and nonlinear patterns. Because these models utilize random sampling of variables, the models circumvent the n < p problem and reduce overfitting. These characteristics, we will show, represent a serious advantage for doing many types of research.

Common misconceptions

In the previous section, we outlined several advantages of RF models. Given that psychological research frequently encounters nonlinear patterns (Hayes, Laurenceau, Feldman, Strauss, & Cardaciotto, 2007; Helmich et al., 2020; Lord & Novick, 1968; Mattei, 2014), violated assumptions (Micceri, 1989; Skidmore & Thompson, 2013; Van Horn et al., 2012), and interactions (Cronbach, 1975), it is somewhat surprising to see RF models so infrequently used. We suspect the reason for this is that researchers have common misconceptions about RF models.Footnote 7

The first, and perhaps most common misconception, is that RF models are inappropriate when one is examining a small number of predictor variables. We suspect this misconception stems from the fact that RF was initially designed to handle very large numbers of predictor variables (e.g., more variables than observations as with genetic or biomedical research). While this is true, this does not mean RF models cannot be used with a small number of variables, especially when one wants to leverage RF’s ability to detect interactions and/or nonlinear effects.

Another misconception we have encountered is that one must have a large sample size to utilize RF models. In reality, RF models do no worse than traditional models (e.g., regression) and likely do much better. Estimates of cross-validation accuracy (i.e., OOB error) will reflect the uncertainty associated with the smaller sample sizes. Granted, if sample sizes are small enough and/or the signal in the data are weak enough, these OOB estimates will report that the model struggles to predict the outcome of interest. In this sense, RF models do struggle with smaller sample sizes. However, we see this as a feature of RF, not a limitation. For example, in order to determine whether certain variables contribute to a prediction model, it is better for the model to admit difficulty (e.g., through imprecise predictions) than to capitalize on chance patterns as linear models might.Footnote 8

While these common misconceptions can be easily dismissed, there are some genuine limitations of RF models. We will outline and address these limitations in the next section.

Limitations of RF models

RF models are “black box” algorithms

Single decision trees can be easily and intuitively interpreted visually if there are not too many nodes. However, the predictions of individual decision trees are highly unstable. Depending on the number of predictors within an analysis, the predictor variable chosen for the first branching node can be different across two single decision trees. Moreover, this difference likely alters the entire structure of the tree and the terminal node prediction between two decision trees might vary drastically (Strobl et al., 2009).

RF addresses this instability by aggregating terminal node predictions across hundreds or thousands of decision trees. Although this process is the very reason for the advantages of RF, limitations remain. Specifically, as RF is a computer-generated machine learning model, it is fundamentally a “black box” algorithm (Breiman, 2001), making it difficult to interpret the model itself. It would be unfeasible, for example, to visually display all decision trees generated from the random forest. More specifically, while regression models yield a simple algebraic equation following analysis, RF does not. In order to predict new observations, one would save the hundreds or thousands of single decision trees into a computer program and feed new data through the saved program.

However, while the fitted algorithm cannot be easily conceptualized, the output of the model can because users can visualize the predictions of a RF model. To do so, one could feed the RF model a new dataset containing a wide range of predictor values. RF will then use the forest for the new data to estimate predictions. Of course, many statistical programs will do this sort of prediction automatically. As an example, Fig. 4 displays the same data as shown in Fig. 3, but with the fits of a RF model instead of the fits of a single decision-tree.

One convenient tool for visualizing RF models is Flexplot (Fife, 2021), which is a software application available in R, JASP, and Jamovi. Flexplot is designed to visualize statistical models and has many native functions for visualizing RF models. Flexplot can easily visualize a few variables, though visualizing more than a few variables can be less intuitive. For guidance on visualizing more than a few variables, see (Fife, Longo, Correll, & Tremoulet, 2021) for the JASP version of Flexplot and (Fife, 2021) for the R version. Or, for a systematic strategy for visualizing multivariate data with Flexplot, see (Fife & Mendoza, 2021). An alternative to multivariate visualizations is to plot what are called “marginal” relationships. For example, one might use an added variable plot to visualize the relationship between self-efficacy and income, holding all other variables constant.

Fig. 4
figure 4

This shows the same data displayed in Fig. 3, but the fits are from a RF model

In addition to visualizing predictions, one could also conceptualize variable importance metrics. As mentioned previously, VI metrics are intuitive to interpret and do not require users to peek within the black box or to visualize multivariate relationships. For the previous dataset, self-efficacy has a VI of 28.584, which can be roughly interpreted to mean that, relative to a model that excludes industry, the model with industry is approximately $28,584 closer to predicting one’s actual income. Additionally, the VI for self-efficacy is 13.544.

RF models lack distributional theory

A second disadvantage of RF is that these algorithms lack statistical distribution theory. While RF’s nonparametric approach is advantageous when assumptions are violated, this also makes it difficult to make sophisticated statistical inferences. For example, one cannot derive a p value or a confidence interval from a population distribution. One can obtain substitutes with resampling procedures and from these substitutes derive confidence intervals or make other sorts of inferences. However, these inferences are tied to the data at hand, though past research has shown that RF predictions have parametric characteristics and can often be used for statistical inferences (McAlexander & Mentch, 2020), including prediction intervals (Zhang, Zimmerman, Nettleton, & Nordman, 2019). Also, RF performs quite well in making inferences beyond the data (Fox et al., 2017; Gao, Wen, & Zhang, 2019; Lu et al., 2016).

Despite this limitation, we rarely see any reason to use RF as the final step in the research process as statistics are tools that allow us to make ever-more-precise mathematical statements about theory. For this reason, nonparametric models are simply “hacks”; they allow us to temporarily acquire an answer to a question in such a way that we don’t deceive ourselves (e.g., by violating a statistical assumption). Arguably, the ultimate goal of research is to have precise mathematical parametric models, but often these nonparametric models are a necessary pit stop.Footnote 9 RF models, while rarely (if ever) the final destination of a theory’s journey, they do assist in moving from imprecise nonparametric answers to specific parametric formulations. We will discuss examples of this process in later sections.

In short, RF models detect interactions, model nonparametric relationships, and they reduce overfitting. While they are considered a “black box” algorithm, their predictions can be easily visualized (e.g., by using Flexplot), particularly when one limits visualizations to only a few variables. Additionally, RF models are best considered a pit stop toward parametric modeling, rather than the final step in theoretical development.

Having covered the strengths, misconceptions, and limitations of RF models, we now turn to the core crux of our paper. As we hope to show, RF models can be used in novel and unique ways. In the following section, we identify a few strategies one might use to leverage the strengths and advantages of RF in psychological research.

Common, uncommon, and novel applications of RF models

The variety of applications of RF continues to expand in psychological research. In this section, we discuss a number of different strategies for using RF models. For each approach, we describe an overall strategy and demonstrate this strategy with applied examples.

Variable selection then parametric modeling

Perhaps the most common reason people use RF models is for variable selection. For example, RF models have been used in psychology to predict correlates of nonsuicidal self-injury (Ammerman, Jacobucci, & McCloskey, 2018), smoking behaviors (Kitsantas, Moore, & Sly, 2007), utilization of psychiatric services (Rossi, Amaddeo, Sandri, & Tansella, 2005), adherence to HIV testing (Pan, Liu, Metsch, & Feaster, 2017), and use of Internet-based psychotherapeutic treatment for depression and anxiety (Wallert et al., 2018).

Very often researchers are not necessarily, or at least presently, interested in developing theoretical explanations of psychological phenomena. Rather, researchers may wish to describe a phenomenon (Mõttus et al., 2020), or to winnow down a large list of candidate variables to a smaller subset of viable predictors. While algorithms exist for doing this in multiple regression (i.e., the various stepwise regression methods), these methods cross-validate poorly (Smith, 2018). Instead, RF can be used. The ensemble of decision trees improves cross-validation accuracy.

One limitation to this approach is that RF treats VI measures as if they represent the final stage of analysis. As we said previously, RF models are best considered as data analytic pit stops; they are powerful tools that offer valuable insight into how we might then utilize parametric models, particularly when paired with visualizations. In the following section, we illustrate how one might leverage VI measures as a means of gaining additional insights.

Example and overall strategy

When attempting to identify a small number of variables from a large set of candidates, we recommend the following strategy:

  1. 1.

    Enter all variables of interest into a RF model and compute variable importance

  2. 2.

    Sort the variables in terms of VI.

  3. 3.

    Select a small number of candidate variables to visualize (e.g., the top four variables as measured by VI).

  4. 4.

    Visualize the RF predictions for this small group of candidate variables. Visualizing multivariate data can be tricky, though we suggest (Fife & Mendoza, 2021) or (Fife, 2021) for simple multivariate visualization strategies.

  5. 5.

    Use the visuals from Step #4 to select appropriate variables for parametric modeling. The preceding step might suggest a certain variable is not helpful to the model, that two variables interact, or a nonlinear pattern exists. That step will guide the choice of parametric model.

When one uses this strategy, they retain many of the benefits of RF (i.e., cross-validation, native interaction detection, and native nonlinear detection), without the disadvantages (i.e., “black box” algorithm).

Table 2 Variable importance for the top three variables in the simulated suicide ideology dataset

As an example, we simulated a dataset that contained the outcome variable of suicidal ideation and eight predictor variables: locus of control, depression, age, gender, parental income, grades, social support, and parental history of depression. The data were simulated in such a way that age had a nonlinear relationship with suicidal ideation and depression and social support had an interaction.

We began by computing VI for each of the predictor variables. Table 2 shows VI for the variables social support, age, depression, and locus of control. The next step was to plot these variables using Flexplot. This is an iterative process that may require dozens of plots to disentangle the relationships existent in the data. For example, one might choose to place social support on the X-axis in one plot and put depression/LOC in panels. Subsequently, the user might place depression on the X-axis and LOC/age in panels. Each visual represents a different “view” or “angle” of the multivariate relationship that might reveal different features of the relationship.

When visualizing these plots, the user is seeking to identify evidence of nonlinear patterns and/or interaction effects since these are most difficult to grasp without visuals. For those interested in understanding how to use multivariate visualizations to detect nonlinear/interaction effects, we recommend (Fife & Mendoza, 2021). The end result of this process yielded three distinct relationships illustrated in Fig. 5. The top plots show the interaction between depression and social support. Notice for those reporting lower levels of depression, there exists a weak relationship between social support and suicidal ideation. However, the relationship is stronger for those reporting more severe levels of depression.

Fig. 5
figure 5

Fits of the RF model for social support/depression (top plot), age (bottom left), and locus of control (bottom right)

The bottom-left plot shows the nonlinear relationship between suicidal ideation and age. Notice suicidal ideation increases from ages 12 to roughly 18, plateaus until age 22, and then trends downward. Finally, the bottom-right plot shows there is almost no relationship between locus of control and suicidal ideation. Since this had the smallest VI within the top four predictor variables, there is little reason to visualize the remaining variables in the dataset.

The visuals in Fig. 5 suggest the following parametric model:

$$\text{Suicide ideation} = \text{Age} + \text{Age}^{2} + \text{Depression} + \text{Social support} + \text{Depression}\times \text{Social support}$$

This parametric model was fit to the data, then visualized in Fig. 6. The red lines show the fit of the regression model, while the blue lines show the fit of the RF model. The two predictions are quite similar, at least near the center of the data, which suggests the parametric model seems to capture the most important elements of the nonparametric model.

Fig. 6
figure 6

Fits of a regression model (red line) and the RF model (blue line) for the age (left plot) and social support/depression (right plot) relationships

Nonparametric modeling

In order to tie statistical models to distributional properties, models make several key assumptions: normality, independence, constant variance, linearity, and homogeneity of regression. The latter two assumptions are particularly problematic for linear models and RF is well equipped to handle these assumptions.Footnote 10

Detecting nonlinearity

Standard statistical models assume linear relationships between the predictors and the outcome. If one encounters a nonlinear relationship, linear models can be rigged to fit some limited nonlinear relationships (e.g., we can add a squared predictor to a linear model to get nonlinear predictions). However, if the appropriate function is not linear (e.g., exponential, logarithmic, logit), linear models will fail.Footnote 11 RF models, on the other hand, can fit patterns from any nonlinear relationship: exponential, logarithmic, logistic, polynomial, etc. When researchers begin analyses by visualizing RF models, they can then attempt to identify the appropriate nonlinear function and, if they wish, formalize the mathematical relationship.

Fig. 7
figure 7

Random forest predictions (left) and predictions from a model using the Michaelis–Menten equation (right) of psychological distress. (Fife, 2020) initially used a RF model to identify the nature of the MI/Distress relationship. Using that information, he subsequently fit the data using the MM equation

Fife (2020) utilized this exact approach when modeling the relationship between mental illness and psychological distress. Upon visually inspecting RF predictions, Fife was able to identify an appropriate mathematical function, called the Michaelis–Menten or MM equation. The data were refit with a Bayesian MM model and results were replicated using the same MM model on an independent dataset (see video explaining this process at https://youtu.be/5BpmktmvgIA). Figure 7 shows the model fit with RF (left) and the MM equation (right).

To be clear, we reiterate that RF models are rarely an end unto themselves. Rather, they could serve an important step in helping researchers shift from nonparametric to parametric modeling. The results of a RF algorithm can help guide researchers in how to add theoretical and/or statistical precision to their models.

Interaction detection

Another important assumption of traditional statistical models is the assumption of homogeneity of regression slopes. This assumption states that if any interactions do exist between variables, they have been explicitly modeled, see (Fife and Mendoza, 2021; Gelman & Hill, 2006). If interactions exist that have not been modeled, estimates will be biased and conclusions gleaned will be misleading. (Gelman & Hill, 2006) noted that violating the assumption of homogeneity of regression (or, “additivity” in their terminology) is one of the most egregious violations.

As an example, suppose a researcher would like to perform a multiple regression. They might evaluate the main effect while also controlling for a number of different variables in order to determine the variance attributable to the main effect. However, as we mentioned previously, multiple regression assumes all “controlled” effects do not interact with the variable of interest. This is termed the “homogeneity of regression” assumption. When violated, any conclusions gleaned from main effects could be extremely misleading. For example, consider the image in Fig. 8, which shows the simulated results of a factorial ANOVA that contains a crossover interaction. If one were to estimate the main effect of treatment, one would conclude that there is no effect of treatment. This would be a misleading conclusion. Rather, the main effect of treatment depends on gender.

Fig. 8
figure 8

A crossover interaction of the gender/treatment relationship. If one wished to estimate the effect of treatment on depression, after controlling for gender, the model would suggest there is no treatment effect. Clearly there is a treatment effect; it simply depends on gender. Standard statistical models assume no interactions (except for those that are explicitly modeled). RF models do not make this assumption

The homogeneity of regression assumption applies to all linear models (e.g., multiple regression, factorial ANOVA, ANCOVA) and these models are not robust to violations (Fife & Mendoza, 2021; Gelman & Hill, 2006), yet researchers routinely utilize these models without assessing the viability of this assumption (Fife & Mendoza, 2021). This lack of attention to this critical assumption is understandable; for even simple analyses, modeling interactions requires a large number of terms. For example, with only four variables, one would need to model 16 terms (one intercept, four main effects, six two-way interactions, four three-way interactions, and one four-way interaction). Few studies will have large enough sample sizes to precisely estimate the size of these effects. Clearly, it is much easier to assume a main effect model and hope there are no interactions present, but the risks are great.

A better alternative is to utilize RF models. RF will natively detect interactions without explicitly modeling them. This saves effort on the part of the researcher, who would otherwise have to, by hand, choose which interactions should be estimated.

Additionally, in traditional regression, when all interaction terms are explicitly modeled, the researcher must then study large tables to determine whether each interaction is worth keeping in the model. The researcher could utilize p values to dismiss interaction terms, but because of multiple testing, these p values have no probabilistic meaning and are thus prone to bias. Researchers could instead utilize effect size measures (e.g., semi-partial R2) to make decisions, but these are also prone to overfitting and extremely sensitive to multicollinearity. No matter the metric one uses, the process is time-consuming, difficult, and prone to capitalizing on chance. On the other hand, with RF models, the user need not decide from lengthy tables which terms to keep as the user can inspect VI metrics. If a variable is important, either as a main effect or as an interaction, it will be reflected in the VI metric.

Finally, estimating interactions in traditional regression is very difficult and requires very large sample sizes. Moreover, doing so requires degrees of freedom the researcher may be unable to spare. On the other hand, RF models do not “spend” (Mentch & Zhou, 2019; Rodgers, 2019) degrees of freedom to estimate interactions. Instead, modeling interactions is a natural part of the fitting process.

Overall, linear models are not robust to violations of the assumption of homogeneity of regression and evaluating the viability of this assumption is cumbersome and prone to imprecision. Taken together, we think RF should be the default method for modeling many multivariate relationships, particularly when more than two or three variables are used. Much like residual dependence plots are used for assessing homoscedasticity/linearity in regression, perhaps RF modeling can be a first step in evaluating the homogeneity of regression slopes assumption in multiple regression.

Unfortunately, using RF models for the purpose of interaction detection is neither well known nor common. However, (Kitsantas et al., 2007) utilized classification trees to identify whether a small set of predictors, including social risks, health risks, and peer smoking, interacted with one another in predicting intentions to smoke. They discovered that social and health risks were highly dependent on peer smoking behavior.

Example analysis

Table 3 ANOVA summary table of a simulated dataset investigating the effect of treatment on intentions to smoke, after controlling for peer and parent Smoking

We will briefly illustrate this strategy with another simulated dataset. Suppose one were interested in modeling the efficacy of a smoking prevention program on adolescents’ intentions to smoke while controlling for peer and parent smoking. Further suppose one simply fit the model without assessing the homogeneity of regression assumption. Table 3 shows an ANOVA summary table of this model, which shows the treatment effect was ineffective at reducing intentions to smoke. However, suppose we modeled the data using a RF algorithm.

Fig. 9
figure 9

This plot shows the fits from a RF model that predicts smoking intentions from peer smoking (X-axis), parent smoking (panels), and treatment condition (colors). The RF model picks up on the nonlinear effect of peer on intentions, as well as the interaction between treatment and peer

Figure 9 outputs predictions from this model. The RF model suggests that, without treatment, adolescents increase their intentions to smoke the more their peers smoke. For the treatment group, on the other hand, peer influence seems to have no effect on intentions to reduce smoking. In other words, the model that (incorrectly) assumed linearity and heterogeneity of regression committed a serious type II error.

As we mentioned previously, RF models best serve as a guide for determining what sort of parametric model is best. For this particular dataset, the RF model suggests we might add a quadratic term to the peer effect as well as an interaction. If we did this, we might plot the fits of the modified regression model, as in Fig. 10.

Fig. 10
figure 10

This plot shows the predictions from a parametric model where a quadratic term was added to the peer/intention to smoke relationship, as well as an interaction term

Assessing parameters of a Monte Carlo

This final strategy is more on the technical side and will likely only be of interest to statisticians. This application uses RF models to identify important parameters in a Monte Carlo simulation. Monte Carlo simulations are powerful techniques often used by statisticians to identify how various statistical procedures perform under different conditions. For example, one might wish to identify how nonlinear parameters bias R2 values. To do so, a researcher might perform a Monte Carlo simulation, where the researcher simulates different numeric conditions, such as degree of nonlinearity (e.g., by modifying the beta weight of the quadratic term), size of the linear component (e.g., by modifying the beta weight for the linear term), the sample size, and the number of covariates.

When varying these parameters, it is common to pick a few values that span a reasonable range; for example, specifying the degree of nonlinearity as standardized regression weights of either -0.6, -0.3, 0, + 0.3, or + 0.6. Subsequently, researchers might report results via a large table that shows average bias under every possible condition. Alternatively, some might build an ANOVA that estimates the mean type I error rate from each of the levels of the parameter values, such as nonlinear component, linear component, sample size, and number of covariates.

A better alternative to this approach is to sample parameters from a liberal range of values. For example, rather than selecting regression weights of -0.6, -0.4, etc., one could instead sample from a uniform distribution that ranges from -0.6 to + 0.6. In other words, every iteration of the Monte Carlo will yield unique parameter values. One can then use RF in the same way the ANOVA is commonly used. The advantage of this is RF will identify important parameters and whether their importance derives from nonlinear relationships and/or interaction effects.

As of this writing, we know of no published articles that utilize this strategy. However, the strategy seems promising and may be of use to future researchers.

Discussion

The research landscape is rapidly changing. As a result, analysts are becoming more inclined to expand their statistical toolbox. It is our hope this expansion will include greater use of exploratory data analysis, including the use of RF models. RF models include several advantages that are particularly relevant to psychologists, including the ability to detect interactions and nonlinear effects as well as an impressive ability to avoid overfitting. While RF models are underutilized, we hope this paper has been a step toward more widespread adoption of RF models. In this paper, we have attempted to provide a simple explanation of RF models, addressed common misconceptions, and highlighted the limitations of RF models, including their “black box” nature and their nonparametric assumptions.

We have also provided several applied examples to show how to leverage RF’s strengths and overcome its limitations. The key to this endeavor is to visualize the predictions of RF models. Visuals can reveal insights into the statistical information RF models are able to capture as well as suggest parametric alternatives one might pursue. We have provided an appendix (Appendix A) that demonstrates how to perform simple RF analyses and visualizations.

To be clear, we do not suggest RF models replace existing methods. Rather, we suggest the appropriate procedure to use depends on the analyst’s intentions. RF models are best suited when researchers have exploratory intentions, in which RF models serve as a pit stop toward more sophisticated confirmatory methods. Alternatively, RF models can be used for more confirmatory research to check the viability of standard parametric models (e.g., linearity and homogeneity of regression). In either case, we hope to see greater use of this powerful tool.