1 Introduction

In recent years, the so-called replication crisis has shaken the social sciences in general and psychology in particular (e.g., Shrout and Rodgers 2018). Several replication projects (e.g., Aarts et al. 2015; Camerer et al. 2018) showed that many published effects cannot be replicated and urged a reform of research practices. Replicability is not only a problem within the (confirmatory) framework of hypothesis testing, which is mainly affected by p-hacking, publication bias and underpowered studies (Asendorpf et al. 2013), but also crucial for exploratory analyses that shape entire research areas. One prominent example for such an analysis is exploratory factor analysis (EFA), which is widely used to assess the dimensionality and structure of (psychological) constructs (Goretzko et al. 2019) and plays a major role in questionnaire development and test construction. Determining the number of factors that should be retained in EFA is “likely to be the most important decision a researcher will make” (Zwick and Velicer 1986), because its implications are extremely far-reaching. The most prominent example in psychological research might be the dimensionality of personality. Although it has been widely agreed to describe personality with the five-factor model (“BIG5”, e.g., Costa Jr and McCrae 1992), several studies reported difficulties in replicating this structure (e.g., Thalmayer et al. 2011).

Therefore, when conducting an EFA and determining the number of factors that should be retained, the goal of replicability should be considered alongside the goal of approximating the data generating process (Preacher et al. 2013). Common factor retention criteria such as the Scree-Test (Cattell 1966), the Kaiser-Guttman rule (Kaiser 1960) and parallel analysis (PA; Horn 1965) as well as modern approaches like the comparison data (CD) approach (Ruscio and Roche 2012) or the empirical Kaiser criterion (EKC; Braeken and Van Assen 2017) have been developed to primarily serve the approximation goal and focus less on the replication goal. While PA has become some kind of gold-standard for factor retention (Fabrigar et al. 1999; Goretzko et al. 2019), both CD and EKC showed higher accuracies in simulation studies for some data conditions (Auerswald and Moshagen 2019). The EFA literature clearly lacks a focus on replicability though as called for by Preacher et al. (2013) or Osborne and Fitzpatrick (2012). For this reason, we want to evaluate the relationship between replicability in the context of factor retention and the robustness of common criteria against sampling errors. Hence, a practical way to assess the robustness of a retention criterion’s solution is proposed—bootstrapping. Bootstrapping have been already used in the context of structural equation modeling to estimate standard errors of parameters (Nevitt and Hancock 2001), to evaluate model fit statistics (Hancock and Liu 2012) or to get corrected p-values for the model test in cases where multivariate normality is violated (Nevitt and Hancock, 2001). Zientek and Thompson (2007) suggested to use bootstrapping in the context of EFA as well to obtain more stable results or to evaluate the replicability. As they did not focus on determining the number of factors (in fact they applied a bootstrapped Kaiser criterion, that has been shown to overestimate the number of factors, e.g., Jackson 1993), we want to evaluate the usefulness of bootstrapping to the issue of factor retention in this study more closely.

1.1 Replicability and robustness

Throughout this paper, we discuss the replicability of the factor retention process and the robustness of different factor retention criteria. The latter does not refer to the term of robust statistics (robustness against outliers and distributional assumptions, for further readings, see Huber 1981), but rather describes the ability of the method to neglect noise (sampling error) and to provide estimates (here: the suggested number of factors) that do not change with minor, not important changes in the data (in the respective sample).

Replicability is understood in its narrowest sense in this article—the number of factors found in one sample should be replicated in another sample based on the same population (in this study we evaluate both a within-person replication assessing the dimensionality at different time points and a between-person replication comparing several samples from the same population). When it comes to the replicability of factor structures across populations (e.g., cross-cultural research, where measurement invariance is needed), a broader definition of replicability is used. We focus on the replicability of the number of factors among samples of the same population since it is necessary that a factor retention criterion provides replicable solutions in this narrow sense as a basis for broader replication as well. In other words, if a factor structure (or to be more precise the number of factors) is not replicable in samples from the same population, it will not be replicable across populations.

1.2 Factor retention criteria

In our study, we use four different factor retention criteria, PA, CD, EKC and a new machine learning approach- a tuned xgboost model (XGB; for the xgboost implementation, see Chen and Guestrin 2016; Chen et al. 2018; for the tuned XGB model, see Goretzko and Bühner 2020) as well as the Bayesian Information Criterion (BIC; Schwarz 1978) for comparison.

1.2.1 Parallel analysis

PA (Horn 1965) is based on a comparison of the empirical eigenvalues of the correlation matrix with eigenvalues of simulated or resampled data (for a comparison of these different implementations, see Lim and Jahng 2019). The traditional version of PA, for example, is based on the comparison of the empirical eigenvalues and the mean of S eigenvalues, where S is the number of simulated data sets. The first empirical eigenvalue is compared to the mean of these S first eigenvalues, the second empirical eigenvalue is compared to the mean of the S second eigenvalues and so on. PA suggests to retain factors as long as the empirical eigenvalue is greater than the reference eigenvalue.

1.2.2 Empirical Kaiser criterion

EKC (Braeken and van Assen 2017) is a descendant of the Kaiser-Guttman rule or the Eigenvalue-greater-one rule (Kaiser 1960). Instead of comparing the empirical eigenvalues to one (and retaining all factors for which the associated eigenvalue is greater than one), EKC takes the sample size \(N\), the number of manifest variables \(p\) and the strength of the other factors into account when calculating reference eigenvalues \({l}_{j}^{REF}\) for an empirical eigenvalue \({\lambda }_{j}\):

\({l}_{j}^{REF}=\mathrm{max}\left[\frac{p- \sum_{k=0}^{j-1}{\lambda }_{k}}{p-j+1}{\left(1+\sqrt{\frac{p}{N}} \right)}^{2},1\right] with {\lambda }_{0}=0\).

EKC suggests to retain as many factors as there are eigenvalues greater than their respective reference eigenvalues \({l}_{j}^{REF}\).

1.2.3 Comparison data

CD (Ruscio and Roche 2012) is a variant of PA that integrates the simulation of comparison data (comparable to the simulated data in classical PA) and the model comparison perspective of structural equation modeling. For each possible number of factors, a population is simulated whose correlation matrix reproduces the empirical correlation matrix as closely as possible. Then numerous samples (the comparison data sets) are drawn from these populations and the root-mean-squared error (RMSE) between the sample eigenvalues and the empirical eigenvalues is calculated. Accordingly, if 100 samples are drawn, 100 RMSE values are calculated. The RMSE values of a one-factor model are compared to those of a two-factor model using a non-parametric Mann–Whitney-U significance test. This way n-factor models are compared to (n + 1)-factor models until no significant improvement (with regard to the RMSE values) is detected and n factors are retained.

1.2.4 Machine learning approach

XGB (Goretzko and Bühner 2020) is a completely new approach to the issue of factor retention combining data simulation and machine learning (ML) modeling. The idea of this method is to simulate various data sets that reflect all important data conditions of an application context and to calculate features for these simulated data sets that may be related to the dimensionality of the underlying factor structure such as eigenvalues or matrix norms of the correlation matrix. Since, the true number of factors is known for the simulated data sets, it is then possible to train a ML model that “learns” the relation between the extracted features and the number of factors and that can therefore be used to predict the dimensionality in EFA. Goretzko and Bühner (2020) provided a trained xgboost model that is able to predict the number of factors based on 184 different features (the xgboost algorithm is a rather complex ML algorithm that consists of numerous decision trees that are subsequently fitted to the residuals of the previous trees [idea of boosting simplified] and that has many hyperparameters which influence the way these trees are grown and averaged, for more details, see Chen and Guestrin 2016).

1.3 Bootstrapping

Bootstrapping is a resampling strategy that was developed to assess the uncertainty of estimates when analytical solutions are not available (Efron and Tibshirani 1994). The ordinary non-parametric bootstrap is based on repeated case resampling. A bootstrap sample is created by drawing \({n}_{obs}\) (which is the number of observations in the respective data set) times from the empirical data with replacement which means that every observation \(i\) (\(i\in \{1,...,{n}_{obs}\}\)) has the same chance to be in a particular bootstrap sample multiple times, but not every observation will be drawn in each sample.Footnote 1 When repeating this procedure \(B\) times, one obtains \(B\) different bootstrap samples that consist of different subsets of the empirical data. Based on these \(B\) bootstrap samples, it is now possible to estimate a parameter of interest \(B\) times which yields \(B\) estimates (an empirical distribution of the estimation function) that can be used, for example, to quantify the standard error of this estimate.

Assuming that we are interested in an estimate of the population mean of a certain variable and we have four observations of this variable (\({x}^{\top }=(\mathrm{1,2},\mathrm{3,4})\)), then our point estimate would be \(\overline{x}=2.5\). Using the five bootstrap samples \({x}_{1}^{*\top }=(\mathrm{2,2},\mathrm{1,1})\), \({x}_{2}^{*\top }=(\mathrm{3,1},\mathrm{4,3})\), \({x}_{3}^{*\top }=(\mathrm{4,4},\mathrm{4,2})\), \({x}_{4}^{*\top }=(\mathrm{1,2},\mathrm{3,3})\) and \({x}_{5}^{*\top }=(\mathrm{2,4},\mathrm{3,1})\), we obtain five estimates that can be used to build a \(60\%\)-confidence interval ([\(2.25\);\(2.75\)] based on the \(20\%\)-percentile and the \(80\%\)-percentile of the bootstrap sample estimates).

Transferred to the issue of replicability or robustness of factor retention criteria, bootstrapping allows us to assess the influence of (small) changes in the empirical data on the outcome of these criteria. Conversely, we expect that small and/or few changes in the suggested factor solutions for different bootstrap samples will be an indicator for (closer) replicability. In addition, when comparing criteria, it may be preferable to use those that have minor differences between the bootstrap samples and thus promise more robust solutions. However, replicability has no value in itself, of course. Robust and replicable factor solutions that misrepresent the underlying relations on population level are by no means desirable, but when comparing different factor retention criteria that showed comparably good performances in simulation studies focusing on the approximation of the data generating process (e.g., Auerswald and Moshagen 2019) which means trying to find the “true” dimensionality of a latent variable model that is assumed to be the data-generating cause, it might be reasonable to trust the more robust criterion. In this study, we evaluate the robustness (and the replicability) of factor retention criteria on real empirical data trying to complement the findings of Monte Carlo simulation studies, as both the replication goal and the approximation goal should be considered likewise in the factor retention process.

2 Methods

To illustrate how to use bootstrapping for the evaluation of the robustness of factor retention criteria and to investigate the relation between robustness and potential replicability, we used three different samples of the Big Five Structure Inventory (BFSI; Arendasy 2009) that were provided by Stachl et al. (2018) and collected within the Phonestudy project (first data set: Schoedel et al. 2018; second data set: Schuwerk et al. 2019; third data set: Stachl et al. 2017) and four samples of the 10 Item Big Five Inventory (BFI-10; Rammstedt et al. 2013) that were collected within the GESIS panel (GESIS 2018). The BFSI consists of \(300\) items that measure the typical five factors (openness, emotional stability/ neuroticism, extraversion, conscientiousness and agreeableness), which can be described by six facets each. We evaluated the \(60\) items assigned to each factor separately focusing on the dimensionality of the respective trait (e.g., determining how many facets can be found for extraversion). Contrary, the BFI-10 consists of \(10\) items also measuring these five factors without further facets. Accordingly, we evaluated the dimensionality of the questionnaire as a whole and applied the retention criteria to all ten items.

The first sample of the BFSI contains \(N=312\) observations, the second sample of the BFSI counts \(N=256\) observations and the third sample has \(N=120\) observations. Since the Phonestudy data were collected for mobile sensing studies using smartphone logging data, the participants are comparably young (mean age: \({M}_{1}= 24\), \({M}_{2}= 23\), \({M}_{3} = 24\)) and well educated due to the recruitment procedures in academic contexts. In case of the BFI-10, we have one set of participants closely representing the German population, that were asked to fill out the questionnaire four times (waves bd, cd, dd, ed of the panel), so our four samples predominantly consist of the same persons (sample sizes are \({N}_{1}=4888\), \({N}_{2}=4249\),\({N}_{3}=3797\), \({N}_{4}=3448\) using only complete cases of the BFI-10 items in each wave). Since all the Phonestudy data were collected in a comparable study setting, the three data sets can be interpreted as different cohorts of the same study and therefore used for a replication attempt. The questions for the four different waves of the panel study were mainly the same and instructions did not vary among the different measurements, so the BFI-10 data are repeated measures that can be used for a within-person replication study. These two different projects (Phonestudy and GESIS panel) where chosen for this study because they represent these two different replication contexts (within-person and between-person). While a within-person replicability speaks for a reliability of the questionnaire (related to the idea of the re-test reliability), a successful between-person replication can be seen as an indicator of factorial validity.

2.1 Aim of the analysis

In this paper, above all, we want to evaluate the relation between the robustness of a factor retention solution (defined as the stability of the suggested number of factors across bootstrap samples) and its replicability in empirical data sets. Furthermore, by applying different factor retention criteria to the empirical data sets, we are able to compare them with regard to their robustness and replicability. If robustness is a good indicator for replicability, it might be a good idea to rely on robust factor retention criteria rather than on those that show little stability (in case that the robust criterion shows high accuracies in simulation studies as well).

3 Data analysis

For all \(19\) data sets (four BFI-10 samples and three BFSI samples with five factors each) we assessed the dimensionality with PA (default settings in the psych package in R (Revelle 2018) using the \(95\%\) quantile of the random eigenvalue distribution and the Minres algorithm as extraction method), CD (default settings with \(\alpha =0.30\) for the internal Mann–Whitney-U tests and \(500\) simulated data sets for the “comparison” approach), EKC and XGB as well as with a model comparison approach using the BIC. Afterwards, \(100\) bootstrap samples (ordinary non-parametric bootstrapping as described above) were drawn (using the boot package, Canty and Ripley 2019) for each data set and all four factor retention criteria were applied to each of these bootstrap samples.

We compared the range of proposed solutions between data sets and between retention criteria, and evaluated whether robust solutions (less fluctuation in bootstrap samples) were promising with regard to the replication purpose (in other words we evaluated the link between robustness indicated by the stability across bootstrap samples and replicability). We used each wave of the panel data as a replication data set for the previous one. In the case of the BFSI, the second data set (\(N=256\)) was used as the replication data set for the first (\(N=312\)) and the third data set (\(N=120\)) was used as the replication data set for the second. To quantify the relationship between the stability across bootstrap samples and actual replicability, we introduced two robustness metrics (volatility of solutions across bootstrap samples and a rate of consistency that is the percentage of bootstrap samples that yielded the same results as the empirical data set) and used them as independent variables in a generalized linear model (GLM; Nelder and Wedderburn 1972) to predict the probability of exact replication (logistic regression) as well as in a second GLM to predict the absolute error of replication (Poisson regression).

We used R (Version 4.0.0; R Core Team 2018) and the R-packages data.table (Version 1.12.8; Dowle and Srinivasan 2018), and papaja (Version 0.1.0.9997; Aust and Barth 2018) for all our analyses and the preparation of the manuscript.

4 Results

4.1 BFI-10

The application of the four retention criteria (XGB, PA, CD and EKC) to the four BFI-10 data sets mostly yielded one-factor solutions. XGB, CD and EKC suggested one factor in all four cases, while PA proposed three factors for the first BFI-10 data set and two factors for the third empirical data set. Moreover, EKC and XGB provided one factor solutions for all \(100\) bootstrap samples of all four original data sets (\(4*100\) data sets), whereas CD did so in \(94\%\), \(98\%\), \(96\%\) and \(95\%\) of the cases. PA had the highest volatility among the bootstrapped samples and contradicted its solution when comparing the original data set with the bootstrapped samples. The BIC-based model comparison approach, on the contrary, suggested five factors across all four samples and all bootstrap samples. Table 1 shows the solutions of the four retention criteria and the BIC for the four initial BFI-10 data sets as well as summary statistics for the respective bootstrap samples.

Table 1 Suggested number of factors of the four retention criteria and the BIC approach, means and standard deviations of the suggested number of factors across Bootstrap samples as well as percentages of Bootstrap samples with the same factor solution as the respective Empirical BFI-10 data set

PA showed the lowest replicability—for the first wave (\(BF{I}_{1}\)) PA suggested three factors, for the second wave one factor, for the third wave two factors and for the fourth wave one factor. Across the 100 bootstrap samples of the first wave, PA yielded 2.98 factors on average (\(SD = 0.887\)), yet in just 40 of these bootstrap samples three factors as in the empirical data set were suggested. The percentages of bootstrap samples, for which PA implied the same dimensionality as for the empirical data set, were 48, 44, 29 for the second, third and fourth wave respectively. This so-called rate of consistency was higher for all other factor retention criteria that also showed perfect replication rates.

4.2 BFSI

Since the three BFSI data sets consisted of far fewer observations (\(312;256;120\)), yet more variables (\(p=60\) compared to \(p=10\) in case of the BFI-10), the factor retention results were considerably more volatile than the results for the BFI-10 data. Mostly six facets per factor were suggested, but the results varied according to the retention criterion, the data set and the respective factor. BIC, EKC and XGB tended to show fewer differences between the bootstrapped solutions, whereas CD yielded the highest variance (or standard deviation) between the bootstrap samples for all combinations of data sets and factors.

Table 2 shows the solutions of the four retention criteria and the BIC for the five factors openness, conscientiousness, extraversion, neuroticism and agreeableness of three empirical BFSI data sets separately as well as summary statistics for the respective bootstrap samples. XGB showed the highest replicability, as it suggested six facets for the openness and the conscientiousness factor for all three data sets, while yielding six facets for two data sets and seven facets for one data set for the extraversion, neuroticism and agreeableness factors. All other factor retention criteria did not provide the same estimate for the number of facets for all three data sets once—PA, for example, suggested three different number of facets for all factors except neuroticism.

Table 2 Suggested number of factors of the four retention criteria and the BIC approach, means and standard deviations of the suggested number of factors across Bootstrap samples as well as percentages of Bootstrap samples with the same factor solution as the respective Empirical BFSI data set

4.3 Robustness and replicability

We used a GLM with binomial family and logit link to model whether the number of factors was exactly replicated in the next data set when comparing the results of the first BFI-10 data set with the results of the second, the results of the second with those of the third BFI-10 data set, the results of the third with those of the fourth BFI-10 data set, and the same for the three BFSI data sets. We modeled the relation between robustness and replicability for all factor retention criteria combined, so that five (number of criteria + BIC approach) times 13 (number of replications considered: three replications of the BFI-10 data and two replications of each of the five factors of the BFSI data)—65—instances were used for the analysis. The standard deviation of the suggested number of factors of the respective \(100\) bootstrap samples as well as the percentage of bootstrap solutions being equal to the outcome of the initial data set (referred to as the rate of consistency) served as independent variables in our model. Both the standard deviation and this rate of consistency can be seen as measures of robustness of the proposed factor solution. The absolute difference in the suggested number of factors between two consecutive data sets (e.g., the first BFSI data set compared with the second BFSI data set) served as a second measure of “replicability” of the proposed factor solutions (absolute replication error). A second GLM with Poisson family and log link was used for this dependent variable analogous to the first model with the standard deviation of the bootstrapped factor retention solutions and the rate of consistency as independent variables.

The results of the GLM analyses support the descriptive observations that factor retention criteria, that were more stable across bootstrap samples, were more likely to yield replicable results. With respect to exact replication (the first GLM), higher standard deviations for the suggested number of factors across the bootstrap samples were associated with a lower probability of replication [\(b=-0.69\), 95% CI \((-3.14\), \(1.21)\), \(z=-0.65\), \(p=0.517\)], whereas the percentage of bootstrap samples with the same solution as the initial data set (rate of consistency) was positively linked to this probability [\(b=0.05\), 95% CI \((0.02\), \(0.09)\), \(z=3.09\), \(p=0.002\)]. Results of the second GLM indicated that the higher the standard deviations for the proposed number of factors across the bootstrap samples were, the less accurate the replication was—illustrated here by a positive association with the dependent variable [\(b=0.66\), 95% CI \((0.13\), \(1.14)\), \(z=2.57\), \(p=0.010\)]. With an increasing rate of consistency, a smaller deviation of the proposed number of factors from two consecutive data sets was associated [\(b=-0.02\), 95% CI \((-0.03\), − \(0.01)\), \(z=-2.81\), \(p=0.005\)].

4.4 Comparing the criteria

Both the standard deviation of the bootstrap results and the rate of consistency can be used to compare the retention criteria with regard to their robustness against sampling errors. While for the BFI-10 data, BIC, EKC and XGB had a rate of consistency of \(100\%\) and thus no variance in the bootstrap results, all criteria were much more volatile for the \(BFSI\) data sets, which can be explained by the far smaller sample sizes and the higher number of items (\(p=60\) vs. \(p=10\)).

EKC provided the most robust results (smallest mean and median standard deviation as well as highest mean and median rates of consistency). In terms of replicability, however, XGB yielded better results on average (highest replicability rate with 61.54 \(\%\) and the smallest mean absolute difference of consecutive number of factors or mean replication error: 0.38). PA had the lowest mean and median rate of consistency as well as the worst replicability rate of 7.69 \(\%\). CD yielded the most volatile results (highest mean and median standard deviation across the bootstrap samples), which can be linked to the highest mean replication error (especially caused by the facet conscientiousness of the BFSI data sets, see Table 1). Table 3 provides an overview of these robustness and replicability measures for the four retention criteria as well as the BIC approach.

Table 3 Means and medians of standard deviations of the suggested number of factors and rates of consistency over all data sets for the four factor retention criteria and the BIC approach as well as the means of both replicability measures (dependent variables of the GLM analyses)

5 Discussion

The present study examines the relationship between the robustness of factor retention criteria and the replicability of their solutions. Bootstrapping of the initial empirical data sets is chosen as an easy-to-use method to evaluate the robustness (or stability) of the factor retention process that seems to be a good proxy for replicability. The study results showed some promising patterns, since criteria in specific cases with high robustness tended to show higher replicability rates and provided more consistent results across the data sets that were used for the replication.

Higher robustness and replicability rates were recorded for the BFI-10 panel data, which can be explained by the much larger sample sizes compared to the BFSI data. Several authors discussed this relationship between robustness and sample sizes for EFA in general (e.g., Osborne and Fitzpatrick 2012) and various simulation studies showed the need for larger samples to achieve higher accuracy/precision in EFA (see Goretzko et al. 2019 for an overview or MacCallum et al. 1999 for a simulation study). Regarding factor retention criteria, Auerswald and Moshagen (2019) (among others) found that they consistently perform better at higher sample sizes and although their focus lay on the approximation goal and not on the replication goal, it seems reasonable to assume that higher sample sizes also benefit the replicability of factor retention criteria. It was striking that (except PA for two waves) all factor retention criteria (not the BIC approach, though) suggested one factor for all four BFI-10 data sets, even though the BFI-10 claims to measure the “BIG5” with two indicators per factor. This small overdetermination (two manifest variables per latent factor) prevents a comprehensive confirmatory analysis of the factorial structure (Reilly 1995) and is seen as too small for EFA as well (e.g., Fabrigar et. al. 1999), so the factorial validity of the BFI-10 remains unclear. However, for our study the questionnaire can still be used for the evaluation of the robustness and replicability of the factor retention process (the results should not be interpreted against the background of theoretical considerations, though).

Comparing the retention criteria, EKC and XGB provided more robust and replicable results on average than PA and CD. These advantages with regard to the replicability goal are in line with the higher overall accuracy by both XGB and EKC in an extensive simulation study of Goretzko and Bühner (2020). Although we do not know the true dimensionality since this study is based on empirical data, the result patterns strengthen confidence in the suggested number of factors provided by XGB and EKC rather than in the solutions PA and CD produced. However, the results of XGB seem to be more in line with the theoretical assumptions of the BFSI—namely six facets per factor—than the results of the EKC. In practice, of course, replications of EFA results are usually conducted using confirmatory factor analysis (CFA), so the factor retention process is not replicated in general. However, when the number of factors suggested by a factor retention criterion is not replicable which means that the initial factor solution is not replicable, it cannot be expected that the CFA model with the same number of factors will show an acceptable fit to the data. Thus, the method applied to determine the number of factors should be replicable as the suggested number of factors is necessary to be “correct” for both the initial analysis and the subsequent replication.

The study should be considered purely descriptive, as the number of observations for the GLM analyses is rather small (\(N=65\)). As mentioned above, this small number leads to an insufficient statistical power and does not allow cross-validation. With an \(\alpha\)-level of five percent, three out of four coefficients of interest would be classified as significant anyway. However, this does not mean that the true effects are necessarily large enough, so that our power was sufficiently high. We, therefore, refrain from interpreting the hypothesis tests for the GLM coefficients. Nonetheless, from a descriptive point of view, a positive relationship can be assumed between the robustness and the replicability of factor retention criteria. Both the face validity (regarding the result patterns in Tables 1 and 2) and the signs of GLM parameter estimates that met our expectations are indicators that robustness and replicability are positively related. The empirical data sets had quite different characteristics (BFI-10 data with great \(N\) and small \(p\) and BFSI data with small \(N\) and rather large \(p\))—particularly with regard of the replication context. The panel data (BFI-10 data) consists of the same participants, making it a within-person replication scenario, while in the BFSI data sets different cohorts were sampled, making it a between-person replication.

Therefore, bootstrapping can be used to assess the robustness of the factor retention process against small data changes and seems to be a good proxy for replicability in the narrowest sense which means the replicability in samples from the same population (see also the section “Replicability and Robustness”). This robustness of the factor retention (or rather its replicability in a population) is a necessary but not sufficient condition for replications across populations (generalizability of the factor structure) which is of interest in context such as questionnaire development for cross-culture comparisons (i.e., measurement invariance across populations). Since this study focuses solely on the factor retention process, it is also worth mentioning that the replicability of a factor structure goes beyond replicating the number of factors (even though replicability of the dimensionality is the basis for a successfully replicated factor structure)—inter-factor correlations, loading patterns and factor scores have to be regarded as well. Zientek and Thompson (2007) suggested bootstrapped EFA or PCA for these evaluations which of course can be combined with bootstrapped factor retention.

6 Conclusion

The present study demonstrates a positive relation between the robustness of factor retention criteria and the replicability of their solutions. Using bootstrap samples of the empirical data set, it is possible to evaluate the robustness of a given solution, either by looking at the standard deviation of the bootstrap solutions or by computing the rate of consistency. We want to encourage researchers to include bootstrapping in their analyses, since individual point estimates of the number of factors based on one empirical data set do not reflect the uncertainty of this estimate and the possible vulnerability to sampling error. This idea aims in the same direction as splitting the empirical data set and evaluating the factor retention criteria on both subsets in order to gain confidence in the stability of the proposed factor solution (Fabrigar et al. 1999; Goretzko et al. 2019). Relying on bootstrapped samples instead of splitting the empirical data may be a better option for small samples, i.e., cases in which subsamples become too small for factor analytic methods, even though bootstrapping also benefits from greater samples and yields more trustworthy results with increasing sample sizes. When evaluating the robustness of the criteria, a comparison among them is imperative, because the stability measures cannot be interpreted absolutely (if all bootstrap samples provide the same solution, then the standard deviation would be \(0\) and the rate of consistency would be \(100\%\)). Both Fabrigar et al. (1999) and Goretzko et al. (2019) also recommend comparing methods and evaluate combinations of criteria as suggested by Auerswald and Moshagen (2019). Ultimately, the users of EFA should not only focus on the goal of approximation, but also on the goal of replication, where bootstrapping and the evaluation of the robustness of factor solutions might be a good start. However, it has to be stated again that replicability should not be an end in itself since replicating under- or overfactoring is not desirable at all. Accordingly, comparing the robustness of different factor retention criteria should always be accompanied by a reference to simulation studies (such as Auerswald and Moshagen 2019 or Goretzko and Bühner, 2020) that evaluate the accuracy of the respective factor retention methods. Thus, we would recommend to assess the robustness of several factor retention methods that have shown high accuracy in simulation studies with data conditions similar to the respective empirical data using bootstrapping. The results of the factor retention criteria showing the highest robustness (e.g., the highest rate of consistency) can be seen as more trustworthy and should be focused on when combining the results of the different methods. When combining the results (or setting the number of factors according to a specific criterion), the suggested number of factors based on the empirical data set should be used as all methods yielded higher numbers on the bootstrapped samples on average (which could be seen as a sign of overfactoring).