Neuropsychological tests are designed to measure cognitive functions, which may be impaired by brain disorders like Alzheimer’s or Parkinson’s disease, traumatic brain injury, or stroke. The tests neuropsychologists use are often assigned to various cognitive domains such as executive function, memory, or attention.

There are many reasons for establishing domains of cognitive functions and for assigning tests to these domains. The first reason may be that a clinician suspects problems in a specific cognitive domain for a particular patient and wants to select tests from this domain to administer. For example, if a patient comes in with subjective memory complaints, memory could be investigated further by selecting tests from this domain. The second reason may be that a clinician wants to qualify whether a particular patient is suffering from impairment on a single domain or on multiple domains. In the literature on mild cognitive impairment (MCI), for example, single-domain or multi-domain MCI are considered separate diagnoses, which have separate prognoses (Petersen, 2004). The third reason may be that a clinician or researcher wants to use composite scores on cognitive domains as an outcome measure rather than separate test scores. This method can reduce noise from individual measurement instruments (but see Lezak, Howieson, Bigler, & Tranel, 2012, p. 159). These composite scores may be calculated by summing the scores of individual tests that belong to a particular domain, as is done in the calculation of Perceptual Reasoning or Verbal Comprehension. A more sophisticated approach is to obtain estimates of a latent variable through factor analysis or item response theory analysis of a single domain and use scores on the latent variable as an outcome measure (Gross et al., 2015). The fourth reason may be to establish the validity of a particular test. If a researcher designs a new test intended to measure memory, he or she can calculate whether scores correlate highly with other tests in the memory domain, and do not correlate as highly with tests from other domains. Therefore, domains can be used to show both convergent and divergent validity. The fifth reason may be to handle missing value problems, such as those encountered in the Advanced Neuropsychological Diagnostics Infrastructure (ANDI) project (De Vent et al., 2016a). In this project, neuropsychological healthy control group data of many studies are combined to provide a normative database to be used in normative comparisons. However, this database has many missing values since not all tests were administered in all studies. The best solution to handle this missing value problem is to fit a factor model to the data (Agelink van Rentergem, De Vent, Schmand, Murre, & Huizenga, 2018; Cudeck, 2000). However, this requires knowledge on the best-fitting factor model, which still needs to be determined.

Although domains have many uses, the idea of domains of cognitive functions is not without problems. There is a lack of consensus on which tests belong to which domain because there are many reasonable ways to assign tests to domains. For example, the Trail Making Test B (TMT B), in which one has to draw a line from labeled circles 1 to A to 2 to B to 3 and so on, is one test that is particularly difficult to assign. Because it involves drawing with a pencil and the outcome measure is the time to completion, one could assign it to the domain of psychomotor speed along with tests like the Grooved Pegboard. However, because Trail Making Test B performance depends for a large part on how attentive the person is, one could assign it to the domain of attention as well, along with tests like the Continuous Performance Test. Moreover, because it involves shifting back and forth between letters and numbers, one could assign it to the domain of executive functions along with the Stroop Interference test.

There is also a lack of consensus on how many domains there are. For example, there are many tests that aim to assess memory in neuropsychology. Whether a single memory domain is sufficient, or whether more domains are necessary, is a matter of debate (Delis, Jacobson, Bondi, Hamilton, & Salmon, 2003). Measures of memory could be divided into measures of an immediate recall domain and a delayed recall domain, or in measures of a visuospatial memory domain and a verbal memory domain. Of course, one could also argue that separate domains are necessary for immediate visuospatial recall and delayed visuospatial recall.

A factor analysis can provide some clarity through quantification of what model best describes the correlations between tests. We have up till now used the term domain interchangeably with the term latent variable. Domain is a more common term in clinical practice used to refer to a family of tests to which a particular test belongs. Latent variable is a more common term in research, used to refer to a variable that is measurable by inferring it from the correlations between test variables. Domain and latent variable are therefore not strictly synonymous. Because we perform a latent variable analysis, we use the term latent variable to be precise and consistent throughout the rest of the article. We will outline next how the latent variables that result from a latent variable analysis depend on the method and sample of the study.

First, the factor structure that is found will depend on the tests that are selected. For example, if a test like Trail Making Test B is administered together with tests that measure executive functioning, Trail Making Test B may also load on a single executive functioning factor because it has elements of shifting. However, if more speeded tests are administered, Trail Making Test B may load on a different latent variable, processing speed, together with other measures of processing speed. Therefore, the domain to which a test seems to belong is dependent on the battery of tests used. Consequently, comparisons across studies with different batteries of tests become necessary.

Second, age can affect the factor structure that is found in a study because age affects scores on almost all neuropsychological measures. Therefore, in a sample with a large age range, variables may become correlated because they are affected by age. Elderly people generally score lower on all variables, and young people generally score high on all variables. If age is not appropriately accounted for, fitting a factor model to a sample with a large age range can provide support for a single “cognitive” factor on which some participants score poorly - the elderly - and others score well - the young. One solution would be to study the factor structure in a sample that is homogeneous in age. However, since studying a single age group limits generalizability, an appropriate alternative is to include age in the analysis.

Third, similar to the age range effect, there can be a confounding effect of level of education in factor analysis. There is generally a large effect of education on neuropsychological test scores. Again, this may lead to the conclusion that to explain correlations between tests, we need just a single “cognitive” factor on which some participants score poorly - those with little education - and some score well - those with much education. Such a single factor due to education would not be found in samples with very similar educational background such as college students. However, since neuropsychological test results need to generalize beyond groups such as college students, it may again be more appropriate to correct for the effect of education in the analysis.

Fourth, domains can be different depending on the sample used. This is especially true for samples of patients with very specific deficits. A delayed recall test can become uncorrelated with other memory tests if delayed memory specifically is impaired by disorder or injury. Therefore, the structure of domains is ideally studied separately for healthy groups and different clinical groups. Results so far have shown that the factor structure has large communalities for many different clinical groups (Bowden, Cook, Bardenhagen, Shores, & Carstairs, 2004; Park et al., 2012; Schretlen et al., 2013), but it cannot be assumed that this is the case for all disorders.

Fifth, to obtain stable results for a factor analysis, many participants have to be tested on multiple tests. The amount of variance that is explained by latent variables may be low in neuropsychology, and examining many latent variables increases the required sample size (MacCallum, Widaman, Zhang, & Hong, 1999). However, obtaining a large sample size for a battery of neuropsychological tests is costly. This limits the number of participants that can be tested in a study, or limits the size of the battery that can be administered to a large number of participants.

Our goal is to establish how neuropsychological tests should be assigned to domains. We will do so by using a factor analytic approach, comparing different factor models that have been formulated in the literature. We will use the results of multiple studies to achieve a broad range of neuropsychological tests, and we will correct for effects of demographic variables including age and level of education. We will study healthy adults so the factor models are not confounded by sample differences in clinical status. Last, through combining different studies, samples of participants are combined to arrive at a much larger sample size than possible with a single study.

First, we will perform a factor analysis of neuropsychological tests by applying a meta-analytic framework that allows for structural equation models to be fitted to summary statistics (Cheung & Chan, 2005). Specifically, this method pools correlation matrices from multiple studies to arrive at a single correlation matrix. To this correlation matrix, multiple models can be fitted, which allows us to compare the fit of neuropsychological factor models that have been formulated in the literature. Second, we will conduct a factor analysis of data from the ANDI normative database (De Vent et al., 2016a). This database contains raw data from healthy control participants from multiple studies conducted in the Netherlands and Belgium, not included in our first analysis.

To summarize, neuropsychology would benefit from clarity on the number and type of cognitive domains and on which tests belong to which cognitive domains. This would facilitate test selection, diagnosis of single-domain and multi-domain disorders, calculation of composite scores, neuropsychological research into the construct validity of tests, and normative comparisons given the aggregated ANDI database.

Study 1: Factor Meta-Analysis

Methods

Literature Search

A systematic literature search was conducted using PsycINFO and MEDLINE for articles that contained a factor analysis of neuropsychological tests in healthy adults. Factor analyses were chosen since studies conducting a factor analysis generally recruit a large sample and administer a large battery of tests. The search strategy was developed in PsycINFO (see Appendix 1 for the syntax) because PsycINFO is particularly well-suited for searching psychological tests. The search strategy for MEDLINE was based on the PsycINFO search strategy. The search strategy consisted of the following key concepts: factor analysis-related terms, specific neuropsychological test-related terms and general neuropsychology-related terms. Deduplication of results was done using Refworks, and screening of results for inclusion was done using Rayyan (Ouzzani, Hammady, Fedorowicz, & Elmagarmid, 2016).

Exclusion Criteria

The goal was to obtain a healthy adult sample correlation matrix from each study containing both neuropsychological tests and demographic variables. Articles were excluded if a) fewer than two tests of interest were used, b) an adult sample was not studied, c) they were published before 1997, d) a cognitively healthy sample was not studied, e) test administration was manipulated or otherwise differed from typical administration, f) they were included in the ANDI database. Criterion c was chosen because for many datasets, extra information would be required from the original authors. Criterion d entailed that we did not include groups with psychiatric or neurological disorders (e.g., bipolar disorder or epilepsy), with disorders that could interfere with test administration (e.g., hearing loss), or with conditions that were studied for their cognitive implications (e.g., HIV). Criterion e excluded studies in which manipulations (e.g., transcranial magnetic stimulation) were applied to participants during testing, or in which novel, often computerized, versions of test batteries were used. This last choice was made because these novel versions are less familiar and less thoroughly validated than the common versions. Criterion f preserved the independence of the analyses done in study 1 and study 2 of the present article.

Tests

A complete list of variables that were considered of interest is given in Appendix 2. We included search terms for each of these variables. We intended to collect correlations from several tests from each of the models’ domains. However, far from every combination of variables was present in the correlation matrices that were analyzed. Therefore, even though we included search terms for, for example, Rey Complex Figure Test, Judgement of Line Orientation, Tower of London, and Wisconsin Card Sorting Test, and we requested correlations for all these tests, there was no complete overlap with all other common test variables for these tests. For twelve test variables, correlations were available with every other variable. To increase the number of usable correlations, different versions of the same test variables were combined (see Table 1). These tests may not be completely parallel since there may be differences in test administration and scoring rules. However, the current analysis assumes that, although there may be mean differences between versions, the correlations with other test variables will not be different. This issue is addressed in study 2.

Table 1 Included test variables

Contacting Authors

With a few exceptions (e.g., Adrover-Roig, Sesé, Barceló, & Palmer, 2012), articles or supplementary materials did not contain the correlation matrix including both the tests and the demographic variables that were necessary for this study. Therefore, corresponding authors of included studies were contacted. In case a researcher appeared multiple times as a corresponding author in the included studies, a single, recent article was chosen which included a large selection of tests. In this case, if the corresponding authors agreed to share a correlation matrix, they were asked whether they would be willing to share the correlation matrix for other articles as well. If authors did not respond, they were reminded after a period of 2–3 weeks.

The authors were sent a list of variables of interest that were to be included, which were the test variables that they collected in their study along with age, sex, and level of education. There was no specific hypothesis for the influence of sex on the factor structure, but we chose to correct for its influence as well because this is common in neuropsychology (Testa, Winicki, Pearlson, Gordon, & Schretlen, 2009). Level of education was scored differently in different studies, sometimes using a seven-point-scale, sometimes using years of education. This issue is discussed in more depth in the discussion section and is addressed in study 2. Authors were requested to send a correlation matrix of these variables for the cognitively healthy sample within their data. If they were unsure that their participants qualified as cognitively healthy, possibilities for exclusion criteria within their data were discussed. For example, if measurements from the Mini-Mental State Examination (MMSE; Folstein, Folstein, & McHugh, 1975) and Clinical Dementia Rating (CDR; Morris, 1993) had been taken in their study, participants with MMSE scores below 24 and CDR scores above 0 could be removed before the correlation matrix was computed. Since these exclusion criteria depended on what the authors had available in their data, this was an ad-hoc procedure.

Publication bias was not considered. Publication bias exists when the publication and subsequent inclusion of studies is contingent on the size of the effect that is of interest in the meta-analysis (Cheung & Vijayakumar, 2016; Easterbrook, Gopalan, Berlin, & Matthews, 1991). The risk of publication bias is negligible because the information that is retrieved are correlations rather than effect sizes of group comparisons.

Analysis

The analysis was carried out using R (R Core Team, 2016). First, for each study, the correlation matrix was converted to a partial correlation matrix by partialing out the influence of age, sex, and level of education using the psych package (Revelle, 2008). This method allows for study-specific correction for age, sex, and level of education.

A factor meta-analysis of the partial correlation matrices was conducted using the metaSEM package (Cheung, 2015, see Electronic Supplementary Material 2 for CHC example script). This factor meta-analysis consisted of two steps (Cheung & Chan, 2005; Jak, 2015). First, the partial correlation matrices were pooled into a single weighted partial correlation matrix using the total number of participants after exclusion for each study in the weighting. The number of participants per study was recorded as the lowest number of participants that were available for any given correlation within the study since sample sizes may differ due to missing data. Therefore, the total number of participants included here is an underestimate of the actual number of participants, and the confidence intervals could in fact be tighter for some correlations. Second, using the weighted partial correlation matrix as input, different factor models that have been described in the literature were compared (Lee, 1986; Levin, 1987; McDonald, 1978). For each model, fit was evaluated by RMSEA, SRMR, CFI, AIC, and BIC, using the rules of thumb outlined in Schermelleh-Engel, Moosbrugger, and Müller (2003) to decide what constitutes lack of fit, acceptable and good fit.

Candidate Factor Models

Factor models that were broad enough to span all neuropsychological tests were selected from the literature. This excludes factor models that describe correlations between tests from just a single domain (e.g., Huizinga, Dolan, & van der Molen, 2006). The first model was a model with a single latent variable on which all variables loaded. Verhaeghen and Salthouse (1997) used a single factor model in a meta-analysis of correlations of neuropsychological test scores and found that a large part of the variance in test scores can be construed as variance on a single common latent variable. The fit of the one factor model can be used as a reference to judge the fit of more complex models.

The second and third models came from the chapter structure of the clinical neuropsychology reference works by Strauss, Sherman, and Spreen (2006) and Lezak, Howieson, Bigler, and Tranel (2012). Although there is not an explicit factor model in these works that has been empirically tested, the neuropsychological tests are categorized into separate chapters. Therefore, they give a good impression of which tests belong together in the eyes of clinical neuropsychologists. In Strauss, Sherman, and Spreen (2006), the chapters containing the included tests were “General cognitive functioning”, “Executive Functions,” “Memory, “Orientation and attention,“ and “Language“. In Lezak, Howieson, Bigler, and Tranel (2012), the chapters containing the included tests were “Attention,” “Memory,” “Executive Functions,” “Verbal functions and language skills.“ The difference between the two was that Digit Span and Coding fall under “General cognitive functioning“ in Strauss, Sherman, and Spreen (2006) and under “Orientation and attention” in Lezak, Howieson, Bigler, and Tranel (2012).

The fourth and fifth models were based on the opinion of experts. The fourth model was based on the domains used in Gross et al. (2015). Gross et al. (2015) assigned tests to “Memory”, “Executive functioning” and “Rest” domains on the basis of expert opinion. Of the currently included tests, only the Boston Naming Test fell in the “Rest” category. The fifth model was based on a survey of clinical neuropsychologists (Hoogland et al., 2017). Twenty experts were asked to rate, on a seven-point Likert scale, how well test variables assess cognitive functioning on a particular domain. For the twelve tests included here, the relevant domains were “Language,” “Attention and working memory,” “Memory,” and “Executive function.” For the factor model used, all mean ratings were above 4.85 on the seven-point scale indicating a large degree of confidence that these variables should be assigned to these domains.

The sixth model was based on the recommendations made by Larrabee (2014). Larrabee (2014) divided tests in six domains on the basis of a review of the literature. This domain specification was explicitly intended to help clinicians compose a battery of tests that assesses cognitive abilities from different domains. The four domains for the included tests are “Verbal symbolic abilities,” “Attention or working memory,” “Processing speed,” and “Learning and memory—verbal and visual.”

The seventh and eighth models were two variants of the Cattell-Horn-Carroll (CHC) model as described by Jewsbury, Bowden, and Duff (2016). The CHC model was developed in intelligence research rather than in clinical neuropsychology (Floyd, Bergeron, Hamilton, & Parra, 2010; McGrew, 2009; Schneider & McGrew, 2018). Although the CHC model is dominant and has a broad influence on the development of IQ test, there is not one single CHC model and the CHC model has developed over time. The CHC model started as a combination of the three-stratum Carroll model, which includes a higher-order factor g, with the Cattell-Horn model, which splits cognitive functioning into fluid and crystallized parts, Gf and Gc (McGrew, 2009; Schneider & McGrew, 2018). Since then, the CHC was adapted many times and now broadly includes nine different factors (Keith & Reynolds, 2010; McGrew, 2009; Schneider & McGrew, 2018). For a review of how the Cattell-Horn-Carroll model came into existence, and how it has evolved over time, see Schneider and McGrew (2018). Jewsbury, Bowden, and Duff (2016) formulated one specific version of this CHC model where the higher-order factor g is not included. Therefore, it can be argued that their model does not include Carroll’s original contribution. However, in accordance with Jewsbury et al., we still characterize this version as the CHC model. Jewsbury, Bowden, and Duff (2016) demonstrated that the CHC model as they formulate it fits well in each of the nine neuropsychological datasets they studied, with only minor adaptations for each dataset. One addition to the CHC, as it was adapted to clinical neuropsychology, was the inclusion of Fluency as a separate latent variable (Jewsbury & Bowden, 2016; Schneider & McGrew, 2018).

The factors for the included tests were the same across the two variants of the CHC model: “Acquired knowledge or crystallized ability,” “Processing speed,, “Long-term memory encoding and retrieval,” “Working memory,“ and “Word fluency,. In the first variant, Trail Making Test B measures “Processing speed.” In the second variant, Trail Making Test B measures both “Processing speed” and “Working memory.”

All factor model specifications are given in Table 2. As factor names, we have chosen to follow the naming conventions from the sources of the different factor models. Because we could not include all variables that were included in the original models, some other factor names might be more apt. For example, the factor “Long term memory encoding and retrieval” from the CHC model might also be called “Memory” as the tests that fall under this factor in the CHC model are almost the same as those that fall under the factor called “Memory” in other models.

Table 2 Factor model specifications of the candidate models for study 1. Tests that load on the same latent variable share a letter. Some tests load on multiple latent variables in the Hoogland and CHC models

Each factor model consisted of factor loadings describing the relationship between the tests and the latent variables, residual variances of the test variables, and covariances between latent variables, which were all freely estimated. The covariances between latent variables can be interpreted as correlations, because all latent variable variances were fixed to 1.

Results

Sample

From the literature search, 3259 sources were identified. After deduplication, 2520 distinct sources remained. These were judged against the exclusion criteria by inspection of the title, abstract, and description of the tests and measures that is provided in PsycINFO. After this step, 330 articles were selected of which the full-texts were obtained. Seven articles were excluded because the full-text could not be could not be retrieved, so a total of 323 were eligible for inclusion. After e-mailing the corresponding authors, 60 correlation matrices were obtained from 57 studies. A list of contributing studies is provided in the supplemental material. Horvat et al. (2014) provided four separate correlation matrices from four countries.

From these correlation matrices, tests were selected that were administered together in multiple studies. This limited the number of tests to the twelve described in the methods section. Five studies did not include any or just one of the selected tests and were not included in the final analysis (Burns, Nettelbeck, & McPherson, 2009; DeYoung, Peterson, & Higgins, 2005; Kafadar, 2012; Sternäng, Lövdén, Kabir, Hamadani, & Wahlin, 2016; Thibeau, McFall, Wiebe, Anstey, & Dixon, 2016). The PRISMA diagram is given in Fig. 1.

Fig. 1
figure 1

PRISMA diagram

All correlations of test variables were scrutinized for miscoding. One source showed aberrant correlations that could not be explained because Trail Making Test B was positively correlated with other, unspeeded, tests in one correlation matrix (the oddity of which was noted in the original publication; Royall, Bishnoi, & Palmer, 2015). Correlations with the Trail Making Test B variable were removed for this study. Motivating plots for this removal are provided in Appendix 3, along with the analysis which did include these correlations.

The final sample consisted of 60,398 participants and 55 correlation matrices. Study characteristics are given in Appendix 4, along with those correlation matrices for which we received explicit permission to share them here (49 out of 55). The correlations with age, sex, and level of education were partialed out from each correlation matrix. Because for some correlations, multiple studies provided data, we attempted to estimate the variability between studies in these correlations by introducing one or more variance components to the model. However, this led to a lack of convergence of the first part of the analysis, establishing the pooled weighted partial correlation matrix, at every attempt. Therefore, the pooled weighted partial correlation matrix was established using a fixed meta-analysis approach. The pooled partial correlation matrix is given in Table 3.

Table 3 Pooled partial correlation matrix

Model Fit

The results of the model comparison between candidate models is given in Table 4. The Hoogland et al. (2017), Lezak, Howieson, Bigler, and Tranel (2012), and Strauss, Sherman, and Spreen (2006) models did not converge. Therefore, the fit measures for these models cannot be interpreted with confidence and are not reported. With respect to relative fit, the AIC and BIC indicate that the two variants of the CHC model fit better than the other models.

Table 4 Model comparison results

With respect to absolute fit, the fit measures generally agree about the ordering of the models as well. All RMSEA values indicate good fit (all RMSEA <0.05), except for the one factor model, where the RMSEA value indicates acceptable fit (RMSEA <0.08). The SRMR values indicate a lack of fit for the five simplest models (SRMR >0.10), and acceptable fit for the CHC model and the model by Larrabee (SRMR >0.05). The CFI values indicate a lack of fit for the one factor model (CFI < 0.95), acceptable fit for the model used by Gross et al. (CFI < 0.97), and good fit for the other models (CFI > 0.97).

The best-fitting CHC model is depicted in Fig. 2 in which correlations between latent variables are also provided. Because Trail Making Test A and B are measured in time to completion, these variables and the “Processing Speed” factor that they loaded on, are reverse coded. Therefore, the negative correlations between “Processing Speed” and the other latent variables should be interpreted such that better “Processing Speed” is correlated with better scores on the other latent variables.

Fig. 2
figure 2

CHC 2 model for the twelve tests included in study 1. For each combination of latent variables, the correlation is given

Discussion

From this factor meta-analysis, we can conclude that the two CHC models provide the best fit. The two CHC models themselves do not differ by much, but all fit measures agree that the second model, with the extra cross-loading of Trail Making Test B on “Working memory,” fits better.

Therefore, we conclude that for the tests used here, the correlations between test variables are best described by five cognitive domains, namely “Acquired knowledge or crystallized ability,” “Processing speed,” “Long-term memory encoding and retrieval,” “Working memory,” and “Word fluency.” We also conclude that some test variables load on multiple domains.

The factor meta-analysis framework has several advantages in that it allows for the analysis of a large number of tests and a very large number of participants. Using the partial correlation matrices rather than the raw correlation matrices allowed us to correct for the effects of age, sex, and level of education.

However, there are a number of limitations to this analysis. First, different versions of tests were used as if they are parallel. This choice was made to arrive at a greater degree of test overlap between studies. Whether a test with a different name constitutes a different version or a different test altogether is to some degree subjective. For some tests, there is empirical evidence that there is a high correlation between test scores which makes it reasonable for the present goal to consider them versions of the same test. For example, the sum scores of the California Verbal Learning Test (CVLT), the Hopkins Verbal Learning Test (HVLT), and the Auditory Verbal Learning Test (AVLT) are highly correlated (for example, Lacritz & Cullum, 1998, report r = 0.74 between CVLT and HVLT; Stallings, Boake, & Sherer, 1995, report r = 0.83 between CVLT and AVLT; an anonymous reviewer reports r = 0.68 between HVLT and AVLT), even though there are differences between test versions in stimuli, test administration, the number of repetitions, and so on. For the Color Trails and Trail Making Test, the pooling of data may be more contentious since the tests are more dissimilar (but see Dugbartey, Townes, & Mahurin, 2000; Lee & Chan, 2000). The assumption here was that the correlations between these variables and other test variables do not change due to these differences. This assumption may not be tenable.

Second, there were differences in education scales and education systems between studies. As argued in the introduction, it is necessary to remove the confounding influence of education. However, the contributing studies used different ways of coding level of education, which means that the correction in the form of the partial correlation was different between studies as well. Also, even if two studies used the same scale such as years of education, such a scale may have a different interpretation in different countries (UNESCO, 2011; De Vent et al., 2016b).

Third, there was some overlap in the studies that were used in Jewsbury, Bowden, and Duff (2016) and the studies that were included in this factor meta-analysis. Thus, the sample that was used to develop the model was not completely distinct from the sample used to evaluate its performance. To study whether the results were biased towards the model as specified by Jewsbury, Bowden, and Duff (2016), we reran the analysis without two datasets that were included in their analysis (Duff et al., 2006; Bowden, Cook, Bardenhagen, Shores, & Carstairs, 2004). Without these studies, the Gross, CHC 1, and CHC 2 models did not converge (in addition to Hoogland and Lezak that did not converge before). Of the remaining models (Strauss, One factor, and Larrabee), the Larrabee model fitted best. The two excluded studies had a wide range of tests as well as a large sample, and therefore played an important part in stabilizing results. The analysis presented here and in Jewsbury, Bowden, and Duff (2016) were therefore not independent, which could have artificially improved the performance of the CHC model.

To address these issues, in the next study, the factor models will be fitted to raw data from the Netherlands and Belgium, combined in the ANDI database. This database allows us to use a single test version for every variable, and to use a single standardized education scale. Also, because raw data are available, we can directly incorporate the influence of demographic variables on test variables, rather than using the more indirect approach of partialing out these variables from the correlations. Last, this is a completely different sample of studies from the samples used in study 1 and the samples used by Jewsbury, Bowden, and Duff (2016).

Study 2: Factor Analysis of the ANDI Database

Methods

Sample

The construction and composition of the ANDI database are described elsewhere (De Vent et al., 2016a). This database includes data of studies that were conducted in the Netherlands and Belgium. For the data used in the present analysis, the number of included studies was 54 with a total of 11,881 participants. In Study 2, Full Information Maximum Likelihood (FIML) was used for model estimation. FIML allows for estimation with missing data and allows us to estimate covariances between tests that are implied by the factor analysis, even when pairs of variables are missing (Agelink van Rentergem, De Vent, Schmand, Murre, & Huizenga, 2018; Cudeck, 2000). This was not possible for Study 1 where raw data were not available. In Study 2, we were able to estimate almost the exact same model with a much smaller number of participants and a smaller number of studies than in Study 1. Without using FIML to deal with missing data in the variables, fewer variables would be amenable for inclusion in Study 2.

All test variables were transformed to normality using Box-Cox transformations (Box & Cox, 1964, selected power transformations are reported on andi.nl/en) in order to meet parametric assumptions and to speed up convergence, and were demographically corrected and standardized (De Vent et al., 2016a). Note that factor analyses can also be performed on non-normally distributed data, but in the ANDI project it was required to transform data to normality. For the demographic corrections for level of education, we used a seven-point scale that is commonly used in Dutch neuropsychology (Verhage, 1964). This scale is comparable to the International Standard Classification of Education (UNESCO, 2011).

Tests

In study 2, the same test variables were included as in study 1. To remove the influence of test versions differing between studies, we included a single version for every test. Digit Span Forwards and Backwards were not included since there were too few data for these variables for any specific version. Story Recall Immediate Recall and Story Recall Delayed Recall referred to Rivermead Behavioural Memory Test Stories Immediate Recall and Delayed Recall. Semantic Fluency referred to the Animals version of Semantic Fluency. Coding referred to WAIS-III Digit Symbol-Coding. Verbal Learning Test referred to the Rey Auditory Verbal Learning Test.

Model Changes

Because of the removal of the Digit Span subtests, the two versions of the CHC model collapse into a single version without “Working memory.” The remaining factors were “Acquired knowledge or crystallized ability,” “Processing speed,” “Long-term memory encoding and retrieval,” and “Word fluency.” Like in study 1, factor loadings and covariances between latent variables were freely estimated. All latent variable variances were fixed to 1 so the covariances between latent variables can be interpreted as correlations. Residual variances of the tests are freely estimated as well. For three models, there was an exception because these models included factors with a single indicator (Strauss, Lezak, and Gross) (Table 5). For these single indicators, factor loadings were constrained to equal 1 and residual variances were constrained to equal 0.

Table 5 Factor model specifications of the candidate models for Study 2

The models were fitted using Mplus (Muthén & Muthén, 2012; see Electronic Supplementary Material 2 for CHC example input). Like in Study 1, we attempted to include variance components within the model since the data are nested within studies and are not strictly independent. The standard approach is to include the different levels within the analysis, estimating a variance component at the different levels in a multilevel SEM model (Hox, Maas, & Brinkhuis, 2010). However, when including variance components, none of the models converged, presumably due to the high percentage of missing data within the ANDI database (De Vent et al., 2016a). Therefore, variance components were not included for the results given below.

Fit was evaluated by RMSEA, SRMR, CFI, AIC, and BIC using the rules of thumb outlined in Schermelleh-Engel, Moosbrugger, and Müller (2003) to decide what constitutes lack of fit, acceptable, and good fit.

Results

The Strauss model did not converge. The CHC model converged, but produced a warning indicating a negative residual variance which may indicate misspecification if the negative variance is large (Kolenikov & Bollen, 2012). However, the variance was not significantly different from 0, θ = −0.032, z = −0.581, p = 0.561.

The results of the model comparison between candidate models is given in Table 6. As in study 1, the fit measures for the model that did not converge could not be interpreted and were not reported. With respect to relative fit, the AIC and BIC indicate that the CHC model fits better than the other models.

Table 6 Model comparison results

All RMSEA values indicate good fit (all RMSEA <0.05), except for the one factor model, for which the RMSEA indicates acceptable fit (RMSEA <0.08). The SRMR values indicate a lack of fit for the one factor, Gross, and Hoogland models (SRMR >0.10), and acceptable fit for the Lezak, Larrabee, and CHC models (SRMR >0.05). The CFI values indicate a lack of fit for all models (CFI < 0.95), except for the CHC model, for which fit was good (CFI > 0.97).

Next, we compared the CHC model fitted in study 2 to the CHC model fitted in study 1 to determine whether the factor structure was stable across the two analyses. The methods used in the two studies were dissimilar, that is, correlation matrices served as the outcome measure in study 1 and actual test scores were the outcome measure in study 2. Because the scale of factor loadings and residual variances is dependent on the scale of the outcome measure, it is not warranted to compare factor loadings or residual variances between studies. However, the correlations between latent variables can be compared. To make the models comparable, the CHC model without the “Working Memory” latent variable from study 2 was fitted to the meta-analytic data from study 1 without Digit Span Forwards and Digit Span Backwards. The model is depicted in Fig. 3 in which correlations between latent variables are also provided. As in study 1, the “Processing Speed” factor is reverse coded. It can be seen that the correlations were in the same direction in both studies and that correlations were lower for the second study. This could be due to the more appropriate demographic corrections since regression-based corrections of the raw data were used rather than using a partial correlation approach, and level of education was coded on the same seven-point scale for all included samples.

Fig. 3
figure 3

CHC model for the ten tests included in study 2. For each combination of latent variables, the correlation is given for the meta-analytic data in roman type, and for the ANDI data in italic type

General Discussion

In this article, we sought to establish the cognitive domains that are measured by neuropsychological tests. Cognitive domains are used by neuropsychologists to make decisions on which tests to administer to a particular patient, to determine whether a disorder affects a single domain or multiple domains, to calculate composite scores of different tests belonging to the same domain, and to validate new tests that are designed to measure a particular cognitive function.

We compared several neuropsychological factor models that have been formulated in the literature. First, we performed a factor meta-analysis of correlation matrices, using the meta-analytic structural equation modeling framework (Cheung & Chan, 2005). Second, the different factor models were fitted to raw data from the ANDI database (De Vent et al., 2016a). Both analyses included a number of neuropsychological tests, a very large sample, and accounted for the effects of age, sex, and level of education. Using these two different methods and samples, the same result was obtained. The Cattell-Horn-Carroll (CHC) model was shown to be the model that best described the data.

In the introduction, we formulated the aims of the study to decide how many domains exist, what these domains are, and which cognitive tests belong to which domain. In this article, we were interested to examine the factor structure of all common neuropsychological test variables but were restricted by the availability of data to the twelve most frequently used test variable. From the factor analysis, we can conclude that the CHC model is an appropriate way of categorization for these twelve variables. For the tests that were considered in this article, the CHC model consists of five intercorrelated factors: “Acquired knowledge or crystallized ability”, “Long-term memory encoding and retrieval”, “Processing speed”, “Working memory”, and “Word fluency”. The Boston Naming Test and Story Recall variables loaded on the first factor. The Verbal Learning Test variables and Story Recall variables loaded on the second factor. Digit Symbol Substitution and Trail Making Test Parts A and B loaded on the third factor. The Digit Span variables and Trail Making Test Part B loaded on the fourth factor. Letter Fluency and Semantic Fluency loaded on the fifth factor.

The CHC model has three unique aspects compared to the other models fitted in this article. First, Letter Fluency and Semantic Fluency are typically paired with either the Boston Naming Test to form a “Language” factor (Larrabee) or are considered “Executive Functioning” tests (Strauss, Lezak, Gross, Hoogland). In the CHC model as formulated by Jewsbury, Bowden, and Duff (2016), a separate factor is estimated for these fluency tests (Schneider & McGrew, 2018; Jewsbury & Bowden, 2016). Second, the Boston Naming Test is typically either a constituent of a “Verbal” factor (Larrabee, Hoogland) or is considered as separate from the other tests considered here (Strauss, Lezak, Gross). In the CHC model, the Boston Naming Test is paired with the Story Recall variables to form the “Acquired knowledge or crystallized ability” factor. Third, the Digit Span variables are typically paired with Coding (Strauss, Lezak, Gross, Hoogland) and Trail Making Test Part A (Lezak, Gross, Hoogland). In the best-fitting CHC model, the Digit Span variables formed a separate factor and were not paired with any of these variables. Fourth, all other models, except for Hoogland, had no cross-loadings, that is, all variables only belonged to one domain. The best-fitting CHC model had three cross-loadings, with the Trail Making Test Part B measuring both “Working memory” and “Processing speed,” and Story Recall Immediate and Delayed Recall measuring both “Acquired knowledge or crystallized ability” and “Long-term memory encoding and retrieval”.

In our analyses of the CHC model, the Boston Naming Test was the only measure that was an indicator of only “Acquired knowledge or crystallized ability”, since Story Recall also functions as an indicator for “Long-term memory encoding and retrieval”. However, the Boston Naming Test is not a pure measure of crystallized ability in Jewsbury, Bowden, and Duff (2016) their specification of the CHC model, but is also an indicator of “Visuo-spatial ability” alongside tests like the Rey Complex Figure Test and Judgement of Line Orientation. In practice, the Boston Naming Test is used specifically for identifying naming deficits in patients with aphasia or dementia (Lezak, Howieson, Bigler, & Tranel, 2012, p. 551), and is not used to characterize individual differences in crystallized ability or visuo-spatial ability in cognitively healthy participants.

One caveat concerning the goodness-of-fit of the CHC model is its number of cognitive domains relative to the number of variables that were included. If there are many latent variables, and each latent variable has only indicators from a single measurement instrument, this latent variable may not represent a cognitive domain and may simply represent the variance that is unique to that particular method. In each of the models examined in this article, two test variables that belong to the same test were always assumed to measure the same latent variable. Therefore, this issue, sometimes called method variance, is not unique to the CHC model. In fact, no factor in the CHC model has only indicators belonging to a single instrument. Therefore, of the models specified here, in combination with this selection of variables, the CHC results should be relatively less determined by method variance. However, more data from a broader selection of tests is imperative.

The present article lends support to the recent endorsement of the CHC model by Jewsbury, Bowden, and Duff (2016). Their study, in which separate analyses were run for every dataset, provided a better coverage of spatial ability and fluid reasoning domains. However, the current study adds to the Jewsbury et al. findings in several ways. First, in two studies we were able to perform a single analysis of multiple datasets, thereby yielding a very large sample size. Second, the fit of the CHC model was good even though we corrected for age, sex, and level of education, which could have distorted earlier analyses. Third, we compared the CHC model to various alternatives, and even among those alternatives, the CHC model provided the best fit. Therefore, this article provides strong evidence for the CHC model.

The fact that the CHC model fits better than other models has a number of consequences for neuropsychology. First, a consequence of the cross-loadings in the CHC model is that it corroborates the view that tests generally measure more than one domain. For test selection, this does not mean that these are bad tests to administer, but rather that they can be informative for multiple domains at once. For example, if a low score on Trail Making Test Part B is observed, this could indicate impairment of “Processing speed” if observed with a low score on Trail Making Test Part A, or indicate impairment of “Working memory” if observed with a low score on Digit Span.

Second, the result has implications for the distinction between single-domain and multi-domain disorders. These disorders have typically been defined referring to the domains based on expert opinion, that is, “Executive Functions”, “Memory”, “Attention”, and so on (Petersen, 2004). Given the results, it seems better to work instead with “Long-term memory encoding and retrieval”, “Acquired knowledge or crystallized ability”, “Processing speed”, “Working memory”, and “Word fluency”. Application of the single-domain and multi-domain criteria with these domains would be straightforward. However, it is not clear whether the results that have been obtained in studies using the traditional domain definition (e.g., Ganguli et al., 2010; Libon et al., 2010) also hold with the CHC domain definition. It could be worthwhile to go back to already published data and apply the criteria using the CHC domains to study their prognostic value in comparison to that of the criteria using the traditional domains. One important domain in terms of diagnosis in the traditional model is the “Memory” domain, which is used to define amnestic variants of disorders (Tabert et al., 2006). For the CHC model, the “Long-term memory encoding and retrieval” domain could be used for the same purpose since all the same tests that load on the “Memory” factor load also on this factor.

Third, by calculating composite scores for a particular cognitive domain, one assumes that differences between people in their test scores are due to differences in their latent ability on this cognitive domain, that is, that the cognitive domain is unidimensional (Borsboom, 2008). This is done for example in the calculation of an “Executive functioning” composite score (e.g., Gross et al., 2015), where one implicitly assumes that individual variation on Trail Making Test Part B, Coding, and Digit Span Backwards is due to individual variation in Executive functioning. From a practical standpoint, calculating a composite score may be a useful form of data reduction, and the resulting latent variable may correspond well with what is known from the literature (Preacher & MacCallum, 2003). Therefore, there is something to be said for calculating such a composite score, from a constructivist perspective (Borsboom, 2008). From a statistical perspective though, an “Executive functioning” composite score does not seem to represent a unitary construct. The variables that are typically assigned to the “Executive Functioning” domain are spread out over three domains in the best-fitting CHC model (“Processing speed”, “Working memory”, and “Word fluency”), suggesting that unidimensionality is violated. Fourth, it should be recognized that in both analyses, all latent variables were correlated in the CHC model. The influence of age and level of education that could have artificially produced such a correlation were partialed out. Therefore, although the tests in neuropsychological practice are designed to measure well-separable cognitive domains, these domains do not in fact seem completely separable. This could be due to the design of the tests. Perhaps tests have not been designed such that they can specifically measure individual variation only in “Working memory” while not also measuring variation in “Processing speed.” However, this could also be due to the nature of cognitive functioning. All cognitive functions could be so deeply intertwined that it is not possible to measure one without the other (Van der Maas et al., 2006).

Although in this article there was an emphasis on the differences between factor models, we can as easily focus on the communalities. As one reviewer pointed out, the Lezak model is highly similar to all other models. This is not surprising, as the test variables included here were once devised with certain measurement goals in mind. For example, Story Recall and the Verbal Learning Test were devised for the measurement of memory and were subsumed under the same latent variable in all models considered here. Therefore, the differences between models necessarily focus on the variables like the Boston Naming Test and Fluency, which are less easily categorized. None of the models that were derived from the literature was proven wrong, although the one factor model did fit considerably worse than the other models.

It is important to realize the limitations of our results. First, the goal was to establish a factor model for cognitively healthy participants although some participants included in the analyses may not have been cognitively healthy. Some of the contributing studies did not have the explicit goal to exclude pathology, but instead had the goal to obtain a representative sample from the population. This is true for both studies 1 and 2.

Second, we should be careful not to overgeneralize the results to other samples. Tests loading on the same latent variable are not necessarily redundant measures of the same latent variable in all samples. For example, immediate recall and delayed recall on the Verbal Learning Tests were found to be indicators of the same latent variable in the CHC model. However, it has been argued immediate and delayed recall are not interchangeable tests in clinical practice since the function of one may be disrupted by disorder or injury while the other remains intact (Delis, Jacobson, Bondi, Hamilton, & Salmon, 2003). Many neuropsychological tasks show particular sensitivities to specific disorders. For example, aphasia may have specific effects on language-based tasks that have little to do with the higher-order factor structure proposed here. If neuropsychological tests would measure different latent variables in different populations, this would invalidate standard neuropsychological assessment practice. However, investigations into measurement invariance, which statistically evaluate whether the same factor structure holds for different populations (Widaman & Reise, 1997), have shown that the same factor structure holds for many populations in neuropsychology (Bowden, Cook, Bardenhagen, Shores, & Carstairs, 2004; Park et al., 2012; Schretlen et al., 2013). Therefore, measurement invariance seems to be the rule rather than the exception.

Third, in working with correlations between variables, the underlying assumption is that the relations between variables can be captured by a linear model. This assumption may not be tenable for every pair of variables. Especially for the relation between age and some test variables, the relation may be better characterized by a curve, rather than a straight line, with an increased deterioration for the very old (Salthouse, 2009). Non-linearity may affect partial correlations in a variety of ways such as masking partial correlations or introducing spurious partial correlations (Vargha, Bergman, & Delaney, 2013). However, although non-linear effects are found, effects are still largely linear for most ages (rather than for example an inverted U-shape). Therefore, the effect of the violation of the linearity assumption would be a small undercorrection for the effect of age for the very old.

Fourth, only parts of the models that were described in the introduction were tested in this study. For example, far from all variables that are mentioned in Lezak, Howieson, Bigler, and Tranel (2012) could be included, and not all latent variables that are included in their overview could be included in the latent structure of the model. This was the case for all models. After all, just twelve test variables were included in study 1 and ten variables were included in study 2, whereas many more test variables are used in clinical neuropsychology. However, these ten to twelve variables are not a random selection from the field, as these variables are the most commonly used and form the basis of many neuropsychological assessments.

Fifth, there was a high correlation between two latent variables of the CHC model, “long-term memory encoding and retrieval” and “acquired knowledge or crystallized ability.” Jewsbury, Bowden, and Duff (2016) ran into similar issues, for some datasets, wherein factors were so highly correlated that there were problems in fitting the model (described in their supplemental material). Although the correlation between the latent variables in Study 1 is statistically different from 1 (95% CI: 0.95–0.99), we do not consider these latent variables to be distinguishable with the current selection of variables. Therefore, it was considered to collapse these two latent variables into one. However, such exploratory adaptations to the model could be overfitting this particular selection of variables and would not fit our confirmatory setup. As Jewsbury, Bowden, and Duff (2016) show, these two factors are meaningful when considering other selections of tests.

To our knowledge, MASEM has not been utilized in a large scale project like this but has been limited to the structure of small sets of variables. One other source of large scale data that would be amenable to this type of study, specifically to study the CHC model, would be the datasets that Carroll used to study the structure of intelligence, collected in the Human Cognitive Abilities project (McGrew, 2009). Furthermore, with correlation matrices from newly published studies, the present meta-analysis could be extended to include other variables and to include variance parameters on the correlations. To facilitate such an analysis, we provide correlation matrices in the appendix. We recommend that, as a rule, correlation matrices are shared publicly in articles or in supplemental materials, to facilitate the type of meta-analysis presented here.

To conclude, in two independent large-scale analyses the Cattell-Horn-Carroll (CHC) model best describes the structure of neuropsychological test domains.