Despite an increasing push for consideration of participant demographics, generalizability remains a contemporary threat to psychological research and the validity of neuropsychological assessment. The National Institutes of Health (NIH, 2022) has strengthened requirements to disaggregate effects by race/ethnicity and gender. Still, recent randomized control trials (RCTs) indicate a persistent lack of demographic reporting, with White participants making up to 75% or more of the reported samples (De Jesús-Romero et al., 2022, Preprint). That being said, recent studies suggest improvements in the reporting of sample characteristics within both neuropsychology (Medina et al., 2021) and pediatric psychology (Raad et al., 2008). Given changes in recommendations and norms for reporting sample demographics, an important consideration is how this information can be used to understand and further test generalizability of findings and assessments in pediatric neuropsychology.

Discussions of generalizability in psychology, and neuropsychology specifically, are not new (Drotar, 1994; Okazaki & Sue, 1998; Sears, 1986). Yet, as evident from the need for related initiatives taken by journals such as Neuropsychology (Yeates, 2022), this issue continues to be a contemporary threat that warrants attention. Indeed, the agenda of Relevance 2050 Initiative by the American Academy of Clinical Neuropsychology (AACN) estimates that 60% of the American population will be “un-testable” by 2050 using current neuropsychological assessment strategies. Thus, we aim to discuss the role of historical influences on generalizability in neuropsychological assessment and synthesize best practices for researchers and clinicians to consider generalizability, including how to overcome several barriers toward advancing generalizability in the field of pediatric neuropsychological assessment.

We define generalizability as the extent to which outcomes of research with certain samples are applicable to the general human population, or across multiple populations and contexts. To concretely evaluate generalizability, samples should include sufficient representation of the population of interest (e.g., avoiding too narrow of a data base, Sears, 1986). In the context of neuropsychological testing, if a test does not specify sample demographics, the clinician is left to wonder whether the test is a good fit for their patient. A lack of generalizability in assessment research can also result in a lack of test fairness or diagnostic error for clients who are underrepresented in research, contributing to compounded stigmatization in the healthcare system. We situate generalizability as a consideration that is inherently intertwined with the scientific method: one that applies at multiple points in the research process and aids in understanding a phenomenon more broadly.

In the following sections, we provide a summary of past discussions of generalizability in order to situate our proposed recommendations for pediatric neuropsychology within the larger field of psychology.Footnote 1 We begin with a broad overview of generalizability in psychology, then move to evidence suggesting that generalizability remains a contemporary threat to pediatric neuropsychology research, and, finally, provide strategies to overcome common obstacles to advancing generalizability.

Historical Considerations for Psychology and Neuropsychology

Neuropsychology and psychology in general are never wholly detached from historical trends within the field. Thus, evaluating generalizability and sample demographics must integrate knowledge of how the field has approached these issues in the past. Psychology has been built from a narrow knowledge base driving both active and passive misrepresentations and mistreatment of several stigmatized groups (Guthrie, 2004; Henry, 2008; Sears, 1986).

Notably, stigmatized groups who have been passively and actively harmed by assessment or research are more likely to have medical mistrust that reflects these injustices (Jaiswal & Halkitis, 2019) and reduced cultural safety (Gale et al., 2022). These relationships with the field feed into treatment-seeking barriers and treatment disparities (Haag et al., 2019; Rivera-Mindt et al., 2010). For example, this exclusive knowledge base has contributed to gaps in understanding and treating Black women’s mental health (see Spates, 2012 for a review). More broadly, foundational “facts” developed by a narrow knowledge base may not generalize, evidenced by misdiagnosis disparities across race (e.g., Gara et al., 2012; Kiselica et al., 2021). Thus, it is crucial to recognize environmental and cultural factors that may influence test outcomes in diverse clients (Ardila, 2005).

The early development and application of intelligence testing is a prime example of the active historical harms of assessments and repercussions associated with failure to account for generalizability. As outlined in Boake (2002) and Wasserman (2018), French psychologists Alfred Binet and Theodore Simon developed early intelligence testing to provide special needs for children whose scores were below the score of an average child of their own age. The test was translated and adapted for use in the USA by psychologist Henry Herbert Goddard, gaining immense popularity and wide usage in schools and other areas, including influencing immigration decisions and policy. Goddard’s conclusions quickly contributed to the use of intelligence assessments to support eugenics. The movement advocated for restraining certain populations from having children through institutional isolation or sexual sterilization. Hence, families who had a child with low intelligence scores were stigmatized and deemed genetically flawed. Furthermore, in 1913, the intelligence test was used for immigrants at Ellis Island to assess English proficiency, labeling performance as intellectual functioning without considering that individuals lacked formal education and/or any exposure to the English language.

Similarly, in the 1900’s, intelligence tests developed by the American Psychological Association Committee on Psychological Examination of Recruits highlighted racial disparities in test scores. This bolstered the claim that intelligence is influenced not only by genetics but also by race, fueling immigration and sterilization laws that marginalized Black Americans and the immigrant population in the USA. For example, in 1924, the state of Virginia allowed compulsory sterilization of people with low intelligence, and North Carolina established an official Eugenics Board with racialized sterilization decisions well into the 1970s (Schoen, 2005). The harm of eugenics was thus legitimized by early intelligence testing and psychological science. Indeed, this recent history informs how individuals interact with neuropsychological testing as well as impacts how intelligence testing has evolved over time.

Today, intelligence tests such as the Wechsler Intelligence Scale for Children (WISC-V; Wechsler, 2014) and the Woodcock-Johnson Tests of Cognitive abilities (WJ-IV; Schrank et al., 2014) are frequently used in both research and clinical endeavors (e.g., to assess learning disabilities and special education service eligibility, Kamphaus et al., 2000; Kranzler et al., 2020). These tests can be administered alone but are often part of a comprehensive neuropsychological battery, including to determine the functional consequences of a traumatic brain injury (Kinsella et al., 1995). Although intelligence tests have utility in providing numeric values to a construct, it is necessary to consider the historical harms of the test and account for a client’s socio-historical background when interpreting results.

The mistrust and harm done in the name of science have lasting implications: people of color are less willing than European Americans to participate in developmental research due to mistrust (e.g., Freimuth et al., 2001). In the context of historical harms, we turn to reviewing literature on when and why generalizability is an important consideration with a focus on how a lack of attention to these factors threatens neuropsychology research and practice.

When and Why Does Generalizability Matter?

One implication of psychology’s exclusionary knowledge base is that many foundational studies are rooted in understanding processes in Western, educated, industrialized, rich, and democratic (WEIRD) participants (Arnett, 2009; Henrich et al., 2010). Henrich et al. (2010) reviewed scientific research across behavioral sciences to assess whether researchers can reasonably generalize from studies using WEIRD samples. They found that addressing research questions using data drawn from a narrow sub-population significantly undermines generalizability. Following from this, an important question is when and why a lack of generalizability is an issue.

Notably, there may be some situations for which generalizability is not a primary concern for a study. If the research objective is limited to interpretation of only the sample population, then generalizability may not be relevant and/or the research question may be best answered using WERID samples, which can require fewer resources and traditional data collection methods (see Mook, 1983). For instance, in qualitative research or case study analysis, the inference process is not applicable, and generalizability is not a relevant goal (Myers, 2000). In pediatric neuropsychology, this may also occur with highly specialized assessments (e.g., ADHD assessments in 8th grade children in the USA). However, even with this prior example focusing on a specific cultural context and age group, racial/ethnic disparities in test sensitivity are present, making generalizability a valid concern (Morgan et al., 2013). When considering the field of pediatric neuropsychology, researchers and clinicians typically seek to make inferences from a sample about a broader population. Thus, a vital question is which populations are represented in a sample and whether this aligns with the claims to whom the findings will extend.

We argue that universalism—the idea that a basic psychological process should apply across populations—is often implied or assumed, even when a test development study may include narrow samples. For example, in examining 5000 published articles across psychology more broadly, titles rarely mentioned regional details when samples were from the USA relative to other WEIRD and non-WEIRD regions (Cheon et al., 2020). In addition, researchers in the USA were more likely to qualify race or ethnicity in titles for research with minoritized samples (e.g., Asian Americans) compared to those with White samples (Cheon et al., 2020; Roberts & Mortenson, 2022). Failing to specify these characteristics in published work may reflect an overall trend of researcher assumptions, one potentially shared by reviewers and editors: effects found in White U.S. populations will generalize to humankind more broadly and testing generalizability in these findings is thus seen as low priority. Illuminating the focus on WEIRD, U.S.-centric samples has led to an expanding body of work that tests the generalizability of what were assumed to be universal truths. Below, we outline the benefits of considering generalizability and review some key approaches to issues of generalizability in neuropsychology.

Threats and Promises of Generalizability in Neuropsychology

As mentioned previously, drawing inferences is an important goal of researchers. A finding that is not generalizable (e.g., neural patterns in attentional control; Hedden et al., 2008) could suggest that the effect applies to only specific populations or groups. And yet, these constraints on generalizability are not always discussed. There are two major implications when a field’s effects or assessments are representative of only a small number of people, particularly without acknowledgment.

First, a lack of generalizability operates as a threat to understanding humankind more broadly. A clear limitation in this regard is that failing to test the generalizability of a given finding severely constrains knowledge of psychological phenomena. For instance, consider an example outlined by Haeffel and Cobb (2022) in testing the cognitive vulnerability to depression hypothesis across five populations. Despite existing theorizing on depression that centers negative cognitions as a potential driver of depressive symptomology (e.g., Alloy et al., 1999), they did not find evidence that the association between cognitive vulnerability and depressive symptoms generalized to Honduran and Nepali participants. They suggest that cognitive vulnerability may confer depression risk in some countries, but not others. Without such an examination, an understanding of how cognitive factors are related to depression may remain incomplete. Relatedly, ungeneralizable assessments and measurements impact researchers’ and clinicians’ ability to capture the construct in question. An inability to measure a construct in different populations serves as a direct challenge to adequately research the construct.

Second, a lack of generalizability is a threat from a practical standpoint. People have a right to valid, reliable assessment. An ungeneralizable field restricts access to this right. When treatments or assessments lack applicability, populations who are underrepresented in research are more likely to be deprived of adequate services. Carrying forward the example of negative cognitions, a treatment that focuses on reframing negative cognitions (e.g., as in CBT, Kaufman et al., 2005), may not be suitable across cultures. Applying ungeneralizable assessments could thus lead to erroneous conclusions or diagnoses. These conclusions and treatment disparities are exacerbated by a limited sampling of racial minority populations during test norming, well acknowledged by the American Academy of Clinical Neuropsychology (AACN, 2021). Consequently, generalizability becomes a social justice issue in that access to tools for efficacious treatment serving all groups, not just WEIRD samples, is needed. Thus, establishing that a lack of generalizability can operate as a threat to the field of neuropsychology.

In contrast, knowing and appropriately considering the generalizability of findings serve to benefit neuropsychology. For instance, knowledge of whether a finding generalizes can improve cultural competency in assessment and practice, subsequently directing resources toward treatments that are highly efficacious across multiple populations. This illustrates the promise of considering generalizability. Testing generalizability can advance both theory and practice if a finding generalizes or fails to generalize. For instance, the lack of generalizability regarding negative cognitions illustrates a potential gap to be explored. From a scientific viewpoint, greater consideration of whether effects generalize can help us understand the why, or mechanisms, behind a phenomenon.

Approaches to Addressing Generalizability in Neuropsychological Assessment

One strategy to address the lack of representation and subsequent potential lack of generalizability in neuropsychological research has been the implementation of demographic norm adjustments. At face value, applying different norms for different populations appears to address the fact that assessments may differentially capture constructs across groups, thus improving diagnostic accuracy. However, there are several major concerns with this approach.

American Academy of Clinical Neuropsychology’s (AACN) position statement in 2021 highlights severe shortcomings in how neuropsychology has historically attempted to address issues of generalizability using race-stratified norms. Notably, the use of race norming erases the diverse drivers of existing disparities and reinforces myths of innate biological differences across people of different races. In reality, there are no discrete differences in genetic variation across racial groups, and conceptualizations of race vary widely across regions, pointing to race as a social category (Smedley & Smedley, 2005). Despite this, race norming was often used to further marginalize Black people in the USA. For instance, in the case of the National Football League (NFL), race norms for dementia systematically placed Black players at a disadvantage for warranted compensation (Possin et al., 2021). Therefore, recommendations by the AACN place emphasis on accounting for environmental factors such as socio-economic status, access to education, disparities in nutrition, the historical impact of racism and discrimination, and cultural differences during test interpretation and overall assessment. Further, mistrust is a predictor of cognitive test performance and other neuropsychological testing outcomes (Terrell et al., 1981). Yet, race norming fails to explicitly account for the fact that medical mistrust is more likely for minoritized groups, such as Black Americans, who have experienced stigma and harm from the medical system.

Past offensive applications of race norming illuminate flaws in demographically stratified norms more broadly. Drawing from intersectional and historical perspectives, issues of generalizability do not simply arise from a need to adapt a measure from one population to another. Rather, it is important to consider socio-contextual factors that impact the measure. For example, in the case of race-stratified norms, AACN outlined how the use of race in norming reinforces a notion of biological differences across racial groups. Yet, the observed differences are not due to internal factors or the demographic category. Rather, there is more evidence that factors such as who developed the measure, which populations it was validated on, and what cultural knowledge it draws from can cause variation in outcomes (see also Gould, 1996 for a similar perspective on intelligence testing). In other words, demographically stratified norms imply that differences exist due to group categories (e.g., race or gender), promoting a narrow view of what causes assessment differences within that domain. One consequence of this practice is reduced attention to why these differences exist and in what ways the assessment fails to generalize.

We propose two important considerations: understanding not only whether a construct generalizes but also how it generalizes. From an inferential statistics perspective, testing generalizability is similar to asking whether a significant effect is observed across multiple samples and thus generalizes to a broader population. Therefore, the assumption of the sample being representative of the population is ingrained in statistical inference. However, to assert this claim in practicality, multiple samples are needed to provide cumulative support for generalizability (see also Bonett, 2012). This testing process in studies is further complicated by a frequent lack of statistical power and relative group size including minoritized samples. Thus, another equally important question is “how”: Is the strength of the effect equivalent across samples? Does the measurement capture a construct in the same way across samples? The latter questions are less frequently examined but remain important components of generalizability. Notably, psychological measurement is rooted in the idea that we can adequately measure internal or behavioral processes. If measurement differs across samples and the ability to detect a significant effect is weakened for some group comparisons, this can impede sufficient tests for generalizability and reduce understanding of the population-level construct of interest.

The Beck Depression Inventory (BDI) provides a clear illustration of why generalizability matters in these two ways. While developed primarily for clinical use, the BDI is widely used across populations. A consequence of its wide use and established reliability is that empirical work using the BDI rarely reports sample-specific reliability information (e.g., only 7.5% of 1200 articles, Yin & Fan, 2000). And yet, the reliability of BDI scores is meaningfully impacted by who is responding to the measure; Yin and Fan (2000) found that BDI reliability was substantially lower among participants who abused substances than among non-clinical participants. Lower reliability can lead to an incomplete understanding of the antecedents and treatment efficacy of depression. For instance, if people who abuse substances tend to show lower reliability in depression scores when using the BDI, it would be more difficult to meaningfully interpret pre-and post-test scores following depression interventions. A substantial change in an individual’s score from pre- to post-test could represent true change or chance change due to measurement error. Similarly, null findings could reflect measurement noise or inability to capture depressive symptomology. Thus, inconsistent reliability estimates across samples can substantially complicate the conclusions of research that tests whether the same factors are predictive of depression across populations.

In sum, generalizability feeds into neuropsychological assessment and practice at multiple levels: from the development and validation of assessment and client–clinician interactions to diagnosis and treatment. Yet, there are scattered guidelines with which to apply knowledge of these issues and several barriers to applying recommended best practices to advance generalizability of the field. With this in mind, we turn to synthesizing recommendations and best practices at each of these levels. We focus on methodological and statistical considerations in assessment research and then move to discuss how clinicians can integrate these into assessment and treatment. In contrast to strategies that focus on only one level of this process—such as adapting assessment scoring using demographic-stratified norms—we advocate for a multipronged approach that integrates generalizability into all stages of the research process.

Advancing Generalizability in Pediatric Neuropsychology

Moving toward a more generalizable field necessitates a shift of research methods/values, additional resources to recruit underrepresented population or increase access for rural or low-income sample pools, and counteracting mistrust created from the prior harms of research on minoritized communities (i.e., avoiding transactional research and cultural incompetence when working with diverse samples and clients). In the context of pediatric research, several additional challenges come into play—recruitment often requires additional resources relative to adult samples, and participation barriers include dual consent/assent in addition to family resources (e.g., circumstances such as limited morning availability of caregiver to accompany children to a lab study). Recent advances in methodological approaches combined with work on cultural competency in neuropsychology practice provide several avenues from which generalizability can be integrated and evaluated.

In this section, we will review concrete recommendations to assist in effectively testing and reporting treatment differences across sample demographics and evaluate whether a given assessment is generalizable across populations. Given unique barriers to these various recommendations, incorporating every recommendation may be an unfeasible goal. Rather, we aim to provide researchers and clinicians with several strategies to overcome some barriers and increase generalizability with the aim of cumulative advances in the field of pediatric neuropsychology. To synthesize these best practices, we reviewed literature on generalizability considerations in neuropsychology and developmental and clinical psychology with a focus on novel approaches to these long-standing problems. Table 1 provides an overview of the recommendations that focus on best practices that serve researchers and clinicians in three main areas of pediatric neuropsychological assessment:

  1. 1.

    Research design: reporting, measuring, and evaluating sample demographics appropriately; strategies for increasing sample and researcher diversity

  2. 2.

    Analytic approaches: best practices for testing group differences within and across samples; statistical considerations for reliability and effect size

  3. 3.

    Evaluation and application of findings: cultural competency in applying assessment; evaluating the generalizability of a scale; ways to look beyond numerical scores

Table 1 Best practice recommendations for researchers and clinicians in advancing and evaluating generalizability

Research Design

At the research design stage, several practices can increase information that aids in evaluating generalizability in neuropsychology assessment. Notably, scholars have long called for population-based and theory-driven research to advance generalizability (Drotar, 1994). That is, prior to collecting data, researchers should actively consider the population of interest and design studies with those considerations in mind. In addition, theory testing relies on clearly specifying such populations, including whether a given theory is assumed to generalize to other groups. Later in the research design stage, sample demographics for recruitment are closely integrated to improve generalizability. Sample demographics are typically defined as quantitative assessments of sample characteristics including, but not limited to age, gender/sex, race/ethnicity, and socioeconomic status. Thus, our first two recommendations in Table 1 align with increasing sample (and researcher) diversity along with measuring and reporting sample demographics.

First, in terms of increasing sample and researcher diversity in published literature, we recommend that research on pediatric neuropsychological assessment continue to expand recruitment strategies. Expanding recruitment strategies involves intentionality of sample selection (i.e., evaluating who the target sample is and whether that aligns with the question or is a sample of convenience) and intentionality of recruitment outlets (i.e., multimethod recruitment strategies that reach multiple communities/labs). Related to issues of sample diversity, one commonly cited barrier to representative samples is that underrepresented populations are “hard to reach.” However, one reframe of this barrier is that underrepresented populations are labeled as hard to reach because research with these samples requires a shift from traditional methods used with Western or White samples (e.g., convenience sampling, Sears, 1986; see also Lange, 2019). This is supported by a review of top psychology journals: White first authors were less likely than first authors of color to publish research with participants of color (Roberts et al., 2020). In addition, psychology often has a practice of excluding theoretical approaches that challenge dominant norms about how research should be conducted (Buchanan et al., 2021; Settles et al., 2020). Thus, White researchers and other researchers of privilege likely continue to employ research practices that work to exclude underrepresented samples in the field of neuropsychology. Greater variation of research and recruitment methods can move pediatric neuropsychology into a more generalizable science. We review three commonly proposed strategies—remote platforms, team science, and community based-research—and discuss their feasibility for use in pediatric neuropsychology.

Online crowdsourcing platforms (e.g., Amazon’s Mechanical Turk, Prolific) have been proposed as one way to increase generalizability of non-clinical samples relative to student samples (see Henrich et al., 2010; Ledgerwood et al., 2017). These are not without flaws, however, and the age and abilities of pediatric samples restrict researchers’ options. Most online recruitment platforms exclude participants under 18 years of age, and many young participants may not be able to read written instructions and follow complex instructions or text-based stimuli. The advent of the COVID-19 pandemic did bring several remote platforms geared toward pediatric samples. These platforms have potential to reduce barriers to recruitment of underrepresented populations (e.g., Lookit, Nelson & Oakes, 2021; Scott & Schulz, 2017, and TheChildLab, Sheskin & Keil, 2018). While early work points to some validity of online protocols for neuropsychological testing/teleneuropsychology in adults (e.g., Requena-Komuro et al., 2022; Saban & Ivry, 2021) and pediatric samples (Hamner et al., 2022; Salinas et al., 2020), additional work is needed to better validate and establish protocols for online testing.

Drawing from discussions on the promises of team science (Hall et al., 2018) and increased funding for interdisciplinary collaborations (e.g., NIH BRAIN Initiative, Insel et al., 2013; Koroshetz et al., 2018), we believe increased collaborations and large-scale projects on understanding generalizability of assessments provide a promising path forward to increasing both sample and researcher diversity. Notably, greater reliance on team science would enhance investigations that allow for large-scale, in-person data collection with more representative samples. Multiple labs can participate and diversify the use of community samples to levels that would be unfeasible for one lab to reach.

For instance, developmental psychologists have called for the creation of organizations like the Psychological Science Accelerator (PSA; Moshontz et al., 2018) aimed toward consolidating resources, establishing shared infrastructure, and increasing research outreach by providing participants with internet access (see Sheskin et al., 2020). The ManyBabies Consortium provides additional support to the validity of this approach for advancing generalizability. Past work demonstrates that this consortium increases diversity of infant research with data that reflects the population of interest for most infant studies (“human infants”; Byers-Heinlein et al., 2020; Visser et al., 2022). Furthermore, it provides support for national research across underfunded labs, thus enhancing diversity of both teams and samples. In addition to increasing sample diversity, team-based science thus has the potential to counteract uniformity biases by incorporating increased perspectives that feed into greater diversity of what research questions are focused on and how constructs are measured. To our knowledge, there are no existing organizations that serve this function within pediatric neuropsychology.

A third promising method is community-based participatory research (CBPR). This technique involves collaboration with people who have lived experience recruited via outreach (community events, marketing, and word of mouth). Participants’ input is not only incorporated, but actively drives the path of the research. Although this approach has been used with adults (O’Bryant et al., 2021), it has not been discussed at length in the context of pediatric neuropsychology. We recommend clinicians and researchers consider the value of adapting this approach to pediatric neuropsychology, informed by best practices in the health field more broadly (see Israel et al., 2005 for a review). Parents play an important role in encouraging their child’s participation, and thus, CBPR could dismantle mistrust issues and offer logistical flexibility. Trust in clinicians, parent satisfaction with pediatric neuropsychological evaluations, and greater parental implementation of report recommendations have been noted as pressing issues for the field and vary across demographic factors including income (Elias et al., 2021; Gross et al., 2001; for systematic reviews see Fisher et al., 2022, Spano et al., 2021). While reducing the barriers that may present in participation of diverse groups requires stepping out of the comfort of convenience sampling, CBPR provides a well-supported option.

With respect to our second recommendation in Table 1, recent reports suggest greater attention to participant demographics in neuropsychology research (e.g., Raad et al., 2008), but reporting of demographics can continue to improve. Several current best practices exist in reporting and measuring participant demographics. The Journal Article Reporting Standards recommends reporting demographic groups in both the abstract and methods, selection methods, and whether exclusions may inadvertently restrict demographic characteristics (Appelbaum et al., 2018). As an example of current recommended best practices in measurement, Call et al. (2022) outline a social justice approach to measuring and using demographic data, which includes important recommendations such as avoiding the term “Other” in measurement (e.g., using “Not listed” or open-ended data when multiple-choice responses are inadequate, Cameron & Stinson, 2019) and informing participants of the importance and way in which their demographic data will be used. When possible, it is also a best practice to use terms that align with how an individual describe themselves (e.g., adults with autism in English-speaking countries generally prefer autism before the person whereas adults with autism in Dutch-speaking countries prefer people-first language, Buijsman et al., 2022; see also Monk et al., 2022). In the context of pediatric samples, careful acknowledgement should be given to how children and adolescents describe themselves, relative to caregiver-ascribed labels.

Because social categories differ across cultures, it is important to acknowledge that demographic measurement should be assessed at the design stage to determine what is appropriate for the context. And, there may be instances where collection of demographic information is limited (e.g., using data from medical records or brief intake information). Thus, our goal is not to establish a one-size-fits-all recommendation for how to measure demographics; there are multiple inclusive measures that will depend on the context and groups being assessed. Rather, we raise these examples as demonstrating the breadth of research in this area. A general approach to assessing participant demographics is first to consider the context or region of the sample and second to identify the relevant demographics based on past research, and third to develop items that appropriately assess these demographics considering changes in language over time and cultural norms.

Going a step further on this issue of sample demographics, reporting sample demographics is often understood as including participant characteristics within a method section, leaving out a vital connection: contextualizing findings within sample demographics. It has been proposed that publications include sections to outline constraints on generalizability in the context of sample characteristics (e.g., see Simons et al., 2017). Two areas in which contextualization of samples is severely lacking in psychology more generally are in labeling the roles of whiteness and U.S. centrism in research findings. As discussed previously, findings with White participants in the USA are often discussed with generic language that implies White = humankind (Roberts & Mortenson, 2022). As another example, Salter and Adams (2013) outline how APA’s bias-free language excludes discussion of terminology for White people. These types of exclusions reinforce the idea that race and racism are marginal constructs that do not apply to White Americans, leading to greater invisibility of the implications of white-, Western-centered neuropsychology. Reporting racial demographics in titles and abstracts even when the sample is primarily White helps dismantle this. Yet, this strategy alone also falls short (see also Remedios, 2022). A more generalizable science necessitates interrogating the positionality of power among both researchers and samples. Said another way, research conducted on White or Western samples, by White or Western researchers, can enhance generalizability by not only reporting sample information but actively making the implications of this visible (e.g., including discussions of positionality or a section on the constraints of generalizability). Within pediatric neuropsychology, these additions can provide vital information to clinicians about potential limits on the applicability of findings.

In sum, the inclusion of diverse sampling and research teams as well as maintaining reporting in sample demographics has the potential to increase generalizability and direct attention to the constraints of WEIRD samples. We discussed several potential obstacles in increasing sample diversity and strategies to overcome these obstacles in the context of research design and reporting—notably, team science collaborations and the use of computer-based implementations. We recommend that journal editors, grant reviewers, and journal reviewers encourage researchers to report their sample demographics and limits of generalizability. In addition, clinicians should conscientiously assess sample demographics to calibrate their confidence in generalizing the findings of research. We next turn to recommendations for analytic approaches seeking to illuminate group differences or constraints on generalizability.

Analytic Approaches

We recommend two broad analytic best practices for advancing generalizability in neuropsychological research. First, as summarized in Table 1, researchers should consider testing and reporting effects across potentially informative sample demographics. Measurement invariance can be used to aid in this practice and assess the extent to which a construct is psychometrically equivalent across different contexts such as between-group or across-time comparisons (see Putnick & Bornstein, 2016 for a review of measurement invariance in psychology). Researchers typically seek to establish measurement invariance prior to testing for group comparisons to rule out the possibility that differences are due to psychometric differences (e.g., groups interpreting the items in different ways). Thus, establishing invariance is a way to increase confidence in the validity of interpreting group differences.

Measurement invariance is often tested using multigroup confirmatory factor analysis. Multiple levels of invariance can be met, ranging from configural invariance (whether the model or basic factor structure of the scale holds across conditions), metric invariance (whether the items contribute to the construct in the same way across conditions or show equivalent factor loadings), scalar invariance (equivalence of item intercepts, meaning that the overall levels of items are the same across conditions), to strict invariance (the items’ errors or residuals are equivalent). Most scholars argue for meeting scalar invariance at a minimum to compare mean level differences across groups, although strict invariance may be necessary for comparing observed composite scores (Cheung & Rensvold, 2002; Steinmetz, 2013; van de Schoot et al., 2012).

Measurement invariance provides a pathway to testing generalizability across populations. For instance, Avila et al., (2020) had over 6000 participants complete a neuropsychological battery to compare testing invariance across ethnicity/race and sex/gender. Their results suggest full invariance for most tests, except for language domain tests (see also Tuokko et al., 2009, for similar findings regarding language function tests such as verbal ability). Although invariance is typically discussed in neuropsychology publications focused on scale psychometrics, increased reporting regardless of the aims is advised. At an empirical level, increasing reporting of measurement invariance would aid in understanding the extent to which certain scales may or may not generalize across populations, as well as determine whether testing for between-group differences is appropriate. Structural equation modeling approaches comparing goodness-of-fit indicators are most common for testing invariance, but item response theory is another strategy that is increasingly employed (Millsap, 2010; Reise et al., 1993). Researchers looking to conduct invariance testing and clinicians looking to evaluate invariance of scales can refer to van de Schoot et al. (2012) for a practical checklist. Other scholars have developed comprehensive tutorials for researchers in the freely available statistical program, R (Fischer & Karl, 2019; Hirschfeld & Von Brachel, 2014).

However, an important limitation in measurement invariance and testing differences across groups is that group differences across demographics are generally used as a proxy for something (e.g., discrimination or cultural values). Thus, testing group differences across demographics can overlook what the demographic characteristics are being used as a proxy for and whether that construct should be measured instead. Notably, proper invariance testing approaches may also necessitate greater use of supplementary materials (to report invariance information) and/or open data practices (to provide the data for others to conduct invariance testing) in journals.

Another key obstacle to properly addressing and testing group differences is that research on underrepresented populations tends to use smaller sample sizes due to decreased infrastructure in recruiting these populations. This can easily result in an erroneous conclusion: observing a significant association in one group, but not in another, and concluding this is evidence for a lack of generalizability. In reality, one possibility for this occurrence is that there is sufficient power to detect such a difference in an overrepresented group but smaller sample size of another group which results in reduced statistical power for the same test within that group. For instance, in one prior reference to unpublished work, Haeffel and Cobb (2022) suggest evidence of a lack of generalizability for some cognitive theories of depression. To illustrate when evidence for this type of claim would be invalid, we will walk through one analytic approach: this unpublished work could have sampled across multiple countries and tested the association between cognitive vulnerability within each subsample, finding a significant association in most countries but no significant association among Nepali and Honduran participants. If data collection in Nepal and Honduras included far fewer participants than the other countries tested, significance in one group and a lack of significance in the other may not reflect the effect failing to replicate but instead the limitations of null hypothesis testing and sample size.

When invariance testing or sufficient sample sizes are unfeasible, researchers who employ this type of comparison should at a minimum examine effect sizes (e.g., r2) and provide additional information about the equivalence of the two effects. As an example of best practice in this area, McCaffrey et al. (2023) report effect sizes alongside p-values, examine mean score equivalency, and take into account the role of lower statistical power in their comparisons. We thus advocate for reporting effect sizes for all comparisons regardless of significance as listed in Table 1 (see also Schatz et al., 2005). This practice would aid in more meaningful understanding of comparisons beyond significance and provide information for future sample size determinations. Lab collaborations, a strategy discussed earlier, would also help increase feasibility of testing for group differences using appropriate sample sizes.

In our second analytic recommendation, we advocate that reliability information should always be examined at a sample level, rather than presumed based on previous work. As illustrated with the BDI, the reliability of assessments is a sample-specific property and can vary according to the individuals being assessed. While psychometric-focused studies can provide valuable information regarding assessment constraints, this is also something that can be emphasized in empirical work. Considering the reliability of tests in a sample is important because lower reliability can result in lower statistical power. In addition, comparing the effect sizes of two identical studies has little meaning when the reliability of measures differs (Parsons et al., 2019). In assessing the generalizability of effects, reliability can be a component of interest (i.e., how a scale generalizes across populations) or a confounder (i.e., impairing comparisons of group scores).

When assessments are developed and validated with specific populations, reliability of these assessments may be lower for the global majority. This can in turn exacerbate the potential of null findings due to lower statistical power, compounded with typically smaller samples for minoritized populations. Said another way, simply comparing the significance of hypothesis tests across groups is insufficient and may reflect method-specific variation (in terms of differences in reliability or group sizes). At an empirical level, researchers should thus calculate and report appropriate reliability information for all primary scales (e.g., Nazaribadie et al., 2021, report prior reliability and sample-specific reliability information for the Wisconsin Card Sorting Test).

In sum, we have put forth two analytic approaches that would aid in generalizability considerations across neuropsychological assessment: further analyses that test for group differences, when feasible given the sample, and testing reliability of tests in all empirical work. Yet, the statistical validity discussed above is not an endpoint; results of assessments, even if internally valid and reliable, should consider the clients’ larger context (e.g., see Gale et al., 2022). Our final set of recommendations focuses on applying knowledge of generalizability in clinical practice and evaluating research.

Evaluation and Application of Findings

Thus far, our reviewed recommendations have focused on best practices to improve information to advance or test for generalizability (e.g., reporting reliability/demographic information, testing for group effects, increased use of novel methods and collaborative data), with few implications for evaluating generalizability from an applied perspective. Now, we discuss some ways in which clinicians and reviewers/editors of research can aid in generalizability evaluation and properly apply such findings.

In evaluating the generalizability of assessments, systematic reviews and meta-analyses can provide informative value to clinicians, to the extent that null findings are included. Kopp et al. (2019), for example, provide evidence that split-half reliability for the Wisconsin Card Sorting Test (WCST) had likely been underestimated in previous work with non-clinical adult samples. Their findings with neurological inpatients demonstrated high split-half reliability for the WCST, suggesting that this assessment maintains reliability with clinical samples but should be used with caution in non-clinical samples. This has clear implications for clinical use, and a researcher could integrate this knowledge by selecting alternative measures of cognitive flexibility for a non-clinical sample. In terms of pediatric fields, reviews focused on particular cognitive domains such as executive functioning in children can also be fruitful. For example, Souissi et al. (2022) comprehensively reviewed the leading theories of executive functioning and how it is measured in children. They further make recommendations for researchers and clinicians on how to better assess this domain using a multifaceted approach (performance-based tests along with ecologically valid measurement) and accounting for contextual (e.g., SES) as well as cultural (e.g., bilingualism) factors. Systematic reviews can thus provide a wealth of information for clinicians to integrate and present a snapshot of which tests might be most generalizable or appropriate for a given population. Further, the review above (Souissi et al., 2022) suggests tangible practices for applying this knowledge to assessment. Thus, large-scale reviews can both synthesize research gaps as well as inform clinical practice.

From a research point of view, improving generalizability is more likely when there are structural aids to achieve and incentivize the strategies outlined in the sections above. Neuropsychology shows promise here: reports from neuropsychological test batteries are often published regardless of statistical significance or novelty (Lange, 2019), which could mean there is minimal publication bias in scale evaluation literature. And, before using neuropsychological assessments with pediatric samples, assessments are often carefully evaluated across age and validated in target age groups (e.g., Pearson, 2018). However, across psychology, publishing is a persistent pressure. And, while many of the strategies to increase sample size or expand sample diversity address cost limitations, they do not address the resource of time. Using these methods may take greater time and effort on the part of researchers, and more structures that incentivize the time to make this shift are needed (e.g., special issues on generalizable methods). A lack of incentives can thus limit the availability of information from which to evaluate generalizability.

Another important consideration with structural incentives is that novelty and null findings should be equally published in both high-impact and specialty journals to maintain structures that incentivize reporting null findings and extensions. Pre-registration and registered reports are two approaches that are aimed at minimizing bias against null findings (Gonzales & Cunningham, 2015). Although much discussion has focused on how individual researchers can implement pre-registration, these approaches can be incentivized at the journal level to reduce systemic or peer review bias toward null and novel findings (e.g., special registered report submission tracks). Similarly, reviewers can place more weight on the importance of sample diversity, regardless of the research question novelty or encourage sample size justification with WEIRD samples. That is, generalizability considerations could receive more value if they are viewed not just as an incremental contribution but instead as a novel approach even if the research question has been asked with narrow samples before.

From a clinician point of view, applying these findings into practice requires a holistic approach to assessment. Our final two recommendations in Table 1 focus on the applicability of research on generalizability. During client and collateral interviews, clinicians should strive to paint a comprehensive picture of a client’s socio-historical background with special attention to historical reasons for treatment-seeking gaps and mistrust compiled throughout the years. Cultural values and norms that are unique to the client should be thoroughly understood by the clinician to account for variability in test scores attributed to ethnic-cultural and linguistic diversity. Information regarding the socio-cultural and historical background of a child is of paramount importance not only for better case conceptualization but also for evaluation of assessment fit in terms of both cultural loading and linguistic demands (for recommendations, see Pedraza et al., 2020, and the ECLECTIC approach; Fujii, 2018). This information should drive the clinician’s judgment of choosing assessments that would be representative of the client’s neuropsychological functioning.

In addition to evaluating psychometrics and invariance testing, clinicians should thus pay close attention to the sampling population of the test and stay informed on recent literature in line with Standards for Educational and Psychological Testing (American Educational Research Association et al., 2014). Failing to acknowledge factors such as cultural loading and using inappropriate tests introduces systematic biases, including barriers in test takers’ abilities to demonstrate their true level of functioning. Misrepresenting current neuropsychological function can further lead to misdiagnosis and call into question the overall validity of neuropsychological assessment among children and families. This underscores the ultimate takeaway for clinicians, which is to understand the numbers from the tests in a multilayered context of symptoms, cultural heritage, linguistic diversity, and broader socio-economic factors.

One area that we identify as particularly important in the coming years is the treatment of linguistically diverse individuals (see also Manly, 2008). Many of the papers reviewed above showed limitations in the generalizability of verbal or language-related tests and instruction interpretation. When working with linguistically diverse children and adolescents, interpreters who are well-trained in cultural influences on neuropsychological assessment as well as bilingual psychometrics are advised to be used. Whenever possible, the testing should be performed in the client’s dominant language with respective translated tests and norms (e.g., with Spanish-speaking clients, the use WISC-V Spanish version is appropriate).

Another recommendation for clinicians is to evaluate the psychometric properties of the assessments they use and consider the assessment context on an individual basis. In particular, the testing context can evoke stress and enhanced vigilance, particularly among members of stigmatized groups (e.g., due to discriminatory and threat potential). Stress and cognitive reserve are independently associated with neuropsychological performance in healthy elderly adults (Cabral et al., 2016). And, while past research shows stress can, unintuitively, reduce Stroop interference through reduced cue utilization (Booth, 2019), Richeson and Trawalter (2005) further demonstrate that interracial contact can lead to greater Stroop interference due to self-regulatory demands. This has implications for record keeping during assessment and suggests that it is important to consider whether the test administrator shares or does not share social identities with the child and whether those dynamics may have influenced performance. This also makes the case for repeat testing when resources allow and expanding diversity of test administrators.

Evaluating an Assessment: the Test of Memory Malingering

One of the most well-validated and frequently used performance validity tests (PVT) is the Test of Memory Malingering (TOMM; Schroeder et al., 2016; Tombaugh, 1996), used by 75–78% of North American neuropsychologists (Martin et al., 2015). The normative data for the TOMM was taken from a community sample of English speakers aged 16–84 years, with an education level of 8–21 years in addition to a clinical sample. Similarly, Rees et al. (1998) conducted a series of experiments to validate the TOMM in several groups (college students, people with traumatic brain injury, and cognitively impaired older adults) which broadens the bandwidth of generalizing the TOMM. However, these early validation studies failed to report any ethnicity-related information for their samples, which significantly comprised a recent meta-analysis of the TOMM (Martin et al., 2019).

Given the widespread use of the TOMM, it has now been cross-culturally validated in diverse samples (Grabyan et al., 2017; Wisdom et al., 2012)—for example, in Romanian (Crisan & Erdodi, 2022) and Singaporean clinical samples (Chan et al., 2021). However, there are some mixed results regarding the influence of demographics on TOMM performance. Nijdam-Jones et al. (2019) analyzed TOMM performance in a sample from Columbia who had varying years of education and age (ranged from 18 to 84 years old). The results showed that participants who lacked formal education scored significantly lower than participants who had formal education (12 years or higher) and had an invalid performance on TOMM Trial 1. In addition to education, age also influenced TOMM performance. Younger participants had significantly higher and more valid TOMM scores than older participants. And, in another study in which participants belonged to seven different countries in Latin America, age (18–95 years) and education continued to influence TOMM performance (Rivera et al., 2015).

While TOMM has been examined in pediatric samples, there has been limited targeted work looking at demographic differences (e.g., language fluency) in younger samples. Some work with pediatric samples suggests a lack of validity with children younger than 5 years of age with ADHD, particularly in the Retention trial (Schneider et al., 2014). And, although this study showed that performance on TOMM was affected by age (4–7 years old), 93% of their sample reached performance comparable to adults by Trial 2, which is often used for interpretation. In terms of other age effects, children aged 5–7 performed better than those aged 4 on Trials 2 and 3. However, there was no difference between these age groups for Trial 1. Similarly, a recent study examined associations with TOMM scores and demographics that found no evidence for differential effects across age, gender, ethnicity, and English proficiency among the many other demographics assessed (Ku et al., 2020). Moreover, a cross-cultural study (USA and Cyprus) has also validated the robustness of TOMM relative to variability in demographics (Constantinou & McCaffrey, 2003). However, the influence of demographics on TOMM performance in a pediatric sample was not the primary research question in these articles; hence, further research is warranted on the validity in TOMM use.

The purpose of reviewing this literature is to accentuate the importance of reporting the full extent of the sample demographics used to validate a test so that clinicians can evaluate the applicability of the test to the client. While TOMM validity has been extensively researched across samples, we see two paths through which generalizability can be improved. First, we highly recommend diversifying samples. According to the 2050 American Academy of Clinical Neuropsychology initiative, testers will become unequipped to handle 60% of the population due to an increase in non-primary English speakers and non-European Americans. Hence, it is imperative to increase sample diversity to broaden the generalizability horizon and facilitate health equity. Additionally, the TOMM as a PVT taps into non-credible effort. Past work suggests that effort can vary as a function of how cultures are valued (or unvalued) in the context of the Westernization of neuropsychological assessment. For instance, in interviews with Māori participants in New Zealand, two prevalent themes were the experience of cultural invisibility and a cultural divide between the client and neuropsychologists (Dudley et al., 2014). Clients who feel low cultural safety may feel less motivated to put in credible effort, therefore showing poorer performance on the TOMM and increasing the likelihood of adverse conclusions (e.g., malingering). Thus, although knowing whether non-credible effort is an issue, it is also practically meaningful to know why it might be an issue. One recent study aimed at abridging this gap in the literature showed that the Pediatric PVT Suite (PdPVTS; McCaffrey et al., 2020) pass/fail classifications were not correlated with gender or race/ethnicity (McCaffrey et al., 2023). This might be a fruitful area for future research, which would provide avenues to improve neuropsychological testing instruction and preparation as well as integrate cultural variation in values and counteract medical distrust.


The generalizability of findings in neuropsychology has implications both for adequate theory building (e.g., understanding of human behavior) and practice (e.g., appropriate diagnostic information). Despite increasing calls for cultural competence in neuropsychology, additional resources are needed to expand the research available to inform culturally competent practice. One piece of this involves the generalizability of assessments and research findings. To improve the extent to which the field incorporates multiple perspectives and samples, several practices are needed from research design to analysis to clinical practice and diagnosis.

To aid in this goal, we reviewed a set of concrete recommendations that we believe would enhance generalizability of neuropsychology, with a focus on assessment and measurement. Importantly, our recommendations are far from exhaustive and, as mentioned previously, incorporating every recommendation may be an unfeasible goal. Furthermore, generalizability goals are not a static endpoint, but rather a continuous process: the practices a field engage in must continue to develop in line with a continually changing society. As an example, at present, there is virtually no known data to address the normative inclusion of variations in gender identity, and instead, cisgender is the default. What is unclear is if this would affect the utility of testing in non-conventional, cisgender individuals, including differences in mean scores, variability in scores, measurement properties, equivalence in content, or comfort with the testing environment. While this is a growing area (e.g., one recent study found neuropsychologists tend to underutilize affirming practices for LGBTQ + clients; Correro et al., 2022), research will need to examine these and many other issues. As our understanding and practices evolve, clinicians and researchers will have to consider many previously neglected factors that may prove critical to the generalizability of any given psychological construct. We aimed to provide a foundational knowledge base and toolbox from which researchers and clinicians can draw from, subsequently improving the field’s approach to these issues.