Introduction

Test fairness is a fundamental principle of test development and refers to the extent to which a test measures the construct of interest and not characteristics of a specific group. Test fairness is linked to test bias, which is defined as empirically established systematic error in a test score. When the systematic error is attributable to a cultural/ethnic variable, the test is considered culturally biased (Newman et al., 2007; Reynolds et al., 2021).

The potential cultural bias of standard neuropsychological and psychoeducational tests has been an area of discussion and research for decades. As performance validity assessment has become a standard component of psychoeducational and neuropsychological evaluations (Sweet et al., 2021), the study of the possible influence of cultural variables on performance validity tests has been developing. This research, to date, points the relevance of several variables associated with culture for performance validity tests, including proficiency in English, bilingualism vs. monolingualism, educational attainment, and country-of-origin. As examples, non-English-speaking individuals with low educational attainment may have a higher failure rate using standard cut scores for some freestanding performance validity tests, specifically, the Test of Memory Malingering, the Rey-15 Item Memory Test, and b-test. Bilingual individuals, with research primarily examining Spanish–English speakers, may have higher failure rates on the Rey-15 Item Test, Warrington Recognition Memory Test, Rey Word Recognition Test, and Dot Counting Test. The Genuine Memory Impairment Profile from the Word Memory Test has been supported in research involving ethnically and linguistically diverse individuals but not the use of the primary validity indices in isolation. Embedded performance validity tests, the Reliable Digit Span and Digit Span Age-corrected Scaled score, have been found to have higher failure rates using standard cut score with Hispanic and Native American individuals, bilingual individuals (Spanish/English), and individuals with low educational attainment (Salazar et al., 2021; Strutt & Stinson, 2022 for review).

Limited research on the impact of race on performance validity tests has been conducted, with the general findings indicating no difference in the performance of racially diverse samples on many performance validity tests. Hood et al., (2022) found that clinically referred Black and White American adults performed generally comparably across a number of freestanding and embedded performance validity tests. Importantly, the two groups were equated for age and educational level. In a pediatric sample, Bosworth & Dodd (2020), Black, Hispanic, and Asian youth with history of mild traumatic brain injury performed comparably on the Non-Verbal Medical Symptom Validity Test (NV-MSVT). There also was no impact gender on the NV-MSVT.

Performance validity assessment is necessary in determining the accuracy of the neuropsychological and psychoeducational test data and should be a component of every evaluation, including pediatric evaluations (Emhoff et al., 2018; Sweet et al., 2021). It is essential that these are culturally fair tests, and this necessitates validation within culturally diverse samples. The current study examined the impact of race and gender on the Pediatric Performance Validity Test Suite (PdPVTS; McCaffrey et al., 2020). The PdPVTS is comprised of five computer-based tests that were specifically designed to assess performance validity in children and adolescents. The principles of universal design were kept forefront in developing a test battery that is accessible youth of varying physical and cognitive capabilities and utilizes cultural- and gender-neutral test stimuli. PdPVTS performance was examined across racial groups and gender. Considering the attention to cultural variables in developing test stimuli, we expected that racial and gender groups would perform comparably across the PdPVTS.

Methods

Participants

A general population sample of n = 838 examinees was collected. Sampling was stratified across age, gender, race/ethnic group, geographic region, and parental education level (PEL) in order to match the demographic composition of the US population. Data collection took place in the fall of 2017 and thus the demographic composition of the sample was matched to the 2017 US Census data (2017 American Community Survey; United States Census Bureau, 2017). During the development of the PdPVTS, the objective was to start with eight tests and to eliminate tests that did not perform as well to arrive at a final set of five tests that maximize sensitivity and specificity in discerning performance validity. Due to this experimental design, not all examinees received each of the final tests in the PdPVTS and, therefore, the total number of examinees in the general population sample who completed a given PdPVTS test ranged from n = 431 to n = 563.

Examinees were excluded from the sample if they had any uncorrected hearing, visual, or motor impairments that might affect their ability to use a tablet or other touchscreen devices to complete the PdPVTS. Examinees were also excluded if they could not communicate verbally. As this was a general population sample, examinees were excluded if they were diagnosed with any psychological, neurological, behavioral, or learning-related disorders. Examinees were excluded if the examiners observed any anomalies with respect to their interaction with the tablet or any of the tests in the PdPVTS (n = 2).

Given that the primary objective of the current investigation was to compare performance on the PdPVTS between gender and among racial/ethnic groups, demographically matched samples were selected to facilitate each set of comparisons while controlling for confounding demographic variables. To compare male and female respondents, subsamples were created for each test in the PdPVTS that were matched on age group, race/ethnicity, and parental education level. To compare Black respondents to White respondents, and to compare Hispanic respondents to White respondents, subsamples were created for each test that were matched on age group and gender only, due to sample size restrictions. Where possible, the same matched samples were used across multiple tests in the PdPVTS. The demographic composition of the male vs. female matched samples are summarized in Table 1 and the demographic characteristics of the Black vs. White and Hispanic vs. White matched subsamples are summarized in Tables 2 and 3, respectively.

Table 1 Demographics characteristics of the male vs. female matched samples
Table 2 Demographic characteristics of the Black vs. White matched samples
Table 3 Demographic characteristics of the Hispanic vs. White matched samples

Materials and Procedure

The PdPVTS was administered on an iOS or Windows tablet device, or on a Windows desktop or laptop computer. Examinees were seated beside the examiner and the examiner guided each examinee through instructions for the practice screens and for each of the formal tests. The PdPVTS comprises four visual and one verbal tests: Find the Animal (visual scanning and classification), Matching (visual recognition), Shape Learning (visual recognition), Silhouettes (visual organization), and Story Questions (verbal recognition). Each test took from 3 to 5 min to administer for a total administration time of no more than 25 min.

Data Analyses

The aim of the present study was to examine the fairness of the PdPVTS by considering whether there were differences in PdPVTS outcomes (pass vs. fail) and PdPVTS scores (mean total score) across gender and racial/ethnic groups. Two sets of analyses were used to examine differences in PdPVTS outcomes and scores between groups: an adverse impact approach was used to consider whether there were differences in the rates of PdPVTS outcomes between groups and an equivalence testing approach was used to explore differences in PdPVTS total scores and to establish evidence related to any meaningful differences between groups.

Adverse Impact

An adverse impact analysis (for an overview, see Biddle, 2017) was used to consider the rates at which male vs. female, Black vs. White, and Hispanic vs. White examinees passed the PdPVTS. Adverse impact analysis is typically used in contexts where examinees are being selected for specific opportunities (e.g., employment, housing, resource allocation) and there is a desire to understand whether a legally protected or minority group is being selected at a meaningfully lower rate than a non-protected group. The test or criteria being used to facilitate selection is said to have an adverse impact when it leads to a protected group being selected at a rate of 80% or lower than that of the non-protected group (i.e., the 4/5 rule; see Hough et al., 2001 for an overview). In the current study, we considered the ratio of examinees passing the PdPVTS to determine whether there were any meaningful differences in the rates at which different gender and racial/ethnic groups passed the PdPVTS and whether the PdPVTS could have an adverse impact. Fisher’s exact test was then used to consider more formally whether there were statistically significant differences in pass/fail frequencies between groups.

Evidence of Equivalence Between Groups

Two sets of analyses were used to examine for meaningful differences between the demographic groups of interest. First, the Mann–Whitney U tests were used to explore whether there were significant differences in mean total PdPVTS scores between groups. It was hypothesized that there would be no significant differences in mean total score between groups. The Two One-Sided Test (TOST) procedure was then used to determine whether there was evidence of equivalence across groups. The TOST procedure involves determining what constitutes the Smallest Effect Size of Interest (SESOI) and using this effect size to establish upper and lower equivalence bounds. The observed data are then compared against each of the two bounds using two one-sided t-tests: one testing the null hypothesis that the effect is at least as large as the upper bound and the other testing that null hypothesis that the effect is at least as small as the lower bound. If both null hypotheses can be rejected, this demonstrates that the observed effect falls within the equivalence bounds and that the groups of interest are practically equivalent (see Lakens, 2017 and Lakens et al., 2018 for an overview of the TOST approach). All TOST pairs used in the current study employed a SESOI of Cohen’s d = 0.49. This SESOI was selected for multiple reasons, including the fact that a d value of 0.49 corresponds to a half standard deviation, which is consistent with the commonly used Minimal Important Difference criteria (see Copay et al., 2007 for an overview). Moreover, in keeping with the Neyman-Pearson approach, a SESOI of d = 0.49 would balance the risk of type I and type II error by yielding a reasonably high level of statistical power, given the sometimes modest cell sizes used in the current study (Lakens et al., 2018).

Results

Adverse Impact

To enable the adverse impact analyses, first, the mean total PdPVTS scores (see Tables 4, 5, and 6 for descriptive statistics) were generated and compared against the established age-adjusted cut scores (McCaffrey et al., 2020; see Table 7) to establish whether an examinee had passed or failed. The pass rates were then used to create ratios expressing the portion of females passing the PdPVTS relative to males, along with the portions of Black and Hispanic examinees passing the PdPVTS relative to White examinees (see Table 8). As predicted, for all tests in the PdPVTS and for each comparison of interest, the ratios were greater than 0.80, demonstrating there was no evidence of adverse impact associated with the PdPVTS. To confirm that there were no statistically significant differences in pass/fail rates between demographic groups, a series of Fisher’s exact tests were used to compare pass/fail frequencies between groups for each test. Ultimately, none of the Fisher Exact tests were significant (p ranged from 0.12 to 0.99, see Tables 4, 5, and 6).

Table 4 Descriptive statistics, pass rate, and Fisher’s exact test for male vs. female
Table 5 Descriptive statistics, pass rate, and Fisher’s exact test for Black vs. White
Table 6 Descriptive statistics, pass rate, and Fisher’s exact test for Hispanic vs. White
Table 7 Age-adjusted cut scores by test
Table 8 Adverse impact ratios

Evidence of Equivalence Between Groups

The Mann–Whitney U tests revealed that when comparing male to female examinees, there was a significant difference in mean total score between groups for the PdPVTS Story Questions among examinees aged 12 to 18 (U = 1799, p = 0.039); however, the effect size was small (r = − 0.18, see Table 9). No other significant differences between males and females were observed. When comparing Black to White examinees, there was a significant difference in mean total score between groups for the Story Questions among examinees age 7 to 11 (U = 972.5, p = 0.008); however, the effect size was small (r = − 0.26, see Table 10). No other significant differences between the Black and White examinees were observed. When comparing Hispanic to White examinees, no significant differences between groups were observed (p ranged from 0.160 to 0.797, see Table 11).

Table 9 Mann–Whitney’s U tests for male vs. female
Table 10 Mann–Whitney’s U tests for Black vs. White
Table 11 Mann–Whitney’s U tests for Hispanic vs. White

Given that there was little evidence of differences in mean total score between groups, the TOST procedure was used to consider whether evidence of equivalence between groups could be established. When comparing male to female examinees, evidence of equivalence (i.e., the null hypothesis was rejected for both two one-sided t-tests) was observed for all tests (p < 0.01) except for the Story Questions for examinees aged 12 to 18 (p = 0.124, see Table 12). When comparing Black to White examinees, evidence of equivalence was observed for the Matching, Silhouettes, and Story Questions for examines aged 12 to 18 (p < 0.05, see Table 13). When comparing Hispanic to White examinees, evidence of equivalence was observed for all tests (p < 0.01) except for the Story Questions for examinees aged 12 to 18 (p = 0.207, see Table 14). Ultimately, evidence of equivalence was observed for most of the TOST pairs that were run. Instances where evidence of equivalence was not observed were generally associated with smaller cell sizes and comparatively lower statistical power.

Table 12 Two One-Sided Tests (TOST) exploring equivalence between males vs. females
Table 13 Two One-Sided Tests (TOST) exploring equivalence between Black vs. White examinees
Table 14 Two One-Sided Tests (TOST) exploring equivalence between Hispanic vs. White examinees

Discussion

When tests are used to classify individuals into specific groups, and especially when some of those classifications can result in adverse consequences for members of certain classifications (e.g., loss of compensation for injuries or denial of disability for those classified and giving suboptimal effort on the PdPVTS), it is critical to assure the obtained classifications are not associated with nominal cultural variables such as gender and race/ethnicity. Fairness in such classifications must be considered empirically. Herein, we have defined fairness in terms of adverse impact via adverse classification as having “failed” a PdPVTS test and examined such failure rates across gender and race/ethnicity. In addition to looking at classification rates and accuracy, we also examined mean score equivalencies.

In every instance, classification/failure rates lacked adverse impact across the nominal variables of gender and race/ethnicity. In the majority of cases, this was accompanied by mean score equivalency across groups but not in all cases. Story Questions, the only verbal measure among the suite of five tests that make up the PdPVTS, did not always demonstrate mean score equivalence. Moreover, evidence of equivalence could not be established for the Find the Animal, Shape Learning, and Story Questions (ages 7–11) tests when comparing Black vs. White examinees. However, the lack of equivalence evidence for the story questions was largely due to the comparatively smaller sample sizes and lower power. The lack of equivalence evidence for some of the comparisons between Black vs. White examinees also appears to stem from a lack of power. The inference that low statistical power was the main reason for which evidence of equivalence could not be established for select comparisons is supported by the observation that all the mean differences between groups and associated effect sizes were quite small and, as such, there was no evidence of adverse impact in the classification of individuals as either passing or failing any of the tests, at any age level. That said, follow-up with larger samples is needed to confirm whether this interpretation is accurate or not. Ultimately, the empirical evidence presented in the current investigation suggests that examiners may use the PdPVTS with confidence the results of the pass/fail classifications will not be associated with gender or race/ethnicity for Black, White, and Hispanic examinees. This lack of adverse impact in achieving such classifications is critically important to all PVTs and in all settings and, when choosing a PVT, examiners should consider the existing evidence related to adverse impact of the classification rates of their chosen instrument across such nominal variables as gender and ethnicity.

Conclusions

As performance validity tests continue to play a more central role in psychoeducational and neuropsychological assessment, there is a clear need to ensure that PVTs represent fair tests, in that classification/failure rates are similar across culturally diverse groups. Previous research has demonstrated that several well-established, freestanding, and embedded PVTs (e.g., the Rey-15 Item Test, Warrington Recognition Memory Test, Rey Word Recognition Test, the Dot Counting Test, the Reliable Digit Span, and Digit Span Age-corrected Scaled score) have higher failure rates associated with different demographic attributes. The current study provides strong evidence to demonstrate that PdPVTS pass/fail classification rates are not associated with gender or race/ethnicity for Black, White, and Hispanic examinees. Moreover, there was strong evidence of equivalency in terms of mean PdPVTS scores between the gender and racial/ethnic groups of interest. That said, evidence of equivalency could not be established for all PdPVTS tests when comparing racial/ethnic groups, albeit owing to comparatively smaller sample sizes for the non-White groups. Therefore, future studies should aim to consider equivalency between racial/ethnic groups in terms of their mean PdPVTS scores using larger, representative samples.