Performance Validity Test Failure in the Clinical Population: A Systematic Review and Meta-Analysis of Prevalence Rates

Performance validity tests (PVTs) are used to measure the validity of the obtained neuropsychological test data. However, when an individual fails a PVT, the likelihood that failure truly reflects invalid performance (i.e., the positive predictive value) depends on the base rate in the context in which the assessment takes place. Therefore, accurate base rate information is needed to guide interpretation of PVT performance. This systematic review and meta-analysis examined the base rate of PVT failure in the clinical population (PROSPERO number: CRD42020164128). PubMed/MEDLINE, Web of Science, and PsychINFO were searched to identify articles published up to November 5, 2021. Main eligibility criteria were a clinical evaluation context and utilization of stand-alone and well-validated PVTs. Of the 457 articles scrutinized for eligibility, 47 were selected for systematic review and meta-analyses. Pooled base rate of PVT failure for all included studies was 16%, 95% CI [14, 19]. High heterogeneity existed among these studies (Cochran's Q = 697.97, p < .001; I2 = 91%; τ2 = 0.08). Subgroup analysis indicated that pooled PVT failure rates varied across clinical context, presence of external incentives, clinical diagnosis, and utilized PVT. Our findings can be used for calculating clinically applied statistics (i.e., positive and negative predictive values, and likelihood ratios) to increase the diagnostic accuracy of performance validity determination in clinical evaluation. Future research is necessary with more detailed recruitment procedures and sample descriptions to further improve the accuracy of the base rate of PVT failure in clinical practice. Supplementary Information The online version contains supplementary material available at 10.1007/s11065-023-09582-7.


Introduction
Neuropsychological assessment guides diagnostics and treatment in a wide range of clinical conditions (e.g., traumatic brain injury, epilepsy, functional neurological disorder, attention deficit hyperactivity disorder, multiple sclerosis, or mild cognitive impairment).Therefore, it is important that neuropsychological test results accurately represent a patients' actual cognitive abilities.However, personal factors such as a lack of task engagement or malingering can invalidate a patient's test performance (Schroeder & Martin, 2022).When invalid performance is not properly identified, clinicians risk attributing abnormally low scores to cognitive impairment, potentially leading to misdiagnosis and ineffective or even harmful treatments (e.g., Roor et al., 2016;van der Heide et al., 2020).Consequently, performance invalidity is not only relevant to diagnostics, but also extends to treatment efficacy (Roor et al., 2022).
Various tests are available for determining invalid performance on cognitive tests (for an overview, see Soble et al., 1 3 2022).Performance validity tests (PVTs) can be specifically designed to measure performance validity (i.e., stand-alone PVTs), or empirically derived from standard cognitive tests (i.e., embedded indicators).Overall, the psychometric properties of stand-alone PVTs have been found to be superior in comparison to embedded PVTs (Miele et al., 2012;Soble et al., 2022).Using well-researched stand-alone PVTs, the meta-analyses of Sollman and Berry (2011) found their aggregated mean specificity to be 0.90, with a mean sensitivity of 0.69.This finding is typical for stand-alone PVTs, for which empirical cutoff scores are chosen at a specificity of ≥ 0.90 to minimize the misclassification of a valid cognitive test performance as non-valid (i.e., a maximum 10% false positive rate).
Importantly, sensitivity and specificity should never be interpreted in isolation from other clinical metrics like base rates (Lange & Lippa, 2017).To determine the positive and negative predictive value of a PVT score, the base rate of the condition (here: performance invalidity) needs to be considered (Richards et al., 2015).Using Bayes' rule, the likelihood that PVT failure is indeed indicative of performance invalidity can be calculated based upon: 1) the base rate of invalid performance in the specific population of that individual; 2) the score of a PVT; 3) the sensitivity; and 4) the specificity of the utilized PVT (Dandachi-FitzGerald & Martin, 2022;Tiemens et al., 2020).Ignoring Bayes' rule potentially leads to overdiagnosis of invalid performance when the base rate of invalidity is low and to underdiagnosis when the base rate is high.Therefore, it is essential that base rate information is available for each PVT in a specific clinical context and, ideally, for specific clinical patient groups (Schroeder et al., 2021a).
Early surveys amongst non-forensic clinical neuropsychologists reported an expectation that only 8% of general clinical referrals would produce invalid test results (Mittenberg et al., 2002).Over the last two decades, research on validity issues in clinical practice increased significantly, and neuropsychologists have become more aware of the need to identify invalid test performance (Merten & Dandachi-FitzGerald, 2022;Sweet et al., 2021).These factors probably contributed to the findings of a nearly double median reported base rate of 15% across clinical contexts and settings in a more recent survey (Martin & Schroeder, 2020).However, there has been a delay in research examining empirically derived bases rates of invalidity in clinical settings.
To address this issue, McWhirter et al. (2020) undertook a systematic review to examine PVT failure in clinical populations.Their main finding was that PVT failure rates were common, exceeding 25% for some PVTs and clinical groups.However, their study has been criticized on several aspects.First, Kemp and Kapur (2020) did not distinguish between stand-alone and (psychometrically inferior) embedded PVTs.Second, McWhirter et al. (2020) included studies that examined PVT failure in patients with dementia and intellectual disabilities, two groups in which PVTs are strongly discouraged due to unacceptable high false-positive rates when using the standard cutoffs (Larrabee et al., 2020;Lippa, 2018;Merten et al., 2007).Third, studies with ≥ 50% of the patient sample was involved in litigation or seeking welfare benefits were excluded, other types and lower rates of external gain incentives were not characterized.Therefore, external incentives that increase PVT failure rates in patients engaged in standard clinical evaluations (Schroeder et al., 2021b) may have contributed to their reported PVT failure rates.Importantly, McWhirter et al. (2020) summarized data on PVT failure based upon the literature search and data extraction performed by one author, without considering the quality of included studies or calculating a weighted average to get a more precise estimate.
The current meta-analysis is designed to address these gaps to improve the quality of reported PVT failure findings in clinical patient groups.The main aim of the present study is to provide comprehensive information regarding the base rate of PVT failure to facilitate its interpretation in clinical practice.We calculated pooled estimates of the base rate of PVT failure across the type of clinical context, distinct clinical patient groups, the potential for external incentives, and per PVT.

Search Strategy
This meta-analysis was conducted in accordance with updated Preferred Reporting Items for Systematic Review and Metaanalyses guidelines (PRISMA; Page et al., 2021).A review protocol was registered at inception on PROSPERO (ID: CRD42020164128).The protocol was slightly modified to further improve the quality of included studies.Specifically, only stand-alone PVTs were included that met the restrictive selection criteria per Sollman and Berry (2011), and one additional database was searched.Electronic databases (PubMed/ MEDLINE, Web of Science, and PsychINFO) were comprehensively searched using multiple terms for performance validity and neuropsychological assessment (see Online Resource 1 for detailed search strategies).Finally, we chose to focus solely on the base rate of PVT failure without also addressing its impact on treatment outcome.The final search was conducted on November 5, 2021.

Study Selection
All studies in this systematic review and meta-analysis were performed in a clinical evaluation context of adult patients (18 + years of age), using standard/per manual administration procedure and cutoffs for the five stand-alone performance validity tests (PVTs) from Sollman and Berry (2011).These five PVTs are: the Word Memory Test (WMT; Green, 2003), the Medical Symptom Validity Test (MSVT; Green, 2004), the Test of Memory Malingering (TOMM; Tombaugh, 1996), the Victoria Symptom Validity Test (VSVT; Slick et al., 1997), and the Letter Memory Test (LMT; Inman et al., 1998).Based upon Grote et al. (2000), a higher cutoff was used for the hard items of the VSVT in patients with medically intractable epilepsy.All studies were original, peer-reviewed, and published in English.Studies were excluded if they examined PVT failure rate in a non-clinical context (i.e., forensic/medico-legal context, data generated for research purposes).Studies that only addressed PVT failure in a sample already selected upon initially passing/failing a PVT were equally excluded (typically known-groups design).Studies performed on patients diagnosed with intellectual disability or dementia were excluded, as well as studies with a small (sub)sample size (N < 20).Finally, we chose to exclude studies of Veterans/ military personnel since the distinction between clinical and forensic evaluations are difficult to make within the context of the Veterans Affairs (VA) system (Armistead-Jehle & Buican, 2012).
Unique patient samples were ensured by carefully screening for similar samples used in different studies.In case multiple studies examined the same patient sample, data with the largest sample size was included, or, when equal, the most recent paper.

Data Collection and Extraction
References resulting from the searches in PubMed/MED-LINE, Web of Science, and PsychINFO were imported into a reference manager (EndNote X8).After automatic duplicate removal, one of the investigators (JR) manually removed the remaining duplicate references.First, a single rater (JR) screened all titles and abstracts for broad suitability and eligibility.Doubtful references were addressed with a second rater (MP).If doubts remained, references were included for full-text scrutinization.Second, two independent raters (MP and JR) reviewed the remaining full-texts based on the mentioned inclusion and exclusion criteria, for which the online systematic review tool Rayyan (Ouzzani et al., 2016) was used.The interrater reliability was substantial (Cohen's k = 0.63), and agreement 89.83%.A sizable number of studies failed to clearly state information used for inclusion in the current study, which contributed to the suboptimal agreement between the two independent raters.Therefore, corresponding authors were contacted when additional information was required (e.g., regarding clinical context, utilized PVT cutoff, number of subjects that were provided and failed a PVT, or language/version of the utilized PVT).Non-responders were reminded twice, and if no author response was elicited, studies were excluded.Discrepancies were resolved by discussion with a third and fourth reviewer (BD and RP).Finally, one investigator (JR) extracted relevant information from the included full-text articles, such as setting, sample size, mean age, and utilized PVT(s) according to a standardized data collection form (see Online Resource 2).

Statistical Analyses
Statistical analysis was performed using MetaXL version 5.3 (www.epige ar.com), a freely available add-in for metaanalysis in Microsoft Excel.Independence of effect sizes, a critical assumption in random-effects meta-analyses, was examined by checking if and how many studies used multiple, potentially inter-correlated PVTs from the same patient sample (Cheung, 2019).The frequency of PVT failure from the individual studies were pooled into the meta-analysis using a double-arcsine transformation.Back transformation was performed to report the pooled prevalence rates.We chose to use this transformation method to stabilize variance in the analysis.The double arcsine transformation has been shown to be preferential to logit transformation or no transformation usage in the calculation of pooled prevalence rates (Barendregt et al., 2013).All analyses were performed using the random-effects model since it allows betweenstudy variation of PVT failure.Forest plots were used to visualize the pooled prevalence of PVT failure, with 95% confidence intervals [CIs].Where possible, subgroup analyses were performed to examine whether the base rate of PVT failure was related to specific clinical contexts, distinct patient groups, utilized PVT, and the consideration of the presence of potential external gain.To further establish the generalizability of our study findings, the consistency across the included studies was assessed using the Cochran's Q-test (Higgins et al., 2003).For the Q-test, a p-value < 0.10 was considered to indicate statistically significant heterogeneity between studies.Because the number of included studies impacts the Q-test, we additionally evaluated the inconsistency index I 2 (Higgins & Thompson, 2002).An I 2 value over 75% would tentatively be classified as a "high" degree of between-study variance (Higgins et al., 2003).Since I 2 is a relative measure of heterogeneity and its value depends on the precision of included studies, we also calculated Tau squared (τ 2 ).This measure quantifies the variance of the true effect sizes underlying our data, with larger values suggesting greater between-study variance (Borenstein et al., 2017).

Study Quality
An adapted version of the Prevalence Critical Appraisal Tool of the Joanna Briggs Institute (Munn et al., 2015) was used to rate the quality of all included studies.Amongst the currently available tools, it addresses the most important items related to the methodological quality when determining prevalence (Migliavaca et al., 2020).Three study quality domains were assessed: selection bias (items 1, 2, and 4), sample size/statistics (items 3 and 5), and attrition bias (item 6; see Online Resource 3 for a detailed description).
Doi plot and LFK index are relatively new graphical and quantitative methods that were used for detecting publication bias (Furuya-Kanamori et al., 2018).These analyses were also implemented using MetaXL.Contrary to the scatter plot of precision used in a more standard funnel plot to examine publication bias, the Doi plot uses a quantile plot providing a better visual representation of normality (Wilk & Gnanadesikan, 1968).A symmetric inverted funnel is created with a Z-score closest to zero at its tip if the trials are not affected by publication bias.The LKF index then quantifies the two areas under the Doi plot.The interpretation is based on the a-priori concern about positive or negative publication bias.Since we were concerned about possible positive publication bias, the LFK > 1 was used consistent with positive publication bias.Even in the case of limited included studies, the LKF index has a better sensitivity over the more standard Egger's test (Furuya-Kanamori et al., 2018).

Literature Search
Figure 1 gives an overview of the search and selection process.Of the 13,587 identified abstracts, 457 (3.4%) were included for full-text scrutiny.We contacted the first author of 37 studies for additional information, and 30 authors responded.This resulted in 47 observational studies of PVT failure in the clinical context, with a total sample size of n = 6,484.

Characterization of Included Studies
Table 1 reports study characteristics, including clinical context, clinical patient group, and sample size.Most studies were performed in a medical hospital (k = 25), with others in an epilepsy clinic (k = 7), psychiatric institute (k = 6), rehabilitation clinic (k = 4), and private practice (k = 2).Three studies (6.2%) did not specify clinical context.In 15/47 (31.9%) of the studies, prevalence of PVT failure was reported for heterogeneous patient samples.The majority of the studies (32/47; 68.1%) reported PVT failure rates for one or multiple diagnostic subgroups.The diagnostic (sub)groups constituted of patients with traumatic brain injury (TBI) in most studies (k = 10), followed by patients with epilepsy (k = 9), patients with psychogenic non-epileptic seizures (PNES; k = 5), patients that were seen for attention deficit hyperactivity disorder (ADHD) assessment (k = 4), patients with mild cognitive impairment (MCI: k = 4), patients with multiple sclerosis (MS; k = 2), and patients with Parkinson's disease (k = 2).Severity of TBI was not always specified or was poorly defined.The remaining diagnostic (sub)groups (i.e., sickle cell disease, Huntington's disease, patients substance-use related disorders (SUD), inpatients with depression, memory complaints) were examined in single studies.In more than half of the included studies (25/47; 53.2%), the language (-proficiency) of the included patient sample was not reported.Potential external gain was not mentioned in 12/47 (25.5%) studies, and the remaining studies varied greatly in how they addressed its presence.Of the remaining 35 studies, only seven (i.e., 20%) specified how external gain was examined (Domen et al., 2020;Eichstaedt et al., 2014;Galioto et al., 2020;Grote et al., 2000;Rhoads et al., 2021a;Williamson et al., 2012;Wodushek & Domen, 2020).In most studies (28/35; 80%), the way the authors examined this variable (e.g., by checking the medical record of patient, querying patients about potential incentives being present during the assessment procedure) was not specified.Moreover, in only 4/35 (11.4%) studies, subjects were excluded when external gain incentives (e.g., workers compensation claim) were present (Dandachi-FitzGerald et al., 2020;Davis & Millis, 2014;Merten et al., 2007;Wodushek & Domen, 2020).
The TOMM was the most frequently administered PVT (k = 18), followed by the WMT (k = 17), the MSVT (k = 9), the VSVT (k = 6), or the LMT (k = 1).Only 4/47 (8.5%) studies employed two PVTs (none used > 2 PVTs that fulfilled the inclusion/and exclusion criteria).The other 43/47 (91.5%) studies used one PVT.In two of the four studies reporting two PVTs, the same PVTs were not administered to all participants.Harrison et al. (2021) administered the MSVT to 648 patients and the WMT to 1810 patients, and Krishnan and Donders (2011) administered the TOMM to 39 patients and the WMT to 81 patients.Inclusion in these studies was -amongst others -based upon failing one PVT.Furthermore, these studies did not report the number of subjects that were provided with both PVTs.Therefore, it is unclear to what extend the reported PVT failure rates in these studies are influenced by potential dependence.In the two other studies reporting two PVTs (i.e., Cragar et al., 2006;Merten et al., 2007), all patients were administered both PVTs.The total number of subjects in these two studies that reported two likely dependent effect-sizes was n = 76.This is 1.2% of the total of n = 6487 patients from all 47 studies.We therefore argue that the reported effect sizes from the 47 included studies are (largely) independent.

Methodological Quality Assessment
A summary of the methodological quality of the included studies for determining prevalence is provided in Online Resource 4. No study was rated as having high quality; all had limitations in at least one of the three prespecified domains (selection bias, attrition bias, and sample size/ statistical analyses).Most studies had a study sample that addressed the target population (k = 41, 87.2%), whereas only a minority described relevant assessment and patient characteristics (n = 15, 31.9%).The majority of included studies failed to clearly state how patients were recruited (n = 27, 57.4%).Eleven studies (23.4%) had an inadequate response rate.The majority of the studies used appropriate statistical analyses (n = 41, 87.2%), but also had inappropriate sample sizes (n = 39, 83.0%).
The shape of the Doi plot showed slight asymmetry (see Online Resource 5), and the results of the LFK index (1.09)revealed minor asymmetry indicative of potential positive publication bias.** No automation tool was used.All records were excluded manually.The authors provided demographics for the total patient sample, not for the (sub)samples for which PVT failure rates were reported b The authors only provided demographics for the final sample, after excluding cases with missing data.The initial total sample that was provided with a PVT was larger than the sample size detailed in the article

Rate of PVT Failure in Clinical Patients
The pooled prevalence of PVT failure of all (n = 47) included studies was 16%, 95% CI [14,19].Significant between-study heterogeneity and high between-study variability existed (Cochran's Q = 697.97,p < 0.001; I 2 = 91%; τ 2 = 0.08) as revealed by the large 95% CIs (see Fig. 2).The high I 2 statistic indicates that the variation in reported PVT failure is likely a result of true heterogeneity rather than chance.

Subgroup Analyses based upon Clinically Relevant Characteristics
To facilitate the interpretation of PVT failure in clinical practice, subgroup analyses were performed for clinically relevant characteristics associated with performance validity (Table 2).
It is important to emphasize that some of these findings are based upon relatively small numbers of studies (i.e., k = 2 or 4), potentially impacting the stability if the reported estimates.

False-Positive Scrutinization
Although we excluded studies that PVT failure rates in patients with dementia or intellectual disability a priori, the included studies might still comprise patient samples with other conditions or combinations of characteristics that make them highly susceptible to false-positive PVT failure classification.Therefore, and in line with clinical guidelines (Sweet et al., 2021), we first examined included studies for the risk of unacceptably low specificity rates when applying standard PVTs cutoffs, and two studies were identified.First, PVT performance in the subsample of severely ill schizophrenia spectrum and mostly inpatients from Gorissen et al. (2005) was significantly correlated with negative symptoms and general psychopathology.Second, the MCI subjects from Martins and Martins (2010) were of advanced age, Spanish speaking, and had the lowest formal schooling of all included studies (i.e., 71.4% had less than 6 years of formal education).These cultural/language factors in combination with low formal schooling are associated with unacceptably low specificity rates when applying standard PVT cutoffs (Robles et al., 2015;Ruiz et al., 2020).Exclusion of the subsample of patients with schizophrenia in the Gorissen et al. (2005 study) and of the Martins and Martins (2010) study led to a pooled prevalence of PVT failure of 15% (95% CI [13,18]; Cochran's Q = 573.73,p < 0.01; I 2 = 89%; τ 2 = 0.07).However, after exclusion of these patient samples, between-study heterogeneity and between-study variability were still high as indicated by a significant Cochran's Q statistic and high and I 2 statistic.Further subgroup analyses were performed in the remaining studies (k = 46; see Table 1).τ 2 = 0.02).Based upon Cochran's Q, heterogeneity of pooled PVT failure rates was significant in patients with PNES, (m)TBI, epilepsy, Parkinson's disease, and subjects seen for ADHD assessment.Non-significant heterogeneity in pooled PVT failure rates was found in patients with Parkinson's disease, MCI, and MS.Based upon the I 2 statistic, variability of base rate estimates of PVT failure was low in patients with MCI and MS, and (moderately) high for the other diagnostic patient groups.This suggests that for studies in patients with MCI and MS, the pooled PVT failure rates are more homogeneous.However, since these calculations are based upon small numbers of studies, these findings should be interpreted with caution (von Hippel, 2015).

External Gain Incentives
In the four studies where patients with potential external gain incentives were excluded from analysis, the pooled prevalence of PVT failure was as low as 10% (95% CI [5,15]; Cochran's Q = 9.17, p = 0.10; I 2 = 45%; τ 2 = 0.02).For the 42 remaining studies that did not report to have actively excluded clinical patients with potential external gain incentives before reporting PVT failure, however, the pooled prevalence of reported PVT failure was 16% (95% CI [13,19]; Cochran's Q = 560.93,p < 0.001; I 2 = 90%; τ 2 = 0.07).Although Cochran's Q statistic indicated that heterogeneity of pooled PVT failure rates in both groups was high, inconsistency was lower in the studies where patients with external gain were excluded from analysis.

PVT
The pooled prevalence of PVT failure was the highest for patients examined with the WMT (25%, 95% CI [19,32]

Discussion
This systematic review and meta-analysis examined the prevalence of PVT failure in the context of routine clinical care.Based on extracted data from all 47 studies involving 6,484 patients seen for clinical assessment, the pooled prevalence of PVT failure was 16%, 95% CI [14,19].Excluding two studies that likely represented patients where standard PVT cutoff application would probably lead to false positive classification, resulted in a pooled PVT failure of 15%, 95% CI [13,18].This number corresponds with the median estimated base rate of invalid performance in clinical settings reported in a recent survey amongst 178 adult-focused neuropsychologists (Martin & Schroeder, 2020).Our empirical findings confirm PVT failure in a sizeable minority of patients seen for clinical neuropsychological assessment.
Another key finding is that reported PVT failure rates vary significantly amongst the included studies (i.e., 0-52.2%).This variability is likely due to (1) sample characteristics, such as clinical setting, clinical diagnosis, and potential external incentives, and (2) the sensitivity and specificity of the PVT used.Pooled PVT failure was found to be highest (i.e., 27%, 95% CI [15,40]) in patients seen in private practice.The pooled PVT failure rates for the other settings (i.e., epilepsy clinic, psychiatric institute, medical hospital, and rehabilitation clinic) varied between 13-19%.The Sabelli et al. (2021) study had the largest private practice sample (N = 326), consisting of relatively young mTBI patients referred for neuropsychological evaluation.Since only 2/47 of the included studies were conducted in the private practice setting, the Sabelli et ( 2021) study with a PVT failure rate of 31.9%, was a major contributor to the higher pooled PVT failure rate in a private practice setting.Of interest, potential external incentives were not mentioned in that study.Therefore, potential external gain incentives may have been present and impacted the relatively high level of PVT failure rather than assessment context per se.Unsurprisingly, but now clearly objectified, studies that excluded patients with potential external gain incentives had a significantly lower pooled PVT failure rate compared to studies where these subjects (potentially) remained in the analysis (i.e., 10%, 95% CI [ 5,15] versus 16%, 95% CI [13, 19] respectively).However, although it is known that the presence of external incentive links directly to PVT failure in clinical evaluations (e.g., Schroeder et al., 2021b), little over a quarter of the included studies failed to mention the presence of external gain incentives.Moreover, even when external gain incentives were known to be present, only a minority of studies excluded these subjects from further analyses.Pooled PVT failure rates were highest for patients diagnosed with PNES (i.e., 33%, 95% CI [24,43]), patients seen for ADHD assessment (i.e., 17%, 95% CI [11,23]), and (m)TBI (i.e., 17%, 95% CI [10, 25) with pooled PVT failure ranging between 6-13% for the other diagnostic groups (i.e., MS, epilepsy, MCI, and Parkinson's disease).These findings contrast with McWhirter et al. (2020), who reported PVT failure in subjects with functional neurological disorders (such as PNES) are no higher compared to MCI or epilepsy.Likely, our strict inclusion and exclusion criteria, inclusion of only well-validated stand-alone PVTs, and meta-analysis application lead to a more precise estimate of PVT failure across diagnostic groups.Our findings also indicate that pooled PVT failure rates for MCI, MS, and Parkinson's disease diagnostic groups are more homogeneous than those of PNES, (m)TBI, and patients seen for ADHD assessment.The higher levels of heterogeneity in these latter groups could indicate that other factors that likely impact PVT failure were present, such as external gain incentives, variation in diagnostic criteria, and bias in patient selection.Finally, pooled failure rates varied across the utilized PVTs in line with their respective sensitivity/specificity ratios in correctly identifying invalid performance.The WMT is known for its relatively high sensitivity (Sollman & Berry, 2011), which likely resulted in the highest pooled failure rate amongst the examined stand-alone PVTs.The lowest pooled failure rate for the TOMM is probably related to its high specificity (Martin et al., 2020).
Our findings indicate that in addition to PVT psychometric properties (i.e., sensitivity and specificity), the clinical setting, the presence of external gain incentives, and the clinical diagnosis impact pooled PVT failure rates.The clinician should therefore consider these factors when interpreting PVT results.Consider, for example, a well-researched stand-alone PVT with a sensitivity of 0.69 and specificity of 0.90, administered to two different clinical patients.The first patient is diagnosed with epilepsy and wants to get approved to return to work (i.e., no external gain incentives for invalid performance).If the mentioned PVT were failed in the context of this patient without external gain incentives (base rate PVT failure of 10%, see Table 2), the likelihood that PVT failure was indeed a true positive (i.e., positive predictive value, PPV) would be 43%.The second patient is also diagnosed with epilepsy but has a pending disability application because the patient does not believe he/she is able to return to work (i.e., potential external gain incentive for invalid performance).If the same PVT were failed in the context of this patient with potential external gain incentive (base rate PVT failure of 16%, see Table 2), PPV would be 57%.
Of importance, although a PPV increase of 0.43 to 0.57 is substantial, the latter is still not sufficient to determine performance validity.Therefore, in line with general consensus multiple, independent, validity tests should be employed (Sherman et al., 2020;Sweet et al., 2021).By chaining the positive likelihood-ratios (LRs) of multiple failed PVTs, the diagnostic probability of invalid performance (or PPV) is increased and the diagnostic error is decreased (for an explanation of how to chain likelihood ratios see Larrabee, 2014;Larrabee, 2022).Note that while considerable weight should be placed on the psychometric evaluation of performance validity, the clinician should also include other test and extra-test information (e.g., degree of PVT failure, (in) consistency of the clinical presentation) to draw conclusions about the validity of an individual patient's neuropsychological assessment (Dandachi-FitzGerald & Martin, 2022;Larrabee, 2022;Sherman et al., 2020).
Strengths of the present study are its strict inclusion/ exclusion criteria ensuring accurate PVT results.Unfortunately, none of the studies fulfilled all components of the three pre-defined quality criteria selection bias, attrition bias, and adequate sample size/statistics for determining prevalence.Although we excluded studies with < 20 subjects, most of the remaining studies still relatively small sample sizes, increasing the likelihood of sampling bias and heterogeneity.Moreover, only 20/47 studies reported appropriate recruitment method (e.g., consecutive referrals of a good census) necessary for determining the base rate of PVT failure.Additionally, diagnostic criteria varied across studies, limiting the generalizability of their calculated PVT failure base rates.Also, the way potential external gain incentives were examined and defined varied significantly.Surprisingly, in just over one quarter of included studies, potential external gain incentives were not mentioned at all, and potential external gain incentives may have been present.Finally, although language (-proficiency) and cultural factors relate to PVT failure (Robles et al., 2015;Ruiz et al., 2020), factors were not mentioned in more than half of the included studies in our meta-analysis.
Additional empirical research is necessary to advance knowledge of performance validity test failure in clinical populations.An important first step in future research should be to provide comprehensive details regarding study design, such as recruitment procedure, clinical setting, and demographic/descriptive information (e.g., cultural factors, age, language and language proficiency, level of education).
A second improvement would be to form comparable and homogeneous patient samples by specifying diagnostic criteria and providing a detailed specification of how external gain incentives were examined (e.g., querying the patient for potential external gain incentives, such as pending litigation or disability procedures; Schroeder et al., 2021a).Since administration of multiple PVTs is recommended (Sweet et al., 2021), future studies and specifically meta-analyses should consider using advanced statistical techniques (e.g., three-level meta-analyses) in handling non-independent effect sizes (Cheung, 2019).
In conclusion, the current meta-analysis demonstrates that PVT failure occurs in a substantial minority of patients seen for routine clinical care.Type of clinical context, patient characteristics, presence of external gain incentives, and psychometric properties of the utilized PVT are found to impact the rate of PVT failure.Our findings can be used for calculating clinically applied statistics (i.e., PPV/NPV, and LRs) in everyday practice to increase the diagnostic accuracy of performance validity determination.Future studies using detailed recruitment procedures and sample characteristics, such as external gain incentives and language (proficiency), are needed to further improve and refine knowledge about the base rates of PVT failure in clinical assessments.

c
Results of the "no clinically obvious cognitive impairment" subgroup (n = 24) d Demographics for the total sample (n = 77) were not mentioned.Therefore, we choose to display the demographics of the non-compensation subgroup (n = 41) e Demographics for the total sample (n = 90) were not mentioned.Therefore, we choose to display the demographics of the WMT fail subgroup (n = 32) f VSVT hard items cutoff perGrote et al., 2000 in epilepsy (sub)sample g B. February 4

Fig. 2
Fig. 2 Forest plot of the 47 included studies estimating the pooled prevalence of PVT failure in the clinical setting CI = confidence interval.Note: Weights are from random effects analysis

Table 1
Summary details for individual studies that reported the prevalence of PVT failure in clinical patients *In case not all subjects in the patient sample received a given PVT, the proportion is shown in this column (…/…) a

Table 2
ADHD attention deficit hyperactivity disorder; CI confidence interval; MCI mild cognitive impairment; MS multiple sclerosis; (m)TBI (mild) traumatic brain injury; (nv)MSVT (non-verbal) medical symptom validity test; PNES psychogenic non-epileptic seizures; PVT performance validity test; TOMM test of memory malingering; VSVT Victoria symptom validity test; WMT word memory test.* after exclusion of the subsamples of subjects with a probable risk of false-positive PVT failure classification (i.e., k = 46)