Introduction

Exaggeration or outright fabrication of symptoms occurs in a high proportion of neuropsychological evaluations, especially in forensic settings, where the external incentives to appear impaired are often substantial (Bush et al., 2005). Thus, it is crucial to thoroughly assess the credibility of self-reported symptoms in order to ensure the validity of diagnoses and treatment recommendations based on the results of psychometric testing.

Methods used to assess symptom validity may vary depending on the context of the evaluation and the type of symptoms presented. Additionally, the American Psychological Association (2013) has drafted some guidelines to assist practitioners in their forensic evaluations. Among them, Guideline 9.02 (Use of Multiple Sources of Information) explicitly recommends, “Forensic practitioners ordinarily avoid relying solely on one source of data” (APA, 2013; p. 15).

Performance and Symptom Validity Assessment

In the context of neuropsychological assessments, the credibility of the clinical presentation must be further divided into performance and symptoms validity (Larrabee, 2012). The former refers to the extent to which scores on performance-based measures of cognitive abilities reflect the examinee’s true ability level; the latter refers to the extent to which self-reported symptoms accurately capture the examinee’s level of emotional distress. Although the two constructs are related, they ultimately measure conceptually distinct aspects of the examinee’s neuropsychological profile (Bianchini et al., 2014; Gervais et al., 2007, 2011; Merten et al., 2022; Richman et al., 2006; Tarescavege et al., 2013; Tylicki et al., 2021; Young, 2020). Therefore, they are assessed using different types of instruments: performance versus symptom validity tests (PVTs vs SVTs).

Naturally, PVTs and SVTs use different detection strategies, following the measurement paradigms established to assess cognitive ability (i.e., performance-based tasks) and emotional functioning (i.e., self-reported symptom inventories), respectively. PVTs are designed to detect implausibly low scores on measures of cognitive abilities, whereas SVTs are designed to detect implausibly high scores on measures of psychological symptoms (Giromini et al., 2022). The most common detection mechanisms used by PVTs are the method of threshold (an unusually low level of performance – below that commonly observed in credible patients with genuine impairment) and measures of compelling inconsistency (combination of scores incompatible with known patterns of neurological deficits). In contrast, SVTs are designed to detect a tendency to endorse extremely rare symptoms or indiscriminate endorsement of all symptoms. Both patterns of response are interpreted as the examinees’ tendency (whether deliberate or not) to exaggerate their true level of their emotional distress (Rogers & Bender, 2018).

The different detection mechanisms used by PVTs and SVTs predict a weak correlation between these two types of tests. In fact, SVTs tend to correlate more with other SVTs than with PVTs and vice versa (Giromini et al., 2022) and the method variance predicts that the outcome of SVTs and PVTs administered to a given examinee will be unrelated. Recent empirical instigations largely supported this prediction (Sabelli et al., 2021; Shura et al., 2022; Van Dyke et al., 2013), and there is consensus that symptom and performance validity should be assessed separately (Sweet et al., 2021). Similarly, researchers calibrating or cross-validating SVTs should establish criterion groups based on the outcomes of other SVTs, not PVTs (Gegner et al., 2022).

Within the last decades, empirical research on both free-standing (Boone et al., 2002a, b; Green, 2003, 2004; Nelson et al., 2006; Pearson, 2009; Slick et al., 1997; Tombaugh, 1996) and embedded PVTs has proliferated (Martin et al., 2015): old instruments have been continuously recalibrated (Boucher et al., 2023; Deloria et al., 2021; Johnson et al., 2012; Sugarman & Axelrod, 2015; Whiteside et al., 2015) while new measures and cutoffs are being introduced (Abeare et al., 2021a, b; Erdodi et al., 2016; Langeluddecke & Lucas, 2003; Rai et al., 2019; Sawyer et al., 2017; Schroeder & Marshall, 2010). Assessors have a wide range of PVTs to choose from. Although there is a comparable number of well-established free-standing SVTs, both interview-based [e.g., the Structured Interview of Reported Symptoms (SIRS; Rogers et al., 1992; SIRS-2; Rogers et al., 2010); the Miller Forensic Assessment of Symptoms Test (M-FAST; Miller, 2001)] and self-report measures [e.g., the Structured Inventory of Malingered Symptoms (SIMS; Smith & Burger, 1997); the Inventory of Problems – 29 (IOP-29; Viglione & Giromini, 2020; Viglione et al., 2017), the SRSI (Merckelbach et al., 2018)], the most commonly used SVTs are embedded within comprehensive personality inventories [the Minnesota Multiphasic Personality Inventory (MMPI-2; Butcher et al., 2001; MMPI-2-RF; Ben-Porath & Tellegen, 2008) or the Personality Assessment Inventory (PAI; Morey, 1991)].

Validity scales embedded within brief symptom inventories are less common and the existing ones have limited post-publication empirical evidence base (Roth et al., 2005). In contrast, nowadays the most commonly used neuropsychological test of cognitive ability contain embedded PVTs. It may be no coincidence that the trend for developing embedded SVTs in shorter self-report inventories has started with instruments commonly used by neuropsychologists (Abeare et al., 2021b; Cutler et al., 2022; Shwartz et al., 2020; Silva, 2021; Vanderploeg et al., 2014).

The use of SVTs is highly recommended in all psychological evaluations (Sherman et al., 2020; Sweet et al., 2021). However, to date, SVTs are still underutilized, regardless of the assessment context (Merten & Merckelbach, 2013; Nelson et al., 2019; Plohmann & Merten, 2013; Sharland & Gfeller, 2007; Tierney et al., 2021). The limited range of available SVTs that are quick and easy to administer and score may be a practical limitation inhibiting their widespread use. Conversely, the research on the integrated use of several different PVTs (Boone, 2009; Erdodi, 2019, 2021, 2023; Larrabee, 2008, 2014; Larrabee et al., 2019) is robust and morphed into clear guidelines on multivariate models. Additionally, there is still little evidence on how many SVTs should be administered to properly assess the credibility of symptom report, and how many failures are required to deem a response invalid (Sherman et al., 2020). Given these significant knowledge gaps, there is a clear need for more research on SVTs.

Trauma-Related Symptoms Validity Assessment

SVTs may differ in terms of the type of symptoms being evaluated. Some scales were designed to assess a broad spectrum of symptoms, without any specificity to a given category of psychopathology [e.g., the Negative Impression Management scale of the PAI (NIMPAI); the Infrequency scale of the MMPI-2 (FMMPI-2)]; others focus on specific symptoms clusters (e.g., psychiatric, somatic, cognitive). The SVTs embedded within the Trauma Symptom Inventory – Second Edition (TSI-2; Briere, 2011) are examples of the latter. The TSI-2 is a self-report inventory designed to assess symptoms and behaviors following trauma of various kinds (e.g., sexual and/or physical assault, domestic violence, physical confrontation, torture, car accident, multiple victim events, health care incident, witnessing violence, traumatic loss, and early experiences of child neglect or abuse).

In addition, the TSI-2 contains two validity scales: Response Level (RL) and Atypical Responses (ATR). High scores on either of these scales raise concerns about the validity of the profile. RL was designed to monitor a tendency to deny symptoms that others generally recognize. In contrast, ATR was designed to monitor a tendency to (over)endorse trauma-related symptoms that are only rarely reported by others, including those with significant post-traumatic symptomatology. High scores on the ATR may indicate (a) general over-estimation of symptoms, (b) specific over-estimation of PTSD-related items, (c) random response style, and (d) very high levels of genuine distress (Palermo & Brand, 2019).

Consequently, the underlying problem of this scale is that over-endorsing items may be interpreted either as an attempt at gross symptom exaggeration/factitious complaints or the experience of symptoms with greater intensity than others. In clinical and forensic settings, the recommended cutoff for non-credible presentation is ≥ 15 (Briere, 2011). In research settings (i.e., general population assessed in a non-clinical context), the recommended cutoff is ≥ 8 (Gray et al., 2010).

A review of the ATR’s item content reveals a mixture of different detection mechanisms: rare symptom endorsement combined with neurologically (global amnesia) or physiologically (inability to meet basic needs for prolonged periods of time) implausible level of impairments. At face value, reporting a high frequency of these symptoms seems incompatible with genuine distress, unless the associated psychopathology is correlated with severe cognitive deficits that interfere with the examinees’ ability to objectively evaluate their level of functioning (i.e., clinically significant impairments in reality testing). Arguably, the latter scenario should still be classified as a subtype of non-credible reporting – although perhaps of a different etiology (Merten & Merckelbach, 2013). In other words, there are no salient, face-valid a priori reasons to justify the need for a highly conservative cutoff on the ATR.

The TSI-2 has been standardized and validated on a representative sample of the U.S. population (n = 678). The Professional Manual reports variable internal consistency (α = 0.76–0.94) and test–retest reliability with a one-week interval (r = 0.76–0.93; Briere, 2011). However, post-publication research on the effectiveness of the TSI-2 at distinguishing coached simulators from patients with genuine dissociative disorders, found that it underperformed compared to both the Trauma Index of the SIRS-2 (Brand et al., 2014), and the Infrequency-Psychopathology scale (Fp) of the MMPI-2 (Palermo & Brand, 2019). Other studies reported incremental utility of the TSI-2 above and beyond other SVTs or study-specific predictors in distinguishing individuals attempting to feign PTSD from honest responders (Efendov et al., 2008; Elhai et al., 2005).

After examining PAI and TSI-2 scores in coached PTSD simulators and credible patients with PTSD, Gray and colleagues (2010) found that both the PAI and the TSI-2 successfully differentiated between the two groups, but the NIMPAI outperformed the ATR. Taken together, these findings, along with a recent review of the available literature on the effectiveness of the ATR scale (Ales & Erdodi, 2021), suggest the need for further research to evaluate its clinical and forensic utility. In addition, there is no information on the ATR’s differential predictive power using SVTs vs PVTs as criterion measures. Although peri-traumatic dissociation is a common occurrence (Azoulay et al., 2020; Holeva & Tarrier, 2001; Ursano et al., 1999) with verifiable neurobehavioral (Daniels et al., 2012) and genetic correlates (Koenen et al., 2005), its validity has been called into question (Candel & Merckelbach, 2004). Therefore, there is a clear need for objectively verifying claims of memory deficits associated with traumatic events.

Present Study

This study was designed to address this gap in the research literature. The classification accuracy of the ATR was computed against both SVTs and PVTs as criterion measures to empirically evaluate its differential predictive power. More importantly, we collected data from clinical patients with identifiable external incentives to appear impaired – an important factor in the study of motivated exaggeration of symptoms and deficits (Boskovic, 2020; McDermott, 2012; Peace & Richards, 2014). Based on previous reports (Sabelli et al., 2021; Shura et al., 2022), we hypothesized that the ATR would produce a superior classification accuracy against SVTs compared to PVTs. In addition, we predicted that the optimal cutoff would be closer to that proposed by Gray et al. (2010; ≥ 8) as opposed to the one proposed by Briere (2011; ≥ 15).

Methods

Participants

Data were collected from a consecutive case sequence of 99 files retrieved from the clinical archives of a clinical neuropsychologist from the Greater Toronto Area in Ontario, Canada. Patients were assessed in the context of a motor vehicle collision to provide an independent medicolegal evaluation of their neuropsychological and adaptive functioning. Inclusion criteria were 1) A full administration of the TSI-2 and PAI; 2) Age between 18 and 69 (adults); and 3) Being born in Canada (to control for limited English proficiency as a confounding variable; Ali et al., 2022; Boskovic et al., 2020; Crişan et al., 2023a, b; Dandachi-FitzGerald et al., 2023a, b; Erdodi & Lajiness-O’Neill, 2012; Erdodi et al., 2017b). The majority of the sample was female (63.6%) and right-handed (92.9%). Mean age was 42.5 (SD = 14.2); mean level of education was 12.7 (SD = 2.6). Overall intellectual functioning (MFSIQ = 93.5, SD = 12.8) and single-word reading level (MFSIQ = 92.2, SD = 13.2) were in the average range.

All patients were involved in litigation around the motor vehicle collision that prompted the referral for neuropsychological assessment. The majority of patients (77) sustained an uncomplicated mild TBI [Glasgow Coma Scale (GCS) > 13; loss of consciousness < 30 min; peritraumatic amnesia < 1 h; and negative neuroradiological findings], followed by complicated (positive neuroradiological findings) mild TBI (10), severe (3) and moderate (2) TBI. All patients were assessed in the post-acute stage of recovery (> 3 months post injury for mild TBI and > 12 months post injury for moderate/severe TBI). Coincidentally, the same proportion of the sample (42.9%) reported clinically significant PTSD symptoms on the PAI and TSI-2

Measures

Trauma Symptom Inventory – Second Edition (TSI-2)

The TSI-2 consists of 136 items and measures a wide range of complex psychopathology across the lifespan (e.g., post-traumatic stress, dissociation, somatization, insecure attachment styles, reduced self-capacity, and wide-ranging dysfunctional behaviors) organized into 12 clinical scales, 12 subscales, four factors, and two validity scales (Table 1). The TSI-2 instructs the examinee to read each item carefully and rate how often the symptom was experienced in the past six months on a scale ranging from 0 to 3.Footnote 1 The TSI-2 assesses acute or chronic trauma-related symptomatology. T-scores represent linear transformations of raw scores (M = 50, SD = 10). Higher scores represent higher levels of symptomatology. T-scores between 60 and 64 are considered problematic (i.e., above average symptoms, with potential clinical implications); a T-score ≥ 65 is considered clinically elevated (i.e., high levels of symptoms that constitute a major clinical problem).

Table 1 TSI-2 factors, scales and subscales

Personality Assessment Inventory (PAI)

The PAI offers four validity scales to determine whether the profile emerging from the test accurately represents the individual’s distress, and to assess any potential biases in delivering responses. The NIMPAI is a 9-item scale specifically designed to detect whether the individual attempts to present a more negative picture of their symptoms. It comprises items on bizarre symptoms that are rarely endorsed in both clinical and non-clinical samples. Thus, the NIMPAI may be considered a measure of over-estimation of pathology driven by pessimism and/or intentional over-estimation of distress (Morey, 1991). In the second edition of the PAI Professional Manual, Morey (2007) proposed a T-score cutoff < 74 suggesting little response distortion, whereas T-scores between 74 and 84 would suggest some exaggeration. Additionally, Hawes and Boccaccini (2009) conducted an extensive meta-analysis examining different PAI cutoffs. They found that a NIMPAI T-score cutoff of ≥ 81 yielded the highest overall classification rate (.79), while preserving relatively strong sensitivity (.73) and specificity (.83), and thus suggested that future PAI validity studies report classification results using optimal cutoffs identified by their meta-analysis. In the current study, the NIMPAI (at a cutoff of T ≥ 81) served as the legacy criterion SVT (i.e., domain-congruent measure) for evaluating the ATR’s classification accuracy.

Beck Depression Inventory – Second Edition (BDI-II)

The BDI-II (Beck et al., 1996) is a 21-item self-report measuring presence and severity of depressive symptoms in the past two weeks. The BDI-II provides a total score covering two symptoms spectra, i.e. the somatic-affective and cognitive. The former intends to capture somatic-affective manifestations of depression such as loss of interest, loss of energy, changes in sleep and appetite, agitation and crying; the latter targets cognitive manifestations such as pessimism, guilt, and self-criticism. A recent study demonstrated that a cutoff of ≥ 38 on the BDI-II is specific to non-credible symptom report (Fuermaier et al., 2023a). As such, the BDI-II’s new embedded validity indicator was employed as an alternative SVT. A BDI-II total raw score of ≥ 38 was used to operationalize symptom exaggeration within this study.

SVT-2

The NIMPAI (invalid defined as T ≥ 81) and BDI-II (invalid defined as ≥ 38) were combined into a multivariate criterion measure labeled SVT-2, consistent with methodological recommendations by Sherman et al. (2020). The classification accuracy of the ATR was evaluated across two alternative multivariate cutoffs. On the SVT-2A, invalid responding was defined as failing either of the two components (liberal cutoff). In contrast, on the SVT-2B, invalid responding was defined as failing both of the components (conservative cutoff).

Test of Memory Malingering (TOMM)

The TOMM (Tombaugh, 1996) is one of the most commonly used free-standing PVT worldwide (Dandachi-FitzGerald et al., 2013; Martin et al., 2015; Sharland & Gfeller, 2007; Slick et al., 2004; Uiterwijk et al., 2021). It is based on the visual forced choice recognition paradigm using pictures of common objects represented by black-and-white single line drawings. Its first trial (TOMM-1) was initially developed as an inactive learning trial but has been subsequently validated as a free-standing PVT on its own right. A liberal cutoff of ≤ 43 demonstrated high specificity to non-credible responding (Ashendorf et al., 2004; Erdodi, 2022; Greve et al., 2006; Jones, 2013; Kulas et al., 2014; Rai & Erdodi, 2021), but a recent meta-analysis endorsed the use of a more conservative cutoff of ≤ 41 (Martin et al., 2020). Therefore, invalid performance on the TOMM-1 within this study was operationalized as a raw score of ≤ 41.

Validity Index Five (VI-5)

Next, a composite measure of performance validity (VI-5) was created by aggregating five embedded PVTs. Each component was dichotomized along published cutoffs (Table 2). The value of the VI-5 is the number of its components failed by a given patient. As such, it ranges from 0 (all five PVTs passed) to 5 (all five PVTs failed). A VI-5 score ≥ 2 was used as the multivariate cutoff for invalid performance (Larrabee, 2014).

Table 2 Components of the VI-5, cutoffs, failure rates and references

Erdodi Index Seven (EI-7)

Finally, another validity composite was developed using an alternative aggregation method following the template developed by Erdodi (2019). Each embedded PVT was recoded onto a four-point ordinal scale, where 0 is defined by a score that cleared the most liberal cutoff and suggests valid performance; 3 is a score that failed the most conservative cutoff, with 1 and 2 representing in-between levels of failure (Table 3). The value of the EI-7 is obtained by summing the recoded components. As such, it ranges from 0 (all components passed) to 21 (all components failed at the most conservative cutoff). An EI-7 ≤ 1 is considered an incontrovertible Pass, as it reflects at most one marginal failure. EI-7 values in the 2–3 range are considered Borderline, as they indicate either at most three marginal failures, which contains insufficient overall evidence of globally invalid performance (Pearson, 2009). However, an EI-7 score ≥ 4 represents either at least four marginal failures, which is associated with a < 5% cumulative failure rate (Pearson, 2009) or at least two failures at conservative cutoffs. Either of these combinations provide sufficient psychometric evidence of non-credible responding. Therefore, this level of performance (≥ 4) was considered an overall Fail in this study. The EI model has been extensively validated in different samples and against a variety of criterion measures (Abeare et al., 2021a, 2022b; An et al., 2019; Boucher et al., 2023; Erdodi, 2023; Erdodi et al., 2019a; Holcomb et al., 2022b; Tyson et al., 2023), demonstrating strong classification accuracy and robustness to moderate/severe TBI (Erdodi & Abeare, 2020; Erdodi et al., 2019b). Independent replications confirmed its clinical utility (Robinson et al., 2023; Tyson & Shahein, 2023).

Table 3 Components of the EI-7 and base rates of failure at given cutoffs

The parallel use of the TOMM-1, VI-5 and EI-7 provides alternative conceptualizations of performance validity [i.e., free-standing versus embedded PVTs; the traditional dichotomous (Pass/Fail) versus ordinal components (levels of failure)]. As such, they represent an engineered method variance for the validation of the ATR. In the absence of a gold standard measure of the credibility of the clinical presentation, using a variety of instruments/aggregation methods affords an opportunity to examine classification accuracy across changing psychometric definitions of invalid response sets.

Procedure

Data were collected and curated by the first author. Patient files were irreversibly de-identified at the source: no personal information was recorded for research purposes. The project was approved by the Research Ethics Board of the university listed as the last author’s institutional affiliation. APA guidelines regulating research involving human participants were followed throughout the process.

Data Analysis

Descriptive statistics [M, SD, base rates of failure (BRFail)] were reported as relevant. Inferential statistics included receiver operating characteristics curves [area under the curve (AUC) with corresponding 95% confidence intervals (CIs)], and Chi-square tests of independence. Sensitivity, specificity, and overall correct classification (OCC; the sum of true positives and true negatives divided by N) were calculated using standard formulas. Effect size estimates were expressed in Ф2. Although the interpretation of the magnitude of the association is context-dependent, an effect of .40 is considered to be at the upper limit of values typically observed in psychosocial and biomedical research (Rosnow & Rosenthal, 2003).

For most clinical instruments, the benchmark value for sensitivity and specificity is .80 (Gregory, 2013). However, given the delicate nature of symptom and performance validity, specificity is prioritized over sensitivity, with ≥ .90 being the lower limit (i.e., a false positive rate of ≤ .10; Boone, 2013; Chafetz, 2022). Therefore, instead of optimizing cutoffs to achieve a balance between sensitivity and specificity, the latter is prioritized, allowing the former to fall where it may. In PVT research, this typically produces a sensitivity hovering around .50, while specificity is fixed at .90. This seemingly inevitable trade-off between sensitivity and specificity has been labeled the Larrabee limit (Erdodi et al., 2014; Crişan et al., 2021).

Results

The ATR was a significant predictor of the SVT-2A (AUC = .73; 95% CI: .63-.83). A cutoff of ≥ 7 failed to approximate the specificity standard (.78). The next cutoff (≥ 8) produced an acceptable combination of sensitivity (.43) and specificity (.89) at .710 OCC. Raising the cutoff to ≥ 9 achieved high specificity (.95) at a reasonable cost to sensitivity (.35) and no change in OCC. Further increasing the cutoff to ≥ 10 reached the point of diminishing returns (.24 sensitivity at .96 specificity and .677 OCC). Perfect specificity but low sensitivity (.16) was observed at ≥ 11 (Table 4). At ≥ 15, sensitivity was very low (.05).

Table 4 Classification accuracy of the ATR Scale of the TSI-2 across various cutoffs and criterion measures

The ATR was an even stronger predictor of the SVT-2B (AUC = .85; 95% CI: .74-.96). Once again, a cutoff of ≥ 7 failed to approximate the specificity standard (.78), but ≥ 8 produced a good combination of sensitivity (.68) and specificity (.89) at .821 OCC. Raising the cutoff to ≥ 9 achieved improved specificity (.92) at an acceptable cost to sensitivity (.53) and OCC (.786). Further increasing the cutoff to ≥ 10 resulted in high specificity (.95) but a further decline in sensitivity (.37) and OCC (.750). Perfect specificity was achieved at ≥ 13 at low sensitivity (.21). Predictably, sensitivity was even lower at ≥ 15 (.11).

In sharp contrast to the analyses above, the ATR was a non-significant predictor of TOMM-1 (BRFail = 45.4%; AUC = .53, 95% CI: .41-.64). Therefore, classification accuracy was not computed. However, the ATR was a significant predictor of the VI-5 (AUC = .64, 95% CI: .52-.77). Once again, a cutoff of ≥ 7 failed to achieve minimum specificity (.79), as did the next level of cutoff (.83 specificity). The first cutoff to reach .90 specificity was ≥ 9, at .37 sensitivity and .690 OCC. Making the cutoff more conservative (≥ 10) resulted in high specificity (.94) but low sensitivity (.29) and OCC (.678). Further raising the cutoff to ≥ 13 resulted in increased specificity (.96), but a notable loss in sensitivity (.18) and OCC (.644). Sensitivity further declined at ≥ 15 (.04), with slight improvement in specificity (.99).

The ATR was also a significant predictor of the EI-7 (AUC = .69, 95% CI: .56-.82). Once again, ≥ 7 and ≥ 8 failed to achieve minimum specificity (.80-.85). However, the next cutoff (≥ 9) produced high specificity (.93), albeit at low sensitivity (.26) and OCC (.657). Raising the cutoff to ≥ 10 resulted in the predictable trade-off between sensitivity (.19) and specificity (.95). Making the cutoff even more conservative (≥ 13) achieved perfect specificity but low (.15) sensitivity (Table 5). Predictably, sensitivity was even lower at ≥ 15 (.04).

Table 5 Classification accuracy of the ATR Scale of the TSI-2 across various cutoffs and criterion measures

Next, the relationship between self-reported trauma symptoms and the outcome of SVTs and PVTs was examined. Trauma was operationalized as the T-score on the Anxiety Related Distress scale of the PAI (ARDPAI) [categorized as none (< 60), mild (60–69), moderate (70–89) and severe (≥ 90)] and the Posttraumatic Stress factor (PTSTSI-2) on the TSI-2 [categorized as none (< 55), mild (55–64), moderate (65–74) and severe (≥ 75)]. On ARDPAI, a strong linear relationship emerged for NIMPAI (invalid defined as T ≥ 81), BDI-II (invalid defined as ≥ 38) and ATR ≥ 9 (p < .001, Ф2: .229-.297; very large effects). The trend extended to SVT-2A (p < .001, Ф2 = .248, very large effect) and was accentuated on SVT-2B (p < .001, Ф2 = .406, very large effect). However, none of the contrasts were significant using BRFail on PVTs (p: .335-.930).

On PTSTSI-2, a notably stronger linear relationship emerged for NIMPAI (invalid defined as T ≥ 81), BDI-II (invalid defined as ≥ 38) and ATR ≥ 9 (p < .001, Ф2: .377-.501; extremely large effects). The trend extended to SVT-2A (p < .001, Ф2 = .375, very large effect) and was further accentuated on SVT-2B (p < .001, Ф2 = .569, extremely large effect). However, the only significant contrast using BRFail on PVTs emerged on the EI-7 (p = .032, Ф2 = .133, medium effect; Table 6).

Table 6 Failure rate (%) on various SVTs and PVTs as a function of self-reported Anxiety Related Distress (ARDPAI) and Posttraumatic Stress Factor (PTSTSI-2) severity

Finally, the BRFail on ATR cutoffs that showed the best overall classification accuracy (≥ 8, ≥ 9 and ≥ 10) were compared across patients with low head injury severity (i.e., uncomplicated mild TBI) and patients with high head injury severity (i.e., complicated mild, moderate and severe TBI). There was no difference in BRFail as a function of TBI severity (p: .824-.963; Table 7). Likewise, comparable BRFail was observed on the criterion measures [SVT-2A and SVT-2B (p: .461-.492) as well as the TOMM-1, VI-5, and EI-7 (p: .117-.508)].

Table 7 Failure rate (%) on various SVTs and PVTs as a function of TBI severity

Discussion

Overview of the Results

Assessing the credibility/validity of self-reported symptoms is of paramount importance in clinical and forensic settings (Sweet et al., 2021), and the ATR of the TSI-2 is one of the relatively few SVTs available to professionals working in the field of psychological injury and law (Giromini et al., 2022). Unfortunately, empirical research on its efficacy has been relatively sparse and inconclusive (Ales & Erdodi, 2022). Therefore, the current study was designed to empirically evaluate its classification accuracy against a commonly used SVT and a series of PVTs in a consecutive case sequence of 99 patients referred for neuropsychological evaluations in the context of motor vehicle collisions. We predicted that the ATR would produce a superior classification accuracy against SVTs compared to PVTs and that the optimal cutoff would be closer to ≥ 8 (Gray et al., 2010) than ≥ 15 (Briere, 2011).

Both hypotheses were supported by the data. The default cutoff (≥ 15) grossly underestimated the prevalence of non-credible symptom report (2%) within this sample relative to other SVTs (34–40%) or PVTs (40–45%). Given the strong correlation between BRFail, sensitivity and specificity (Dandachi-FitzGerald & Martin, 2022; Rai et al., 2023), it is not surprising that ATR ≥ 15 produced consistently poor classification accuracy (driven by dismal sensitivity) against both versions of the SVT-2 (.05-.11 sensitivity at 1.00 specificity and .624-.773 OCC) and the VI-5/EI-7 (.03-.04 sensitivity at .98–1.00 specificity and .598-.612 OCC). In contrast, at a BRFail of 25.3%, ATR ≥ 8 approximated the specificity standard (.89) against SVT-2; at a BRFail of 19.2%, ATR ≥ 9 produced a good combination of sensitivity (.35-.53) and specificity (.92-.95), at .710-.786 OCC. Similarly, ATR ≥ 9 was specific (.90-.93) to invalid performance on measures of cognitive ability, albeit at low sensitivity (.26-.37). Therefore, ≥ 9 emerged as the optimal cutoff on the ATR that provides a reasonable balance between high (≥ .90) specificity and sensitivity (.26-.53) using both SVTs and PVTs as criterion measures. It should be noted, however, that an ATR ≥ 9 still only detects between a quarter and half of the sample with independent psychometric evidence of non-credible clinical presentation.

Clinical/Forensic Implications

Taken together, these results converge in a number of practical conclusions: 1) The default cutoff of ≥ 15 provides a highly biased estimate of the prevalence of invalid symptom report, detecting 5–11% of the profiles identified as non-credible by other SVTs. Therefore, its use in clinical and forensic settings cannot be justified due to unacceptably high (90–95%) false negative rates. 2) Alternative cutoffs offer an opportunity to recalibrate the classification of the ATR and provide a more balanced trade-off between sensitivity and specificity. The more liberal cutoff of ≥ 8, although technically fell short of the .90 specificity standard (.89 against both versions of the SVT-2), provided the single best OCC, correctly classifying between 71 and 82% of the sample. As such, it can be considered the first level of failure, and has the potential to serve as a screening cutoff (i.e., help rule in non-credible symptom report). The next level of cutoff (≥ 9) had uniformly high specificity (.90-.95) against a range of SVTs and PVTs as criterion measures. Finally, an ATR score ≥ 10 was associated with consolidated specificity (.95-.96), indicating a level of symptom report that is likely invalid. 3) Although the ATR was a weaker predictor of PVTs relative to SVTs as criterion measures (consistent with our prediction and the results of previous research), the specificity of ≥ 9 was invariant of type and composition of the criterion. In other words, failing this cutoff suggests a globally invalid clinical presentation, consistent with earlier reports that sufficiently extreme response styles override the modality specificity effect (Rai & Erdodi, 2021) and become significant predictors of invalid presentation in different domains of clinical assessment (Holcomb et al., 2022a). 4) Elevations on the ARDPAI and PTSTSI-2 were associated with symptom overreport on other scales. The majority of patients (70–100%) with extreme scores (T ≥ 90 and ≥ 80, respectively) had independent evidence of symptom magnification. 5) Elevations on the ARDPAI and PTSTSI-2 were unrelated to PVT outcomes, suggesting that the credibility of self-reported PTSD symptoms and cognitive deficits observed on performance-based tests may be orthogonal and therefore, should be evaluated independently (Sabelli et al., 2021).

Mathematically, an ATR cutoff of ≥ 9 allows for endorsing all of the items at the first severity level above Never or half of the items at the severity level above that, and still have the response set deemed valid. Phenomenologically, the ATR’s item content [neurologically (global amnesia) or physiologically (inability to meet basic needs for prolonged periods of time; medically unexplained severe disfunction of the autonomic and/or peripheral nervous system) implausible severe impairments that are incompatible with normal functioning] seems to consist of a series of pathognomonic signs of non-credible symptom report. In other words, a qualitative review of the symptoms used to determine the validity of the response set suggests that ≥ 9 constitutes a sufficiently conservative demarcation line between credible and non-credible response sets. Assessors can recruit this argument in defense of their interpretation of a score in the failing range on the ATR, on top of the classification accuracy statistics.

In addition, results displayed in Table 6 reveal that the PTSTSI-2 is more susceptible to contamination by non-credible symptom report compared to the ARDPAI. A larger proportion of variance in T-scores was explained by failing both univariate (38–50% versus 23–30%) and multivariate (38–57% versus 25–41%) SVTs on the PTSTSI-2 relative to ARDPAI. Likewise, while a trivial (and statistically non-significant) amount of variance was captured by PVT failures on the ARDPAI (1–4%), these values were higher (6–13%) on the PTSTSI-2. If these findings were to be replicated by future research, they could inform both test selection and interpretation.

Given that 92–100% of the patients who scored T ≥ 75 on the PTSTSI-2 also had strong psychometric evidence of non-credible clinical presentation, these results implicitly validate this cutoff as an emerging alternative embedded SVT within the TSI-2. Our findings suggest that extreme scores (i.e., T ≥ 75) on the PTSTSI-2 are specific to invalid symptom report. Therefore, a score in this range should be interpreted with caution. Namely, alternative explanations (i.e., invalid responding) should be ruled out before considering such a score evidence of genuinely elevated posttraumatic stress. Naturally, future cross-validation research is needed to determine the generalizability of these results to other samples with different clinical characteristics, using different criterion grouping methods.

Results in the Context of Previous Research

Although there are no universally accepted standards for assessing the efficacy of an SVT, the results of recent meta-analytic studies suggest that widely used embedded SVTs, such as the validity scales of the MMPI-2-RF or the validity scales of the PAI, typically yield classification accuracy statistics similar to those observed for the ATR in our study. For example, a frequently cited meta-analysis by Sharf et al. (2017) found that in assessing feigned mental disorders, the Fp-r of the MMPI-2-RF (arguably one of the most effective SVTs currently available; Burchett & Bagby, 2022; Giromini et al, 2022) has an average specificity of .92 and an average sensitivity of .45, at the commonly used cutoff of T ≥ 80 (a seemingly invariant trade-off between sensitivity and specificity dubbed the Larrabee limit; Crişan et al., 2021; Erdodi et al., 2014). From this point of view, the ATR appears to be a promising SVT at the alternative cutoff score of ≥ 9.

On the other hand, it may be premature to conclude that the ATR is about as valid as the other commonly used SVTs. Indeed, results suggest that at ≥ 9, the ATR identified between a third and half of the patients who failed the SVT-2, indicating non-credible symptom report. At the same time, ATR ≥ 9 demonstrated a level of sensitivity to elevations in ARDPAI and PTSTSI-2 that is comparable to that of the NIMPAI ≥ 81. This is a remarkable performance given that ARDPAI and NIMPAI are part of the same instrument. Combined with the fact that the ATR was also a significant predictor of PVT failures, it provides further support for the hypothesis that the ATR is sensitive to diffuse signs of non-credible clinical presentation that transcends domains (symptom versus performance validity) and instruments.

Likewise, the ATR and the NIMPAI explained a similar proportion of the variance in self-reported symptom severity on the ARDPAI (23% and 27%, respectively). However, the ATR explained a notably higher proportion of the variance than the NIMPAI in self-reported symptom severity on the PTSTSI-2 (50% versus 38%, respectively). Therefore, at least in terms of non-credible PTSD symptoms, the ATR demonstrated comparable sensitivity to the NIMPAI (Table 6).

These findings contradict earlier reports by Gray and colleagues (2010) that the NIMPAI outperformed the ATR in differentiating coached PTSD simulators and credible patients with PTSD. The discrepancy between the two studies reinforces the importance of relying on real-world patients with suspected symptom over-report who operate under significant financial incentives rather than experimentally used malingering while calibrating SVTs. The incentive structure in lab-based studies (i.e., participation is rewarded rather than the ability to mimic credible impairment; Abeare et al., 2021a, b; Erdal, 2012; Rai et al., 2019; van Helvoort et al., 2019) are meaningfully different from those in high-stake medicolegal settings. The ultimate purpose of SVTs is to accurately detect non-credible symptom report in applied clinical or forensic setting (Fuermaier et al., 2023b).

The validity scales of the MMPI instruments have been the subject of meta-analytic studies summarizing the results of dozens of empirical studies (e.g., Ingram & Ternes, 2016; Rogers et al., 2003; Sharf et al., 2017), as have the validity scales of the PAI (e.g., Hawes & Boccaccini, 2009; Kurtz & McCredie, 2022). The SIMS and the IOP-29, two other widely used SVTs, have also been extensively researched and have been the subject of quantitative literature reviews (e.g., Giromini & Viglione, 2022; Shura et al., 2022) and extensive meta-analytic studies (e.g., Puente-López et al., 2023; van Impelen et al., 2014). In contrast, there are still a limited number of studies to date on the efficacy of the ATR. Until the knowledge base on this relatively rarely studied SVT consolidates, assessors should exercise appropriate caution when interpreting its results.

Consistent with emerging empirical findings (Holcomb et al., 2022a; Sabelli et al., 2021), the ATR showed better classification accuracy against the SVT-2 as a criterion than as compared to PVTs (VI-5 and EI-7). This finding is not surprising, as symptom validity and performance validity are commonly conceptualized as related but ultimately distinct constructs (Blavier et al., 2023; De Boer et al., 2023; Giromini et al., 2020; Larrabee, 2012; Merten et al., 2020; Sabelli et al., 2021). Indeed, as Giromini et al. (2022) pointed out, “The optimal criterion variables in SVT research are SVTs, or maybe SVTs combined with PVTs, but not PVTs alone” (p. 13). Therefore, additional research using other SVTs to further investigate the efficacy of the ATR in detecting symptom invalidity would be beneficial.

The ATR Scale as a Screening Tool

In this context, it should be noted that the fact that ATR ≥ 9 had similar specificity against both SVTs and PVTs (although lower sensitivity to the latter) is remarkable, suggesting that the ATR taps a common source of non-credible presentation affecting both performance on cognitive tests and pattern of self-reported symptoms (Bianchini et al., 2005, 2014; Gervais et al., 2007, 2011; Merten et al., 2022; Richman et al., 2006; Tarescavage et al., 2013; Tylicki et al., 2021; Young, 2020). If replicated by future research, this feature may uniquely position the ATR scale to serve as a brief (potentially stand-alone) screener for the credibility of self-reported symptoms and deficits. Although administering the full TSI-2 in a setting in which assessors operate under high volume pressures would not be practical, the eight items that define the score on ATR can be administered and scored under one minute. In other words, the score on the ATR can serve as a quick and rough estimate of symptom validity to inform downstream decision about further in-depth assessment or treatment planning, similar to the screening function the Mini-Mental Status Exam (Folstein et al., 1975) for the presence/absence of cognitive deficits (Erdodi et al., 2020; Mitchell, 2009, 2017; Tsoi et al., 2015).

Equally important, failing the ATR was unrelated to TBI severity, as was the case for all of the criterion measures. This negative finding can serve to pre-empt attempts to discount scores in the failing range on the ATR (or other SVTs and PVTs), invoking contamination by genuine and severe trauma. Although the emotional salience of a motor vehicle accident cannot be accurately captured by physical parameters (force of the impact, whether airbags deployed, amount of damage to the car, etc.) alone, in the context of motor vehicle collisions, psychological trauma and TBI severity likely have an inverted U-shape relationship. In other words, the intuitive positive linear relationship between these two factors eventually reverses: once the impact results in a sufficiently severe TBI, the significant peritraumatic amnesia typically associated with such an injury effectively erases accident-related memories. However, this may not hold true for other traumatic experiences due to irreversible, tragic losses (e.g., death of a loved one during the accident) that induce a secondary trauma unrelated to the experiential aspects of the collision.

Limitations

Results should be interpreted in the context of the study’s limitations. First, the sample size was relatively small, restricted to a single region of Canada and to a medicolegal context, so additional replications with larger sample sizes based on patients from different geographic locations (Lichtenstein et al., 2019), with different clinical characteristics, assessed in a medical context are needed before the TSI-2 ATR can be fully endorsed as an all-purpose SVT. Second, but somewhat related, because the percentage of noncredible presentations in real-world clinical settings is likely greater than zero but typically lower than the 40% in the present sample (Young, 2015; Young et al., 2016), criterion groups studies tend to include a large number of credible cases but a smaller number of non-credible cases.

Given the influence of BRFail on classification accuracy (Dandachi-FitzGerald & Martin, 2022; Rai et al., 2023), the difference between the prevalence of invalid profiles should be taken into account during the interpretation of divergent findings from future reports. Although the prevalence of non-credible presentation was remarkably consistent across domains (SVTs and PVTs) and instruments (SVT-2, TOMM-1, VI-5 and EI-7) as well as with previous estimates (Czornik et al., 2022; Larrabee et al., 2009; Merten et al., 2009, 2020; Richman et al., 2006), given the ubiquitous presence of significant external incentives to appear impaired within our sample, it is likely higher than what is typical in clinical settings (Merten et al., 2016; Puente-López et al., 2023; Young, 2015; Young et al., 2016). Therefore, the ATR’s classification accuracy and predictive power may differ in settings with different clinical characteristics and motivation status.

Finally, the present sample was restricted to patients born in Canada to control for the potential confounding effect of linguistic and cultural diversity (Boskovic et al., 2020; Crişan, et al., 2023a; Dandachi-FitzGerald et al., 2023a; Erdodi & Lajiness-O’Neill, 2012). Future research examining the classification accuracy of the ATR in examinees with limited English proficiency (LEP) would greatly advance the knowledge base of symptom validity assessment. Although some instruments proved remarkably robust to LEP, concerns about this threat to the validity of SVT scores rightfully persist (Crișan, 2023).

Like all research based on criterion groups, our study also provides stronger evidence on specificity than sensitivity (Chafetz, 2022). Another limitation common to all criterion-group studies is that the internal validity of our research design needs to be critically evaluated in the absence of gold standard method for establishing credible vs non-credible responding. That is, although we attempted to increase the internal validity of our study by using psychometrically sound PVTs and SVTs, it is possible that our criterion groups were themselves contaminated by classification errors. Notably, one of the SVTs (BDI-II) is a recently introduced measure of symptom validity (Fuermaier et al., 2023a, b) with no independent replication (although there is previous support for its sensitivity for non-credible symptom report; Wiggins et al., 2012). Nevertheless, despite all of these and possibly other limitations, our study contributes valuable empirical data from a real-world medicolegal sample that provide unique insights into the utility of the ATR in assessing symptom validity.

Conclusions

Overall, the ATR demonstrated its potential to serve as an effective SVT in medico-legal (and potentially general clinical) settings. However, the cutoff (≥ 15) proposed by Briere (2011) proved prohibitively conservative in the present sample, and grossly underestimated the prevalence of non-credible responding as operationalized by the SVT-2 (2.0% versus 34–40%). In combination with external incentives to appear impaired due to active engagement in personal injury litigation and a consistently high BRFail on both SVTs and PVTs (40–45%) within this sample, the most plausible interpretation of the very low BRFail on the ATR at ≥ 15 is that such a highly conservative cutoff artificially suppresses the detection of invalid symptom report. Using this cutoff in survivors of motor vehicle collisions is essentially giving examinees a Pass (Erdodi, 2023). In other words, the choice of cutoff can strongly influence the outcome of the evaluation, lowering the likelihood of detecting invalid response sets below what most assessors consider reasonable. Results suggest that the decision to use the default cutoff (≥ 15) on ATR essentially sets the false negative rate to around 90–95%.

Conversely, even though the cutoff (≥ 8) recommended by Gray and colleagues (2010) falls short of the .90 specificity standard, it may be useful for screening purposes. As a compromise, a cutoff of ≥ 9 seems to provide sufficient specificity to both invalid symptom report and cognitive performance (.90-.95), while maximizing sensitivity (.26-.53). These findings are consistent with trends observed in PVTs: post-publication research often reveals that the originally proposed cutoffs were overly conservative, and more liberal cutoffs would optimize overall classification accuracy (Ashendorf et al., 2021; Erdodi et al., 20182023; Martin et al., 2020, Poynter et al., 2019).

As it is the case with all measures of performance and symptom validity, assessors should not rely on the ATR as the sole indicator to establish the credibility of the entire clinical presentation (APA, 2013; p. 15). However, the ATR can serve as an effective screener of symptom validity, providing the first valuable data point in the assessment process. Its potential for providing incremental information when combined with other SVTs is worth investigating further. Given that the ATR only contains eight items that are quick and easy to administer and score, clinicians operating under time constraints may choose to only administer it as a stand-alone SVT – provided that a thorough assessment of trauma-related distress is not a central/immediate goal of the evaluation.