Modern validity assessment is important in evaluating the credibility of patients’ self-report of symptomatology and functional impairment. Symptom validity tests (SVTs) aim to detect the presence of possible response bias, in particular overreporting of psychological problems or other forms of invalid response patterns (such as irrelevant or careless responding). SVTs are either freestanding instruments or scales embedded into more complex personality or symptom inventories. For a recent review, see the special issue 1/2022 of Psychological Injury and Law and the introduction by Giromini et al. (2022). The credibility of neuropsychological test profiles is tapped by performance validity tests (PVTs). Current guidelines recommend including multiple SVTs and PVTs in an evaluation protocol to reach a conclusion regarding symptom validity (Sherman et al., 2020; Sweet et al., 2021). The detailed analysis of the state-of-the-art of SVT research given by Sherman et al. (2020) leads to the conclusion that much more research into the diagnostic capacity of available self-report validity measures is needed.

With a generally elevated prevalence of noncredible symptom reports in forensic contexts, criminal forensic evaluations and examinations in prison settings play a minor role in the ever-growing literature on validity research. Yet, the inclusion of symptom and performance validity tests in such contexts appears to be as much indicated as in civil forensic evaluations (Fazio & Denney, 2018). Secondary gain expectations and hidden agendas may play an essential role there and determine the presence of distorted symptom presentations.

Results of validity testing are important to evaluate the credibility of symptom report, whether or not positive results are due to an act of malingering is a question going beyond what can be answered by tests. For prison populations, Resnick and Knoll (2018) emphasized that the label of “malingerer” may have dire and long-lasting consequences for an inmate with genuine psychopathology (e.g., disciplinary actions, denial of needed treatment, disregard for future complaints). The authors distinguished between illicit reasons for distorted symptom presentations (such as obtaining medications to abuse or sell, avoiding disciplinary action, or being transferred to better living accommodations) and what may be considered adaptive coping strategies (e.g., an inmate may seek a more protected environment of a mental health unit when harassed by other inmates).

Some other aspects of validity assessment in prison settings appear to be noteworthy. (1) There seems to be a difference in prevalence rates indicative of a higher risk of noncredible responding in pre-trial and pre-sentence criminal defendants when compared to inmates serving their sentence (Denney & Fazio, 2021), particularly in pre-trial defendants with high psychopathy scores (Cima et al., 2009). This probably results from an intent to feign mitigating circumstances (which may reduce the length of a possible sentence) or to be considered incompetent to stand trial. (2) Obtaining prescription drugs appears to be a frequent motivation for overreporting symptoms (Resnick & Knoll, 2018). (3) A low level of education has been associated with a higher risk of noncredible responding, possibly due to less sophisticated decision-making abilities coupled with a limited capacity to successfully feign mental illness (Cornell & Hawk, 1989; Norris & May, 1998).

The prevalence of noncredible presentations in neuropsychological examinations was estimated to reach 23% for criminal-forensic cases (Mittenberg et al., 2002). Ardolf et al. (2007) analyzed pre-trial and pre-sentence data from the neuropsychological assessment of 105 male examinees and obtained rates of 21.9% for definite malingering, 32.4% for probable malingering, and 21.9% for possible malingering. Vitacco et al. (2007) found a prevalence of probable malingering of 21% within a sample of 118 forensic patients undergoing competence-to-stand-trial evaluation in a Midwestern forensic hospital. McDermott et al. (2013) performed a study with participants from an inpatient psychiatric hospital and a jail facility. They analyzed data from 879 patients found incompetent to stand trial and sent to a state hospital for treatment, and 473 inmates seeking psychiatric services. Using the Miller Forensic Assessment of Symptoms Test (Miller, 2001) and the Atypical Presentation Scale (Gothard et al., 1995) for the patient sample, and the Structured Interview of Reported Symptoms (SIRS; Rogers et al., 1992) for inmates, the authors reported base rates of noncredible responding of 17.5% and 64.5%, respectively. Fazio et al. (2015) used the classification criterion of failure on at least two freestanding performance validity tests (PVTs) and reported a prevalence of 51.9% of malingered neurocognitive dysfunction among 109 male inmates evaluated for competency to stand trial.

However, in their review of research into the detection of malingering of intellectual disability in criminal contexts, Salekin and Doane (2009) analyzed a variety of commonly used symptom and performance validity tests and concluded that none of them appeared to be sufficiently robust against false positives in that context.

Using the Self-Report Symptom Inventory (SRSI; Merten et al., 2016), van Helvoort et al. (2019) found no elevated pseudosymptom endorsement in forensic psychiatric inpatients. These authors included only participants without a history of proven or suspected invalid responding in the past. This appeared to demonstrate that the presence of genuine psychopathology per se did not cause SVT failure.

A systematic review of validity testing in criminal forensic contexts and prison populations is still missing and factors responsible for the large variability in validity test failure rates have not been systematically studied. Yet, there appears to be no doubt that validity tests should be regularly included in assessment protocols in order to enable the assessor to distinguish more reliably between credible and noncredible symptom presentations (e.g., Gottfried & Glassmire, 2016).

The current study primarily aimed to analyze the concurrent validity of three different SVTs—the Self-Report Symptom Inventory (SRSI; Merten et al., 2016), the Structured Inventory of Malingered Symptomatology (SIMS; Smith & Burger, 1997; Widows & Smith, 2005; Portuguese version by Simões et al., 2017a), and the Symptom Validity Scale–Version 2 (EVS-2; Simões et al., 2017b) in a sample of prison inmates. Their discriminant validity was also analyzed, using a general measure of psychopathology, the Brief Symptom Inventory (BSI; Derogatis & Spencer, 1982; Portuguese version by Canavarro, 2007). It was hypothesized that statistically significant strong correlations would be found between all three SVTs and that weaker correlations would be found between each SVT and the BSI.

SVTs should be largely insensitive to sociodemographic variables, but past research has partly obtained results which point to the contrary (e.g., Cornell & Hawk, 1989; Giger & Merten, 2013; Merten et al., 2016; Norris & May, 1998; Schoemaker et al., 2019; Silva, 2022). Also, the results of SVTs should be largely independent of the presence of genuine psychopathology as long as the examinees respond honestly. To minimize mistakes in the decision-making process, it is important to know the possible effects of sociodemographic, legal, and health-related factors on validity test results. Yet, the number of studies addressing these issues is limited. Both Cornell and Hawk (1989) and Norris and May (1998) found a tendency for examinees clinically classified as “malingerers” to have a lower level of education when compared to “nonmalingerers.” The latter authors also found a lower mean age for the group classified as malingerers. Giger and Merten (2013) used multiple regression analyses on data obtained from 100 Swiss native speakers of German. The data indicated the effects of both age and verbal intelligence on several SVTs and PVTs, with overall lower performances at older ages and higher performance for participants with higher verbal intelligence. In the same sample, an analysis of the SRSI results found a significant correlation with educational level, with less educated examinees endorsing a higher number of pseudosymptoms (Merten et al., 2016). More recently, Schoemaker et al. (2019) also found a significant correlation between the presence of response bias and a low level of education.

In light of these data, we anticipated a statistically significant difference in the number of endorsed bogus symptoms depending on the participants’ age, educational level, and conviction status (pre-trial vs. post-trial). Specifically, we expected participants with a lower educational level as well as participants on remand to endorse more bogus symptoms when compared to those with higher education and post-trial status. Regarding age, we expected a statistically significant difference in bogus symptom endorsement between younger and older participants and aimed to test if our results would support Norris and May’s (1998) or Giger and Merten’s (2013) findings. Using data collected from three prison establishments, comprising a pooled sample of 240 male inmates, we studied the possible effects of these sociodemographic and legal variables on SVT results.



An estimate of the total prison population initially eligible for the study gave a number of N = 894. The large majority of the prison inmates were male, so the study was confined to participants of male gender, of whom 297 were approached for participation by the prison staff, on the basis of the following exclusion criteria: (i) difficulties understanding the Portuguese language, (ii) illiteracy or limited Portuguese reading ability, (iii) clinically obvious cognitive impairment, (iv) an educational level below the first cycle (i.e., under four years of formal schooling). A number of inmates (n = 49) refused to participate, and for some of the participants (n = 8), the evaluations had to be discontinued because of self-declared or observed difficulties in understanding Portuguese to a sufficient degree or because of limited reading abilities (although they had initially been included as potentially eligible participants).

The final sample for the analyses consisted of N = 240 male inmates, recruited from the Coimbra (n = 85), Guarda (n = 100), and Aveiro (n = 55) prison establishments. The pooled sample was aged 18 to 78 years (M = 40.0; SD = 11.8), mostly of Portuguese nationality (93.8%; n = 225). The remaining participants were of Angolan (n = 6), Brazilian (n = 2), French (n = 2), Venezuelan, Cape Verdean, Spanish, Luxembourgish, and Sao Tomean origin (n = 1 each). In sum, 97.9% of the participants were from countries with Portuguese as the official language. In terms of marital status, 131 respondents reported being single (54.6%), 62 were married (25.8%), and 46 were divorced (19.2%), with only one respondent being widowed (0.4%). At the time of assessment, 29 participants had completed the first cycle of education (first to fourth grade of primary education; 12.1%); 50 had completed the second cycle (fifth and sixth grades of basic education; 20.8%); 102 had completed the third cycle (seventh to ninth grade of basic education; 42.5%); 48 had attained a secondary level of education (tenth to twelfth grade; 20.0%); and 11 had attended higher education (university) or earned an academic degree (4.6%). Of the 140 respondents surveyed (participants from the Coimbra and Aveiro prison establishments), 44 had been remanded for pre-trial detainment (18.3% of the pooled sample; 31.4% of the subsample), while 96 had been convicted and were serving their sentence (40.0% of the pooled sample; 68.6% of the subsample). No information on conviction status was available for the Guarda prison establishment.


A sociodemographic and legal data questionnaire was created for the present study to gather sociodemographic (age, nationality, marital status, household, educational level, and occupation) and legal information (e.g., conviction status, criminal background, and length of imprisonment sentence).

Results on the Portuguese version of the SRSI (Merten et al., 2016, 2022) were available for the total sample (N = 240). The SRSI is a self-report SVT composed of 107 dichotomous items which combine potentially genuine psychopathological symptoms with bogus symptoms. The instrument aims to detect noncredible symptom endorsement on a spectrum of “soft” psychopathology. Thus, it is focused on symptoms often claimed in social and civil law litigation where adjustment, depressive, anxiety, somatoform, and pain disorders are dominant. One hundred items are distributed between two main scales: the Genuine Symptoms scale and the Pseudosymptoms scale. Both main scales comprise five subscales of ten items each which represent different symptom domains (cognitive, depressive, pain, nonspecific somatic, and PTSD/anxiety symptoms, vs. cognitive, motor, sensory, pain, and anxiety/depression pseudosymptoms). Along with two initial warming-up items, five additional items were constructed to check for gross inconsistencies potentially caused by careless responding.

Two different cutoff scores are recommended for use in most assessment contexts. Pseudosymptom endorsement > 6 indicates a possible noncredible symptom report at the screening level (with an estimate of 83% sensitivity and 91% specificity), while a cutoff score > 9 is recommended for standard use (with an estimate of 62% sensitivity and 96% specificity). Numerous studies in different countries have provided evidence for good psychometric properties and classification results of the SRSI (e.g., Boskovic et al., 2018, 2019; Giger & Merten, 2019; Merten et al., 2022; van Helvoort et al., 2019, for a summary report of research findings). In Portugal, several unpublished master’s theses were conducted with the SRSI, both in community samples (Domingues, 2019; Dwarkadas, 2018) and in forensic settings (Pinheiro, 2019; Venâncio, 2021; Silva, 2022).

Results on the SIMS (Widows & Smith, 2005; Portuguese version by Simões et al., 2017a) were available for a partial sample (n = 190; all participants from the Coimbra and Aveiro prisons, and half the participants from the Guarda prison). The SIMS is one of the most widely used freestanding SVTs. The instrument is composed of 75 dichotomous items describing bizarre, extreme, or atypical symptoms, evenly grouped into four dimensions of psychopathology—psychosis, neurological impairment, amnestic disorders, and low intelligence, with 15 items each. The fifth scale, affective disorders, comprises genuine depressive symptoms and uses the principle of indiscriminate overreporting of common health complaints (as known from the Fake Bad Scale; Lees-Haley et al., 1991).

The instrument is mostly considered as a screening measure for noncredible symptom report (see Shura et al., 2021 and van Impelen et al., 2014, for two meta-analytic studies). The original cutoff score of > 14 endorsed items has been repeatedly met with criticism, prompting several authors to suggest higher cutoff scores of 16, 19, 21, or even 24 in order to reduce false positives (e.g., van Impelen et al., 2014; Wisdom et al., 2010). The Portuguese version of the SIMS was studied both in population-based and forensic samples and demonstrated good concurrent and discriminant validity (Simões et al., 2017a). A cutoff score of 16 (with estimated values ranging from 72 to 100% for its sensitivity and from 59 to 100% for its specificity; cf. van Impelen et al., 2014) is recommended for the Portuguese version as well as for a number of other foreign language versions.

Results on the EVS-2 (Simões et al., 2017b) were available for a partial sample (n = 104; all participants from the Aveiro prison and half the participants from the Guarda prison, with one exclusion because the inmate discontinued participation). The EVS-2 is a 48-item questionnaire with dichotomous response format developed in Portugal between 2010 and 2015. Starting with a pool of 385 statements established from a review on existing malingering detection scales, those items found to be relevant were reformulated. The resulting 48 items can be grouped into three dimensions of feigned or exaggerated symptoms: emotional disorders (14 items), psychosis (20 items), and cognitive disorders (14 items). Total scores above the established cutoff score of > 17 endorsed items are suggestive of a noncredible symptom report. A ROC analysis yielded an estimated 64% sensitivity and 99% specificity when using SIMS scores > 24 as external criterion.

Various validation studies were carried out with forensic and population-based samples, obtaining respectable to excellent reliability values and good concurrent validity results which demonstrated the instrument’s potential as a validity measure (Simões et al., 2017b).

Results on the BSI (Derogatis & Spencer, 1982; Portuguese version by Canavarro, 2007) were available for a partial sample (n = 155; participants from the Guarda and Aveiro prisons). The BSI is a self-report measure of psychopathological symptoms which comprises 53 items encompassing nine symptom domains: somatization, obsession-compulsion, interpersonal sensitivity, depression, anxiety, hostility, phobic anxiety, paranoid ideation, and psychoticism. Responses are given on a five-point Likert scale rating the degree to which symptoms have bothered the examinee during the previous week (from “not at all” to “extremely”). In addition to each dimension’s individual scores, the instrument yields three global indices of psychopathological symptomatology: the Global Severity Index, the Positive Symptom Distress Index, and the Positive Symptom Total.

The Portuguese version of the BSI obtained moderate to strong intercorrelations between all indices and subscales (Pearson’s correlations ranging from 0.38 to 0.91), and good discriminant, predictive, and concurrent validity results (Canavarro, 2007). Scores above a cutoff score of 1.7 on the Positive Symptom Distress Index are interpreted as indicating the presence of relevant psychopathology (Canavarro, 2007).


The data presented here stem from three different subsamples studied by Pinheiro (2019), Silva (2022), and Venâncio (2021) in the context of their master’s theses, in three different Portuguese prison establishments. For the purpose of this article, the relevant data were merged. The examinations were conducted by the above-named authors. Testing was individual and took place in rooms designated for inmate consultation in two of the correctional facilities, while for the third correctional facility, it took place in groups of four to eight respondents, in the visitation area (due to time and space limitations there). After obtaining approval from the Directorate-General for Reintegration and Prison Services and from the directors of each prison, potential candidates for participation were approached by the prison staff, on the basis of the exclusion criteria described above. Volunteers were fully informed that all information was dealt with confidentially and solely used for scientific purposes and that no information about individual test results would be given to the prison authorities or the medical and therapeutic staff.

Following data collection, test scoring for each individual sample was handled by the authors of the master’s theses. Statistical analysis for the current study was conducted with the use of IBM SPSS Statistics Version 27.0.1 for Windows.


Descriptive Statistics

The results in Table 1 show a substantially higher mean of endorsed genuine symptoms (M = 16.27, Mdn = 15, SD = 10.60) than pseudosymptoms (M = 4.43, Mdn = 2, SD = 7.21) on the SRSI. Anxiety was the most commonly endorsed domain of genuine symptoms, while pain was the least common. For pseudosymptoms, mental was the most prevalent domain, and motor was the least prevalent. SIMS total results showed a mean of 11.27 endorsed items (Mdn = 9, SD = 9.14), with affective disorders as the most prevalent psychopathological dimension and psychosis as the least prevalent. As for the EVS-2 total results, a mean of 7.72 endorsed items (Mdn = 3.00, SD = 6.79) was obtained, with emotional disorders as the dimension with most commonly endorsed problematic symptoms, and cognitive disorders as the dimension with the lowest number of endorsed items.

Table 1 Descriptive analysis, central tendency, and dispersion of the SRSI, the SIMS, and the EVS-2 in Portuguese prison settings (N = 240)

There was a wide range of individual raw scores in the distributions, with a minimum of one or zero endorsed items for the SRSI Genuine Symptoms and Pseudosymptoms scales as well as for SIMS and the EVS-2 scores. In contrast, very high symptom endorsement was observed for some participants, reaching up to scores of 46 and 48 items (out of 50) for the SRSI scales, 56 (out of 75) for the SIMS, and 34 (out of 48) for the EVS-2.

SVT Concurrent Validity

As expected, Spearman nonparametric correlations revealed a strong association between all symptom validity measures (rho ranging from 0.72 to 0.79, p < 0.001; see Table 2). These data establish concurrent validity estimates for the three SVTs. Separate rank correlations also revealed moderate to strong associations between SVTs and summary symptom scales from the BSI and the SRSI Genuine Symptoms total scale (rho estimates from 0.43 to 0.81; Table 3).

Table 2 Estimated values of rank correlation coefficients between the SRSI total pseudosymptoms, SIMS total scores, and EVS-2 total scores
Table 3 Estimated values of rank correlation coefficients between the SVT results and the BSI symptom scales

SVT Failure Base Rates and ROC Analyses

Table 4 shows the number of positive SVT results using the screening and standard cutoff scores for the SRSI (> 6 and > 9 endorsed pseudosymptoms, respectively), two different cutoff scores for the SIMS (> 16 and > 24 endorsed items), and the recommended cutoff score for the EVS-2 (> 17 endorsed items). Base rates of SVT failure varied from 7.9 to 19.2% depending on the measure and applied cutoff score. The more conservative cutoff scores of the SRSI and the SIMS yielded failure rates of 13.3% and 7.9% respectively. On the EVS-2, 9.6% of the participants scored above the recommended cutoff score. No gold standard was defined to distinguish between honest respondents and participants with noncredible symptom endorsement; thus, true and false positive rates are unknown.

Table 4 Number of positive cases and base failure rates for all SVTs, with different cutoff scores

Separate ROC analyses were performed on the scores of (1) the SRSI pseudosymptoms scale using the SIMS cutoff score > 24 as external criterion, (2) the SRSI pseudosymptoms scale using the EVS-2 cutoff score > 17 as external criterion, (3) the SIMS using the SRSI cutoff score > 9 as external criterion, (4) the SIMS using the EVS-2 cutoff score > 17 as external criterion, (5) the EVS-2 using the SRSI cutoff score > 9 as external criterion, and (6) the EVS-2 using the SIMS cutoff score > 24 as external criterion. All six models performed with excellent predictive accuracy, yielding area under the curve (AUC) values over 0.9 (respectively, 0.961, 0.954, 0.960, 0.978, 0.968, and 0.920, at a 95% confidence interval; see Figs. 1, 2, 3, 4, 5, and 6). Table 5 shows the sensitivity and specificity estimates of the screening and standard cutoff scores for the SRSI, two different cutoff scores for the SIMS, and the recommended cutoff score for the EVS-2, based on each model.

Fig. 1
figure 1

Receiver operating characteristics curve of total SRSI Pseudosymptoms scores with SIMS > 24 cutoff score as external criterion

Fig. 2
figure 2

Receiver operating characteristics curve of total SRSI pseudosymptom scores with EVS-2 > 17 cutoff score as external criterion

Fig. 3
figure 3

Receiver operating characteristics curve of total SIMS scores with SRSI > 9 cutoff score as external criterion

Fig. 4
figure 4

Receiver operating characteristics curve of total SIMS scores with EVS-2 > 17 cutoff score as external criterion

Fig. 5
figure 5

Receiver operating characteristics curve of total EVS-2 scores with SRSI > 9 cutoff score as external criterion

Fig. 6
figure 6

Receiver operating characteristics curve of total EVS-2 scores with SIMS > 24 cutoff score as external criterion

Table 5 Sensitivity and specificity values for all SVTs, with recommended cutoff scores, based on the obtained ROC models

Influence of Sociodemographic and Legal Variables on SVT Results

The Effect of Age

Participants were first divided into three age groups for a more homogeneous distribution: 20 to 35 years old, 36 to 50 years old, and 51 to 70 years old. Results of the SVTs for the three groups are presented in Table 6. Rank correlations between age and the three SVT raw scores were computed. A statistically significant difference was only obtained for the SIMS total scores (rho = 0.171, p = 0.020; weak association), while no statistically significant differences were found between age and the SRSI Pseudosymptoms or the EVS-2. Table 7 shows the results of the chi-square analyses on the SVT failure rates for each cutoff score according to age. Analyses were carried out with the Monte Carlo simulation; however, the results did not differ in terms of statistical significance, therefore only the chi-square values are reported. Statistically significant results were only found for the SIMS cutoff score > 16.

Table 6 Results of the three SVTs according to age and estimated values of rank correlation coefficients between each SVT and age
Table 7 Chi-square analyses of the failure rates for the three SVTs, according to age

The Effect of Educational Level

Participants were divided into five groups, according to their educational level: first cycle (first to fourth grade of primary education), second cycle (fifth and sixth grades of basic education), third cycle (seventh to ninth grade of basic education), secondary education (tenth to twelfth grade), and higher education (university or academic degree). Rank correlations between educational level and the three SVT raw scores were computed. Table 8 shows statistically significant associations for all SVTs as well as for the SRSI Genuine Symptoms scale (rho ranging from − 162 to − 283; weak associations). There is a clear tendency for an overall lower failure rate in higher educational levels when compared to lower educational levels, as a higher rate of bogus symptom as well as genuine symptom endorsement is shown in lower educational levels. The results of the chi-square analyses on the SVT failure rates for each cutoff score according to educational level are displayed in Table 9, showing no statistical significance. These analyses were carried out with the Monte Carlo simulation, but there was no difference in terms of statistical significance, therefore only the chi-square results are reported.

Table 8 Results of the three SVTs according to educational level and estimated values of rank correlation coefficients between each SVT and education level
Table 9 Chi-square analyses of the failure rates for the three SVTs, according to educational level

The Effect of Conviction Status

Rank biserial correlations between conviction status and the three SVT raw scores were computed. The results show a statistically significant difference in the SIMS total results between inmates who were remanded for pre-trial detention and inmates who were convicted and serving their sentence, representing a weak correlational effect (rho = − 232, p = 0.006; see Table 10). No statistically significant differences were found for the SRSI Pseudosymptoms scale or the EVS-2 results between the two groups (respectively, p = 0.059 and p = 0.913).

Table 10 Results of the three SVTs according to conviction status and estimated values of rank correlation coefficients between each SVT and conviction status


Due to a growing awareness of the need to include validity measures in clinical and forensic psychological assessment contexts, the aim of the present study was to test the SRSI, the SIMS, and the EVS-2 in a Portuguese prison setting. We collected data on their concurrent validity and analyzed the impact of specific sociodemographic and legal variables on their results. Combining the data obtained from three correctional facilities, a pooled sample of 240 male inmates was used to test the instruments’ psychometric properties. Concurrent validity was analyzed through correlations among the three SVTs, failure base rates were compared, and predictive accuracy was tested by means of ROC analyses.

As expected, the strong associations identified between all three SVTs (rho from 0.72 to 0.79) established their concurrent validity. These results were fully in line with the results of previous studies (e.g., Merten et al., 2016, 2022; Simões et al., 2017b).

The resulting SVT failure rate varied between 7.9 and 19.2%, depending on the instrument and the cutoff score used, but using only conservative cutoff scores reduced this range, resulting in rates from 7.9 to 13.3%. These numbers were more in line with those found by some previous studies for the prevalence of symptom overreporting in criminal settings (Cima et al., 2009; McDermott et al., 2013; Vitacco et al., 2007) and those proposed by Young (2015) in his review of forensic disability-related assessments than with those reported by McDermott et al. (2013) specifically for general offender samples in prison settings (64.5%). In the context of the present study, distinguishing between true positive results and false positives was not possible. Within the study design, we had no means of defining an independent and robust gold standard for distinguishing between true honest responders and participants with noncredible symptom endorsement. This is certainly a major limitation of the study and must be kept in mind when interpreting the results.

An investigation of the instruments’ predictive accuracy through ROC analyses yielded AUC results over 0.9 and overall high sensitivity (48 to 100%) and specificity (87 to 99%) estimates for the selected cutoff scores for every model run, which highlights the high concurrent validity of the three instruments.

According to expectations, the correlations found between the SVTs and the BSI global indices (rho from 0.43 to 0.68) were significantly weaker than those obtained when correlating the SVTs with one another, demonstrating discriminant validity. Despite these results, there were statistically significant positive correlations found with the BSI as well as a high correlation between SVTs and SRSI Genuine Symptoms endorsement. These results indicate a vulnerability of clinical symptom scales to produce elevated or extreme scores in patients who engage in overreporting, an inference which is supported by the results of previous studies (Boskovic et al., 2019; Dwarkadas, 2018; Giger & Merten, 2019; Pinheiro, 2019; Venâncio, 2021). This fact is also well known from extreme scores on the Beck Depression Inventory (BDI and BDI-II; Beck et al., 1996) signaling gross symptom exaggeration rather than genuine very severe degrees of depression (Czornik et al., 2022; Fuermaier et al., 2023; Groth-Marnat, 1990; Lees-Haley, 1989; Merten et al., 2020). However, in a study with a sample of forensic inpatients (all diagnosed with personality disorders and/or substance abuse problems and/or psychotic disorders), van Helvoort et al. (2019) demonstrated that symptom validity measures were not inevitably elevated in the presence of genuine psychopathology, as long as patients with previously proven manipulated symptom reports were excluded from the analyses. Additional studies with SVTs are needed, preferably with more detailed psychopathology measures and using samples with genuine symptoms but which are carefully checked for hidden or overt agendas for manipulation. The question of the robustness of existing SVTs to the presence of genuine symptomatology will remain high on the agenda of future research.

A number of studies have supported the assumption that SVTs are insensitive to sociodemographic variables for different instruments and specific factors like age, gender, and educational background (Giger & Merten, 2013), but, despite that, past investigations have also obtained results which contradict it (Cornell & Hawk, 1989; Giger & Merten, 2013; Norris & May, 1998), possibly linking less sophisticated decision-making abilities and reduced proficiency at successfully feigning psychopathology with higher probabilities of symptom overreporting. Rank correlation coefficients in our study revealed statistically significant differences in the SIMS total results for age (rho = 0.171, p = 0.020) and conviction status (rho = − 0.232, p = 0.006), as well as in all three SVT results between educational levels (rho ranging from − 0.162 to − 0.283). Based on our data, the probability of SVT failure was substantially higher at lower education levels (roughly 13 to 21% for the first cycle and 11 to 20% for the second cycle) when compared to higher education levels (approximately 4 to 7% for secondary education and 0 to 11% for higher education). This result was consistent across all three SVTs. It is, however, not possible to ascertain whether or not the primary cause for higher failure rates in lower education groups lies in the comprehension of questionnaire items. Thus, the complex structure of some SIMS items with double negation, conditional sentences, and other potentially ambiguous contents can be at the root of distorted scores. While it is possible that participants with lower education generally engage more frequently in overreporting (reflected by true positive SVT scores), other potential causes can be considered, such as lower familiarity with questionnaire formats or an unsophisticated approach to impression management.

Chi-square analyses yielded results with no statistical significance when testing the association between SVT failure rates and educational level, which may be due to a relatively low sample size for some of the groups and overall low failure rates.

Regarding age groups, findings were less consistent. Only the SIMS displayed a clear pattern, showing a higher probability of SVT failure among older examinees (approximately 5% for ages 20 to 35, 9% for ages 36 to 50, and 13% for ages 51 to 70), supporting the results obtained by Giger and Merten (2013). Chi-square analyses yielded statistical significance only regarding the SIMS with a cutoff score of > 16 when testing the association between SVT failure rates and age, while all other SVTs and respective cutoff scores showed no statistical significance. Lastly, although confidentiality was assured, it was our understanding that prison population participants, naturally inclined to distrust the system, would most likely feel the need to safeguard themselves and/or still believe they might benefit from faking or exaggerating symptoms. So, while the effect obtained of conviction status on SVT results might be smaller than in a genuine psychological assessment conducted for the court or by prison personnel, since it may be somewhat mitigated by the confidentiality assurance, we expected it to still be present. Our analysis confirmed this for the SRSI and the SIMS (approximately 16 to 18% probability of SVT failure for inmates on remand versus approximately 7 to 13% for convicted inmates), although the EVS-2 seems to contradict this with a reverse pattern. It should be taken into consideration that the EVS-2 was, unfortunately, applied to a much smaller number of convicted inmates when compared to the other two SVTs (because it was not included in the protocol for the Coimbra sample), which might explain this inconsistency in the results.

These results appear to suggest that SVTs may, to some degree, be sensitive to the examinees’ age and educational background, which should warrant further analyses in future studies. The differences in results between inmates on remand and inmates who are serving their sentence are also worthy of further investigation. Presumably, the perception that a mental illness diagnosis may help their case, representing attenuating circumstances that may alleviate a possible sentence or even render them incompetent to stand trial, motivates the behavior in pre-trial cases.

In summary, the present study obtained results which support the adequate psychometric properties of the SRSI, the SIMS, and the EVS-2 and established failure base rates for a Portuguese prison context. The higher failure rate in participants with very low educational backgrounds signals that questionnaire-based assessment may be flawed when the respondents have limited resources. In such cases, a higher rate of false-positive results may occur. Evidence for this hypothesis stems from a study by Graue et al. (2007) with the SIMS. The authors not only found a high rate of positive test scores in patients with limited intellectual capacity but also in a comparison group of ten community volunteers with a mean full-scale Wechsler Adult Intelligence Scale IQ of only 80.7 (SD = 9.1) and similarly low IQ scores predicted by their performance on the Wechsler Test of Adult Reading (m = 80.2, SD = 8.9). The mean SIMS score of the presumably honest controls in the Graue et al. (2007) study was as high as 18.3 (SD = 11.0).

Several limitations of the present study deserve comment. First, the sample comprises exclusively male participants, which limits the generalizability of the results to the female or nonbinary inmate population. Second, a larger sample would have been ideal, allowing for better representativity of extreme classes of age and educational level. Moreover, results on the SIMS and the EVS-2 were only available for a subsample of participants. Third, systematic data on the presence of mental disorders were not available for the sample. It would be interesting to include this factor in a follow-up study and investigate its effect on the SVTs included in the current study. Results from a previous study by van Helvoort et al. (2019) suggest that genuine psychopathology per se does not lead to elevated pseudosymptom endorsement as long as known sources of potential noncredible symptom claims are excluded. Fourth, the factor of low education needs further clarification. Psychometric measures of reading ability and intelligence would enhance studies into the important question of the lower limits of applicability of common SVTs without the danger of producing an elevated number of false positives.

Given the limitations described above, future studies should expand participant selection to include non-male respondents and, ideally, increase sample sizes for added representativity in the stratification based on sociodemographic and legal variables. Considering the fact that age, educational level, and conviction status were shown to have statistically significant effects on SVT results, their inclusion in sociodemographic and legal questionnaires in future research should be considered.