Lying on the Dissection Table: Anatomizing Faked Responses

Research has shown that even experts cannot detect faking above chance, but recent studies have suggested that machine learning may help in this endeavor. However, faking differs between faking conditions, previous efforts have not taken these differences into account, and faking indices have yet to be integrated into such approaches. We reanalyzed seven data sets (N = 1,039) with various faking conditions (high and low scores, different constructs, naïve and informed faking, faking with and without practice, different measures [self-reports vs. implicit association tests; IATs]). We investigated the extent to which and how machine learning classifiers could detect faking under these conditions and compared different input data (response patterns, scores, faking indices) and different classifiers (logistic regression, random forest, XGBoost). We also explored the features that classifiers used for detection. Our results show that machine learning has the potential to detect faking, but detection success varies between conditions from chance levels to 100%. There were differences in detection (e.g., detecting low-score faking was better than detecting high-score faking). For self-reports, response patterns and scores were comparable with regard to faking detection, whereas for IATs, faking indices and response patterns were superior to scores. Logistic regression and random forest worked about equally well and outperformed XGBoost. In most cases, classifiers used more than one feature (faking occurred over different pathways), and the features varied in their relevance. Our research supports the assumption of different faking processes and explains why detecting faking is a complex endeavor.

Attempting to detect faking seems comparable to a pathologist's work when attempting to clarify the cause of sudden death. Both endeavors are important and time-consuming and must take various circumstances into account. Indicators may depend on the circumstances under which the deed occurred (e.g., Röhner et al., 2013), an enormous pool of data must be evaluated to answer the question, and incorrect decisions can have severe consequences. And obviously, both efforts are based on the assumption that transgressors leave traces that will unveil them.
Recent research has suggested that people use different approaches when they fake on psychological measures (e.g., Bensch et al., 2019). Thus, they may also leave different traces. As faking is multifold, its detection is still a challenge, and even experts often fail to detect fakers above chance (e.g., Fiedler & Bluemke, 2005). In this study, we reanalyzed seven data sets by using machine learning to investigate whether artificial intelligence can help to detect faking when faking occurs under different conditions.

Faking: An Unresolved Problem
In research and in applied settings, psychologists test hypotheses, explore behavior, and provide diagnoses. To do so, they typically have to rely on the sincerity of the people who participate in psychological assessments. Thus, an important quality criterion of psychological measures is their non-fakeability (e.g., Moosbrugger & Kelava, 2020). But an immense body of research has shown that people are able to fake on psychological measures (e.g., Birkeland et al., 2006;Viswesvaran & Ones, 1999). Even going beyond classical tests, measures that had originally been considered to be immune against faking (e.g., Implicit Association Tests; IATs; Greenwald et al., 1998) have turned out to be fakeable (e.g., Röhner et al., 2011;Röhner & Lai, 2021). As faking results in changes in test scores and rank orders, it is a serious problem that can impair the validity of tests (e.g., Salgado, 2016;see Ziegler et al., 2012, for an overview), and this impairment of validity may be higher for construct validity than for criterion validity (e.g., Ones & Viswesvaran, 1998;Ziegler & Buehner, 2009).

Faking Detection as a Solution?
The goal of detecting faked scores in psychological measurement has been pursued for more than 100 years now (Sackett et al., 2017). A variety of approaches have been tested, including the implementation of scales that aim to measure the tendency to create favorable impressions (e.g., Paulhus, 2002) or the inspection of response latencies (e.g., Holden & Lambert, 2015). So far, though, none of these procedures has become widely accepted. Some procedures have been criticized for carrying their own risks (e.g., erroneously suspecting people high in conscientiousness to be fakers; Uziel, 2010, see also Röhner & Schütz, 2020). Others can only be applied to a very restricted group of measures (e.g., Röhner et al., 2013), or their applicability depends on measurement conditions (e.g., Röhner & Holden, 2021). Apparently, it is not as easy to detect faking as one might assume at first glance.

The Complexity of Faking
Faking is affected by a complex interplay of conditions (e.g., Goffin & Boyd, 2009;Tett & Simonet, 2011; see also Röhner & Schütz, 2019) and can be pursued via different pathways (e.g., Bensch et al., 2019;Röhner et al., 2013). Faking detection is based on the idea that fakers leave telltale traces. However, if faking can be done in various ways and is impacted by conditions, faking detection is a complex endeavor in which different faking conditions have to be taken into account.
The Impact of Measures Faking varies between measures (e.g., Röhner et al., 2011;Ziegler et al., 2007). For example, faking on self-reports includes decoding the items and choosing one's responses according to the impression one wants to make (e.g., faking good vs. faking bad). By contrast, faking on IATs involves decoding the measurement procedure, which is based on reaction times (and error values [i.e., correct or erroneous responses]), and manipulating one's reaction times (and error values) to achieve the desired impression (e.g., Röhner et al., 2013). Consequently, various theoretical approaches have suggested that faking on IATs is more difficult, and thus less possible, than faking on self-reports (see, e.g., De Houwer, 2006). In line with this argument, research has found more evidence of faking on self-reports than on IATs (e.g., Röhner et al., 2011;Steffens, 2004).

The Impact of Faking Direction
Several studies have demonstrated that faking depends on the requested faking direction (e.g., faking good vs. faking bad, Bensch et al., 2019; faking high scores vs. low scores, Röhner et al., 2013). 1 Typically there is more evidence of faking when low scores are faked than when high scores are faked (e.g., Röhner et al., 2011;Viswesvaran & Ones, 1999).
The Impact of Knowledge Faking depends on whether people have knowledge about measurement procedures and whether they are provided with strategies on how to fake (i.e., informed faking) or not (i.e., naïve faking; Röhner et al., 2013). 2 It has been argued that informed faking improves people's ability to fake (e.g., Raymark & Tafero, 2009;Snell et al., 1999). This idea has received empirical support (Röhner et al., 2011), and there was more evidence of faking when participants had prior information than when they were naïve (e.g., Röhner et al., 2013).
The Impact of Practice Practice with faking on a specific measure can impact faking on that measure. There is more evidence of faking when participants are able to practice faking compared with when they are not (e.g., Röhner et al., 2011).

The Impact of Constructs
Research has indicated that faking also depends on the construct that fakers are attempting to fake. Differences in face validity have been shown to impact faking (Bornstein et al., 1994) and might explain why constructs that have more face validity than others are related to stronger faking behavior. Some studies have shown that the better participants can understand what is being measured, the more they are able to fake (e.g., McFarland & Ryan, 2000). However, the results of studies that have explored the impact of constructs have been less clear than the results of studies on other faking conditions. For example, Steffens (2004) demonstrated more faking on extraversion than on conscientiousness in IATs and selfreports, whereas Birkeland et al. (2006), who investigated only self-reports, demonstrated more faking on conscientiousness than on extraversion. However, the face validity of measures should not vary that strongly. Thus, this difference cannot be explained by face validity alone. Because there has been a lot of variation in other faking conditions that impact faking in these previous studies, it is not possible to ultimately explain such differences. Most likely, various constructs impact faking differently under different conditions. Therefore, the possibility that constructs impact faking should be considered.
To sum up, fakers will leave different traces under different faking conditions. When aiming to conduct research on faking detection, it is necessary to include the abovementioned conditions.

Large Quantities of Data
Whereas the idea to investigate response patterns in order to identify faking goes back to Zickar et al. (2004), Calanna et al. (2020) recently showed that the use of response patterns (i.e., all of a participant's responses; e.g., all answers to all items on a self-report) outperforms the use of scores (e.g., the test score from a self-report) in faking detection. Apparently, there is relevant information in response patterns that is not mirrored by scores (e.g., Kuncel & Borneman, 2007;Kuncel & Tellegen, 2009). Thus, to identify fakers, it seems necessary to compare various patterns of faked and not-faked responses. Consequently, large quantities of data have to be analyzed. Depending on the respective measure, data matrices quickly become very large (e.g., the IAT response pattern of a single participant includes about 250 reaction times and about 250 response values [i.e., erroneous or correct responses] that need to be compared with data from other participants). 3 Considering the variety of faking behavior, a human analyst may be overburdened. And in fact, a study in which experts were asked to distinguish fakers from non-fakers on the basis of measurement protocols (i.e., response patterns) found that experts were unable to distinguish between these groups above chance (Fiedler & Bluemke, 2005).
To sum up, faking detection seems to work better when response patterns instead of scores are included. However, human analysts are typically overwhelmed by the amount of data related to analyzing response patterns.

Faking Indices are not Available for all Measures
Faking indices seem to offer the ideal solution because they do not require researchers to investigate entire response patterns. Instead, researchers can inspect only certain indicators, thus making the analyses much more manageable. Usually, cutoff scores for these indices are suggested. When the indices miss the cutoffs, researchers can assume that participants have faked. Indices are typically based on theories about how people fake (e.g., Röhner et al., 2013). However, indices that have received empirical support are available for only a few measures (e.g., Cvencek et al., 2010;Röhner et al., 2013).
To sum up, efforts to detect faking have faced a kind of dead end. Inspecting response patterns is overwhelming for a human analyst and probably does not even lead to faking detection above chance levels-and although faking indices are more manageable, they are not yet available for all measures. Therefore, it makes sense to ask whether there might be another solution.

Machine Learning as a Solution?
In recent years, machine learning has sparked immense interest and has been applied to several psychological problems (e.g., Calanna et al., 2020;Youyou et al., 2015). Machine learning may help solve the problem of complexity in faking detection. Artificial intelligence, in contrast to human analysts, can easily compare hundreds of responses on measures under different conditions, point to differences, and provide advice on how to detect faking. Thus, machine learning seems to be an ideal approach when the goal is to find out what fakers do and how their behavior differs from non-fakers (i.e., identifying the traces of faking; e.g., Calanna et al., 2020).

The Process of Machine Learning
Classifiers are machine learning algorithms that classify objects (e.g., participants' data) into groups (e.g., faker vs. non-faker). In principle, the goal of such classifiers is to use a chosen set of variables (i.e., features; e.g., response patterns, scores, or faking indices) to predict an outcome (i.e., faker vs. non-faker) on the basis of mathematical models (Kotsiantis et al., 2006). Supervised machine learning makes the classifier learn how to map observations (e.g., responses) onto categories (e.g., faker vs. non-faker) in a training process that is similar to human inductive reasoning (e.g., Xue & Zhu, 2009). In this process, the classifier is confronted with training data. The goal of the learning process is for the classifier to be able to correctly predict the categories (here, fakers and non-fakers) when it is confronted with new data. In a process of tuning, there is a search for the model that performs best while the settings of hyperparameters are adjusted. In the testing process, the classifier is applied to data that have not been part of the training data to validate the quality of the classification results (testing the generalizability of the classifier).
It is important to note that classifiers search for differences between the groups (e.g., fakers and non-fakers) in order to make the classifications. Thus, the stronger the difference in the behavior of fakers and non-fakers, the better the classifiers are at spotting the fakers.

Performance Evaluation of Classifiers
The performance of classifiers is typically evaluated with the following performance indices (e.g., Calanna et al., 2020): F1, Precision, Recall, Accuracy, and the Area Under the Curve (AUC). F1 represents the harmonic mean of Precision and Recall. 4 Precision (or Positive Predictive Power) is the ratio of correctly classified positive observations (here, correctly identified fakers) to the number of observations labeled positive by the model (here, all participants who have been classified as fakers, including those who were non-fakers [i.e., false positives]). Recall (or Sensitivity) represents the ratio of correctly classified positive observations (here, correctly identified fakers) to the number of positive observations in the data (here, the number of fakers who were included in the data). Accuracy (or Efficiency) represents the ratio of observations that have been classified correctly (here, fakers as being fakers and non-fakers as being non-fakers) to the number of all observations in a given data set (here, fakers and non-fakers). The AUC is the Area Under the Curve in Receiver Operating Characteristic (ROC) curve analyses. In ROC curve analyses, hit rates (here, successfully identifying individuals as fakers) are plotted as a function of false-alarm rates (here, falsely identifying non-fakers as fakers; i.e., false negatives). The AUC shows the success rate of correct classifications (see also Röhner et al., 2013). It should be different from chance (i.e., .50) in a binary classification.

Feature Importance
Exploring the importance of features (i.e., variables that are used to classify fakers from non-fakers here) allows researchers to peer into the black box of faking (e.g., Röhner & Ewers, 2016). Taking a look at the importance of the features offers insights into what (most) fakers did and whether their behavior varied across conditions.

Status Quo Faking Detection With Machine Learning
Machine Learning is Able to Detect Fakers Boldt et al. (2018) used native Bayes, support vector machines, multinomial logistic regression, multilayer perceptron, simple logistic regression, propositional rule learner, and random forest on data from a self-developed IAT and showed that machine learning was able to detect fakers successfully. Machine learning performed better than Agosta et al.'s (2011) IAT faking index. A study by Calanna et al. (2020) used logistic regression, random forest, and XGBoost on data from a self-report measure (i.e., Big Five Questionnaire-2; BFQ2; Caprara et al., 2007). They found that machine learning was able to correctly classify fakers and non-fakers beyond a faking index (i.e., the lie scale from the BFQ2). However, neither study analyzed different faking conditions. Calanna et al. (2020) varied their input data (i.e., response patterns vs. scores) and showed that response patterns led to better classification performances than scores. From a practical and theoretical point of view, the use of faking indices in combination with machine learning (i.e., as input data) seems to provide a meaningful extension for detecting faking because classifiers perform best when the input data are relevant for classification (e.g., Plonsky et al., 2019). Stated differently, using large quantities of data (e.g., response patterns) that are partly irrelevant for the classification problem (e.g., trials or items that are not faked at all) does not necessarily improve classification. However, focusing on relevant input data (e.g., validated indices) has the potential to outperform classification with response patterns and scores. Still, research has yet to test whether a combination of machine learning and faking indices may work better than using only response patterns or scores. Calanna et al. (2020) found that XGBoost worked best in faking detection. Boldt et al. (2018) showed that logistic regression worked best. 5 Because these two studies differed with respect to measures, constructs, and faking directions, this difference may be explained by factors in the study designs. Still, both studies showed that the classifier impacts how well faking can be detected.

Shortcomings and Open Questions
Impact of Faking Conditions So far, research on the ability of machine learning to detect faking has not considered the complexity of faking under different faking conditions. First, faking depends on the measure (e.g., Röhner et al., 2011), and thus, a comparison between different measures seems essential. Previous research has focused on faking either an IAT (Boldt et al., 2018) or a self-report (Calanna et al., 2020), but results have not been compared between the two measures. Typically there is more evidence of faking on selfreports than on IATs, and thus, classifiers (which search for differences between fakers and non-fakers) should be superior at spotting fakers on self-reports than on IATs. Second, faking direction impacts faking (e.g., Bensch et al., 2019;Röhner et al., 2013). There is more faking of low scores than of high scores, and thus, classifiers should be better at detecting faked low scores than at detecting faked high scores. However, previous studies have either included only one faking direction (i.e., faking good; Calanna et al., 2020) or did not distinguish between faking directions (Boldt et al., 2018). Third, faking differs between naive and informed conditions (e.g., Röhner et al., 2013), and there is more evidence of faking when participants have information than when they are naïve (Röhner et al., 2011). Thus, it is plausible that faking detection is superior in informed than in naïve faking. However, Calanna et al. (2020) used naïve faking conditions, whereas Boldt et al. (2018) used only informed faking. Fourth, the impact of faking practice has not been taken into account. Thus, we do not know whether machine learning is able to detect both experienced fakers and novices. This distinction is important because one study indicated more evidence of faking with practice (Röhner et al., 2011), which in turn should somewhat increase its detection. Fifth, faking may depend on the constructs that are being faked (Steffens, 2004). So far, studies either did not discriminate systematically between constructs (Calanna et al., 2020) or used only one construct (Boldt et al., 2018). In order to show that a result can be generalized, different constructs have to be investigated and analyzed separately.

Implementation of Faking Indices
Both studies tested machine learning against faking indices but did not combine the two approaches by using these indices as input data. Given that classifiers perform best when input data are relevant for classification, research that includes empirically validated faking indices as input data is still needed. 6 Peering Into the Black Box of Faking The classification process has so far remained a black box because previous studies have not investigated the information the classifiers use to separate fakers from non-fakers under varying faking conditions. However, such an investigation is warranted to understand what makes fakers stand out.

The Present Study
To advance knowledge about the ability of classifiers to detect faking, we built on research by Boldt et al. (2018) and Calanna et al. (2020) and reanalyzed seven data sets to address the abovementioned shortcomings. We compared two frequently used types of measures (self-reports vs. IATs). We included the faking of high scores and the faking of low scores. Although we focused on naïve faking attempts because they would provide the biggest challenge to the classifiers, we also included informed faking. 7 We used data from participants with and without faking experience to investigate practice effects. We used data on four different constructs (extraversion, conscientiousness, need for cognition, and self-esteem). Concerning the IATs, we additionally took advantage of the benefits of having empirically supported faking indices by including them as input data. Finally but importantly, we investigated feature importance so that we could peer into the black box of faking. For reasons of comparison, we used the classifiers that turned out to be the best in Boldt et al.'s (2018) and Calanna et al.'s (2020) studies and those that had been used in both studies. Thus, 6 In addition, the majority of the data in the study by Calanna et al. (2020) were retrieved from a repository of real-world assessments that had been conducted prior to their study. Thus, although experimental manipulations could in principle also be conducted in naturalistic settings, Calanna et al. (2020) did not experimentally manipulate faking. Instead, they used a post hoc strategy and defined participants as fakers or non-fakers on the basis of their scores on a lie scale (faking index). However, the validity of such scales for identifying fakers has been criticized (e.g., De Vries et al., 2014;Goffin & Christiansen, 2003;Uziel, 2010). A recent meta-analysis by Lanz et al. (2021) revealed that scales that are intended to measure socially desirable responding are not suitable for measuring response biases (e.g., faking). Consequently, whether the assignment of fakers and non-fakers was valid is not clear for the majority of data in the study by Calanna et al. (2020), and thus, there is a need for an investigation based on experimentally manipulated faking attempts. 7 Informed fakers must follow a small set of faking strategies that strongly limit their behavior and are thereby "eye-catching" for classifiers. Thus, we focused on data from participants who were given freedom in how they faked low or high scores because this provided the more critical test for the classifiers. we used logistic regression, random forest, and XGBoost as classifiers.
In doing so, we aimed to test the following hypotheses: 1. Considering that there is more evidence of faking on self-reports than on IATs, we expected classifiers to spot fakers better on self-reports than on IATs. 2. Considering that there is more evidence of faking when people fake low scores, we expected classifiers to spot faking low better than faking high. 3. Considering that there is more evidence of faking in informed conditions, we expected classifiers to spot informed faking better than naïve faking. 4. Considering that there is more evidence of faking after practice, we expected faking detection by classifiers to be superior when fakers are experienced than when they are not. 5. Considering that there might be differences in faking behavior with respect to constructs, we explored whether classifiers can detect faking to comparable extents across constructs (extraversion, conscientiousness, need for cognition, self-esteem). 6. Concerning self-reports, we tried to replicate the superiority of using response patterns over using scores as input data for machine learning in faking detection. For IATs, we tried to extend previous knowledge by showing that the use of empirically supported faking indices as input data in machine learning outperforms the use of response patterns and scores. 7. We wanted to replicate differences in faking detection with respect to types of classifiers. 8. We explored which kind of information classifiers use to detect faking under the varying conditions.

Data
Altogether we used seven data sets (N = In each data set, participants worked on a baseline assessment and afterwards were randomly assigned to one of the following conditions: faking high scores, faking low scores, or working under the standard instructions of the measures (i.e., control condition). Whether they were asked to fake naïvely or whether they additionally received information about faking strategies varied between the studies (see Table 1). Also, whether they had faking practice varied between the studies (Table 1). In each data set, the constructs were assessed via IATs and self-reports, with the IATs always preceding the self-reports. When participants had missing values, we dropped those participants from the respective analyses.

Naïve Faking With Faking Practice 9
Naïve faking of high and low scores with one, two, or three practice trials was assessed for two constructs: extraversion (Data Set 2: Röhner, 2014a) and conscientiousness (Data Set 5: Röhner, 2014b). In both data sets, naïve faking with one, two, or three practice trials followed a baseline assessment on the respective measure and the assessment of an initial naïve faking attempt without practice on the respective measure.

Informed Faking Without Faking Practice 10
Informed faking of high and low scores without practice was assessed for two constructs: extraversion (Data Set 1: Röhner et al., 2013) and self-esteem (Data Set 7: Röhner et al., 2011). In both studies, informed faking without practice followed a baseline assessment on the respective measure and the assessment of an initial naïve faking attempt without practice in faking on the respective measure. Concerning Data Set 1, participants had to fake low if they had faked high under naïve faking conditions, and vice versa.

Informed Faking With Faking Practice 11
Informed faking of high and low scores with one or two practice trials was assessed for self-esteem (Data Set 7: Röhner et al., 2011). Concerning informed faking with two practice trials, participants faked low if they had faked high under naïve faking conditions, and vice versa. Descriptives for self-reports were based on questionnaire data with a possible range from 0 to 4 (extraversion), 0 to 4 (conscientiousness), -3 to +3 (need for cognition), or 0 to 3 (self-esteem). Descriptives for the IAT were based on IAT data, which were treated with the recommended D 2 scoring algorithm (Greenwald et al., 2003a(Greenwald et al., , 2003b. α was calculated as Cronbach's α. Split-half reliability was based on split-half correlations incorporating Spearman-Brown adjustments.

Measures to be Faked
According to their randomly assigned experimental condition, participants were asked to fake either high or low scores or to work under standard instructions.

Self-Reports
Extraversion Scale Participants worked on the respective scale from the NEO-Five Factor Inventory (Borkenau & Ostendorf, 2008;English version: Costa Jr. & McCrae, 1992). This scale consists of 12 items that are answered on a 5-point rating scale ranging from 1 (strongly disagree) to 5 (strongly agree). Scale characteristics and Cronbach's alpha reliability (Table 1) were comparable to Borkenau and Ostendorf's (2008) values of M = 28.38, SD = 6.70, and α = .80.

Conscientiousness Scale
Participants worked on the respective scale from the NEO-Five Factor Inventory (Borkenau & Ostendorf, 2008;English version: Costa Jr. & McCrae, 1992). The scale consists of 12 items that are answered on a 5-point rating scale ranging from 1 (strongly disagree) to 5 (strongly agree). Scale characteristics and reliability (Table 1) were comparable to Borkenau and Ostendorf's (2008) values of M = 30.87, SD = 7.13, and α = .84.

Need for Cognition Scale
Participants worked on the German adaptation of the 16-item short version of the need for cognition scale (Bless et al., 1994;English version: Cacioppo & Petty, 1982). The scale consists of 16 items that are answered on a 7-point scale ranging from -3 (strongly disagree) to +3 (strongly agree). Scale characteristics and reliability (Table 1) were comparable to Fleischhauer et al.'s (2010) values of M = 15.28, SD = 11.14, and α = .84.

Rosenberg Self-Esteem Scale
Participants worked on the German adaptation of the Rosenberg Self-Esteem Scale (von Collani & Herzberg, 2003;English version: Rosenberg, 1965). The scale consists of 10 items that are answered on a 4-point scale ranging from 0 (strongly disagree) to 3 (strongly agree). Scale characteristics and reliability (Table 1)
The need for cognition IAT consisted of five blocks of trials (Fleischhauer et al., 2013). The single dimension Practice Blocks 1, 2, and 4 each included 22 trials (20 practice trials and two warm-up trials). The combined Blocks 3 and 5 each included 22 + 62 trials (20 practice trials and two warm-up trials; 60 experimental trials and 2 warm-up trials). Between participants, IATs were counterbalanced for the order of combined phases 12 to control for the effect that IAT scores tend to show stronger associations for the first pair of categories (Schnabel et al., 2008). Within participants, the presentation of combined phases was held constant. We used the R code provided by Röhner and Thoss (2019) to compute the D 2 algorithm suggested by Greenwald et al. (2003aGreenwald et al. ( , 2003b as a measure of the IAT effect. In addition, we calculated the diffusion-model-based IAT effect IAT v (Klauer et al., 2007) by subtracting parameter v of the compatible phase from parameter v of the incompatible phase. For diffusion modeling, we followed the tutorial by Röhner and Thoss (2018) and used the EZ software, which can be downloaded (http:// www. ejwag enmak ers. com/ papers. html). 13 13 We followed Voss and Voss' (2008) and Voss et al.'s (2013) recommendation to exclude outliers from the individual response-time distribution for participants who had reaction times lower than 200 ms or higher than 5,000 ms. Altogether, we removed 11,201 trials (2.02% of the trials). We removed 141 trials from the IAT (0.4% of the trials) in Data Set 1, 4,850 trials from the IAT (3.2% of the trials) in Data Set 2, 377 trials from the IAT (0.8% of the trials) in Data Set 3,626 trials from the IAT (1.1% of the trials) in Data Set 4, 4,054 trials from the IAT (2.7% of the trials) in Data Set 5, 1,017 trials from the IAT (2.1% of the trials) in Data Set 6, and 136 trials from the IAT (0.2% of the trials) in Data Set 7. As suggested by Wagenmakers et al. (2007), we corrected the percentage of correct responses that equaled exactly 1.0 by subtracting half an error from the percentage of correct responses before running further analyses. We also corrected the percentage of correct responses that equaled exactly 0 and 0.5 by adding half an error, respectively. Because of the approximation formula, t 0 can be negative in sign (e.g., the mean of the reaction time is less than the mean decision time that is defined: a 2v × 1−e y 1∓e y ; Wagenmakers et al., 2007). However, a negative t 0 cannot be interpreted theoretically because it represents the nondecisional portion of the response time, and time cannot take on negative values (Voss et al., 2004). Thus, participants with negative t 0 should be removed before further analyses (Wagenmakers et al., 2007). Altogether, we excluded N = 68 (4.80% of participants) from further analyses because t 0 was negative in sign (N = 4 [4.8%] participants from Data Set 1, N = 22 [11.2%] participants from Data Set 2, N = 5 [1.9%] participants from Data Set 3, N = 11 [3.8%] participants from Data Set 4, N = 14 [7.0%] participants from Data Set 5, N = 9 [3.0%] participants from Data Set 6, and N = 3 [3.6%] participants from Data Set 7). With fakers, there were N = 52 [3.7%] t 0 -based exclusions, whereas there were only N = 16 [1.1%] t 0based exclusions with non-fakers, indicating that faking attempts had a strong impact on reaction time distributions so that the mean decision time exceeded the mean reaction time, and thus, t 0 was impacted 12 We use the term combined phase to refer to the combination of the critical practice block and the critical test block (compatible phase = compatible practice trials and compatible test trials; incompatible phase = incompatible practice trials and incompatible test trials; see, e.g., Röhner & Ewers, 2016).
Extraversion IAT This IAT (Back et al., 2009) included the target discrimination between self-relevant (e.g., I, mine) and non-self-relevant (e.g., they, their) words and attribute discrimination between extraversion-related words (e.g., talkative, active) and introversion-related words (e.g., shy, passive). The IAT's characteristics (Table 1) were comparable to the values of M = 0.02, SD = 0.38, α = .85 reported by Back et al. (2009). Back et al. (2009) computed their mean with the D 1 measure that does not involve a lower tail treatment, which explains why their mean was somewhat lower than ours because we used the recommended D 2 measure (i.e., trials below 400 ms are deleted).

Manipulation Check
We computed robust ANCOVAs (Wilcox, 2017) on each measure's score to check whether participants in the faking groups were motivated and able to fake on all measures and whether their scores still differed when the baseline scores were controlled for (Vickers & Altman, 2001). As expected, the significant differences between trimmed means in nearly all design points revealed that participants in the faking conditions were motivated and able to fake on all measures. The results of the robust ANCOVAs are stored in the Supplement on the OSF (https:// osf. io/ bj492/). Moreover, faking led to typical consequences (e.g., Salgado, 2016); the means decreased, and the standard deviations and reliability scores increased (see Table 1).

Computation of the Input Data
We used the data from the data sets described above and prepared the respective input data (i.e., response patterns, scores, and faking indices). Response patterns consisted either of all IAT trials (IATs) or of all item responses (selfreports). 14 Scores consisted of either D 2 and IAT v (IATs) or the test score (self-reports). 15 We combined the potential of faking indices with the potential of machine learning by using faking indices as additional input data for classifiers. We based our set of faking indices on recommendations from prior research. We were unable to consider faking indices for self-reports because such validated indices are missing. Lie scales have come under heavy criticism (e.g., De Vries et al., 2014;Lanz et al., 2021;Uziel, 2010), and even the scale's authors strongly advise against the use of lie scales to detect faking (e.g., Borkenau & Ostendorf, 2008). Faking indices for IATs were created on the basis of recommendations from prior research (see Agosta et al., 2011;Cvencek et al., 2010;Röhner et al., 2013;Röhner & Thoss, 2018). Accordingly, they consisted of CTS, IAT a , IAT t 0 , Ratio 150-10000, Slow_Co, and IncErr_Co for the naïve faking and informed faking of low scores. They consisted of CTS, IAT a , IAT t 0 , Ratio 150-10000, and Accel_Co for the naïve faking of high scores and CTS, IAT a , IAT t 0 , Ratio 150-10000, and Slow_In for the informed faking of high scores. 16

Computation of Faking Indices
Combined Task Slowing (CTS) CTS was computed by subtracting the faster combined phase of the baseline IATs from the slower combined phase of the faked IATs (Cvencek et al., 2010). Therefore, average reaction times on the combined phases from the faked IATs were examined relative to the average reaction times on the combined phases from the baseline IATs.
IAT a and IAT t Both indices were computed using the diffusion model analyses (e.g., Klauer et al., 2007;Röhner & Ewers, 2016) that we explained above. IAT a represents participants' speed-accuracy tradeoffs and was computed by subtracting parameter a of the compatible phase from parameter a of the incompatible phase, whereas IAT t 0 represents participants' non-decision-related processes and was computed by subtracting parameter t 0 of the compatible phase from parameter t 0 of the incompatible phase (Klauer et al., 2007).
Ratio 150-10000 This index was calculated according to the procedures described in Agosta et al. (2011). Thus, only reaction times between 150 and 10,000 ms were used, and the others were excluded from further analyses. Errors were substituted with the mean of the corresponding IAT phase with an added penalty of 600 ms. The average reaction times from the fastest combined phase (i.e., either compatible or incompatible) were then divided by the average reaction times from the corresponding single blocks (i.e., Single Blocks 1 & 2, or Single Blocks 1 & 5 for extraversion, conscientiousness, and self-esteem IATs; Single Blocks 1 & 2, or Single Blocks 1 & 4 for the need for cognition IAT).

Slow_Co, IncErr_Co, Slow_In, and Accel_Co
We computed these indices as described in Röhner et al. (2013). Thus, for the naïve as well as informed faking of low scores, we computed slowing down on the congruent phase (i.e., Slow_Co) as the difference in reaction times between the congruent IAT phase after faking instructions and the congruent IAT phase at baseline. For the naïve faking of low scores, we additionally computed increasing errors on the congruent phase (i.e., IncErr_Co) as the difference in errors between the congruent IAT phase under faking instructions and the congruent IAT phase at baseline. Albeit not necessarily related to faking success, this index was shown to mirror a faking strategy that is commonly used under the naïve faking of low scores. 17 Concerning the naïve faking of high scores, we computed acceleration on the congruent phase (i.e., Accel_Co) as the difference in reaction times between the congruent IAT phase at baseline and the congruent IAT phase under faking. Concerning the informed faking of high scores, we computed slowing down on the incongruent phase (i.e., Slow_In) as the difference in reaction times between the incongruent IAT phase under faking and the incongruent IAT phase at baseline.

Machine Learning
In order to investigate the ability of machine learning to detect faking, we used the following three types of classifiers on the faked and non-faked data: logistic regression, random forest, and XGBoost. We decided to use logistic regression and random forest for reasons of comparability. Both were used in Boldt et al. (2018) as well as in Calanna et al. (2020). We also included the classifier that worked best in each study: logistic regression (Boldt et al., 2018) and XGBoost (Calanna et al., 2020). Each of the classifiers was applied to response patterns and scores for the self-reports and to response patterns, scores and faking indices for the IAT. We thereby discriminated between the abovementioned faking conditions. Additionally, we made sure that the groups (i.e., faking and non-faking) were equal in size before we ran the analyses. A detailed overview of the resulting models is stored on the OSF (https:// osf. io/ bj492/).

Multilayer Cross-Validation
To ensure the generalizability of the results, we followed Calanna et al. (2020) and adopted a multilayer cross-validation procedure. We ran a five-fold cross-validation to tune the algorithms and additionally ran another 10-fold cross-validation to estimate their performance (see Cawley & Talbot, 2010). Training data and test data were independent from each other in every fold (i.e., data split). This was true for the five-fold cross-validation that was used to tune the algorithms and also for the 10-fold crossvalidation that was used to estimate the performance.

Performance Evaluation
When it comes to faking, Precision and Recall are equally important. Thus, we used the random search to find the best set of hyperparameters relative to the F1 score in order to maximize the tradeoff between Precision and Recall (e.g., Calanna et al., 2020).

Feature Importance
To gain insight into the black box of faking, we explored the features that were used by the classifiers to discriminate between fakers and non-fakers (see Fig. 4; for more details, see also Tables S7 to S9 and Figures S1 to S4 in the Supplement). To reduce complexity in the Results section, we evaluated performance by reporting the means and standard deviations of only the most important performance index with regard to faking detection (i.e., F1; the harmonic mean between Precision and Recall). Higher values on this performance index indicate better faking detection. In order to facilitate interpretation, we compared the F1 performance evaluations using Cohen's d.

Ability of Classifiers to Detect Fakers
Summing up, in most cases, the classifiers were able to detect faking above chance. As expected, however, faking conditions, input data, and type of classifier determined how well faking could be detected. F1 varied from .44 (faking condition: naïve faking of high scores on the conscientiousness IAT without practice; classifier: random forest; input data: scores) to .98 (faking condition: informed faking of low scores on the self-esteem IAT without practice; classifier: logistic regression or random forest; input data: scores or indices).
We want to exemplify the results for these models. Concerning the model that was computed for the condition involving the naïve faking of high scores on the conscientiousness IAT without practice using the random forest classifier and scores as the input data, F1 was .44. Precision was .45. Thus, only 45% of the participants who were classified as fakers actually were fakers (i.e., 55% were non-fakers). Recall was .44. Thus, only 44% of the fakers that existed were detected (i.e., 56% of the fakers were not detected). Accordingly, F1 was below 50%. The probability of detecting fakers as fakers was below chance. Conversely, in the models that were computed for the condition involving the informed faking of low scores on the self-esteem IAT without practice and using the logistic regression or random forest classifier and scores or indices as input data, the chances of classifying fakers correctly as fakers were largely above chance. Concerning the model that was computed for the condition involving the informed faking of low scores on the self-esteem IAT without practice and using the logistic regression or random forest classifier and scores as input data, Precision was 1.00 (i.e., 100% of the participants who were classified as fakers actually were fakers. Thus, no nonfakers were classified as fakers), and Recall was .97 (97% of the fakers that existed were detected. Thus, only 3% of the fakers were missed). Concerning the model that was computed for the condition involving the informed faking of low scores on the self-esteem IAT without practice and using the logistic regression or random forest classifier and scores or indices as input data, Precision was .97 (97% of the participants who were classified as fakers actually were fakers; 3% were non-fakers that had been wrongly assigned to the group of fakers). Recall was 1.00 (100% of the fakers that existed were detected. No faker was missed).
Under informed faking conditions, the F1 performance evaluations of classifiers on self-reports and IATs were more comparable than under naïve conditions. This was true concerning informed faking without practice (d = -0.55, 95% CI [-1.08, -0.03]) and with practice (d = -0.43, 95% CI [-0.95, 0.10]; Table 2; Fig. 3). Thus, differences in faking detection were less pronounced here.  Table 2; Figs. 1, 2 and 3). Thus, faking was spotted much better for the faking of low scores than for the faking of high scores. Performance evaluation can vary between 0.00 and 1.00 (y-axis). Geometrical shapes code the classifiers: Circles represent perfor-mance evaluations from logistic regression, triangles represent performance evaluations from random forest, and squares represent performance evaluations from XGBoost. Colors code the kind of input data: Yellow represents response patterns, red represents scores, and blue represents faking indices

Faking Without Versus with Practice
The F1 performance evaluations of classifiers were comparable between experienced and inexperienced fakers for naïve faking without practice versus one practice trial (d = -0.08, 95% CI [-0.39 Fig. 3). Thus, classifiers worked equally well irrespective of faking practice.
When aiming to detect the naïve faking of participants with practice trials, F1 performance evaluations of classifiers were comparable for the constructs extraversion and conscientiousness (d = -0.08, 95% CI [-0.37, 0.21]; Table 2; Fig. 2). Also, when aiming to detect informed faking, F1 performance evaluations of classifiers were comparable for the constructs extraversion and self-esteem (d = -0.09, 95% CI [-0.60, 0.42]; Table 2; Fig. 3). Thus, classifiers were comparably good at detecting fakers on different constructs when participants had practice or information. Grey cells indicate that these models were not part of our reanalyses because of the nonavailability of the recommended faking indices for self-reports or because we did not collect data concerning this condition.

Opening the Black Box: Which Information did Classifiers Use to Detect Faking?
Because logistic regressions worked best to detect faking, we decided to focus on analyses of the feature importance of logistic regressions in order to reduce complexity. Also, we decided to focus on the feature importance of faking indices in IATs and response patterns in self-reports because, overall, these approaches were the most successful for detecting faking. Figure 4 provides an overview of the aggregated feature importance of logistic regressions in the form of forest plots (see Figure S4 in the Supplement for the plots from random forest and XGBoost). It clearly demonstrates that, for IATs, participants' speed-accuracy setting (i.e., IAT a ) was consistently the most important feature for detecting  Fig. 4 Forest Plots of the Evaluation of Feature Importance in Logistic Regression. Note. The x-axis represents the mean feature importance, which can vary between 0 = not important at all to 1 = most important. The larger the distance from zero, the more important the feature is. Point size is proportional to the number of occurrences (N) used to calculate the mean feature importance and can vary on the basis of the underlying data or the results of the algorithm that was used. Horizontal lines represent confidence intervals. Confidence intervals that exceeded the margins of -0.5 and 1.5 were clipped. Clipping is indicated by an "x." Confidence intervals that fall below zero are colored in a lighter shade of grey, or else they are blue. Response patterns represent the features of self-reports. Faking indices represent the features in IATs faking, 18 whereas the results on the response pattern in selfreports were more diverse. Although there was clear variation within feature importance, the differences between the relevance of various items was less strongly pronounced. On this general level, the most important feature on the extraversion scale represents activity, the one concerning the conscientiousness scale represents handling of time, the one concerning the need for cognition represents enjoyment of problem-solving, and the one on the self-esteem scale represents self-satisfaction. Thus, additional analyses of feature importance on a more detailed level (i.e., with respect to faking conditions) seemed relevant. Figures S1 to S3 in the Supplement show the feature importance of classifiers under the different faking conditions. Feature importance clearly demonstrates that faking occurs along different pathways, which is why we decided to present the most important feature and compare the ordering of feature importance with Spearman's rank correlation coefficients. Tables S6 to S8 in the Supplement provide an overview of the M and SD values for features between fakers and non-fakers. To assess correspondence between feature importance under different faking conditions, we calculated Spearman's rank correlation coefficients between the ranked (descriptive) importance of features under different faking conditions with each other (1 = most important; 6 to 16 19 = least important).

Self-Reports
Conscientiousness Scale Under naïve faking conditions without practice, the most important feature for detecting the faking of low scores was the lower ratings of fakers on Item 8 (i.e., "When I make a commitment, I can always be counted on to follow through"), whereas the most important feature for detecting the faking of high scores was the higher ratings of fakers on Item 6 (i.e., "I waste a lot of time before settling down to work"; Table S7). The same was true for faking after one or two practice trials (Table S8). It was different when faking with three practice trials: The most important feature for detecting the faking of low scores was the lower ratings of fakers on Item 10 (i.e., "I am a productive person who always gets the job done"), and the most important feature for detecting the faking of high scores was the higher ratings of fakers on Item 11 (i.e., "I never seem to be able to get organized"; Table S8).
The ordering of feature importance varied with respect to faking direction. The importance of features for detecting the faking of low scores was always unrelated to the importance for detecting the faking of high scores under naïve faking without practice (r s = -.13, p = .697) as well as with one (r s = -.20, p = .542), two (r s = .08, p = .795), or three (r s = .50, p = .095) practice trials. Feature importance orderings also varied with respect to practice. The features that had the strongest impact on detecting fakers under naïve faking conditions without practice did not have the strongest impact on detecting fakers under naïve faking conditions with one practice trial for faking low (r s = .40, p = .191) or faking high (r s = .24, p = .475) or two practice trials for faking low (r s = .46, p = .131) or faking high (r s = .55, p = .067). But the results for naïve faking held for three practice trials for faking high (r s = .68, p = .015) but not for faking low (r s = .10, p = .762).
Extraversion Scale Under naïve faking conditions without practice, the most important feature was the lower ratings of fakers on Item 2 (i.e., "I laugh easily") when detecting the faking of low scores and the higher ratings of fakers on Item 7 (i.e., "I often feel as if I'm bursting with energy") when detecting the faking of high scores (Table S7). With one or two practice trials in faking, the most important feature for detecting faked low scores was the lower ratings of fakers on Item 11 (i.e., "I am a very active person"), whereas the most important feature for detecting faked high scores again was the higher ratings of fakers on Item 7 (Table S8). When participants had three practice trials, the most important feature for detecting faked low scores was the lower ratings of fakers on Item 4 (i.e., "I really enjoy talking to people"), whereas the most important feature for detecting faked high scores was the higher ratings of fakers on Item 1 (i.e., "I like to have a lot of people around me"; Table S8). Under informed faking conditions, the most important feature was the lower ratings of fakers on Item 1 when detecting the faking of low scores and the higher ratings of fakers on Item 11 when detecting the faking of high scores (Table S9).
The ordering of feature importance for detecting the faking of low scores under naïve faking was not or only scarcely related to the detection of the faking of high scores without practice (r s = .26, p = .417) or with one (r s = .15, p = .649), two (r s = .64, p = .026), or three practice trials (r s = .25, p = .443) under naïve faking as well as under informed faking (r s = .46, p = .131). Features varied with respect to practice concerning faking high but not concerning faking low. Features that had the strongest impact on detecting fakers of high scores under naïve faking conditions without practice also had the strongest impact on detecting fakers of high scores under naïve faking conditions with one practice trial (r s = .60, p = .039), two practice trails (r s = .82, p = .001), and three practice trials (r s = .67, p = .017). Features that had the strongest impact on detecting fakers of low scores under naïve faking conditions without practice did not have the strongest impact on detecting fakers of low scores under naïve faking conditions with one practice trial (r s = .51, p = .090), two practice trials (r s = .04, p = .914), or three practice trials (r s = .40, p = .199). Additionally, features varied with respect to whether participants faked naïvely or were informed about faking strategies for faking low (r s = .11, p = .729) and faking high (r s = .50, p = .101).

Need for Cognition Scale
Under naïve faking conditions without practice, the most important feature was the higher 20 ratings of fakers on Item 3 (i.e., "I tend to set goals that can be accomplished only by expanding considerable mental effort") when detecting the faking of low scores and the higher ratings of fakers on Item 13 (i.e., "I prefer my life to be filled with puzzles that I must solve") when detecting the faking of high scores (Table S7). Again, features varied with respect to faking direction (r s = .37, p = .154).

Self-Esteem Scale
Under naïve faking conditions without practice, the most important feature was the lower ratings of fakers on Item 3 (i.e., "I feel that I have a number of good qualities") when detecting the faking of low scores and the higher ratings of fakers on Item 1 (i.e., "On the whole, I am satisfied with myself") when detecting the faking of high scores (Table S7). Under informed faking conditions without practice, the most important feature was the lower ratings of fakers on Item 4 (i.e., "I am able to do things as well as most other people") when detecting the faking of low scores and the higher ratings of fakers on Item 1 when detecting the faking of high scores (Table S9). With one practice trial, the most important feature was the higher 21 ratings of fakers on Item 5 (i.e., "I feel I do not have much to be proud of") when detecting the faking of low scores and the lower 22 ratings of fakers on Item 9 (i.e., "All in all, I am inclined to think that I am a failure") when detecting the faking of high scores (Table S9). With two practice trials, the most important feature was the lower ratings of fakers on Item 4 when detecting the faking of low scores and the higher ratings of fakers on Item 5 when detecting the faking of high scores (Table S9). Again, the ordering of feature importance for detecting the faking of low scores was unrelated to the ordering for detecting the faking of high scores for naïve faking (r s = .03, p = .934), informed faking without practice (r s = -.08, p = .829), informed faking with one practice trial (r s = -.07, p = .855), and informed faking with two practice trials (r s = .24, p = .511). Additionally, feature importance did not largely vary with respect to whether participants faked naïvely or were informed about how to fake when faking low (r s = .69, p = .029), but it did vary when faking high (r s = .33, p = .347). Finally, under informed faking, features varied with respect to practice when faking low (r s = .42, p = .229) and when faking high (r s = .29, p = .425).

IATs
Conscientiousness IAT Concerning the detection of faking under naïve faking conditions without practice, the most important feature was the lower IAT a of fakers (i.e., participants' speed-accuracy setting) when faking low scores and the lower Ratio 150-10000 of fakers (i.e., the ratio that measures a slowing down behavior on either the compatible or incompatible IAT phase compared with the single blocks) when faking high scores (Table S7). With practice in faking, the lower IAT a of fakers was the most important feature for detecting the faking of low scores, and the higher IAT a of fakers was the most important feature for detecting the faking of high scores (Table S8). The ordering of feature importance varied with respect to faking direction. Under naïve faking without and with practice, the ordering of feature importance for detecting the faking of low scores was not related to the ordering of feature importance for detecting the faking of high scores without practice (r s = .54, p =.258), with one practice trial (r s = .69, p = .060), with two practice trials (r s = .30, p = .479), or with three practice trails (r s = .69, p = .060). The ordering of feature importance did not vary greatly with respect to practice. Features that had the strongest impact on the detection of fakers under naïve faking conditions without practice also had the strongest impact on the detection of fakers under naïve faking conditions with one practice trial for faking low (r s = .93, p = .001) and faking high (r s = .73, p = .042), for two practice trails for faking low (r s = .98, p ≤ .001) and faking high (r s = .55, p = .158), and for three practice trails for faking low (r s = .95, p ≤ .001) and faking high (r s = .75, p =.032).
Extraversion IAT Concerning the detection of faking under naïve faking without practice, the most important feature was the lower IAT a of fakers when detecting the faking of low scores and the higher IAT a of fakers when detecting the faking of high scores (Table S7). With practice in faking, a lower IAT t 0 (one practice trial), a lower IAT a (two practice trials), and a higher Ratio 150-10000 (three practice trials) of fakers were most important for detecting the faking of low scores, but a higher IAT a (one and two practice trials) and a lower IAT a (three practice trials) of fakers was consistently important for detecting the faking of high scores. Under informed faking conditions, the most important feature was the lower IAT a of fakers when detecting the faking of low scores and the higher IAT a of fakers when detecting the faking of high scores. Differences with respect to faking direction were also apparent on the extraversion IAT. Although under naïve faking without practice, the ordering of the importance of features for detecting the faking of low scores was somewhat related to the detection of the faking of high scores without practice, r s = .76, p = .028, it was not related when participants had practice with one (r s = .23, p = .578), or three practice trials (r s = .46, p = .244), but with two practice trials (r s = .79, p = .021). Under informed faking conditions, the ordering of the importance of features for detecting the faking of low scores was strongly related to the ordering for detecting the faking of high scores without practice, r s = .90, p = .002.
The ordering of the importance of features did not vary much with respect to practice. The features that had the strongest impact on detecting fakers under naïve faking conditions without practice also had the strongest impact on detecting fakers under naïve faking conditions with one practice trial for faking low (r s = .93, p = .001) and faking high (r s = .93, p = .001), two practice trails for faking low (r s = 1.00, p ≤ .001) and faking high (r s = .98, p ≤ .001), and three practice trails for faking low (r s = .81, p = .015) and faking high (r s = .98, p ≤ .001).
Additionally, feature importance did not vary much with respect to whether participants faked naïvely or whether they were informed for faking low (r s = .86, p = .006) and faking high (r s = .88, p = .004).

Need for Cognition IAT
Under naïve faking conditions without practice, the most important feature was the lower IAT a of fakers when detecting the faking of low scores and the higher CTS (i.e., combined task slowing) of fakers when detecting the faking of high scores (Table S7). Again, features varied with respect to faking direction, r s = .09, p = .840.
Self-esteem IAT Concerning faking detection under naïve faking conditions without practice, the most important feature was the lower IAT a of fakers when detecting the faking of low and high scores (Table S7). Under informed faking conditions without and with practice, the most important feature was also the lower IAT a of fakers when detecting the faking of low scores and the higher IAT a of fakers when detecting the faking of high scores (Table S9). The ordering of the importance of features varied with respect to faking direction. Under naïve faking, feature importance differed regarding the detection of low and high scores, r s = .61, p = .106. As was true for extraversion, the orderings of the importance of features for high and low scores were more strongly related under informed faking (r s = .80, p = .017), informed faking with one practice trial (r s = .75, p = .032), and informed faking with two practice trials (r s = .78, p = .024). Feature importance did not largely vary with respect to whether participants faked naïvely or whether they were informed about faking strategies for faking low (r s =.71, p = .048) or faking high (r s =.90, .002). Finally, under informed faking, feature orderings did not vary with respect to practice for faking low (r s =.98, p ≤ .001) or faking high (r s =1.00, p ≤ .001).

Discussion
We reanalyzed seven data sets (N =1,039) to investigate the ability of machine learning to detect faking under different faking conditions. We analyzed the detection of faking on two frequently used and well-established psychological measures (self-reports and IATs) regarding the faking of high and low scores, naïve and informed faking, faking with and without practice, and on four constructs (extraversion, conscientiousness, need for cognition, and self-esteem), thus varying factors that have been shown to impact faking behavior (i.e., traces of faking). We also compared three types of classifiers (logistic regression, random forest, and XGBoost) and three types of input data (response patterns, scores, and faking indices). Last but not least, to peer into the black box of faking and its detection, we explored feature importance.
Our results are in line with Boldt et al.'s (2018) and Calanna et al.'s (2020) earlier findings, which identified machine learning as a promising approach for detecting faking. In most cases, classifiers were able to detect faking above chance. Our results extend previous findings by showing that besides the type of classifier and besides the type of input data, the conditions under which faking occurs affect how faking is done and how well it can be detected. Accordingly, faking detection ranged from chance levels to nearly 100%. For example, detection was rather poor with naïve faking on the conscientiousness IAT when using scores and random forest, but it worked very well for detecting the informed faking of low scores on the self-esteem IAT on the basis of scores or faking indices with logistic regression.

Faking Detection is Better on Self-Reports than on IATs Under Naïve Conditions but not Under Informed Conditions
Under naïve faking and irrespective of practice levels, classifiers had more trouble recognizing fakers on IATs than on self-reports. Under informed faking, the opposite was true, albeit this effect was much smaller and nonsignificant when people had practice. 23 Various theorizing has suggested that faking on IATs is more difficult and thus less possible than faking on self-reports (see, e.g., De Houwer, 2006). This argument has been supported by empirical research (e.g., Röhner et al., 2011;Steffens, 2004). In fact, the reduced transparency of the measurement procedure in IATs as compared with self-reports is one core attribute of IATs (e.g., De Houwer, 2006). Consequently, especially naïve faking conditions challenge participants when they try to fake, whereas information makes faking easier (e.g., Röhner et al., 2011). One explanation for this finding comes from research that shows that participants develop and use successful but also unsuccessful faking strategies in naïve faking conditions (Röhner et al., 2013). By contrast, faking on self-reports is quite easy because participants basically choose responses that fit the impression they want to make. Correspondingly, research has shown that faking on self-reports is not impacted much by knowledge about faking strategies (Röhner et al., 2011). Most likely, successful faking strategies are very obvious on self-reports, and thus, any potential gains from information about how to fake is less pronounced than it is on IATs.
Thus, the measure to be faked plays a role in faking detection. As expected, faking detection was better on self-reports than on IATs. However, keeping in mind the results on feature importance, this better detection on self-reports came at the expense of a lower generalizability of features to detect faking across faking conditions on the self-report measures than on the IATs. Moreover, this advantage of self-reports was only true for naïve faking. Thus, the impact of the type of measure on faking detection changes with information about faking strategies. Faking on less transparent measures (e.g., on IATs) was detected to almost the same degree as on self-reports when participants had information about how to fake them.

The Detection of Faking Low is Superior to the Detection of Faking High
Earlier findings have emphasized that faking behavior differs by faking direction (e.g., Bensch et al., 2019;Röhner et al., 2013) and found more evidence of faking when participants faked low scores than when they faked high scores (see, e.g., Röhner et al., 2011). Extending these results and in line with expectations, classifiers were better at detecting faking low than at detecting faking high.
Thus, faking direction played a role in the detection of faking in the current study. Faked low scores were spotted better than faked high scores.

The Detection of Informed Fakers is Superior to the Detection of Naïve Fakers
Previous research has found more evidence of faking when participants were informed than when they were naïve with respect to faking strategies-as informed faking is easier and thus more pronounced than naïve faking (e.g., Röhner et al., 2011). In line with this idea and as expected, classifiers performed somewhat better for informed faking than for naïve faking. Thus, although faking detection was possible for fakers who faked naïvely and those who were informed about how to fake, knowledge about faking strategies impacted faking detection; it was superior when participants had knowledge about faking strategies than when they did not.

Practice in Faking has no Impact on Detection
Faking detection was equally good regardless of practice levels. Apparently, information (see paragraph above) is more relevant than practice.

Without Practice and Without Information, Faking Detection is Better on Need for Cognition and on Extraversion Than on Self-Esteem and Conscientiousness
When participants faked naively and had no practice, the construct to be faked played a role. Detection was better for extraversion and need for cognition than for selfesteem and conscientiousness. These findings are in line with a finding by Lukoff (2012), who gave warnings to potential fakers and found that constructs impacted how well fakers and non-fakers were classified with machine learning.
However, when participants in our studies had practice in faking or were informed about faking strategies, detection did not differ between constructs. Apparently, faking became more homogeneous under these conditions.
To sum up, although it was possible to detect faking for all four constructs, the construct that was being faked impacted faking detection for conditions involving naïve faking without practice. Faking was more often detected when it involved extraversion or need for cognition than self-esteem or conscientiousness in this case. Replicating Calanna et al.'s (2020) prior findings, our study demonstrated that faking detection is superior when using response patterns than when using scores as input data. These results are in line with the assumption that faking is represented more strongly in a kind of profile (response patterns) rather than in scores (Geiger et al., 2018). Apparently, faking is too multifaceted to be captured by one overall score (e.g., Röhner et al., 2013). The findings also underscore the advantage of machine learning in faking detection: Machines can analyze complex response patterns efficiently. However, whereas the effect for IATs was significant, it remained nonsignificant for self-reports. Most likely the quantity of response patterns plays an important role with respect to whether response patterns perform better than scores. In our analyses, response patterns on IATs consisted of 220 to 264 responses, whereas response patterns on self-reports consisted of 10 to 16 responses. The self-report measure used by Calanna et al. (2020) included 134 responses. Thus, the advantage of using response patterns seems especially strong for measures with large sets of responses. An obvious explanation for this may be that with more items, faking can be more multifaceted, and it becomes more important to inspect response patterns.

Faking Detection With Faking Indices as Input Data is Superior to Faking Detection With Response Patterns or Scores
Extending these findings, we demonstrated that using response patterns can be outperformed when using theoretically derived and empirically supported faking indices-at least for IATs where such indices are available. This is in line with our expectation and can be explained by the fact that machine learning performs best if the input data are all relevant for classification. Thus, focusing on relevant input data only (e.g., indices that reflect empirically supported faking strategies) works better than including all IAT trials on which participants do not fake on all.

Faking Detection With Logistic Regression and Random Forest is Superior to XGBoost
Whereas Calanna et al. (2020) showed that faking detection with XGBoost was superior to faking detection with random forest and logistic regression, Boldt et al. (2018) demonstrated that logistic regression worked best. In combining the detection of faking on self-reports and IATs, our research showed that in general, logistic regression and random forest worked comparably well, and logistic regression outperformed XGBoost. Calanna et al. (2020) focused on faking on a self-report and on faking high scores only, whereas Boldt et al. (2018) restricted their research to faking on an IAT. Thus, faking conditions most likely impact the performance of classifiers and thereby have to be taken into consideration when choosing which classifiers to use to detect faking.
Moreover, the level of measurement of input variables (continuous vs. categorical) may impact the performance of different machine learning algorithms. For instance, in many cases, logistic regression works better with continuous predictors (i.e., response patterns, scores, and faking indices in IATs as well as scores in self-reports) than with categorical predictors (i.e., response patterns in selfreports), whereas one strength of random forest is that its performance is excellent with categorical predictors. Thus, the level of measurement of input variables should also be taken into consideration when choosing potential machine learning algorithms.

Which Behavior Revealed Fakers?
Exploring the importance of features provides insight into the processes of faking and in its detection. On a general level, for IATs, participants' speed-accuracy setting (i.e., IAT a ) was consistently the most important feature for detecting faking, whereas the results on the response pattern in self-reports were more diverse. On self-reports, self-descriptions concerning activity (extraversion), handling of time (conscientiousness), enjoyment of problem-solving (need for cognition), and self-satisfaction (self-esteem) were the most important for revealing faking on a general level, but there was much variation between faking conditions. Thus, overall, there was considerably more correspondence across IATs than across self-report measures, which especially supports the generalizability of findings for the detection of faking with faking indices on the IAT. Nevertheless, to a certain extent, our results allow for a look into the black box of faking processes in self-report measures. So far, there is little theoretical background to explain why some items strongly discriminated between fakers and non-fakers, whereas others were less important. However, research using a cognitive interview technique revealed that people evaluated the importance of an item in terms of the situational demand (e.g., Ziegler, 2011). If participants judge an item as important with regard to the situation, they will attempt to fake on that item-but they will not attempt to fake on items they regard as unimportant regarding their faking goal. According to Ziegler (2011), people use specific knowledge and implicit theories about the desired impressions to evaluate item importance. Further, the stakes of the situation may impact the evaluation of what is important (Ziegler, 2011). In our studies, for example, participants were confronted with a personal selection scenario, which most likely triggered specific knowledge and implicit theories about the characteristics of an ideal employee (e.g., Klehe et al., 2012). In our studies, the ideal employee on a general level may be described as someone who is active, does not waste time, enjoys problem-solving, and is happy with themselves. Still, there were differences with respect to faking conditions, and thus, there were no front-runners in feature importance across conditions. All in all, there is some evidence that, depending on the respective faking conditions, people consider different items to be relevant and thereby fake on different items. In addition, the following insights were indicated by more fine-grained analyses of feature importance.
First, the classifier used more than one feature (i.e., more than one faking index on IATs or more than one item on self-reports, respectively) to distinguish fakers from non-fakers. 24 This finding is in line with the assumption that faking occurs through several pathways (Bensch et al., 2019;Röhner & Schütz, 2019). 25 At maximum, all features were used (i.e., six features on IATs, up to 16 features on self-reports) for classification. Second, feature importance varied with respect to faking conditions. This finding shows that faking differs between conditions and that faking is consequently detected on the basis of different behaviors. The feature that had the largest impact on the classification varied with respect to faking direction. Concerning self-reports, different items (features) were considered to be most important for classification when detecting the faking of low scores and when detecting the faking of high scores. Also, the rankorderings of features typically differed between the faking of high and low scores. With IATs, the strategy to adapt speed-accuracy tradeoffs was most important for both faking directions. This finding is in line with previous research that demonstrated that faking impacts the extent to which participants prioritize accuracy or speed in decision-making (Röhner & Lai, 2021;Röhner & Thoss, 2018). As in self-reports, the rank-orderings of features typically differed between the faking of high and low scores, except for informed faking. In other words, faking on IATs becomes more uniform with information. Thus, in line with previous theorizing (e.g., Bensch et al., 2019;Röhner & Schütz, 2019), two different processes appear to be behind the faking of high versus low scores. However, how different these processes are depends on the type of measure. Besides the differences with respect to faking direction, the rank-orderings of the importance of features also varied with respect to practice trials. On selfreports, practice in faking impacted the way participants responded to items when faking low scores in a naïve manner, but its impact was smaller when they faked high scores. By contrast, variation in the ordering of feature importance concerning the IAT was low: IAT participants used very similar faking strategies irrespective of practice levels. Last but not least, the ordering of the importance of features varied with respect to whether participants faked naively or in an informed manner on self-reports but not on IATs. Thus, informed fakers were detected on the basis of other features than naïve ones on self-reports, but on IATs, the features were similar between the two.
To sum up, feature importance analyses underpin prior theories that faking processes differ (e.g., Bensch et al., 2019). However, not only do they shed light on the question of how people fake under different faking conditions, but they also show that faking detection-in line with different faking behavior-occurs along very different pathways. Nevertheless, especially with regard to the self-report measures, correspondence across conditions is limited. Moreover, the statistical power differed between conditions. Thus, the generalizability of these results is a relevant issue for future research.

Limitations
We considered a large quantity of variables that impact faking and its detection in order to advance knowledge about faking and its detection with machine learning. Nevertheless, our study is limited in that our data came only from participants who were instructed to fake. However, we purposefully did not include data from applied settings. Not only does instructing participants to fake represent the most common methodology that is used to investigate faking (Smith & McDaniel, 2012), but it also provides valuable insights into the extent to which people can fake and into the strategies people apply when asked to fake (Smith & Ellingson, 2002;Smith & McDaniel, 2012). This was what we were interested in and what we needed for our analyses. If the motivation to fake in applied settings would have been the focus of our research, we would have preferred to use data from applied settings. So, on the one hand, the data fit our research goal. On the other hand, there is one even more important reason for not including data from applied settings. In applied settings, participants are usually not instructed to fake, which creates a circular problem if researchers want to investigate the detection of faking. To classify fakers and non-fakers, one has to know first who was trying to fake, and this is exactly what the research is trying to find out. Instead of applying other faking indices that bear their own risks of misclassification, we decided to restrict ourselves to using instructed faking sets. Although faking has been suggested to be the sum of at least two substantive sources of variance (i.e., traits and faking; e.g., Bensch et al., 2019;Ziegler et al., 2015), variance shared across multiple traits could still be affected by various response sets and response styles. Thus, in applied settings without experimental manipulations, faking is not the only type of response distortion that occurs. To avoid this problem, we chose laboratory settings and experimentally manipulated faking by explicitly asking participants to fake in order to minimize the activation and impact of other response sets and response styles that might cloud the results (e.g., acquiescence, midpoint or extreme point responding, carelessness). Thus, future research should investigate whether the analytical procedures tested here can be generalized to other response sets and response styles.
Faking strategies can also differ between settings (e.g., applied settings vs. laboratory settings). On the one hand, it seems plausible that they are more diverse in applied than in laboratory settings (e.g., because of more diverse test-taker characteristics that stipulate more diverse faking strategies). On the other hand, even the contrary might be the case. Faking strategies could be less diverse in applied settings because of certain information, such as one prominent testcracking manual or training that recommends one "most successful" faking strategy. These factors most likely impact the success of detecting faking with machine learning. Future research should investigate whether the procedures applied here can be generalized to real-world faking.
Furthermore, the machine learning approach that we applied in our study is based on the assumption that faking can be considered a dichotomous variable with two categories (i.e., faking and non-faking). This reasoning is supported by previous research that has demonstrated that faking can be grouped into distinct latent classes (Zickar et al., 2004) and is also in line with previous procedures that aimed to detect faking with machine learning (e.g., Calanna et al., 2020). However, there is also evidence that faking could be considered a continuous variable (i.e., it can be measured at any level of precision; Geiger et al., 2021;Geiger et al., 2018;Ziegler et al., 2015). Using dichotomous variables to predict continuous variables can result in information loss, and thus, in nonoptimal findings. Future research should therefore compare the results of attempts to measure faking as a dichotomous versus a continuous variable.
Also, we restricted ourselves exclusively to using faking indices that have already been empirically validated in past research and thus wanted to avoid intermingling potential concerns about the validity of faking indices with the validity of the machine learning approach. The applied indices differ in their meaning and limitations. Slow_Co, IncErr_Co, Slow_In, and Accel_Co have been theoretically derived and empirically shown to indicate faking. However, they can be used only when data are available from both a baseline and a faking condition, which researchers do not always have at hand. The same is true for CTS, which in addition is a bit difficult to interpret as it confounds a substantial IAT effect (i.e., a difference between compatible and incompatible effects; here, between different IATs) and a possibly superimposed faking strategy (e.g., intentional slowing in an IAT phase). By contrast, Ratio 150-10000 can be applied without participants' baseline data. In addition, it is a very intuitive index of relative slowing on the compatible or incompatible phase relative to the preceding single blocks. Not only have IAT a and IAT t 0 been shown to be related to faking, but both indices additionally (and in contrast to the other indices) also correspond with a theoretical model (the drift diffusion model; e.g., Klauer et al., 2007). However, not only do they represent faking, but they also reflect substantial differences. For example, IAT a reflects differences in participants' perceptions of task difficulty, and IAT t 0 reflects interferences during the selection of responses (Schmitz & Voss, 2012). Thus, in contrast to other indices, IAT a and IAT t 0 should not be interpreted as pure faking indices. Future research might evaluate additional experimental indices (e.g., standard distribution of reaction times) and compare them against indices that have already been empirically validated with machine learning.
Interestingly, feature importance was more consistent in IATs than in self-reports. Self-reports lack empirically validated faking indices, but such indices were used in our analyses on faking in IATs and performed best there. Thus, feature importance was most likely more consistent for IATs than for self-reports because the input data for IATs (faking indices), as compared with those for self-reports (response patterns) were superior in predicting faking. In combination with varying sample sizes, this might explain the differences between IATs and self-reports. The small amounts of data in certain conditions do not warrant tests of generalizability on the basis of multiple independent data sets, which might be a relevant extension of future research. Nevertheless, the results emphasize that a machine learning approach works best when input data are relevant for classification (e.g., Plonsky et al., 2019) as is the case with validated faking indices. By contrast, using large amounts of data (e.g., response patterns) that are partly irrelevant for the classification problem (e.g., trials or items that are not faked at all) does not necessarily improve classification. Instead, focusing on relevant input data (e.g., validated indices) has the potential to outperform classification with response patterns and scores.

In a Nutshell: Can Machine Learning Assist in Faking Detection?
Under naïve faking, the detection of faking was superior on self-reports than on IATs, whereas this was not the case under informed faking. Thus, the type of measure plays a role, and nontransparent measurement procedures lead to lower success in faking detection, but this effect disappears with practice or information. In general, faking detection was superior for the faking of low scores compared with the faking of high scores. This finding is in line with prior theorizing that faking low and high represent different processes. This assumption is also backed up by feature importance analyses because the features that can be used to detect the faking of low scores typically differed from the ones that can be used to detect the faking of high scores. 26 Faking detection was also superior for informed as compared with naïve faking. Thus, the good news is that test-cracking manuals might aid the detection of fakers because naïve faking is less homogenous and, thus, more difficult to detect. Fakers could be spotted comparably well regardless of their practice levels. Thus, information about how to fake is more relevant than practice in faking. Similarly, whereas the choice of construct impacted faking detection under naïve faking, it did not under informed faking or when participants had practice. Also, fakers were spotted best by machine learning with empirically validated faking indices or response patterns and worst by the use of scores-especially when there were long response patterns. Last but not least, the machine learning algorithm affected the quality of faking detection. As a consequence of the interplay of these conditions, faking detection varied from chance levels to 100%.

Conclusion
Faking detection indeed resembles the work of a pathologist. By carefully anatomizing faked responses, our results showed that faking conditions largely impact faking behavior and thereby affect the quality of faking detection with machine learning. Additionally, faking behavior is reflected in different input data, which then impact the quality of faking detection. Moreover, the type of machine learning algorithms impact the quality of faking detection. Our analyses provided insights into faking processes and can explain why faking detection is such a complex endeavor. Not only do fakers fake on different pathways when confronted with different faking conditions, but in most cases, more than one pathway is used for faking. Thus, it is challenging to find typical traces left by fakers, thus rendering faking detection with machine learning a promising approach. However, a variety of factors that impact how well (from chance levels to excellent) machine learning works in faking detection has to be taken into consideration in this endeavor.

Author Note
We want express our gratitude to the people who participated in our studies and to the students who helped collect the data over the years: Rose Bose, Anna Dirk, Christina Doukas, Hannes Duve, Elke Hütten, Nathalie Käther, Hannah Klink, Carmen Möller, Franziska Nötzold, Nadine Richtsteiger, Anna-Marie Rudat, Elisabeth Tenberge, Stefan Wachter, and Claudia Wetzel.
This research was partly funded by a grant from the equal opportunities office at the University of Bamberg. The funding source was not involved in designing the study or analyzing the data.
Funding Open Access funding enabled and organized by Projekt DEAL.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.