Forced-choice questionnaires have been used for a long time to prevent faking and other response distortions (Cao & Drasgow, 2019; Jackson et al., 2000; Saville & Willson, 1991). However, recent studies show that discrete forced-choice formats often yield reliabilities that are too low for individual diagnostics and, moreover, allow comparisons between individuals only to a limited extent (Bürkner et al., 2019; Schulte et al., 2020). One reason for this is that psychometrically, the selection or rejection of a forced-choice item within an item block provides relatively less item information regarding the trait score, compared to rating (i.e., Likert) items.

A promising solution that might combine the high information generated by Likert items with the faking resistance of the forced-choice format are graded paired comparisons (GPCs). In a GPC, two items are placed on both ends of a rating scale and respondents have to indicate the degree to which they prefer one item over the other (De Beuckelaer et al., 2013; Huber & Holbrook, 1982). Thus, they are a combination of Likert-type rating items and conventional dichotomous forced-choice tasks with two items. The two included items maintain the characteristic property of forced-choice scales in that not all items can be fully endorsed simultaneously, because one item’s gain is the alternative item’s loss. At the same time, the graded response provides information about the relative preference of one item over another, increasing the information about the latent traits. Figure 1 shows a GPC sample item.

Fig. 1
figure 1

Example of a graded paired comparison (GPC)

So far, little is known about GPCs as a response format for personality measures. This pre-registered study was designed to investigate the potential merits of GPCs over Likert items regarding their ability to resist faking attempts in the context of personnel selection. Additionally, it investigates whether the expected reliability gains are sufficient to achieve a measurement error small enough for individual diagnostic purposes in practice. Although a recent meta-analysis emphasizes the influence of item desirability on the fakeability of forced-choice questionnaires (Cao & Drasgow, 2019), little is known about what to consider when trying to match items on desirability during the construction of forced-choice item blocks. To inform test constructors about desirability matching, as it is perhaps the most important feature of forced-choice tests, we will investigate in detail the effects of item and trait desirability as well as the association with the keyed direction of items (i.e., the sign of the factor loading).

Forced-Choice Questionnaires as a Potential Remedy for the Problems Associated with Self-Reports

Questionnaires designed to measure personality, attitudes, interests, or values usually use self-report measures with Likert scales. On a Likert scale (Likert, 1932), test-takers have to rate their agreement to self-describing statements, i.e., from 1 (strongly agree) to 5 (strongly disagree). However, these items are subject to a number of biases, such as faking in a socially desirable direction, acquiescence (confirmation tendency), exaggerated coherence between items of different traits (“halo” effect), and several others (Paulhus & Vazire, 2007; Wetzel & Greiff, 2018; Wetzel et al., 2016).

These distortions may compromise the validity of the conclusions derived from Likert scales (Christiansen et al., 2005), especially in high-stakes situations, such as personnel selection (Birkeland et al., 2006; Christiansen et al., 2005) and performance appraisal (Brown et al., 2017), clinically relevant constructs (Young, 2018), or when the measured traits are particularly undesirable, such as “dark” personality traits (Guenole et al., 2018; Paulhus & Jones, 2014). Response biases can also be problematic when differentiation is particularly important, such as in market research or career advice (Wang et al., 2017). Problems can also arise in applications like comparative cultural research, where the groups under comparison may differ in these biases (J. Lee et al., 2002).

As a solution to these issues, forced-choice response formats have been suggested, in which respondents have to decide between two or more items, thus preventing many Likert scale biases by design (e.g., Hontangas et al., 2015; Jackson et al., 2000; Saville & Willson, 1991; Wetzel et al., 2016). To distinguish between forced-choice format types, we use the term discrete forced-choice format to refer to the classical one that can be transferred into binary comparisons (but can consist of more than two items) and the term GPC for two items with a rating scale in between. Discrete forced-choice items, however, are associated with two central problems: ipsativity and loss of information in comparison with Likert scales. A person parameter estimate is ipsative if all measured dimensions add up to the same total for each individual (Clemans, 1966). Therefore, a respondent’s score on one dimension depends on their score on all other dimensions, making inter-individual comparisons based on ipsative scores questionable (Cattell, 1944; Hicks, 1970). This is highly problematic for many applications of forced-choice questionnaires, e.g., in personnel selection, where the comparison between applicants is the central objective (Johnson et al., 1988). Quasi-ipsative measures (e.g., Rasheed & Robie, 2023) only partially solve these problems.

To derive normative parameter estimates from forced-choice responses, Thurstonian IRT models were introduced (Brown & Maydeu-Olivares, 2011). An estimate is normative if it indicates the relative trait level compared to other subjects of the population distribution (Cattell, 1944). Nevertheless, the loss of information relative to Likert items remains problematic. Simulation studies suggest that even when scored with Thurstonian IRT models, conventional forced-choice items will yield low reliabilities in most applied conditions (Bürkner et al., 2019; Schulte et al., 2020). Because in GPCs respondents indicate their preference for one item over the other on a rating scale instead of only discretely selecting one of them, more information is captured about their expression on latent traits. This is a promising starting point to achieve higher reliabilities compared to discrete forced-choice methods.

Graded Paired Comparisons as a Psychometric Method

In the past, GPCs have primarily been used to investigate consumer preferences (Agresti, 1992; Alfaro-Rodriguez et al., 2005; De Beuckelaer et al., 2013; Ofir, 2004). For example, in marketing research, GPCs are often employed in the context of conjoint analyses (e.g., Scholz et al., 2010). GPCs are also known by slightly different names, such as constant sum paired comparisons (Skedgel et al., 2015) or ordinal paired comparisons (Agresti, 1992).

Recently, Brown and Maydeu-Olivares (2018) proposed applying GPCs in personality assessments, as GPCs entail advantages over discrete forced-choice formats: First, while discrete forced-choice formats deteriorate participant reactions (Borman et al., 2023), GPCs can lead to greater acceptance by participants, as participants are not as harshly forced to choose between options that might describe them equally (in)appropriately (Dalal et al., 2019). Second, the aforementioned information gain achieved by re-introducing a rating scale might help to increase reliability. Indeed, simulations demonstrate higher levels of reliability for graded compared with discrete forced-choice formats (Lingel et al., 2022). Unfortunately, re-implementing a Likert scale also brings back typical method-related response biases such as extreme responding, but empirical results suggest these effects to be negligible (De Beuckelaer et al., 2013). Further, test-takers still have to contrast the two opposing statements within one comparison, which means they cannot fully endorse all desirable statements.

Faking in Graded Paired Comparisons

Whether GPCs have the potential to actually be more faking resistant in high-stakes situations has not, to our knowledge, been empirically tested. In the following, we present recent findings on the fakeability of discrete forced-choice formats as well as Likert scales. Based on this, we discuss which faking effects can be expected for GPCs, especially in comparison to Likert scales as a standard method of questionnaire-based personality measurement. Further, we will propose two important boundary conditions of faking in GPCs, namely the faking intention of the individual respondent and the difference in social desirability of both items involved in a given GPC.

Overall Faking Effects

Are people able to deliberately distort their answers in personality questionnaires, for instance, in order to present themselves more favorably when applying for a job? The simple answer to this question is yes. According to meta-analyses, respondents can improve their scores substantially if they are instructed to do so (meta-analytically by \(d=.73\) in Edens & Arthur, 2000, and, depending on the trait the effect size ranges between \(d=.47\) for agreeableness and \(d=.93\) for emotional stability in Viswesvaran & Ones, 1999). In situations where participants were faced with real-life incentives to distort (e.g., money or a job), the distortion seems to be somewhat weaker (\(d=.30\) in a meta-analysis by Edens & Arthur, 2000; see also Birkeland et al., 2006; Martínez & Salgado, 2021).

We are not aware of any empirical evidence regarding the specific fakeability of GPCs in personality assessment. However, since the theoretical rationale for GPCs’ potential to resist faking is the same as for discrete forced-choice formats, we will review the results of the corresponding research. Concerning the effects of faking on discrete forced-choice measures, recent meta-analyses show that respondents do alter their scores. Meta-analytic effect sizes for the overall mean score inflation between the honest and the faking conditions are \(d=.06\) (Cao & Drasgow, 2019) and \(d=.41\), respectively (Speer et al., 2023; see also Martínez & Salgado, 2021). Thus, test-takers can inflate their scores in both Likert and forced-choice questionnaires. Because essentially the same mechanism is supposed to inhibit response distortion in forced-choice items and GPCs (i.e., one item’s gain is the alternative item’s loss), we assume that a certain amount of score inflation will also be found in GPCs:

  • Hypothesis 1: Both rating scales and graded paired comparisons are fakeable. The average trait scores are higher (in the socially desired direction) when respondents try to distort them than when they answer honestly.

However, the meta-analytic faking effects are smaller for forced-choice formats (\(d=.41\)) than for Likert scales (\(d=.75\); Speer et al., 2023, see also Cao & Drasgow). In addition to meta-analyses, which consider the fakeability of one response format (Likert vs. forced-choice) in isolation, individual studies that have drawn a direct comparison can also be used to assess the relative susceptibility to faking. In fact, the vast majority of these studies found faking-induced score inflation to be smaller for forced-choice scales (Christiansen et al., 2005; Huber, 2017; Jackson et al., 2000; Lee et al., 2019; Pavlov et al., 2019; Vasilopoulos et al., 2006); but see Heggestad et al. (2006) for contrary results.

Thus, although test-takers can elevate their mean trait scores in forced-choice questionnaires when motivated to do so, the magnitude is significantly smaller than that which can be obtained when using Likert scales. As GPCs adopt the crucial forced-choice characteristic of contrasting different statements, they should also make it harder to endorse all desirable items, thus impeding faking. We therefore expect:

  • Hypothesis 2: Graded paired comparisons exhibit a higher resistance to faking attempts than Likert scales.

When analyzing such faking effects, two different approaches can be taken: The first is mean score inflation, as described in the previous section; this is calculated by subtracting the honest score from the faked score and then dividing the result by the standard deviation of the honest scores. The second approach is to assess the correlation between honest and faked trait scores (Pavlov et al., 2019). In this study, we will test fakeability with both approaches. As we have already discussed, we expect lower mean score inflation in the GPC condition.

  • Hypothesis 2a: The mean score inflation between honest and faked answers is lower in graded paired comparisons than in Likert scales.

The mean score inflation approach is most suitable for assessing fakeability in questionnaires in which a fixed threshold must be reached, e.g., in order to advance to the next round of an application procedure. However, in many selection processes, absolute values are irrelevant, and instead, the best candidates relative to all applicants are selected. For such situations, the correlation-based approach shows how well the scores from a faking condition can mirror the honest scores. It is sensitive to relative changes of trait scores between respondents, and—unlike the mean score inflation approach—it does not implicitly assume constant score shifts for all respondents. Therefore, we use the correlation between scores of the honest and faking condition as a second measure in which a potentially enhanced resistance of GPCs to faking attempts (as addressed in Hypothesis 2) should be manifested. In this respect, findings for the discrete forced-choice format are mixed. It is particularly important for potential faking in personnel selection, since faking in this context would be closely linked to changes in applicants’ rank order and, therefore, is directly relevant to selection decisions. While in one study forced-choice scores from an applicant condition explained more “true” trait variance than their Likert scale counterparts (Christiansen et al., 2005), other studies have failed to consistently show this supposed superiority of discrete forced-choice scales (Guan, 2015; Heggestad et al., 2006; Pavlov et al., 2019). This might be a consequence of reliability issues associated with discrete forced-choice measures (Guan, 2015; Pavlov et al., 2019) or for the study by Heggestad and colleagues, an artifact of ipsative scoring methods. Importantly, reliability issues could be counteracted by the GPC format, and ipsative scoring artifacts could be avoided by scoring responses with Thurstonian IRT models as this would facilitate normative trait score estimates. Furthermore, in the study by Heggestad and colleagues, honest scores were obtained exclusively with a Likert scale, resulting in a common method bias in favor of Likert scales. Thus, we assume that GPCs can preserve more of the test-takers “true” variance than Likert scales can in a study design that considers the previous shortcomings by collecting data for each of the two test formats under both honest and faking conditions while analyzing the GPCs with Thurstonian IRT models thus yielding normative scores:

  • Hypothesis 2b: The correlation of honest and faked trait scores will be higher in graded paired comparisons compared with Likert scales.

The Role of Individual Faking Intention

As stated before, respondents differ in how much they fake their answers. In personnel selection contexts, estimates suggest that about 40 to 60% of applicants actually alter their scores (Donovan et al., 2003; Griffin & Wilson, 2012; Griffith et al., 2007). Further, there is considerable variance in the extent to which respondents fake their answers (Rosse et al., 1998). Even in studies where participants are asked to fake as much as possible, test takers differ in the degree to which they fake (e.g., McFarland & Ryan, 2000; Pavlov, 2015). Individuals with a higher will to present themselves positively will change their answers more between honest and application condition. As a consequence, the association of trait scores from the honest and faking condition will be weaker with increasing intention to fake. This moderating effect has recently been demonstrated for discrete forced-choice response formats (Pavlov, 2015; Pavlov et al., 2019). Likewise, we assume that the association of honest and faked trait scores (Hypothesis 2b) will be influenced by the individual intention to fake for both the Likert and the GPC format:

  • Hypothesis 3: As the individual faking intention increases, the association of honest and faked scores will decrease across the two questionnaire formats.

While the theory on response distortion and the empirical findings for discrete forced-choice tasks are in agreement here, another observation calls for more research: In the Pavlov et al. (2019) study, the association of honest and faked trait scores decreased with an increasing individual faking intention for both the Likert and the forced-choice format to an equal extent. To explain this, the authors mention the lower reliability of forced-choice measurements or the fact that manipulations of one trait indicator (i.e., item) in forced-choice inevitably lead to changes in another trait’s indicator (i.e., item), which could not be distinguished in their design (Pavlov et al., 2019). The increased reliability of GPCs (Brown & Maydeu-Olivares, 2018) might allow one to test whether the empirical results are in line with the previous theory on a reduced fakeability of forced-choice formats when the reliabilities of both formats are comparable. In line with the assumption that forced-choice techniques can resist even stronger faking attempts, we hypothesize:

  • Hypothesis 4: The association of honest and faked trait scores will be less affected by the individual faking intention in the GPC format compared with the Likert format.

Social Desirability of Items

Based on the assumptions that GPCs are also fakeable up to a certain degree and that subjects can intentionally control their faking behavior, the question arises how the process of response distortion exactly takes place. Which specific item properties are used to decide on the shift from an honest answer to distorted responses? We assume that respondents prefer more socially desirable items over less socially desirable ones. This assumption emerges from the mechanism by which forced-choice techniques are postulated to reduce faking: by forcing choice between equally socially desirable items. Now one could argue that social desirability matching should be a prerequisite for forced-choice tests and, therefore, unequal desirabilities should not occur. In fact, however, it is virtually impossible that all items of a forced-choice questionnaire that are compared with each other are exactly equally socially desirable. Moreover, many scholars do not match their items for social desirability at all (Cao & Drasgow, 2019). This tendency is amplified by the requirements of Thurstonian IRT models, whose reliability critically depends on the inclusion of unequally keyed items in the questionnaires (Brown & Maydeu-Olivares, 2011; Bürkner et al., 2019; Schulte et al., 2020). Unequally keyed item blocks contain positively and negatively keyed items that are unlikely equally desirable. We believe that this basic assumption about the functioning of forced-choice, which has so far hardly been empirically tested, should be examined more closely, as it has fundamental implications for the test design and the usefulness of the Thurstonian IRT approach. The impact of desirability differences on faking behavior can be tested more efficiently with GPCs than with discrete forced-choice formats, because responses shift on a more fine-grained ordinal scale, such that they can even detect effects of small desirability differences. Thus, we will test the following hypothesis:

  • Hypothesis 5: The mean score inflation for a given graded paired comparison is positively correlated with the difference in perceived desirability of its two statements.

The hypotheses, the method, and the data analysis procedure of this study were specified prior to the start of data collection. For the preregistration form, see https://osf.io/8emj7.

Method

Participants

Based on a power analysis, the intended sample size was 618 observations, giving an expected effect size of \(d=.16\) for the mean score inflation in forced-choice questionnaires (see meta-analysis by Cao & Drasgow, 2019), an \(\alpha\) error of \(.05\), a power of \(.80\), and equal sample sizes in all conditions. Simulations show that sample sizes of \(N=300\) are sufficient to estimate Thurstonian IRT models for GPCs (Lingel et al., 2022).

Participants were recruited via an online panel of professionals (PsyWeb), social media channels (Facebook, Xing, LinkedIn, Instagram), the online platform SurveyCircle, and the student participant pools at the universities of Münster and Ulm (Germany). The data collection took place between September 2019 and February 2020.

Of the original 591 participants, seven were excluded because they were younger than 18 years, one person’s [study language, German] skills were insufficient, and five individuals wanted to exclude their data from the analysis after they had completed the study. Moreover, four people were excluded due to implausible short response times in combination with extremely monotonous response patterns. Furthermore, one person was excluded solely due to a very short overall response time. All included participants provided data on all study variables.

Thus, the final sample size was \(N=573\) (78% female). Participants’ ages ranged from 18 to 56 (\(M=27.50\), \(SD=9.93\)). In our sample, the majority (67%) were students, 20% were employed full-time in a large variety of jobs, and 9% were part-time employees, while 4% were unemployed. The highest educational level was a secondary school certificate for 3%, a university entrance qualification for 61%, and a university degree for 36% of the participants. As an incentive, participants were offered feedback on their ability to fake in applicant settings and psychology students could earn course credits. Additionally, the four “best” applicants were rewarded with 50 € each.

Design

We conducted an online experiment with a 2 × 2 mixed design. Participants completed a GPC personality measure (\(n=283\)) or the Likert version of the same questionnaire (\(n=290\)). We used a between-subject design for this factor to ensure sufficient attention of the participants over the entire duration of the online experiment. After the participants had been asked to answer honestly, we presented a job advertisement for the position of a project manager. The participants were then asked to answer the same questionnaire again, but in a way that maximized their chances of getting the job. We used a within-subject design for the honest/faking factor, as several of our hypotheses had to be tested with regression analysis. Additionally, meta-analyses show that this design can detect faking effects better (Edens & Arthur, 2000; Viswesvaran & Ones, 1999). Respondents were randomly assigned to the two response format conditions, with equal group sizes being enforced with the last participants. In the end of the applicant condition, respondents were asked how much they tried to fake their answers (referred to as individual faking intention in the following). Finally, both groups rated the desirability of the traits that had been measured and the personality items that had been used.

Materials

We used work-related personality items developed for a personnel selection test in a large German governmental organization. They measured the four dimensions of the Big 5 that predict work performance for the main job profile in this organization, namely emotional stability, extraversion, agreeableness, and conscientiousness.

Likert Personality Questionnaire

The questionnaire in the Likert condition consisted of 42 self-describing statements, 10 each measuring extraversion and agreeableness and 11 each measuring neuroticism and conscientiousness. Participants indicated their level of agreement on a 7-point Likert scale from 1 (do not agree at all) to 7 (fully agree). Empirical reliabilities are reported in Table 1 and can be interpreted as high. Comparisons of measurement accuracy between conditions should be made on the basis of scale-independent reliability estimates and not on the basis of SEs, as the variance of factor scores differs between conditions (see Table 5 in the online supplement).

Table 1 Reliabilities for Likert items and graded paired comparisons

GPC Personality Questionnaire

The questionnaire in the GPC condition consisted of the same items as the Likert version. All items from the Likert version and only these were used for the GPC version of the questionnaire. However, in the GPC version, some items were used in several item pairs to yield sufficient and with regard to the Likert version similar levels of reliability. In sum, the questionnaire in the GPC condition consisted of 119 item pairs (95 two-dimensional, 24 unidimensional) of which 48 (two-dimensional) GPCs were unequally keyed. Participants were asked to express the degree to which they prefer one of the two statements within a GPC over the other on a 9-point rating scale ranging from 1 (the statement on the left describes me very much better) to 9 (the statement on the right describes me very much better). Indicators of measurement precision are reported in Table 1. The GPC reliabilities were on the same level or slightly higher, and the SEs were equal or lower than those for Likert scales. However, this comparison should be interpreted with caution, because GPC scales were longer and included more items than Likert scales in our study. Nevertheless, it can be stated that the GPCs achieved excellent reliability.

The item pairs were taken from an existing test and were initially not matched for social desirability. The main analysis and hypotheses tests are all based on this long questionnaire with both equally and unequally (i.e., mixed) keyed items. However, for additional explorative analysis, we conducted all our hypotheses tests again for two more questionnaires which were formed from subgroups of items from the full questionnaire described above. First, the unequally keyed items were excluded, resulting in a questionnaire of 71 equally keyed item pairs. Second, we exclusively analyzed those 50 item pairs with the lowest social desirability difference (for details on social desirability measurement, see below). The desirability difference within an item pair of this questionnaire was less than or equal to \(1.17\) measured on a 7-point scale, and all of these item pairs were equally keyed. Data for these analyses were taken from the complete questionnaire version. Reliabilities over both the honest and the faking condition were sufficient to excellent for all traits in both additional questionnaire versions ranging from 0.76 to 0.91. The mean standardized factor loadingFootnote 1 was \(M=0.72\) \(\left(SD=0.12\right)\) for the complete questionnaire, which is a similar level compared with the CFA-based loadings of Likert items of \(M=0.73\) (\(SD=0.17\)). Factor loadings for the equally keyed GPC questionnaire were \(M=0.84\) \(\left(SD=0.06\right)\) and \(M=0.80\) \(\left(SD=0.09\right)\) for the GPC questionnaire with item pairs matched for social desirability.

Job Advertisement

A job advertisement for the position of a project manager was used to create a realistic context for the faking condition that is similarly appealing for persons with varying professional backgrounds. It consisted of a general task description and a desired personality/skill profile. The profile listed four general competencies, each corresponding to one of the measured dimensions (e.g., for agreeableness: You are cooperative and sensitive to the needs of your team members). The job description was designed in such a way that the participants would perceive all measured characteristics as equally desirable for the job described.

Individual Faking Intention

As proposed by Pavlov (2015), the individual intention to fake was measured by asking participants to what extent they faked their answers in order to present themselves positively with respect to the position of a project manager. The item used an 11-point rating scale ranging from 0 (I did not distort my answers at all) to 10 (I distorted my answers strongly). In all regression analyses, faking intention was \(z\)-standardized across both formats.

Perceived Social Desirability

We measured the social desirability of all 42 single items and the 4 global traits on a 7-point scale ranging from 1 (very undesirable) to 7 (very desirable). Participants were instructed to rate the desirability with respect to the position of a project manager.

Trait Score Estimation and Standardization

In this section, we briefly describe the estimation procedures for participants’ trait scores in the Likert and GPC conditions as well as the standardization procedures, which are essential for interpreting the results.

Trait Score Estimation

We estimated GPC trait scores (person parameters) with Thurstonian IRT models for graded preference data (Brown & Maydeu-Olivares, 2011, 2018). See also Bürkner (2022) for a more detailed model description.

For parameter estimation, we used the thurstonianIRT package (Bürkner, 2019) in R (R Core Team, 2020). Within this package, we specified an ordinal model using the Bayesian software Stan (Carpenter et al., 2017) as the underlying engine and the EAP estimator. For all model parameters, we used the default priors from the thurstonianIRT package, which are weakly informative and do not change the estimates compared to frequentist software (Bürkner et al., 2019). We estimated factor scores from the honest and faking condition in a common model to estimate both types of scores on the same scale. Only to test whether the assumption of identical item parameters in both conditions affects the results notably, we estimated one model for each of the two conditions (Zhang et al., 2020), resulting in highly correlated factor loadings in both models (\(r=.98\)). All further analyses are based on the joint model. Two participants reached extreme values on several traits and were excluded from factor score-based regression analyses (Hypotheses 1 to 4) without consequences for acceptance or rejection of hypotheses.

To keep GPC and Likert scores as comparable as possible, trait scores for the Likert condition were estimated using a multidimensional graded response IRT model (Samejima, 1969), which is ordinal as well. As software, we used the R package \(mirt\) (Chalmers, 2012) with the Metropolis–Hastings Robbins-Monro (MHRM) algorithm and MAP factor scores. Sensitivity analyses showed no differences between the selected MHRM estimator and other methods for multidimensional models, namely Monte Carlo Expectation Maximization and Quasi Monte Carlo Expectation Maximization. The estimates of all methods correlated to \(r=.99\) or higher.

Trait Score Standardization

To facilitate the interpretation of the regression models, we standardized all trait scores with the mean and standard deviation of the honest scores (within each questionnaire format). As a consequence, each trait score can be interpreted as the difference from the average honest score in honest \(SD\)s. For example, if person A has a faking (i.e., applicant) score of 2 on extraversion, this means that this score is two honest \(SD\)s above the average honest score on extraversion. Likewise, Likert and GPC honest scores will have a mean of zero and a standard deviation of one, while the mean of the faked scores will reflect the average difference of Likert/GPC faked scores from the average Likert/GPC honest score in honest \(SD\)s. Therefore, the means of the faked scores can be interpreted as Cohen’s \(d\) effect sizes.

The instructions, data, the R code for all analyses, a code book, and an online supplement with additional results are available on OSF (https://osf.io/fx4yz/).

Results

As intended by the experimental manipulation (job profile), all traits were perceived as highly desirable for the position in the job advertisement (\(M=6.61\), \(SD=0.84\) for emotional stability, \(M=6.18\), \(SD=0.95\) for extraversion, \(M=5.73\), \(SD=1.17\) for agreeableness, and \(M=6.82\), \(SD=0.59\) for conscientiousness; all measured on a 7-point scale). Agreeableness had comparatively lower but still high trait desirability ratings. The faking instruction within the applicant scenario combined with the financial incentive also appears to have worked successfully: The individual faking intention was relatively high, with an average value of \(7.71\) (\(SD=2.30\)) in the Likert and \(7.73\) (\(SD=2.53\)) in the GPC condition, both measured on a scale from 1 to 11. The faking intention did not significantly differ between groups (\(\Delta M=0.02\), 95% CI \(\left[-0.37,0.42\right]\), \(t\left(569.23\right)=0.10\), \(p=.920\)). The effects of demographic variables on faking strength were negligible (see Table 8 in the online supplement). In this regard, no differences were found between students and professionals either. Unless mentioned otherwise, all analyses refer to the complete version of the GPC questionnaire. The results for the other two versions (equally keyed and matched for social desirability) do not differ substantially. They are addressed at the end of the results section.

Table 2 shows the inter-correlations of trait estimates for GPCs (lower triangle) and Likert scales (upper triangle) for the honest and faking condition. GPC inter-trait correlation estimates are similar but not equal when compared with their Likert counterparts. Inter-trait correlations are higher in the faking condition than in the honest condition, which is a typical effect in application situations and indicates that an ideal applicant factor has formed in both formats.

Table 2 Correlation matrix for GPC (lower triangle) and Likert (upper triangle) factor scores

Due to the inter-correlations of traits (Table 2) and the heterogeneity of variances, we tested all hypotheses with multivariate linear models that account for heterogeneous error variances across the range of predictors. We estimated all multivariate models with the R package brms for Bayesian regression modeling (Bürkner, 2017, 2018) based on Stan (Carpenter et al., 2017). See the OSF online supplement for model equations. The regression coefficients of these models can be interpreted like those of single multiple regressions, but credible intervals take into account the inter-correlations of traits. In the Bayesian analyses, we consider as significant those results where the value of the null hypothesis is outside the 95% credible interval. Since the methods mentioned above model the data better than the pre-registered tests for Hypotheses 1 and 2 (\(t\) tests and bivariate correlations) but at the same time maintain the basic idea of the pre-registered methods, we decided to deviate from the original analysis plan in this respect. We report results for the original analysis plan in an online supplement on OSF. The adjustments to the analysis plan had no effect on the acceptance or rejection of hypotheses. Our analysis approach is based on the regression-based moderation framework suggested by Pavlov et al. (2019).

We now report the results of our main analyses, which address the question of whether and under what conditions GPCs are faked, and how they compare to Likert scales in this regard. Hypothesis 1 stated that average trait scores are higher when respondents try to distort them than when they answer honestly. Table 3 shows the model results for the regression of trait scores on the condition (dummy coded with honest = 0 and faking = 1). The intercepts represent the predicted values for the honest condition, which are zero for all traits due to the standardization procedure. The slope coefficients represent the predicted values for the faking condition and can be interpreted as Cohen’s \(d\) effect sizes, that is, they represent the mean difference between the honest and the faking condition with the standard deviation of the honest condition as units. Across all traits, effects for both formats are large and \(95\mathrm{\%}\) CIs for slope parameters (i.e., the differences between honest and faking condition) do not include zero. Thus, results support Hypothesis 1, that is, participants were able to fake both questionnaire types when they were instructed to do so.

Table 3 Multivariate regression results with the condition as the predictors for trait scores

Hypothesis 2a stated that GPC scores are less inflated than Likert scores. To test this hypothesis, we predicted the faking scores, which can be interpreted as the scores’ inflation from the honest to the faking condition, by the questionnaire format (dummy coded with Likert = 0 and GPC = 1) while controlling for honest scores and the interaction of honest scores and questionnaire format. At an honest score of 0, the score inflation was significantly lower for GPCs than for Likert scales in three out of four traits (see credible intervals for format coefficients in Table 4). Consequently, respondents seem less able to raise their scores on GPCs than on Likert scales (as predicted in Hypothesis 2a).

Table 4 Multivariate multiple regression results with the interaction of honest scores and format as predictors for faked scores

Hypothesis 2b stated that the correlation of honest and faked scores would be higher in GPCs compared with Likert scales. We tested this on the basis of the interaction term for the honest score and the questionnaire format (see Table 4). Although all interaction terms are significantly different from 0 for all traits but agreeableness, their signs indicate an effect contrary to the hypothesis and what we would expect based on the results for the mean score inflation: The association of honest and faked scores is higher for Likert scales than for GPCs. This is consistent with the results for trait-wise direct comparisons of the honest–faking correlations for Likert scales and honest–faking correlations for GPCs (see the OSF online supplement). They also do not support Hypothesis 2b but suggest that in GPCs, honest and fake ratings are less correlated than in Likert scales.

The Role of Individual Faking Intention

Next, we analyzed the impact of participants’ individual faking behavior as a potential boundary condition of the observed faking effects. We hypothesized that the association of participants’ honest and faked scores will decrease the more they distort their responses in the faking condition, regardless of the test format (Hypothesis 3). To investigate this hypothesis, we ran standardized multivariate multiple regression models for Likert and GPC formats. Faked trait scores were regressed on corresponding honest scores, individual faking intention, and the interaction of these variables. Assuming a positive regression coefficient for honest scores, the interaction term should have a negative sign, thus reducing the regression weight of honest scores as faking intention increases.

The interaction coefficients were negative in all cases and significant for all traits but agreeableness in the Likert format and for all traits in the GPC format (see CIs for the interaction terms in Table 5). Figure 2 plots the interactions of honest scores and individual faking intention. For Likert scales (upper row of plots), we see that the slope becomes flatter the higher the faking intention becomes, that is, the more respondents try to fake, the less the “true” values are mirrored in their faked responses. The same is true for GPCs (lower row), but here, we see that when the faking intention is high (i.e., one SD above the mean), the association of honest and faked scores is actually mostly negative (see negative slopes in three out of four traits). We will return to this point in the explorative results section.

Table 5 Multivariate multiple regression results with the interaction of honest scores and individual faking intention as predictors for faked scores
Fig. 2
figure 2

Interaction plots for Hypothesis 2

The differences between the Likert and the GPC format concerning the effects of faking intention were addressed by Hypothesis 4: We hypothesized that the individual faking intention would affect the association between honest and faked scores in GPCs less than in the Likert format. To examine this issue, we again ran multivariate multiple regression analyses regressing faked trait scores on the format (Likert vs. GPC), the honest scores, individual faking intention, and all possible interactions of these variables.

The relevant statistics for Hypothesis 4 are the regression coefficients of the threefold interaction \(Format\times Honest\ Scores\times Faking\ Intention\). As shown in the section on Hypothesis 3, the association of honest and faked scores—that is, the regression weight for the predictor \(Honest\ Scores\)—becomes smaller as faking intention increases (= negative coefficient for \(Honest\ Scores\times Faking\ Intention\) interaction term). If this effect is more pronounced in case of the Likert format, the coefficient of the threefold interaction should be of positive magnitude, thus shifting the coefficient of the interaction \(Honest\ Scores\times Faking\ Intention\) closer to zero when switching from the Likert format (reference group) to the GPC format. However, the threefold interaction effects were very small and non-significant for all four traits (regression coefficients and 95% credible intervals were \(-0.03\) \(\left[-0.13,0.06\right]\) for emotional stability, \(-0.03\) \(\left[-0.12,0.06\right]\) for extraversion, \(-0.10\) \(\left[-0.25,0.05\right]\) for agreeableness, and \(-0.03\) \(\left[-0.14,0.08\right]\) for conscientiousness). Thus, Hypothesis 4 is not supported. For all coefficients of the regression model, see the online supplement.

The Role of Perceived Item Desirability

Hypothesis 5 aimed at explaining which item characteristics respondents draw on in GPCs to produce the faking effects observed above. We hypothesized that the mean score inflation for a given GPC is positively associated with the difference in perceived desirability of its two statements. To test this, we computed the difference of perceived desirability for each GPC by subtracting the average perceived desirability of the left statement from the average perceived desirability of the right statement within each GPC. We also calculated the mean score inflation for each GPC by subtracting the raw mean score of a given GPC in the honest condition from its corresponding raw mean score in the faking condition. Therefore, the higher the absolute value of the score inflation, the more the mean for this item has changed between the honest and faking condition.

The correlation between desirability difference of both items in a GPC and score inflation in this GPC is \(r=.94\), 95% confidence interval \([.91\), \(.96]\), \(t\left(117\right)=29.21\), \(p<.001\). Moreover, the intercept of the unstandardized regression of the score inflation on the desirability difference is \(b=0.10\), 95% confidence interval \([0.00\), \(0.21]\). Thus, when both items are equally desirable (i.e., their desirability difference is zero), the predicted score inflation is only \(.10\) \(SD\)s. Results are graphically displayed in Fig. 3. In this figure, red points represent equally keyed GPCs, and turquoise points represent unequally keyed GPCs. Equally keyed GPCs clearly show less absolute score inflation than unequally keyed GPCs, forming nearly perfectly separable classes with very little overlap.

Fig. 3
figure 3

Scatter plot and unstandardized regression line for the influence of desirability difference on score inflation in GPCs

Explorative Analyses

In the following, we report analyses that we performed in knowledge of the results presented above and that were not pre-registered. In this context, we are particularly interested in the counterintuitive results on the interaction of honest scores and individual faking intention in the prediction of faking scores as well as the influence of item desirability on response behavior.

In Fig. 2, GPC honest and faked scores are mostly negatively associated when faking intention is high. We wanted to analyze these effects more closely. Further, we aimed to exclude the possibility that this is a statistical artifact; potential non-linear effects are represented exclusively by the interaction term, meaning that the data was potentially modeled inadequately. Therefore, we allowed for linear and quadratic effects in both predictors and their error variances. Figure 4 describes the same association with the faking intention as a continuous variable and from a different perspective. The color describes the value of the trait score in the faking condition, and the lines run along constant values. Two different types of respondents reached high scores in the faking condition: first, those with a high faking intention and low scores in the honest condition, and second, those with high honest scores but a low faking intention. Those who showed medium levels of both predictors or even high levels of both predictors seem to yield lower scores in the faking condition than those with very high scores in the honest condition and a very low faking intention or vice versa. Even though the patterns are very consistent across the traits, the results have to be interpreted with caution because the trait-wise credibility intervals of the honest scores’ quadratic effects all include zero.

Fig. 4
figure 4

GPC interaction plots for Hypothesis 2 with continuous variables and quadratic effects

We also conducted additional analyses regarding social desirability. Results for Hypothesis 5 suggest that removing the unequally desirable items from the GPC questionnaire might reduce its fakeability. We ran the regression models for Hypotheses 1 to 4 again with the two other versions of the GPC questionnaire, namely the equally keyed version and that with 50 item pairs with the lowest desirability difference. The Likert questionnaire remained unchanged. The results of these analyses are presented in the online supplement. They do not indicate a better faking resistance of the GPC factor scores neither when unequally keyed item pairs are removed nor when items were matched for social desirability.

Discussion

The main purpose of this study was to judge the utility of GPCs for personality assessment in high-stakes situations such as personnel selection. As expected, participants managed to fake both Likert and GPC scales, i.e., the mean trait scores were higher when respondents tried to distort them than when they answered honestly. However, the mean score inflation was lower in GPCs compared with Likert scales. Remarkably, the reduced score inflation did not translate into a closer association of honest and faked trait scores. Even though the GPC scores are less inflated, they do not seem to capture the real differences between respondents more adequately in situations where they are motivated to distort. Thus, the assertion that GPCs would exhibit a higher faking resistance was only partially supported. One factor on the respondent side that influences the association of honest and faked scores is the individual faking intention. In both questionnaire types, scores can be manipulated if the respondent tries to do so. In Likert scales, the scores in the faking condition increased with higher faking intentions and higher scores in the honest condition. If one of the two (faking intention or honest trait score) was highly pronounced, the other showed little additional effects. In contrast, in GPCs, trait scores in the faking condition were the highest when either the honest trait scores or the faking intention was high and the other variable was low. The association of honest and faked trait scores was even mostly negative when the faking intention was high. Moreover, the association of these variables in GPC questionnaires could be non-linear, but our evidence is not conclusive in this respect. Finally, an extremely important questionnaire feature that determines the response inflation in a given GPC is the desirability difference between the two items involved. This desirability difference is much larger in unequally keyed item pairs than in equally keyed ones.

Theoretical Implications

The results confirm and expand our knowledge about the forced-choice format in general and GPCs in particular. The meta-analytic finding that forced-choice scores are inflated when respondents are motivated to fake but less so than in Likert questionnaires (Cao & Drasgow, 2019; Speer et al., 2023) seems to apply to GPCs as well. The found effect sizes were high, and—if no further precautions are taken to avoid faking—the association of honest and faked scores is low. Compared with most other faking studies and meta-analytic estimates, our effect sizes are relatively large for both GPC and Likert questionnaires. Note that in faking studies, the effect sizes depend on several design decisions. Faking effects are typically higher in instructionally induced designs as ours than in studies with real-life motivational distortion (Cao & Drasgow, 2019; Edens & Arthur, 2000). This is because designs with faking instructions directly manipulate the motivation to distort which is more heterogeneous in real application contexts (i.e., some applicants do not distort at all; Donovan et al., 2003; Griffin & Wilson, 2012; Griffith et al., 2007). By omitting the null effects of respondents without faking motivation, we isolated the effects of respondents trying to improve their scores. Thus, our results are representative for people who are willing to show high levels of impression management in order to be selected. These are exactly those applicants that faking reduction methods such as forced-choice were developed for. For such conditions, our results suggest that GPCs compared with Likert scales can reduce score inflation to some extent, but still, they raise questions about the construct validity of personality tests when administered to faking-motivated respondents. Specifically, meta-analyses (e.g., Barrick & Mount, 1991; Schmidt & Hunter, 1998) may overestimate the actual predictive validity of personality tests for this part of the applicant population: A closer look at corresponding studies shows that they rely on participants already employed by a company, meaning they may have a lower faking motivation than job applicants.

The lower correlation between honest and faked trait scores for forced-choice formats compared with Likert items has previously been observed for discrete forced-choice formats (Guan, 2015; Heggestad et al., 2006; Pavlov et al., 2019). However, considering our study’s design and analysis strategy, we can rule out alternative interpretations mentioned in earlier studies. Particularly, in our study, neither the ipsative classical test theory scoring method (Heggestad et al., 2006) nor the lower reliability of discrete forced-choice formats (Guan, 2015; Pavlov et al., 2019) are valid explanations. Instead, respondents with medium or high honest scores reached systematically lower scores in the applicant condition than those who had low true, i.e., honest, scores. The negative association of honest and faked scores in a subpopulation diminishes the overall correlation for the full population. Furthermore, non-linear effects might negatively influence the level of correlation. Together with the diminished correlations between honest and faked scores in the forced-choice questionnaire reported previously (Guan, 2015; Heggestad et al., 2006; Pavlov et al., 2019), these observations point to a severe psychometric weakness of the forced-choice format. We assume that this effect is caused by artifacts of the interdependent nature of the response process and/or format. Each response change in favor of one item inevitably causes a change in the response to another item and, therefore, the corresponding trait score estimate. The consequences of these manipulations are difficult to foresee for respondents and might unintentionally negatively affect other traits’ scores that they did not intend to affect.

Our study also aimed to reveal questionnaire properties that respondents use to decide on the shift from an honest answer to distorted responses. In this respect, it seems as if the social desirability is crucial. Desirability differences between items and the score inflation in this item pair are very closely associated. Thus, if the difference between the desirability of two items in a given situation is known, it can be used to predict almost perfectly how much these items are distorted. Moreover, GPCs have the potential to reduce mean response shifts from the honest to the faking condition to an absolute minimum if item desirabilities are exactly matched. In addition, results on raw response level suggest very clearly that unequally keyed item pairs are faked particularly strongly. The responses to these item pairs changed considerably more between honest and faking conditions than responses to equally keyed items did. This is most likely due to the large differences in desirability of these item pairs. The present study provides empirical support for the assumption that in unequally keyed item pairs, one item typically represents the desired end and one the undesired end of a trait continuum, resulting in large desirability differences (Bürkner et al., 2019; Schulte et al., 2020).

Based on the high score inflation in item pairs with higher social desirability differences (what particularly concerns unequally keyed items), it would have been reasonable to assume that the score inflation in Thustonian IRT factor scores could be reduced by removing these item pairs. However, this was not the case in our explorative analyses. This is counterintuitive on the first view but is in line with the meta-analysis by Cao and Drasgow (2019) which found lower faking effects for normative than for ipsative scores in discrete forced-choice questionnaires. Ipsative scores rely on sums of raw responses (comparable with classical test theory scoring for Likert scales). Thus, changes in raw responses directly translate into changes in trait scores. In contrast, normative scoring methods like Thurstonian IRT (as used in our study) weight each response differently. Items that are strongly faked by almost all respondents contribute little to precise trait score estimation under faking conditions due to the homogeneous responses to these items. We assume that this is the reason why results remained nearly unchanged when the most strongly faked item pairs were removed in explorative analysis.

Another noteworthy observation from the present study relates to the faking differences between traits, especially with regard to agreeableness. In line with meta-analytic results (Speer et al., 2023), traits that were perceived as more desirable seemed to have been faked more strongly. Note that social desirability is context-specific. Items and traits that are highly desirable for one job can be undesirable for another job and vice versa. Accordingly, faking behavior will depend on the job context as well.

Practical Implications

GPCs represent an important extension of forced-choice-based response formats. In particular, they seem to allow for highly reliable trait estimates even with only equally keyed item pairs, which successfully eliminates a major weakness of discrete forced-choice formats (Bürkner et al., 2019; Schulte et al., 2020). In this respect, they are preferable to other discrete forced-choice formats. Regarding their fakeability, it seems that GPCs have similar strengths and weaknesses as have been reported for discrete forced-choice formats. They can reduce the score inflation in job applicants or other respondents who are motivated to fake in high-stakes situations, but for practical applications, they can currently only be recommended to a limited extent. GPCs seem to be less capable than Likert scales of recovering applicants’ true aptitude relative to other applicants. To put it simple, GPCs are less helpful than Likert scales in answering the question of whether candidate A is better suited than candidate B in a situation where both are motivated to fake. Thus, these scales should not be used to make important decisions until these problems have been resolved. Further research must show whether this is possible.

When test developers attempt to construct forced-choice questionnaires, they should pay particular attention to the social desirability of the items, as the responses can only be expected to be relatively free of faking effects if the items to be compared are equally socially desirable. Accordingly, questionnaire development should start with an item pool that is large enough to exclude items that cannot be paired based on their desirability. It is reasonable to assume that unequally keyed item pairs/blocks contribute little or nothing to construct-valid trait estimation in high-stakes situations, as they are strongly faked.

On the other hand, since the exclusion of differently socially desirable pairs had no significant influence on the faking effect strength on factor score level, it can also be argued that these item pairs do not cause any harm. This seems to be possible at least for IRT-based analysis methods, since homogeneously answered items can be weighted less due to the low level of information they provide. If one follows this argumentation, however, an exclusion of unequally desired items would still be advisable, because in this case, they would be unnecessary for the factor score estimation and thus extend questionnaire length without any need. Further, our study used an induced faking design which leads to higher and more similar intentions to distort. In practice, not all respondents fake to an equal extent. Some applicants just respond honestly. This leads to more heterogeneous responses which in turn increases the amount of (in part distorted) information in the faked items and thus increases the influence of these item pairs on trait scores. Consequently, the need to exclude these items might be even more pronounced under real applicant conditions.

Limitations and Future Research

Since sufficient levels of reliability can be achieved with GPCs, the most severe remaining weakness of GPCs (and forced-choice formats in general) seems to be their limited ability to recover the true rank order of respondents’ trait scores. Further research is necessary to determine why this is the case, and doing so would require gaining a better understanding of the faking processes in forced-choice questionnaires. Although social desirability seems to play a major role here, the exact response processes and their consequences for the estimated factor scores remain largely unknown. Further, the negative effects of high faking intentions need to be examined in depth, as it would be interesting to learn whether this phenomenon is caused by the cognitive response process, the response format, response strategies specific to this format, or the analysis method.

One question that comes with the construction of GPCs is the use of a middle response category. On the one hand, equal preferences do occur and the middle category gives the chance to indicate this correctly. On the other hand, an even number of response categories could force more (but possibly artificial) differentiation. It is also possible that the middle category is particularly attractive under faking conditions. The advantages and disadvantages of a middle category should, therefore, be examined in terms of how well it is able to capture true trait values.

Finally, as some of our results are explorative, they should be interpreted with caution until successfully replicated.

Conclusion

To our knowledge, this study is the first to investigate the performance of the GPC format under high-stakes (i.e., faking) conditions. The graded response format can provide reliable personality measures and should, thus, be considered as an alternative to discrete forced-choice formats. From a research perspective, it allows for more detailed insights into the general response distortion of forced-choice response formats. Our study also indicates that some important aspects should be considered when designing faking-resistant forced-choice questionnaires. The social desirability of the items—which in turn is extremely closely linked to item keying—is of outstanding importance. Social desirability has a very high impact on response behavior and score inflation. Furthermore, the faking effects we found resemble those of discrete forced-choice formats: forced-choice questionnaires constructed according to current knowledge can reduce score inflation but do not seem to outperform Likert scales in the recovery of applicants’ true trait standing. Although the mean score inflation paradigm has previously dominated faking research, especially on forced-choice questionnaires, it does not seem to account for important disadvantages of forced-choice-based formats. If the effects found here are also found for discrete forced-choice formats, which earlier studies suggest (Guan, 2015; Heggestad et al., 2006; Pavlov et al., 2019), then the relative fakeability of forced-choice vs. Likert scales should be revised. To judge whether the issues we found can be resolved by improved questionnaire design and/or alternative analysis methods, a better understanding of the faking process and its effects on trait estimates is necessary.