When queried about sensitive personal attributes, some respondents conceal their true statuses, by responding untruthfully to present themselves in a socially desirable manner (Krumpal, 2013; Marquis, Marquis, & Polich, 1986; Tourangeau & Yan, 2007). To increase respondents’ willingness to respond honestly, indirect questioning procedures such as the randomized response technique (Warner, 1965) enhance the confidentiality of individual answers to sensitive questions. Consequently, prevalence estimates for sensitive personal attributes obtained through indirect questioning are considered more valid than prevalence estimates based on conventional, direct questioning. However, the use of indirect questioning relies on the assumption that participants understand all instructions and understand how the procedures increase privacy protection (Landsheer, van der Heijden, & van Gils, 1999). Violation of this assumption is potentially at odds with a method’s acceptance and the validity of its results. Employing a quasi-experimental design, in this study we investigated the influence of questioning techniques and education on comprehension and perceived privacy protection. Four indirect questioning techniques were investigated, and a conventional direct question served as a control condition.

Indirect questioning techniques

To minimize bias due to respondents not answering truthfully to sensitive questions, Warner (1965) introduced the randomized response technique (RRT). With the original RRT procedure, respondents are confronted simultaneously with two related questions: a sensitive question A (“Do you carry the sensitive attribute?”) and its negation question B (“Do you not carry the sensitive attribute?”). Participants answer one of these two questions, depending on the outcome of a randomization procedure, which is known only to the respondent and not the experimenter. When using a die as a randomization device, for example, respondents might be asked to answer question A if the die shows a number between 1 and 4 (randomization probability p = 4/6), and to answer question B if the die shows either 5 or 6 (p = 2/6). Hence, a “Yes” response does not allow conclusions regarding a respondent’s true status. He or she might be a carrier of the sensitive attribute who was instructed to respond to statement A, or a noncarrier instructed to respond to B. Since the randomization probability p is known, the proportion of carriers of the sensitive attribute π can be estimated at the sample level (Warner, 1965). Since the collection of individual data related directly to the sensitive attribute is avoided, respondents queried about sensitive topics are expected to answer more truthfully when asked indirectly, rather than through direct questioning (DQ). Prevalence estimates obtained via RRT are supposed to exceed DQ estimates, and this has been found repeatedly (Lensvelt-Mulders, Hox, van der Heijden, & Maas, 2005). However, nonsignificantly different estimates in the RRT and DQ conditions, and estimates higher in the DQ than in the RRT condition, have also been reported (e.g., Holbrook & Krosnick, 2010; Wolter & Preisendörfer, 2013). Moreover, given identical sample sizes, RRT estimates are always accompanied by a higher standard error than DQ, since employing randomization adds unsystematic variance to the estimator (Ulrich, Schröter, Striegel, & Simon, 2012).

Following the original model from Warner (1965), various more advanced RRT models have been proposed that focus on optimizing the statistical efficiency, validity, and applicability of the method (e.g., Dawes & Moore, 1980; Horvitz, Shah, & Simmons, 1967; Mangat & Singh, 1990). Several reviews and monographs have provided detailed descriptions of RRT models and their applications (e.g., Chaudhuri & Christofides, 2013; Fox & Tracy, 1986; Umesh & Peterson, 1991). We present four indirect questioning procedures used in studies that have investigated the prevalence of sensitive personal attributes, and compare them in terms of comprehensibility and their perceived privacy protection.

The cheating detection model

With the cheating detection model (CDM; Clark & Desharnais, 1998), participants are confronted with a forced-response paradigm. After presentation of a single, sensitive question, the outcome of a randomization procedure determines whether respondents answer truthfully to this question with probability p or ignore the question and answer “Yes” with probability 1 – p. Since the outcome of the randomization procedure remains confidential, a “Yes” response does not allow for conclusion concerning an individual’s status with respect to a sensitive attribute. Clark and Desharnais (1998) suspect that some participants disobey the instructions by responding “No” regardless of the outcome of randomization, to avoid the risk of being marked as a carrier of a sensitive attribute. Consequently, three disjoint and exhaustive classes are considered with CDM: carriers of the sensitive attribute responding truthfully (π), honest noncarriers (β), and respondents concealing their true statuses by answering “No” without regard for the instructions. Clark and Desharnais refer to the latter class as cheaters (γ). An example of a CDM question using a respondent’s month of birth as a randomization device is shown in Fig. 1.

Fig. 1
figure 1

Example of a question regarding academic dishonesty as presented in surveys employing the cheating detection model (Clark & Desharnais, 1998). The respondent’s month of birth is used as a randomization device, with randomization probability p = 2/12 = .17

The CDM has been shown repeatedly to produce higher, and thus presumably more valid, prevalence estimates than direct questions or other indirect questioning techniques that do not consider instruction disobedience (e.g., Ostapczuk, Musch, & Moshagen, 2011). Validation studies frequently arrive at estimates of γ that exceed zero substantially, demonstrating the usefulness of a cheating detection approach (e.g., Moshagen, Musch, Ostapczuk, & Zhao, 2010). However, in the case of γ > 0, the CDM provides only a lower and an upper bound for the proportion of carriers, since the true statuses of the respondents classified as cheaters are unknown. Hence, the rate of carriers could be located within the range of π (were no cheater a carrier) and π + γ (were all cheaters carriers).

The stochastic lie detector

Similar to the original RRT procedure (Warner, 1965), the recently proposed stochastic lie detector (SLD; Moshagen, Musch, & Erdfelder, 2012) confronts respondents with sensitive question A and its negation B. Similar to the modified RRT model that Mangat (1994) proposed, only some of the participants are instructed to engage in randomization. The carriers of the sensitive attribute respond to question A unconditionally, and if they respond truthfully, their answer should always be “Yes.” Noncarriers respond to question A with a randomization probability p, and to question B with a probability 1 – p. Consequently, neither a “Yes” nor a “No” response unequivocally reveals a respondent’s true status. However, Moshagen et al. (2012) argued that some carriers of the sensitive attribute might feel a desire to lie and respond “No,” even if instructed otherwise. This assumption was represented by a new parameter t, which accounts for the proportion of carriers answering truthfully, whereas the remaining proportion of the carriers (1 – t) are assumed to lie about their statuses. In contrast, noncarriers should not have any reason to lie. An example of an SLD question is shown in Fig. 2.

Fig. 2
figure 2

Example of a question regarding academic dishonesty using the stochastic lie detector (Moshagen et al., 2012). The respondent’s month of birth is used as a randomization device, with randomization probability p = 2/12 = .17

During a pilot study, application of the SLD resulted in a prevalence estimate for domestic violence that exceeded an estimate obtained using a direct question. Moreover, the SLD estimated the proportion of nonvoters in the German federal elections in 2009 in concordance with the known true prevalence (Moshagen et al., 2012). In a second study by Moshagen, Hilbig, Erdfelder, and Moritz (2014), cheating behaviors were induced experimentally to allow direct determination of the proportion of cheaters as an external validation criterion. Again, SLD closely reproduced the known proportion of carriers of the sensitive attribute, whereas DQ produced an underestimate. In contrast to these results, a recent experimental comparison of SLD with competing questioning techniques found SLD to overestimate the known prevalence of a nonsensitive control question (Hoffmann & Musch, 2015). Although this mixed pattern of results might be explained in terms of sampling error, difficulties regarding understanding the SLD instructions offer an alternative explanation.

The crosswise model

A new class of nonrandomized response techniques was proposed recently (Tian & Tang, 2014), offering simplified assessment of the prevalence of sensitive attributes, since no external randomization device is required. One of the most promising candidates among these is the crosswise model (CWM; Yu, Tian, & Tang, 2008), because it offers symmetric answer categories (i.e., none of the answer options is a safe alternative that eliminates identification as a carrier). With CWM, participants are presented with two statements simultaneously: One statement refers to the sensitive attribute with unknown prevalence π, and a second to a nonsensitive control attribute with known prevalence p (e.g., a respondent’s month of birth). Participants indicate whether “both statements are true or both statements are false,” or whether “exactly one of the two statements is true (irrespective of which one).” If an individual respondent’s month of birth is unknown to the questioner, CWM grants confidentiality of the respondents’ true statuses, presumably leading to undistorted prevalence estimates for sensitive attributes. Figure 3 shows an example of a CWM question.

Fig. 3
figure 3

Example of a question regarding academic dishonesty using the crosswise model (Yu et al., 2008). The respondent’s month of birth is used as a randomization device, with randomization probability p = 2/12 = .17

In various studies, application of CWM resulted in higher prevalence estimates for sensitive attributes than did DQ (e.g., Coutts, Jann, Krumpal, & Näher, 2011; Kundt, Misch, & Nerré, 2013). An experimental comparison of CWM, SLD, and a DQ condition showed that the CWM and SLD prevalence estimates of xenophobia and Islamophobia exceeded those obtained via DQ (Hoffmann & Musch, 2015). In another study, the CWM estimated the known prevalence of experimentally induced cheating behavior accurately (Hoffmann, Diedenhofen, Verschuere, & Musch, 2015). Yu et al. (2008) argued that nonrandomized models are “easy to operate for both interviewer and interviewee” (p. 261), which offers an explanation for the promising results observed to date using the CWM.

The unmatched count technique

Introduced by Miller (1984), the unmatched count technique (UCT) also offers comparably simple instructions. Respondents are assigned randomly to an experimental or a control group, both of which are confronted with a list of nonsensitive statements. In the experimental group, the list additionally contains a sensitive statement. In both groups, respondents indicate how many, but not which, of the statements apply to them. Since the only disparity between the two groups is the addition of a question referring to the sensitive attribute in the experimental group, a difference in the mean reported total counts estimates the proportion π of carriers of the sensitive attribute (Erdfelder & Musch, 2006; Miller, 1984). The individual statuses of the respondents in the experimental group remain confidential as long as the total reported count is both different from zero (in which case, all statements could be deduced to have been answered negatively) and different from the maximum count possible (in which case, all statements, including the sensitive statement, could be deduced to have been answered affirmatively). Thus, experimenters should prevent such extreme counts cautiously by including a sufficient number of nonsensitive statements (Erdfelder & Musch, 2006; Fox & Tracy, 1986). An example of a UCT question with one sensitive and three nonsensitive items is shown in Fig. 4.

Fig. 4
figure 4

Example of a question regarding academic dishonesty using the unmatched-count technique (Miller, 1984) with one sensitive (A) and three nonsensitive questions (B to D)

UCT has repeatedly provided higher prevalence estimates for sensitive attributes than have DQ approaches (e.g., Ahart & Sackett, 2004; Coutts & Jann, 2011; Wimbush & Dalton, 1997). The comprehensibility of the instructions and trust in the method were found to exceed the values for the RRT and a conventional DQ approach (Coutts & Jann, 2011). These results, however, were limited to a comparison of UCT and a forced-response RRT design, and comprehension was evaluated only by means of potentially forgeable self-ratings.

A meta-analytic evaluation of indirect-questioning studies (Lensvelt-Mulders et al., 2005) revealed that the prevalence estimates obtained through RRT largely meet the more-is-better criterion; that is, RRT estimates for socially undesirable attributes exceeding estimates based on DQ indicate increased validity, since social desirability biases them less. Another meta-analytic accumulation of strong validation studies in which the known true prevalence of a sensitive attribute served as an objective criterion found that RRT yielded prevalence estimates that were substantially less biased than DQ estimates (Lensvelt-Mulders et al., 2005). Some studies presented RRT estimates that were independent from (e.g., Kulka, Weeks, & Folsom, 1981), or even lower than (e.g., Holbrook & Krosnick, 2010), DQ estimates. Regarding a thorough examination of the validity of indirect questioning, in some strong validation studies, RRT estimates deviated substantially from known population values (e.g., Kulka et al., 1981; van der Heijden, van Gils, Bouts, & Hox, 2000). These results might be explained in terms of participants’ noncompliance with the instructions even under RRT conditions, especially concerning surveys that covered highly sensitive personal attributes (e.g., Clark & Desharnais, 1998; Edgell, Himmelfarb, & Duchan, 1982; Moshagen et al., 2012). Two psychological aspects that are likely to play a role in respondents’ willingness to cooperate are (a) the ability to understand the instructions and (b) whether respondents trust the promise of confidentiality associated with the use of indirect questioning.

Comprehensibility and perceived privacy protection from indirect questioning

Most indirect questioning relies on the assumption that participants comply with the instructions—that they are able and willing to cooperate (Abul-Ela, Greenberg, & Horvitz, 1967; Edgell et al., 1982). Many researchers have raised concerns that some participants might not understand the instructions for indirect questions fully, since the instructions are generally more complex in comparison to DQ (Coutts & Jann, 2011; Landsheer et al., 1999). Participants might also not trust indirect questioning to protect their privacy, and might therefore disregard the instructions (Clark & Desharnais, 1998; Landsheer et al., 1999). Response bias resulting from a lack of understanding or trust toward a method threatens the validity of prevalence estimates determined through indirect questions (Holbrook & Krosnick, 2010; James, Nepusz, Naughton, & Petroczi, 2013). Hence, trust and understanding are two psychological factors that determine the validity of indirect questioning (Fox & Tracy, 1980; Landsheer et al., 1999).

One strategy used to evaluate comprehensibility and perceived privacy protection is assessment of the response rates in surveys that use indirect questioning. Following the logic of these studies, higher response rates indicate higher trust and understanding. Although some studies have shown reduced response rates in RRT conditions as compared to DQ (Coutts & Jann, 2011), other studies have reported comparable response rates for indirect and direct questioning (e.g., I-Cheng, Chow, & Rider, 1972; Locander, Sudman, & Bradburn, 1976) or higher response rates during indirect questioning (e.g., Fidler & Kleinknecht, 1977; Goodstadt & Gruson, 1975). However, these results only allow indirect conclusions regarding the comprehensibility and perceived privacy protection of the questioning techniques used, since numerous alternative explanations exist for disparities in response rates (e.g., motivational factors and the content of sensitive questions). Therefore, differential influences of trust and understanding cannot be disentangled on the basis of an analysis of response rates.

Using more controlled approaches, some validation studies have used the known individual statuses of respondents regarding sensitive attributes to determine whether they responded in accordance with the instructions. The rate of demonstrably untrue responses was used to estimate the rate of participants who did not understand or trust the questioning procedure. Edgell et al. (1982) and Edgell, Duchan, and Himmelfarb (1992) argued that low rates of 2 to 4 % incorrect responses to moderately sensitive questions indicate a high level of comprehension. However, the rate of false answers rose to 10 to 26 % for highly sensitive questions. It is plausible that this stronger bias might in part be caused by respondents distorting answers to increasingly distance themselves from more sensitive attributes (Edgell et al., 1982). A meta-analytic investigation of strong validation studies in which participants’ true statuses concerning a sensitive attribute were known identified a mean rate of 38 % incorrect responses for RRT questions, whereas other questioning formats produced up to 49 % false answers (Lensvelt-Mulders et al., 2005). The disparities between RRT and DQ estimates increased for questions with higher sensitivity. This pattern could be interpreted as evidence that respondents trust the confidentiality offered by indirect questioning but require enhanced privacy protection, and use it only if a sensitive issue is at stake. However, the designs used in these studies did not separate the influences of comprehension and perceived privacy protection.

A more direct strategy to determine trust and understanding for varying questioning procedures is to assess these two constructs directly on a survey. Various studies based on reports of interviewees and interviewers estimated the rate of respondents who fully understood the RRT procedure at 94 % (I-Cheng et al., 1972), 78 to 90 % (Locander et al., 1976), 79 to 83 % (van der Heijden, van Gils, Bouts, & Hox, 1998), and 80 to 93 % (Coutts & Jann, 2011). For the UCT (Miller, 1984), the rate was 92 %. In another study, the comprehensibility of an RRT question was rated as normal or easy by 89 % of respondents, and 10 % indicated it was difficult (Hejri, Zendehdel, Asghari, Fotouhi, & Rashidian, 2013).

To estimate trust toward an RRT question, some researchers asked participants whether they thought there was a trick to the RRT procedure. Since 20 to 40 % (Abernathy, Greenberg, & Horvitz, 1970) and 15 to 37 % (I-Cheng et al., 1972) of the respondents answered affirmatively to this statement, a considerable fraction of respondents appear to mistrust RRT despite the promise of confidentiality. When confronted with an indirect question, respondents estimated the probability of the researcher knowing which questions they answered at 55 to 72 % (Soeken & Macready, 1982). Consequently, the probability of the procedure granting confidentiality was estimated at only 28 to 45 %. Few respondents (15 to 22 %) believed that RRT guaranteed the anonymity of their answers in a study from Coutts and Jann (2011); for a UCT question, the rate was slightly higher, though still low at 29 %.

Aside from assessment of total rates of trust and understanding, some studies have compared the perceived privacy protection of direct versus indirect questions. In one study, 91 % of respondents felt that the RRT would enhance confidentiality as compared to DQ (Edgell et al., 1982). In another, a rate of 72 % of respondents trusting the RRT procedure was unexpectedly exceeded by a rate of 83 % trustful participants in a DQ condition (van der Heijden et al., 1998), implying that the RRT failed to establish higher trust. Only 29 % of participants in a study from Hejri et al. (2013) perceived that the RRT increased confidentiality when compared to DQ. Other studies comparing indirect questioning techniques indicated that the UCT might be superior to RRT regarding trust and understanding (Coutts & Jann, 2011; James et al., 2013).

Few studies have examined the influences of cognitive skill and education on comprehension and perceived privacy protection of indirect questioning designs. I-Cheng et al. (1972) found a positive effect of education on the rate of cooperative respondents. Although 72 % of participants failed to understand an RRT question, the rate dropped to 27 % for participants who had graduated from primary school, and to 2 % for participants who held a junior high school degree. Landsheer et al. (1999) found no influence of participants’ formal education on the incidences of incorrect answers. Holbrook and Krosnick (2010) reported that the most implausible results in their study occurred in a subgroup of highly educated participants, indicating that the “failure of the RRT was not due to the cognitive difficulty of the task” (p. 336).

Overall, the results from studies that have investigated participants’ trust in and understanding of indirect questioning have been inconclusive. Some studies reported high rates of trust and understanding, and others showed that a substantial share of participants failed to understand indirect questions or did not trust the procedures. The data do not allow separation of these factors, and thus independent assessments of trust and understanding will be needed to identify indirect questioning techniques that are both comprehensible and inspire trust. The roles of cognitive skill and education as moderators of trust and understanding are not yet understood.

Present study

In this study, four indirect questioning techniques that have been used frequently in survey research that has addressed sensitive questions were entered into an experimental comparison of comprehensibility and perceived privacy protection. The CDM (Clark & Desharnais, 1998) and the SLD (Moshagen et al., 2012) allow for separate estimation of the proportions of noncompliant respondents in the sample by implementing an additional cheating parameter. The CWM (Yu et al., 2008) is presumably easier to understand than other RRT models and offers a symmetric design, which might facilitate honest responding. The UCT (Miller, 1984) is similarly easy to employ, and some participants have preferred UCT over RRT questions concerning trust and understanding. This study evaluates the comprehensibility and perceived privacy protection of these four indirect questioning techniques separately, since these two factors might be intertwined though not linked causally in a unidirectional connection. Some participants might understand the instructions but not trust the protection of their privacy, and others might fail to comprehend the task but perceive that indirect questions offer more confidentiality than do conventional DQ approaches.

To allow an objective and rigorous evaluation of participants’ instruction comprehension, we used a scenario-based design. To assess whether they understood the procedure, participants responded to a number of questions vicariously for various fictional characters. Participants were first given information regarding these characters (e.g., “Wilhelm has never cheated on an exam” or “Wilhelm was born in July”), were subsequently provided with instructions for one of the indirect questioning techniques, and finally indicated which answer the fictional character must give. This approach ensured that participants would not respond untruthfully to conceal their personal statuses regarding sensitive attributes. As a benefit of the scenario-based design, the true status for each fictional character was known to both the respondent and the questioner, and thus served as an objective criterion for assessment of the correctness of a respondent’s answers. The mean proportion of questions answered correctly in a test that assessed a respondent’s understanding of the procedure was determined as an estimate of the comprehensibility of each questioning procedure. We also assessed how participants estimated the privacy protection offered by various questioning techniques. Finally, by questioning two groups of participants with high versus low education, we investigated the moderation of cognitive skill.

This study addresses the following research questions: (1) Do indirect questions differ from conventional direct questions regarding comprehensibility? If so, which one of the four models under investigation is most comprehensible? (2) Do indirect questions offer higher perceived privacy protection than direct questions do? If so, what model is perceived as most protective? (3) Do cognitive skills, measured by respondents’ education, moderate the influence of questioning technique on comprehension or perceived privacy protection? (4) Is there an association between comprehension and perceived privacy protection?

Method

Participants

A total of 766 participants were recruited to participate in an online survey through a commercial online panel. Since education was part of the experimental design, an online quota ensured matching proportions of participants with lower versus higher educations. The participants in the lower-education group had finished at most 9 years of school (the German Hauptschule), and the participants in the higher-education group had finished at least 12 years of education (the German Abitur). To optimize our statistical power to detect differences between experimental conditions, we decided to increase the homogeneity of our sample by allowing only respondents between 25 and 35 years of age to participate. This particular range was chosen because it matches the age range of the respondents that participate most often in online studies (Gosling, Vazire, Srivastava, & John, 2004). Of the initially invited participants, 171 (22 %) were rejected due to full quotas, 58 (8 %) were screened out at the first page of the questionnaire because they did not match the inclusion criteria (education and age range), and 136 (18 %) were excluded because they failed to complete the questionnaire. Of the 136 participants who started but did not complete the questionnaire, 41 (5 % of those initially invited) aborted the experiment before any of the experimental questions were presented, and 95 (12 % of the initially invited) viewed at least one of the questioning techniques. To test for selective dropout with respect to experimental conditions, we compared which types of questions the participants saw last before dropping out (N = 95). As a reference, we compared these proportions against those of the last type of question for participants completing the study (N = 401). Within the CDM (21 vs. 22 %), CWM (23 vs. 21 %), and UCT (18 vs. 20 %) conditions, the distributions did not differ between incomplete and complete data sets. There was a trend toward a lower dropout rate in the more simple DQ condition (6 vs. 16 %) and a higher dropout rate in the more complex SLD condition (32 vs. 21 %); this trend was, however, small and nonsignificant, χ 2(4, N = 496) = 8.55, p = .07, w = .13. Educational levels (high vs. low) did not differ between the aborting and finishing participants, either, χ 2(1, N = 496) = 2.67, p = .10, w = .07. The participants in the final sample (N = 401, 52 % of those initially invited) had a mean age of 30.72 years (SD = 3.35); 211 (53 %) were female, and 386 (97 %) indicated German as their first language. Education groups were represented evenly, with 199 lower- and 202 higher-education participants. Power analyses conducted using the G*Power 3 software (Faul, Erdfelder, Buchner, & Lang, 2009; Faul, Erdfelder, Lang, & Buchner, 2007) revealed that our large sample size provided sufficient power for the detection of medium effects during analysis of the mean differences between groups (f = 0.25, 1 – β = .99) and (both parametric and nonparametric) correlations (r/r S = .30, 1 – β > .99).

Design

The scenario-based experiment implemented a 5 (questioning technique) × 2 (educational level), quasi-experimental mixed design. Questioning technique varied within subjects, realized in five blocks: CDM (Clark & Desharnais, 1998), SLD (Moshagen et al., 2012), CWM (Yu et al., 2008), UCT (Miller, 1984), and a conventional DQ approach. The second, quasi-experimental, between-subjects independent variable was the participants’ education (high vs. low).

Academic cheating served as the sensitive attribute, as had been used in several studies of indirect questioning techniques (e.g., Hejri et al., 2013; Lamb & Stem, 1978; Ostapczuk, Moshagen, Zhao, & Musch, 2009; Scheers & Dayton, 1987). The wordings of the sensitive question were identical in all questioning technique conditions: It read, “Have you ever cheated on an exam?” Three additional, nonsensitive attributes were used to employ indirect questioning techniques. First, the participant’s month of birth was used as the randomization device for the CDM, SLD, and CWM questions. To allow application of the UCT format, we constructed a list of four items: the sensitive, the nonsensitive month-of-birth, and two nonsensitive attributes (i.e., gender and a question concerning whether participants visited London). The indirect questioning techniques were implemented as shown in Figs. 1, 2, 3 and 4. Each of the questioning techniques was applied to four fictional characters named Ludwig, Ernst, Hans, and Wilhelm, characterized differently regarding the sensitive and nonsensitive attributes. Ludwig and Ernst were presented as carriers of the sensitive attribute, and Hans and Wilhelm were described as noncarriers. The birthdays of Ludwig and Hans were chosen to fall into one of the outcome categories of the binary randomization procedure, and the months of birth for Ernst and Wilhelm were set to fall into the other category. All four characters were male, and none was described as having visited London. The descriptions were chosen to avoid extreme counts in the UCT condition. The descriptions of the four fictional characters were accessible to participants at any time during the experiment. To control for effects of serial position, the sequence of presentation of the five questioning technique blocks was randomized among participants. Additionally, the four fictional characters were presented in a random order within each of the questioning technique blocks.

To examine the comprehensibility of the questioning techniques, participants vicariously indicated the answers that the four fictional characters must give if confronted with each of the various questioning techniques. Descriptions of the characters were displayed along with the questions. As an example, a screenshot of a CWM question that had to be answered from the perspective of Wilhelm is shown in Fig. 5. The comprehensibility of the questioning techniques was operationalized as the percentage of correct answers computed across all four fictional characters, separately for each participant.

Fig. 5
figure 5

Screenshot of a CWM question that had to be answered from the perspective of the fictional character Wilhelm. Since Wilhelm never cheated on an exam and was born in July, the first answer option (“Yes to both questions or no to both questions.”) would have been correct

To assess perceived privacy protection, participants rated the perceived confidentiality offered by each questioning technique on a 7-point Likert-type scale, ranging from –3 (no confidentiality) to +3 (perfect confidentiality). The scales were presented directly below the comprehension questions. Perceived privacy protection was operationalized as the mean score on these Likert scales concerning all four fictional characters.

Results

Comprehensibility

The mean proportions of correct responses as a function of questioning technique and education are shown in Figs. 6 and 7, respectively. Reliability analyses for the proportions of correct responses across all five questioning techniques revealed that the variable measured a homogeneous construct (Cronbach’s α = .75). Descriptively, the mean proportion of correct responses in the DQ control condition was higher than those for the CDM (ΔM = 15.04 %, r = .44, dz = 0.70; according to Cohen, 1988), SLD (ΔM = 21.73 %, r = .23, dz = 0.79), CWM (ΔM = 7.07 %, r = .49, dz = 0.33), and UCT (ΔM = 13.38 %, r = .52, dz = 0.49) conditions. Among the indirect questioning techniques, the mean proportion of correct responses was descriptively highest in the CWM condition, followed by scores in the UCT (CWM vs. UCT: ΔM = 6.3 %, r = .52, dz = 0.23), CDM (CWM vs. CDM: ΔM = 8.0 %, r = .39, dz = 0.33; UCT vs. CDM: ΔM = 1.7 %, r = .42, dz = 0.06), and SLD (CWM vs. SLD: ΔM = 14.7 %, r = .29, dz = 0.52; UCT vs. SLD: ΔM = 8.4 %, r = .25, dz = 0.24; CDM vs. SLD: ΔM = 6.7 %, r = .38, dz = 0.26) conditions. The descriptive differences in the mean proportions of correct responses between participants with high versus low education were negligible in the DQ control condition (ΔM = 1.39 %, d = 0.07). Within the CDM condition, people with lower education had slightly lower scores (ΔM = 4.98 %, d = 0.24). For the SLD (ΔM = 9.70, d = 0.41), CWM (ΔM = 7.61 %, d = 0.34), and UCT (ΔM = 11.07 %, d = 0.36) conditions, lower education resulted in substantially lower mean proportions of correct responses. Considering the binary nature of correct/incorrect responses, inferential statistics were determined by establishing a generalized linear mixed model with a logit link function, implementing the fixed factors Questioning Technique (within subjects), Education (between subjects), and the interaction of these two factors (cf. Jaeger, 2008). Responses were coded as incorrect (0; reference category) versus correct (1) and served as the criterion. A by-subjects random intercept accounted for the dependency of the measurements. This model revealed a significant main effect of within-subjects questioning technique [F(4, 8010) = 77.51, p < .001]. Sequentially Bonferroni-corrected pairwise contrasts for within-subjects questioning technique widely mirrored the descriptive results: The comprehensibility in the DQ control condition was higher than those in the CDM [t(8010) = –5.64, p < .001], SLD [t(8010) = –10.41, p < .001], CWM [t(8010) = –5.99, p < .001], and UCT [t(8010) = –11.11, p < .001] conditions. Pairwise comparisons among the indirect questioning techniques resulted in significant differences for all combinations [CDM vs. SLD: t(8010) = –7.53, p < .001; CDM vs. UCT: t(8010) = –6.96, p < .001; SLD vs. CWM: t(8010) = 7.51, p < .001; SLD vs. UCT: t(8010) = 2.36, p < .05; CWM vs. UCT: t(8010) = –6.96, p < .001], except for the difference between CDM and CWM, which was not statistically reliable [t(8010) = –0.158, p = .88]. Thus, participants demonstrated the highest comprehension for direct questions. Comprehension was slightly but significantly reduced for CWM and CDM questions. For CDM, comprehensibility was descriptively, but not significantly lower than for CWM. For UCT, comprehension was significantly reduced further, but it was still significantly higher than for SLD questions, for which comprehension was lowest. Furthermore, the established model revealed a significant main effect of between-subjects education [F(1, 8010) = 9.07, p < .01]. As hypothesized, higher education resulted in a higher proportion of correct responses. Finally, the model showed a significant interaction of the two factors Questioning Technique and Education [F(4, 8010) = 5.58, p < .001]. Sequentially Bonferroni-corrected pairwise contrasts indicated that high versus low education did not result in significantly different proportions of correct responses in the DQ [t(8010) = –0.98, p = .33] or CDM [t(8010) = –0.63, p = .53] conditions, respectively. For the SLD [t(8010) = –2.17, p < .05], CWM [t(8010) = –3.36, p < .01], and UCT [t(8010) = –4.65, p < .001] conditions, lower education resulted in lower comprehension. Hence, although the proportions of correct responses were comparable between educational groups for DQ, education moderated comprehension in three of four indirect-questioning formats.

Fig. 6
figure 6

Mean percentages of correct responses as a function of questioning technique in the total sample (N = 401). Error bars denote ±1 standard error

Fig. 7
figure 7

Mean percentages of correct responses as a function of questioning technique and low (N = 199) versus high (N = 202) education. Error bars denote ±1 standard error

Perceived privacy protection

The mean ratings of perceived privacy protection as a function of questioning technique and education are shown in Figs. 8 and 9, respectively. Reliability analyses for the mean ratings of perceived privacy protection across all five questioning techniques revealed that the variable measured a homogeneous construct (α = .87). A univariate 5 (questioning technique) × 2 (education), mixed-model ANOVA revealed a main effect for within-subjects questioning technique [F(4, 1596) = 18.76, p < .001, η 2 = .05], but no effect for between-subjects education [F(1, 399) < 1]. However, the two factors showed an interaction [F(4, 1596) = 9.21, p < .001, η 2 = .02]. A Bonferroni post-hoc test of the factor Questioning Technique revealed that the mean scores in the DQ control condition were lower than those in the CDM (ΔM = 0.26, p < .001; r = .57, dz = 0.19), SLD (ΔM = 0.25, p < .01; r = .53, dz = 0.18), CWM (ΔM = 0.39, p < .001; r = .39, dz = 0.25), and UCT (ΔM = 0.52, p < .001; r = .40, dz = 0.33) conditions. Post-hoc tests between the indirect questioning techniques showed that the UCT format resulted in the highest scores, which were not significantly different from the scores in the CWM condition (ΔM = 0.13, p = .21; r = .64, dz = 0.12) but were higher than the scores in the CDM (ΔM = 0.26, p < .001; r = .61, dz = 0.22) and SLD (ΔM = 0.27, p < .001; r = .64, dz = 0.24) conditions. The mean scores in the CWM condition were comparable to the scores in the CDM (ΔM = 0.13, p = .31; r = .61, dz = 0.11) and SLD (ΔM = 0.14, p = .10; r = .67, dz = 0.13) conditions. Finally, the CDM and SLD scores showed no difference (ΔM = 0.01, p > .99; r = .65, dz = 0.01). Combined, all indirect questioning techniques enhanced perceived privacy protection in comparison with conventional DQ. Participants perceived the highest privacy protection when confronted with the UCT and CWM questions, and the perceived privacy ratings for the CWM, CDM, and SLD questions did not differ. Since no main effect of education emerged, results are presented only for the interaction of education and questioning technique. Five pairwise t tests for independent groups on a Bonferroni-corrected α level (corrected α = .05/5 = .01) were computed to compare the participants with high versus low education separately within each questioning technique condition. The comparisons revealed an education effect only in the DQ condition [ΔM = 0.51; t(399) = 3.35, p < .001, d = 0.33], whereas education groups did not significantly differ in corrected αs within the CDM [ΔM = 0.08; t(399) = 0.64, p = .53, d = 0.07], SLD [ΔM = 0.10; t(399) = 0.78, p = .43, d = 0.08], CWM [ΔM = 0.10; t(399) = 0.77, p = .44, d = 0.07], and UCT [ΔM = 0.26; t(399) = 1.98, p = .05, d = 0.20] conditions. Hence, participants with lower education perceived higher privacy protection when confronted with a direct question than did participants with higher education, and perceived privacy protection did not differ between education groups within the indirect questioning conditions.

Fig. 8
figure 8

Mean perceived privacy protection on a 7-point Likert-scale from –3 (no confidentiality) to +3 (perfect confidentiality) as a function of questioning technique in the total sample (N = 401). Error bars denote ±1 standard error

Fig. 9
figure 9

Mean perceived privacy protection on a 7-point Likert-scale from –3 (no confidentiality) to + 3 (perfect confidentiality) as a function of questioning technique and low (N = 199) versus high (N = 202) education. Error bars denote ±1 standard error

Association of comprehension and perceived privacy protection

To investigate whether participants’ comprehension of a questioning technique was associated with perceived privacy protection, bivariate Spearman correlations were computed for the total sample and separately for the two education groups (Table 1). Comprehension and perceived privacy protection showed no significant associations.

Table 1 Nonparametric correlation coefficients (Spearman’s rho) measuring the association of comprehension and perceived privacy protection

Discussion

In the present study, we compared four indirect questioning procedures in terms of comprehensibility and perceived privacy protection. A conventional direct question served as a control condition. The moderating effects of participants’ level of education were investigated.

Comprehensibility of indirect questioning techniques

All indirect questioning techniques showed lower comprehensibility than a DQ condition. The results accord with extant studies that have suggested that the instructions of indirect questions are more complex and thus more difficult to comprehend, than direct questions (e.g., Böckenholt, Barlas, & van der Heijden, 2009; Coutts & Jann, 2011; Edgell et al., 1992; Landsheer et al., 1999; O’Brien, 1977). In a qualitative interview study, Boeije and Lensvelt-Mulders (2002) reported that the reduced comprehensibility of indirect RRT questions might be explained partially by participants experiencing difficulties when “doing two things at the same time” (p. 30). Participants struggle to focus on RRT questions and the randomization procedure simultaneously. This experience applies to the present study, since participants had to integrate two types of information to identify the correct responses in all indirect questioning conditions: first the status of the fictional characters regarding a sensitive attribute, and second their statuses concerning nonsensitive randomization attribute(s). Our results suggest that some indirect-questioning formats showed better comprehensibility than others did; CWM appears to have been the most comprehensible format, corroborating Yu et al.’s (2008) assertion that CWM is easier to follow. Integrating two types of information or “doing two things at the same time” (Boeije & Lensvelt-Mulders, 2002, p. 30; see also Lensvelt-Mulders & Boeije, 2007, p. 598) might have been easiest for participants in the CWM condition, since this questioning format incorporates the randomization procedure and the response to the sensitive statement in a single step: Respondents simply have to read two answer options and identify the appropriate one. In contrast, comprehension was lowest in the SLD condition. A more detailed inspection of the SLD’s instructions revealed that participants must make three sequential decisions to identify the correct response: (a) decide whether the fictional character is a carrier of the sensitive attribute, (b) identify the question that must be answered as determined by the randomization procedure (if the character is a noncarrier), and (c) identify the correct response to the respective question. Answering an SLD question is therefore arguably more difficult, and more prone to errors, than answering a CWM question. However, since this explanation is rather speculative, future studies should consider qualitative interviews similar to the one conducted by Boeije and Lensvelt-Mulders (2002) to shed further light on the exact mechanisms that account for differential comprehensibility of the four indirect questioning models investigated here.

The lower-education group demonstrated decreased comprehension of all indirect questioning techniques, with the exception of CDM. Researchers investigating the prevalence of sensitive personal attributes should consider that the comprehension of indirect questions might be reduced in samples that include less-educated participants, and thus should refrain from applying indirect questioning techniques if less-educated individuals report difficulties while completing a survey. This caveat should receive particular attention if education is expected to be associated with the sensitive attribute under investigation (e.g., negative attitudes toward foreigners; cf. Ostapczuk, Musch, & Moshagen, 2009).

On the one hand, since a within-subjects, scenario-based design was used, the comprehension rates reported in this study likely mark the lower boundaries for the comprehensibility of the questioning procedures under investigation. The mean comprehension in the DQ condition was high (>90 %) and unaffected by education, indicating that participants were generally capable of answering questions from the perspective of the four fictional characters. However, participants’ comprehension would likely improve if they had to deal with only one questioning technique, and if they were not required to respond vicariously about fictional characters but for themselves. On the other hand, as was remarked by one of the reviewers of this article, the participants in our study were provided with all relevant information on screen, which possibly facilitated the identification of the correct response. In real applications, this information has to be retrieved from memory. Under applied conditions, issues with the retrieval of autobiographical information with respect to sensitive and/or nonsensitive attributes may therefore make it more difficult to identify the correct response. The instructions for all indirect questioning procedures were kept as concise as possible. During real applications, more-comprehensive instructions could be presented along with extended explanations and could be combined with comprehension checks to ensure that respondents understand the procedure. In contrast to many extant studies that have used face-to-face questioning or paper–pencil tests, this study confronted participants with an online questionnaire that utilized indirect questioning techniques. Although RRT has yielded valid results in previous online studies (e.g., Musch, Bröder, & Klauer, 2001), a face-to-face setting offers better opportunities to assist participants who experience difficulties, and might help respondents achieve better comprehension and avoid errors when answering questions.

Perceived privacy protection

Regarding perceived privacy protection, all indirect questioning techniques showed higher mean scores than a conventional DQ, suggesting participants developed higher trust toward indirect questions. The highest mean score was achieved in the UCT condition, followed by a slightly but nonsignificantly reduced mean score with CWM. The scores under CWM, CDM, and SLD were similar, though the latter two differed from the UCT condition. Education influenced perceived privacy protection only in the DQ condition, with lower-education participants reporting higher perceived protection. This education effect did not occur in any indirect questioning condition. Hence, the influence of education on perceived privacy protection reduces to failure to understand that direct questions provide poorer privacy protection. When sensitive questions are assessed using indirect questioning, the effect of education might be negligible concerning perceived protection.

Comprehension did not associate with perceived privacy protection in the entire sample or in the two education groups. This pattern suggests that although participants understood the instructions, they did not necessarily trust the procedure. The results also suggest respondents developed trust despite failure to comprehend instructions fully. The lack of association between comprehension and perceived privacy protection suggests the importance of examining the differential impacts of these two constructs separately when assessing sensitive topics with indirect questioning techniques. To allow valid assessment of the prevalence of sensitive personal attributes, participants should ideally both understand and trust the questioning technique.

Limitations and future directions

Several limitations to our study have to be acknowledged. For example, despite the successful separation of comprehension and perceived privacy protection, a confounding influence of task motivation on the comprehensibility of questioning techniques cannot be ruled out. Although comprehension in the DQ condition was generally high, about 10 % of the participants’ responses were incorrect. This suggests a potential lack of motivation among at least some participants. However, in a recent study, Baudson and Preckel (2016) found that in other rather simple cognitive tasks, the proportion of successful participants was also only 90 %, and thus, close to the accuracy we observed in the DQ condition. This provides evidence for the notion that it is probably unrealistic to expect perfect scores in tasks like the ones we investigated.

Arguably, a lack of motivation is likely to exert a stronger influence on cognitively more demanding tasks, such as responding to indirect rather than direct questions. Our dropout analyses indeed showed a small (yet insignificant) trend indicating a lower dropout rate in the less cognitively demanding DQ condition, and a higher dropout rate in the presumably rather demanding SLD condition.

It is conceivable that participants with lower education might also be less motivated. However, given that comprehension in the DQ condition did not differ between high and low education groups, a general difference in motivation between these two groups seems to be rather unlikely. Moreover, whereas the design of our experiment did not allow us to directly observe evidence for a lack of motivation, any such motivational differences would be likely to affect real applications of indirect questioning techniques, as well. Even though comprehensibility in our study may actually have measured a mixture of comprehension and motivation, there is little reason to expect a higher share of valid responses in real applications than in the present study. To further explore the exact mechanisms underlying incorrect responses, future studies should however try to measure task motivation more directly, or might try to increase task motivation by offering financial incentives.

Because participants had to take on artificial characters’ perspectives in a scenario-based design, absolute comprehension rates and perceived privacy scores might not be directly transferrable to real applications. However, if participants respond to sensitive questions from their own perspective, comprehension and perceived privacy protection are intertwined by default. For example, carriers of a sensitive attribute who do not trust a questioning technique will necessarily tend to provide untruthful (i.e., incorrect) responses; vice versa, carriers who fully trust the procedure will probably answer truthfully (i.e., correctly). For this reason, only a scenario-based approach allows to separate comprehension from perceived privacy protection in RRT designs investigating sensitive attributes; and arguably, at least the rank order of the questioning techniques we investigated is therefore likely to remain valid even if absolute values may differ in real applications.

Another limitation of the present study is that we measured perceived privacy protection in a within-subjects design. Although this may have affected the responses, it allowed us to achieve higher statistical power, and also helped to avoid an effect that has been shown to potentially distort the results of between-subjects comparisons of numerical rating scales (Birnbaum, 1999). In particular, contexts that differ between experimental conditions can lead to erroneous conclusions in between-subjects designs if participants provide relative judgments according to the range principle. For example, in a between-subjects design, participants have been shown to perceive the number 9 as being higher than the number 221 if the former evoked a frame of reference that consisted of single-digit numbers, whereas the latter evoked a frame of reference that consisted of three-digit numbers (Birnbaum, 1999). Similarly, an absolute judgment of the privacy protection afforded by a direct question may be distorted if participants are not aware of the possibility of privacy-protecting indirect questioning techniques because they are not given an opportunity to acquaint themselves with such techniques. Our decision to employ a within-subjects design helped to avoid such range effects, because participants were given an opportunity to compare all questioning techniques.

A final limitation of our study is the relatively narrow age range of the participants (25 to 35 years old). Although this relatively homogeneous sample increased the statistical power to detect differences between the experimental conditions, it also limits the generalizability of the findings. Future studies should therefore include older participants to investigate the replicability of our results in samples with a broader range of age.

This study supports the application of indirect questioning designs, since they were shown to increase perceived privacy protection. When selecting among techniques, the best advice is to use CWM (Yu et al., 2008) to assess sensitive personal attributes. This model had the highest comprehensibility among the indirect questioning techniques and substantially increased perceived privacy protection in comparison to direct questioning. This recommendation is further supported by findings from various extant studies that have suggested that CWM results in more-valid prevalence estimates than conventional direct questioning (e.g., Coutts et al., 2011; Hoffmann & Musch, 2015; Jann, Jerke, & Krumpal, 2012; Kundt et al., 2013; Nakhaee, Pakravan, & Nakhaee, 2013). If the attribute under investigation is extraordinarily sensitive (e.g., deviant sexual interests or severe criminal behavior), researchers may want to consider using the UCT (Miller, 1984) to maximize perceived privacy.