The application of factorial surveys to study recruiters’ hiring intentions: comparing designs based on hypothetical and real vacancies

Factorial survey experiments have been widely used to study recruiters’ hiring intentions. Respondents are asked to evaluate hypothetical applicant descriptions, which are experimentally manipulated, for hypothetical job descriptions. However, this methodology has been criticized for putting respondents in hypothetical situations that often only partially correspond to real-life hiring situations. It has been proposed that this criticism can be overcome by sampling real-world vacancies and the recruiters responsible for filling them. In such an approach, only the applicants’ descriptions are hypothetical; respondents are asked about a real hiring problem, which might increase internal and external validity. In this study, we test whether using real vacancies triggers more valid judgments compared to designs based on hypothetical vacancies. The growing number of factorial survey experiments conducted in employer studies makes addressing this question relevant, both for methodological and practical reasons. However, despite the potential implications for the validity of data, it has been neglected so far. We conducted a factorial survey experiment in Luxembourg, in which respondents evaluated hypothetical applicants referring either to a currently vacant position in their company or to a hypothetical job. Overall, we found little evidence for differences in responses by the design of the survey experiment. However, the use of real vacancies might prove beneficial depending on the research interest. We hope that our comparison of designs using real and hypothetical vacancies contributes to the emerging methodological inquiry on the possibilities and limits of using factorial survey experiments in research on hiring.


Introduction
Factorial survey experiments (FSEs) have been used in the social sciences to study human judgments since the 1970s (Rossi 1979;Wallander 2009) and, in recent years, have become an established method to study how employers select job applicants (McDonald 2019). In a typical set-up, respondents are asked to evaluate fictitious descriptions of applicants, called "vignettes." Researchers can study how employers interpret and value specific applicant characteristics during the hiring process through the experimental manipulation of information on applicants. Analyzing employers' hiring preferences is crucial to our understanding of how inequalities arise at the individual and job level (Bills et al. 2017). Previous studies have employed FSEs to analyze, for example, gender discrimination (Kübler et al. 2018;Oesch et al. 2017), ethnic discrimination (Auer et al. 2019;Baert and De Pauw 2014), the value of foreign qualifications (Damelang et al. 2019;Protsch and Solga 2017), and the signaling value of educational credentials (Di Stasio 2014;Di Stasio and Van De Werfhorst 2016) and unemployment experience (Van Belle et al. 2018;Shi et al. 2018).
However, FSEs have been criticized for measuring the hiring intentions of employers, rather than their real behavior (e.g., Pager and Quillian 2005). Consequently, field experiments, such as correspondence or audit studies in which fictitious applications are sent to real jobs (e.g., Pedulla 2018;Protsch and Solga 2014), are often seen as better suited to detecting inequalities in the hiring process. 1 This criticism applies particularly to FSEs, in which respondents are put in a situation that may be completely unknown to them in real life. This would be the case, for example, if respondents are students (e.g., Baert and De Pauw 2014) or only partly familiar with the recruitment process for a specific job type (e.g., Van Belle et al. 2018;Damelang and Abraham 2016). However, FSEs can also be based on a careful selection of study participants, so that the pool of respondents closely matches the target population (i.e., individuals with recruiting experience for the occupation of interest and familiar with the situations described in the FSE). Adopting this strategy may increase external validity (Hainmueller et al. 2015).
Furthermore, FSEs have several advantages over field experiments, which make them an attractive option for studying hiring processes (see also, McDonald 2019). First, unlike field experiments, FSEs may capture the multidimensionality of hiring decisions by considering several applicant characteristics simultaneously instead of focusing on only one or two features. Moreover, the survey part of FSEs allows the researcher to gather rich information about recruiters, companies, and jobs; this enables an in-depth analysis of the mechanisms underlying recruiters' decision-making processes, an analytical feature typically unavailable in field experiments. Finally, FSEs are easily compatible with ethical research practices, as the respondents (i.e., recruiters) give informed consent to participate in the survey, whereas in field experiments, real employers are led to believe that they are dealing with real applications. 2 Resolving the FSE versus field experiment debate is beyond the scope of this article. Given that prior research suggests that FSEs are useful, some of the justified criticisms about their use should be addressed by improving the design of the FSEs applied to study hiring intentions. A research group from the NEGOTIATE project proposed basing FSEs on samples of current real-world vacancies and the recruiters responsible for filling them (NEGOTIATE 2020). 3 This is in contrast to most previous FSEs on recruiting, which relied on rather abstract descriptions of hypothetical vacancies that may have lacked psychological realism (e.g., Van Belle et al. 2018;Damelang et al. 2019;Di Stasio 2014). "Psychological realism [emphasis in original] is the extent to which the psychological processes that occur in an experiment are the same as the psychological processes that occur in everyday life" (Brewer and Crano 2014, p. 21). This is an important factor that should be considered when designing FSEs, to enhance the external validity . Conducting FSEs with recruiters who are currently in charge of filling a real-world vacancy might increase psychological realism and, hence, contribute to the internal and external validity of the results (NEGOTIATE 2020). However, collecting large sets of job advertisements and employer details is costly and timeconsuming. Furthermore, given that respondents are still presented with a fictitious situation that entails no real-life consequences, the question arises whether it really makes a difference for the measured hiring intentions of recruiters whether vignette ratings refer to a real hiring problem or a hypothetical one. If the FSE is based on hypothetical vacancies, the employers might have trouble putting themselves in the actual hiring situation, particularly when the respondents are not familiar with recruiting for the specific job type studied. Hence, the internal and external validity might be reduced. This is presumably different if the FSE is conducted with recruiters who are currently responsible for filling a real-world vacancy on which the experiment is based. Only the applicant profiles are fictitious in this case, which might increase internal and external validity. However, the question remains whether this is enough to trigger more valid judgments compared to a completely hypothetical situation.
The main objective of this study is to help answer this question. While the advantages and disadvantages of survey experiments in general have been discussed (e.g., Sniderman 2018), more research is needed on the designs of FSEs to study hiring intentions. We conducted an FSE in Luxembourg, where one group of recruiters rated hypothetical applicants for a real vacancy for which they were personally responsible in real life (i.e., for which they know all the requirements and characteristics) and another group of recruiters rated hypothetical applicants for a hypothetical vacancy (i.e., without information on job characteristics). Since random assignment of recruiters to a condition with real or hypothetical vacancies (i.e., a split ballot experiment) was impossible due to data limitations, we compared two different samples of recruiters. We manipulated three applicant characteristics: migrant background, gender, and unemployment. We asked whether the effects of these characteristics on observed hiring intentions differed if the rating task corresponded to a real instead of a hypothetical hiring problem and whether these effects were closer to certain benchmarks when real vacancies were used. "Behavioral benchmarks" (see Hainmueller et al. 2015) corresponding to real-life hiring behavior were not available for our study context, which is why we derived "theoretical" benchmarks from previous research. Given the growing number of FSEs in employer studies and the potential implications for the validity of those studies' results, answering this question is relevant for both methodological and practical reasons. However, to the best of our knowledge, this question has not been addressed empirically.
After a literature review in Sect. 2, we discuss our expectations about the impact of the FSE design on recruiters' stated hiring intentions in Sect. 3. In Sect. 4, we describe our research design. Our results are presented in Sect. 5 Section 6 contains our conclusions. Table 1 provides an overview of previous FSEs used to study recruiters ' hiring intentions. 4 All listed studies ask respondents to rate either the hiring chances of applicants or the likelihood of inviting applicants to the next selection stage (e.g., interview). Moreover, in all cases, the presented applicant profiles (i.e., vignettes) are fictitious. These studies concern how recruiters interpret and value certain applicant characteristics, such as educational background, age, or nationality.

Literature review
The last column of Table 1 shows the type of jobs presented to respondents. To the best of our knowledge, so far only in NEGOTIATE (2020) have respondents been asked about a real vacancy in their company that they are responsible for filling. 5 Scholars interested in measuring hiring intentions by means of FSEs clearly prefer using hypothetical vacancies. However, the presentation of hypothetical jobs to respondents varies in certain ways. For example, some studies give specific information about the job, such as characteristics required for the position (e.g., Baert and De Pauw 2014;De Wolf and Van Der Velden 2001); other studies are more general and provide no details about the job (e.g., Oesch et al. 2017). Furthermore, some rating tasks do not refer to a job per se but to the organization as a whole (e.g., Karpinska et al. 2011).
The pool of respondents used in the reviewed studies also differs noticeably. The respondents in most studies are managers or human resource professionals, who have at least some recruitment experience (see the 5th column in Table 1). Yet, they are not always experts in the specific occupations under investigation or in the relevant recruitment procedures (e.g., Damelang and Abraham 2016). In a few cases, the sample consists fully or partly of students (Baert and De Pauw 2014;Karpinska et al. 2011). Such a mismatch between the target population (i.e., real recruiters with job-specific knowledge) and the analytic sample could affect the internal and external validity of the results, as discrepancies between measured intentions and real behavior are more likely if respondents are not familiar with the rating task in real life (see Todesco 2017, p. 17). Hence, FSEs based on hypothetical vacancies can enhance the validity of the results by carefully selecting respondents with relevant recruiting experience (ideally for the types of jobs under investigation) and designing hypothetical vacancies that are familiar to them. Nevertheless, this issue can be easily addressed if the sample consists of real-world vacancies, given that, in this case, the job described in the hiring situation is concrete and real for the respondents. The respondents (i.e., real recruiters) are asked about a currently vacant position in their company, for which they are personally responsible. This means that they know all the relevant characteristics of this vacancy and do not have to make any assumptions when rating vignettes (see Sect. 3). This is different for respondents who 4 McDonald (2019) provides a more general systematic review of previous FSEs used in the study of employers' preferences. In our literature review, we compare previous FSEs based on their use of hypothetical and real-world vacancies. We focus on FSEs published in peer-reviewed journals written in English and exclude behavioral validation studies (i.e., studies that combine field and survey experiments to compare the two methods). Conjoint analyses and other types of vignette studies, which are similar to FSEs but typically ask respondents to rate applicant profiles in pairs rather than individually (e.g., Auer et al. 2019;Van Beek et al. 1997;Biesma et al. 2007;Humburg and Van Der Velden 2015), are also excluded as our study design diverges from this approach. 5 Van Beek et al. (1997) also asked employers about a real vacancy in their company in their FSE, but they adopted a different approach. They asked a sample of employers to describe a real (current or past) vacancy in their firm, and the vignette ratings were based on this description. In contrast, NEGOTIATE (2020) sampled published real vacancies, and the study participants were personally responsible for filling them. Moreover, Van Beek et al. (1997) asked respondents to rate pairs of applicants rather than rate the applicants in consecutive order. Further differences among FSEs used to measure hiring intentions involve the type of samples used. Respondents were either drawn from a representative population or employer surveys or were sampled specifically for the respective study context (see column 4 in Table 1). For example, Liechti et al. (2017) sampled respondents based on the membership register of a Swiss hotel employer association, whereas other studies used a (random) sample of firms to find suitable study participants (e.g., Van Beek et al. 1997;Di Stasio 2014). Similar to the NEGOTIATE research group, Petzold (2017) sampled real job advertisements through which recruiters were contacted. As far as we can judge, however, Petzold (2017), in contrast to NEGOTIATE (2020), presented respondents with a hypothetical job description.
Some of these differences may be due to the different study contexts and goals, but they also indicate an evolving field that has not yet converged upon shared standards. Whereas the vast majority of the reviewed studies arrive at conclusions in line with the respective theory, common guidelines will increase their comparability.

Theory and hypotheses
Our contribution focuses on the applicant's migrant background, gender, and unemployment experience to examine whether the FSE design affects recruiters' hiring intentions. Hiringrelated inequalities based on these characteristics have been of particular interest to scholars in labor market research. In the following, we develop the argument that the effects of these characteristics on recruiters' hiring intentions are closer to what we would observe in reality when real vacancies are used. To test this hypothesis, behavioral benchmarks corresponding to the true effects of these characteristics on hiring decisions would be ideal (see Hainmueller et al. 2015), but these were not available for our study context. Therefore, we had to rely on the extensive empirical evidence on the effects of our three applicant characteristics on recruiters' hiring decisions to find plausible "theoretical" benchmarks. These will be briefly discussed below, before we turn to our expectations about the impact of using real vacancies in FSEs on recruiters' hiring intentions.
In line with well-established discrimination theories, hiring discrimination toward foreigners and ethnic minorities has been documented through numerous field (e.g., Zschirnt and Ruedin 2016;Quillian et al. 2019) and survey experiments (e.g., Auer et al. 2019). Therefore, we assume a negative effect of a foreign background on recruiters' hiring intentions as the benchmark for the effect of migrant background for our study context. Moreover, experimental studies suggest an advantage for women in female-dominated occupations and a disadvantage for women in male-dominated occupations (e.g., Birkelund et al. 2019;Koch et al. 2015;Fernandez and Mors 2008). Based on this evidence, we assume two benchmarks for the effect of gender for our study context: a negative (positive) effect of being female in male-dominated (female-dominated) occupations. However, the empirical evidence on the effect of gender on recruitment decisions is somewhat mixed, and these benchmarks might therefore be less reliable. For example, some studies contest the notion of hiring discrimination towards women in male-dominated occupations (e.g., Petersen and Togstad 2006;Carlsson 2011). Finally, with only a few exceptions (e.g., Nunley et al. 2017), several experimental studies have shown that employers are more reluctant to hire applicants who have been unemployed (e.g., Van Belle et al. 2018;Eriksson and Rooth 2014;Kroft et al. 2013;Oberholzer-Gee 2008). Following this literature, we assume a negative effect of unemployment on recruiters' hiring intentions as our benchmark.
We explore why and how these effects might differ when the rating task refers to a real hiring problem as opposed to a hypothetical one. First, the hypothetical job descriptions and vignettes presented in prior FSEs likely differ from real-life hiring situations, which might lead to a "hypothetical bias" (Ajzen et al. 2004), meaning that respondents answer differently than they would in reality. However, the psychological realism of the rating task increases decisively when the FSE is based on real vacancies for which the respondents are personally responsible in real life. In this case, the vacancies are familiar to the respondents, and they can draw on the same information (e.g., regarding requirements) they would normally do to form hiring decisions for this particular job type. Although hypothetical vacancies might allow researchers to control for possible differences in job requirements, it is difficult to determine whether the information provided in fictitious job descriptions will sufficiently chime with the psychological realism hold by respondents. A careful design, such as asking recruiters about typical vacancies in their company (e.g., Van Beek et al. 1997;Di Stasio 2014), could alleviate the potential problem of low psychological realism. Nevertheless, respondents must rely on assumptions about requirements and characteristics of the respective hypothetical vacancy on which the experiment is based, which may differ from real vacancies. The latter is all the more likely the less familiar the respondents are with typical vacancies in the respective occupation studied and the less specific the descriptions of these vacancies are. In contrast, while vignettes only represent simplified applicant profiles (i.e., they contain less information than real application documents), they usually provide realistic information about applicants according to the researchers' interest (if carefully designed). 6 This holds regardless of whether real or hypothetical vacancies are used. Yet, respondents who are presented with a real vacancy will have less difficulty deciding whether the provided information corresponds to reality and matches the respective job characteristics. As a corollary, the information respondents consider when rating the vignettes might be closer to real-life decisions made throughout the hiring process. Therefore, the main effects of gender, migration background, and unemployment should be closer to our derived benchmarks if real vacancies are used.
Second, social desirability is a well-known problem in survey research that might bias results. Although the multidimensional design of FSEs should mitigate the risk of normative influences on vignette ratings , the possibility of social desirability bias cannot be fully excluded. Socially biased results would imply less discrimination by gender or migration background and less negative unemployment effects than theoretically expected. Recruiters might be more cautious with their answers in both FSE versions, as they cannot be sure about the real reason for the experiment; however, they might feel more confident about their vignette ratings when the FSE is based on real vacancies. This is because, as we have argued, real vacancies provide psychological realism to the rating task and recruiters may find it easier to justify their answers if needed. For similar reasons, the use of real vacancies might also reduce the risk of other response biases, such as acquiescence (Krosnick et al. 2014;Schwarz 1999), as respondents might be generally less susceptible to external influences if they are confronted with a real-life hiring problem. Consequently, in line with our considerations above, the effects of applicant characteristics should be closer to our benchmarks when real instead of hypothetical vacancies are used. However, the effects of applicant characteristics might not be affected in the same way. Reluctance to hire based on unemployment could be more easily justified with productivity-related characteristics (e.g., skill loss) than reluctance based on characteristics such as gender or migration background, where social desirability is more likely an issue.
Third, FSEs have been criticized for presenting simplified hiring situations that cannot possibly convey the urgency and pressures surrounding actual hiring (e.g., difficulty finding suitable candidates or a company's need to fill a position promptly). These aspects are more likely to be considered if respondents are confronted with a real hiring problem. Prior research suggests that gender, unemployment, or migration background might be less relevant in such a case (e.g., Baert et al. 2015). Unfortunately, we cannot analyze aspects of urgency in detail due to data limitations (see Sect. 4).
Against this background, we expect differences in the effect of applicant characteristics on recruiters' hiring intentions based on the FSE's design. More specifically, in line with our theory, we expect the effects of gender, migrant background and unemployment on vignette ratings to be closer to the defined benchmarks when real vacancies are used as compared to hypothetical vacancies (Hypothesis 1). Although in both cases, respondents rate hypothetical applicants without real-life consequences, we argue that the perceived realism associated with the rating task is markedly higher in the former setting. Also, the effect of unemployment might be generally less affected by social desirability than the effects of gender and migration background are. Hence, we expect the difference in the effects of gender and migrant background between the FSE designs (see Hypothesis 1) to be greater than the same difference regarding the effect of unemployment (Hypothesis 2). The hypotheses will be tested for each applicant characteristic separately.

Research design
We conducted an FSE in Luxembourg in 2018 and 2019. 7 To enable comparability with NEGOTIATE (2020), we built on that design using the same pictorial representation of CV and answer scale. 8 Respondents were asked to rate hypothetical descriptions of applicants (vignettes) that varied systematically on a number of different characteristics (dimensions). The rating task either referred to a real vacancy in their company or to a similar but hypothetical job.
The relatively small Luxembourgish labor market did not provide enough real vacancies to conduct a split ballot experiment, where we could have randomly assigned respondents to a real vacancy or a hypothetical job description in the FSE. 9 Therefore, we employed a two-step approach to gather our data. First, we collected real vacancies and recruiter contact information published on different online job portals and company websites in Luxembourg. Second, we sampled recruiters from publicly available lists of companies and businesses associated with the same occupational fields. Respondents sampled based on the former approach rated hypothetical applicants in the FSE based on the real vacancies they were responsible for filling. Respondents sampled based on the latter approach rated applicants for a hypothetical but similar type of job. In the present analysis, we focus on jobs in catering and in information technology (IT). 10 Both groups of respondents have at least some recruitment experience in the respective occupations studied (see Sects. 4.2 and 4.3), but only respondents sampled based on real vacancies are presented with a concrete, realistic hiring problem in the FSE. This enables us to test whether the use of real vacancies triggers different responses while avoiding distortion in our results from differences in the recruitment experience between the two samples (and thus fundamental differences in perceived realism). In the following sections, we will refer to the "real vacancy" (RV) design and "hypothetical vacancy" (HV) design when discussing the two versions of the FSE. However, the reader should keep in mind that we are essentially comparing two different samples.

Vignettes
The vignettes with hypothetical applicants varied in the values of three experimental variables (see Table 2) 11 : gender (male/female), unemployment (no unemployment/one year of unemployment after graduation/one year of current unemployment), and migrant group, which was signaled to respondents by the applicant's nationality and country of residence (Luxembourgish, foreigners: Portuguese/Luxembourgish-Portuguese/French, border workers: French/German). Luxembourgish-Portuguese nationality was used to signal dual citizenship. The migrant groups selected result from the unique demographic situation in Luxembourg, which is characterized by a multi-cultural population and workforce. 12 The amount of work experience was held constant in the experimental design: each vignette showed 48 months of occupation-specific work experience at a company located in Luxembourg. To make the information provided in our vignettes as realistic as possible, the professional titles all matched the most common job titles in each occupation under investigation.
The experimental design included 36 vignettes, representing all possible combinations of applicant characteristics (2 1 × 3 1 × 6 1 ). We divided the 36 vignettes into 6 equally sized decks, each including 6 vignettes. Following Kuhfeld (2010), we used D-efficient blocking to allocate the vignettes to decks, that is, we optimized the decks for maximum variance and orthogonality between each vignette dimension (Atzmüller and Steiner 2010;Dülmer 2007). We assigned each deck randomly to respondents. As recommended in the literature , we randomized the order of vignettes across respondents to avoid ordering bias. For each vignette, we asked the respondents to rate the likelihood of considering the respective applicant for the advertised position in their company (RV design) or a typical position from the same field (HV design) on an 11-point scale (0 = practically zero to 10 = 10 The FSE was actually conducted in five occupations. However, due to low sample sizes in some occupations, we are only able to consider the two occupations named in our study; we had to exclude respondents from the fields of mechanics, finance, and nursing. 11 Between five and nine dimensions are usually recommended in the literature to increase the variation between vignettes and thereby avoid fatigue effects Sauer et al. 2011). However, a more complex design would have required a larger sample size that would have been difficult to achieve in Luxembourg. To control for possible fatigue effects, we included indicators for the vignette order in our regression models (see Sect. 4.5). 12 Almost half of the general population in 2019 were foreigners (47%), of which 33% were Portuguese, the largest foreign group, followed by the French (16%); own calculations based on data retrieved from Luxembourg's statistical office (STATEC): https://statistiques.public.lu/stat/TableViewer/tableView.aspx? ReportId=12853&IF_Language=eng&MainTheme=2&FldrName=1, January 4, 2020. Moreover, a large proportion of workers (46%) are border workers living in one of the neighboring countries but working in Luxembourg, of which French border workers make up over half (53%), followed by Germans (24%); own calculations based on data retrieved from STATEC: https://statistiques.public.lu/stat/TableViewer/tableView. aspx?ReportId=12919&IF_Language=eng&MainTheme=2&FldrName=3&RFPath=92, January 4, 2020. excellent). 13 The vignettes presented to the respondents were identical in both versions of the FSE. Figure 1 shows an example of a vignette presenting a male foreigner with Portuguese nationality and one year of unemployment after graduation applying for a catering job. The layout of the vignettes resembles the structure of a CV.
In the introduction to the rating task, the respondents in both FSE versions were instructed to assume that all applicants fulfilled the minimum language requirements as well as minimum requirements regarding educational credentials. 14 All respondents were informed that they would have the possibility to indicate selection criteria that are important for the respective position (and are missing in the vignettes) after the rating task.
In the HV version of the FSE, respondents were asked before the experiment whether they typically hired workers for that job type. Respondents who did not were excluded from the survey. The job types consisted of general occupational categories based on codes of the International Standard Classification of Occupations 2008 (ISCO-08; e.g., waiter/waitress, system administrator) that matched the sampled real vacancies for the RV version of the FSE (see Sect. 4.2). The rating task in the HV design was based on these generic job types as opposed to the RV design, where the rating task was based on the sampled real vacancies. We did not provide the respondents with details about the hypothetical job, as the main objective of this study is to test whether using real vacancies triggers more valid ratings of vignettes than using vague hypothetical vacancies does.

Sampling of real vacancies
We sampled job advertisements from various online job portals and company websites in Luxembourg based on pre-defined ISCO-08 categories. We focused on entry-level jobs located in Luxembourg. We searched seven generic job portals as well as occupation-specific job portals and company websites using a set of keywords and search categories that fit the ISCO categories. The online Appendix includes a complete list of these job portals and websites as well as the ISCO categories. The vacancies were collected manually; for the field of IT, additional vacancies were collected with the help of a web-scraping tool to facilitate the sampling process. Screenshots of all job advertisements were taken and later displayed to the respondents in the RV design before the FSE to remind them of the respective vacancy. When a job advertisement provided no contact information for recruiters, we called the company and requested the name and email address of the person responsible for filling the vacancy. The vacancies were collected over 16 weeks between July and November 2018 (four weeks were reserved for telephone calls). We received email addresses of 203 recruiters in catering and 168 recruiters in IT.

Sampling strategy: hypothetical vacancies
Regarding the HV design of the FSE, we sampled respondents from publicly available lists of companies for each occupational field. Since the availability of such lists differed between occupations, we used different sources (e.g., yellow pages) and strategies for each occupation. 15 A detailed description of this sampling process is provided in the online Appendix. Company names, addresses, and phone numbers were collected manually and with the help of a computer program. If available, we collected the names and email addresses of the person responsible for recruitment in the respective companies or persons that were most likely involved in or had experiences in hiring within the respective field (e.g., managing directors or business owners). We called the collected companies and requested the contact information of a person actively involved in recruitment decisions. In total, we managed to collect the email addresses of 173 people in catering and 279 in IT.

Data collection and realized sample
We invited respondents via email to participate in an online survey about recruitment and operational staffing needs. 16 The survey was offered in three languages (French, German, and English). The data collection period started on November 22, 2018 and ended on January 25, 2019. In total, we sent out 823 invitations across the two occupations, and 140 respondents participated in the online survey (overall response rate of about 17%). The response rate for the RV version of the FSE was almost two times higher (22%) than the response rate for the HV version of the FSE (13%). The response rate was 19.7% in catering and 14.8% in IT. As in the whole sample, the response rate in the respective RV sample was considerably higher than in the HV sample. These results suggest that using real vacancies might actually increase interest in the survey and the willingness to participate. 17 The response rates are within the range of similar recruiter surveys using FSEs (e.g., Van Belle et al. 2018;Damelang et al. 2019).
Our sample consisted of 553 vignette ratings from 93 respondents who participated in the FSE; 300 observations from 50 respondents in catering and 253 observations from 43 respondents in IT. 18 We found differences in the distribution of key respondent characteristics between the two FSE versions (RV versus HV) in each occupation, particularly regarding 15 Each company in Luxembourg is required to announce vacancies at Luxembourg's National Employment Agency (ADEM), which keeps records of company names and contact persons in each company. Unfortunately, we were not granted access to this list. 16 M.I.S. Trend, a Swiss-based research institute, carried out the data collection on our behalf. The invitation requested that the email be forwarded to the person responsible for recruitment when the contacted person was not involved in their company's recruitment process (HV design) or not responsible for filling the given vacancy (RV design). As it is usually the case in self-administered surveys, this is no guarantee that the target person is the one who filled out our questionnaire. gender and citizenship. Therefore, we used entropy balancing (Hainmueller and Xu 2013) to adjust the distribution of gender, age, citizenship, and education in the two sub-samples (RV and HV) to the distribution of these characteristics in the overall sample in each occupation. We used these weights in our regression analyses. 19 Due to missing values, our final analytic sample was slightly reduced, to 282 observations from 47 recruiters in catering and 240 observations from 40 recruiters in IT. Tables 3 (catering) and 4 (IT) in the Appendix show that the weights account well for the differences in the distribution of respondent characteristics between the two FSE versions.
On average, each vignette has been evaluated 8.7 times in the RV version of the FSE (catering: 5.2, IT: 3.5) and 6.3 times in the HV version (catering: 3.2, IT: 3.7). In both FSE versions, respondents used the whole answer scale, and the distribution of vignette ratings was slightly left skewed in both occupations (see Figs. 4 and 5 in the Appendix). Given that the respondents were instructed to assume that requirements regarding educational qualifications are met, it is not surprising that the distribution tended toward positive values. Tables 5 and 6 in the Appendix further show that correlations between vignette variables were close to zero and not statistically significant for both occupations.
Because most of the vignettes presented potentially negative signals (e.g., unemployment) and because, according to our theory, real vacancies might help reduce the risk of response bias, the average vignette ratings were likely to be lower in the RV version of the FSE. However, a Wilcoxon-Mann-Whitney test to compare the average vignette ratings between the FSE versions showed no significant differences in both occupations.

Method
We estimated the effect of applicant characteristics on vignette ratings (i.e., recruiters' hiring intentions) using linear multilevel models. Given that each respondent rated six vignettes, the ratings were nested within respondents. Multilevel modeling accounted for this clustering in our data (Hox 2010). We estimated separate models for each occupation to account for possible general differences in staffing needs and recruitment processes that might affect our results. Moreover, the conducted FSE for each occupational field (catering, IT) can be seen as two separate experiments.
Our hypotheses pertain to the potential difference in the overall effects of migration background, gender, and unemployment based on the FSE design. We tested these hypotheses by combining categories 2 and 3 of unemployment and categories 2 to 6 of migrant background (see Table 2) into one category indicating unemployment experience and a foreign background, respectively. For each occupation, we estimated a model interacting the dummy for gender (1=female), unemployment (=1), and foreign background (=1) with the indicator for sample type (i.e., the FSE version). We further controlled for possible fatigue or primacy effects by including dummies for each vignette position (first to sixth). Our regression analyses were weighted (see Sect. 4.4). We calculated the marginal effects (Williams 2012) of each vignette variable on hiring intentions. Full models are shown in Table 7 in the Appendix. 20

Main results
The marginal effects of applicant characteristics on recruiters' hiring intentions are displayed in Fig. 2a (catering), b (IT). Having a foreign background significantly reduced recruiters' hiring intentions in the catering field (see Fig. 2a) at the 5% level in the HV version and at the 10% level in the RV version, matching the respective benchmark. We observed a slightly more negative effect of being a foreigner in the HV version (against our expectations), but the difference is not statistically significant. Hence, our findings contradict Hypothesis 1 for migrant background in catering. Regarding IT jobs (Fig. 2b), we found no effect of migrant background in either FSE version, contradicting the benchmark; while the effects tended to be negative in the RV version and positive in the HV version, the difference was not statistically significant. Consequently, we again found no support for Hypothesis 1 regarding migrant background in IT.
As for the applicants' gender, we observed positive effects of being female in catering in both FSE versions in line with the associated benchmark (Fig. 2a), but the effect was significant only in the RV version ( p < 0.05). This finding might be explained by the relatively high share of female workers in the catering sector in Luxembourg. 21 The observed effect of applicant gender was slightly less positive in the RV design, but the difference was not significant (contradicting Hypothesis 1). In turn, the gender effect on vignette ratings was close to zero and not significant in each IT FSE version (Fig. 2b). Unsurprisingly, we also found no support for Hypothesis 1 regarding gender in IT.
Finally, as shown in Fig. 2a, unemployment significantly reduced recruiters' hiring intentions when applying for catering jobs in both FSE versions (RV: p < 0.01, HV: p < 0.01). These findings support our benchmark regarding the effect of unemployment. Similar to the results regarding migrant background, we observed a slightly more negative effect of unemployment in the HV version; however, the difference was not significant (contradicting Hypothesis 1 regarding unemployment in catering). Fig. 2b reveals that unemployment also had a highly significant negative effect on recruiters' hiring intentions in both IT FSE versions ( p < 0.001). The unemployment effect did not differ between the two FSE versions, and we found no support for Hypothesis 1 in IT.
Our results also do not support Hypothesis 2. The differences in the effect sizes between the two FSE designs for unemployment do not statistically differ from those for gender or migration background. This holds for both occupations.

Additional analyses
We repeated our analyses using the operationalization of vignette variables presented in Table  1 to test whether we find similar effects and design differences when considering specific categories of migration background and unemployment. Figure 3 shows the marginal effects resulting from these models in catering (Fig. 3a) and IT jobs (Fig. 3b). The full models are presented in Table 8 in the Appendix.
As shown in Fig. 3a for jobs in catering, the vignette ratings were on average lower toward German border workers (HV: p < 0.01, RV: p < 0.05), French foreigners (HV: p < 0.05, RV: p = 0.058), and French border workers (HV: p = 0.065, RV: p < 0.05) compared  Table 7 to Luxembourgish natives (HV: p < 0.01, RV: p < 0.05). For both categories indicating French nationality, the effects seemed to be more negative in the HV than the RV design. However, we did not find significant differences in the effects of the above characteristics between the FSE versions. This is in line with our findings regarding Hypothesis 1 when looking at the combined effect of migrant background. Figure 3b shows that there were no significant effects in the HV version for any category of migrant background in IT. Also, whereas in the RV version most effects tended to be negative, here all effects were positive, contrary to our benchmarks. We only observed a significant effect for French border workers in the RV version ( p < 0.001). Most importantly, the difference in this effect between the two FSE versions was significant at the 10% level ( p = 0.064), supporting Hypothesis 1.
Our additional analyses revealed no substantial changes in the effect of applicants' gender on vignette ratings.
Regarding the negative unemployment effect in catering (Fig. 3a), we found that only current unemployment significantly reduced recruiters' hiring intentions for both FSE versions ( p < 0.001). The effect was slightly less negative in the RV version, but the difference was not statistically significant. Regarding IT jobs (Fig. 3b), current unemployment had a highly significant negative effect on vignette ratings in both FSE versions ( p < 0.001); however, the effect seemed to be more negative in the RV version. The effect of past unemployment was (marginally) significant at the 5% level in both FSE designs (HV: p = 0.052, RV: p < 0.05). None of the differences in unemployment effects between the FSE versions were significant. Consequently, in line with our results reported in Sect. 5.1, we found no support for Hypothesis 1 regarding unemployment in either catering or IT. Finally, we found no support for Hypothesis 2 in either occupation.  Table 8 6 Discussion and conclusions In this study, we conducted an FSE for two occupational fields (catering and IT) in Luxembourg to examine whether the effects of applicant characteristics on observed hiring intentions differ when the vignette ratings refer to a real instead of a hypothetical vacancy. We focused on three applicant characteristics in our analyses: migration background, gender, and unemployment.
Estimating multilevel regressions, our results regarding the main effects of applicant characteristics largely correspond to our benchmarks. However, we observed some differences based on the occupational field. Regarding the migration background, applicants with a foreign background face disadvantages in catering, which holds for both FSE versions. In the RV version, the effect was slightly less negative than in the HV version. In contrast, migration background had no significant effect on vignette ratings in IT, although the respective effect tended toward different directions depending on the FSE design. The effect of gender (1=female) tended to be positive in both FSE versions in catering but only in the HV version in IT. In the RV version in IT, the effect of gender was negative. Women might have an advantage in catering because service jobs are typically considered to be a "female" job (e.g., Booth and Leigh 2010). A female disadvantage in IT is, in turn, in line with prior research suggesting that women are discriminated in STEM (science, technology, engineering, math) occupations (e.g., Kübler et al. 2018). However, the effect of gender was not significant in IT in our study. Finally, we found very similar results regarding unemployment for both occupations: in both cases, unemployment significantly reduced hiring intentions. The respective effect rarely differed between the FSE designs in IT but was slightly less negative in the RV version in catering. The differing patterns in the observed effects between occupations might be explained by general structural differences between the two sectors.
Overall, none of the observed differences in the effects between the FSE versions were statistically significant. A more fine-grained analysis of single categories of migration background revealed that French border workers were particularly disadvantaged in IT in the RV version. In the HV version in IT, the effect of being a French border worker tended to be positive but was not significant. The observed difference was statistically significant at the 10% level. This finding lends some support to the use of real-world vacancies, as our benchmark suggested a negative effect of being a non-native person on recruiters' hiring intentions. Excluding this finding, our results provide little evidence for differences in the effect of applicant characteristics on vignette ratings between RV and HV FSE designs.
We observed in this study and the pilot study that response rates in the RV design were about twice those in the HV design of the FSE. If response rates are of special concern, efficient sampling of real vacancies might be an attractive option. Most importantly, higher involvement of recruiters in the FSE due to real vacancies probably also contributes to higher quality of answers and higher internal validity of the results.
This study has some limitations, which make an overall conclusion about the impact of the FSE design on hiring intentions difficult. Our findings must therefore be interpreted with caution. First, our small sample of real vacancies made it impossible to employ a true split ballot experiment. A random assignment of respondents to one of the two FSE designs would have provided a more robust estimate of the effect of using real vacancies. Instead, we compared two different samples that slightly differed in the composition of the respondents: (i) actual recruiters responsible for filling a real vacancy in their company and (ii) a mix of human resource professionals, managers, and directors. Hence, the comparability of the two samples is limited. In our regression analyses, we have used weights to balance both FSE samples for key respondent characteristics. However, using weights does not change our main findings compared to an analysis without weights. Additional analyses, which we present in the online Appendix, further revealed that the randomization processes within our experiment were successful. We found correlations very close to zero and not statistically significant between vignette variables and observed respondent characteristics (see Table A3 and A4). We also estimated fixed effects regressions with vignette variables as predictors to test the influence of unobserved respondent characteristics . The correlation (u j , X ) between the error term and control variables can be taken as indicator for omitted respondent characteristics that influence vignette ratings. In both occupations, these correlations were close to zero (see Table A5). Second, to reliably test whether using real vacancies in FSEs leads to more valid vignette ratings than using hypothetical vacancies, behavioral benchmarks reflecting the true effects of our applicant characteristics on hiring decisions would have been desirable; however, these were not available for Luxembourg. Hence, we compared our results to "theoretical" benchmarks derived from previous experimental research from other contexts (see Sect. 3). Third, our sample size was rather small. Some of our null findings might be due to limited statistical power because of the low number of observations. Moreover, we were unable to go beyond a broad comparison of two occupations (catering and IT specialists) and consider potential contextual moderators more in detail. The job requirements as well as structural characteristics (e.g., unemployment rates and labor market tightness) typically differ between labor market sectors. These factors may have an impact on how employers interpret and act upon the information provided in vignettes. While this limits the generalizability of our findings concerning the main effects of applicant characteristics on recruiters' hiring intentions, it should not affect the generalizability of our methodological effects. Nevertheless, as stated in Sect. 3, differences in vignette ratings when real vacancies as opposed to hypothetical ones are used might be particularly pronounced if the urgency of finding suitable candidates is high. This might be the case, for example, if recruitment is perceived to be difficult. Most respondents in our sample generally found it difficult to recruit suitable workers, 22 which is why a comparative analysis concerning the perceived difficulty to find applicants was not possible due to low variance.
Finally, the idiosyncrasies of Luxembourg limit the comparability of our results with other studies. For example, the Luxembourgish labor market is unique in the multi-cultural composition of its workforce. As aforementioned, different outcomes particularly regarding the main effects of applicant characteristics on recruiters' hiring intentions may be expected in other contexts.
The question of whether we should use real or hypothetical vacancies when studying hiring prompts more general considerations about sampling and inference. Researchers will have to assess whether they wish to make statements about mechanisms at the level of companies, occupations, vacancies, or recruiters or a combination of these. For example, if the main interest is to investigate potential discrimination, it will be beneficial to sample randomly from a pool of actually available vacancies, as they may differ systematically from filled positions in a given occupation. If the primary aim is to advance our general understanding of recruiter behavior in a given sector, on the other hand, presenting hypothetical vacancies to a random selection of recruiters from this sector might be adequate.
This study draws attention to potential issues surrounding the routine use of hypothetical vacancies in FSEs, an area that has been widely neglected in previous studies on recruiters' hiring intentions. Our results suggest that using hypothetical vacancies might be, within the limits of FSEs, a valid approach. Note, however, that the effect of using real vacancies might have been more pronounced if the differences between our two samples would have been larger (e.g., recruiters with only general recruiting experience vs. recruiters with job-specific recruiting knowledge). However, identifying the effect of using real vacancies would have been more difficult with such an approach, as both the type of sample and the type of vacancy potentially affect the perceived realism of the rating task.
In any case, the potential effect of using real-world vacancies in FSEs on response behavior deserves further scrutiny. This seems particularly important given the potential methodological and practical implications for the increasing number of FSEs in employer studies. Substantial and theoretically plausible differences between FSE designs would further encourage the use of real vacancies as the internal and external validity is likely increased. Besides aspects that could not be analyzed in this study (e.g., the role of recruitment difficulties), the present analyses could be extended comparing the survey responses from vignette ratings referring to real vacancies with behavioral data to test the internal and external validity of the results obtained from both types of FSE (i.e., RV versus HV design). In doing so, researchers should ensure comparability in experimental designs and sample composition between studies (Petzold and Wolbring 2019).
A better understanding of the implications of the differences in FSE designs between studies is surely needed to establish best practices in employer studies. Prior research on general methodological questions related to the design of FSEs (e.g., Auspurg et al. 2019;Auspurg and Jäckle 2017;Sauer et al. 2011;Shamon et al. 2019) continues to be highly relevant for the application of FSEs to study employer preferences, as it provides common guidelines for designing vignettes and answer scales. More studies on methodological aspects of FSEs are required, however, to advance our understanding of their possibilities and limits in research on hiring. We hope that our comparison of designs using real and hypothetical vacancies contributes to this emerging strand of methodological inquiry.  Italic: headings for categorical variables with 3 or more categories.

Table 5
Correlations between vignette variables: catering Pairwise correlations between all levels of vignette variables (as dummies), Pearson's correlation coefficient.