1 Introduction

The topic of discrimination based on, for example, ethnic origin and gender in the labour market came under scrutiny of the economics discipline after the influential doctoral dissertation entitled The Economics of Discrimination by the Nobel laureate Gary Becker in 1957. Becker proposed the concept of taste discrimination in which prejudiced persons receive disutility from their interaction with certain groups of people. Hence, they monetise their prejudice by applying a mark-up to the transaction. Even if two demographic groups were to have identical productive characteristics, such a mark-up leads to differences in compensation. In the labour market, taste discrimination can be classified by the source of the prejudice into employer, employee, and customer discrimination.

To consistently estimate the impacts of discrimination, researchers would ideally like to observe labour market performances of two groups which are the same in every aspect except for the characteristic of interest such as gender or ethnic origin (see Baert (2018); Bertrand and Duflo (2016); Lane (2016); Zschirnt and Ruedin (2016) for extensive review). Many studies (Bertrand and Mullainathan 2004; Carlsson and Rooth 2007) focused on racial discrimination at the very first stage of entry into the labour market and adopted the technique called “correspondence testing”, which is to create fake CVs, allocate fake ethnicity at random to each CV, then send these CVs to job adverts. They found that ethnic minorities received significantly fewer callbacks for job interviews. However, in some studies (Kaas and Manger 2012), such a difference disappears when favourable information about the applicants’ personality is included in the applications. They interpreted this finding as evidence for statistical discriminationFootnote 1.

In terms of appearance, Hamermesh and Biddle (1994) found significant effects of beauty on earnings in the USA. Yet there is evidence of sorting by looks and beauty premium from some occupations, such as salespersons and lawyers, where the workers have to appear in public or confront the buyers directly (Biddle and Hamermesh 1998). Hence, a fraction of the beauty premium could result from productive characteristics of beauty. A few studies used the correspondence testing technique in this domain as well. Rooth (2009) showed that an obese applicant received 20% fewer callbacks for an interview. Based on the correlation between job performance and being obese, he concluded that customer discrimination and/or statistical discrimination could be the explanation. On the other hand, Kraft (2012) found that unattractive candidates are 14% less likely to get an invitation and have to wait a couple of days longer for a callback. However, he did not find differential effects between high and low customer contact positions.

Recently, there are studies using correspondence testing to uncover discrimination against certain religious practices. In France, Valfort (2017) distinguished effects of applicants’ religion from their country of origin by comparing the callback rates among fictitious candidates whose religion is Catholic, Judaism, or Islam. All of them came from Lebanon, completed their high school, then received a certificate in Paris and became naturalised French citizens. She found that practising Catholic raises a callback rate by 30 and 100% respectively higher than being a Jew or being a Muslim. However, these disadvantages could be alleviated if the applicants signalled through extra-curricular activities in their CVs that they were secular rather than serious practitioners of these religions.

With regard to religious practice signalled by attire, Weichselbaumer (2016) sent out the same CV of a female candidate with different combination of names and photos to job openings for secretaries, accountants, and chief accountants in Germany. Particularly, she took photos of the same woman with and without headscarf and assigned either German or Turkish sounding names to the CV. Her results showed that a photo with German name was significantly more likely to get a callback than the same photo with Turkish name. In addition, a woman with a Turkish name and headscarf suffered additional discrimination compared to the same Turkish woman without headscarf.

Incorporating more variation in characteristics of both applicants and recruiters than previous studies, this paper demonstrates that interaction between religious practice and positive characteristics of applicants can mitigate impacts of discrimination, while it provides some evidence of additional heterogeneous effects based on recruiters’ characteristics. Particularly, our study design combines a randomised CV approach with a laboratory experiment, which allows us to assess the effects of beauty, ethnicity, and religious practice in the same empirical model. We recruited students from universities in Hannover, Germany, to participate in an experiment where they were asked to select applicants for an interview for fictitious positions from a pool of candidates whose CVs were randomised in every other aspect except for appearance. An additional advantage of conducting such a lab experiment is to be able to control for personal characteristics of the participants (acting as HR recruiters) involved in the selection processFootnote 2.

We also tracked the time each participant used to evaluate each part of the presented CVs. This extra information allows us to (partially) control for the dual-process framework in judgement and decision making that could lead to bias in the hiring process (Derous et al. 2016). The dual-process theories involve type 1 and type 2 processes where the former is spontaneous, intuitive, effortless, and fast, while the latter is deliberate, rule-governed, effortful, and slow (Kahneman and Frederick 2002). Specifically, we can control for the relative time each participant spends on each component of a particular CV, especially the photo page, with respect to his or her own average. Hence, any influences of heuristics involving type 1 process could be partialled out from our results.

Exploiting a sizeable proportion of Turkish descendants in Germany, we randomly insert photos of the same Turkish-looking women with and without headscarf into CVs in the experimentFootnote 3. Apart from providing consistent estimates of the differences in the probability of being selected for an interview owing to beauty, ethnicity, and headscarf, we attempt to identify the source of such discrimination based on job positions, characteristics of the CVs, and the participants. Specifically, we classify our job openings into high- and low-skilled occupations as well as jobs with and without (or minimal) customer contact. We hypothesise recruiting for occupations with more interaction with customers would prefer better-looking persons and avoid minorities or females with headscarf due to anticipated customer discrimination, hence leading to potentially higher productivity such as higher sales. Conversely, any discrimination observed in low customer contact jobs such as back-office operations could mainly arise from either within firm (employer/employee) discrimination or statistical discrimination.

Our results suggest that the beauty premium prevails in all types of occupations and is quite large in high-skilled occupations. So, the beauty premium could be driven by both taste discrimination and potential productive attributes of beauty. Nevertheless, a slightly larger premium in high-skilled jobs supports the argument for employee discrimination because this is a sector where our participants could relate to the candidates as their future co-workers. Interestingly, better-looking candidates with the same gender as the recruiter are less likely to be chosen for the interview. Although a simple difference in the probability of being chosen between Turkish and German applicants shows a significant discrimination against Turks, such an effect disappears after controlling for beauty and interactions between some applicants’ characteristics and headscarf. This finding provides an alternative explanation to the previous studies that racial discrimination in Germany might be partly explained by the fact that Turkish applicants (at least in our sample) are perceived as less beautiful than German-looking counterparts.

Similar to Weichselbaumer (2016), Baert et al. (2017), and Valfort (2017), we find that positive characteristics mitigate negative impacts of religious practice and ethnicity in job recruitment. Such results provide circumstantial evidence for the importance of statistical discrimination (Haan et al. 2017) and biased beliefs (Reuben et al. 2014) as potential causes of discrimination against headscarf. In other words, supplying more productivity-relevant information could reduce average differences in perceived unobservable characteristics between candidates with and without headscarf. Yet unlike, for example, Baert and De Pauw (2014) who gather information on key attitudes underlying different mechanisms, our method is only an indirect way (yet popular among researchers) to try to isolate taste-based discrimination from statistical discrimination (Bertrand and Duflo 2016). Specifically, the main goal of our study is to investigate when people discriminate against the headscarf, rather than why.

Despite the non-dynamic setting of our experiment, a reversal in discrimination effects against headscarf from over-penalised candidates with low characteristics to over-rewarded those with high characteristics is in line with a theory of dynamics of discrimination proposed by (Bohren JA, Imas A, Rosenberg, M: The dynamics of discrimination: theory and evidence, unpublished). Using data from an online platform, they showed that women with no prior evaluation score on the platform were discriminated against but women with a history of positive evaluations were favoured over their men counterparts. Similar to our case, the recruiters may hold a certain kind of biased beliefs in abilities of female with a headscarf. Thus, extra information on positive characteristics helps to reverse negative effects of discrimination into advantages for this group of applicants.

Furthermore, the extent of discrimination against headscarf (conditioning on being Turkish) is more prominent in the case of high-skilled occupations and jobs with customer contact. Interestingly, we find evidence that such discrimination seems to be driven by male recruiters Footnote 4. Older participants are those who discriminate against headscarf but at the same time they value good characteristics of these applicants. Our results imply that the characteristics and behaviours of the “recruiter” could be the main driver of observed discrimination, particularly against headscarf, during the hiring process. Yet such practices might not reflect the best interest of their employers in terms of firms’ productivity or profit.

The paper is organised as follows. Section 2 explains the experimental design. Section 3 presents our methodology. Sections 4 and 5 show the results and robustness checks respectively. Section 6 discusses and concludes; figures and tables are included in the Appendix.

2 Experimental design

The experiment is divided into two parts, and all participants are students from local universities. Both parts were programmed using the software z-Tree (Fischbacher 2007). The first part was carried out in December 2015 where 120 students had to act as HR personnel and choose candidates for a call back based on CVs. This part took 1 h, and the students were paid 20 Euro for participation. The decisions were not incentivised in order to capture an unbiased perception of the students. Otherwise, it might be possible that the participants would choose what they think is preferable by the researchers and do not reveal her true preferences. Descriptive statistics for the participants are summarised in Table 1. To assess the importance of beauty on the probability of being selected, we conducted the second experiment, where the only task was to rate the photos from the first experiment. The rating was performed in March 2016, with 40 students in total. The second part took around 20 min, and the students were paid 7 Euro. Both experiments took place in the computer lab of the Leibniz Universität Hannover in several sessions, each of them with 10 to 17 participants. We did not acquire any photos in Hannover, so the possibility that “recruiters” had ever met the candidates in real life is negligible.

Table 1 Descriptive statistics for the experiments

As the objective of the first part was to simulate the recruiting process, participants were asked to select applicants for an interview for fictitious positions. For each position, they selected applicants after reviewing seven characteristics, application photosFootnote 5, and the names of the applicant. The seven characteristics were presented in the following order:

  • Work experience (ranging from 0 to 3 years)

  • Expected wage (average wage, 10% higher/lower than average, 20% higher/lower than average)

  • Grade university/school (average, higher/lower than average)

  • Quality of education: which is reputation of the college in the case of high-skilled jobs or amount of absence days in the case of low-skilled jobsFootnote 6 (average, higher/lower than average)

  • Current unemployment spell (currently not unemployed, 1–6 months unemployed, 7–12 months unemployed, 13–18 months unemployed)

  • Computer skills (sufficient, good, very good)

  • English skills (basic, advanced, fluent)

The characteristics were randomly assigned, the photos appeared in a random order, and the names were randomly assigned (corresponding to gender and ethnic background of the photo). We presented the photos and the characteristics separately (see Figs. 3 and 4) so that we could record the time used in each part and verify if the amount of time participants looked at the photo is correlated with discrimination. As for the headscarf, we constructed the control group by asking a professional photographer to take photos of three Turkish-looking women with and without the headscarf, while keeping everything else the same (see Fig. 5).

Each participant had to review 32–48 applications for 8–12 positions (see Tables 10 and 11 for a full list of positions)Footnote 7. These positions were organised in blocks of four positions. After each block, the participants were asked to take a break and answer a questionnaire (paper-and-pencil questionnaire about socio-demographic characteristics) and then an experiment for another project dedicating to analyse methods of multidimensional scaling done by (Jelnov, P: Latent dimensions and similarity, unpublished)Footnote 8. We also used his data in our analysis in order to measure the consistency of decision and exclude 10% of participants with inconsistent decisions. We dropped the individuals based on the correlation coefficient between two part’s of Jelnov’s experiment (see Fig. 1 for the distribution where an accuracy score of one indicates perfect consistency of decisions, while zero shows the answers are not correlated at all). We implicitly assumed that those who made inconsistent decisions in his part, which is unrelated to our research question, were more likely to make inconsistent decisions in our simplified recruiting process.

Fig. 1
figure 1

Distribution of accuracy score and 10% dropped individuals

We assigned pictures, names, and characteristics into each CV in three steps. First, we restricted the pool of possible characteristics to 834 combinations, thereby we ensured that no applicant could have very high or very low scores in all dimension, i.e. that scoring low in one dimension increases the probability to score high in another dimension and vice versa. Hence, we eliminated almost all the chance that one CV could dominate another CV in all dimensions. Each CV was randomly assigned with one of these 834 combinations. In the second step, we assigned the photos for each CV so that in the first block (consisting of 16 CVs) three pictures of Turkish-looking women would certainly appear but at random positions. In the second block where the applicants have to fill four more jobs (with another 16 applicants), three different photos of Turkish women would also be shown randomly. Our programme guaranteed that if a recruiter saw one Turkish-looking women with the headscarf in the first block, she would evaluate the same Turkish women without the headscarf in the second block and vice versa. After random positions for three Turkish-looking women were chosen, we would randomly assign photos for the other CVs out of a pool of 10 female and 13 male German-looking photos, with the restriction that no photo could appear twice in one block.Footnote 9 In the last step, we randomly assign the first and last name to each photo according to the gender and ethnicity of the photo. So if the photo was from a German-looking applicant, we would randomly assign one of the 50 most common German last names, whereas if the applicant is Turkish looking, we would assign one of 50 most common Turkish last names. Similarly, we also had three pools of 50 most common first names for Turkish female, German female, and German male photos respectively in order to assign appropriate first names randomly.

We used pictures from 13 females and 13 males. For each of the three Turkish women, we had two pictures, one with a headscarf and one without. Therefore, we used 29 different pictures in total. According to the actual number each photo appeared in our experiment, roughly 10% were photos from Turkish women with headscarves and roughly 10% were Turkish women without headscarves.Footnote 10 We ensured that all pictured persons were roughly in the same age, around 25–30 years old. We framed the task such that all presented candidates are in the begin of their career and satisfied the formal requirements for the respective position but differed by the seven characteristics only. The participants of the experiment were asked to carry out a pre-selection and choose their first and second preferences out of four candidates for each position.

In order to investigate the potential productive characteristics of beauty (Biddle and Hamermesh 1998), the jobs in each block can be classified into four groups by level of skill (high or low skilled) and interaction with customers (high or low levels of contact). Concerning the rating of characteristics of each photo, we ran part 2 of the experiment, where we asked another group of students to rate persons on the photos by five characteristics: beauty, trustworthiness, friendliness, intelligence, and physical resilience. We standardised these rating scores within each rater then averaged the scores for each photo across all raters before only beauty as our main explanatory variable.

3 Methodology

We adopt the linear probability model (LPM) to estimate the impacts of beauty, ethnicity, and headscarf on the chance of being selected for an interview by the participants in our experiment.

$$\begin{array}{*{20}l} y_{ijk}=\beta_{0}+X'_{i}\beta + Z'_{j}\gamma +B'_{k}\delta +{Int}_{ik}\theta +{time}_{ij}+D_{i}+\varepsilon_{ij} \end{array} $$

where yijk is a dummy variable equal to 1 if CV i with photo k is chosen by “recruiter” j. \(X^{\prime }_{i}\) is a vector of the CV’s seven characteristics discussed previously on page 15 while \(Z^{\prime }_{j}\) is a vector of “recruiter” j characteristics from participant’s responses to a questionnaire. \(B^{\prime }_{k}\) is a vector for our main explanatory variables (based on the photo k attached to each CV), which are a female dummy if the applicant is a female, a composite beauty rating score of photo k, a dummy variable for ethnic Turkish and a dummy if the applicant wears a headscarf.

Intik are vectors of interaction terms between the photo’s specific characteristic of interest, which is the headscarf, and selected CV’s characteristics (subset of \(X^{\prime }_{i} \times B'_{k}\)). These interactions should capture additional effects of having desirable characteristics among candidates wearing headscarf. Due to our moderate sample size of photos and participants, we decide not to code each CV characteristic as a categorical variable represented by a set of dummy variables. Instead, we redefine the levels of each characteristic as integers where the higher the value, the more preferable it is (0, 1, 2, 3, or 4 depending on the amount of levels in that characteristics). For instance, being unemployed for 13–18 months is coded as 0, while currently not unemployed is coded as 3. Further, we assess an effect of beauty rating when the candidate is the same gender as the recruiter (student participant) by adding an interaction term between beauty score and a dummy variable of whether the recruiter and the candidate are the same gender.

Since our variables of interest are drawn from the photo accompanying each CV, we control for proportion of time each participant j looked at the photo page of CV i relative to j’s total time used in that job position (timeij). Meanwhile, Di are dummies for the order of CV i, i.e. applicant numbers 2, 3, or 4 in each job position (with the first applicant as the reference group). These dummies are included to control for a tendency that some participants might systematically choose the first, second, third, or fourth applicants more often than other choices. Lastly, β, γ, δ, and θ are vectors of parameters and εij are the error terms which are clustered by 29 photos. As for robustness checks, we also cluster standard errors by participant as well as use two-way clustered robust standard errors by both photo and participant (Cameron and Miller 2015).

In order to verify if the results are driven by characteristics of job openings or those of participants, we estimate the model for several subsamples based on job classifications (comprising high skilled, low skilled, with and without customer contact), gender of participants, age of participants (either older or younger than 23 years), and their total time used in the experiment.

We also explore the differential role of characteristics on the ranking chosen by participants in our experiment. Using the same regression specification as before, we redefine the dependent variable (yijk) from the first choice analysis to be a dummy variable for being chosen as the first preference, i.e. the second choice as well as those unselected are coded as 0. Regarding the second rank, the main LPM with yijk equal to 1 for the second choice only is applied to a restricted sample, dropping all the first choices. Of course, this method implicitly assumes that participants decided on their first preference candidate before comparing the remaining three candidates in order to select their second choice. We then estimate conditional logit model as a robustness check later in Section 5.

4 Results

Following the model in Section 3, we focus mainly on the results of the linear probability model. The results from participant fixed effects estimation are very similar to the OLS and are reported in Table 7. Table 2 shows that our randomisation process worked quite well and there is no significant difference in relevant characteristics between different sub-groups except for experience which is significantly different at 10% level. Yet we always control for all CV’s characteristics in our models. Table 3 shows simple regressions with dummy variables for gender, Turkish background, and headscarf. We do not observe significant discrimination against headscarf in this setting but only discrimination for Turkish background in high skilled, and jobs with less customer interaction. This result remains quite robust after including controls (see Table 4) for characteristics and the order a candidate appears in each job opening (Di).

Table 2 t test results by headscarf
Table 3 No controls, no beauty, and no interaction effects
Table 4 Controls, no beauty, and interaction effects

However, Table 5 shows that these negative and significant coefficients for Turkish-looking applicants disappear after controlling for beauty, i.e. the lower chance of female candidates with Turkish origin in our sample being selected results from their lower beauty rating compared to German-looking applicantsFootnote 11. This result is quite surprising because most studies on correspondence testing in different countries tend to find significant lower average callback rates for minorities. Yet there are some exceptions; for example, recent studies by Kraft (2012) in Germany and Edo et al. (2017) in France show no significant discrimination against female foreigners who signal good language skills. Owing to our focus on young applicants, all Turkish applicants are female in their 20s. Hence, they correspond to the second- or third-generation migrants, who in general speak German as their mother tongue.

In addition, Tables 4 and 5 show intuitive and robust results for the significance of the seven characteristics. We observe that labour market experience, final grade, and quality of education increase the chances of being called back in every job category. Computer skill also has positive and significant effects but becomes less important among jobs with customer contact and low-skilled jobs, whereas English skills are positively associated in occupations with customer interaction and high-skilled jobs. Surprisingly, coefficients of previous unemployment and expected wage are not significant at 5% level in any types of occupations.

Table 5 also shows a very significant and robust effect of appearance on the chance of being selected. We do not observe a clear difference in the effect of beauty between jobs with and without customer interaction. The effect of beauty differs slightly between low-skilled and high-skilled occupations. Our experiment implies that the recruiters do not give extra rewards for potential productivity improvement of beauty in jobs with customer interaction but favour beauty more in high-skilled jobs. Perhaps, they can relate themselves to these high-skilled occupations and prefer having more attractive people as their colleagues. The advantage of our laboratory experiment compared to a fake CV approach is our ability to control for characteristics of the recruiters and the time used for each part of the application. So we are able to detect that the beauty premium is only positive and significant for applicants of a different gender to the recruiter. This interaction effect significantly reduces the beauty premium for same gender applicants in column (1) for the whole sample and in column (4) of the low-skilled subsample.

As for the headscarf, Table 6 shows that wearing headscarf decreases a probability of being selected in all types of occupations significantly for Turkish women (only significant at 10% level for low skilled and jobs with less customer contact). However, having good characteristics such as work experience and education also increase the chance of those with headscarf significantly. Such a mitigating role of favourable characteristics echoes findings from previous studies such as Valfort (2017), Kaas and Manger (2012), and Weichselbaumer (2016). Following the idea of belief-based discrimination proposed by (Bohren JA, Imas A, Rosenberg, M: The dynamics of discrimination: theory and evidence, unpublished), we infer that our participants might have prior biases in beliefs about unobservable characteristics of candidates with headscarves. Yet positive signals from observable characteristics could reduce such an uncertainty and help the decision makers to rely less on their biased beliefs on the group average of these unobservables. Moreover, (Bohren JA, Imas A, Rosenberg, M: The dynamics of discrimination: theory and evidence, unpublished) argued that knowing the existence of biased beliefs against themselves, those who are discriminated against have to put more effort and hence become better than the rest of the population despite the same observable characteristics. In our case, the decision makers might take this possibility into consideration. That is why we observe a reversal where good candidates with headscarves become more favourable than other good candidates without headscarves.

Table 6 Controls and beauty, interaction effects
Table 7 Controls and beauty, interaction (by time), participant fixed effects

Our laboratory design also allows us to divide the sample by different subgroups. Table 12 shows that decision makers who take more time in our experiment favour less towards appearance. Although the magnitude of the headscarf dummy is very high for fast participants, this coefficient is insignificant. Such a finding may result from the heterogeneity in this group, which should comprise some who did not take the experiment seriously (i.e. just randomly choose a candidate, hence should not discriminate) and others who did not carefully look at the characteristics but were guided by their intuition (heuristic judgement of type 1 process) and thus discriminated more than the average. On the other hand, those who took more time for the experiment were more judicious (i.e. relying more on type 2 process) and probably tried not to judge by appearance. Hence, the beauty premium is reduced for this subsample and only significant at 10% level. In Table 13, there is no significant difference between the genders in beauty premium but male participants seem to drive negative responses toward headscarf. As for age of participants, we observe similar level of discrimination by appearance but the magnitude of discrimination against females and headscarf increases with the respondents’ age.

We present results from separated LPM for the first and second choices in Table 8. For the first choice, the dependent variable is a dummy variable equal to 1 only if the candidate was selected as the first choice. In the columns with the second choice, we drop those individuals who were selected as the first choice and perform another regression with a dummy-dependent variable taking value 1 if the candidate was selected as the second choice or 0 otherwise. We observe that the headscarf mostly affects the second preference, i.e. when it is a borderline decision. Beauty also matters more among the second choice except for jobs with more customer contact, where beauty becomes more important for the first preference. We interpret this results as a sign that for the first preference the decision makers are more objective and focus on relevant characteristics such as work experiences, appearance, and gender for contact jobs. Yet for the second choice, their decisions are nudged by other less important characteristics like religious practice, gender, or appearance in no contact jobs. In other words, the participants seem to discriminate in the situation where it is easier to discriminate because the decision is very close but they do not discriminate against the headscarf in cases where the woman with the headscarf is the most qualified out of the four, i.e. when discrimination is very costly to the employer.

Table 8 Controls and beauty, interaction effects

5 Robustness analyses

We perform several robustness checks to assure that our findings are not sensitive to sample selection or model specifications. The full sample—including participants whose responses in another experiment were extremely inconsistent—is used to estimate regressions with four subgroups of job categories in Table 14Footnote 12. The sign and significance of estimated coefficients are qualitatively similar albeit less precise. Using the main sample, we also estimate the linear probability model with participant fixed effects and clustered standard errors at participant level. Controlling for unobservable heterogeneity across participants, LPM-FE should circumvent endogeneity concerns, i.e. non-zero correlation between CVs’ characteristics such as Turkish background or wearing headscarf and participants’ characteristics such as their migration background. Again, the results based on LPM-FE in Table 7 are very similar to those from LPM in Table 6. We also report two-way clustered robust standard errors by both photo and participant in Table 15.

Since the main specification assumes that our seven characteristics in the CV can be coded as cardinal variables, we treat these characteristics as categorical variables represented by sets of dummy variables. Figure 2 illustrates the marginal effects of four main characteristics between candidates with and without headscarves on the predicted probability of being chosen. Although only some levels of these characteristics show significant differences in predicted probability of being chosen owing to the headscarf, their general trends conform to our main findings.

Fig. 2
figure 2

Predicted probability for a callback by categories of selected characteristics

As for the role of characteristics on ranking, instead of estimating two separated regressions for the first and second choices, we assume that our participants evaluate their first and second choices as a pair against all other possible pairs in each job opening (Table 8). Hence, their response for each job opening can be rearranged from selecting 2 out of 4 candidates (j) to choosing 1 pair representing their first and second ranking out of all 12 possible pairs of the first and second choices (k). We then follow Cameron and Trivedi (2005) and specify the conditional logit model with the probability of a pair of CVs k being chosen by participant j (pjk) as follows:

$$\begin{array}{*{20}l} p_{jk}= \frac{e^{\widetilde{\eta_{jk}}}}{{\sum\nolimits}_{l=1}^{12} e^{\widetilde{\eta_{jl}}}}, k = 1,...,12 \end{array} $$


$$\begin{array}{*{20}l} \widetilde{\eta_{jk}} & = \widetilde{\beta_{0}} + X'_{{jk}_{1}}\widetilde{\beta}^{1} + X'_{{jk}_{2}}\widetilde{\beta}^{2} + B'_{k_{1}}\widetilde{\delta}^{1} + B'_{k_{2}}\widetilde{\delta}^{2} \\ & \qquad + {Int}_{{ik}_{1}}\widetilde{\theta}^{1} + {Int}_{{ik}_{2}}\widetilde{\theta}^{2} + D_{k_{1}} + D_{k_{2}}+ {time}_{{jk}_{1}} + {time}_{{jk}_{2}} \end{array} $$

where \(X^{\prime }_{{jk}_{s}}\phantom {\dot {i}\!}, {B}^{\prime }_{k_{s}}, {Int}_{{ik}_{s}}, D_{k_{s}}\), and \(\phantom {\dot {i}\!}{time}_{{jk}_{s}}\) are vectors of the explanatory variables defined earlier with ks = k1 or k2 standing for the first and second ranking in the pair k respectively. Therefore, the main differences in explanatory variables between the main LPM and this conditional logit model are the inclusion of characteristics for both first and second choices and the exclusion of participant-level characteristics from the conditional logit model.

Table 9 shows odds ratios of the chance that a pair of CVs is selected given a one unit change in particular characteristics of the first or second rank candidates with the Z-score in the squared brackets. Almost all of seven characteristics from both first and second CVs of a pair have odds ratios higher than 1 and are significant at 1% level, i.e. the better the characteristics are, the more likely that a pair of CVs is chosen. Odds ratios of beauty are also higher than 1 and significant at 5 or 10% level in all job categories except for the beauty of the second rank in customer-oriented jobs. Like our main results, beauty has negative effects on the chance of being selected if the decision maker and the candidate are of the same gender. Yet for the second choice, such effects are significant only for jobs with less customer interaction and high-skilled jobs at 5 and 10% level respectively.

Table 9 Conditional logit results (controls and beauty with interactions)

In all jobs, wearing headscarf affects the chance of being selected as the second choice negatively but not for the first choice with significant and positive interaction effects for some characteristics of the CVs. In other words, the headscarf reduces the chances of being selected for applicants with unfavourable characteristics and increases the chances for applicants with good characteristics. Regarding subgroups, headscarf reduces the chance of being chosen (as both the first and second choices) in high-skilled jobs only.

One concern for the conditional logit model is the validity of the independence of irrelevant alternatives (IIA) assumption. We test this assumption by comparing the estimated coefficients of the restricted model (excluding one choice combination) to the full model. The Wald test indicates that our model fails the IIA condition. This is plausible given that our choice combinations consist of, for example, a pair of first and second CV, first and third, or second and third CV. So it is hard to argue that the choice between any two of these combinations would be independent from other possible “irrelevant” pair. Nonetheless, we decide to show these findings as a robustness check for our “two-step” LPM.

6 Discussion and conclusion

Exploiting a German practice to include photos in a CV allows us to measure how differences in appearance can influence job recruitment. Our laboratory experiment contributes to the sizeable literature on correspondence testing (fake CVs) in many aspects. Our research design enables us to gather information on socio-economic background of our participants as well as how long they look at the photo of each applicant. We also consider different aspects of appearance, i.e. beauty, ethnicity, and religious attire simultaneously. Further, based on job categories, our paper can shed light on the sources of discrimination. Since the participants were asked to fill fictitious positions, the sources of discrimination against appearance could come from their taste-based discrimination, statistical discrimination, or productive characteristics of appearance. In correspondence testing experiments, the composition of a team might be an additional source of discrimination owing to HR’s endeavour to hire someone who fit in with the existing team. However, our participants did not have such information; hence, we can rule out this channel.

Our findings suggest significant beauty effects on the hiring decision in every job category and such effects do not depend on gender or age of the recruiters. Although we do not find any discrimination against Turkish applicants after controlling for beauty, our results indicate a significant discrimination against those Turkish-looking candidates who wear headscarf. Yet desirable characteristics of these applicants do help to reduce or even reverse their disadvantages. The older subgroup seems to discriminate more against headscarf but the decision is not driven by how long they look at the photosFootnote 13. Looking closely into the mechanisms, employer discrimination might be underestimated in our experiment because the participants are young and have little labour market experience. Regarding customer discrimination, the negative and significant effects in more customer-oriented occupations signal our participants’ concern about perception of customers on headscarf. However, the negative result is not robust to an inclusion of participant-level fixed effects; thus, it could just be a biased estimate due to some unobservable characteristics of the participants.

The only subgroup with a robust negative result is the high skilled. Since it is the only job category in which our participants can imagine themselves working with the fictitious candidates in the future, we interpret this as supporting evidence for employee discrimination. Also, compared to low-skilled occupations, the high skilled could be perceived as requiring more personal interaction with other colleagues or managers. Furthermore, good characteristics such as better education and more work experience are being extra rewarded for candidates with headscarf. This implies that statistical discrimination could play an important role during the selection process. On balance, our participants might feel that they are not particularly against headscarf because they favour those with good characteristics, while being harder on candidates with headscarf and undesirable characteristics.

Unfortunately, the choice between correspondence testing and laboratory experiment comes with trade-offs. Despite being able to control for more variables, there are several reasons why our results might not reflect the true level of discrimination in the labour market. First, the participants were instructed by the experimenter and knew that they were observed. Therefore, the “true” effect might be shrouded by experimenter demand effects Zizzo (2010), where people tend to act according to social norms albeit contradicting their true beliefs. This could result in a much lower magnitude of discrimination against the headscarf compared to a field experiment, for example, Weichselbaumer (2016). Additionally, our experimental design fixes number of applicants to four per job opening and the participants were asked to always choose two of them. This is of course an unrealistic restriction in real-world recruitment. Moreover, according to Becker (1957), discrimination should be more prevalent in industries with highly competitive labour markets because it is cheaper for firms to discriminate. Our setting, however, does not consider such a competition effect.

Despite our effort to mimic the field as much as possible, we are aware that our experiment setting most likely underestimates the true level of taste-based discrimination in the labour market, hence jeopardising the external validity. Nevertheless, our experiment sheds some light on the mechanism behind discrimination, i.e. the participants discriminated against minorities with undesirable characteristics, but balanced this out by favouring those with good characteristics. The experiment suggests that good characteristics may not only compensate for but reverse the biased negative perception of minorities in the labour market through possibly implicit or subconscious affirmative action.

Our findings are in line with the model for dynamic discrimination by (Bohren JA, Imas A, Rosenberg, M: The dynamics of discrimination: theory and evidence, unpublished), who show that a reversal of discrimination can occur if some evaluators hold a biased stereotype against a certain group, while others are aware of this. The basic idea is that at the first stage where the quality of the applicant is unknown a certain group is discriminated against due to biased beliefs, but the evaluators account for that at later stages after they received a positive signal of applicant’s quality. In our experiment, this is signalled by good characteristics (e.g. labour market experience or good educational outcomes). Interestingly, there are also studies which show that racial discrimination in the USA is higher if a signal of high productivity was included (Nunley et al. 2015). Our results indicate the opposite, namely that the discrimination is reduced if a signal for high expected productivity, e.g. good grades and labour market experience, is included. This difference might come from the fact that many field experiments use the first names as a signal for ethnicity. If names carry additional information about the applicant such as socioeconomic status (Fryer and Levitt 2004), this might bias the results, especially when the socioeconomic status is more important for high productivity applicantsFootnote 14.

As for policy implication, one might wonder if a policy prohibiting a photo from CV could help reduce discrimination against appearance. In France, Manant et al. (2014) found that despite having no photo on the CV, recruiters did gather information about their fictitious candidates’ looks and religious practice from their Facebook profiles. Although photos make ethnicity of candidates clear to the recruiters, such disadvantages can be mitigated if the candidates from ethnic minorities look attractive, friendly, or likeable (Weichselbaumer D, Schuster J: The effect of photos and name change on discrimination against migrants in Austria 2017, unpublished). Such a finding is consistent with our robust beauty premium. Further, our results suggest that the gap in job opportunities due to religious practice like headscarf could be narrowed down through a signalling of preferable characteristics. This gives room for policy interventions such as education or apprenticeship programmes targeting these groups. We hope that our paper will spur more discussion and encourage future research to consider appearance as a package of, for instance, beauty, ethnicity, and observable religious practice simultaneously.

7 Appendix

Fig. 3
figure 3

Example of the experiment screen 1 (the photo is anonymised here for privacy reasons)

Fig. 4
figure 4

Example of the experiment screen 2

Fig. 5
figure 5

An example of photos of the same person with and without veil (the photos are anonymised here for privacy reasons)

Table 10 Job description (low-skilled jobs)
Table 11 Job description (high-skilled jobs)
Table 12 Controls and beauty, interaction (by time)
Table 13 Controls and beauty, interaction (by subgroups)
Table 14 Controls and beauty, interaction effects (full sample)
Table 15 Controls and beauty, interaction effects (two-way clustering by photo and participant)