Measuring gender attitudes using list experiments

We elicit adolescent girls’ attitudes towards intimate partner violence and child marriage using purposefully collected data from rural Bangladesh. Alongside direct survey questions, we conduct list experiments to elicit true preferences for intimate partner violence and marriage before age 18. Responses to direct survey questions suggest that very few adolescent girls in the study accept the practises of intimate partner violence and child marriage (5% and 2%). However, our list experiments reveal significantly higher support for both intimate partner violence and child marriage (at 30% and 24%). We further investigate how numerous variables relate to preferences for egalitarian gender norms in rural Bangladesh.

The objective of this paper is to measure attitudes towards intimate partner violence and child marriage among adolescent girls in rural Bangladesh. We do this by using standard survey questions (direct questions) as well as methods to elicit responses to sensitive survey questions (list experiments, see "Section 2" for a description and examples). We find that the method of measurement matters when eliciting responses to sensitive questions. Standard direct survey questions under-estimate support for socially harmful practices-few adolescent girls in the study accept the practice of intimate partner violence (5%) or child marriage (2%) when asked directly. List experiments reveal substantially higher support for both intimate partner violence (30%) and child marriage (24%). Adolescent girls with lower levels of education under-report their support for child marriage by 16 percentage points compared with adolescent girls with higher education. We also find that girls randomly exposed to village level adolescent clubs set up by the Bangladeshi non-government organization BRAC, which educated them on marital rights and laws, under-report their support for intimate partner violence in comparison with non-exposed adolescent girls. To our knowledge, ours is the first study to use list experiment methods to elicit attitudes towards child marriage as well as domestic violence, and has greater external validity than similar empirical investigations.
The rest of the paper is organised as follows. "Section 2" provides a literature review and lists the contributions this paper makes to the literature. "Section 3" describes the background of our study and list experiment that generated the data we use in this paper. "Section 4" lays out the empirical analysis that we perform on the data as well as including a discussion of our findings, while "Section 5" concludes.

Literature review and contribution
An important concern when eliciting gender attitudes in surveys is related to measurement error. 4 Suppose a survey respondent is queried on whether they consider domestic violence to be acceptable. It is very likely that they would either choose not to respond to the query (leading to systematic item nonresponse) or respond to the effect that they do not consider domestic violence to be acceptable (misreporting, which might arise from social desirability). In either case, the resulting measurement error will lead to biased estimates when investigating the relationship between gender attitudes and other outcome variables that we might be interested in (Bound et al. 2001).
There are different strategies that have been employed to deal with measurement error when eliciting responses to sensitive questions. One is to rely on administrative 4 A related literature explores the formation of attitudes regarding gender equality in developing country contexts. Beaman et al. (2009) find that prior exposure to female leaders increased the chances of women's electoral success in India. They find that changes in voter's gender attitudes (measured using Implicit Association Tests) arising from exposure to female leaders are an important channel through which this change happens. Jensen and Oster (2009) use panel data to find that introduction of cable television in rural India reduced the reported acceptability of domestic violence and son preference. They also find accompanying increases in female autonomy, decreases in fertility and increased school enrolment of young children. Dhar et al. (2016) examine how intergenerational transmission plays a role in the formation of gender attitudes in India. data rather than self-reports, even though in developing countries they are usually not systematically collected nor well registered. 5 Another is to use intensive qualitative fieldwork, as done by Blattman et al. (2016), in which local researchers spend several days with a random sub-sample of survey respondents after a survey has taken place. They then obtain verbal confirmation of sensitive behaviours allowing the employment of a validation technique to examine the nature of measurement error in survey responses. Alternatively, one may use quantitative survey methods to examine measurement error in responses to sensitive questions, such as randomized response techniques, endorsement or list experiments. 6 In this paper, we use a list experiment which is also known as the item count or unmatched count technique. In a list experiment, survey respondents are queried on the number of items they agree with on a list, which (randomly) either includes or excludes a sensitive item (Miller 1984;Imai 2011). We use list experiments to deal with measurement error in elicited gender attitudes in Bangladesh.
Several recent empirical studies make use of list experiments. Karlan and Zinman (2012) use a list experiment to indirectly elicit how borrowers from microfinance institutions (MFI) use their loan proceeds in Peru and the Philippines. Using the results from the list experiment and comparing them with responses to direct survey questions, they find that direct elicitation underreports the non-enterprise use of the loan proceeds by borrowers from MFIs. List experiments have also been employed to elicit truthful responses to sexual behaviours, such as condom use, number of partners and unfaithfulness in Uganda (Jamison et al. 2013), Colombia (Chong et al. 2013) and Côte d'Ivoire (Chuang et al. 2019); harmful traditional practices against women in Ethiopia (De Cao et al. 2017;De Cao and Lutz 2018;Gibson et al. 2018) and anti-gay sentiment in the USA (Coffman et al. 2016).
A recent study examining gender attitudes (specifically those related to female genital mutilation/cutting) in Ethiopia finds under-reporting in direct attitude questions of 10 percentage points (De Cao and Lutz 2018). This study also provides suggestive evidence that under-reporting is more pronounced among uneducated women and among women who were targeted by a non-government organization (NGO) intervention to strengthen the health system as well as sexual and reproductive health knowledge.
A few recent studies have also used list experiments to examine measurement error in domestic violence reporting and prevalence. Peterman et al. (2017) use a list experiment combined with an unconditional cash transfer given to female caregivers of children younger than 5 in rural Zambia. They find that 15% of the women had experienced physical intimate partner violence in the last 12 months. They also find no effect of the cash transfer on intimate partner violence 4 years after the program. Since 5 Palermo et al. (2014) show that administrative data in developing countries capture only a small fraction of women who experienced domestic violence, and this depends on different socio-economic characteristics of the women. See also Jamison et al. (2013) and Moseson et al. (2015). 6 In a randomised response technique respondents are asked to use a randomisation device such as a dice or coin whose outcome is unknown to the enumerator (Warner 1965). In an endorsement experiment, randomly selected survey respondents are asked for their support of policies which have been endorsed by a socially sensitive actor whilst other survey respondents are asked for their support for the same policies without the endorsement. If endorsement increases support for the policies, then this is taken as evidence of support for the socially sensitive actor (Bullock et al. 2011). direct questions are not asked, it is not possible to examine the direction or magnitude of the measurement error in such questions. Joseph et al. (2017) use a list experiment in Kerala, India to find that the level of under-reporting of domestic violence is over 9 percentage points, while being negligible for physical harassment on buses. They analyse the list experiment using difference-in-means across sub-groups of the population. Unlike Peterman et al. (2017) and Joseph et al. (2017), Aguero and Frisancho (2018) follow WHO guidelines and protocol to ask female respondents direct questions on violence (which are comparable to the widely used domestic violence questions asked in the Demographic and Household Surveys) and compare these with a list experiment used to elicit information on experiences of physical and sexual intimate partner violence. They use a sample of female clients of a micro credit organisation operating in urban areas of Lima, Peru. Aguero and Frisancho (2018) find that more educated women systematically underreport violence more often, but that there is no under-reporting by women who have less education. They also describe a low-cost solution to correct for bias in a setting in which there is non-classical measurement error in the dependent variable (for instance the dependent variable could be intimate partner violence as in Aguero and Frisancho 2018); there is no measurement error in the independent variable and endogeneity is present. This solution involves using estimates generated from list experiments carried alongside other survey instruments.
Our work contributes to this literature in several important ways. It is the first study to use a list experiment to elicit attitudes towards child marriage, at the same time being among the first to develop a list experiment for domestic violence alongside Peterman et al. (2017), Joseph et al. (2017 and Aguero and Frisancho (2018). Since our sample comprises a third of all districts of Bangladesh, it has greater external validity than many of the other empirical investigations in this area (e.g., urban Lima in Aguero and Frisancho (2018)). Ours is also the first study that makes use of an RCT (non-formal education intervention) to analyse how support for domestic violence and child marriage changes with the intervention while also using a list experiment. We analyse our list experiments by using regression techniques that allow us to investigate how the probability of supporting the sensitive question varies as a function of respondent's characteristics (as in Coffman et al. 2016;Aguero and Frisancho 2018;De Cao and Lutz 2018), improving on earlier papers that only compute difference-inmeans across sub-groups of the population (e.g., Karlan and Zinman 2012;Joseph et al. 2017). Finally, we discuss the validity of our list experiments in relation to recent criticisms raised by Chuang et al. (2019).

Survey design
Data on list experiments eliciting attitudes towards domestic violence and child marriage used in this study was collected in February 2017 as part of an end-line survey to evaluate the Adolescent Development Programme (ADP), a randomized control trial (RCT) intervention implemented by, BRAC, the largest NGO in Bangladesh. The ADP intervention introduced village level random variation in adolescent girls' exposure to non-formal education on marital rights and laws. The program design of the intervention is described in detail in Appendix 2. The baseline survey design considered 27 BRAC branch offices located in the 19 poorest districts 7 where BRAC was about to implement and scale up the ADP scheme by the end of 2012. Of a total 216 villages in the sample under 27 BRAC branches, half were assigned to the program and the remaining served as non-program villages. Randomisation was done at the village level where BRAC branches were considered as clusters, i.e. it is a clustered RCT. Within the catchment area of each sample village, 20 adolescents (ages 11-16), of whom 15 were girls and five boys, were interviewed. A total of 4320 adolescents (3240 females) ages 11-16 years were interviewed across all villages as part of the baseline survey in June 2012. 8 The same adolescents were interviewed in the end-line survey in February 2017, with 2732 (or 63% of the baseline respondents) being successfully re-interviewed (of which 2020 are females). 9 Appendix Table 8 gives a comparison of characteristics across the sample of 3240 adolescent girls and the sub-sample of 2020 who completed the end-line survey. An important concern is related to potentially selective attrition of subjects from baseline to end-line. However, observable characteristics of adolescent girls which are related to age, religion and education as well as their mother's age, education and empowerment measures at baseline for the complete sample and the subsample successfully re-interviewed at end-line are very similar, making selective attrition unlikely in this setting.

List experiments to elicit gender attitudes
The list experiment question used to elicit attitudes towards domestic violence included the following items: (1) If the father is too busy with outside work, this has a negative impact on children's education; (2) it is not acceptable to use contraceptives to avoid pregnancy; (3) in a marriage both husband and wife should decide on how many children to have; (4) a wife can be hit, slapped, kicked or physically hurt by the husband under any circumstances. 10 The list experiment question used to elicit attitudes towards child marriage included the following items: (1) It is important for girls to attend school; (2) birth of a girl brings as much happiness to a family as birth of a boy does; (3) literate mothers can take care of their children better than illiterate mothers; (4) a girl should be married off before 18.
We randomly divided our respondents in two groups, A and B, that acted as either control or treatment for the first or second list experiment. This allowed us to reduce bias in the answers, given that only one list experiment with one sensitive item was 7 These were selected based on national poverty ranking. 8 The adolescent sample size was considered sufficient with 80% power and 95% confidence level for a 20% effect size (Khatoon et al. 2018). 9 Given random program placement at the village level, all adolescent girls in the program area had an equal probability of participation in the ADP intervention. As such, a comparison of mean outcomes across ADP program and control villages yield intention-to-treat (ITT) estimate of the program effect. 10 The instructions given the respondent before the list experiment module are as follows: "Now I will read out a number of statements to you and request you to anonymously give me your answer indicating how many statements you agree with. I'll give you 4 stones which I'll request you to keep in your right hand. Both of your hands will have to be kept behind you so that they are not visible to me. If you agree with the statement I read out, transfer one stone from your right to your left hand. Please do not show this to me or tell me verbally your answer or whether you transferred any stone from right to left hand. If you do not agree with a sentence, do not transfer any stone from your right hand. At the end of this exercise, tell me the total number of stones in your left hand." asked from each respondent. In both list experiments, the sensitive item is the last item. 11 For each list experiment, the control group was asked the list experiment with only items (1)-(2)-(3). We carefully selected our non-sensitive questions after discussions with BRAC, and although the items (1)-(3) in both list experiments seem sensitive too, in the local setting they fit as non-sensitive items given the illegality of the sensitive one. 12 Moreover, recent research shows that non-sensitive items more closely related to the sensitive one perform better because they make the sensitive item less salient (Chuang et al. 2019).
In early January 2017, BRAC researchers piloted the list experiment questions in the Karail slums in Dhaka district with 10 female adolescents participating in the pilot. The primary objective was to verify the adequacy of the list experiment statements, and to assess the appropriateness of using stones (marbles) by interviewees to describe their responses. Stones were used to avoid numeracy-related bias in responses (as in De Cao and Lutz 2018). The survey work was conducted by a team of 50 enumerators who received an intensive week-long training at the BRAC head office which was directly supervised by study team members as well as field management trainers from BRAC's Research and Evaluation Division (RED). The majority of enumerators (38 out of 50) were females keeping in mind the study population (where 75% adolescent respondents were female). In total, the enumerators were organized in 15 teams so that each team in a sample site had enough female enumerators available to interview a female adolescent.
The list experiment questions were asked in the last page of a long questionnaire (about 40 pages long). Direct questions phrased in the same way as the sensitive item in the list experiments were asked from all respondents but at around page 20 of the questionnaire. We have no reason to believe that respondents were cognisant of this design and that the format influenced the list experiment results as so many different issues were dealt with during the interview. However, when we analyse the direct questions, we focus on the sample corresponding to the list experiment control group. See also "Section 4.4" for further discussion on the validity of our list experiments.

Estimation sample
Given that the targets of the NGO intervention were primarily adolescent girls, we restrict our estimation sample to adolescent girls who responded to the list experiment questions asked in the end-line survey. This gives us an estimation sample of 2020 adolescent girls. Table 1 reports descriptive statistics for this sample. Half of the sample was exposed to the ADP program. 13 Of the respondents, 42.5% had less than 9 years of schooling or had (at most) completed junior secondary education while the rest had either secondary or tertiary education. The average age while completing the end-line 11 List experiments should not be too short to avoid the ceiling and floor effect, and usually include a 3-item or 4-item list (Kuklinski et al. 1997). We did not have enough power to randomize the order of the items, but it is common practice to have the sensitive item at the very end. 12 There is a variety of punishments against perpetrators under the following relevant laws: the Domestic Violence Prevention and Protection Act of 2010; the Prevention of Oppression against Women and Children Act of 2000; the Child Marriage Restraint Act of 1929 (later revised as "the Child Marriage Restraint Act of 2017"). 13 Balance tests on the ADP intervention are given in Appendix Table 9. survey was 17.5 years. 28% of the adolescent girls were married by the time they completed the end-line survey, of whom 71% were married before age 18.
When directly asked about gender attitudes, only 2% of the adolescent girls agreed that a girl should be married off before age 18. Similarly, only 5% agreed that a wife can be hit, slapped, kicked or physically hurt by the husband under any circumstances. This is striking because of the high prevalence of early marriage as well as violence against women in the study area. For example, turning to maternal characteristics, in 15% of the cases, respondents' mothers reported being beaten at least once by their husband in the last 12 months. Moreover, about 46% of the girls' mothers in the estimation sample were pregnant before age 18. Female respondents in rural Bangladesh are also subject to patriarchal social norms-89% of the mothers of adolescent respondents reportedly practiced Purdah 14 when they went out.

Empirical strategy
In a standard list experiment design a sample of respondents (N) is randomly divided in two groups: control and treatment. Each respondent in the control group (T i = 0, where i indicates the individual) receives a list of J non-sensitive, yes/no items, and is asked to Residing in ADP-exposed village (0 = ADP non-exposed, 1 = ADP exposed) 0.497 0.500 ≤ Primary education (0 = secondary/tertiary, 1 = primary schooling or below) 0 Adolescent-and mother-specific variables (except domestic violence and child marriage attitudes) are measured at baseline. ADP refers to the Adolescent Development Programme provide the total number of items he/she agrees on. The same applies to each respondent in the treatment group (T i = 1) where the list is increased by one item to include the sensitive item (J + 1 items). Let us assume Z * ij to be the respondent i's truthful preference to the jth item, j = 0, 1, …, J (Imai 2011). We have that Z ij (T) is one if the answer to the jth item is one, and zero otherwise. The econometrician only observes (Imai 2011;Blair and Imai 2012) if: (a) the randomisation is good, meaning that for each respondent there are no design effects meaning that the inclusion of the sensitive item does not change the sum of affirmative answers to the non-sensitive items there are no liars meaning that the respondent replies truthfully to the sensitive item (Z i; J þ1 1 ð Þ ¼ Z * i; J þ1 ). Assumption (c) is also called ceiling and floor effects. Ceiling effects occur when a respondent in the treatment group gives the answer Y i = J even if he/she would have replied Y i = J + 1. Floor effects occur instead when a respondent in the treatment group answers Y i = 1 even if he/she would have replied Y i = 0.
If the list experiment satisfies (a), (b) and (c), then support for the sensitive item can be obtained by simply using a difference-in-means estimator: where N 1 ¼ ∑ N i¼1 T i is the treatment group size and N 0 the control group size. To investigate how preferences over the sensitive item change with changes in respondent's characteristics, a multivariate regression model can be used. 15 In particular, the following equation can be estimated: where X i are the respondent's characteristics and (γ, δ) are the parameters to estimate. We can estimate (γ, δ) using ordinary least squares (OLS).

Estimation results
In Table 2, we present the distribution of responses to our two list experiments (LE). The proportion of women in favour of domestic violence (DV) and child marriage (CM) is computed using the difference-in-means estimator and is respectively 30% (SE = 0.028) and 24% (SE = 0.026). 16 Table 3 reports the analysis when we run a linear regression model, as in Eq.
(2). The first four columns report the results where the outcome is the list experiment 15 See Imai (2011) for the methodological contribution, and De Cao and Lutz (2018) for a recent application. 16 The average answer in the control group for the LE DV question is 2.19 (SE = 0.019), while in the treated it is 2.48 (SE = 0.020); the difference-in-means is then 0.297 (SE = 0.028). The average answer in the control group for the LE CM question is 2.48 (SE = 0.019), while in the treated it is 2.71 (SE = 0.019); the differencein-means is then 0.238 (SE = 0.026). outcome for domestic violence (LE DV), while the remaining four refer to the list experiment outcome for child marriage (LE CM). Columns (1) and (5) report regressions where the list experiment outcomes are regressed only on a list experiment indicator (T i ); these correspond to the difference-in-means estimate from Eq. (1). The next columns in Table 3 add the most important individual characteristics. The coefficients of interest are the ones interacted with the list experiment dummy (δ). Column (2) provides the effect of being exposed to ADP on LE DV and finds a surprisingly positive effect indicating an increase in support for domestic violence. Column (4) also includes age, primary education and marital status, and it shows that ADP-exposed adolescent girls are 11.4 percentage points (p value = 0.005) more likely to be in favour of domestic violence than girls not exposed to the ADP intervention even after controlling for other individual characteristics. The results of the list experiment outcome for child marriage, instead, reveal an interesting effect of education. Column (8) shows that less educated girls, who have at most completed primary education, are 16.2 percentage points (p value = 0.012) more likely to support child marriage compared to more educated girls.
We report and use robust standard errors when interpreting our results in the previous paragraph. We also compute and report p values computed using the wild bootstrap when clustering at the NGO branch level (since there are only 27 NGO branches). Our results remain robust to the use of clustered standard errors at the NGO branch level.

Social desirability bias
In this section, we examine social desirability bias by comparing attitudes towards domestic violence and child marriage measured via a list experiment with the same ADP refers to the Adolescent Development Programme. The dependent variable is the response to the list experiment questions. It is either 0, 1, 2 or 3 for respondents in the control group or 0, 1, 2, 3 or 4 for respondents in the treatment group. Data on education is missing for 16 observations. Estimated coefficients are from the item count technique linear regression model (see Imai 2011). Robust standard errors in parentheses, and wild bootstrap p-values clustered at the NGO branch level in italics attitudes measured via a standard direct survey question (DQ). Table 1 reports that only 5% and 2% of respondents support domestic violence and child marriage when asked directly. When considering the direct question on domestic violence (DQ DV), we restrict our sample to the control group in the list experiment for domestic violence. Similarly, when considering the direct question on child marriage (DQ CM), we restrict our sample to the control group in the list experiment for child marriage. This implies that when we compare the direct question response with the list experiment, each respondent would have answered the sensitive question only once. In Table 4, we estimate linear probability models, using an indicator variable taking the value one if a girl supports domestic violence in columns (1)-(4) as the dependent variable of interest. 17 An indicator variable taking the value one if a girl supports child marriage is used as the dependent variable of interest in columns (5)-(8). Explanatory variables include a girl's main characteristics (ADP exposure, age, marital status and education). While being exposed to ADP has no effect on adolescent girl's attitudes, primary education is positively associated with the probability of supporting domestic violence, while age is negatively associated with the probability of supporting child marriage. Lower educated girls (with at most primary education) are about 3 percentage points more likely to support domestic violence; while being a year older decreases the probability that child marriage is supported by 0.5 percentage points. 18 Next, we empirically test whether there are statistically significant differences between the estimates obtained using list experiments vs. direct questions eliciting gender attitudes. This difference tells us how much the true support for domestic violence or child marriage is under-reported. The underlying assumption is that true support for domestic violence or child marriage is measured using the list experiment. A second assumption is that the measurement error in the direct questions and list experiments has the same sign. Formally, let us define Z i, J + 1 (0) as the respondent i's potential answer to the sensitive item when asked directly (Blair and Imai 2012). Then, social desirability bias is as follows: The first term can be estimated as in Eq.
(2), while the second can be estimated with a linear probability model regressing the observed value of Z i, J + 1 (0) on X i . Given that the LE only allows to identify the total number of items the respondent agrees on but not which ones (e.g., Z * i; J þ1 cannot be identified), we cannot study the social desirability bias at the individual level, but at aggregate level. 17 Results are very similar if we instead compute average marginal effects from a non-linear probit model (see Table 10). 18 The R-squared is low for DQ in specifications that include an intercept only or an intercept and ADP exposure only (columns (1)-(2) and (5)-(6), Table 4); these specifications were estimated to facilitate comparison with the LE results (Table 3). We also do not a find a good model fit for these specifications when using a non-linear probit model (Appendix Table 10). Inclusion of additional variables (such as education, age and whether married) improve model fit by increasing the R-squared. We interpret this as showing that it is difficult to predict individual responses to direct questions with much accuracy using the models at hand (either linear OLS or non-linear probit), at least in specifications where a full set of controls in not used.   (4)) or child marriage (columns (5)- (8)). The sample used for the direct question on domestic violence (child marriage) corresponds to the control group in the list experiment on domestic violence (control group in the list experiment on child marriage). Data on education is missing for 16 observations. Robust standard errors in parentheses, and wild bootstrap p values clustered at the NGO branch level in italics Table 5 reports the differences in the estimated proportion of girls answering the sensitive item in the affirmative when using the list experiment or the direct question by socio-demographic characteristics. 19 The direct question estimates correspond to the list experiment control group sub-samples. The first row of Table 5 shows the unconditional results; these reveal a large difference of 24 percentage points in support for domestic violence, and 22 percentage points in support for child marriage. In the following rows of Table 5, we report differences in gender attitudes elicited using list experiments and direct questions by examining the estimated proportions for different groups whilst controlling for all other characteristics. All differences are highly statistically significant and between 15 and 30 percentage points. When indirectly questioned, girls seem to be much more in favour of both domestic violence and child marriage than when asked directly. Which girls under-report their support the most? By taking differences again between groups (e.g., married versus non-married and primary educated versus secondary/tertiary educated) from columns (5) and (6), we find two interesting results. First, girls exposed to the ADP intervention are more likely to under-report their support (by 12 percentage points) for domestic violence compared to girls who are not exposed to the intervention (p value = 0.044). Second, less educated girls (i.e. primary schooling or below) are 15 percentage points more likely to under-report their support for child marriage than the higher educated girls (p value = 0.008).
We also estimate the proportions for the DQ using probit models. Table 13 shows the social desirability bias when the DQ predictions and their standard errors (columns (3)-(4)) come from the probit models used in Table 12. Reassuringly, the results are very similar to Table 5.

Validity of the list experiments
In "Section 4.1", we discussed the conditions for list experiments to be valid. Here, we discuss the validity of each of them in the context of the list experiments that we implemented. The balance tests for the randomisation of the list experiments can be seen in Table 6. Column (5) reports the p value of the t test statistic where each main variable in the control group is compared with the one in the treatment group. None of the differences are statistically significant, indicating that our list experiment randomisation is good.
To test if there is a violation of the design effects assumption, Blair and Imai (2012) developed a statistical test. The null hypothesis of this test indicates no design effects, and we fail to reject it. 20 This indicates that the inclusion of the sensitive item did not change the responses to the non-sensitive items.
The third requirement for a valid list experiment is the absence of ceiling or floor effects. This assumption called no liars cannot be statistically tested (with the linear model used in this paper, Blair and Imai 2012), but we can analyse the distribution of responses to our list experiments (see Table 2 and Fig. 1). As can be seen, responses to the list experiments are well distributed, being mainly concentrated around 2 and 3. 19 Results are similar if we use a non-linear probit rather than a linear probability model to estimate the second term in equation (3), see Appendix Table 11. 20 Results of this test are reported in Appendix Table 12. For technical details, we refer to Blair and Imai (2012, pp. 64-65).
None of the respondents responded zero to either list experiment, but floor effects are expected to play a minor role. Table 2 shows that 4% and 5% of the respondents have no problems in revealing their support for domestic violence and child marriage, but there are quite a few girls who replied "3" to both list experiments, particularly the one on child marriage. In Table 7, we run different regressions to analyse floor and ceiling effects. We create an outcome called floor LE DV (Floor LE CM) that takes the value one if the LE DV (LE CM) is equal to one and zero otherwise; and an outcome ceiling LE DV (ceiling LE CM) that takes the value one if the LE DV (LE CM) is equal to three and zero otherwise. We regress these outcomes on the main respondent characteristics for the list experiment control group. This allows us to see who is most likely hitting the floor or the ceiling and may thus be over-or under-reporting her support for domestic violence or child marriage, not reporting "0" or "4". We find no statistically significant effect of any of those characteristics on the outcomes, except for primary education on ceiling effects for the list experiment on child marriage. This result could indicate a ceiling effect for less educated girls. Bearing in mind this limitation, it has been shown that when there are ceiling (or floor) effects, the true support for the sensitive item is underestimated (Blair and Imai 2012).  (4) and (8) of Table 4), and on the linear model for the indirect question (columns (4) and (8) of Table 3). The results are averaged over the sample distribution of covariates. Standard errors are robust Given that our list experiments show heterogenous effects by ADP exposure for domestic violence, and by education for child marriage, we examine the distribution of responses to the list experiments by these characteristics in Figs. 2 and 3. This is not a formal test, but the idea behind these Figures is to try to understand if these girls (less  Fig. 1 Distribution of list experiment responses educated and ADP exposed) have understood the mechanism behind the list experiment and have manipulated their results. For the sake of comparison, we report the distribution of the list experiments also for the highly educated girls and girls not exposed to ADP. In Fig. 2, we can see how responses to the list experiment on domestic violence in the different groups is well distributed, with only a small number of cases at the extremes. In Fig. 3, the distribution of responses to the list experiment on child marriage shows that some girls gave the response 3, which might indicate the presence of a ceiling effect. In this case, we might have an underestimate of the true support for child marriage. Tests for design effects run on the sub-sample of low educated, high educated, ADP-exposed and ADP-non-exposed adolescent girls always fail to reject the null hypothesis of no design effects (results available upon request). In a recent paper, Chuang et al. (2019) critically examine the usefulness of indirect survey methods such as list experiments and randomized response techniques. They implement a large number of double list experiments within a single survey taken by respondents in Côte d'Ivoire where groups A and B acted as treatment and controls for the same sexual or reproductive health sensitive behaviour; in this design, the non-sensitive items for groups A and B need to be different by construction. Use of double list experiments allows the generation of two difference-in-means estimators that can be compared, which to date had only been used to reduce the variance compared to a single list experiment (Droitcour et al. 1991;Glynn 2013). For most  Chuang et al. (2019). We preferred to ask our respondents direct questions on attitudes to compare them with the indirect list experiment questions. We believe this method is better suited to our objective to examine social desirability bias since in a double list experiment everyone is asked about the sensitive item twice, both directly or via the list experiment. 21 An important exercise in Chuang et al. (2019)'s work is the variation in the type of non-sensitive items ranging from innocuous items to items related to the sensitive item. The authors find that non-sensitive items more closely related to the sensitive one perform better. None of the non-sensitive items in our list experiments are innocuous, making the sensitive item less salient, and supporting the validity of our design.
Tables 13 and 14 report similar analysis to respectively Tables 3 and 4 but with additional controls. We included controls for the following measures of maternal empowerment: if the respondent's mother has been beaten by her husband, was married early, became pregnant early, and if she practices purdah. None of these additional variables are statistically significant, except if the mother practices purdah which increases the likelihood of supporting child marriage when asked directly. Nonetheless, adding these variables does not change our main findings. 21 Ideally, the direct question should only be asked from the list experiment control group to avoid potential underreporting (see our discussion on how we deal with this in our list experiment, "Section 4.1").

Discussion
Our findings show that measurement error is important when examining attitudes towards sensitive issues such as domestic violence or child marriage. Under-reporting can be quite high. We find that only 5.4% of adolescent girls support domestic violence when questioned directly, but 29.7% support domestic violence when questioned indirectly via a list experiment. Similar results are shown for child marriage, where 2.1% of the respondents think a girl should be married off by age 18 when asked a direct question, but support increases to 23.9% when asked via a list experiment. Interestingly, we find girls who have lower education under-report their support for child marriage compared to girls who have higher education. To the best of our knowledge, this is the first study which implements a list experiment to examine attitudes towards child marriage; therefore, we cannot compare this result with existing studies. There are no heterogenous effects by education when looking at attitudes towards domestic violence. In contrast, Aguero and Frisancho (2018) use a list experiment to study domestic violence experiences in urban Lima (Peru), and find high under-reporting among the most educated respondents. This difference could be related to the different contexts or to the fact that we aim at measuring attitudes, while Aguero and Frisancho focus on behaviours. Our survey asks girls if "a wife can be hit, slapped, kicked or physically hurt by the husband under any circumstances". In our context, girls with at most primary education might have more to lose if they do not support domestic violence, while educated girls might have better outside options (e.g., better jobs) and depend less on their husband. De Cao and Lutz (2018) examine attitudes towards female genital cutting in Ethiopia and find, similarly to us, that uneducated women are less willing to share their support for the practice.
Finally, we find suggestive evidence that the social desirability bias for domestic violence is larger among adolescent girls exposed to ADP. ADP is a random intervention; hence, we can interpret its effect to be causal, even if only marginally statistically significant. 22 The intervention focuses on the change in traditional attitudes through non-formal training and dissemination of information regarding sexual health, gender rights and legal provisions for violence against women including child marriage. It is certainly possible that respondents in ADP-exposed areas conform to the expectations of those providing the program treatment. The ADP campaign aims at changing the local customs and this may increase social pressures around gender attitudes resulting in a stronger incentive to reveal a biased answer. We provide a more detailed comparison of the ADP program with other similar programs in developing countries in Appendix 2.

Conclusion
Traditional "gender attitudes" or beliefs regarding the appropriateness and/or acceptability of gender-specific roles and behaviour in society are considered important drivers of women's well-being. While measures of gender attitudes are now included in many representative international and national surveys, they suffer from potential measurement error, limiting their usefulness in empirical research. Using a unique data set from Bangladesh, we confirm that subjective responses to sensitive direct questions under-estimate support for regressive social practices such as wife beating and child marriage. We find that girls with higher education are more supportive of egalitarian gender norms pertaining to child marriage. While we do not claim this to be a causal relationship, our finding is supportive of expanding access to education to young girls in developing country settings. We also find that exposure to a program that disseminated knowledge on gender empowerment led girls to hide their true support for domestic violence. This indicates that (at least in the short-term) programs like the ADP might not have the desired effects on gender attitudes. We also find that different individual characteristics are associated with under-reporting of different aspects of gender attitudes. For instance, education matters for under-reporting of attitudes which pertain to child marriage, while ADP exposure matters for under-reporting of attitudes regarding domestic violence. This indicates that there are no simple prescriptions or general rules that apply across all aspects of gender attitudes. Our research suggests that survey methods matter in eliciting attitudes towards gendered violence and child marriage. The evidence presented in this paper also highlights the difficulty in permanently shifting gender attitudes exclusively through social empowerment programs even in a setting where girls' schooling and economic opportunities have improved considerably in recent decades. Our results confirm the relevance of potential bias in responses to standard direct questions when the outcome of interest is sensitive. We suggest practitioners to measure each sensitive outcome using different survey methodologies to test if there is indeed under-or over-reporting. We believe this is particularly important in the context of policy impact evaluations where gathering complementing evidence about the effectiveness of a program or intervention is crucial when attitudes or behaviour concern sensitive topics. 22 Similar findings were found by De Cao and Lutz (2018) that study attitudes towards female genital cutting in Ethiopia. The intervention they consider, however, is not random and prevents them from claiming causal effects.
To examine differences in female labor force participation, as well as the incidence of and attitudes towards child marriage and domestic violence across cohorts and over time we use data from the 2007 and 2014 Demographic and Health Surveys (DHS) for Bangladesh. These are nationally representative surveys that interview repeated cross-sections of Bangladeshi households. For the following discussion, we make use of responses to the women's questionnaire from the 2007 and 2014 surveys where the respondents were ever married women from these households between the ages of 15 and 49. Figure 4 shows the fraction within different age groups of women who report that they are currently working. While the fraction of women currently working is less than 40% within all age groups in both the 2007 and 2014 surveys, these fractions have increased over time if we compare the 2007 and 2014 respondents. Among the 2014 respondents, the fraction of currently working women has particularly increased within the older age groups of 35-39, 40-44 and 45-49 in comparison with the 2007 respondents. Figure 5 provides the average age at first cohabitation (or marriage) within different age groups for both the 2007 and 2014 respondents. This provides us with information on the incidence of child marriage. 23 As may be seen in Fig. 5, there is an increase in average age at first cohabitation over time within all age groups, and particularly among the oldest age groups of 40-44-and 45-49-year-old women. For both the 2007 and 2014 respondents, average age at first cohabitation is lowest for the youngest age group 15-19 (largely because the average is over the few women who are already cohabiting at this age), and then for older women belonging to age groups 30-34, 40-44 and 45-49.
Questions on incidence of domestic violence were only asked in the 2007 DHS. Figure 6 shows the fraction of women within different age groups who experienced either less severe violence (left panel) or severe violence (right panel). While relatively high fractions of women experience less severe violence (> 40% for all age groups) and severe violence (> 10% for all age groups), there does not seem to be much variation in incidence across age groups. In other words, younger and older women seem to be equally likely to experience either less severe or severe violence from an intimate partner.
Next, we turn to attitudes towards domestic violence as given in Figs. 7 and 8. These are constructed from a set of questions asking women whether they agree that wife beating is justified in the following situations (for 2007 respondents): In the 2014 survey, female respondents are asked if they agree with the above four statements, and, also, whether they agree that wife beating is justified: 5. If the wife burns the food Figure 7 provides a summary of responses for the 2007 respondents by age group and Fig. 8 provides this summary for the 2014 respondents. Firstly, despite potential measurement error in these responses a relatively large fraction of women agree that wife beating is justified in these situations. Approximately 20% of women support wife beating in situations 1-3 and approximately 10% in situations 4-5. 24 Second, there is very little variation in support for wife beating across age groups within the 2007 respondents or within the 2014 respondents. In other words, younger women seem as likely to support wife beating as older women, and this is true in 2007 as well as in 2014. Finally, from a comparison of Figs. 7 and 8, it does not seem that attitudes towards wife beating have changed over time since approximately the same fraction of women support wife beating in 2007 and 2014. This is despite the improvements in labor force participation over this period that we discussed earlier as shown in Fig. 4. 24 We find a lower fraction of respondents support intimate partner violence when asked directly in our survey but our direct question is very different, asking if wife beating is justified under any rather than specific circumstances. Comparing over the recent past, Bangladesh has seen an improvement in female force participation particularly among older women and a reduction in the incidence of child marriage. However, gender attitudes specifically towards domestic violence remain unchanged. While we have not ruled out potential confounds in this descriptive discussion, the patterns we have shown indicate that improvements in female labor force participation could have driven reductions in the incidence of child marriage, but that it is unlikely that changes in female labor force participation led to changes in gender attitudes (at least those related to domestic violence) in Bangladesh.
Appendix 2: The ADP program BRAC has innovated a range of club-based adolescent development programmes which expose adolescents to a variety of activities such as (i) livelihood training courses, (ii) special network for (female) adolescent photographers, (iii) communication, awareness and advocacy through dialogues among adolescents, their parents and influential persons in the community, 25 and (iv) the Adolescent Peer Organised Network (APON). All educational activities are organized in adolescent clubs (aka "Kishori clubs"), whereby lessons are delivered in structured courses. These clubs offer a safe space where adolescent girls can read, socialise, play games, take part in cultural activities and have an open discussion on personal and social issues with their peers. These clubs are set up at the village level using a former BRAC school building as the venue. In 25 Those activities involves various initiative such as interactive popular theatre, adolescent fairs, cultural competition and sports for development. In terms of pedagogic structure, there are 2 hours of session in a week which take place every Thursday at the club. In total, four sessions take place in a month and 48 sessions altogether in 1 year. The APON/life skill-based education offers, in total, 12 subjects on different social and health-related issues in which 31 learning stories are articulated. Clubs are managed by an adolescent leader who is responsible for implementing all club activities. The leader is chosen based on leadership abilities.
Each club consists of 25-35 adolescent members of age 10-19 years, with 75% girls and the rest boys. Participation in the club is conditioned by socioeconomic status. Adolescents who dropped out from school and come from a poor socio-economic background are given priority. While individual adolescents from the eligible groups self-selected in an ADP club (i.e. participation is non-random), all eligible adolescents in ADP program village were equally exposed (i.e. intervention exposure is random)-the study design randomly assigned treatment (i.e. the ADP Program placement) at the village level. Moreover, both program and non-program villages have benefited from BRAC's non-formal education in the past.

Comparison with similar interventions in developing countries
There are also a number of other developing country studies that have evaluated related programs and their impact on gender attitudes and outcomes. These include the "empowerment and livelihoods for adolescents" (ELA) training scheme in Uganda (Bandiera et al. 2018), BALIKA (Bangladeshi Association for Life Skills, Income, and Knowledge for Adolescents) in Bangladesh (Amin et al. 2018) and Kishori Kendra (KK) scheme of training-based gender empowerment and financial incentives to delay marriage in Bangladesh (Buchmann et al. 2018). Both ELA and KK include safe space components where, in clubs, adolescents receive life skill lessons about gender rights and sexual education. However, they differ in other aspects. For instance, ELA simultaneously provides a vocational training component for income generating activities while KK includes a financial incentives component to delay marriage (in addition to a 6-month empowerment program) as well as an additional treatment arm offering empowerment plus incentive.
The existing developing country interventions differ considerably in terms of intensity of the treatment, target population, design and geographic coverage. For example, the ADP scheme's dosage was 96 hours total for 1 year. In contrast, girls in the safe space groups in Bangladesh received about 200 hours of training in over 6 months (Buchmann et al. 2018), 144 hours total in BALIKA scheme in Bangladesh (Amin et al. 2018) and over 500 hours in five sessions per week for 2 years in the ELA project in Uganda (Bandiera et al. (2018). 26 In this context, ELA and KK are both variants of the scheme to which our sample respondents are exposed. However, in contrast to Bangladesh ADP scheme, ELA and KK are multifaceted programs and their evaluations have a longer window (4 years post-intervention). Moreover, these interventions did not have a conclusive impact on attitudes towards child marriage and do not report an impact on domestic violence. 27 Buchmann et al. (2018) contains data on a rich set of outcomes on age at marriage as well as indices for gender attitudes but the estimated impact on the standard empowerment component is insignificant. While ELA is reported to be effective in improving girls' expectations for ages at first marriage for women, the most suitable age to start 26 Another recent study is Dhar et al. (2018) which examines a school-based randomized intervention in a north Indian state (Haryana) where gender discrimination is entrenched. In contrast to club-based safe space interventions in Bangladesh and Uganda, the intervention in Dhar et al. (2018) is integrated within regular classrooms/schools and conditional on school attendance and government school enrolment. The sample includes both rural and urban locations. While dosage was only a total of 20 h in the secondary school-based program in Haryana (India), it's a non-community-level multi-year school-based intervention. This study finds a positive effect of the intervention on adolescent's support for gender equality. The evaluation study (i.e. Dhar et al. 2018) relies on aggregate indices of gender attitudes and do not report treatment effect for attitude questions specific to the appropriate age of marriage for girls. childbearing and delaying pregnancy, it is not known what the impact would have been in the absence of livelihood training. Compared to KK, BALIKA and ELA, BRAC's ADP scheme studied in this paper only focuses on the standard empowerment component. While these differences can undermine the size of the program impact, they do not necessarily explain the greater support for attitude towards domestic violence among ADP participants which remains a puzzle.    (4)) or child marriage (columns (5)- (8)). The sample used for estimates reported in columns (1)-(4) corresponds to the control group in the list experiment on domestic violence while the sample used for estimates reported in columns (5)-(8) corresponds to the control group in the list experiment on child marriage. Data on education is missing for 16 observations. Robust standard errors in parentheses, and bootstrap standard errors clustered at the NGO branch level in brackets  (4) and (8) of Table 10), and on the linear model for the indirect question (columns (4) and (8) of Table 3). The results are averaged over the sample distribution of covariates. Standard errors are robust π (y = 0, t = 1) 0.000 0.000 0.000 0.000 π (y = 1, t = 1) 0.064 0.012 0.026 0.008 π (y = 2, t = 1) 0.196 0.021 0.160 0.022 π (y = 3, t = 1) 0.037 0.006 0.050 0.007 π (y = 0, t = 0) 0.000 0.000 0.000 0.000 π (y = 1, t = 0) 0.043 0.006 0.024 0.005 π (y = 2, t = 0) 0.406 0.019 0.262 0.016 π (y = 3, t = 0) 0.255 0.015 0.478 0.017 The table shows the estimated proportion (and standard error) of respondent types, π yt , characterised by the total number of affirmative answers to the control questions, y, and the truthful answer for the sensitive item. The null hypothesis of no design effects implies that π yt ≥ 0 for all y and t. Since all proportions are positive, we cannot reject the null. The SE of these tests are robust    (4)) or child marriage (columns (5)- (8)). Robust standard errors in parentheses, and wild bootstrap p values clustered at the NGO branch level in italics Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.