## Abstract

The analysis of data from experiments in economics routinely involves testing multiple null hypotheses simultaneously. These different null hypotheses arise naturally in this setting for at least three different reasons: when there are multiple outcomes of interest and it is desired to determine on which of these outcomes a treatment has an effect; when the effect of a treatment may be heterogeneous in that it varies across subgroups defined by observed characteristics and it is desired to determine for which of these subgroups a treatment has an effect; and finally when there are multiple treatments of interest and it is desired to determine which treatments have an effect relative to either the control or relative to each of the other treatments. In this paper, we provide a bootstrap-based procedure for testing these null hypotheses simultaneously using experimental data in which simple random sampling is used to assign treatment status to units. Using the general results in Romano and Wolf (Ann Stat 38:598–633, 2010), we show under weak assumptions that our procedure (1) asymptotically controls the familywise error rate—the probability of one or more false rejections—and (2) is asymptotically balanced in that the marginal probability of rejecting any true null hypothesis is approximately equal in large samples. Importantly, by incorporating information about dependence ignored in classical multiple testing procedures, such as the Bonferroni and Holm corrections, our procedure has much greater ability to detect truly false null hypotheses. In the presence of multiple treatments, we additionally show how to exploit logical restrictions across null hypotheses to further improve power. We illustrate our methodology by revisiting the study by Karlan and List (Am Econ Rev 97(5):1774–1793, 2007) of why people give to charitable causes.

### Similar content being viewed by others

## References

Anderson, M. (2008). Multiple inference and gender differences in the effects of early intervention: A re-evaluation of the abecedarian, perry preschool, and early training projects.

*Journal of the American Statistical Association*,*103*(484), 1481–1495.Bettis, R. A. (2012). The search for asterisks: Compromised statistical tests and flawed theories.

*Strategic Management Journal*,*33*(1), 108–113.Bhattacharya, J., Shaikh, A. M., & Vytlacil, E. (2012). Treatment effect bounds: An application to swan-ganz catheterization.

*Journal of Econometrics*,*168*(2), 223–243.Bonferroni, C. E. (1935).

*Il calcolo delle assicurazioni su gruppi di teste*. Rome: Tipografia del Senato.Bugni, F., Canay, I., & Shaikh, A. (2015). Inference under covariate-adaptive randomization. Technical report, cemmap working paper, Centre for Microdata Methods and Practice.

Camerer, C. F., Dreber, A., Forsell, E., Ho, T.-H., Huber, J., Johannesson, M., et al. (2016). Evaluating replicability of laboratory experiments in economics.

*Science*,*351*(6280), 1433–1436.Fink, G., McConnell, M., & Vollmer, S. (2014). Testing for heterogeneous treatment effects in experimental data: False discovery risks and correction procedures.

*Journal of Development Effectiveness*,*6*(1), 44–57.Flory, J. A., Gneezy, U., Leonard, K. L., & List, J. A. (2015a). Gender, age, and competition: The disappearing gap. Unpublished Manuscript.

Flory, J. A., Leibbrandt, A., & List, J. A. (2015b). Do competitive workplaces deter female workers? A large-scale natural field experiment on job-entry decisions.

*The Review of Economic Studies*,*82*(1), 122–155.Gneezy, U., Niederle, M., & Rustichini, A. (2003). Performance in competitive environments: Gender differences.

*The Quarterly Journal of Economics*,*118*(3), 1049–1074.Heckman, J., Moon, S. H., Pinto, R., Savelyev, P., & Yavitz, A. (2010). Analyzing social experiments as implemented: A reexamination of the evidence from the highscope perry preschool program.

*Quantitative Economics*,*1*(1), 1–46.Heckman, J. J., Pinto, R., Shaikh, A. M., & Yavitz, A. (2011). Inference with imperfect randomization: The case of the perry preschool program. National Bureau of Economic Research Working Paper w16935.

Holm, S. (1979). A simple sequentially rejective multiple test procedure.

*Scandinavian Journal of Statistics*,*6*(2), 65–70.Hossain, T., & List, J. A. (2012). The behavioralist visits the factory: Increasing productivity using simple framing manipulations.

*Management Science*,*58*(12), 2151–2167.Ioannidis, J. (2005). Why most published research findings are false.

*PLoS Med*,*2*(8), e124.Jennions, M. D., & Moller, A. P. (2002). Publication bias in ecology and evolution: An empirical assessment using the ‘trim and fill’ method.

*Biological Reviews of the Cambridge Philosophical Society*,*77*(02), 211–222.Karlan, D., & List, J. A. (2007). Does price matter in charitable giving? Evidence from a large-scale natural field experiment.

*The American Economic Review*,*97*(5), 1774–1793.Kling, J., Liebman, J., & Katz, L. (2007). Experimental analysis of neighborhood effects.

*Econometrica*,*75*(1), 83–119.Lee, S., & Shaikh, A. M. (2014). Multiple testing and heterogeneous treatment effects: Re-evaluating the effect of progresa on school enrollment.

*Journal of Applied Econometrics*,*29*(4), 612–626.Lehmann, E., & Romano, J. (2005). Generalizations of the familywise error rate.

*The Annals of Statistics*,*33*(3), 1138–1154.Lehmann, E. L., & Romano, J. P. (2006).

*Testing statistical hypotheses*. Berlin: Springer.Levitt, S. D., List, J. A., Neckermann, S., & Sadoff, S. (2012). The behavioralist goes to school: Leveraging behavioral economics to improve educational performance. National Bureau of Economic Research w18165.

List, J. A., & Samek, A. S. (2015). The behavioralist as nutritionist: Leveraging behavioral economics to improve child food choice and consumption.

*Journal of Health Economics*,*39*, 135–146.Machado, C., Shaikh, A., Vytlacil, E., & Lunch, C. (2013). Instrumental variables, and the sign of the average treatment effect. Unpublished Manuscript, Getulio Vargas Foundation, University of Chicago, and New York University. [2049].

Maniadis, Z., Tufano, F., & List, J. A. (2014). One swallow doesn’t make a summer: New evidence on anchoring effects.

*The American Economic Review*,*104*(1), 277–290.Niederle, M., & Vesterlund, L. (2007). Do women shy away from competition? Do men compete too much?

*The Quarterly Journal of Economics*,*122*(3), 1067–1101.Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia ii. Restructuring incentives and practices to promote truth over publishability.

*Perspectives on Psychological Science*,*7*(6), 615–631.Romano, J. P., & Shaikh, A. M. (2006a). On stepdown control of the false discovery proportion. In

*Lecture Notes-Monograph Series*(pp. 33–50).Romano, J. P., & Shaikh, A. M. (2006b). Stepup procedures for control of generalizations of the familywise error rate.

*The Annals of Statistics*,*34*, 1850–1873.Romano, J. P., & Shaikh, A. M. (2012). On the uniform asymptotic validity of subsampling and the bootstrap.

*The Annals of Statistics*,*40*(6), 2798–2822.Romano, J. P., Shaikh, A. M., & Wolf, M. (2008a). Control of the false discovery rate under dependence using the bootstrap and subsampling.

*Test*,*17*(3), 417–442.Romano, J. P., Shaikh, A. M., & Wolf, M. (2008b). Formalized data snooping based on generalized error rates.

*Econometric Theory*,*24*(02), 404–447.Romano, J. P., & Wolf, M. (2005). Stepwise multiple testing as formalized data snooping.

*Econometrica*,*73*(4), 1237–1282.Romano, J. P., & Wolf, M. (2010). Balanced control of generalized error rates.

*The Annals of Statistics*,*38*, 598–633.Sutter, M., & Glätzle-Rützler, D. (2014). Gender differences in the willingness to compete emerge early in life and persist.

*Management Science*,*61*(10), 2339–23354.Westfall, P. H., & Young, S. S. (1993).

*Resampling-based multiple testing: Examples and methods for p value adjustment*(Vol. 279). New York: Wiley.

## Acknowledgements

We would like to thank Joseph P. Romano for helpful comments on this paper. We also thank Joseph Seidel for his excellent research assistance. The research of the second author was supported by National Science Foundation Grants DMS-1308260, SES-1227091, and SES-1530661.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

Documentation of our procedures and our Stata and Matlab code can be found at https://github.com/seidelj/mht.

## Appendix

### Appendix

### 1.1 Proof of Theorem 3.1

First note that under Assumption 2.1, \(Q\in \omega _{s}\) if and only if \(P\in {\tilde{\omega }}_{s}\), where

The proof of this result now follows by verifying the conditions of Corollary 5.1 in Romano and Wolf (2010). In particular, we verify Assumptions B.1–B.4 in Romano and Wolf (2010).

In order to verify Assumption B.1 in Romano and Wolf (2010), let

and note that

where

with \(A_{n,i}(P)\) equal to the \(2|{\mathcal {S}}|\)-dimensional vector formed by stacking vertically for \(s\in {\mathcal {S}}\) the terms

and \(B_{n}\) is the \(2|{\mathcal {S}}|\)-dimensional vector formed by stacking vertically for \(s\in {\mathcal {S}}\) the terms

and \(f:{\mathbf {R}}^{2|{\mathcal {S}}|}\times {\mathbf {R}}^{2|{\mathcal {S}}|}\rightarrow {\mathbf {R}}^{2|{\mathcal {S}}|}\) is the function of \(A_{n}(P)\) and \(B_{n}\) whose *s*th argument for \(s\in {\mathcal {S}}\) is given by the inner product of the *s*th pair of terms in \(A_{n}(P)\) and the *s*th pair of terms in \(B_{n}\), i.e., the inner product of (10) and (11). The weak law of large numbers and central limit theorem imply that

where *B*(*P*) is the \(2|{\mathcal {S}}|\)-dimensional vector formed by stacking vertically for \(s\in {\mathcal {S}}\) the terms

Next, note that \(E_{P}[A_{n,i}(P)]=0\). Assumption 2.3 and the central limit theorem therefore imply that

for an appropriate choice of \(V_{A}(P)\). In particular, the diagonal elements of \(V_{A}(P)\) are of the form

The continuous mapping theorem thus implies that

for an appropriate variance matrix *V*(*P*). In particular, the *s*th diagonal element of *V*(*P*) is given by

In order to verify Assumptions B.2–B.3 in Romano and Wolf (2010), it suffices to note that (12) is strictly greater than zero under our assumptions. Note that it is not required that *V*(*P*) be non-singular for these assumptions to be satisfied.

In order to verify Assumption B.4 in Romano and Wolf (2010), we first argue that

under \(P_{n}\) for an appropriate sequence of distributions \(P_{n}\) for \((Y_{i},D_{i},Z_{i})\). To this end, assume that

- (a)
\(P_{n}{\mathop {\rightarrow }\limits ^{d}}P\).

- (b)
\({\tilde{\mu }}_{k|d,z}(P_{n})\rightarrow {\tilde{\mu }}_{k|d,z}(P)\).

- (c)
\(B_{n}{\mathop {\rightarrow }\limits ^{P_{n}}}B(P)\).

- (d)
\(\text {Var}_{P_{n}}[A_{n,i}(P_{n})]\rightarrow \text {Var}_{P}[A_{n,i}(P)]\).

Under (a) and (b), it follows that \(A_{n,i}(P_{n}){\mathop {\rightarrow }\limits ^{d}}A_{n,i}(P)\) under \(P_{n}\). By arguing as in Theorem 15.4.3 in Lehmann and Romano (2006) and using (d), it follows from the Lindeberg–Feller central limit theorem that

under \(P_{n}\). It thus follows from (c) and the continuous mapping theorem that (13) holds under \(P_{n}\). Assumption B.4 in Romano and Wolf (2010) now follows simply by nothing that the Glivenko-Cantelli theorem, strong law of large numbers and continuous mapping theorem ensure that \({\hat{P}}_{n}\) satisfies (a)–(d) with probability one under *P*.

## Rights and permissions

## About this article

### Cite this article

List, J.A., Shaikh, A.M. & Xu, Y. Multiple hypothesis testing in experimental economics.
*Exp Econ* **22**, 773–793 (2019). https://doi.org/10.1007/s10683-018-09597-5

Received:

Revised:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s10683-018-09597-5

### Keywords

- Experiments
- Multiple hypothesis testing
- Multiple treatments
- Multiple outcomes
- Multiple subgroups
- Randomized controlled trial
- Bootstrap
- Balance