Skip to main content
Log in

Multiple hypothesis testing in experimental economics

  • Original Paper
  • Published:
Experimental Economics Aims and scope Submit manuscript


The analysis of data from experiments in economics routinely involves testing multiple null hypotheses simultaneously. These different null hypotheses arise naturally in this setting for at least three different reasons: when there are multiple outcomes of interest and it is desired to determine on which of these outcomes a treatment has an effect; when the effect of a treatment may be heterogeneous in that it varies across subgroups defined by observed characteristics and it is desired to determine for which of these subgroups a treatment has an effect; and finally when there are multiple treatments of interest and it is desired to determine which treatments have an effect relative to either the control or relative to each of the other treatments. In this paper, we provide a bootstrap-based procedure for testing these null hypotheses simultaneously using experimental data in which simple random sampling is used to assign treatment status to units. Using the general results in Romano and Wolf (Ann Stat 38:598–633, 2010), we show under weak assumptions that our procedure (1) asymptotically controls the familywise error rate—the probability of one or more false rejections—and (2) is asymptotically balanced in that the marginal probability of rejecting any true null hypothesis is approximately equal in large samples. Importantly, by incorporating information about dependence ignored in classical multiple testing procedures, such as the Bonferroni and Holm corrections, our procedure has much greater ability to detect truly false null hypotheses. In the presence of multiple treatments, we additionally show how to exploit logical restrictions across null hypotheses to further improve power. We illustrate our methodology by revisiting the study by Karlan and List (Am Econ Rev 97(5):1774–1793, 2007) of why people give to charitable causes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others


  • Anderson, M. (2008). Multiple inference and gender differences in the effects of early intervention: A re-evaluation of the abecedarian, perry preschool, and early training projects. Journal of the American Statistical Association, 103(484), 1481–1495.

    Article  Google Scholar 

  • Bettis, R. A. (2012). The search for asterisks: Compromised statistical tests and flawed theories. Strategic Management Journal, 33(1), 108–113.

    Article  Google Scholar 

  • Bhattacharya, J., Shaikh, A. M., & Vytlacil, E. (2012). Treatment effect bounds: An application to swan-ganz catheterization. Journal of Econometrics, 168(2), 223–243.

    Article  Google Scholar 

  • Bonferroni, C. E. (1935). Il calcolo delle assicurazioni su gruppi di teste. Rome: Tipografia del Senato.

    Google Scholar 

  • Bugni, F., Canay, I., & Shaikh, A. (2015). Inference under covariate-adaptive randomization. Technical report, cemmap working paper, Centre for Microdata Methods and Practice.

  • Camerer, C. F., Dreber, A., Forsell, E., Ho, T.-H., Huber, J., Johannesson, M., et al. (2016). Evaluating replicability of laboratory experiments in economics. Science, 351(6280), 1433–1436.

    Article  Google Scholar 

  • Fink, G., McConnell, M., & Vollmer, S. (2014). Testing for heterogeneous treatment effects in experimental data: False discovery risks and correction procedures. Journal of Development Effectiveness, 6(1), 44–57.

    Article  Google Scholar 

  • Flory, J. A., Gneezy, U., Leonard, K. L., & List, J. A. (2015a). Gender, age, and competition: The disappearing gap. Unpublished Manuscript.

  • Flory, J. A., Leibbrandt, A., & List, J. A. (2015b). Do competitive workplaces deter female workers? A large-scale natural field experiment on job-entry decisions. The Review of Economic Studies, 82(1), 122–155.

    Article  Google Scholar 

  • Gneezy, U., Niederle, M., & Rustichini, A. (2003). Performance in competitive environments: Gender differences. The Quarterly Journal of Economics, 118(3), 1049–1074.

    Article  Google Scholar 

  • Heckman, J., Moon, S. H., Pinto, R., Savelyev, P., & Yavitz, A. (2010). Analyzing social experiments as implemented: A reexamination of the evidence from the highscope perry preschool program. Quantitative Economics, 1(1), 1–46.

    Article  Google Scholar 

  • Heckman, J. J., Pinto, R., Shaikh, A. M., & Yavitz, A. (2011). Inference with imperfect randomization: The case of the perry preschool program. National Bureau of Economic Research Working Paper w16935.

  • Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70.

    Google Scholar 

  • Hossain, T., & List, J. A. (2012). The behavioralist visits the factory: Increasing productivity using simple framing manipulations. Management Science, 58(12), 2151–2167.

    Article  Google Scholar 

  • Ioannidis, J. (2005). Why most published research findings are false. PLoS Med, 2(8), e124.

    Article  Google Scholar 

  • Jennions, M. D., & Moller, A. P. (2002). Publication bias in ecology and evolution: An empirical assessment using the ‘trim and fill’ method. Biological Reviews of the Cambridge Philosophical Society, 77(02), 211–222.

    Article  Google Scholar 

  • Karlan, D., & List, J. A. (2007). Does price matter in charitable giving? Evidence from a large-scale natural field experiment. The American Economic Review, 97(5), 1774–1793.

    Article  Google Scholar 

  • Kling, J., Liebman, J., & Katz, L. (2007). Experimental analysis of neighborhood effects. Econometrica, 75(1), 83–119.

    Article  Google Scholar 

  • Lee, S., & Shaikh, A. M. (2014). Multiple testing and heterogeneous treatment effects: Re-evaluating the effect of progresa on school enrollment. Journal of Applied Econometrics, 29(4), 612–626.

    Article  Google Scholar 

  • Lehmann, E., & Romano, J. (2005). Generalizations of the familywise error rate. The Annals of Statistics, 33(3), 1138–1154.

    Article  Google Scholar 

  • Lehmann, E. L., & Romano, J. P. (2006). Testing statistical hypotheses. Berlin: Springer.

    Google Scholar 

  • Levitt, S. D., List, J. A., Neckermann, S., & Sadoff, S. (2012). The behavioralist goes to school: Leveraging behavioral economics to improve educational performance. National Bureau of Economic Research w18165.

  • List, J. A., & Samek, A. S. (2015). The behavioralist as nutritionist: Leveraging behavioral economics to improve child food choice and consumption. Journal of Health Economics, 39, 135–146.

    Article  Google Scholar 

  • Machado, C., Shaikh, A., Vytlacil, E., & Lunch, C. (2013). Instrumental variables, and the sign of the average treatment effect. Unpublished Manuscript, Getulio Vargas Foundation, University of Chicago, and New York University. [2049].

  • Maniadis, Z., Tufano, F., & List, J. A. (2014). One swallow doesn’t make a summer: New evidence on anchoring effects. The American Economic Review, 104(1), 277–290.

    Article  Google Scholar 

  • Niederle, M., & Vesterlund, L. (2007). Do women shy away from competition? Do men compete too much? The Quarterly Journal of Economics, 122(3), 1067–1101.

    Article  Google Scholar 

  • Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia ii. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7(6), 615–631.

    Article  Google Scholar 

  • Romano, J. P., & Shaikh, A. M. (2006a). On stepdown control of the false discovery proportion. In Lecture Notes-Monograph Series (pp. 33–50).

  • Romano, J. P., & Shaikh, A. M. (2006b). Stepup procedures for control of generalizations of the familywise error rate. The Annals of Statistics, 34, 1850–1873.

    Article  Google Scholar 

  • Romano, J. P., & Shaikh, A. M. (2012). On the uniform asymptotic validity of subsampling and the bootstrap. The Annals of Statistics, 40(6), 2798–2822.

    Article  Google Scholar 

  • Romano, J. P., Shaikh, A. M., & Wolf, M. (2008a). Control of the false discovery rate under dependence using the bootstrap and subsampling. Test, 17(3), 417–442.

    Article  Google Scholar 

  • Romano, J. P., Shaikh, A. M., & Wolf, M. (2008b). Formalized data snooping based on generalized error rates. Econometric Theory, 24(02), 404–447.

    Article  Google Scholar 

  • Romano, J. P., & Wolf, M. (2005). Stepwise multiple testing as formalized data snooping. Econometrica, 73(4), 1237–1282.

    Article  Google Scholar 

  • Romano, J. P., & Wolf, M. (2010). Balanced control of generalized error rates. The Annals of Statistics, 38, 598–633.

    Article  Google Scholar 

  • Sutter, M., & Glätzle-Rützler, D. (2014). Gender differences in the willingness to compete emerge early in life and persist. Management Science, 61(10), 2339–23354.

    Article  Google Scholar 

  • Westfall, P. H., & Young, S. S. (1993). Resampling-based multiple testing: Examples and methods for p value adjustment (Vol. 279). New York: Wiley.

    Google Scholar 

Download references


We would like to thank Joseph P. Romano for helpful comments on this paper. We also thank Joseph Seidel for his excellent research assistance. The research of the second author was supported by National Science Foundation Grants DMS-1308260, SES-1227091, and SES-1530661.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Yang Xu.

Additional information

Documentation of our procedures and our Stata and Matlab code can be found at



1.1 Proof of Theorem 3.1

First note that under Assumption 2.1, \(Q\in \omega _{s}\) if and only if \(P\in {\tilde{\omega }}_{s}\), where

$${\tilde{\omega }}_{s}=\{P(Q):Q\in \varOmega ,E_{P}[Y_{i,k}|D_{i}=d,Z_{i}=z]=E_{P}[Y_{i,k}|D_{i}=d',Z_{i}=z]\}.$$

The proof of this result now follows by verifying the conditions of Corollary 5.1 in Romano and Wolf (2010). In particular, we verify Assumptions B.1–B.4 in Romano and Wolf (2010).

In order to verify Assumption B.1 in Romano and Wolf (2010), let

$$T_{s,n}^{*}(P)=\sqrt{n}\left( \frac{1}{n_{d,z}}\sum _{1\le i\le n:D_{i}=d,Z_{i}=z}(Y_{i,k}-{\tilde{\mu }}_{k|d,z}(P))-\frac{1}{n_{d',z}}\sum _{1\le i\le n:D_{i}=d',Z_{i}=z}(Y_{i,k}-{\tilde{\mu }}_{k|d',z}(P))\right),$$

and note that

$$T_{n}^{*}(P)=(T_{s,n}^{*}(P):s\in {\mathcal {S}})=f(A_{n}(P),B_{n}),$$


$$A_{n}(P)=\frac{1}{\sqrt{n}}\sum _{1\le i\le n}A_{n,i}(P),$$

with \(A_{n,i}(P)\) equal to the \(2|{\mathcal {S}}|\)-dimensional vector formed by stacking vertically for \(s\in {\mathcal {S}}\) the terms

$$\left( \begin{array}{c} (Y_{i,k}-{\tilde{\mu }}_{k|d,z}(P))I\{D_{i}=d,Z_{i}=z\}\\ (Y_{i,k}-{\tilde{\mu }}_{k|d',z}(P))I\{D_{i}=d',Z_{i}=z\} \end{array}\right),$$

and \(B_{n}\) is the \(2|{\mathcal {S}}|\)-dimensional vector formed by stacking vertically for \(s\in {\mathcal {S}}\) the terms

$$\left( \begin{array}{c} \frac{1}{\frac{1}{n}\sum _{1\le i\le n}I\{D_{i}=d,Z_{i}=z\}}\\ -\frac{1}{\frac{1}{n}\sum _{1\le i\le n}I\{D_{i}=d',Z_{i}=z\}} \end{array}\right).$$

and \(f:{\mathbf {R}}^{2|{\mathcal {S}}|}\times {\mathbf {R}}^{2|{\mathcal {S}}|}\rightarrow {\mathbf {R}}^{2|{\mathcal {S}}|}\) is the function of \(A_{n}(P)\) and \(B_{n}\) whose sth argument for \(s\in {\mathcal {S}}\) is given by the inner product of the sth pair of terms in \(A_{n}(P)\) and the sth pair of terms in \(B_{n}\), i.e., the inner product of (10) and (11). The weak law of large numbers and central limit theorem imply that

$$B_{n}{\mathop {\rightarrow }\limits ^{P}}B(P),$$

where B(P) is the \(2|{\mathcal {S}}|\)-dimensional vector formed by stacking vertically for \(s\in {\mathcal {S}}\) the terms

$$\left( \begin{array}{c} \frac{1}{P\{D_{i}=d,Z_{i}=z\}}\\ -\frac{1}{P\{D_{i}=d',Z_{i}=z\}} \end{array}\right).$$

Next, note that \(E_{P}[A_{n,i}(P)]=0\). Assumption 2.3 and the central limit theorem therefore imply that

$$A_{n}(P){\mathop {\rightarrow }\limits ^{d}}N(0,V_{A}(P))$$

for an appropriate choice of \(V_{A}(P)\). In particular, the diagonal elements of \(V_{A}(P)\) are of the form

$${\tilde{\sigma }}_{k|d,z}^{2}(P)P\{D_{i}=d,Z_{i}=z\}.$$

The continuous mapping theorem thus implies that

$$T_{n}^{*}(P){\mathop {\rightarrow }\limits ^{d}}N(0,V(P))$$

for an appropriate variance matrix V(P). In particular, the sth diagonal element of V(P) is given by

$$\frac{{\tilde{\sigma }}_{k|d,z}^{2}(P)}{P\{D_{i}=d,Z_{i}=z\}}+\frac{{\tilde{\sigma }}_{k|d',z}^{2}(P)}{P\{D_{i}=d',Z_{i}=z\}}.$$

In order to verify Assumptions B.2–B.3 in Romano and Wolf (2010), it suffices to note that (12) is strictly greater than zero under our assumptions. Note that it is not required that V(P) be non-singular for these assumptions to be satisfied.

In order to verify Assumption B.4 in Romano and Wolf (2010), we first argue that

$$T_{n}^{*}(P_{n}){\mathop {\rightarrow }\limits ^{d}}N(0,V(P))$$

under \(P_{n}\) for an appropriate sequence of distributions \(P_{n}\) for \((Y_{i},D_{i},Z_{i})\). To this end, assume that

  1. (a)

    \(P_{n}{\mathop {\rightarrow }\limits ^{d}}P\).

  2. (b)

    \({\tilde{\mu }}_{k|d,z}(P_{n})\rightarrow {\tilde{\mu }}_{k|d,z}(P)\).

  3. (c)

    \(B_{n}{\mathop {\rightarrow }\limits ^{P_{n}}}B(P)\).

  4. (d)

    \(\text {Var}_{P_{n}}[A_{n,i}(P_{n})]\rightarrow \text {Var}_{P}[A_{n,i}(P)]\).

Under (a) and (b), it follows that \(A_{n,i}(P_{n}){\mathop {\rightarrow }\limits ^{d}}A_{n,i}(P)\) under \(P_{n}\). By arguing as in Theorem 15.4.3 in Lehmann and Romano (2006) and using (d), it follows from the Lindeberg–Feller central limit theorem that

$$A_{n}(P_{n}){\mathop {\rightarrow }\limits ^{d}}N(0,V_{A}(P))$$

under \(P_{n}\). It thus follows from (c) and the continuous mapping theorem that (13) holds under \(P_{n}\). Assumption B.4 in Romano and Wolf (2010) now follows simply by nothing that the Glivenko-Cantelli theorem, strong law of large numbers and continuous mapping theorem ensure that \({\hat{P}}_{n}\) satisfies (a)–(d) with probability one under P.

Table 1 Multiple outcomes
Table 2 Multiple subgroups
Table 3 Multiple treatments (Comparing multiple treatments with a control)
Table 4 Multiple treatments (All pairwise comparisons across multiple treatments and a control)
Table 5 Multiple outcomes, subgroups, and treatments

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

List, J.A., Shaikh, A.M. & Xu, Y. Multiple hypothesis testing in experimental economics. Exp Econ 22, 773–793 (2019).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


JEL Classification