Simulating power of economic experiments: the powerBBK package
 1.6k Downloads
 6 Citations
Abstract
In this article, we highlight how simulation methods can be used to analyze power of economic experiments. We provide the powerBBK package programmed for experimental economists, that can be used to perform simulations in STATA. Power can be simulated using a single command line for various statistical tests (nonparametric and parametric), estimation methods (linear, binary, and censored regression models), treatment variables (binary, continuous, timeinvariant or time varying), sample sizes, experimental periods, and other design features (within or betweensubjects design). The package can be used to predict minimum sample sizes required to reach a userspecific level of power, to maximize power of a design given the researcher supplied a budget constraint, or to compute power to detect a userspecified treatment order effect in withinsubjects designs. The package can also be used to compute the probability of sign errors—the probability of rejecting the null hypothesis in the wrong direction as well as the share of rejections pointing in the wrong direction. The powerBBK package is provided as an .ado file along with a help file, both of which can be downloaded here (http://www.bbktools.org).
Keywords
Statistical power Simulation methods Economic experimentsJEL Codes
C8 C9 D031 Introduction
Underpowered experimental designs can have important consequences for the representativeness of published experimental research (Fanelli and Ioannidis 2013). In particular, it may result in publication bias if papers failing to detect a significant treatment effect face a lower acceptance probability in academic journals (Button et al. 2013; Nosek et al. 2012). This in turn may discourage researchers from even submitting papers reporting insignificant treatment effects. Moreover, underpowered experimental designs can also generate significant treatment effects in the wrong direction (sign error, see Gelman and Carlin 2014). These studies suggest that significant treatment effects in underpowered studies provide little information about the true treatment effects.
Researchers planning an experimental study have to decide among other things about the number of treatment variations, the number of subjects to recruit, the number of experimental periods, and whether to conduct a within or betweensubjects design. All these decisions require a careful balancing between the chance of finding an existing effect, the precision with which this effect can be measured, and the available research budget. Statistical power computation using closedform expressions are typically derived for simple statistical models and tests and tend to be valid under very specific conditions (e.g., large sample sizes and normally distributed errors).^{1} Simulation methods, on the other hand, have the advantage to approximate the statistical power of experimental designs under relatively general assumptions about the distribution of model unobservables and experimental design configurations.
In this paper, we illustrate how to simulate power of economic experiments and provide the powerBBK package which can be used to perform the simulations in STATA. Power can be simulated for various statistical tests (nonparametric and parametric), estimation methods (linear, binary, and censored regression models), treatment variables (binary or continuous), sample sizes, experimental periods, and other design features (within or betweensubjects designs). The powerBBK package can be used to achieve different objectives. It can be used to maximize ex ante statistical power of a given design subject to a userspecified budget constraint, taking into account the treatmentspecific costs. The package can alternatively be used to simulate the minimal necessary sample size to reach a userspecified level of statistical power. It can also be used to compute the statistical power of a particular design. In doing so, users have the option to predict the probability of detecting a userspecified treatment order effect in the context of withinsubjects designs, and the probability of sign error—the probability of rejecting the null hypothesis in the wrong direction as well as the share of rejections pointing in the wrong direction. Finally, the package can be used to conduct ex post power analyses of published results to evaluate their credibility and to get (posterior) estimates of plausible effect sizes. In all cases, powerBBK requires that users enter a single command line specifying the desired options and parameters necessary to conduct the simulations.
Other software programs and packages currently available to conduct power analyses are presented in Table 1 (#1–15) along with the current package (#16). They are offered as standalone, webbased applications, or STATA / SAS modules and are either free of charge or offered for purchase with free trials. Some of these programs (#1–5,7,10) are adapted to the needs of particular fields, such as psychology, health, epidemiology, biology and education, while others (#1,2,5,6,8,11–15) target a general audience. However, no package currently addresses the special needs of economists. Most of the programs (#1–15) rely on asymptotic approximation and none implements simulationbased methods adapted to the needs of (experimental) economists, e.g., measures power to detect treatment order effects, compares power of within and betweensubjects design with multiple periods, proposes an optimal allocation of subjects to treatment and control within a given budget, nor allows for a continuous treatment variable.^{2}
Statistical programs and packages allowing to perform power analysis and/or calculate the optimal sample size
Software/package  Version  Primary audience  

Standalone software—free  
1  G*Power  3.1.9.2 (2014)  Psychologists and general 
2  MorePower  6.0.1 (2013)  Psychologists and general 
3  Optimal design  3.01 (2011)  Social scientists 
4  PEPI  11.62 (2016)  Epidemiologists 
5  PS: power and sample size calculation  3.1.2 (2014)  Biologists and general 
Standalone software—for purchase  
Prices for academic use: approx. 800$ or annual subscription approx. 400$, 7 or 30 days free trial available  
6  PASS (power analysis and sample size software)  14  General 
7  Power and precision  2  Health 
8  Systat  13.1  General 
Online—free  
9  Power and sample size (very basic)  General  
SAS/STAT: addons—free or included  
10  UnifyPow (discontinued followed by 11, 12)  2002.08.17a  Biologists, epidemiologists, health 
11  PROC POWER (as of SAS 9.1)  SAS/STAT(R) 13.2  General 
12  GLMPOWER (as of SAS 9.1)  SAS/STAT(R) 13.2  General 
STATA: addons—included  
13  sampsi (discontinued, followed by 15)  STATA  General 
14  stpower (discontinued, followed by 15)  STATA  Survival analysis 
15  power (as of STATA 13)  STATA 14  General 
The module of this article: STATA addon—free  
16  powerBBK  STATA 14 (2016)  Economists and general 
2 Power computation using powerBBK

Case 1. Linear model: \(y_{it} = y_{it}^*.\)

Case 2. Binary choice model: \(y_{it} = 1 \text { if } y_{it}^* \ge 0\), and 0 otherwise.

Case 3. Model with censoring from below at a: \(y_{it} = \max (a,y_{it}^*),\)
The datagenerating process described above is relatively flexible in terms of the type of outcome distributions it can capture. This is especially true for Case 1. The package currently does not support other discrete outcomes, notably multinomial choices or ordered responses. The powerBBK is free and opensource, allowing users to extend the package to suit their needs.
The powerBBK package requires the user to specify details concerning the experimental design, such as the number of subjects, number of periods, WS or BS design, balance of WS design and so on. There are options to evaluate the statistical power over a range of values N and to assess simultaneously power of both WS and BS designs. The user can specify whether or not to include individual heterogeneity by means of randomeffects terms (i.e., the variance of \(\mu _i\) is greater than 0) or to include treatmentspecific heteroscedasticity (i.e., the variance of \(\epsilon _{it}\) depends on the treatment received). Users can also specify when appropriate (e.g., in linear models) the distribution of errors \((F_\mu , F_{\epsilon \mathbf {d}})\) they require for their simulations, thus allowing for example heavytailed distributions in linear models. The package further permits users to simulate power of nonparametric rankbased tests and can accommodate several common nonlinear models (i.e., logit, probit, tobit).^{3} Users can use the package to predict the maximal power a design can reach given a userspecified budget constraint with treatmentspecific costs. Additional information and examples are available in the help file provided with the package.
Computing power of a given design is straightforward using the following steps.
Step 1 Fix N and T and for a given design (WS or BS), values of \((\beta _0,\varvec{\beta }_1)\) and choice of \((F_\mu , F_{\epsilon \mathbf {d}})\) generate a sample \(\{\{(y_{it},d_{it}):t=1,\ldots ,T\}:i=1,\ldots ,N\}\).
Step 2—parametric Estimate \((\beta _0,\varvec{\beta }_1)\) and the parameters of \((F_\mu , F_{\epsilon \mathbf {d}})\) and compute \(\hat{z}_t = \hat{\beta }_{1,t}/se(\hat{\beta }_{1,t})\) and the corresponding p value of the null hypothesis \(H_0:\beta _{1,t}=0\) against either a onesided or twosided alternative. Here \(se(\hat{\beta }_{1,t})\) denotes the standard error of the estimated period t treatment effect.^{4}
Step 2—nonparametric Aggregate the individual data over T and use nonparametric rankbased tests (e.g., Wilcoxon ranksum test for BS data, Wilcoxon signedrank test for WS data) of the null hypothesis that the distribution of the aggregated values of y are the same under control and treatment conditions and compute the p value of the test.
Step 3 Repeat steps 1 and 2 for a large number of samples. Compute the fraction of p values which are less than the significance level of the test (e.g., 5 %). This represents the power of the test.
Repeating the three steps above for a range of N and T values for each design, enables the researcher to plot power curves for each element of \(\varvec{\beta }_1\). Power curves are useful for comparing the designs for a given sample size, for determining the minimal sample size needed to reach a certain statistical power separately for each design, or to look at the effect of the number of periods and how to balance the number of participants in the treatments. The package also offers users the possibility of predicting the maximal power an experimental design can reach given a specified budget constraint. In this case, users are required to additionally specify the expected payoff of a participant in each treatment as well as the total available budget. The package then evaluates the power of a series of userspecified allocations, which easily allows users to determine the allocation that maximizes power. Finally, an issue concerning WS designs is possible treatment order effects. These effects imply that the response depends on whether treatment or control conditions are experienced first. The powerBBK package can be used to predict the probability of detecting a userspecified treatment order effect for a given experimental design. This option is currently only implemented for the timeinvariant binary treatment effect model where all elements of \(\varvec{\beta }_1\) are identical.
3 Illustration: gift exchange in the field
We illustrate the power analysis presented in Sect. 2 with an application in the context of field experiments designed to measure reciprocal preferences of workers. The Appendix provides one of the command lines used to perform this analysis. Our analysis exploits data from two different studies in this area. Gneezy and List (2006) use a BS design in the context of a single day spot labor market experiment with a data entry task. They assign 9 workers to their treatment condition (gift) and 10 workers to the control condition (no gift). They estimate a linear randomeffects panel data model (Case 1 in Sect. 2) with individualspecific effects \(\mu _i\) and where t indexes the hour of work within the experimental day. Bellemare and Shearer (2009) use a WS design with 18 workers (treeplanters). They test how workers respond to a gift from their employer. Their WS design is unbalanced: workers planted first for 5 days under control conditions (no gift). Workers then received a gift on the final day of planting on the experimental block. Bellemare and Shearer (2009) estimate a linear fixedeffects panel data model (Case 1) with individualspecific effects \(\mu _i\) and where t indexes the day of work during the experiment. Both studies use roughly the same total number of subjects and time periods, but the notion of time varies across studies.
We first estimated a randomeffects panel data model of Eq. (1) using the Gneezy and List data with the dependent variable being the natural log of productivity. We get \((\widehat{\beta }_0,\widehat{\beta }_1)= (3.674, 0.055)\), \(\widehat{\sigma }^2_\mu = 0.088\), \(\widehat{\sigma }^2_{\epsilon } = 0.018\). The corresponding estimates using the Bellemare and Shearer data are \((\widehat{\beta }_0,\widehat{\beta }_1)= (6.955, 0.061)\), \(\widehat{\sigma }^2_\mu = 0.046\), \(\widehat{\sigma }^2_{\epsilon } = 0.018\). The estimated treatment effect (\(\beta _1\)) and estimated error variance \(\sigma ^2_{\epsilon }\) are very similar for both studies. The estimated value of \(\sigma ^2_{\mu }\) (unobserved heterogeneity) on the other hand is twice as high in the Gneezy and List data.^{5}
We next used the estimated model parameters from both data sets to simulate power of WS and BS designs for two scenarios, the lownoise and the highnoise scenario. The lownoise scenario sets \((\sigma ^2_{\mu }=0.045\) and \(\sigma ^2_{\epsilon }=0.02\) while the highnoise scenario sets \(\sigma ^2_{\mu }=0.09\) and \(\sigma ^2_{\epsilon }=0.02\). The variance of \(\mu _i\) in the highnoise scenario is thus exactly twice the corresponding value for the lownoise scenario. We will consider three values for \(\beta _1\) (0.05, 0.1 and 0.15) for both scenarios. The value of \(\beta _0\) plays no role in our analysis and will be set to 7.0 in all our simulations. We will also consider setting T to 2 and 6. Setting \(T=6\) proxies the number of periods used in both studies. The case \(T=2\) is interesting because it proxies experiments which take place with very low number of observations, e.g., for two periods, while still allowing a meaningful comparison of WS and BS designs. It also represents a case where researchers have little information to control for the presence of unobserved individual heterogeneity \(\mu _i\). It is straightforward to consider other values of T. We perform a separate power analysis for each scenario for a doublesided test with a 5 % level of significance. We implement the BS design by assigning the same number of subjects to control and treatment conditions. We implement a balanced WS design by assigning subjects to the same number of periods under control and treatment conditions. We also simulated power for an “unbalanced” WS design assigning subjects to the treatment condition for only one out of six periods. Simulated power of the unbalanced WS design was not very different to the power of the balanced WS design. This is to be expected as the variance of the outcome variable is kept constant under control and trial conditions. We thus focus our analysis on the balanced WS design. Finally, we use the OLS estimator with standard errors clustered at the individual level. All our results are very similar when using the (asymptotically more efficient) GLS estimator.
Figure 1 presents the simulated power curves for the lownoise scenario. Several regularities emerge. First, we find that power is systematically higher for the WS design for all 6 combinations of \(\beta _1\) and T values used. This result is expected given the WS design exploits within subject variation in decisions for a given individual (for a given level of \(\mu _i\)). This advantage of the WS design over the BS design is well documented [see, e.g., Keren (1993)]. We also find that increasing the number of periods raises power of the WS design, but has relatively minor impact on power of the BS design. The quantitative differences in power between both designs are perhaps more surprising. A natural way to compare both designs is to compare the minimal number of subjects (MNS) required to reach a given level of power. Social scientists often argue that an experiment should aim to correctly detect a treatment effect 80 % of the time (see Cohen 1988) when using a doublesided test along with a 5 % significance level. Table 2 presents the simulated MNS required to reach this power threshold derived from the curves in Fig. 1. We find that the MNS exceeds 400 subjects for the BS design for both values of T when \(\beta _1=0.05\). In comparison, the MNS of the WS design is 122 subjects when \(T=2\), and 42 subjects when \(T=6\). As expected, the required MNS decrease with \(\beta _1\). The MNS of the BS design when \(\beta _1=0.1\) are 182 subjects and 162 subjects for 2 and 6 periods, respectively. The corresponding MNS of the WS design are 30 subjects and less than 20 subjects, thus 6–8 times less than the corresponding MNS of the BS design. Finally, MNS of the BS when \(\beta _1=0.15\) are 84 subjects and 74 subjects for 2 and 6 periods, respectively. Corresponding MNS of the WS design are both below 20 subjects, roughly 4 times less than the BS design.
Figure 2 presents the simulated power curves for the highnoise scenario. Several interesting regularities emerge. First, power curves of the WS design in the highnoise scenario are very similar to those of the WS design in the lownoise scenario. Power of the BS design, on the other hand, is substantially worse under the highnoise scenario than under the lownoise scenario. These regularities are captured by the corresponding MNS of both designs (see Table 2). We find that the MNS of the WS in the highnoise scenario are very similar to the corresponding values in the lownoise scenario. The MNS of the BS design, on the other hand, are considerably higher. In particular, we find that the BS design requires between 286 and 302 subjects to detect a value of \(\beta _1=0.1\) with power of 80 %. This is roughly 120 subjects more (approx. 65 % more) than required in the lownoise scenario. Similarly, we find that MNS of the BS design lies between 130 and 140 subjects when \(\beta _1=0.15\). This is roughly 60 subjects more (approx. 70 % more) than required in the lownoise scenario. These results suggest that researchers planning to conduct BS design experiments in this area should carefully consider the level of noise they expect to be present in the data.^{6}
In most power analyses, model parameters values are calibrated using data from either existing studies or pilot experiments conducted by the researchers themselves. These values represent estimates of the true but unobserved underlying population parameters and are thus subject to sampling variability. The importance of sampling variability is especially relevant when values are calibrated using small data sets. In these cases, researchers may consider repeating their power analysis for a selected range of values for each model parameter. One straightforward approach would be to draw parameter values from the sampling distribution of the model parameters and evaluate power for each draw, thus approximating the sampling distribution of the predicted power.
4 Conclusion
This paper highlighted the usefulness of simulation methods for power analysis of economic experiments and provides the powerBBK package to perform such analyses, taking into account several common design features and the possibility to optimize experimental designs under budget constraints.
Footnotes
 1.
See List et al. (2011) for results and discussion.
 2.
 3.
In those cases, the distributions \((F_\mu , F_{\epsilon \mathbf {d}})\) are prespecified.
 4.
Note, that the estimators used in Step 2 will depend on the nature of the outcome variable. The maximum likelihood estimator can be used in all three cases (linear, binary choice or models with censoring). Linear regression with clustered standard errors can also be used in the case of the linear model. This is a popular choice for experimental economists, as the distributions \((F_\mu , F_{\epsilon \mathbf {d}})\) are often unknown in practice.
 5.
Both studies estimate regression models using the variable \(y_{it}\) in level—they do not use the natural logarithm of productivity as the dependent variable. Using the natural logarithm of productivity simplifies the comparison of the estimated treatment effect of both studies. Bellemare and Shearer (2009) additionally control for weather effects while (Gneezy and List 2006) allow the effect of the gift to vary across time. Estimated model parameters with those additional controls are very similar to the results we report here.
 6.
We repeated the power analysis using nonparametric rankbased tests (Wilcoxon ranksum tests and Wilcoxon signedrank tests). Results are very similar and available upon request.
References
 Bellemare, C., & Shearer, B. (2009). Gift giving and worker productivity: evidence from a firmlevel experiment. Games and Economic Behavior, 67(1), 233–244.CrossRefGoogle Scholar
 Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., et al. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365–376.CrossRefGoogle Scholar
 Cohen, J. (1988). Statistical power analysis for behavioral sciences (2nd ed.). New Jersey: Routledge Academic.Google Scholar
 Dattalo, P. (2009). A review of software for sample size determination. Evaluation & The Health Professions, 32(3), 229–248.CrossRefGoogle Scholar
 Fanelli, D., & Ioannidis, J. P. A. (2013). US studies may overestimate effect sizes in softer research. Proceedings of the National Academy of Sciences, 110(37), 15031–15036.CrossRefGoogle Scholar
 Gelman, A., & Carlin, J. (2014). Beyond power calculations: assessing type S (sign) and type M (magnitude) errors. Perspectives on Psychological Science, 9, 641–651.CrossRefGoogle Scholar
 Gneezy, U., & List, J. A. (2006). Putting behavioral economics to work: testing for gift exchange in labor markets using field experiments. Econometrica, 74(5), 1365–1384.CrossRefGoogle Scholar
 Keren, G. (1993). Between or withinsubjects design: a methodological dilemma. In G. Keren, C. Lewis (Eds.), A Handbook for data analysis in the behavioral sciences, Chapter 8. New Jersey: Lawrence Erlbaum Associates Inc.Google Scholar
 List, J., Sadoff, S., & Wagner, M. (2011). So you want to run an experiment, now what? Some simple rules of thumb for optimal experimental design. Experimental Economics, 14(4), 439–457.CrossRefGoogle Scholar
 Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific Utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7(6), 615–631.CrossRefGoogle Scholar
 Peng, C. Y. J., Long, H., & Abaci, S. (2012). Power analysis software for educational researchers. The Journal of Experimental Education, 80(2), 113–136.CrossRefGoogle Scholar
 Thomas, L., & Krebs, C. J. (1997). A review of statistical power analysis software. Bulletin of the Ecological Society of America, 78(2), 126–139.Google Scholar