Controlling for false negatives in agent-based models: a review of power analysis in organizational research

Secchi, Davide; Seri, Raffaello

doi:10.1007/s10588-016-9218-0

Controlling for false negatives in agent-based models: a review of power analysis in organizational research

Manuscript
Open access
Published: 03 May 2016

Volume 23, pages 94–121, (2017)
Cite this article

Download PDF

You have full access to this open access article

Computational and Mathematical Organization Theory Aims and scope

Controlling for false negatives in agent-based models: a review of power analysis in organizational research

Download PDF

Davide Secchi¹ &
Raffaello Seri^2,3

8009 Accesses
24 Citations
1 Altmetric
Explore all metrics

Abstract

This article is concerned with the study of statistical power in agent-based modeling (ABM). After an overview of classic statistics theory on how to interpret Type-II error (whose occurrence is also referred to as a false negative) and power, the manuscript presents a study on ABM simulation articles published in management journals and other outlets likely to publish management and organizational research. Findings show that most studies are underpowered, with some being overpowered. After discussing the risks of under- and overpower, we present two formulas to approximate the number of simulation runs to reach an appropriate level of power. The study concludes with the importance for organizational behavior scholars to perform their models in an attempt to reach a power of 0.95 or higher at the 0.01 significance level.

Applications of agent-based modeling and simulation in organization management: a quarter-century review through bibliometric mapping (1998–2022)

Article 24 July 2023

Agent-based modeling in managerial science: an illustrative survey and study

Article Open access 26 September 2014

Simulation as Research Method: Modeling Social Interactions in Management Science

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The last few years have seen a growing interest towards agent-based modeling (ABM) and its potentials to benefit management and organization studies (Fioretti 2013). As a technique to model complex adaptive social systems, it has been recently advocated that ABM is particularly well suited to team research (Secchi 2015) and, more broadly, to study organizational behavior (Secchi and Neumann 2016).

Research on organizational behavior and management is slowly taking ABM into consideration, with the help of a few specialized academic outlets such as Computational and Mathematical Organization Theory and the Journal of Artificial Societies and Social Simulation, and of increasing room at international management conferences such as EURAM and AOM. Given this expanding trend, we believe behavioral sciences and management studies have a lot to offer to the way computational simulations are performed. We refer to the typical toolkit of the management and organizational behavior researcher, the questions they ask, and the solutions they adopt when conducting a study. ABM simulations can be intended as experiments of a computational nature (Coen 2009a; Hoser 2013) consistent with experimental design and methods that flourish in our disciplines.

This article is concerned with an issue that is opposite to what the articles above point out (i.e. how ABM can be imported into management and organizational behavior studies). The present study aims at exporting one of the most relevant concerns of experimental methods to ABM. One of the issues that every experimental researcher deals with is statistical power and sample size determination (Cohen 1988, 1992). When performing any computer simulation, a researcher comes to the question of how many times the model should run. As known from the literature (e.g., Liu 2014), this is a problem of sample size determination that is usually addressed by power analysis.

By improving the way ABM research is conducted by the means of power analysis, we also improve the tools in the hands of those organizational behavior researchers that have embraced this new simulation technique. Moreover, by using statistical power analysis, the organizational researcher may feel more “at home,” being able to apply tests that are more familiar to him/her.

It is clear that statistical power analysis is relevant only to ABM respecting some conditions. First of all, the models under scrutiny have to be stochastic (see North and Macal 2007, Sect. 2 for a comparison between deterministic and stochastic models). Second, the objective of the model should be testing of assumptions and not, say, observation of emergent behavior or detailed description of phenomena, as is often the case for exploratory ABM. Third, while our discussion of statistical power analysis can be applied to all kinds of tests, in the second part of the paper, we will focus on the following situation: one or more outcome measures are identified, and the hypothesis is to test that the expected outcomes under J configurations of parameters are the same; the expected outcome for each configuration is estimated through the mean over a certain number n of runs.

Although statistical power and Type-II and Type-I error are well known topics in statistics, some authors (e.g., Gigerenzer 2004; Friston 2012) highlight that many scholars get confused by the interpretation of some of the key elements and approaches involved. For this reason, we believe we should clarify what is the methodological backbone of the testing theory we are discussing. Hence, the following section features a description of the classic theory of power, setting the ground for the basic concepts used in this article. We then review ABM studies published between 2010 and 2013 and present calculations of power for all the 69 articles (whenever possible). A discussion of findings follows and we finally conclude the article with some recommendations for organizational and ABM researchers.

2 The traditional treatment of power

In the following, we consider the Neyman–Pearson framework, introduced by Jerzy Neyman and Egon Sharpe Pearson between the 20’s and the 30’s, where statistical tests start from the definition of a null hypothesis, $\mathsf {H}_{0}$, and of an alternative hypothesis, $\mathsf {H}_{1}$. This article is concerned with parametric models, i.e. statistical models defined through a set of parameters $\Theta$, called parameter space. As an example, a normal distribution can be characterised by the mean $\mu$, that can take any real value, and the standard deviation $\sigma$, that must be nonnegative. The parameter space is $\Theta = \left( -\infty ,+\infty \right) \times \left[ 0,+\infty \right)$. The null hypothesis is generally characterized as a subset of the parameter space, say $\Theta _{0}$. The most common hypotheses correspond to the nullity of a parameter (in the previous example, say, $\mu = 0$ and $\Theta _{0} = \left\{ 0\right\} \times \left[ 0,+\infty \right)$) or a vector of parameters, or the equality of some parameters. The alternative hypothesis is given by the values of the parameters that are not in $\Theta _{0}$; this set is generally called $\Theta _{1}$ and is defined as the complement of $\Theta _{0}$ in $\Theta$ (in the previous example, $\Theta _{1} = \left( -\infty ,0\right) \times \left[ 0,+\infty \right) \cup \left( 0,+\infty \right) \times \left[ 0,+\infty \right)$). It should therefore be clear, in what follows, that either the null or the alternative hypothesis is true.

The objective of a statistical test is to help the researcher make a decision as to which hypothesis between $\mathsf {H}_{0}$ and $\mathsf {H}_{1}$ is true. Usually this is done through a test statistic T and a subset $\mathcal {A}$ of the range of T, called acceptance region. In the sample the test statistic T takes the value t. It is customary to write that, if t does not fall inside the region $\mathcal {A}$, the test “rejects” the null hypothesis. On the other hand, if t falls inside the region $\mathcal {A}$, the test “fails to reject” or “does not reject” the null hypothesis. This apparently odd circumlocution is preferred instead of the more direct “accept.” The widespread and uncontroversial use of this expression in the recent literature conceals a disagreement in earlier theoretical references, as witnessed by the contradiction between the use of “acceptance region” and of “fail to reject.” Indeed, Neyman himself (Neyman 1950, p. 259) agreed with the use of the word “accept.” The dichotomy between acceptance and rejection is coherent with the original purpose of tests in the Neyman–Pearson framework. Neyman, in particular, was extremely clear about the fact that tests should lead to decision based on acceptance or rejection and even to action (Neyman 1950, p. 259). Pearson, instead, was more reluctant (Pearson 1955, p. 206). It was Fisher (Fisher 1955, p. 73), in a different approach to testing, who strongly and consistently argued against the use of the word “accept” and this contributed to create the confusion on terminology. In fact, the prescription that “accept” should not be used inside the Neyman–Pearson approach is yet another part of what Gigerenzer (2004) calls the “null ritual.” However, in this article, we follow Neyman and will use interchangeably “fail to reject” or “accept.”

Two types of errors can be committed while testing the hypotheses. If we reject the null hypothesis when it is true, we commit a Type-I error (or false positive). This happens with a rate equal to the probability that T does not belong to $\mathcal {A}$ when $\mathsf {H}_{0}$ is true. This probability is usually denoted as $\alpha$. If we accept the null hypothesis when it is false, we commit a Type-II error (or false negative). This happens with rate $\beta$, the probability that T belongs to $\mathcal {A}$ when $\mathsf {H}_{0}$ is false (and $\mathsf {H}_{1}$ is true). The two probabilities, $\alpha$ and $\beta$, are linked by a trade-off: in order for $\alpha$ to decrease, one needs to increase $\mathcal {A}$ (because $\alpha$ is the probability that T does not belong to $\mathcal {A}$ under $\mathsf {H}_{0}$), and therefore $\beta$ increases too (because $\beta$ is the probability that T belongs to $\mathcal {A}$ under $\mathsf {H}_{1}$).^{Footnote 1} In the classic Neyman–Pearson approach, $\mathcal {A}$ is chosen in such a way that $\alpha$ is fixed and small (5 and 1 % are customary values). On the other hand, $\beta$ is not directly (and rarely indirectly) controlled. However, when the sample size N increases, $\beta$ generally tends to 0, so that, for N large enough, one can take $\alpha$ small and hope for $\beta$ to be not too large.

2.1 An example on the relation between α and β

Consider the test for the nullity of the mean in a normally distributed population with mean μ and variance $\sigma ^{2}$. The null hypothesis is $\mathsf {H}_{0}:\mu =0$, the alternative hypothesis is $\mathsf {H}_{1}:\mu \ne 0$. Suppose to observe a sample $\left\{ X_{1},\dots ,X_{N}\right\}$ from the population. A test statistic for this hypothesis is

$$\begin{aligned} T=\tfrac{\sum _{i=1}^{N}X_{i}}{\sqrt{\sum _{i=1}^{N}X_{i}^{2}-\tfrac{1}{N}\left( \sum _{i=1}^{N}X_{i}\right) ^{2}}}. \end{aligned}$$

If $\mathsf {H}_{0}$ holds true, then its distribution is a Student’s t with $N-1$ degrees of freedom, indicated as $t_{N-1}$. An acceptance region at level $\alpha$ will be given by the interval $\mathcal {A}=\left[ t_{\frac{\alpha }{2},N-1},t_{1-\frac{\alpha }{2},N-1}\right]$ (this is not the only possible one, but is the most common), where $t_{\gamma ,n}$ is the $\gamma -$quantile of the distribution of a $t_{n}$ random variable. This region will have a Type-I error rate equal to $\alpha$ by construction. Suppose now that $\mathsf {H}_{0}$ does not hold true, i.e. $\mu \ne 0$ and $\mathsf {H}_{1}$ holds. Then, T will be distributed as a noncentral Student’s t with $N-1$ degrees of freedom and noncentrality parameter $\lambda _{N}=\sqrt{N}\mu /\sigma$, indicated as $t_{N-1}\left( \lambda _N\right)$. The Type-II error rate $\beta$ is:

$$\begin{aligned} \beta=\,& {} \mathbb {P}\left\{ t_{N-1}\left( \lambda _{N}\right) \in \left[ t_{\frac{\alpha }{2},N-1},t_{1-\frac{\alpha }{2},N-1}\right] \right\} \\=\,& {} \mathbb {P}\left\{ t_{\frac{\alpha }{2},N-1}\le t_{N-1}\left( \lambda _{N}\right) \le t_{1-\frac{\alpha }{2},N-1}\right\} . \end{aligned}$$

One can verify the properties we rapidly described above. When α decreases, $t_{\frac{\alpha }{2},N-1}$ and $t_{1-\frac{\alpha }{2},N-1}$ get far from 0, and β increases. When N increases, $\alpha$ is constant while $\beta$ goes to 0, as $\lambda _{N}$ gets further from 0.

2.2 The power of a test

The power of a statistical test is the probability that it correctly rejects a false null hypothesis, namely one minus β. The previous reasoning lets one wonder which levels of power are supposed to be acceptable. The value that seems to be more commonly accepted is 80 % as it appears in several places in Cohen (1992) and also in Lehr (1992). Studies with lower values are often seen as underpowered. This value, corresponding to $\beta =20$ %, does not seem very high, especially when $\alpha =5$ % or even 1 %. The profound asymmetry between the value required for $\alpha$ and the value deemed acceptable for $\beta$ calls for a clarification.^{Footnote 2} We provide, in what follows, three different possible explanations, not necessarily alternative to each other.

First, it has been repeatedly indicated in the literature (Sedlmeier and Gigerenzer 1989; Hallahan and Rosenthal 1996; Cohen 1992) that many researchers do not perform formal power analysis but rely on sample size as an indicator of error. Indeed, when the sample size N increases, even if $\alpha$ is fixed, $\beta$ tends to decrease to 0. Together with the increase in precision, this is probably the most important reason for which a large sample size is usually considered positively. Therefore, a researcher may feel dispensed with power considerations if he or she believes that the sample size is sufficiently large. However, we show below that this false sense of security often leads to underpowered studies. (We will see more on the role of sample size later.)

Second, suppose that we have two hypotheses, say $\mathsf {H}^{\prime }$ and $\mathsf {H}^{\prime \prime }$, such that either one or the other is true but both of them cannot be true at the same time. As in the Neyman–Pearson approach $\alpha$ is controllable while $\beta$ isn’t completely, it is often reasonable to choose as $\mathsf {H}_{0}$ the hypothesis, say $\mathsf {H}^{\prime }$, whose rejection is considered more serious. Indeed, in case $\mathsf {H}^{\prime \prime }$ were chosen as the null, the rejection of $\mathsf {H}^{\prime }$ when true (that would now correspond to a Type-II error rate $\beta$) would be out of control. This is related to the reason behind Cohen’s choice (Cohen 1988, Sect. 2.4) of $\beta =0.20$ when $\alpha =0.05$ (see also Lakens 2013). Indeed, he explicitly states that the ratio $\beta /\alpha$ should be near to 4 when Type-I errors are about four times as serious as Type-II errors.

Third, the different emphasis on $\alpha$ and $\beta$ is largely due to a misunderstanding between the Neyman–Pearson and Fisher approaches that is made particularly clear in Royall (1997, pp. 109–110). In the Fisher approach, the computation of the test statistic does not lead to any decision but to the determination of the p value: this is the probability, under the null hypothesis, of obtaining values that are as extreme as, or more extreme than the one that is observed, i.e. t. A small p value is considered evidence against the null hypothesis because, if the null is true, it is difficult to suppose that chance alone would lead to such a small probability of observing a sample as extreme as the one we have observed. Despite Fisher himself took position against this choice (Fisher 1956, p. 42)^{Footnote 3} the extraordinariness of the p value is often evaluated comparing it with $\alpha$ (if smaller, rejection ensues). Therefore $\alpha$ takes the further role of a gauge of extraordinariness and it is smaller than $\beta$ because of the double meaning of which it is charged, i.e. error in a Neyman–Pearson framework and p value threshold in a Fisher framework.

This is complicated by the fact that the alternative hypothesis is generally composed of more than one possible value for the parameters. This means that for any possible value of the parameters respecting the alternative hypothesis $\mathsf {H}_{1}$ it is possible to define a different value of $\beta$ and power. However, it is generally the case that the parameters enter into the power function through a single number ES, called effect size and identified with $\mu /\sigma$ in the previous section. There are two kinds of power analysis that can be performed involving ES. The first one is called a priori power analysis and is generally used, before data is collected, to evaluate the sample size needed to obtain a certain value of $\beta$ for fixed $\alpha$ under an hypothesized value for ES. This procedure is customarily performed imputing a value to ES on the basis of the evidence collected in previous similar studies (see Lenth 2001, Sect. 2; Lakens 2013). Cohen (1992) has compiled tables of ES values indicated as small, medium and large that can be used for this task. This kind of power analysis is universally considered as an important and statistically sound tool (despite the use of “canned” effect sizes has been subject to critique, see Lenth 2001, Sect. 6).

The second is called post hoc power analysis and is performed after estimation in order to obtain an a posteriori estimate of the power. This is achieved using the value of ES in which the parameters are replaced by their estimated values. There is some evidence that this technique has several drawbacks (for further details on the issues related to this analysis, see Korn 1990; Hoenig and Heisey 2001).

2.3 The importance of power for ABM

In this subsection we briefly discuss the role of Type-I and Type-II error rates for ABM as a theoretical preamble for the review performed in the section below. Indeed, in this case, the researcher makes decisions on the parameters of the simulation and on how many times the simulation should be performed. The interest of this class of models is often to show if and how a certain outcome measure varies with the parameters of the simulated model. Agent-based simulations are particularly useful in the social sciences for their ability to model complex adaptive systems (e.g., Miller and Page 2007). This makes “emergence” one of the main features of these models (e.g., Fioretti 2013; Secchi 2015), sometimes as a result of complexity. This points right at the core of the use of power for this class of models, that is to make sure that the occurrence that results of simulations are mostly affected by random effects is avoided.^{Footnote 4}

ABM researchers may be running the simulation for different purposes. Some of them may be interested in the mean values of the outcome variable, some others may focus on extreme values, while other simulations may be descriptive or tied to a particular set of empirical data. In this article, we assume that the researcher considers the average value of the outcome variable to be informative, hence relevant to one’s data analysis. This assumption may cut some of the ABM simulations off but we believe most approaches are covered. The modeler will generally identify, for example, a certain number J of parameter configurations to be experimentally tested in a computational simulation. The theoretical expectations of the outcome measure for each of these combinations are given by $\mu _{j}$, $j=1,\dots, J$. Several null hypotheses can be tested, but most of them require that some of these means are equal and, in the extreme case, all of them are equal ($\mathsf {H}_{0}:\mu _{1}=\dots =\mu _{J}$). We suppose that the simulation is balanced across configurations so that each mean is estimated through n runs, and the total number of simulations to be run is $N=n\cdot J$.

In ABM and more generally in simulations, most of the reasons for which, in real-world experimental studies, a large $\beta$ can be tolerated suddenly cease to be acceptable. On the one hand, the statistical tests are quite standard and their power analysis is easily performed. On the other hand, both $\alpha$ and $\beta$ could be reduced with respect to the values in use in most statistical practice. However, before we can elaborate further on the adequate level of power for ABM research, the next section shows not only that a formal power analysis—neither a priori nor post hoc—is by no means common in ABM but also that the values that can be reconstructed from the papers show that most studies are strongly underpowered.

3 ABM and power: a review

Once we have clarified what is the theoretical need for statistical power analysis, and before explaining the consequences of ignoring it, we may ask whether ABM research is actually exempt from these issues. One may claim that, for simulation studies, all it takes to avoid Type-II error is to increase the number of runs, or conduct convergence analysis, for example.

The simplest action to obtain high power, thus having a low probability that the null hypothesis is accepted when it is false, is that of increasing the number of runs. This would be equal to increasing the number of subjects in an experiment, bearing a positive and strong effect on power (Liu 2014). However, given the nature of agent-based simulations, even one more run can be sometimes particularly hard to perform. This is due to the fact that some of these models can be complex. Some advocate a KIDS (“Keep It Descriptive, Stupid”) principle as opposed to the classic KISS (“Keep It Simple, Stupid”) to signal that ABM can be very detailed representations of reality (Edmonds and Moss 2005). These are models that some consider “expensive” (Ritter et al. 2011), because each run may take a significant amount of time to complete. Of course, not all ABM are complex and expensive so, in principle, the strategy of increasing the number of runs may pay off although it can be difficult to achieve in practice.

The other claim—i.e. perform convergence analysis—tackles with a different issue that may affect the test power but it is not directly related to it. In fact, convergence or sensitivity analyses are usually performed to understand whether a given simulation reaches some sort of equilibrium around a given pattern of results (Robinson 2014). Clearly, this is a very important check to be run on a simulation but, as far as ABM is concerned, it deals with time rather than with runs. In other words, it provides information on when a given configuration of parameters provide meaningful results within a single run. It may help with Type-I error, because if data are of low quality (e.g., they have not reached the above-quoted equilibrium) the test is applied to a set of data intrinsically different from the one that the researcher would like to submit to test, and the size of the test may be incorrect.

Even whether power analysis is deemed unnecessary because Type-II error can be avoided easily, researchers should have a benchmark, a point of reference. For this reason, it is strongly advocated by the authors of this paper that power analysis should always be performed (Secchi 2014).

Given the importance of power analysis for the social and behavioral sciences (Cohen 1988; Liu 2014), it is not uncommon to find publications indicating the scarce use for empirical studies (e.g., Mone et al. 1996; Sedlmeier and Gigerenzer 1989). As far as our knowledge is concerned, a review of power has not been conducted for simulation studies. Given the prominence that ABM is gaining in the social and behavioral sciences, we have reviewed some of the publications featuring a model and calculated power where appropriate and possible. In the following, we describe the method of the review study and comment on its results.

3.1 Methods

The study was conducted on articles published over a limited period of time when ABM-related publications seemed to increase. We considered the four years from 2010 to 2013. Since our interest lies in the management and organizational behavior literature, we screened the two simulation journals that have the closest ties with our discipline (Meyer et al. 2009, 2011): Computational and Mathematical Organization Theory and the Journal of Artificial Societies and Social Simulation. Then, we also screened articles published in a list of top management journals (based on ISI Thompson’s Impact Factor) and decided to include the four presenting ABM-related publications (Table 1): Organization Science, Journal of Management Studies, Strategic Management Journal, and MIS Quarterly. The total sample of articles selected for this study is 69, for a total of 75 experiments (some articles feature more than one computational experiment or model).

The criteria for the selection of articles to include in our study were very straightforward. We checked for publications built around an agent-based simulation or where the model was a significant part of the study. We did not screen for models that were more or less descriptive, nor we did check whether the article made enough information available for us to allow (or replicate) power calculations. The reasons for keeping all the ABM we could possibly find (in the time period considered) was that of being able to have a look at all model types. Some of the models reported as NC in Table 2 may be descriptive or of the kind mentioned earlier in this paragraph.

Table 1 Summary of results for power calculations performed (number of models)

Full size table

Once the articles were selected, data on power or Type-II error was extracted whenever possible. If no data or calculations were found in the article, we attempted to gather the information needed to compute statistical power. Given that the most difficult information to gather is the effect size ES, we hypothesized two worst case scenario, with a small- (0.1) and a medium-size (0.3) ES à la Cohen (1992). Since we treat ABM simulations as experiments, we calculated the statistical power of the test hypothesizing that an ANOVA was performed to test the differences provided by results from the different configurations of parameters. According to Cohen (1992) the medium ES for the case of ANOVA is 0.25. We decided to go a bit above that with 0.30 to reach a more significant impact on measurements. When ES is large, high power is reached with a limited number of runs and we deem that this is the case that may not present particular concerns. Another variable that requires careful consideration is the significance level $\alpha$ at which power calculations should be referred to. There is no consensus over its value and we have decided to take the standard $\alpha = 0.05$ and also a more stringent criterion of $\alpha = 0.01$ as a reference for our calculations.

3.2 Findings

Table 2 shows our calculations for 75 ABM studies found in the 69 articles selected for the present study in years 2010–2013 (see above for details). All calculations are performed using the ANOVA test with $\alpha =0.01$, $\alpha =0.05$, and for small (0.1) and medium (0.3) ES. Table 2 then presents how many configurations of parameters (CoP), J, the model in each paper uses, together with the number of runs n actually performed—as declared by the authors. The final column is the calculation of the recommended number of runs resulting from our formula (2) below when $1 - \beta = 0.95$, $\alpha =0.01$, and $ES=0.1$, although those simulations with $J=2$ CoP only could have benefited from using a t formula, like Lehr’s—we acknowledge this limitation and slight imprecision in the calculations. Power calculations appearing in columns 2 to 5 are obtained using Cohen’s (1988) formulas as they appear in the package for —an open source software for statistical analysis. The information provided in Table 2 allows full replication of our study. A particularly sensible quantity is the number of configurations of parameters (CoP), i.e. J, that we computed, as far as possible, from the original articles, where the interested reader can check this information. Some of the studies employ a full factorial design so that every possible parameter assumes multiple values and the model is simulated a number n of times for every possible combination of parameter values ceteris paribus. When this is the case, it is relatively easy to calculate J by multiplying the various numbers as they appear in the text of the article.^{Footnote 5} Most studies do not perform any calculation to estimate the robustness of the simulation. In the selected period, only one article (Radax and Rengs 2010) presents statistical power analysis with the intention to determine the appropriate number of runs.

Table 2 Power analysis estimations on ABM articles published between 2010 and 2013

Full size table

There are multiple strategies to determine either the number of runs or steps. Among the latter, some authors (e.g., Mungovan et al. 2011; Shimazoe and Burton 2013) report convergence analysis to estimate the steady state. Instead, among the former, Siebers and Aickelin (2011) refer to Robinson (2004) to justify the choice of 20 runs per each configuration of parameters. This is an approach that uses confidence intervals but it does not seem to specify to what these numbers are sufficient for. Another strategy for justifying the number of runs is that of Chebyshev’s theorem (Shannon 1975), indicated in Lee et al. (2013). The logic seems to be similar to what found in Siebers and Aickelin (2011) in that it is based on 95 % confidence intervals for the performance measure (outcome).

In the panel of data, we also found 19 published experiments (26 %) where it was impossible (for us, at least) to understand how to calculate power. This signals that the information on methods was not easily accessible from just reading the paper. This may not be a significant problem, given that most ABM are made available in open-source platforms and this may eventually lead to access all information needed. However, this time we limited our analysis to what was available in the published article that is the piece of information with the largest diffusion among academics.

From Table 2, it is apparent that when ES is small (i.e. 0.1), mean power for the studies in our sample is 0.415 (SD = 0.395) at the more restrictive significance level of $\alpha =0.01$, and it is 0.526 (SD = 0.373) when $\alpha =0.05$. Both values are well below any known standard, indicating most studies are significantly underpowered. When ES is medium (i.e. 0.3), on average, the test is above the threshold recommended for power in empirical studies (i.e. 0.80; see Cohen 1988; Liu 2014) with $mean=0.842$ (SD = 0.284) at the less stringent significance level of $\alpha =0.05$. This power threshold is, on average, still not met when $\alpha =0.01$, with $mean=0.783$ (SD = 0.346).

Figures 1 and 2 are a graphical reorganization of the information in Table 2. In these two figures, only papers and experiments on which we performed the calculations were shown (55 observations). Also, experiments are sorted by publication outlet, using a different color and mark: red dot for JASSS, blue dot for CMOT, green other shapes for the other journals. The logic behind the two figures is to map what happens to power in the same study when significance level $\alpha$ is relaxed, respectively when the assumed ES is small (Fig. 1) and medium (Fig. 2). This exercise is interesting because it shows how the assumptions on stringency of a seemingly irrelevant element of power—i.e. the significance level $\alpha$—affect power. Note that we transformed logarithmically the axes in both figures, to help make sense of the distribution of results.

Figure 1 intuitively shows that some studies have power below 0.50 and the majority seems to appear below 0.90—the actual numbers are 51 % below 0.50 and 73 % below 0.90. The change in the significance level does not seem to affect power in the ABM studies reviewed, when ES is small. For overpowered studies—i.e. those studies that have excessively high power (see below for an overview of the risks this entails)—a change in significance levels does not bear any effect at all. For other underpowered studies, there is some effect in that it seems power levels double when $\alpha$ is relaxed for $1 - \beta \le 0.30$. Instead, for $1 - \beta > 0.30$, the impact of $\alpha = 0.05$ is never enough for the study to reach sufficient power. Hence, Fig. 1 makes it even more apparent that, when ES is small, a higher significance level $\alpha$ does not bear meaningful results. This implies that the most sensible strategy for researchers would be to increase the number of runs performed in the simulation. Of course, this requires power analysis to be taken into consideration.

Figure 2 shows the impact of significance levels when ES is medium (0.3). With larger ES, there are only 14 % of studies with power that is below 0.50 under both conditions. Instead, studies with power below 0.90 are 29 % of the total. The distribution is skewed towards higher levels of power, highlighting two interesting facts. On the one hand, when ES is relatively large, underpowered studies do not benefit significantly from a relaxation of $\alpha$ levels. On the other hand, there is a very limited number of borderline studies that would pass the threshold and reach $1 - \beta > 0.90$ (from just below 0.90) although none would reach 0.95 solely because of an $\alpha$ effect. This means, once again, that higher ES impacts power levels more effectively but the only viable way to tackle with low power is to increase the number of runs.

In short, both figures substantiate what is in Table 2 and highlight the importance of assuming a reasonable value of ES and running the simulation an appropriate number of times. The following section discusses these results further.

4 Discussion of results

The results of the review presented in the previous section points out a few issues with ABM research. Before discussing the results of the review and presenting two formulas for sample size determination, we need to specify what the threshold for power analysis should be in the case of computer simulation.

We have claimed above and elsewhere (Secchi 2014) that computer simulation studies cannot be subjected to the same standards to which empirical studies are. Not only computer simulation—especially ABM—is, obviously, different from empirical study but it is the nature of the difference that supports the need for different standards. The diversity of computational modeling from other scientific methods has been advocated by many scholars (e.g., Gilbert 2008; Coen 2009b; Miller and Page 2007), and we believe it is particularly relevant in the case of power analysis. ABM simulation studies are based on a simplification of reality where a given phenomenon is analyzed according to rules, environmental and agent characteristics. The control exercised on this artificial micro-world is much higher than that exercised, for example, in a lab experiment. For this reason, it is possible to structure the ABM in order to make sure errors are not plaguing or fogging results. More than a possibility, this should be the aim of every modeler. Any simplification of reality carries the risk of being too imprecise, lax, unfocused. Thus, given the assumptions, errors should be brought down to the bare minimum so that unclear findings may be directly identified as coming from the model’s theoretical framework, not from its statistical shortcomings. We suggest the reference for every ABM should be to reach power of 0.95 and higher at a 0.01 significance level. More rigorous simulation studies have more potential to contribute to the advancement of our field because they would appear robust and more consistent within the range of assumptions.

It is fair to note that several authors (Johnson 2013; Colquhoun 2014) have recently advocated similarly stringent standards for Type-I error, pushing $\alpha$ to 0.001. However, these authors did not accompany this suggestion with an analogous one concerning a decrease in $\beta$. This creates a paradoxical situation, because none of the reasonings in Sect. 2.2 above is compatible with such a huge difference between $\alpha$ and $\beta$. This fact has been remarked by other authors (Fiedler et al. 2012; Lakens 2013; Lakens and Evers 2014) that have stressed the relevance of statistical power and Type-II errors for statistical inference as well as the need to balance the two errors. Our proposal of reducing both $\alpha$ and $\beta$ in the stated proportions embraces the suggestion of decreasing the frequency of Type-I errors while making the ratio of the two probabilities quite near to the original value proposed by Cohen (1988, Sect. 2.4). While both these reductions can be difficult to accomplish in laboratory experiments, we think that most simulations are compatible with them.

4.1 The current norm: under-powered studies

Once we have clarified the threshold for power in ABM and simulation studies should be 0.95 or higher, we can interpret results more clearly. Most studies appear to be underpowered (medium ES) or strongly underpowered (small ES). This is consistent with the constituents of power, the ES being one of the elements affecting power the most. Even small increases in ES lead to higher power. Surprisingly enough, the increase we used in our analysis ($+0.2$) seems not to be enough in most cases. If we take this four-year sample to be representative of ABM research published in the social sciences, we obtain a very meagre picture. With small or even medium ES, studies published in most of these articles are not able to tell whether Type-II error is under control. Given that large ES depends on the characteristics of the phenomenon under analysis, it seems unlikely that all these studies can claim to have large ES. Hence, ABM research needs power calculations to make sure results are sound enough.

To further elaborate on the issues surrounding underpowered studies, we provide four arguments that lean on the variables entering the formula of statistical power $1 - \beta$ of an ANOVA test, namely ES, J, n and $\alpha$. We start with the problems associated with ES. The first implication is that low power may be symptom of faulty design, that becomes apparent by discarding effects that are, in fact, relevant to understand the dynamics of a simulation model. In addition to that, low power may depend on the fact that the researcher is testing configurations of parameters that are irrelevant (i.e. too close to each other). Low power may also derive from insufficient number of groups/runs J and n, so that results are more or less significant at random. Finally, one may have lax testing standards, on the belief that setting a more stringent $\alpha$ for simulation studies is not an issue and it does not affect power. Are faulty design, testing irrelevant differences, insufficient number of groups or runs, or lax standards a problem for ABM research? Indeed, we think they are. We will take on each one of these in the following.

First, faulty design may affect power in that the simulation model is not capable of discriminating significantly enough between parameter configurations (this is reflected in a low ES). This may depend from coding the impact of parameters on the outcome variable in a way that fails to make differences apparent and it may be related to coding or equation errors, parameter misspecification, etc. If the simulation is affected by these errors (that we call “faulty design”), with the given number of groups and runs, power remains low and some effects may remain hidden. Hence, although in this case power would not fix poor simulation design, checking for power would help the modeler control the model further. Of course, the ES may be low because that is the nature of the simulated relation among different configurations of parameters and, in that case, power analysis would only suggest the appropriate number of runs for that effect to become apparent.

Second, results may be relevant but ES is so small that it needs more runs to become apparent. Sometimes this may affect the interpretation of the model, hence making the contribution to academic discussion less relevant than it could have been. Take, for example, the simulation by Fioretti and Lomi (2010) that is based on the famous “garbage can” model (Cohen et al. 1972). In that very fine article, Fioretti and Lomi show that the agent-based version of the model confirms some of the results and discards some others. If we follow one of our hypotheses above, and consider that the ES for the parameter configurations is small, power is always insufficient (Table 2), independently of the significance level. This means that findings of such a fine piece of modeling are not accurate and, potentially, we cannot either confirm nor reject any of the features that are in the “garbage can” model. In particular, we cannot discard some of the effects that Fioretti and Lomi (2010) did not find. Yes, we may end up confirming some of the features of the model and rejecting some others although having more runs may surely cast clarity among results.

Third, insufficient number of runs n and/or groups (parameter configurations) J are the most common cause of low power as per our review. This is a very important issue because it undermines results. Not only low power makes researchers discard results corresponding to nonnull effect sizes—this is the very concept of false negative—but it raises questions on conditions that are accepted as significant too. As the number of runs in underpowered studies is low (less than needed to reach a certain value of power), it is likely that at least some large ES is just a random occurrence. Stated differently, we cannot confirm that the large ES will remain large when more runs are performed. One of the characteristics of good agent-based modeling is that simulations can be made to vary significantly, so that every run is different from another with the same configuration of parameters. Of course, these differences should be less relevant than those with runs from different configurations of parameters. However, in order for this to happen, the modeler needs to make sure that each configuration runs a number of times that is sufficient to exclude that the similarities (or the differences) are not product of random variation. This is why appropriately powered studies in agent-based simulation research are absolutely key. Discarding this issue on the claim that one is being conservative equals to stating that one does not know whether results are coming off a random effect or a stable, reliable, replicable effect. The least runs or groups one has in the simulation, the more one is exposed to the fact that results are inconsistent and/or anchored to a random occurrence. This is, we believe, the strongest argument for the need of power analysis in ABM.

Finally, agent-based modelers and simulation researchers in general are particularly keen on transposing the standards of empirical research to computational simulated environments. In this article, we advocate for more stringent standards for computational simulation models (see above). From that stance, what matters is that, for example, significance reference for modelers is still $\alpha = 0.05$, or $1 - \beta = 0.80$. These are very lax standards in simulated environments. However, as we show in the article, on average, ABM simulation studies do not even match them. Hence, there probably is an issue with lax standards in the community or, maybe, with lack of academic discussion on these issues. As we wrote above, these standards may still affect the interpretation of results although they are probably a secondary issue compared to the complete absence of power testing. We hope this article contributes to start a discussion on these important topics.

In short, all the four aspects above point at the fact that power analysis is a tool to make results more robust and reliable. There is no shortcut around these two aspects as we believe they are much needed in simulations (as well as in any other scientific analysis). Disregarding power may be considered as a conservative move when, in fact, it may just be that one is leaning on random effects reflected in the simulation results. Additional runs may end up changing the “face” of results, hence making them more robust and reliable. This, we believe, is a very important aspect that has the potential to make simulations more palatable to other organizational scholars.

4.2 The subtle risks of overpowered studies

Some of the studies in Table 2 appear to be overpowered, i.e. calculations show a number that is very close to 1.00 within the range of computational precision. This means that the number we show in the table (0.999) is practically undistinguishable from, although it can never be, 1.00. What happened in these cases is that researchers overran their project, performing an astonishingly high number of runs reaching an incredibly low probability for Type-II error to appear under the hypothesized effect size. For example, our estimation from information available in Sharpanskykh and Stroeve (2011) and Sobkowicz (2010) shows that they performed 8000 runs while Zappala and Logan (2010) did 5000 runs. The most over-performed model we found is Aggarwal et al. (2011) with 10,000 runs performed. These are researchers that showed some awareness of the issues related to low power and decided to produce a number of runs so high that the problem would not appear to be relevant any more. This can only happen when the simulation is not time consuming or, in the case it is, when researchers have supercomputers available. However, is this approach sound? What are the risks of overpower? Although a full article is needed to analytically show what are the actual risks of overpower, we can discuss a few points here as they seem particularly relevant to our results.

In other disciplines, such as medicine, overpowering studies bears high financial costs (Girard 2005). Luckily enough, the decrease in the cost of computing power over the last decades has been so steady that the cost of most ABM is nowadays negligible with respect to more traditional experiments. However, there are risks of overpower besides waste of time.

In particular, the risk is that overpowered studies may lead modelers to notice effects so small that are not worth considering. Mone et al. (1996, p. 115) clearly state that “Excessively large samples [...] raise a serious concern [...] of oversensitivity to trivial or irrelevant findings.” This happens when secondary or marginal elements appear to be statistically significant, just because of very large samples. What we are trying to convey may appear clearer when one fixes $\alpha$ and $\beta$ and looks at the relation between sample size n and effect size (ES). The larger the ES between two configurations of parameters the least runs are needed to reach the stated values of $\alpha$ and $\beta$; conversely, the smaller the ES the larger the number of runs. This implies that, when the number of runs increases for fixed $\alpha$ and $\beta$, hypothesis testing procedures associated with a very small ES will reach the stated values of $\alpha$ and $\beta$.

Consequently, researchers may end up not being able to distinguish between more or less important effects because both of them appear statistically significant.^{Footnote 6}

This consequence of excessive power is rarely stressed in statistical textbooks, but notable exceptions are DeGroot (1986, p. 497), Bickel and Doksum (2001, p. 231) and Larsen and Marx (2012, p. 383).^{Footnote 7} The topic is more often brought to the fore in applied statistics (Hochster 2008; McPhaul and Toto 2012, p. 61).

All in all, overpowered simulations end up being less reliable than appropriately powered simulations. This is not to state that results are to be discarded completely but they are less sound than better calibrated simulations. This property can be turned to good account in order to test how a model performs under extreme or boundary conditions. After obtaining the number of runs via power analysis and testing different parameter configurations, researchers have a first set of results. The following step would be to indiscriminately increment the number of runs to reach overpower with the purpose of testing when previously irrelevant (insignificant) results become statistically significant, if they do. This procedure would give modelers two pieces of information at least: (a) it is a “stress” test for the model and, as such, it may reveal modeling inaccuracies or faults (referred to as ‘faulty design’ for underpowered studies above), and (b) it allows researchers to have a better understanding of how/when a particular set of conditions is meaningful to the modeling effort. Of course, this is feasible only when the simulation is not time consuming.^{Footnote 8}

As our results seem to suggest, this risk of overpower is very significant for ABM research, where the ease of producing additional runs of the model may affect how “clean” and “relevant” results are (Chase and Tucker 1976; Lykken 1968). Appropriate sample size chosen in accordance with a prescribed level of power may be the answer to get clean data. Another implication of our results seems to suggest that there is no clear indication on how to implement statistical power analysis in ABM research. This may be at the basis of most studies not reporting power or misunderstanding the importance of number of runs determination. The following subsection is dedicated to this specific point and it shows two formulas we derived for sample size calculations for agent-based models.

5 Two new formulas for the determination of the number of runs

In ABM, the researcher has direct control over more factors than in most traditional data collection situations (e.g., real experiments, surveys, etc.), because parameter values have been chosen by the researcher and the incremental cost of adding further observations to the sample is generally low. Despite this, the previous sections delineated a situation in which most papers fail to achieve the most elementary power requirements. On the one hand, this is probably due to the fact that most researchers are unaware of the concept of power and of its importance in sample size determination. On the other hand, formulas helping researchers in the determination of the sample size as a function of power are not readily available in the literature.

Apart from the classic and computationally-demanding method of numerically inverting the formula for the power, popularized by Cohen and embodied in the package in , it is customary (Norman and Streiner 1998, pp. 214–215) to approach the multivariate case by reducing it to the univariate, covered in Lehr’s formula (Lehr 1992).

In this section, we provide and discuss two formulas for sample size determination (runs, in the case of ABM) that explicitly take into account the multivariate nature of the comparisons.^{Footnote 9}

5.1 A general formula for n

Let $\alpha$ be the Type-I error rate, and $\beta$ the Type-II error rate that one wants to achieve. We consider an ANOVA test of the null hypothesis $\mathsf {H}_{0}:\mu _{1}=\dots =\mu _{J}$. Let ES be the effect size of the test (Cohen 1988, 1992). In this case the formula of ES is more complex than the one seen in Sect. 2.2 for the t test, but the general interpretation is similar, i.e. ES is a measure of the distance of the real values $\mu _{1},\dots ,\mu _{J}$ with respect to the null hypothesis $\mathsf {H}_{0}$. It turns out that in this case n asymptotically behaves as $n^{\star }$ where:

$$\begin{aligned} n^{\star }=\frac{1}{J \cdot ES}\left( \sqrt{\chi _{1-\alpha,\,J-1}^{2}}+\sqrt{\ln \tfrac{\left( \chi _{1-\alpha,\,J-1}^{2}\right) ^{\frac{J-2}{2}}}{2\pi \beta ^{2}}}\cdot \left( 1-\tfrac{J}{4}\cdot \tfrac{\ln \ln \frac{\left( \chi _{1-\alpha,\,J-1}^{2}\right) ^{\frac{J-2}{2}}}{2\pi \beta ^{2}}}{\ln \frac{\left( \chi _{1-\alpha,\,J-1}^{2}\right) ^{\frac{J-2}{2}}}{2\pi \beta ^{2}}}\right) \right) ^{2}. \end{aligned}$$

(1)

This formula shows several facts. First, when the effect size ES is small, a larger sample size is required. Second, when $\beta$ decreases, n increases: in particular, when $\beta$ is near to 0, n behaves like $\frac{2}{J \cdot ES}\cdot \left| \ln \beta \right|$. Third, when $\alpha$ decreases, n increases: in this case too, n increases like $\frac{2}{J \cdot ES}\cdot \left| \ln \alpha \right|$.

5.2 An empirical formula

The previous formula is valid for fixed $\alpha$ and ES, and is accurate for not too large J and small $\beta$. In this section, the task is to find an accurate formula for $n=n\left( J,ES\right)$, valid for a wider range of J and effect sizes ES, but restricted to the values $\alpha =0.01$ and $\beta =0.05$ (see above). We have resorted to a response surface analysis (Box and Wilson 1951). More details on the derivation are in the Appendix.

The proposed formula is:

$$\begin{aligned} n\left( J,ES\right) \simeq 14.091\cdot J^{-0.640}\cdot ES^{-1.986}. \end{aligned}$$

(2)

A related formula for N can be obtained through the equality $N=n \cdot J$.

A graphical representation of the accuracy is obtained in Fig. 3 that displays the function $\frac{n\left( J,ES\right) -14.091\cdot J^{-0.640}\cdot ES^{-1.986}}{n\left( J,ES\right) }$ for J varying between 2 and 200 and ES between 0.01 and 0.6. The value of the function is displayed through the level curves. The value of $n\left( J,ES\right)$ is displayed in shades of grey (each hue corresponds respectively, from darker to lighter, to $n\,<\,4$, $4\,<\,n\,<\,16$, $16\,<\,n\,<\,64$, $64\,<\,n\,<\,256$, $256\,<\,n\,<\,1024$, $1024\,<\,n$). The area on which the function has been calibrated is displayed as a trapezium. It is clear from the figure that the accuracy deteriorates rapidly for large J and ES, but when ES is moderate and J is large, the formula is overall quite accurate. Moreover, note that the percentage error is higher where n is smaller, so that the error is comparatively less serious.

It is fair to note that, while formula (1) is a theoretical result, formula (2) is empirical. As such, it is not backed by a rigorous mathematical derivation but offers a guaranteed percentage error for certain values of the parameters.

6 Conclusions

In this article, we have described the importance of statistical power analysis for ABM research, especially when applied to the field of management and organizational behavior. We have then reviewed the literature on ABM from selected outlets in the social and behavioral sciences, years 2010–2013, and found that most studies are underpowered or do not provide any indication on ES, $\alpha$ levels, number of runs, or power. This is a very surprising and worrying result because it points at the reliability and significance of ABM research. Most importantly, it points at the need that every ABM researcher at least asks the question on how to avoid Type-II error, making results more robust and consistent. In the previous section, we have derived some implications and presented formulas for sample size (i.e. number of runs) determination of particular interest for agent-based and simulation research. Although the focus of this article is on ABM, the question “how many runs” a simulation should run is not strange to other techniques and we cannot exclude this approach can be successfully adopted in other areas of computational simulation. It is very likely that what we suggest may be useful to those running fitness landscape (or NK) models as well, although NK models may be considered “close relatives” to ABM. Given the scope of the current article, we leave further considerations on this possibility to future research.

ABM is a very promising technique, and it is spreading among the many disciplines of the social and behavioral sciences. Management and organizational behavior seem to lag behind this “new wave” of simulation research (Secchi 2015; Neumann and Secchi 2016; Fioretti 2013). However, this can be a strength more than a weakness. The first years of ABM research have been years of experimentations and challenge to find appropriate and sound methods. Although these are ongoing, our field can step in simulation research from a more solid ground, thanks to what has been done in the last twenty years. This may put management and organizational behavior on a more advanced ground, ready to develop the next generation of ABM simulation and research. Power analysis is part of this toolkit of the advanced simulation modeler.

Another aspect of the use of power is clearly related to the type of results that come out of ABM models. If results of any given simulation are not solid enough, there is the risk that scholars may go back to old prejudices on simulation studies. In the recent past, computer simulation suffered from the excessive simplification of assumptions, abstraction (i.e. distance from reality), and complicated design. Results were often deemed very difficult to grasp and practical implications were lacking. Inappropriate power may face the risk of seeing these prejudices come back and undermine what is the most promising advancement in computer simulation we have seen in decades. This is particularly important in management and organization studies because ABM use has just started.

As we argue in the article, not only we need to encourage researchers to be more precise in the determination of the number of runs for their simulations, but we also need to establish thresholds that are meaningful for ABM research. Our proposal is that of defining a power of 0.95 at a 0.01 significance level.

There are a few limitations of this article and we mention a couple. First of all, we do not know what the ES of the selected studies actually is, and our review may be based on misjudgement, if one was to show that the ES of those articles is higher than hypothesized. However, we cannot do science hoping that data and results are sound enough. On the contrary, we should develop scenarios that allow us to make informed decisions on possibly unfavorable as well as more favorable occurrences. Another limitation is that our proposed thresholds—i.e. power of 0.95 at a 0.01 significance level—may reveal to be inadequate or too restrictive. More research is needed to assure modelers that these are reasonable levels for producing sound and clean results.

Despite these limitations, the article indicates that there are some important reasons why statistical power analysis is particularly important for ABM research per se and for the diffusion of this technique in management and organization studies.

Notes

See van der Vaart (1998, p. 213) or Choirat and Seri (2012, Proposition 7, p. 285) for a quantitative version of this trade-off.
As an interesting variation on the traditional choice of a fixed significance level, Arrow (1960) describes a procedure to compute $\alpha$ that starts from setting $\alpha = \beta$ for a value of the parameters under the alternative hypothesis.
The attitude of Fisher towards fixed thresholds was more ambivalent than this source suggests. As an example, Fisher (1926) advocated the comparison of the p value with a threshold chosen by the researcher according to his or her experience (2, 5 or even 10 %). It is therefore ironic that this paper is often considered as the origin of the fixed 5 % threshold because this is the number that Fisher used more frequently in it. A more nuanced use of p values is in Fisher (1925, p. 80 and elsewhere), where the 5 % threshold is used alongside other values, such as 1 %.
This is made very clear by Morris (1987) whose example shows unequivocally that β is by far a more reasonable measure of reliability than the estimated ES.
For example, in the article by Grow and Flache (2011), authors identify “36 experimental conditions” (p. 213), and this simplifies our job. Instead, in articles such as in Hoser (2013), the author indicates there are 3 parameters, each taking respectively 2, 4, and 2 values (p. 267). This gives $J = 2 \times 4 \times 2 = 16$. In some other articles such as Cioffi-Revilla et al. (2012), we had to estimate the number of parameters and their values because the authors were less explicit on the various configurations of the simulation or, at least, it was unclear to us.
The point we are going to make is similar to what Friston (2012) calls the “fallacy of classical inference,” although we do not necessarily advocate his solution. We believe that a clear statement of the significance threshold and of the required power under an hypothesized effect size is always better than a ritual bound on the number of observations.
Pericchi and Pereira (2016, Sects. 1.3 and 1.4) go a bit further and present a (rather artificial) example in which the accumulation of information apparently in favor of an hypothesis leads to its rejection.
We owe this very interesting consideration to one of the reviewers of this paper, whom we thank very much.
Details of how formula (1) is derived are presented in Seri and Secchi (2014). How to apply this formula to agent-based models and simulation is shown in a previous study (Seri and Secchi 2014), where the code to perform the test is also made available.

References

Abbas SMA (2013) An agent-based model of the development of friendship links within Facebook. Comput Math Organ Theo 19(2):232–252
Article Google Scholar
Aggarwal VA, Siggelkow N, Singh H (2011) Governing collaborative activity: interdependence and the impact of coordination and exploration. Strat Manag J 32(7):705–730
Article Google Scholar
Ahrweiler P, Gilbert N, Pyka A (2011) Agency and structure: a social simulation of knowledge-intensive industries. Comput Math Organ Theo 17(1):59–76
Article Google Scholar
Altaweel M, Alessa LN, Kliskey AD (2010) Social influence and decision-making: evaluating agent networks in village responses to change in freshwater. J Artif Soc Soc Simul 13(1):15
Google Scholar
Arrow KJ (1960) Decision theory and the choice of a level of significance for the $t$-test. In: Contributions to probability and statistics. Stanford University Press, Stanford, pp 70–78
Google Scholar
Arroyo J, Hassan S, Gutiérrez C, Pavón J (2010) Re-thinking simulation: a methodological approach for the application of data mining in agent-based modelling. Comput Math Organ Theo 16(4):416–435
Article Google Scholar
Ballinas-Hernández AL, Munoz-Meléndez A, Rangel-Huerta A (2011) Multiagent system applied to the modeling and simulation of pedestrian traffic in counterflow. J Artif Soc Soc Simul 14(3):2
Google Scholar
Bausch AW (2013) Evolving intergroup cooperation. Comput Math Organ Theo 20(4):1–25
Google Scholar
Bickel PJ, Doksum KA (2001) Mathematical statistics. Basic ideas and selected topics, vol 1. Prentice Hall, Upper Saddle River
Google Scholar
Boero R, Bravo G, Castellani M, Squazzoni F (2010) Why bother with what others tell you? An experimental data-driven agent-based model. J Artif Soc Soc Simul 13(3):6
Google Scholar
Bosse T, Gerritsen C (2010) Social simulation and analysis of the dynamics of criminal hot spots. J Artif Soc Soc Simul 13(2):5
Google Scholar
Box GEP, Wilson KB (1951) On the experimental attainment of optimum conditions (with discussion). J R Stat Soc Ser B 13:1–45
Google Scholar
Cassell BA, Wellman MP (2012) Asset pricing under ambiguous information: an empirical game-theoretic analysis. Comput Math Organ Theo 18(4):445–462
Article Google Scholar
Castellani B, Rajaram R (2012) Case-based modeling and the SACS toolkit: a mathematical outline. Comput Math Organ Theo 18(2):153–174
Article Google Scholar
Cecconi F, Campenni M, Andrighetto G, Conte R (2010) What do agent-based and equation-based modelling tell us about social conventions: the clash between ABM and EBM in a congestion game framework. J Artif Soc Soc Simul 13(1):6
Google Scholar
Chase LJ, Tucker RK (1976) Statistical power: derivation, development, and data-analytic implications. Psychol Record 26:473–486
Google Scholar
Choirat C, Seri R (2012) Estimation in discrete parameter models. Stat Sci 27(2):278–293
Article Google Scholar
Cioffi-Revilla C, De Jong K, Bassett J (2012) Evolutionary computation and agent-based modeling: biologically-inspired approaches for understanding complex social systems. Comput Math Organ Theo 18(3):356–373
Article Google Scholar
Cockburn D, Crabtree SA, Kobti Z, Kohler TA, Bocinsky RK (2013) Simulating social and economic specialization in small-scale agricultural societies. J Artif Soc Soc Simul 16(4):4
Google Scholar
Coen C (2009a) Contrast or assimilation: choosing camps in simple or realistic modeling. Comput Math Organ Theo 15(1):19–25
Article Google Scholar
Coen C (2009b) Simple but not simpler. Introduction CMOT Special Issue-simple or realistic. Comput Math Organ Theo 15(1):1–4
Article Google Scholar
Coen CA, Maritan CA (2011) Investing in capabilities: the dynamics of resource allocation. Organ Sci 22(1):99–117
Article Google Scholar
Cohen J (1988) Statistical power analysis for the behavioral sciences, 2nd edn. LEA, Hillsdale
Google Scholar
Cohen J (1992) A power primer. Psychol Bull 112(1):155–159
Article Google Scholar
Cohen MD, March JG, Olsen HP (1972) A garbage can model of organizational choice. Adm Sci Q 17(1):1–25
Article Google Scholar
Colquhoun D (2014) An investigation of the false discovery rate and the misinterpretation of p-values. R Soc Open Sci 1(3):140216
Article Google Scholar
DeGroot MH (1986) Probability and statistics, 2nd edn. Addison-Wesley Publishing Co, Reading
Google Scholar
Demarest J, Pagsuyoin S, Learmonth G, Mellor J, Dillingham R (2013) Development of a spatial and temporal agent-based model for studying water and health relationships: The case study of two villages in Limpopo, South Africa. J Artif Soc Soc Simul 16(4):3
Google Scholar
Dubois E, Barreteau O, Souchére V (2013) An agent-based model to explore game setting effects on attitude change during a role playing game session. J Artif Soc Soc Simul 16(1):2
Google Scholar
Dugundji ER, Gulyás L (2013) Structure and emergence in a nested logit model with social and spatial interactions. Comput Math Organ Theo 19(2):151–203
Article Google Scholar
Dunn AG, Gallego B (2010) Diffusion of competing innovations: the effects of network structure on the provision of healthcare. J Artif Soc Soc Simul 13(4):8
Google Scholar
Edmonds B, Moss S (2005) From KISS to KIDS: an ‘anti-simplistic’ modelling approach. In: Davidson P (ed) Multi agent based simulation, vol 3415. Lecture notes in artificial intelligence. Springer, New York, pp 130–144
Google Scholar
Fairchild G, Hickmann KS, Mniszewski SM, Del Valle SY, Hyman JM (2014) Optimizing human activity patterns using global sensitivity analysis. Comput Math Organ Theo 20(4):394–416
Article Google Scholar
Fiedler K, Kutzner F, Krueger JI (2012) The long way from $\alpha$-error control to validity proper: Problems with a short-sighted false-positive debate. Perspect Psychol Sci 7(6):661–669
Article Google Scholar
Fioretti G (2013) Agent-based simulation models in organization science. Organ Res Methods 16(2):227–242
Article Google Scholar
Fioretti G, Lomi A (2010) Passing the buck in the garbage can model of organizational choice. Comput Math Organ Theo 16(2):113–143
Article Google Scholar
Fisher RA (1955) Statistical methods and scientific induction. J R Stat Soc Ser B 17(1):69–78
Google Scholar
Fisher RA (1925) Statistical methods for research workers. Oliver and Boyd, Edinburgh
Google Scholar
Fisher RA (1926) The arrangement of field experiments. J Minist Agric GB 33:83–94
Google Scholar
Fisher RA (1956) Statistical methods and scientific inference. Oliver and Boyd, Edinburgh
Google Scholar
Fonoberova M, Fonoberov VA, Mezic I, Mezic J, Brantingham PJ (2012) Nonlinear dynamics of crime and violence in urban settings. J Artif Soc Soc Simul 15(1):2
Google Scholar
Fridman N, Kaminka G (2010) Modeling pedestrian crowd behavior based on a cognitive model of social comparison theory. Comput Math Organ Theo 16(4):348–372
Article Google Scholar
Friston K (2012) Ten ironic rules for non-statistical reviewers. NeuroImage 61(4):1300–1310
Article Google Scholar
Gigerenzer G (2004) Mindless statistics. J Socio-Economics 33(5):587–606
Article Google Scholar
Gilbert N (2008) Agent-based models, Quantitative applications in the social sciences, vol 153. Sage, Thousand Oaks
Book Google Scholar
Girard P (2005) Clinical trial simulation: a tool for understanding study failures and preventing them. Basic Clin Pharmacol Toxicol 96(3):228–234
Article Google Scholar
Grazzini J (2012) Analysis of the emergent properties: stationarity and ergodicity. J Artif Soc Soc Simul 15(2):7
Google Scholar
Grow A, Flache A (2011) How attitude certainty tempers the effects of faultlines in demographically diverse teams. Comput Math Organ Theo 17(2):196–224
Article Google Scholar
Gulden TR (2013) Agent-based modeling as a tool for trade and development theory. J Artif Soc Soc Simul 16(2):1
Google Scholar
Hallahan M, Rosenthal R (1996) Statistical power: concepts, procedures, and applications. Behav Res Theo 34(5–6):489–499
Article Google Scholar
Heckbert S (2013) MayaSim: an agent-based model of the ancient Maya social-ecological system. J Artif Soc Soc Simul 16(4):11
Google Scholar
Hirshman B, St Charles J, Carley K (2011) Leaving us in tiers: can homophily be used to generate tiering effects? Comput Math Organ Theo 17(4):318–343
Article Google Scholar
Hochster HS (2008) The power of “P”: On overpowered clinical trials and “positive” results. Gastrointest Cancer Res 2(2):108–109
Google Scholar
Hoenig JM, Heisey DM (2001) The abuse of power: the pervasive fallacy of power calculations for data analysis. Am Stat 55(1):19–24
Article Google Scholar
Hoser N (2013) Public funding in the academic field of nanotechnology: a multi-agent based model. Comput Math Organ Theo 19(2):253–281
Article Google Scholar
Jansson F (2013) Pitfalls in spatial modelling of ethnocentrism: a simulation analysis of the model of Hammond and Axelrod. J Artif Soc Soc Simul 16(3):2
Google Scholar
Johnson VE (2013) Revised standards for statistical evidence. Proc Natl Acad Sci 110(48):19313–19317
Article Google Scholar
Kim Y, Zhong W, Chun Y (2013) Modeling sanction choices on fraudulent benefit exchanges in public service delivery. J Artif Soc Soc Simul 16(2):8
Google Scholar
Korn EL (1990) Projecting power from a previous study: maximum likelihood estimation. Am Stat 44(4):290–292
Google Scholar
Lakens D (2013) Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs. Front Psychol 4:863
Article Google Scholar
Lakens D, Evers ERK (2014) Sailing from the seas of chaos into the corridor of stability: practical recommendations to increase the informational value of studies. Perspect Psychol Sci 9(3):278–292
Article Google Scholar
Larsen R, Marx M (2012) An introduction to mathematical statistics and its applications, 5th edn. Prentice Hall, Boston
Google Scholar
Lee K, Kim S, Kim CO, Park T (2013) An agent-based competitive product diffusion model for the estimation and sensitivity analysis of social network structure and purchase time distribution. J Artif Soc Soc Simul 16(1):3
Google Scholar
Lee S (2010) Simulation of the long-term effects of decentralized and adaptive investments in cross-agency interoperable and standard IT systems. J Artif Soc Soc Simul 13(2):3
Google Scholar
Lehr R (1992) Sixteen S-squared over D-squared: a relation for crude sample size estimates. Stat Med 11(8):1099–1102
Article Google Scholar
Lenth RV (2001) Some practical guidelines for effective sample size determination. Am Stat 55(3):187–193
Article Google Scholar
Letia IA, Slavescu RR (2012) Logic-based reputation model in e-commerce simulation. J Artif Soc Soc Simul 15(3):7
Google Scholar
Levine SS, Prietula MJ (2012) How knowledge transfer impacts performance: a multilevel model of benefits and liabilities. Organ Sci 23(6):1748–1766
Article Google Scholar
Liu XS (2014) Statistical power analysis for the social and behavioral sciences. Routledge, New York
Google Scholar
Lykken DT (1968) Statistical significance in psychological research. Psychol Bull 70(3):151–159
Article Google Scholar
McPhaul MJ, Toto RD (2012) Clinical research: from proposal to implementation. Lippincott Williams & Wilkins, Philadelphia
Google Scholar
Meadows M, Cliff D (2012) Reexamining the relative agreement model of opinion dynamics. J Artif Soc Soc Simul 15(4):4
Google Scholar
Meyer M, Lorscheid I, Troitzsch KG (2009) The development of social simulation as reflected in the first ten years of JASSS: a citation and co-citation analysis. J Artif Soc Soc Simul 12(4):12
Google Scholar
Meyer M, Zaggl MA, Carley KM (2011) Measuring CMOT’s intellectual structure and its development. Comput Math Organ Theo 17(1):1–34
Article Google Scholar
Miller JH, Page SE (2007) Complex adaptive systems. An introduction to computational models of social life. Princeton University Press, Princeton
Google Scholar
Miller KD, Lin SJ (2010) Different truths in different worlds. Organ Sci 21(1):97–114
Article Google Scholar
Miller KD, Pentland BT, Choi S (2012) Dynamics of performing and remembering organizational routines. J Manag Stud 49(8):1536–1558
Article Google Scholar
Miodownik D, Cartrite B, Bhavnani R (2010) Between replication and docking: “adaptive agents, political institutions, and civic traditions” revisited. J Artif Soc Soc Simul 13(3):1
Google Scholar
Mone MA, Mueller GC, Mauland W (1996) The perceptions and usage of statistical power in applied psychology and management research. Pers Psychol 49(1):103–120
Article Google Scholar
Montes G (2012) Using artificial societies to understand the impact of teacher student match on academic performance: the case of same race effects. J Artif Soc Soc Simul 15(4):8
Google Scholar
Morris CN (1987) Testing a point null hypothesis: the irreconcilability of p values and evidence—comment. J Am Stat Assoc 82(397):131–133
Google Scholar
Mungovan D, Howley E, Duggan J (2011) The influence of random interactions and decision heuristics on norm evolution in social networks. Comput Math Organ Theo 17(2):152–178
Article Google Scholar
Nan N (2011) Capturing bottom-up information technology use processes: a complex adaptive systems model. MIS Q 35(2):505–532
Google Scholar
Neumann M, Secchi D (2016) Exploring the new frontier: computational studies of organizational behavior. In: Secchi D, Neumann M (eds) Agent-based simulation of organizational behavior. New frontiers of social science research. Springer, New York, pp 1–16
Chapter Google Scholar
Neyman J (1950) First course in probability and statistics. Henry Holt and Co, New York
Google Scholar
Nongaillard A, Mathieu P (2011) Reallocation problems in agent societies: a local mechanism to maximize social welfare. J Artif Soc Soc Simul 14(3):5
Google Scholar
Norman GR, Streiner DL (1998) Biostatistics: the bare essentials. B. C. Decker, Hamilton
Google Scholar
North MJ, Macal CM (2007) Managing business complexity: discovering strategic solutions with agent-based modeling and simulation. Oxford University Press Inc, New York
Book Google Scholar
Nye BD (2013) The evolution of multiple resistant strains: an abstract model of systemic treatment and accumulated resistance. J Artif Soc Soc Simul 16(4):2
Google Scholar
Patel A, Crooks A, Koizumi N (2012) Slumulation: an agent-based modeling approach to slum formations. J Artif Soc Soc Simul 15(4):2
Google Scholar
Pearson ES (1955) Statistical concepts in their relation to reality. J R Stat Soc. Ser B 17(2):204–207
Google Scholar
Pericchi L, Pereira C (2016) Adaptative significance levels using optimal decision rules: balancing by weighting the error probabilities. Braz J Probab Stat 30(1):70–90
Article Google Scholar
Quera V, Beltran FS, Dolado R (2010) Flocking behaviour: agent-based simulation and hierarchical leadership. J Artif Soc Soc Simul 13(2):8
Google Scholar
Radax W, Rengs B (2010) Prospects and pitfalls of statistical testing: insights from replicating the demographic prisoner’s dilemma. J Artif Soc Soc Simul 13(4):1
Google Scholar
Ritter FE, Schoelles MJ, Quigley KS, Klein LC (2011) Determining the numbers of simulation runs: treating simulations as theories by not sampling their behavior. In: Rothrock L, Narayanan S (eds) Human-in-the-loop simulations: methods and practice. Springer, London, pp 97–116
Chapter Google Scholar
Robinson S (2004) Simulation: the practice of model development and use. Wiley, Chicester
Google Scholar
Robinson S (2014) Simulation: the practice of model development and use, 2nd edn. Palgrave, New York
Book Google Scholar
Royall RM (1997) Statistical evidence: a likelihood paradigm. Monographs on statistics and applied probability, vol 71. Chapman & Hall, London
Google Scholar
Savarimuthu BTR, Cranefield S, Purvis MA, Purvis MK (2010) Obligation norm identification in agent societies. J Artif Soc Soc Simul 13(4):3
Google Scholar
Schindler J (2012) Rethinking the tragedy of the commons: the integration of socio-psychological dispositions. J Artif Soc Soc Simul 15(1):4
Google Scholar
Schindler J (2013) About the uncertainties in model design and their effects: an illustration with a land-use model. J Artif Soc Soc Simul 16(4):6
Google Scholar
Secchi D (2014) How many times should my simulation run? Power analysis for Agent-Based Modeling. In: European Academy of Management Annual Conference, Valencia, Spain
Secchi D (2015) A case for agent-based model in organizational behavior and team research. Team Perform Manag 21(1/2):37–50
Article Google Scholar
Secchi D, Neumann M (eds) (2016) Agent-based simulation of organizational behavior. New frontiers of social science research. Springer, New York
Sedlmeier P, Gigerenzer G (1989) Do studies of statistical power have an effect on the power of studies? Psychol Bull 105(2):309–316
Article Google Scholar
Seri R, Choirat C (2013) Scenario approximation of robust and chance-constrained programs. J Optim Theo Appl 158(2):590–614
Article Google Scholar
Seri R, Secchi D (2014) Sample size determination in multivariate problems. Working paper, unpublished
Shannon RE (1975) Systems simulation: the art and science. Prentice-Hall, Englewood Cliffs
Google Scholar
Sharpanskykh A, Stroeve SH (2011) An agent-based approach for structured modeling, analysis and improvement of safety culture. Comput Math Organ Theo 17(1):77–117
Article Google Scholar
Shiba N (2013) Analysis of asymmetric two-sided matching: agent-based simulation with theorem-proof approach. J Artif Soc Soc Simul 16(3):11
Google Scholar
Shimazoe J, Burton RM (2013) Justification shift and uncertainty: why are low-probability near misses underrated against organizational routines? Comput Math Organ Theo 19(1):78–100
Article Google Scholar
Siebers PO, Aickelin U (2011) A first approach on modelling staff proactiveness in retail simulation models. J Artif Soc Soc Simul 14(2):2
Google Scholar
Sioson E (2012) Flora: a testbed for evaluating the potential impact of proposed systems on population wellbeing. J Artif Soc Soc Simul 15(3):6
Google Scholar
Sobkowicz P (2010) Dilbert-Peter model of organization effectiveness: computer simulations. J Artif Soc Soc Simul 13(4):4
Google Scholar
Still G (2001) Discretization in semi-infinite programming: the rate of convergence. Math Prog 91(1, Ser. A):53–69
Sutcliffe A, Wang D (2012) Investigating the relative influence of genes and memes in healthcare. J Artif Soc Soc Simul 15(2):1
Google Scholar
Udayaadithya A, Gurtoo A (2013) Governing the local networks in Indian agrarian societies: a MAS perspective. Comput Math Organ Theo 19(2):204–231
Article Google Scholar
van der Vaart A (1998) Asymptotic statistics, vol 3., Cambridge series in statistical and probabilistic mathematics. Cambridge University Press, Cambridge
Book Google Scholar
Villarroel JA, Taylor JE, Tucci CL (2013) Innovation and learning performance implications of free revealing and knowledge brokering in competing communities: Insights from the Netflix prize challenge. Comput Math Organ Theo 19(1):42–77
Article Google Scholar
Waldeck R (2013) Segregated cooperation. J Artif Soc Soc Simul 16(4):14
Google Scholar
Wang M, Hu X (2012) Agent-based modeling and simulation of community collective efficacy. Comput Math Organ Theo 18(4):463–487
Article Google Scholar
Wijermans N, Jorna R, Jager W, van Vliet T, Adang O (2013) Cross: modelling crowd behaviour with social-cognitive agents. J Artif Soc Soc Simul 16(4):1
Google Scholar
Wildman W, Sosis R (2011) Stability of groups with costly beliefs and practices. J Artif Soc Soc Simul 14(3):6
Google Scholar
Xianyu B (2010) Social preference, incomplete information, and the evolution of ultimatum game in the small world networks: An agent-based approach. J Artif Soc and Soc Simul 13(2):7
Google Scholar
Xianyu B (2012) Prisoner’s dilemma game on complex networks with agents’ adaptive expectations. J Artif Soc Soc Simul 15(3):3
Google Scholar
Yamanoi J, Sayama H (2013) Post-merger cultural integration from a social network perspective: a computational modeling approach. Comput Math Organ Theo 19(4):516–537
Article Google Scholar
Zappala J, Logan B (2010) Effects of resource availability on consensus decision making in primates. Comput Math Organ Theo 16(4):400–415
Article Google Scholar
Zhang G, Gao R (2010) Modularity and incremental innovation: the roles of design rules and organizational communication. Comput Math Organ Theo 16(2):171–200
Article Google Scholar
Zhang Y, Leezer J (2010) Simulating human-like decisions in a memory-based agent model. Comput Math Organ Theo 16(4):373–399
Article Google Scholar
Zou G, Yilmaz L (2011) Dynamics of knowledge creation in global participatory science communities: open innovation communities from a network perspective. Comput Math Organ Theo 17(1):35–58
Article Google Scholar

Download references

Author information

Authors and Affiliations

COMAC Research Cluster, Centre for Human Interactivity (CHI), Department of Language and Communication, University of Southern Denmark, Sdr. Stationvej 28, 4200, Slagelse, Denmark
Davide Secchi
Department of Economics, University of Insubria, Via Monte Generoso 71, 21100, Varese, Italy
Raffaello Seri
Center for Nonlinear and Complex Systems, Via Valleggio 11, 22100, Como, Italy
Raffaello Seri

Authors

Davide Secchi
View author publications
You can also search for this author in PubMed Google Scholar
Raffaello Seri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Davide Secchi.

Appendix

We have considered all integer values of J ranging from 2 to 100, for a total of 99 values, and all effect sizes between 0.01 and 0.5 with a mesh size of 0.01, for a total of 50 values. This constitutes a rectangle in the plane $\left( J,ES\right)$.

Then we have removed the combinations $\left( J,ES\right)$ such that $J+490\cdot ES\ge 247$: this is equivalent to remove the upper right corner of the rectangle. We came up with this solution because removing it does not alter the behavior of the function for large J and ES, but provides a much better approximation for small values of J and ES. This choice is reasonable because it is sensible to suppose that, when J is increased, this happens adding new combinations of parameters near to the previous ones, so that ES witnesses an overall decrease.

For each J and ES, for a total of 3909 values, we have computed the exact value of $n=n\left( J,ES\right)$ with $\alpha =0.01$ and $\beta =0.05$. We have then approximated n through the function $\gamma _{0}\cdot J^{\gamma _{1}}\cdot ES^{\gamma _{2}}$ choosing $\left( \gamma _{0},\gamma _{1},\gamma _{2}\right)$ to minimize the function:

$$\begin{aligned} \max _{\left( J,ES\right) }\left| \frac{n\left( J,ES\right) -\gamma _{0}\cdot J^{\gamma _{1}}\cdot ES^{\gamma _{2}}}{n\left( J,ES\right) }\right| \end{aligned}$$

on the grid of J and ES values (the effect of the granularity of the grid on the solution has been investigated in Still 2001; Seri and Choirat 2013). This means that we are minimizing the maximum absolute percentage error in the approximation of $n\left( J,ES\right)$ over the grid. The present formula gives a maximum absolute percentage error of 4.83 % on the grid. The formula for $N=n \cdot J$ obtained from the previous one has the same maximum absolute percentage error on the same grid.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Secchi, D., Seri, R. Controlling for false negatives in agent-based models: a review of power analysis in organizational research. Comput Math Organ Theory 23, 94–121 (2017). https://doi.org/10.1007/s10588-016-9218-0

Download citation

Published: 03 May 2016
Issue Date: March 2017
DOI: https://doi.org/10.1007/s10588-016-9218-0

Keywords