Skip to main content

Advertisement

Log in

Early stopping of RCTs: two potential issues for error statistics

  • Published:
Synthese Aims and scope Submit manuscript

Abstract

Error statistics (ES) is an important methodological view in philosophy of statistics and philosophy of science that can be applied to scientific experiments such as clinical trials. In this paper, I raise two potential issues for ES when it comes to guiding, and explaining early stopping of randomized controlled trials (RCTs): (a) ES (via its severity principle) provides limited guidance in cases of early unfavorable trends due to the possibility of trend reversal; (b) ES is silent on how to prospectively control error rates in experiments requiring multiple interim analyses. The method of conditional power, together with a rationing principle for RCTs, can assist ES in addressing such issues.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. The study principal investigator is responsible for the design of the study protocol, and for providing all required interim data to the data monitoring committee. 

  2. ES provides a standpoint regarding a cluster of statistical tools, including their interpretations and justification, as well as a general philosophy of science based on the principle of learning from error, and the specific roles probability plays in learning from error.

  3. A more realistic formulation of \(\hbox {H}_{0}\) is that the antiviral therapy does no better than the control, i.e. the proportion of patients among the treatment group experiencing a substantive decrease in the level of viral load, is no greater than the proportion of patients in the control group experiencing a substantive decrease in the level of viral load.

  4. The probability of obtaining a test statistic result equal to or more extreme than the one observed, assuming that the null hypothesis \((\hbox {H}_{0})\) is true.

  5. Represented by the Greek letter \(\upalpha \), alpha, is the rate (probability) of the type I error, where type I error occurs when the null hypothesis \((\hbox {H}_{0})\) is true, but rejected (an error of the first kind). \(\upalpha \) (alpha) is also known as the size of the test. In statistical terms the statistical test prescribes rejecting the \(\hbox {H}_{0}\) in favor of the alternative as long as the \(p\) value \(< \upalpha \).

  6. One way to understand the claim ‘decrease in the level of viral load’ is to read it as the proportion of patients who experienced a decrease in their level of viral load. Therefore, the threefold comparison refers to the difference in the proportions of patients experiencing a decrease in viral load, treatment versus control group; e.g. 5 % of the control individuals versus 15 % of those under treatment. I should add however that this example of viral load response, although commonly used in RCTs and sufficient for my purposes, is not ideal. If, in the absence of an effective treatment, viral load would increase with time, patients who showed no improvement but also no deterioration would be considered responding. Another example would be if, regular use of an ‘anti-aging’ cream over 20 years removed no wrinkles yet prevented new ones from appearing, the treatment would hardly be regarded as ineffective. I would like to thank one of the reviewers for calling attention to this point.

  7. A technical note for completeness: according to Mayo (1988) the difference between power and severity is that while severity is a function of the particular observed difference \((\hbox {D}_{\mathrm{obs}})\), i.e., function of outcome \(\hbox {x}_{0}\), the power on the other hand, is a function of the smallest difference judged significant (D*) by a given test, i.e. D* is the critical boundary beyond which the result is taken to reject H. The power of test T against an alternative \(\hbox {H}_{1}\) equals the probability of a difference as large as D*, given that \(\hbox {H}_{1}\).The severity, in contrast, substitutes the observed difference \((\hbox {D}_{\mathrm{obs}})\) for the fixed critical boundary (D*). (See Mayo 1988 footnote 15). I return to the difference between power and severity when considering my numerical example in Sect. 2.1.

  8. The DMC (also known as Data and Safety Monitoring Board) is an external and independent group—presumably the only group reviewing the outcome data by treatment assignment. It assumes responsibility for the safety of trial participants and the future patients who might potentially use the new intervention. In fulfilling these duties the DMC has an obligation to make recommendations to trial sponsors and principal investigators on the proper conduct of RCTs, whether for government or industry-sponsored trials. The particular task of data monitoring involves the repeated examination of the data as it accumulates, with an eye to possible early termination of the trial. See Moyé (2006) for an introduction to statistical monitoring.

  9. A secondary endpoint is an outcome variable that is typically known to be related to the primary outcome of statistical interest in the study. For instance, in addition to a primary outcome variable (e.g. mortality) used to evaluate the efficacy of an intervention (e.g., antiretroviral treatment), a secondary outcome is also informative of the benefits of the intervention (e.g., CD4 cell count) regardless of its causal relationship to the primary outcome—CD4 cell counts are presumed associated with survival; CD4 cell trends are presumed predictor of survival.

  10. A surrogate marker is a laboratory measurement that is used in RCTs as a substitute for a clinically meaningful endpoint that is a direct measure of how a patient responds, or survives, and is expected to predict the effect of the intervention being tested.

  11. Given that \(\hbox {Z}(t)\) is the standard statistic at information fraction \(t\), and \(\hbox {Z}_{\upalpha /2}\) the critical value for type I error, then conditional power (CP) for some \(\uptheta \) is given by: \(\hbox {P}(Z(1) \ge \hbox {Z}_{\upalpha /2}{{\varvec{\vert }}} \hbox {Z}(t),\,\uptheta ) = 1 - \Phi \{\hbox {Z}_{\upalpha /2}-t^{1/2}\hbox {Z}(t)-\uptheta (1- t){{\varvec{\vert }}} /(1- \hbox {t})^{1/2}\).

  12. The ‘assurance’ of the event \(R\) is the expected power \(\hbox {P}(R)=E(\uppi (\uptheta ))\) where the expectation is with respect to the prior probability distribution of \(\uptheta \). The use of assurance is to avoid the need to condition on a fixed treatment effect, at the design stage; “rather it quantifies the ability of the trial to achieve a desired outcome based on the available evidence.” (O’Hagan et al. 2005, p. 189)

  13. Sample size computed according to the z test for the difference between two independent proportions: \(N=\left[ {\frac{Z_{1-\upalpha }\emptyset _{0} +Z_{1-\beta }\emptyset _{1}}{\pi _{c} -\pi _{e}}}\right] ^{2}\), where \(\emptyset _{0}^{2} =4\pi \left( {1-\pi }\right) \) and \(\emptyset _{1}^{2} =2\pi _{e} \left( {1-\pi _{e}}\right) +2\pi _{c} \left( {1-\pi _{c}} \right) \), with \(Z_{1-\upalpha }=1.96\), \(Z_{1-\upbeta }=1.282\), \(\uppi _{\mathrm{c}}=0.3\), \(\uppi _{\mathrm{e}}=0.21\), and \(\uppi =0.255\).

  14. By a loss to follow-up I am referring to patients who have either opted to withdraw from the study or become lost by being unreachable at the interim point.

  15. Computed according to z test for the difference between two independent proportions.

  16. The actual definition of \(\hbox {SEV}(\uptheta < \uptheta _{\mathrm{a}}) = \hbox {P}(\hbox {z}(X) > \hbox {z}(\hbox {x}_{0});\,\uptheta < \uptheta _{\mathrm{a}}\hbox { false}) = \hbox {P}(\hbox {z}(X) > \hbox {z}(\hbox {x}_{0})\); \(\uptheta < \uptheta _{\mathrm{a}})\), with z test statistic (z(X)) as an appropriate measure of agreement or distance from H.

  17. Another way of saying it is, if \(\hbox {H}_{0}\) were false, there is very low probability that our interim test would have produced results that accords with \(\hbox {H}_{0}\) as well as (or better than) the sample estimate do.

  18. Power on the other hand is calculated relative to the cut-off point \(\hbox {c}_{\upalpha }\) for rejecting \(\hbox {H}_{0}\). It treats all values of \(\hbox {z}(\hbox {x}_{0})\) in the acceptance region the same. Power at \(\uptheta = 0.3:\hbox { P}(\hbox {z(X)} > \hbox {c}_{\upalpha }~;~ \uptheta = 0.3)\). One should calculate attained power: \(\hbox {P}(\hbox {z(X)}> \hbox {d}(\hbox {x}_{0})\,;\,\uptheta = 0.3)\). The attained power against alternative \(\uptheta = 0.3\) gives the severity with which \(\uptheta < 0.3\) passes interim test when \(\hbox {H}_{0}\) is accepted.

  19. Proschan et al. (2006) define a B value (a discounted variable) which is given by \(b=\sqrt{tz_{t}}\) where \(z_{t}\) is the \(Z\)-score at information time t that drifts about an expected value that is linear in t, going from start of the trial \((\hbox {t} = 0)\) to the unknown expected Z-statistic at the end of the trial \((\hbox {t} = 1)\). See Proschan et al. (2006) Chapter 3 for details.

  20. See Proschan et al. (2006) Chapter 3, Section 3.2 for details.

  21. Another way of assigning priors is to follow Proschan et al. (2006) recommendation with a mean consistent with the alternate hypothesis and a standard deviation chosen to weight the alternate hypothesis appropriately as compared to the data at interim. This method combines the empirical effect with the hypothesized effect, but ‘flat’ priors are a simpler and sufficient way of computing predictive power for illustrating the hybrid method of conditional power.

  22. The Coronary Drug Project Research Group (1981) “Practical aspects of decision making in clinical trials: the coronary drug project as a case study.” Controlled Clinical Trials 1:363–376.

  23. Coronary Drug Project Research Group (1975).

  24. That is, special methods such as “the development, on the basis of the accumulated data, of new hypotheses relating to treatment differences and the testing of these hypotheses a few months later on the basis of new events occurring during that interval.” (ibid, p. 363)

  25. According to the original study investigator “if the DSMC had decided to stop the study and declare clofibrate therapeutically efficacious on the basis of these early ‘statistical significant’ results, it is evident that in retrospect it would very likely have been a wrong decision. Fortunately, the DSMC was careful not to react quickly and drastically to results that reached ‘statistical significance’ at the conventional 5 % level.” (Canner et al. p. 372)

  26. SEV estimated from the interim relative risks and computed as \(\hbox {P}(\hbox {z}(X) > \hbox {z}(\hbox {x}_{0});\,\uptheta \ge \uptheta _{\mathrm{a}})\), with z test statistic (z(X)) as an appropriate measure of agreement or distance from H. The severity rationale is this: given that the study was designed with 80 % power to detect a difference of (at least) 25 % advantage for ddI, and given that at interim it had a much smaller chance of detecting such an effect (i.e., smaller power at interim since it had only 1/4 of the total trial information) and it nevertheless did detect the effect of interest, then with the impressive results at interim (\(p\) value = 0.009) we would have even more reason to think that the effect of interest (25 % advantage for ddI) exists.

  27. An additional example of ES and conditional power methods used in complementary ways but this time converging on the decision to stop the trial given evidence at interim was the Beta-Blocker Heart Attack Trial (BHAT Research Group 1982). According to BHAT investigators, the DMC made use of “two statistical methods for declaring the overall mortality results significant.” (DeMets et al. 1984, p. 362) The first method used ES considerations to control overall type I error rates, whereas the second method “evaluated whether the observed trend was so impressive that the conclusion was unlikely to change even if the trial should continue to the scheduled end.” (ibid, p. 362) In other words, compliance with a spending function (ES supplemented with a rationing principle for controlling and spending error rates—more on this in Sect. 3) was used to prospectively control error rates, and conditional power was used to assess the probability that mortality rates would wane and no longer show statistical significance, should the trial continue to its original scheduled plan. With 12 months left in the trial, after deliberation at interim, the DMC decided to stop the study given the interim data (i.e., \(\hbox {z value}=2.82 >\) critical value of 2.23; \(p \hbox {value} = 2[1-\upphi (2.82)\}=0.005\); ‘reject the null’ with high severity). Among the relevant factors cited as for the decision to stop were: (1) low conditional power values ranging over a reasonable minimum and maximum number of deaths for the remainder of the study, (2) observed internal consistency between primary and secondary outcomes, and (3) the possible impact of early termination of the study on clinical practice. (BHAT Research Group 1982; Ellenberg et al. 2003)

  28. Conversely, there are the antidepressant drugs. Trials of antidepressant drugs, typically, do not last for years, yet many patients remain on the drugs for years.

  29. The overall alpha error is simply the overall probability of a type I error (incorrectly rejecting a true null-hypothesis) given a decision procedure that includes multiple statistical tests. It is the type I error rate for the entire collection of statistical tests applied to a dataset. By increasing the number of statistical tests (e.g. tests of significance, hypothesis testing) conducted over a fixed dataset for instance, where the null-hypothesis is true, there is a corresponding increase in the probability of a statistically significant result. A variety of statistical procedures have been developed to allow the researcher to control for the overall alpha error. Bonferroni method is an example of such procedure. Bonferroni allows the researcher to control a decision procedure for a desired overall alpha error (e.g. 0.05). For instance, if the researcher wants to perform a significance test 8 times to the same dataset, then in order to have an overall alpha of 0.05, the researcher would set alpha to 0.00625 for each of the 8 tests to be performed. Bonferroni adjustments are only applicable if the tests are independent. The field of sequential analysis is vast and provides innumerous other methods of type I error correction. (e.g. Armitage 1975; Pocock 1977; O’Brien and Fleming 1979; Spiegelhalter et al. 2004; Proschan et al. 2006)

  30. A critical boundary tells the range of \(p\) values that warrants rejection of the null-hypothesis. E.g. \(\upalpha =0.05\), which means any observed \(p\) value that is less than 0.05 warrants rejection of the null-hypothesis.

  31. Clearly the tests are not independent during sequential analysis of RCTs, since the data accumulates, but the simplified scenario is given only to illustrate the intuition and difficulty with having to deal with the increase of type I error rate in RCTs.

  32. Armitage (1975), Pocock (1977), O’Brien and Fleming (1979), Proschan et al. (2006) and others have considered the problem from the point of view of the tests (interim analyses) being dependent of one another. This adds difficulty to the problem of error rate increase, a dependency which is approached either through numerical integration, e.g. Armitage (1975) or through simulation techniques (approximation methods) e.g. Proschan et al. (2006). Regardless of the approach used however (whether through integration or simulation), the end results are the same, namely, as the number of interim analyses approaches infinity, the type I error rate slowly approaches to 1.

  33. As a concrete example of early stopping affecting the results of the trial, the reader might want to look at the TORCH study—Calverley et al. (2007). Here the intervention would have been significant as regards to mortality rates (primary end point) if there had been no interim analyses but as a result of such monitoring the \(p\) value was quoted as 0.052. “Interim analyses were performed by the independent safety and efficacy data monitoring committee according to the method of Whitehead (1999). As a consequence, the P value for the primary comparison between the combination regimen and placebo was adjusted upward to conserve an overall significant level of 0.05.” (Emphasis is mine, Calverley et al. 2007, p. 777).

  34. The term “spending function” or “alpha spending function” has a technical meaning in the biostatistical literature. It typically means a function that approximates a group sequential method to RCTs. I use the term more broadly, as a rule or function that distributes the overall error rate over the intended interim points, whether or not the rule approximates a group sequential method.

  35. There are several studies that compare different spending function approaches with respect to their multiple interim analysis properties, e.g., Pocock versus O’Brien–Fleming with respect to their overall power. (See Skovlund 1999; Gillen and Emerson 2013, for examples) These comparative studies are stochastic simulation experiments. For instance, simulations comparing Pocock versus O’Brien–Fleming approaches using survival data sampled from a breast cancer trial with 200, 350, and 500 patients, and varying the number of interim analysis, have shown that the overall statistical “power is consistently lower for the method proposed by Pocock for all sample sizes applied, compared to O’Brien–Fleming’s method which is almost identical to the power of a fixed sample test.” (Skovlund 1999, p. 1086)

  36. On the general issue of what might be deemed substantively important in statistical testing, Mayo says:

    [O]f course, the particular risk increase that is considered substantively important depends on factors quite outside what the test itself provides. But this is no different, and no more problematic, than the fact that my scale does not tell me what increases in weight I need to worry about. Understanding how to read statistical results, and how to read my scale, inform me of the increase that are or are not indicated by a given result. That is what instruments are supposed to do. Here is where severity considerations supply what a text book reading of standard tests does not. (Mayo 1996, pp. 197–198)

  37. Individuals (trial participants), information, care, goods and services are all scarce resources. These resources are utilized within the confines of RCTs.

  38. NYTimes article reporting on the introduction of a recent “controversial” “major shift of HIV treatment policy” in the city of San Francisco, where public health doctors had begun to advise HIV asymptomatics “to start taking antiviral medicines as soon as they are found to be infected, rather than waiting for signs that their immune systems have started to fail.” The author suggests that “the turning point” in the San Francisco’s thinking might have been guided by “a study in the New England Journal of Medicine on April 1, 2009, that compared death rates among thousands of North American H.I.V. patients” showing that “patients who put off therapy until their immune system showed signs of damage had a nearly twofold greater risk of dying from any cause [compared to patients that were healthy and with] T-cell counts above 500.” Leaving aside the fact that antiviral drugs can cost US$12,000 year per patient—taking approximately US$350 million of the state of California annual budget—there remains disagreement among researchers over whether such change in policy is a good idea after all; disagreement illustrated by Dr. Anthony Fauci (director of the NIAID) quoted as claiming that the new policy of early treatment is “an important step in the right direction” whereas Dr. Jay Levy (virologist from UCSF and one of the pioneers in identifying HIV as the cause of AIDS) quoted as saying of the policy, “it’s just too risky”, since “no one knows the effects of taking them for decades”—even though the new drugs may be less toxic than they used to be. The author of the article also reports that “San Francisco’s decision follows a split vote in December [2009] by a 38-member federal panel on treatment guidelines.” Russel, Sabin “City Endorses New Policy for Treatment of H.I.V.” The New York Times, published online April 3, 2010.

  39. See Stanev 2011, 2012a, b for further examples.

References

  • Abrams, D., Goldman, A., Launer, C., et al. (1994). A comparative trial of didanosine or zalcitabine after treatment with zidovudine in patients with human immunodeficiency virus infection. The New England Journal of Medicine, 330, 657–662.

  • Andersen, P. K. (1987). Conditional power calculations as an aid in the decision whether to continue a clinical trial. Controlled Clinical Trials, 8, 67–74.

    Article  Google Scholar 

  • Beta-Blocker Heart Attack Trial Research Group (1982). A randomized trial of propranolol in patients with acute myocardial infarction. Journal of the American Medical Association, 247, 1707–1714.

  • Armitage, P. (1975). Sequential medical trials (2nd ed.). New York: Wiley.

    Google Scholar 

  • Calverley, P. M., Anderson, J. A., Celli, B., Ferguson, G. T., Jenkins, C., Jones, P. W., et al. (2007). Salmeterol and fluticasone propionate and survival in chronic obstructive pulmonary disease. The New England Journal of Medicine, 356(8), 775–789.

    Article  Google Scholar 

  • Choi, S. C., & Pepple, P. A. (1989). Monitoring clinical trials based on predictive probability of significance. Biometrics, 45(1), 317–323.

    Article  Google Scholar 

  • Coronary Drug Project Research Group. (1975). Clofibrate and niacin in coronary heart disease. JAMA: The. Journal of the American Medical Association, 231, 360–381.

    Article  Google Scholar 

  • DeMets, D. L., Hardy, R., Friedman, L., & Lan, K. (1984). Statistical aspects of early termination in the Beta-Blocker Hearth Attack Trial. Controlled Clinical Trials, 5, 362–372.

  • Ellenberg, S., et al. (2003). Data monitoring committees in clinical trials. New York: Wiley.

    Google Scholar 

  • Fleming, T. R., Neaton, J. D., Goldman, A., DeMets, D. L., Launer, C., Korvick, J., et al. (1995). Insights from monitoring the cpcra didanosine/zalcitabine trial. Journal of Acquired Immune Deficiency Syndromes and Human retrovirology, 10, S9–S18.

  • Gelman, A., et al. (2004). Bayesian data analysis. New York: Chapman & Hall/CRC Press.

    Google Scholar 

  • Gillen, D. (2008). A random walk approach for quantifying uncertainty in group sequential survival trials. Computational Statistics and Data Analysis, 53, 609–620.

  • Gillen, D., & Emerson, S. (2013). Designing, monitoring, and analyzing group sequential clinical trials using RCTdesign package for R. In T. R. Fleming & B. S. Weir (Eds.), Proceedings of the fourth seattle symposium in biostatistics: Clinical trials, lecture notes in statistics, 1205, 177–207.

  • Hacking, I. (1965). Logic of statistical inference. Cambridge: Cambridge University Press.

    Google Scholar 

  • Halperin, M., Lan, K. K. G., Ware, J. H., Johnson, N. J., & Demets, D. L. (1982). An aid to data monitoring in long-term clinical trials. Controlled Clinical Trials, 3, 311–323.

    Article  Google Scholar 

  • Lan, K. K. G., Simon, R., & Halperin, M. (1982). Stochastically curtailed testing in long-term clinical trials. Communications in Statistics C, 1, 207–219.

    Google Scholar 

  • Mayo, D. (1988). Toward a more objective understanding of the evidence of carcinogenic risk. In Proceedings of the biennial meeting of the philosophy of science association (pp. 489–503).

  • Mayo, D. (1996). Error and the growth of experimental knowledge. Chicago: The University of Chicago Press.

    Book  Google Scholar 

  • Mayo, D. (1997). Error statistics and learning from error: Making a virtue of necessity. Philosophy of Science, 64, S195–S212.

    Article  Google Scholar 

  • Mayo, D. (2000). Experimental practice and an error statistical account of evidence. Philosophy of Science, 67, S193–S207.

    Article  Google Scholar 

  • Mayo, D., & Kruse, M. (2001). Principles of inference and their consequences. In D. Corfield & J. Williamson (Eds.), Foundations of Bayesianism (pp. 381–403). Dordrecht: Kuwer Academic Publishers.

    Chapter  Google Scholar 

  • Mayo, D., & Spanos, A. (2006). Severe testing as a basic concept in a Neyman-Pearson philosophy of induction. British Journal for the Philosophy of Science, 57(2), 323–357.

    Article  Google Scholar 

  • Mills, (2006). Randomized trials stopped early for harm in HIV/AIDS: A systematic survey. HIV Clinical Trials, 7(1), 24–33.

    Article  Google Scholar 

  • Montori, V. M., et al. (2005). Randomized trials stopped early for benefit: A systematic review. The Journal of the American Medical Association, 294, 2203–2209.

    Article  Google Scholar 

  • Morrison, D., & Henkel, R. (Eds.). (1973). The significance test controversy. Chicago: Aldine.

    Google Scholar 

  • Moyé, L. (2006). Statistical monitoring of clinical trials. New York: Springer.

    Google Scholar 

  • O’Brien, P. C., & Fleming, T. R. (1979). A multiple testing procedure for clinical trials. Biometrics, 35, 549–556.

    Article  Google Scholar 

  • O’Hagan, A., Stevens, J. W., & Campbell, M. J. (2005). Assurance in clinical trial design. Pharmaceutical Statistics, 4, 187–201.

    Article  Google Scholar 

  • Pocock, S. J. (1977). Group sequential methods in the design and analysis of clinical trials. Biometrika, 64, 191–199.

    Article  Google Scholar 

  • Proschan, M. A., Lan, K. K., & Wittes, J. T. (2006). Statistical monitoring of clinical trials. Berlin: Springer.

    Google Scholar 

  • Russel, S. (2010). City endorses new policy for treatment of H.I.V. The New York Times, published online April 3.

  • Skovlund, E. (1999). Repeated significance tests on accumulating survival data. Journal of Clinical Epidemiology, 52(11), 1083–1088.

  • Spiegelhalter, D., Abrams, K., & Myles, J. (2004). Bayesian approaches to clinical trials and health-care evaluation. New York: Wiley.

    Book  Google Scholar 

  • Stanev, R. (2011). Statistical decisions and the interim analyses of clinical trials. Theoretical Medicine and Bioethics, 32(1), 61–74.

    Article  Google Scholar 

  • Stanev, R. (2012a). Modelling and simulating early stopping of RCTs: a case study of early stop due to harm. Journal of Experimental & Theoretical Artificial Intelligence, 24(4), 513–526.

  • Stanev, R. (2012b). Stopping rules and data monitoring in clinical trials. In H. W. de Regt, S. Hartmann, & S. Okasha (Eds.), EPSA philosophy of science: Amsterdam 2009 (Vol. 1, pp. 375–386). Netherlands: Springer.

  • Steel, D. (2001). Bayesian statistics in radiocarbon calibration. Philosophy of Science, 68, S153–S164.

    Article  Google Scholar 

  • The Coronary Drug Project Research Group. (1981). Practical aspects of decision making in clinical trials: The coronary drug project as a case study. Controlled Clinical Trials, 1, 363–376.

    Article  Google Scholar 

  • Thompson, B. (1996). AERA editorial policies regarding statistical significance testing: Three suggested reforms. Educational Researcher, 25(2), 26–30.

    Google Scholar 

  • Whitehead, J. (1999). The design and analysis of sequential trials (2nd ed.). New York: Wiley.

    Google Scholar 

Download references

Acknowledgments

I thank the anonymous Synthese reviewers, as well as Paul Bartha, Alan Richardson, Hubert Wong, Roger Ariew, Alex Levine, Charles Weijer, Tim Ramsay, Tinghua Zhang, and Bill Cameron, for helpful discussion and feedback. This work was supported by the Banting Fellow award from the Canadian Institutes of Health Research, and a previous Provost Postdoctoral fellowship from the University of South Florida for which I am also thankful.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roger Stanev.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Stanev, R. Early stopping of RCTs: two potential issues for error statistics. Synthese 192, 1089–1116 (2015). https://doi.org/10.1007/s11229-014-0602-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11229-014-0602-3

Keywords

Navigation