Abstract
Scientists often adjust their significance threshold (alpha level) during null hypothesis significance testing in order to take into account multiple testing and multiple comparisons. This alpha adjustment has become particularly relevant in the context of the replication crisis in science. The present article considers the conditions in which this alpha adjustment is appropriate and the conditions in which it is inappropriate. A distinction is drawn between three types of multiple testing: disjunction testing, conjunction testing, and individual testing. It is argued that alpha adjustment is only appropriate in the case of disjunction testing, in which at least one test result must be significant in order to reject the associated joint null hypothesis. Alpha adjustment is inappropriate in the case of conjunction testing, in which all relevant results must be significant in order to reject the joint null hypothesis. Alpha adjustment is also inappropriate in the case of individual testing, in which each individual result must be significant in order to reject each associated individual null hypothesis. The conditions under which each of these three types of multiple testing is warranted are examined. It is concluded that researchers should not automatically (mindlessly) assume that alpha adjustment is necessary during multiple testing. Illustrations are provided in relation to joint studywise hypotheses and joint multiway ANOVAwise hypotheses.
Similar content being viewed by others
Availability of data and materials
There is no data associated with this article.
Code availability
There is no code associated with this article.
Notes
In the Neyman-Pearson approach, some researchers may consider alpha size tests rather than alpha level tests (Casella & Berger, 2002). However, alpha size tests are difficult to construct in the case of disjunction and conjunction testing (Casella & Berger, 2002, p. 385). Consequently, I refer to alpha level tests here.
The researchers could also collapse the green and red jelly beans conditions together and compare jelly beans versus the control (sugar pill) group, but they could do so on two measures of acne (e.g., inflammatory and noninflammatory). In this case, the researchers would be undertaking two tests of the same null hypothesis using two different outcome variables or endpoints. To keep things simple, I refer to the multiple comparisons example throughout this article. However, my arguments are equally applicable to the multiple endpoints situation.
The familywise error rate assumes that test results are independent. As Greenland (2020, p. 17) explained, the term independence is used to refer to several different concepts. In particular, he distinguished between logical and statistical independence. Logical independence refers to the mathematical independence of parameter values such that variation in one value is not logically dependent on variation in another. Logical independence may be demonstrated via the mathematics of a model. Statistical independence refers to independence among variables, estimators, standard errors, and tests, and it may be achieved via study design (e.g., randomisation). A weak form of statistical independence is uncorrelatedness, which assumes that there is no monotonic linear association between the variables (e.g., no positive correlation). As Greenland noted, “uncorrelatedness and hence statistical independence are rarely satisfied in nonexperimental studies.” Although this may be the case, two points allow a qualified interpretation of the familywise error rate under the assumption of independence. First, when interpreting the results of a disjunction test, researchers may adopt a counterfactual interpretation that (a) the joint null hypothesis is true and (b) all of the associated test assumptions are true, including the assumption of independence. Second, researchers may complement this qualified interpretation with an acknowledgment that, if the constituent test results were positively dependent, then the actual familywise error rate would be less than the nominal familywise error rate, because a family of dependent tests provides less opportunity to incorrectly reject the joint null hypothesis than a family of independent tests (e.g., Weber, 2007, p. 284). Hence, although the assumption of independence may not be met in reality, researchers may nonetheless interpret the familywise error rate as indicating a worst-case scenario that assumes that the constituent test results are independent.
Some commentators have argued that conjunction testing decreases the Type I error rate and therefore warrants a corresponding increase in the αConstituent level above the αJoint level (e.g., Capizzi & Zhang, 1996; Massaro, 2009; Weber, 2007). This argument is based on the assumption that the Type I error rate for k independent tests is the product of the Type I error rate for each test (i.e., αk). Hence, for example, the probability of obtaining two independent false positive results at the .05 alpha level is only .0025. However, during conjunction testing, all of the tests are required to be significant in order to reject the joint null hypothesis. Consequently, when undertaking conjunction testing, the alpha level for each of the constituent null hypotheses (αConstituent) cannot be higher than the alpha level for the joint null hypothesis (αJoint; Berger, 1982; Julious & McIntyre, 2012; Kordzakhia et al., 2010).
Tukey (1953), who was a pioneer in the area of multiple testing, described this individual testing error rate as the per determination error rate (i.e., αIndividual). This error rate should not be confused with the per comparison error rate (i.e., αConstituent). Both error rates use unadjusted alpha levels. However, the per determination error rate is used in the context of the individual testing of an individual null hypothesis, whereas the per comparison error rate is used in the context of the disjunction testing of a joint null hypothesis. Tukey (p. 90) was firmly against the use of the per comparison error rate. However, he believed that the per determination error rate was “entirely appropriate” (p. 82) for some research questions (i.e., individual testing; see also Hochberg & Tamhane, 1987, p. 6). For example, he argued that a per determination rate was suitable when diagnosing potentially diabetic patients based on their blood sugar levels. As Tukey (1953, p. 82) explained:
the doctor’s action on John Jones would not depend on the other 19 determinations made at the same time by the same technician or on the other 47 determinations on samples from patients in Smithville. Each determination is an individual matter, and it is appropriate to set error rates accordingly.
A selection bias remains problematic during individual testing, because it involves the suppression of hypotheses after the results are known or SHARKing (Rubin, 2017d). SHARKing is problematic when suppressed falsifications are theoretically (as opposed to statistically) relevant to the research conclusions. For example, in the jelly bean study, it is theoretically informative to know not only that green jelly beans cause acne but also that non-green jelly beans do not appear to cause acne.
Studywise and multiway ANOVAwise error rates are not the only types of error rates that have caused confusion in the area of multiple testing. Other examples include datasetwise error rates (in which the family includes all hypotheses that are tested using a specific dataset; Bennett et al., 2009, p. 417; Thompson et al., 2020), careerwise error rates (in which the family includes all hypotheses that are performed by a specific researcher during their career; O’Keefe, 2003; Stewart-Oaten, 1995), and fieldwise error rates (in which the family includes all hypotheses that are performed in a specific field). A key argument in the current article is that researchers do not usually make decisions about data sets, researchers, and fields. Instead, they make decisions about hypotheses.
Multiple testing corrections may be necessary in multiway ANOVAs when a factor contains more than two levels and multiple comparisons are conducted between those levels in order to test a joint intersection null hypothesis (Benjamini & Bogomolov, 2011; Yekutieli et al., 2006). However, in this case, familywise error rates are limited to the comparisons that are made within factors. Familywise error is not computed across all factors in the ANOVA.
References
An, Q., Xu, D., & Brooks, G. P. (2013). Type I error rates and power of multiple hypothesis testing procedures in factorial ANOVA. Multiple Linear Regression Viewpoints, 39, 1–16.
Armstrong, R. A. (2014). When to use the Bonferroni correction. Ophthalmic and Physiological Optics, 34, 502–508. https://doi.org/10.1111/opo.12131
Bender, R., & Lange, S. (2001). Adjusting for multiple testing—When and how? Journal of Clinical Epidemiology, 54, 343–349. https://doi.org/10.1016/S0895-4356(00)00314-0
Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., Bollen, K. A., Brembs, B., Brown, L., Camerer, C., & Cesarini, D. (2018). Redefine statistical significance. Nature Human Behaviour, 2, 6–10. https://doi.org/10.1038/s41562-017-0189-z
Benjamini, Y., & Bogomolov, M. (2011). Adjusting for selection bias in testing multiple families of hypotheses. https://arxiv.org/abs/1106.3670
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (methodological), 57(1), 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x
Bennett, C. M., Baird, A. A., Miller, M. B., & Wolford, G. L. (2010). Neural correlates of interspecies perspective taking in the post-mortem Atlantic salmon: An argument for proper multiple comparisons correction. Journal of Serendipitous and Unexpected Results, 1(1), 1–5. https://teenspecies.github.io/pdfs/NeuralCorrelates.pdf
Bennett, C. M., Wolford, G. L., & Miller, M. B. (2009). The principled control of false positives in neuroimaging. Social Cognitive and Affective Neuroscience, 4, 417–422. https://doi.org/10.1093/scan/nsp053
Berger, R. L. (1982). Multiparameter hypothesis testing and acceptance sampling. Technometrics, 24, 295–300. https://doi.org/10.2307/1267823
Berger, R. L., & Hsu, J. C. (1996). Bioequivalence trials, intersection-union tests, and equivalence confidence sets. Statistical Science, 11, 283–319. https://doi.org/10.1214/ss/1032280304
Bretz, F., Hothorn, T., & Westfall, P. (2011). Multiple comparisons using R. CRC Press.
Capizzi, T., & Zhang, J. I. (1996). Testing the hypothesis that matters for multiple primary endpoints. Drug Information Journal, 30, 949–956. https://doi.org/10.1177/009286159603000410
Casella, G., & Berger, R. L. (2002). Statistical inference (2nd ed.). Duxbury.
Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304–1312. https://doi.org/10.1037/0003-066X.45.12.1304
Cook, R. J., & Farewell, V. T. (1996). Multiplicity considerations in the design and analysis of clinical trials. Journal of the Royal Statistical Society: Series A (Statistics in Society), 159, 93–110. https://doi.org/10.2307/2983471
Cox, D. R. (1965). A remark on multiple comparison methods. Technometrics, 7, 223–224. https://doi.org/10.1080/00401706.1965.10490250
Cramer, A. O., van Ravenzwaaij, D., Matzke, D., Steingroever, H., Wetzels, R., Grasman, R. P., Waldorp, L. J., & Wagenmakers, E. J. (2016). Hidden multiplicity in exploratory multiway ANOVA: Prevalence and remedies. Psychonomic Bulletin & Review, 23, 640–647. https://doi.org/10.3758/s13423-015-0913-5
De Groot, A. D. (2014). The meaning of “significance” for different types of research. Translated and annotated by Wagenmakers, E. J., Borsboom, D., Verhagen, J., Kievit, R., Bakker, M., Cramer, A.,…van der Maas, H. L. J. Acta Psychologica, 148, 188–194. https://doi.org/10.1016/j.actpsy.2014.02.001
Dennis, B., Ponciano, J. M., Taper, M. L., & Lele, S. R. (2019). Errors in statistical inference under model misspecification: Evidence, hypothesis testing, and AIC. Frontiers in Ecology and Evolution, 7, 372. https://doi.org/10.3389/fevo.2019.00372
Dmitrienko, A., Bretz, F., Westfall, P. H., Troendle, J., Wiens, B. L., Tamhane, A. C., & Hsu, J. C. (2009). Multiple testing methodology. In A. Dmitrienko, A. C. Tamhane, & F. Bretz (Eds.), Multiple testing problems in pharmaceutical statistics (pp. 35–98). Chapman & Hall.
Dmitrienko, A., & D’Agostino, R. (2013). Traditional multiplicity adjustment methods in clinical trials. Statistics in Medicine, 32, 5172–5218. https://doi.org/10.1002/sim.5990
Drachman, D. (2012). Adjusting for multiple comparisons. Journal of Clinical Research Best Practice, 8, 1–3.
Dudoit, S., & Van Der Laan, M. J. (2008). Multiple testing procedures with applications to genomics. Springer.
Efron, B. (2008). Simultaneous inference: When should hypothesis testing problems be combined? The Annals of Applied Statistics, 2, 197–223. https://doi.org/10.1214/07-AOAS141
Feise, R. J. (2002). Do multiple outcome measures require p-value adjustment? BMC Medical Research Methodology, 2, 8. https://doi.org/10.1186/1471-2288-2-8
Fisher, R. A. (1971). The design of experiments (9th ed.). Hafner Press.
Forstmeier, W., Wagenmakers, E. J., & Parker, T. H. (2017). Detecting and avoiding likely false-positive findings—A practical guide. Biological Reviews, 19, 1941–1968. https://doi.org/10.1111/brv.12315
Francis, G., & Thunell, E. (2021). Reversing Bonferroni. Psychonomic Bulletin and Review. https://doi.org/10.3758/s13423-020-01855-z
Frane, A. V. (2015). Planned hypothesis tests are not necessarily exempt from multiplicity adjustment. Journal of Research Practice, 1, 2.
Glickman, M. E., Rao, S. R., & Schultz, M. R. (2014). False discovery rate control is a recommended alternative to Bonferroni-type adjustments in health studies. Journal of Clinical Epidemiology, 67, 850–857. https://doi.org/10.1016/j.jclinepi.2014.03.012
Goeman, J. J., & Solari, A. (2014). Multiple hypothesis testing in genomics. Statistics in Medicine, 33, 1946–1978. https://doi.org/10.1002/sim.0000
Goodman, S. N., Fanelli, D., & Ioannidis, J. P. (2016). What does research reproducibility mean? Science Translational Medicine, 8, 341ps12. https://doi.org/10.1126/scitranslmed.aaf5027
Greenland, S. (2020). Analysis goals, error-cost sensitivity, and analysis hacking: Essential considerations in hypothesis testing and multiple comparisons. Paediatric and Perinatal Epidemiology, 35, 8–23. https://doi.org/10.1111/ppe.12711
Haig, B. D. (2009). Inference to the best explanation: A neglected approach to theory appraisal in psychology. The American Journal of Psychology, 122(2), 219–234. http://www.jstor.org/stable/27784393
Hewes, D. E. (2003). Methods as tools. Human Communication Research, 29, 448–454. https://doi.org/10.1111/j.1468-2958.2003.tb00847.x
Hochberg, Y., & Tamrane, A. C. (1987). Multiple comparison procedures. Wiley.
Hsu, J. (1996). Multiple comparisons: Theory and methods. CRC Press.
Huberty, C. J., & Morris, J. D. (1988). A single contrast test procedure. Educational and Psychological Measurement, 48, 567–578. https://doi.org/10.1177/0013164488483001
Hung, H. M. J., & Wang, S. J. (2010). Challenges to multiple testing in clinical trials. Biometrical Journal, 52, 747–756. https://doi.org/10.1002/bimj.200900206
Hurlbert, S. H., & Lombardi, C. M. (2012). Lopsided reasoning on lopsided tests and multiple comparisons. Australian & New Zealand Journal of Statistics, 54, 23–42. https://doi.org/10.1111/j.1467-842X.2012.00652.x
Jannot, A. S., Ehret, G., & Perneger, T. (2015). P < 5 × 10–8 has emerged as a standard of statistical significance for genome-wide association studies. Journal of Clinical Epidemiology, 68, 460–465. https://doi.org/10.1016/j.jclinepi.2015.01.001
Julious, S. A., & McIntyre, N. E. (2012). Sample sizes for trials involving multiple correlated must-win comparisons. Pharmaceutical Statistics, 11, 177–185. https://doi.org/10.1002/pst.515
Kim, K., Zakharkin, S. O., Loraine, A., & Allison, D. B. (2004). Picking the most likely candidates for further development: Novel intersection-union tests for addressing multi-component hypotheses in comparative genomics. In Proceedings of the American Statistical Association, ASA Section on ENAR Spring Meeting (pp. 1396–1402). http://www.uab.edu/cngi/pdf/2004/JSM%202004%20-IUTs%20Kim%20et%20al.pdf
Klockars, A. J. (2003). Multiple comparisons texts: Their utility in guiding research practice. Journal of Clinical Child and Adolescent Psychology, 32, 613–621. https://doi.org/10.1207/S15374424JCCP3204_15
Kordzakhia, G., Siddiqui, O., & Huque, M. F. (2010). Method of balanced adjustment in testing co-primary endpoints. Statistics in Medicine, 29, 2055–2066. https://doi.org/10.1002/sim.3950
Kotzen, M. (2013). Multiple studies and evidential defeat. Noûs, 47(1), 154–180. http://www.jstor.org/stable/43828821
Kozak, M., & Powers, S. J. (2017). If not multiple comparisons, then what? Annals of Applied Biology, 171, 277–280. https://doi.org/10.1111/aab.12379
Kromrey, J. D., & Dickinson, W. B. (1995). The use of an overall F test to control Type I error rates in factorial analyses of variance: Limitations and better strategies. Journal of Applied Behavioral Science, 31, 51–64. https://doi.org/10.1177/0021886395311006
Lew, M. J. (2019). A reckless guide to p-values: Local evidence, global errors. In A. Bespalov, M. C. Michel, & T. Steckler (Eds.), Good research practice in experimental pharmacology. Springer. https://arxiv.org/abs/1910.02042
Luck, S. J., & Gaspelin, N. (2017). How to get statistically significant effects in any ERP experiment (and why you shouldn’t). Psychophysiology, 54, 146–157. https://doi.org/10.1111/psyp.12639
Mascha, E. J., & Turan, A. (2012). Joint hypothesis testing and gatekeeping procedures for studies with multiple endpoints. Anesthesia and Analgesia, 114, 1304–1317. https://doi.org/10.1213/ANE.0b013e3182504435
Massaro, J. (2009). Experimental design. In D. Robertson & G. H. Williams (Eds.) Clinical and translational science: Principles of human research (pp. 41–57). Academic Press. https://doi.org/10.1016/B978-0-12-373639-0.00003-0
Matsunaga, M. (2007). Familywise error in multiple comparisons: Disentangling a knot through a critique of O’Keefe’s arguments against alpha adjustment. Communication Methods and Measures, 1, 243–265. https://doi.org/10.1080/19312450701641409
Maxwell, S. E., & Delaney, H. D. (2004). Designing experiments and analyzing data: A model comparison perspective (Vol. 1, 2nd edn.). Psychology Press.
Mead, R. (1988). The design of experiments. Cambridge University Press.
Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806–834. https://doi.org/10.1037/0022-006X.46.4.806
Mei, S., Karimnezhad, A., Forest, M., Bickel, D. R., & Greenwood, C. M. (2017). The performance of a new local false discovery rate method on tests of association between coronary artery disease (CAD) and genome-wide genetic variants. PLoS ONE, 12, e0185174. https://doi.org/10.1371/journal.pone.0185174
Miller, R. G., Jr. (1981). Simultaneous statistical inference (2nd ed.). Springer.
Morgan, J. F. (2007). p value fetishism and use of the Bonferroni adjustment. Evidence-Based Mental Health, 10(2), 34–35. https://doi.org/10.1136/ebmh.10.2.34
Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E. (2012). Setting an optimal α that minimizes errors in null hypothesis significance tests. PLoS ONE, 7, e32734. https://doi.org/10.1371/journal.pone.0032734
Mudge, J. F., Martyniuk, C. J., & Houlahan, J. E. (2017). Optimal alpha reduces error rates in gene expression studies: A meta-analysis approach. BMC Bioinformatics, 18, 312. https://doi.org/10.1186/s12859-017-1728-3
Munroe, R. (2011). Significant. Retrieved from https://xkcd.com/882/
Neuhäuser, M. (2006). How to deal with multiple endpoints in clinical trials. Fundamental & Clinical Pharmacology, 20, 515–523. https://doi.org/10.1111/j.1472-8206.2006.00437.x
Nichols, T., Brett, M., Andersson, J., Wager, T., & Poline, J. B. (2005). Valid conjunction inference with the minimum statistic. NeuroImage, 25, 653–660. https://doi.org/10.1016/j.neuroimage.2004.12.005
Nosek, B. A., Beck, E. D., Campbell, L., Flake, J. K., Hardwicke, T. E., Mellor, D. T., van’t Veer, A. E., & Vazire, S. (2019). Preregistration is hard, and worthwhile. Trends in Cognitive Sciences, 23(10), 815–818. https://doi.org/10.1016/j.tics.2019.07.009
Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). The preregistration revolution. Proceedings of the National Academy of Sciences, 115, 2600–2606. https://doi.org/10.1073/pnas.1708274114
Nosek, B. A., & Lakens, D. (2014). Registered reports: A method to increase the credibility of published results. Social Psychology, 45, 137–141. https://doi.org/10.1027/1864-9335/a000192
O’Keefe, D. J. (2003). Colloquy: Should familywise alpha be adjusted? Human Communication Research, 29, 431–447. https://doi.org/10.1111/j.1468-2958.2003.tb00846.x
Otani, T., Noma, H., Nishino, J., & Matsui, S. (2018). Re-assessment of multiple testing strategies for more efficient genome-wide association studies. European Journal of Human Genetics, 26, 1038–1048. https://doi.org/10.1038/s41431-018-0125-3
Pan, Q. (2013). Multiple hypotheses testing procedures in clinical trials and genomic studies. Frontiers in Public Health, 1, 63. https://doi.org/10.3389/fpubh.2013.00063
Panagiotou, O. A., Ioannidis, J. P., & Genome-Wide Significance Project. (2011). What should the genome-wide significance threshold be? Empirical replication of borderline genetic associations. International Journal of Epidemiology, 41, 273–286. https://doi.org/10.1093/ije/dyr178
Parker, R. A., & Weir, C. J. (2020). Non-adjustment for multiple testing in multi-arm trials of distinct treatments: Rationale and justification. Clinical Trials, 17(5), 562–566. https://doi.org/10.1177/1740774520941419
Perneger, T. V. (1998). What’s wrong with Bonferroni adjustments. British Medical Journal, 316, 1236–1238. https://doi.org/10.1136/bmj.316.7139.1236
Proschan, M. A., & Waclawiw, M. A. (2000). Practical guidelines for multiplicity adjustment in clinical trials. Controlled Clinical Trials, 21, 527–539. https://doi.org/10.1016/S0197-2456(00)00106-9
Rodriguez, M. (1997). Non-factorial ANOVA: Test only substantive and interpretable hypotheses. Paper presented at the Annual Meeting of the Southwest Educational Research Association, Austin, Texas, USA. http://files.eric.ed.gov/fulltext/ED406444.pdf
Rosset, S., Heller, R., Painsky, A., & Aharoni, E. (2018). Optimal procedures for multiple testing problems. https://arxiv.org/abs/1804.10256
Rothman, K. J. (1990). No adjustments are needed for multiple comparisons. Epidemiology, 1, 43–46. https://www.jstor.org/stable/20065622
Rothman, K. J., Greenland, S., & Lash, T. L. (2008). Modern epidemiology (3rd ed.). New York: Lippincott Williams & Wilkins.
Roy, S. N. (1953). On a heuristic method of test construction and its use in multivariate analysis. The Annals of Mathematical Statistics, 24, 220–238. https://doi.org/10.1214/aoms/1177729029
Rubin, M. (2017a). An evaluation of four solutions to the forking paths problem: Adjusted alpha, preregistration, sensitivity analyses, and abandoning the Neyman–Pearson approach. Review of General Psychology, 21, 321–329. https://doi.org/10.1037/gpr0000135
Rubin, M. (2017b). Do p values lose their meaning in exploratory analyses? It depends how you define the familywise error rate. Review of General Psychology, 21, 269–275. https://doi.org/10.1037/gpr0000123
Rubin, M. (2017c). The implications of significance testing based on hypothesiswise and studywise error. PsycArXiv. https://doi.org/10.17605/OSF.IO/7YFRV
Rubin, M. (2017d). When does HARKing hurt? Identifying when different types of undisclosed post hoc hypothesizing harm scientific progress. Review of General Psychology, 21, 308–320. https://doi.org/10.1037/gpr0000128
Rubin, M. (2020). Does preregistration improve the credibility of research findings? The Quantitative Methods for Psychology, 16(4), 376–390. https://doi.org/10.20982/tqmp.16.4.p376
Rubin, M. (2021). What type of Type I error? Contrasting the Neyman–Pearson and Fisherian approaches in the context of exact and direct replications. Synthese, 198, 5809–5834. https://doi.org/10.1007/s11229-019-02433-0
Rubin, M. (2022). The costs of HARKing. British Journal for the Philosophy of Science. https://doi.org/10.1093/bjps/axz050
Ryan, T. A. (1962). The experiment as the unit for computing rates of error. Psychological Bulletin, 59, 301–305. https://doi.org/10.1037/h0040562
Sainani, K. L. (2009). The problem of multiple testing. PM&R, 1, 1098–1103. https://doi.org/10.1016/j.pmrj.2009.10.004
Savitz, D. A., & Olshan, A. F. (1995). Multiple comparisons and related issues in the interpretation of epidemiologic data. American Journal of Epidemiology, 142, 904–908. https://doi.org/10.1093/oxfordjournals.aje.a117737
Schochet, P. Z. (2009). An approach for addressing the multiple testing problem in social policy impact evaluations. Evaluation Review, 33, 539–567. https://doi.org/10.1177/0193841X09350590
Schulz, K. F., & Grimes, D. A. (2005). Multiplicity in randomised trials I: Endpoints and treatments. The Lancet, 365, 1591–1595. https://doi.org/10.1016/S0140-6736(05)66461-6
Senn, S. (2007). Statistical issues in drug development (2nd ed.). New York: Wiley.
Shaffer, J. P. (1995). Multiple hypothesis testing. Annual Review of Psychology, 46, 561–584. https://doi.org/10.1146/annurev.ps.46.020195.003021
Shaffer, J. P. (2006). Simultaneous testing. Encyclopedia of Statistical Sciences. https://doi.org/10.1002/0471667196.ess2452.pub2
Šidák, Z. (1967). Rectangular confidence regions for the means of multivariate normal distributions. Journal of the American Statistical Association, 62, 626–633. https://doi.org/10.1080/01621459.1967.10482935
Sinclair, J., Taylor, P. J., & Hobbs, S. J. (2013). Alpha level adjustments for multiple dependent variable analyses and their applicability—A review. International Journal of Sports Science Engineering, 7, 17–20.
Stacey, A. W., Pouly, S., & Czyz, C. N. (2012). An analysis of the use of multiple comparison corrections in ophthalmology research. Investigative Ophthalmology & Visual Science, 53, 1830–1834. https://doi.org/10.1167/iovs.11-8730
Stewart-Oaten, A. (1995). Rules and judgments in statistics: Three examples. Ecology, 76, 2001–2009. https://doi.org/10.2307/1940736
Streiner, D. L. (2015). Best (but oft-forgotten) practices: The multiple problems of multiplicity—Whether and how to correct for many statistical tests. The American Journal of Clinical Nutrition, 102, 721–728. https://doi.org/10.3945/ajcn.115.113548
Thompson, W. H., Wright, J., Bissett, P. G., & Poldrack, R. A. (2020). Dataset decay and the problem of sequential analyses on open datasets. eLife, 9, e53498. https://doi.org/10.7554/eLife.53498
Tsai, J., Kasprow, W. J., & Rosenheck, R. A. (2014). Alcohol and drug use disorders among homeless veterans: Prevalence and association with supported housing outcomes. Addictive Behaviors, 39, 455–460. https://doi.org/10.1016/j.addbeh.2013.02.002
Tukey, J. W. (1953). The problem of multiple comparisons. Princeton University.
Turkheimer, F. E., Aston, J. A., & Cunningham, V. J. (2004). On the logic of hypothesis testing in functional imaging. European Journal of Nuclear Medicine and Molecular Imaging, 31, 725–732. https://doi.org/10.1007/s00259-003-1387-7
Tutzauer, F. (2003). On the sensible application of familywise alpha adjustment. Human Communication Research, 29, 455–463. https://doi.org/10.1111/j.1468-2958.2003.tb00848.x
van der Zee, T. (2017). What are long-term error rates and how do you control them? The Skeptical Scientist. http://www.timvanderzee.com/long-term-error-rates-control/
Veazie, P. J. (2006). When to combine hypotheses and adjust for multiple tests. Health Services Research, 41(3), 804–818. https://doi.org/10.1111/j.1475-6773.2006.00512.x
Wang, S. J., Bretz, F., Dmitrienko, A., Hsu, J., Hung, H. J., Koch, G., Maurer, W., Offen, W., & O’Neill, R. (2015). Multiplicity in confirmatory clinical trials: A case study with discussion from a JSM panel. Statistics in Medicine, 34, 3461–3480. https://doi.org/10.1002/sim.6561
Wason, J. M., Stecher, L., & Mander, A. P. (2014). Correcting for multiple-testing in multi-arm trials: Is it necessary and is it done? Trials, 15, 364. https://doi.org/10.1186/1745-6215-15-364
Weber, R. (2007). Responses to Matsunaga: To adjust or not to adjust alpha in multiple testing: That is the question. Guidelines for alpha adjustment as response to O’Keefe’s and Matsunaga’s critiques. Communication Methods and Measures, 1, 281–289. https://doi.org/10.1080/19312450701641391
Westfall, P. H., Ho, S. Y., & Prillaman, B. A. (2001). Properties of multiple intersection-union tests for multiple endpoints in combination therapy trials. Journal of Biopharmaceutical Statistics, 11, 125–138. https://doi.org/10.1081/BIP-100107653
Westfall, P. H., & Young, S. S. (1993). Resampling-based multiple testing: Examples and methods for p-value adjustment. Wiley.
Wilson, W. (1962). A note on the inconsistency inherent in the necessity to perform multiple comparisons. Psychological Bulletin, 59, 296–300. https://doi.org/10.1037/h0040447
Winkler, A. M., Webster, M. A., Brooks, J. C., Tracey, I., Smith, S. M., & Nichols, T. E. (2016). Non-parametric combination and related permutation tests for neuroimaging. Human Brain Mapping, 37, 1486–1511. https://doi.org/10.1002/hbm.23115
Wu, P., Yang, Q., Wang, K., Zhou, J., Ma, J., Tang, Q., Jin, L., Xiao, W., Jiang, A., Jiang, Y., & Zhu, L. (2018). Single step genome-wide association studies based on genotyping by sequence data reveals novel loci for the litter traits of domestic pigs. Genomics, 110, 171–179. https://doi.org/10.1016/j.ygeno.2017.09.009
Yekutieli, D., Reiner-Benaim, A., Benjamini, Y., Elmer, G. I., Kafkafi, N., Letwin, N. E., & Lee, N. H. (2006). Approaches to multiplicity issues in complex research in microarray analysis. Statistica Neerlandica, 60, 414–437. https://doi.org/10.1111/j.1467-9574.2006.00343.x
Funding
No funding was received in relation to this article.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the topical collection “Recent Issues in Philosophy of Statistics: Evidence, Testing, and Applications”, edited by Sorin Bangu, Emiliano Ippoliti, and Marianna Antonutti.
Rights and permissions
About this article
Cite this article
Rubin, M. When to adjust alpha during multiple testing: a consideration of disjunction, conjunction, and individual testing. Synthese 199, 10969–11000 (2021). https://doi.org/10.1007/s11229-021-03276-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11229-021-03276-4